Prolegomena Paedagogica

I N T R A M E N TA L E V O L U T I O N A N D
ONTOGENY OF TODDLERESE

&&

F O U R S I M U L AT I O N S

1st and 2nd volume of phd. dissertation by
msc. bc. et bc. daniel devatman hromada

Daniel Devatman Hromada: Intramental Evolution & Ontogeny of Toddlerese, Propedeutica Didactica I., © June 2016

W H AT T H I S T E X T I S N O T

This text is not a formal work in a mathematico-logical sense. Its
aim is not to introduce the Theory of natural language, nor even a set
of theorems, whose validity could be proven by blind application of
rules of symbolic substition upon a pre-defined set of solid definitions
and «self-evident » axioms. Assuming that
• indeed incomplete is every formal system whose explanatory power
is at least equivalent to explanatory power of formal system of
basic arithmetics (Gödel, 1931)
• explanatory power of any natural language system is at least as
exhaustive as that of any conceivable arithmetic system
we consider the temptation to explain natural languages in strictly
formal terms, as potentially counterproductive one.
Nor is this text a product of analytical approach to science. It shall
not limit itself to study of a sub-problem of a problem which can be
sometimes observable when one confronts the world with a particular
terminology and methodology of a sub-branch of a highly specialized
discipline. It does not, in Nietzschean terms, devote itself to the study
of « the brain of the leech ».
In other terms, this text does not aim to attain knowledge – whatever it is - by reductionist act of focusing one’s attention upon one
sole boring fragment of "truth".
Knowing that truth is complex, sometimes contextually bound and
more than often simply beyond reach of an individual observer, we
do NOT pretend that ALL hypotheses presented on following pages
are apodictically and universally true. For any hypothesis is just a
piece of a bigger picture and it is this picture itself which is supposed
to represent Reality – i.e. to be « true » - and not the pieces. On their
own, theses and hypotheses are just indices helping the scientist to
find his way on a path to such bigger picture.
Thus, even invalid hypothesis can serve the productive purpose if
ever they succeed to transpose the scientist into realms where (s)he
was never before.
And as we shall try to indicate and re-indicate during this whole
text, it is indeed «by descending into & traversing through the valley
of falsehoods» that the researcher can ultimately attain a perspective
which is «higher» (i.e. more «optimal») than the original one.

iii

This principle, we believe, applies both on a baby language learner
as well as on evolution of scientist’s knowledge and, possibly, on evolution of science in general.
W H AT T H I S T E X T I S

This text is a tentative to elucidate «the mystery» of acquisition and
development of linguistic competence in terms of evolutionary and
complexity theory. Thus, it is principially a multidisciplinary scientific essay.
By being «scientific», its aim has to be either analytic or synthetic ;
and since we have already that our main aim is not analytic, it follows that the goal of this essay is of synthetic nature. More concretely,
the synthesis under question aims to involve following scientific disciplines : artificial intelligence and artificial life, cognitive psychology, developmental psycholinguistics, evolutionary computing, natural language processing, theory of complexity, and universal darwinism. Mindmap localizing the central Topic of this text within wider
scientific context is presented on Figure 1 (c.f. list of Acronyms on
page 355 if some abbreviations are unclear or ambigous).
Machine
Learning

QL

POS-i

Induction
NLP

GI

Computational
Linguistics

Generalization

Production vs.
Comprehension
Language
Acquisition
Device

FLT

Intramental
Evolution
of Linguistic
Structures

GS

Populations

GA

EC

Developmental
Psycholinguistics

Motherese

BootstrapFitness
ping
Landscapes

Universal
Darwinism

ES

Variate,
Reproduce,
Select
Adaptation

Trial
and
Error

Stoch.
Systems

Figure 1: Central notions of this dissertation.

iv

To demonstrate the validity of our perspective, this text shall three
different proofs-of-thesis. The theoretical proof-of-thesis shall shall
consist in making reference to & aligning with multiple theories scattered among different disciplines of cognitive sciences. Ideally, many
seemingly unrelated phenomena could be thus brought under clef-devoute of one scientific paradigm. The observational / empiric proofof-thesis shall aim to align the thesis with seemingly trivial observations of linguistic behaviour of a certain human subject. Finally,
the computational / experimental proof-of-thesis shall hopefully illustrate that diverse problems of language acquisition are computationally solvable if ever they introduce an evolutionary component.
At last but not least, this text is also a dissertation work with which
we aspire for the attribution of the title Philosophiae Doctor. For this
reason, all chapters of this first volume contain a certain quantity
of remarks which partially surpass the informatic, cognitive and/or
psycholinguistic paradigm and point in direction of philosophy in
general, and epistemology in particular.
HOW IS THE TEXT ORGANIZED

The text is composed of two volumes which, taken together, contain
four parts. First volume consists of three parts, second volume (Hromada, 2016d) consists of only one. Each part is divided into chapters. Every chapter consists of introduction and conclusion preceding
resp. following more specific subchapters which can fractally branch
into sub-chapters , sub-sub-chapters etc. All such parts, chapters, subchapters etc. can be considered to be « non-terminal » nodes of structure presented by this text.
The first part, labeled Theses, is a stem of whole text. It will introduce multiple theses at varying degrees of generality which shall be
all - in one way or another - more directly addressed in subsequent
sections. In order to weave the basic conceptual fabric, some definitions of terms like « evolution » and « language learning » shall be
also offered. All variants of the thesis shall be briefly related to other
cognitive sciences.
The second branch, labeled «Paradigms» is composed of chapters
dedicated to Universal Darwinism, Developmental Psycholinguistics
and Computational Linguistics. In these chapters, the theses presented
in the first chapter shall be more deeply interpreted and contextualized in terms of respective disciplines.
The third branch, labeled «Observations» will describe multiple
longtitudinal observations of one concrete human child. Subsequent
interpretations in terms of the evolutionary theoretical framework
shall follow.

v

Basic structure of
the text

The ultimate branch, called «Simulations» shall present multiple
computational models addressing three problems related to language
acquisition process. That is,
1. the problem of concept induction
2. the problem of induction of grammatical categories
3. the problem of induction of grammatical rules

Text’s nodes and
their attributes

Specific chapter will be dedicated to every problem in which existing solutions shall be described. Special focus shall be put on evolutionary solutions, if they exist. To every of four above-mentioned
problems we shall try to offer our own unique evolutionary solution and subsequently we shall discuss its performance. PERL source
codes shall be also attached and publish under mrGPL licence in order to facilitate reproducibility (Hromada, 2016e) of results by other
scientists.
As a whole, the text hereby presented can be thus considered to be
a tree with five major branches which bifurcates all the way to « terminal » (i.e. leaf) nodes. To all nodes of such « tree » shall be also
attributed one among following types:
DEF

Definition

Intensive or extensive definition (or combination of both) of the term used throughout
the book

TXT

Text

Longer piece of text, often dedicated to one
specific hypothesis, topic, theory or model this is a default node type

OBS

Observation

Transcription of an item from the observation journal

APH

Aphorism

A comment presenting author’s stance in regards to topic raised in Text or Observation
node. More subjective than TXT

SRC

Source code

Snippet of PERL source code

The type of the node is specified in its title. Preceding the title is a
unique numeric identificator which can serve as an anchor for crossreferences. Thus, a text dedicated to Piager’s Genetic Epistemology
which is contained in fifth secion of chapter eight, will be introduced
with a following expression:
8.4.4. Genetic Epistemology (TXT)
The end of every node is marker by an expression containing node’s
numeric ID, title and the token END. An above-described node will
thus be terminated with a following expression :
8.4.4. Genetic Epistemology END

vi

Because the nodes can be embedded within each other, such syntax is needed to exclude any disambiguity. C.f. 1.0 and its relation
to embedded nodes 1.0.1 and 1.0.2 for a concrete example of such
embedding1 .
Margin-notes shall be also employed to facilitate even further the
orientation within the text and cross-referencing between diverse parts
of the text. Such a note shall be usually placed at the margin of the
text whenever a new topic shall be addressed.
I N W H AT L A N G U A G E I S T H E T E X T W R I T T E N ?

This dissertation is written in a language which shares majority of its
morphological, lexical and syntactic features with modern standard
english2 . Thus, majority of words are english words and majority of
sentences can be easily parsed by a standard english-language parser.
But it has to be noted that this text is not written by a native english
speaker. Written mainly in germany and deposed at french university
by a child of slovak mother and czech father, inspired by compactness
and eloquence of classic (i.e. latin, greek and sanskrit) treatises, and
often aiming to denote very subtle distinctions and novel meanings:
all this often lead to a sapirwhorfian feeling that communication of certain thoughts is inconsistent with certain well-established schemas
and rules. If ever such situation occured, it was the communicative
intention and not the rule which was prioritized: hence the origin of
many seemingly agrammatical constructions present in this work.
Thus, asides multitudes unvoluntary and erroneous typos and asides
multitudes of omitted and/or misplaced articles - a slavic speciality - this work also exposes the reader to a certain amount of errors
which are, in fact, not "bugs" but "features". In certain cases, italics
and bold were used to mark the moment whereby the author intentionally broke the existing schema - or invented a new one - in order
to emphasize a certain aspect of the-intention-to-be-communicated.

1 Without this embedding, the arborescent structure of this Thesis would be reminiscent of Wittgenstein’s Tractatus. But because this embedding is implemented, the
structure resembles more a context-free (10.2.2) form of a valid XML document.
2 Within the context of this dissertation, standard english is principially understod in
terms of set theory as union of british and american english. Given that it is defined
as union and not intersection both (i.e. american as well as british) can be accepted
as valid and used interchangeably in cases where two languages diverge (e.g. both
british "optimise" as well as american "optimize" can be accepted ).

vii

This is a
self-referential
margin-note

Part I
THESES
In the distant future I see open fields for far more important
researches. Psychology will be based on a new foundation that
of the necessary acquirement of each mental power and capacity
by gradation.
— Charles Darwin
In this part we shall posit and discuss multiple theses
whose validity or invalidity we shall try to demonstrate
in subsequent parts of this diseration.
After a brief discussion of Initial Thesis "mind evolves",
the sense of the Hard Thesis "learning is a form of evolution" shall be more thoroughly criticized by exploring the
conditions of its validity. The Soft Thesis "learning can be
successfully simulated by means of evolutionary computation", the Softer Thesis "learning of natural language can
be successfully simulated by means of evolutionary computation" and the Softest Thesis "learning of first language
can be successfully simulated by means of evolutionary
computation" shall be postulated next. At last, the Operational Thesis "learning of first language from its textual
representations can be successfully simulated by means
of evolutionary computation" shall turn out to be sufficiently concrete enough to become an object of computational simulations.
Definitions for terms mind, to evolve, evolution, brain,
2nd law of thermodynamics, evolutionary computation,
natural language, first language and child shall be also
provided. Asides all that, a so-called "alternative" hypothesis concerning the non-local storage of information in human brain shall be also introduced.

1

INITIAL THESIS

Mind evolves.
This is the Initial Thesis (IT) whose validity we hereby undertake
to demonstrate. In order to do so, both terms of the statement are to
be properly defined.
1.1
Definition of
substantive
"mind"

mind (def)

An auto-organising set of structures and processes determining the
characteristic behaviour of an individual.
end mind
1.1
1.2

to evolve (def)

Oxford Dictionary definition:
1. Develop gradually
2. Develop in time as a result of natural selection
3. (chemistry) Give off gas or heat
Definition of verb
"to evolve"

Etymological definition:
• 1640s: "to unfold, open out, expand," from Latin evolvere "to
unroll," especially of books ; ... from ex- "out" + volvere "to roll".
• 1832: "to develop by natural processes to a higher state"
end to evolve

IT seems to be a
tautology

1.2

IT means that an auto-organising set of structures and processes
determining the characteristic behaviour of an individual is endowed
with propensity to gradually attain higher states of complexity. Hence,
not only structures stocked in and by mind, but also the very processes
which act in mind are to be understood as subjects to transformation.
A fact that the predicate "to evolve" is conjugated in indicative
mode of 3rd person singular of present simple tense suggests that
the statement tends to denote the-state-of-affairs independent of temporal context within which the evolution of mind occurs.
Thus, it can be reproached that IT is too general and potentially
tautologic. Since it is difficult to see how such a statement could, per
se, become an object of positivist endeavour, let’s now discuss IT’s
less tautological variants.
end initial thesis
1

2

2

HARD THESIS

Hard Thesis (HT) is expressed as follows :
«learning is a form of evolution» .
The term evolution, as presented in HT, is to be understood in
terms of generalized form of Darwin’s theory, which is called Universal Darwinism (UD). In such framework, evolution can be defined
as follows :
2.1

evolution (def)

Evolution is a durative process emergent in any finite-resourced environment containing a population of information-encoding entities
which :

Definition of
substantive
"evolution"

1. Reproduce
2. Need resources for their reproduction
3. Vary because of inaccuracies inherent to reproduction process
end evolution

2.1

If ever there exists a causal relation between the information these
entities encode (genotype) and the means how they exploit environment’s resources (phenotype), the population will lead to gradual optimalization of its relations with the environment, i.e. discover ways
which exploit resources more efficiently than before. In this sense
shall be the next generations of individual-encoding entities better
«adapted» to their common environment.
It is important to realize that the notion of «evolution», as hereby
defined, goes far beyond the traditional Darwinian theory which was
concerned with just one instance of evolution, namely the biological
one. Some phaenomena, which could be interpreted or even modelled as instances of systems whose functioning is constitent with the
precepts postulated by Universal Darwinism, shall be in somewhat
closer detail discussed in Chapter 8.
For a Universal Darwinist, evolution is not an empirical but a logical necessity. It has to necessarily occur within any system fulfilling
the above-mentioned conditions. Emergence of evolution in a system
fulfilling above-mentioned conditions is independent from the concrete form of «natural laws » & physical constants which determine
the particularities of such system.

3

How evolution
"works"

Evolution has
many forms

Logical necessity
of evolution

4

HT is about
ontological
equivalence

hard thesis

The Hard Thesis states that psycho-pedagogical process of «learning» can be not only interpreted and simulated as an evolutionary
process. The Hard Thesis states that learning IS functionally equivalent to evolutionary process. That on an ontological level, « learning »
is an instance of an evolutionary process and therefore IS an evolutionary process. In UD-consistent sense.
2.2

Definition of
substantive /
participle
"learning"

Attributes of
learning

learning (def)

Learning is a mind-transforming, information-processing, constructivist and embodied process.
end learning
2.2
The attribute « mind-transforming » denotes the finality of learning
- it means that both contents as well as processes which determine
the characteristic behaviour of an individual agent can be modified
by means of learning. Attribute « information-processing » denotes
the modality of learning – it implies that learning always involves
the processing of information – namely assimilation, accomodation,
encoding, storage or decoding of information. The term « constructivist » suggests that learning is gradual and can potentially bootstrap
itself. The term "embodied" suggests that learning could succeed only
with big difficulties if it is not embedded into an individual monadic
entity which keeps track -in one way or another- of its own trajectory.
The last term of HT is defined, in accordance with tradition, as
follows :
2.3

form (def)
« Form is the possibility of structure.» (Wittgenstein, 1922)
end form

2.3

Given the definitions 2.1, 2.2 and 2.3, the Hard Thesis - presented
as a conjunction of terms « learning is a form of evolution » - can
considered to be true iff following statements are true as well:
2.4

first condition of ht’s validity (def)

Learning involves the reproduction of information-encoding entities.
end first condition of ht’s validity
2.4

2.5 second condition of ht’s validity (def)

2.5

second condition of ht’s validity (def)

These learning-enabling information-encoding entities consume resources in order to reproduce.
end second condition of ht’s validity

2.6

2.5

third condition of ht’s validity (def)

The process of reproduction of information-encoding entities can be
influenced by stochastic phenomena which cause an unpredictable
structural variation.
end third condition of ht’s validity
2.6

2.7

fourth condition of ht’s validity (def)

The resources of environment, within which the learning occurs, are
finite.
end fourth condition of ht’s validity
2.7
Hard Thesis, as proposed until now, defines learning in general and
as such can be told to describe the form of « learning » of both human
and artificial minds. In this general sense, it will be used in majority
of the text which shall follow. For the rest of chapter 2, however, we
shall discuss « learning » as related solely to humans.
The material substrate of human learning1 is the brain and no positivist theory of learning thus cannot be considered to be adequate if
it ignore brain’s essential attributes. We list its essential attributes in
a following definition.
2.8

brain (def)

Human brain is a physical (i.e. four-dimensional) object of organic origin which consumes biochemical energy in order to process and/or
store information in a non-local, highly parallel, and in certain extent also plastic, equipotent and holographic fashion robust to both
endogenous and exogenous perturbations.
end brain
2.8
The fact that the brain disposes of above-mentioned properties is
usually explained in terms of « neural » connectionist theories whose
validity is well demonstrated by multitudes of anatomical observa1 Cellular memory being an exception with which we cannot deal here.

5

6

Of validity of
connectionist Level
of Abstraction

Of brain’s
consumption of
resources.

hard thesis

tions and clinical experiments. And it is indeed true that when observed by a microscope, at a strictly material « level of abstraction »
(LoA2 ), the brain is nothing else than a ball-sized walnut of wetware
consisting of approximately one hundred miliard neural cells.
Subsequently, one when one adopts a more computational LoA,
one easily comes to conclusion that the substrate of mutually interconnected neural cells can indeed yield a device capable of strongly
parallelized computation. « Neural networks », « backpropagation »,
« stochastic gradient descent » – all these notions offer us useful conceptual tools which enable us to bridge the « objective » material reality of the brain with information-processing, i.e. « computational »
faculties of the mind.
Ability to « learn» can be, of course, considered to be such computational faculty. And the Hard Thesis states that above the « material »
and « computational » LoA from which the brain can be intepreted, there
exists also a scientifically sound « evolutionary » LoA at which learning can
be functionally conceived as being both structurally as an instance of an evolutionary process – i.e. a process involving reproduction, variation and
selection of information-carrying entities. If HT is valid, the functions
of one and same brain could be thus ideally intepreted by the prism
of « material », « computational » and « evolutionary » LoAs at the
same time.
An « evolutionary » LoA can be considered to be scientifically
sound only if it does not contradict empiric knowledge – in the case
of HT, it should not contradict the anatomical and clinical knowledge concerning the brain. Nor should it contradict the connectionist
« computational » theory. What we already know about « brain » and
« learning » should rather be consistent with the meaning of HT. Is it
the case ?
It can be, if ever the conditions of HT validity (c.f. 2.4 – 2.7) would
be found consistent with current neuroscientific knowledge. The last
condition «The resources of environment, within which the learning
occurs, are finite» seem not to pose a problem since both environment
about which we speak here – the brain itself – and its material and
energetic resources are finite : even in case of a most abnormal human
being, a brain can simply not consume more than 25-30 % percent of
one’s energy. Hence, it is impossible, for a human brain as an energyconsuming system, to go beyond the upper bound of cca 500 kilocalories per day (Mink and Blumenschine, 1981). In this sense what
holds for energy holds, mutatis mutandi, also for limits of nurturing
chemical substances which the brain must metabolize to keep its vital
functions in equilibrium. Their quantity is limited – even in case of a
well-nurtured healthy individual are brain’s material resources finite.

2 C.f. Philosophy of Information (Floridi, 2011, pp. 46-58) for a more exhaustive definition of « Level of Abstraction ».

2.9 2nd law of thermodynamics (def)

7

The third condition, i.e. «the process of replication of informationencoding entities can be influenced by stochastic phenomena which
cause an unpredictable structural variation» also does not seem to
be very problematic when we consider the fact that the replication
does physically occur within its environment – i.e. in the brain – and
that its environment is an energy&information-processing system. It
is not problematic, because of 2nd law of thermodynamics.
2.9

2nd law of thermodynamics (def)

« Every process occurring in nature proceeds in the sense in which
the sum of the entropies of all bodies taking part in the process is
increased. In the limit, i.e. for reversible processes, the sum of the entropies remains unchanged.» (Planck, 1926)
2nd law of thermodynamics

2.9

Human brain, when understood as a physical system, is not an
exception to this law. Nor are its components – lobes, neural circuits, neurons, axons, dendrites, receptors, proteins etc. Whenever
and wherever is information processed, energy transforms its form
and some residual heat is generated. Heat is energy with increased
entropy – in its essence it is kinetic energy kicking the surrounding
molecules in all directions. As such, it can induce unexpected «unpredictable structural variation» of brain tissue’s molecular substratum.
Thus, the very fact that the brain is an energy-consuming device implies a possibility of decay and loss of information encoded in brain’s
materia.
Heat aside, brain is also confronted with other sources of «unpredictable structural variation». From quantum phenomena, free radicals and different toxins contained in food and air to purely cognitive noise entering the brain through sensory channels – both brain’s
processes and structures are constantly confronted with both endo &
exo-genous sources of «unpredictable structural variation». If a sort
of replication of information-encoding entities would take place in
the brain, it would be highly improbable that it would not be also
subject to such variation. Thus, when it comes to human brain, we
consider the third condition of HT’s validity 2.6 as fulfilled.
By its very definition, any activity of a material system involves
consumption of energy and learning, understood as «informationprocessing mind-transforming constructionist process» 2.2 is not an
exception. Thus, in case of a material system like brain, can the second condition of HT’s validity, i.e. «learning involves informationencoding entities which consume resources in order to replicate» 2.5
be considered to be necessarily valid if first condition of HT’s valid-

Of brain and heat

Of intracerebral
sources of
variation

8

Of indirect
evidence for
intracerebral
reproduction of
information

hard thesis

ity , i.e. «Learning involves the information-encoding entities which
replicate» (2.4), is itself valid.
But now the thing gets complicated since, as far as we know, existence of such « reproduction of information-encoding structures »
within the brain has not been, as of November 2014, demonstrated
with sufficient certitude. At least not directly3 . But note that such reproduction of information is at least indirectly implied at least since
1950s whence neuroanatomic observations, which were primarily concerned with effects of brain lesions upon the resulting behaviour of
the brain, have demonstrated that information in brain is stored in
a non-local fashion. As Karl Lashley, one amongst the biggest neuroscientists of 20th century who spent most of his life studying equipotentiality (i.e. the capacity of any part of functional area to solve a
particular task), once put it: «The equivalence of different regions
of the cortex for retention of memories points to multiple representations. Somehow, equivalent traces are established throughout the
functional area.» (Lashley, 1950, pp. 28)
There are at least two possible interpretations of such «non-local
storage of information» based on "equivalent traces" and/or "multiple representations". The first one is « connectionist »:
2.10

connectionist explanation of non-locality (txt)

Information stored in the brain cannot be localized at one particular spatial locus because it is spatially distributed among multiple
synapses of the neural network.
end connectionist explanation of non-locality
In other terms, the connectionist interpretation states that a material representation of a cognitive structure S (or a cognitive function
F) cannot be localized to this place « here », because it is also partially
encoded « there » and « there » and « even there ». From « connectionist » perspective it is indeed this distribution, this decentralization of
information among synaptic weights which gives to a neural network
both its robust character as well as its capacity of generalization.
But there exists also a second interpretation of the fact that information in brain is not stored on a one specific place:
2.11

alternative explanation of non-locality (txt)

Information stored in the brain cannot be localized at one particular
spatial locus because it is materially encoded at multiple loci.
end alternative explanation of non-locality

3 In 8.6 we shall see some theories interpreting certain neural phenomena not only as
«reinforcement» but also as «reproduction of information».

2.11 alternative explanation of non-locality (txt)

9

From this other perspective, brain stores the material representation of a cognitive structure S (or a cognitive function F) in multiple
alternative places and|or in multiple forms. A trivial example illustrating the essential difference between two approaches is presented
on Figure 2 which visualises "connexionist" and "alternative" representations of corpus containing fours tokens "MABA" and one token
"MAPA".
PA

BA BA

BA

BA

MA

MA

BA

PA

4
1
MA
(a)

(b)

Figure 2: Distinction between "connectionist" (a) and "alternative" (b) representations of the same data. It is evident that the latter allows for
more structural variation than the former.

Far for being mutually exclusive with the first intepretation of brain’s
non-locality, such «alternative» representation has one advantage and
one disadvantage. Given that each particular locus encodes a particular instance of structure S (or function F), i.e. S1 S2 , S3 , any individual
instance can be modified while leaving all others branches intact. Every
instance is thus an independent individual with an individual history: that is a non-negligeable advantage. The disadvantage is that in
order to get encoded, such "alternative" representations need more
space than "compressed" connectionist representations which superpose distinct instances one atop another in order to yield one ultimate
representation.
Note that such diverse individual instances could be well confounded
by an external observer who – if (s)he had not been equipped with
fine-grained resolution imagery apparatus - could easily believe to
witness only the activation of one and only neural circuit S. But the
closer inspection shall reveal -so we speculate - that the same stimulus
and the same response is to be followed, respectively preceded, by activation
of distinct neural loci. Such observation could be potentially interpreted
as an empiric evidence of «alternative» interpretation of non-local encoding of information in the brain.
It is true that from certain point of view, such «alternative» way
of storage of information in multiple cerebral loci could be considered as redundant. But redundancy does not necessarily mean suboptimality. In a body of a multicellular organism, for example, is the
complete genetic code stored practically in nucleus of every single cell
(erythrocytes and trombocytes of higher vertebrates excepted). And
it is verily this very fact that every cell contains the schema for the
whole, which gives, among other properties, to such an organism a
somewhat «miraculous» capacity to regenerate itself. This being said,

Of utility of
redundancy in
organic systems.

10

hard thesis

it can be further speculated whether the «miraculous» property of
brain called «plasticity» - i.e. the fact that the brain can, in some extent, restore the original knowledge even if some part of brain was
damaged or even fully lesioned – can be also explained, mutatis mutandi, in terms of redundant storage of information at multiple loci.
Now back to question discussing the possibility of reproduction of
information-carrying structures within the brain. If we accept that the
«alternative» hypothesis concerning the brain’s faculty to store information non-locally is at least partially valid, we may subsequently
pose a question: «but how comes, that multiple individual instances
of information S are stored at distinct loci L1 , L2 , L3 ?». A possible answer : «because sometimes, somehow4 , information from L1 is copied
into L2 » could pave the way to experiments whose objective shall
be to verify the 1st condition of HT’s validity (2.4) demanding that
learning should somehow involve the reproduction of informationencoding entities.
Note that for the purposes of level of abstraction at which Hard Thesis is postulated, it is secondary whether the replication of informationencoding structure is materially realized as a creation of new material synapses, or synchronization of firings of neural circuits, or
modification of oscillatory properties of certain fields, or something
completely different. The only thing, we believe, which is currently
needed to offer ultimate neuroscientific evidence for the statement
«learning is a form of evolution » is to directly observe spontaneous
intracerebral reproduction of one concrete chunk of information ;
from one locus to another. More formally, such a « reproduction »
could be considered as taking place if, at spatiotemporal locus T1 L1 ,
one would observe the emergence of « child » representation R1
which is at least partial isomorph with « parent » representation R0
which has been already observed at spatiotemporal locus T0 L0 and
is still observable at some spatial locus in time T1 . Such ensemble of
observations would indicate that at least some part of information E
was copied from L0 to L1 in a way which leaves practically intact the
original representation R0 .
But until such neuroscientific evidence is given, the first condition
of HT’s validity cannot be considered as sound on empirical grounds.
This logically implies the consequence that the whole Hard Thesis
must be -given the current state of neuroscientific knowledge- considered as nothing else than a speculative conjecture. The only thing
which we can do to make this dissertation less speculative, and hence
more scientific, is to soften the Thesis by reducing the scope of the
domain upon which it applies.

4 For example during phases of «dreaming» or other activities of "repeating" and "rehearsing".

3

SOFT THESIS

Soft Thesis (ST) is expressed as follows :
« learning can be successfully simulated
by means of evolutionary computation »
ST simply postulates a sort of explanatory adequacy between « learning » and « evolutionary computation ». It does not, as HT does, express the statement about ontological position of « learning », it does
not state what learning « is ». It simply states that the behavior of a
system whose functioning is in agreement with principles of « evolutionary computation » could ressemble to behavior of a system which
is considered to be "learning".
3.1

ST postulates
explanatory and
not ontological
adequacy

evolutionary computation (def)

« Evolutionary computation uses computational models of evolutionary processes as key elements in the design and implementation of
computerbased problem solving systems.» (Spears et al., 1993)
end evolutionary computation

3.1

Evolutionary computation (EC) can be thus considered to be a subdiscipline of informatics. This does not mean that the principles of
EC should be relevant only to realm of silicon-based computers. It is
so, because informatics aims to yield a general theoretical framework
for description of information-processing systems, that is, a theory
which could be ideally applied on both silicon-based (e.g. computers)
and neuron-based (e.g. brain) computational devices1 .
In practice, however, are hypotheses related to informatic science
best studied and most applied in relation to silicon-based universal
Turing machines. Voici reasons why it is so:
• Minimal ethical concerns : it is considered ethically completely
acceptable to program one’s computer ; it is less so to do that
with one’s neighbor, or his intestinal flora.
• Full initial control : a programmer can control practically all
initial states of his informatic model as well as the initial form
of rules according to system shall subsequently behave.
1 And potentially to other types of computational devices. As of 2014, particularly
promising seem to be devices developped in the discipline of biomolecular computing. Note that the very essence of these devices (e.g. DNA-computers) is particularly
favorable to problem-solving by means of evolutionary computation.

11

Of EC and
material substrate

Advantages of EC
simulations in
silico

12

soft thesis

• Reduced cost : construction, execution and evaluation of a model
in silico is generally much less resource-demanding than construction, execution and evaluation of such model in vitro or in
vivo.

Advantages of
performing
evolutionary
simulations in
silico

Since EC is a subdiscipline of informatics, it follows that abovedescribed utility of silicon-based machines for informatics would be
also appreciable in the domain of evolutionary computing. In fact,
especially due to moral and security concerns, in silico seems to be
the only way how living evolution can be empricially studied on a
time scale directly percievable and interpretable by practically any
human observer able to run a program on a computer. For this reason,
when Soft Thesis relates the EC to « learning », it is principially a
silicon-based computer which is supposed to be the subject of the
« learning » process. With exception of Part iii - where we shall mainly
discuss learning process as instantiated in human children - shall
be, in the rest of this dissertation, computer understood as an entity
capable of learning.
In 8.7 we shall discuss EC in somewhat closer detail. There, we
shall also introduce the most important EC paradigms like «genetic
algorithms» (8.7.1), «evolutionary strategies» (8.7.2) and «genetic programming» (8.7.3). But the particularities of these diverse approaches
are not of a great interest for the subject which interests us in this
chapter, that is : to elucidate the of meaning of the Soft Thesis. In
order to do so, the term « successfully simulated » should be defined.
3.2

successful simulation (def)

A process P can be said to be « successfully simulated » by a system
S iff the way, how outputs oS1 , oS2 , . . . oSn of the system S (given
the inputs iS1 iS2 ... iSN ) are generated is isomorph, at certain Level of
Abstraction, to the way how process P reacts to stimuli iP1 ,iP2 ... iPN
when generating outputs oP1 , oP2 ... oPn . Morphism iPX → iSX can be
understood as representational mapping of inputs from the domain
of the process P (i.e. « reality ») into the domain of the simulation S.
end successful simulation
3.2

Of stimuli and
input

In less formal terms, a simulating system can be told to perform
« successful simulation» if and only if tends to react to sequences of
its inputs in the same way as does the process-which-is-simulated react to sequences of stimuli with which it is confronted. Note that in
order to distinguish the two, we use the term "stimulus" when we
speak about the data entering the original physical process-whichis-simulated and we use the term "input" when we speak about the
data which enters the simulation. In light of this definition, the Soft
Thesis practically postulates that by implementing the precepts of
Evolutionary Computation (Section 8.7), one can construct computa-

3.3 cognitive plausibility (def)

tional models which shall gradually transform inputs into outputs in
a way that would be, for an external observer, indistinguishable from
the mappings gradually produced by the process of « learning ».
Let it be underscored that the above-mentioned definition speaks
not only about simulating the outputs (results) of the process; it speaks
also about the manner by means of which such results are obtained.
It demands not only external but also internal adequacy between the
simulation and the process which is being simulated. That is, NOT
ONLY should the simulation yield the outputs which are the most
accurate - i.e. ressemble the most the observable behaviours of the
system - BUT ALSO should execute the input -> output mapping in a
similar way. In case of tentatives aiming to simulate human cognitive
processes, we find it useful to speak about such "internal adequacy"
in terms of cognitive plausibility.
3.3

13

Morphisms among
morphisms

cognitive plausibility (def)

« We label as “cognitively plausible” a model which tends to address
some basic function/skill of human cognitive system not only by simulating, in a sort of “black-box apparatus”, the mapping of inputs
(stimuli, corpus data etc.) upon outputs (observed behaviors, results
etc.), but also tends to faithfully represent – at least when interpreted
from a certain LoA- the way how the respective function/skill is accomplished by a real human mind.» (Hromada, 2014b)
cognitive plausibility
3.3
We believe that it is often pertinent to ask the question "is computational model M of process P cognitive plausible?". In case of process
of "learning" and its computational "machine learning" (ML) counterparts, an analysis through the prism of "cognitive plausibility" could
potentially yield surprising results: while many ML models perform
more than well in a task which was previously the domain of exclusively human learning, they are far from being cognitively plausible.
Extent in which the model successfully simulates the real process
(i.e. its performance) and an extent in which the model does it in a
way similiar to human mind (i.e. its cognitive plausibility) demarcate
two independent axes which are not to be confounded. Engineers
interested only in attaining the best results (i.e. the most adequate
outputs, given the inputs) can often ignore the manner by means of
which a natural system solves a given problem. On the other hand, researchers aiming to understand the functioning of the natural system
are often more ready to accept lesser performance of their model if
ever it seems to exhibit the same properties and faculties as the natural system. Only in rare cases do such engineering, i.e. result-oriented,
and scientific, i.e. knowledge-oriented, axes converge.
end soft thesis
3

Of researchers and
engineers

4

SOFTER THESIS

An important question was left unanswered during our discussion of
the Soft Thesis. That is : what shall be the object of learning which
is supposed to be successfully simulable by means of Evolutionary
Computation ? What shall be the nature of stimuli ip1 ip2 ... ipN
entering the learning process we aim to simulate ?
To concretely address this question, we are, once again, obliged to
soften the Thesis somewhat more, thus obtaining the Softer Thesis
which can be expressed as follows:
«learning of natural language can be successfully simulated
by means of evolutionary computation»
Contrary to ST, which relates EC to a very broadly defined notion
of « learning », does the S2 T specify the object of «learning» which is
supposed to be EC-simulable. It is learning of natural languages.
4.1

natural language (def)

Natural language is a system composed of prosodic, phonetic, phonologic, morpohologic, syntactic, semantic and pragmatic structures and
principles which allows human beings to encode messages in a way
that is comprehensible to other human beings.
end natural language 4.1
Further definitions related to natural language, notably those -ic
terms, shall be presented in the 9.2 for they are not inevitably needed
for eludication of S2 T’s meaning. What we consider of bigger importance here is to introduce the reasons which have motivated us to
study evolutionary computation in relation to learning of natural languages.
4.2
Of essence of
humanity

why natural language ? (aph)

Among all the faculties which distinguish man from other animals
is the mastery of language potentially the most salient one. This was
already well-known for the ancients among which Aristotle, for example, defined man as ζῶον λόγον ἔχον, «an animal which word has».
Centuries later, Wittgenstein (1953) had indicated that whole philosophy, and potentially even more, can be understood as a realisation of
some sort of perenial «language game»...

14

4.2 why natural language ? (aph)

In the meantime, on the very frontier between «natural» and «human» science, emerged linguistics : the science whose object of study
is language, understood as a system, and whose objective is to understand principles governing such a system. During the century which
followed after de Saussure (1916) presented linguistics as a mainly
positivist study of diverse forms of linguistic structures, linguistics
had refined its methodology and terminology in a way such broad
and deep that currently -as of 2014- among all other sciences studying one specific domain of human activity, linguistics has practically
no equal in both quantity and quality of scientific knowledge which
has been already accumulated.
Thus, one reason why we have chosen to focus on the natural language is purely pragmatic one: natural languages are well-studied.
For us it principially means that we are not obliged to «reinvent the
wheel» and can instead use the already existing methodology and terminology, refer to past observations and experiments and potentially
exploit the established corpora.
Notably the discipline of developmental psycholinguistics, with its
focus on the process of «language development» (Section 9.1) as well
as an increasingly popular discipline of Natural Language Processing
(NLP, Section 10.3), located on the border between linguistics and
computer science, seem to be of particular importance in regards to
potential proof of validity of S2 T.
The second reason for focusing our interest on natural language is
related to the role which natural language seems to play in development of every healthy human individual. This role is considered to
be non-negligeable by those who consider the language to be the very
fundament of human society ; and is considered to be vital by those
who know that on its own, i.e. without society’s protective matrix,
a human individual – and especially a human child – simply could
not survive and/or develop full capacities of a self-realized member
of homo sapiens sapiens species. Simply stated, language is a phenomen present in all cultures and as such can be considered to be the
anthropological constant par excellence.
By having already mentioned philosophy, anthropology, and linguistics, we consider it important to underline that the topic of natural language seems to be recurrent in all cognitive sciences.
Neuroscience, for example, had fully established itself as an empiric
science the very day when Broca (1861) realized that the damage of
brain’s inferior frontal lobe of the dominant hemisphere leads to problems in production of language (he was later followed by Wernicke
(1874) who noticed that the damage of superior temporal gyrus leads
to troubles in language comprehension).
Language plays also important role in both psychotherapy and psychology. In both Freundian and Jungian psychanalysis, in Rogerian
person-centered psychotherapy, in Frankl’s logotherapy or individual

15

Of linguistics

1st reason

2nd reason

Role of language
in psychology

16

softer thesis

Figure 3: Cognitive Hexagram

The centroid of the
hexagram

3rd reason

psychology Adler (1976) and possibly in many other psychotherapeutic systems, language is considered to be therapeutic tool of utmost
importance. What is more, in a very sound psychological "theory of
multiple intelligences", as articulated by Gardner (1985a), is man’s
faculty to understand and produce linguistic utterances important
enough so that it merits to obtain the label of «verbal-linguistic» intelligence. Along with six other intelligences, this «linguistic intelligence» is considered to be the basic computational module of human
cognitive system. Also within a theory coming from a different (russian) tradition, that of Vygotsky (1987), is language considered to be a
crucial component of man’s psyche: in Vygotski’s framework, in fact,
is the thinking itself understood as a so-called inner speech.
All this arguments lead us to belief that natural language is a topic
which is localized very close to the centroid of the hexagram delimiting the object of study of all cognitive sciences (depicted on Figure 3).
In one way or another, explicitely or implicitely, all cognitive sciences
deal with natural language.
On their own, these two reasons, «language is well-studied» and
«language is central» would yield, we believe, sufficient an answer
to the question «Why does S2 T relate evolutionary computing with
learning of natural language and not, for example, with learning of
deer-hunting or learning of swimming?"
But there is another, AI-related, reason for which we consider the
study of language learning to be of particular importance in relation to evolutionary computing and/or computer science. More concretely, similarily to Turing (1950), who saw in language a means how
to address the question «Can machines think ?» in an answerable way,
we see in natural language a potentially first solid bridge between the
realms of artificial and human beings.
end why natural language? 4.2
end softer thesis

4

5

SOFTEST THESIS

The Softest Thesis (S3 T) is expressed as follows :
«Ontogeny of toddlerese can be successfully simulated by means of
evolutionary computation.»
In this definition, the term "ontogeny" is used in the sense practically synonymous to "learning", the sole difference between the two
being our intention to mark the notion that toddlerese is not only passively learnt, but that it emerges and is actively constructed. When it
comes to toddlerese itself, it is hereby defined as:
5.1

toddlerese (def)

Toddlerese is a transitory protovariant of a natural language which is
transferred from minds of human adults into the mind of a child by
means of repetitive exchange of sequences of contextualized symbols.
end toddlerese

5.1

Thus, the term "toddlerese" has a meaning similiar to meaning of
terms widely-used terms like "first language" or "mother language".
But contrary to these terms -which are used to denote not only the
language which develops but also, and mainly, the end-state language
resulting from such development- the term "toddlerese" is conceived
to denote only a certain transitory state, or a sequence of states in development of such "first language". In other terms, mother language
stays develops in man’s mind for the rest of (her|his) life but toddlerese language LT gradually disappears, or at least gets latent, in
parallel with child’s cognitive and physiological development away
from the toddler state. The term "protovariant" is used to mark even
more both temporariness as well as its function of a base for a fullfledged language which shall unfold from LT in mid-childhood and
later.
More concretely, we define -for the purpose of this Thesis- toddlerese as language LT emergent from child’s interactions with the
world within the temporal interval (0,2;6) years, id est between birth
and two and half years of age1 .
1 In order to facilitate bridging between computer science and developmental psycholinguistics, we shall not use the decimal notation, but a year;month;week notation to speak about child’s age (e.g. 2;3;1 when speaking about child which is two
years, 3 months and one week old)

17

Toddlerese is a
transitory
language

Age range of
toddlerese

18

Of repetitivity and
reproduction

The mirror
metaphor

Child mirrors its
parents

softest thesis

Thus, the term "toddlerese" has a meaning similiar to meaning of
terms widely-used terms like "first language" or "mother language".
But contrary to these terms -which are used to denote not only the
language which develops but also, and mainly, the end-state language
resulting from such development- the term "toddlerese" is to denote
only a certain transitory state, or a sequence of states in development
of such "first language". In other terms, mother language stays active
in man’s mind for the rest of (her|his) life but toddlerese language
LT gradually disappears, or at least gets latent, in parallel with child’s
cognitive and physiological development away from the toddler state.
The term "protovariant" is used to mark even more both temporariness as well as its function of a base for a full-fledged language which
shall unfold from LT in mid-childhood and later.
Another important notion included in the definition of « toddlerese
» is repetitivity. Repetition of symbol S can be understood as a sort of
« reproduction » along the temporal axis and in following chapters we
shall often interpret phenomena, which repeat themselves, not only
as reactivation of the original schema, but rather in terms of activity
of multiple schemas which are being reproduced. We repeat ; we
restate ; we reiterate: at a certain LoA, repetition can be understood
as a form of reproduction.
But the most important terms of definition ?? are those of «transfer»
and «exchange». Initially, these terms seem to denote divergent concepts: the term «transfer» carries with itself the conotation of somewhat unidirectional movement from the origin (mind of the parent)
to the destination (mind of the baby) while the term «exchange» denotes a bidirectional process whereby neither of interactors plays the
dominant role and both dispose of faculty to partially influence or
fully transform the behaviour of the other. But they can be reconciled
through the metaphor of a «mirror».
At first sight, mirror is a completely passive device simply reflecting the objects which project (transfer) their shapes on its surface. But
by the very fact that «mirror mirrors», it has also the power to influence the behaviour of the one who is looking in it and thus entrer en
échange avec l’autrui. It is important to realize that since mirrors can be
constructed differently, they can mirror things differently – the image
they offer in exchange is thus not only dependent upon the-objectthey reflect, but also determined by the material and the way how
mirror was physically forged2 .
Something similar holds, mutatis mutandi, when it comes to transfer of linguistic competence from the parent to the child. By means
of diverse neural mechanisms (e.g. «mirror neurons» (Rizzolatti et al.,
2008)) does child’s plastic brain assimilate information from its environment. We count among the objects of such assimilation also
2 By interpreting «tabula rasa» hypothesis as a particular case of the mirror metaphor
hereby introduced, one could partially align the empirist and nativist doctrines.

5.2 child (def)

structures explicitely expressed or implicitely encoded in sequences
which child observes and which are, most often, generated by lessplastic and more-crystalized minds of her parents. The child somehow «parses» such information, processes, understands it and acts accordingly. This action is subsequently projected into external environment by diverse means – most prominent of which are undoubtably
child’s vocal tract and child’s facial expressions – and by these means
is the very environment transformed. Minds of parents including.
We precise that by introducing the metaphor of the mirror we do
not, of course, want to state that child is just a receptive informationassimilating entity passively reflecting its external environment. Such
a statement would be completely contradictory to the fact of ceaseless
activity which every healthy child continuously demonstrates. This
fact of childs activity being in fact so salient, we propose to integrate
it in the very definition of what the term «child » means:
5.2

child (def)
Child plays.
end child

5.2

It is by game that child mirrors the world; by playing the game
which is pure activity without finality. Child sees around (her|him)self
the world in movement, then understands that (s)he can also move
and thus (s)he moves. Child’s way of mirroring is thus principially
mirroring by playful action and it is by playful action that the child
exerts influence in and upon its environment3 .
Pages which shall follow, and notably the Part iii, shall furnish further illustrations of what we mean by «playful action» in regards to
both language learning and evolutionary tâtonnement. Other computational language games shall also be introduced, mostly in form of
programs able to induce sets of classes (10.4.7) or transcription rules
(??) from diverse textual corpora. All programs shall apply the principles inherent to « evolutionary computing » in order to furnish some
data validating (or falsifying) the hypothesis S3 T.
On the other hand, none of the programs will be able to account
for phonetic or pragmatic layers of languages under study. For this
reason we are obliged to delimit, for the last time, the scope of our
Thesis.
end softest thesis 5

3 Notions of «game» and «playfulness» are not the same for adults and children.
Adults often consider as hazardous activities which children consider as a game
and vice versa, children often consider as serious the sandbox activities which are
not at all perceived as such by adults. The transfer of adequate categories «game»
and «serious» is an important goal of socialisation and possibly learning in general.

19

6

O P E R AT I O N A L T H E S I S

The Operational Thesis (OT) is defined as follows:
«Learning of toddlerese from its textual representations can be
successfully simulated by means of evolutionary computation.»
OT is thus very similar to the softest thesis, the only difference
being the specification of the modality of representation of inputs in
confrontation with which the toddlerese is supposed to be learnable,
in simulation, by means of evolutionary computation. It is precised
that such learnable modality is «textual».
6.1

text (def)

Sequence of discrete graphemic symbols representing morphosyntactic and semantic contents of natural language utterances.
end text

Text does not have
phonetic, prosodic
and pragmatic
layers.

6.1

This definition principially states that text encodes only subset of
information which a normal « hearable » utterance contains. That
is : semantic information related to its meaning and sense, and morphpsyntactic information related to its grammatical composition. C.f.
sections 9.2.3 and 9.2.4 for discussion of «morphosyntax» and «semantics» respectively.
By specifying the modality of data with which it shall operate, OT
has drastically reduced the scope of applicability of the softest thesis.
More concretely, by defining « text » as the modality of representation
with which we shall confront our computational models, we have left
aside the phonetic, phonologic, prosodic and pragmatic aspects of
language. That is, aspects of language which have been -during practically all human history- crucial whenever the « speaker » intended
to pass information to the « hearer ». It is only during few centuries
that the communication by means of text became prominent and only
within last decades it became dominant, mainly because of increasing role of computers in our lives. This is at least partially so because
computers are essentially machine built for processing of sequences
of discrete symbols and that’s what a text is – a sequence of discrete
symbols. Contrary to flux of spoken language, which is also a sequence, but composed of units whose boundaries are often unclear
and whose features overlap.

20

6.1 text (def)

But the fact that practically no prosodic1 , phonetic or pragmatic information shall be involved in our computational simulations does
not mean that these simulations will not be concerned by natural
language. On the contrary – it is evident, from experience of every
reader, that text indeed is a «communication system which human
beings use to express information in a way comprehensible to other
human beings» (4.1). In other terms : if message is clear and if productive linguistic competence of the writer overlaps with the receptive
linguistic competence of the writer, message shall make it possible
that the reader shall understand the writer. In this sense, text can be
considered as a valid and functional modality of representation of
natural language.
However, the question « Whether text can be also considered as
a modality of representation sufficient for learning of language, and
most notably first language ? », is still an opened one. While some existing computational models indicate that at least for certain subproblems of language learning, like POS-induction (10.5) and grammar induction (10.6), the answer can be «yes», empiric observations of first
language learning of human children also suggest that prosody and
phonology play crucial role (9.2.1) and to ignore them would mean
to miss out the crucial component of the language learning process.
But since children which are deaf, and thus without any access
whatsoever to prosody or phonology, are able to learn the sign language -and since the sign language ressembles, in the sense that it is
visual and sequential, to text - the operational reduction of language
to text is potentially not a completely unreasonable one.
Thus, an operational definition language → text shall be principially used in sections dedicated to computational simulations of language learning. In other sections, however, this reduction shall not be
applied and language will be most often discussed in its full extent,
i.e. involving its phonetic, prosodic and pragmatic facets.
end operational thesis

6

1 One can argue that exclamation (!) or question (?) signs add certain prosody to text
since they can possibly represent increasing or decreasing tone or accent. This is,
however, discutable because prosodical cues are present « along » whole utterance
while the interpunction signs are normally located only at sentence’s final position.

21

Text is a form of
natural language

7

SUMMA I

In this section we had introduced multiple theses which we consider
as valid. These theses were discussed in deductive order, i.e. from the
most general to the most specific one.
Discussion started with the initial thesis « mind evolves » and definition of mind as « auto-organising set of structures and processes».
Because such thesis is so general that one may suspect that it is in
fact a tautological statement-of-faith than a verifiable hypothesis, a
so-called Hard Thesis was subsequently introduced, stating that «
learning is a form of evolution ».
Learning was principially defined as an information-processing constructionist process and it was further precised that the term « evolution » is meant in Darwin-consistent sense, i.e. as an adaptive process
based on reproduction, variation of selection of information-carrying
structures. What was not yet explicitely said, however, is that both
evolution and learning share an important feature : they involve trials and errors.
7.1

trial and error (def)

Most fundamental heuristics based on repetitive confrontation of system’s activity with external and internal constraints and demands.
end trial and error 7
It is generally believed that in learning, trial events are related to
other trial events only in a serial, vertical manner - one trial follows
another one in time. On the other hand, in an evolutionary process,
trials are related to other trials not only in serial (i.e. one generation
folows another) but also in parallel (i.e. generation consists of multiple individuals) manner.
The principal sense of the Hard Thesis is to state that such distinction is illusionary and that learning process almost always involves a
sort of horizontality, a sort of population of parallely co-existing structures which underlay and determine the observable manifestation of
individual "trial". What’s more, HT postulates that as in evolution, so
in learning are such individual structures endowed with the faculty
to reproduce the information which they encode into another locus.
It was further postulated that
1. if ever a stochastic phenomenon can cause variation of information content of an individual entity E generated by the reproduction process

22

7.1 trial and error (def)

2. if ever the information encoded by entity E influences the amount
of resources consumed during the reproduction
3. and if ever such multi-iterative reproduction occurs within the
environment having only finite amount of resources
then, with logical necessity, a sort of adaptation of entities to their
environment shall follow.
After proposing four conditions under which HT can be considered
as plausible, it was further discussed whether human brain could be
potentially considered as such "environment" for a sort of intracerebral evolutionary process. The brain was primarily defined as a finite
physical object storing information in a non-local way. As a physical system, brain is subordinated to laws of physics like 2nd law of
thermodynamics: brain generates heat and heat can, with non-zero
probability, cause variation of its own material content. Such variation of materia could subsequently result itself in the information of
information which the brain encodes. Thus, the very fact that brain
is a finite physical system implies that third and fourth conditions of
HT’s validity - when related to learning faculty of human brain - are
to be considered as fulfilled.
Much more problematic are conditions 1 and 2 of HT’s validity
relating to the question "does brain contain information-encoding
structures able to reproduce?". Since reproduction of informationencoding entities has not yet been directly and irefutably observed
within the brain, conservative scientists are often reluctant to answer
such question in affirmative. On the other hand, an "alternative" (2.11)
explanation of well-observed phenomenon of non-local storage of information implies, that a process resembling reproduction, a process
copying information to multiple loci could, indeed, take place within
the region of brain’s wetware. It was also suggested that in natural
ensembles like organisms, species or even societies, redundancy of information makes often systems more robust against unpredictable
perturbations and it was suggested that same "robustness through
redundancy" principle holds, mutatis mutandi, also for human mind.
Unfortunately, the questions raised by HT are too wide to be adressed,
in extent they merit, in a limited scope of this dissertation. For this
reason, the Hard Thesis is reduced into the soft form which states
that learning can be simulated by means of evolutionary computation. ST thus does not postulate the ontological adequacy between nature of evolutionary and learning processes - it simply postulates that
computational models of the former can successfully simulate the latter. The notion of successful simulation was defined in terms of isomorphism between input-to-output mapping of the simulation and
stimulus-to-reaction mapping of the process-which-is-simulated. The
need to create not only externally but also internally adequate computational models of human faculties was also discussed. By introducing the notion of cognitive plausibility, we have proposed to focus not

23

24

summa i

only on result but also on the path which leads to attainment of the result (Section 3.3). Thus, when considering the realm of machines, ST
postulates that there exist at least certain class of problems -usually
solved by means of traditional "machine learning" techniques- which
could be also solved by means of evolutionary computation with similar or better results. And whose manner of functioning ressembles
the manner of functioning of the system which is simulated.
A so-called Softer Thesis have subsequently precised that learning of natural languages is such problem. Natural language was definedin a most liberal way as "communication system which human
beings use to express information to other human beings" (Section 4.1).
Natural languages were chosen as the topic of our interest for three
principal reasons: Primo, natural languages are well-studied. Secundo,
natural languages are thematized, in one way or another, by all cognitive sciences. Tertio, the canonical (Turing’s) method to answer AI’s
central question "Can machines think?" is principially a test evaluating machine’s mastery in simulation of understanding and production of natural language utterances and discourses.
Since the expression "learning of natural language" can cover too
many phenomenon, the S2 T is further transformed into the Softest
Thesis (S2 T) which speaks only about the "learning of first language".
First language is defined as a communication system transferred from
the mind of the parent into the mind of a child by means of repetitive
exchange of sequences of symbols. Serial - in contrast with parallel, sequental and repetitive nature of first language was discussed
and the apparent contradiction between unidirectional "transfer" and
bidirectional "exchange" was subsequently reconciled by means of the
"mirror metaphor". Human child was defined in terms of its most distinctive propensity, i.e. propensity to "play", to execute activity which
lacks the absolute finality.
The last thesis which have been presented is the Operational one.
This specifies that the modality of representation, with which the
"first language learning" evolutionary computation algorithms will be
confronted, shall be textual. Given that text does not include practically any phonetic, prosodic or pragmatic layer, the complexity of the
first language learning from text could be substantially reduced. The
question whether such reduction is not too strict was also addressed.
By positing 6 theses of varying degree of universality - i.e. Initial,
Hard, Soft, Softer, Softest and Operational - we have delimited the
level at which the rest of this dissertation shall operate. By defining
terms like evolution, learning, form, brain, 2nd law of thermodynamics,
evolutionary computation, successful simulation, cognitive plausibility, natural language, first language, child and trial & error we have demarcated
the basic form of a prism - a theory- through which one could see
that the theses we posited hereby are, indeed, valid.
This theoretical prism shall be polished in the following part.

Part II
PA R A D I G M S
Ideas are never static but develop across time and context,
constantly cross-fertilizing with other currents of thought.
— Edwin F. Bryant
This part shall start the crossover of three seemingly unrelated scientific paradigms.
In its initial chapter devoted to Universal Darwinism, we
shall introduce scientific disciplines and their respective
theories, which are either derived from - or at least consistent with - Darwinian Theory of evolution, understood
as gradual development of populations of informationcarrying structures. Thus, not only biological evolution
shall be discussed, but also evolutionary and genetic epistemology and psychology, memetics, neural darwinism
and different branches and sub-branches of evolutionary
computation.
In the subsequent chapter, devoted to Developmental Psycholinguistics, we shall introduce the fascinating field of
study of acquisition of first language by human children.
After definition of few necessary notions we shall bring to
reader’s attention towards few widely accepted facts and
do a brief historical overview of most important languageacquisition theories. More concretely: associanist, behaviorist, nativist, constructivist and sociopragmatic theories
shall be mentioned and thematised.
The last chapter of this part shall invite the reader into
the realm of Computational Linguistics and Natural Language Processing. After brief introduction into Formal Language Theory and its Grammar Systems Theory variant,
the discussion shall be focused on computational problems of concept construction, part-of-speech induction and
grammatical inference. Some state-of-the-art computational
models aiming to solve these problems shall be described
in closer detail in order to pave the theoretical path towards future evolutionary models of first language acquisition.

8

U N I V E R S A L D A RW I N I S M

Universal Darwinism (UD) is a scientific paradigm regrouping diverse scientific theories extending the Darwinian theory of evolution
and natural selection (Darwin, 1859) beyond the domain of biology.
It can be understood as a generalized theoretical framework aiming
to explain the emergence of many complex phenomena in terms of
interaction of three basic processes:
1. variation
2. selection
3. retention
According to UD paradigm, interaction of these three components
yields « universal algorithm valid not only in biology, but in all domains of knowledge where we can extract informational entities –
replicators, which are able to reproduce themselves with variations
and which are subjects to selection» (Kvasnicka and Pospichal, 2007).
This generic algorithm is nothing else than traditional Evolutionary
Theory (ET) which, when when considered as substrate-neutral, can
be applied to such a vaste number of scientific fields that it has been
compared to a kind of « universal acid » which «« eats through just
about every traditional concept, and leaves in its wake a revolutionized world-view, with most of the old landmarks still recognizable,
but transformed in fundamental ways» (Dennett, 1995).
UD is a source of both theoretical inspiration and practical precepts for many scientific disciplines, technological methods or artistic
endeavours. The most prominent include:
1. biology
2. evolutionary art, e. psychology, e. music, e.linguistics, e.ethics,
e.economics, e.anthropology, e.epistemology, e.computation
3. sociobiology (Wilson, 2000)
4. memetics (Blackmore, 2000)
5. quantum darwinism, neural darwinism, psycho darwinism
6. artificial life
et caetera. We shall now discuss some of them.

26

8.1 biological evolution

8.1

biological evolution

Evolutionary Theory was born when young Charles Darwin realised
that the « gradation and diversity of structure» (Darwin and Bettany,
1890), which he had encountered among mockingbirds of Galapagos islands, could be explained by natural tendency of species to
« adapt to changing world ». Parallely to Darwin’s work which was
gradually clarifying the terms of variability and its close relation to
environment-originated selective pressures, Gregor Mendel was assessing statistical distributions of colours of flowers of his garden
peas in Brno in order to finally converge to fundamental principles
of heredity . But it was only in 1953 when the double-helix structure
of the material substrate of heredity of biological species – the DNA
molecule – was described in article of (Watson et al., 1953).
In simple terms : In the DNA molecule, information is encoded as
a sequence of nucleotides. Every nucleotide can contain one of four
nucleobases, it thus ideally carry 2 bits of information. Continuous
sequence of three nucleotids gives a « triplet » which, when interpreted by a intracellular « ribosome » machinery, can be « translated »
into an amino-acid. Sequences of amino-acides yield proteins which
interact one with another in biochemical cascades. The result is a living organism with its particular phenotype aiming to reproduce its
genetic code.
If, in the given time T there are two organisms A and B whose
genetic code differs in such an extent that their phenotype differs,
and if ever the phenotype of organism A augments probability of
A’s survival and reproduction in the external world W, while the B’s
phenotype diminishes such probability , we say that the A is better
adapted to world W than B, or more formally that fitness(A) > fitness(B). Evolutionary Theory postulates that in case that there is a
lack of resources in world W, descendants of the organism B shall
be gradually, after multiple generations, substituted by descendants
of a more fit organism « A ». This is so because during every act of
reproduction, the material reason for having a more fit phenotype the DNA molecule – is transferred from parent to offspring and the
whole process is cumulative across generations.
It can, however, happen, that the world W changes. Or a random
(stochastic) event – a gamma ray, the presence of a free radical - can
occur which would tamper A’s genetic code. Such an event – called
« mutation » - shall result, in majority of cases, in decrease of A’s
fitness. Rarely, however, can mutations also increase it.
Another event which can transform the genetic sequence is called
« crossover ». It can be formalised as an operator which substitutes
one part of genetic code of the organism A with corresponding sequence of organism B, and vice versa, the part of B with the corresponding part of A. It is indeed especially the crossover operation,

27

28

universal darwinism

Figure 4: One-point and two-point crossovers. Figures reproced from Morgan (1916).

(a)

(b)

first described by in the article (Morgan, 1916), which is responsible
for « mixing of properties » in case of a child organism issued from
two parent organisms. In more concrete terms : the genetic code of
such « diploid » organisms is always stored in X pairs of chromosomes. Each chromosome in the pair is issued from either father or
mother organism which, during the process of meiosis, divide their
normally diploid cells into haploid gamete cellls (i.e. sperms in case
of father and eggs in case of mother). It is especially during the first
meiotic phase that crossover occurs, the content of DNA sequence of
two grand-parents being mixed and mapped during crossover operation into the chromosome contained in the gamete which, if lucky,
shell fuse with the gamete of another parent in the act of fecondation.
Resulting « zygote » is again diploid, contains mix of fragments of
genetic code originally present in the cells of all four grand-parents
of the nascent organism. Zygote subsequently exponentially divides
into growing number of cells which differentiate from each other according to instructions contained in the genetic code which are triggered by biochemical signals coming from cell’s both internal and
external environment. If the genetic code shall endow the organism
with properties that will allow it to survive in its environment until
its own reproduction, approximately half of the genetic information
contained in its DNA shall be transfered to the offspring organism.
If not, the information as such shall disappear from the population
with death of the last individual who carries it.
end biological evolution 8.1

8.2 evolutionary psychology

8.2

evolutionary psychology

We have already quoted Darwin’s statement that asserted that psychology in the distant future shall "be based upon a new foundation
of the necessary acquirement of each mental power and capacity by
gradation". While two possible intepretations of this Darwin’s idea exist, the discipline Evolutionary Psychology (EP) focuses only on the
first one. It aims to explain diverse faculties of human soul & mind in
terms of selective pressures which moulded the modular architecture
of human brain during millions of years of its phylogenetic history.
Its central premises state :« The brain’s adaptive mechanisms were
shaped by natural and sexual selection. Different neural mechanisms
are specialized for solving problems in humanity’s evolutionary past.»
(Cosmides and Tooby, 1997).
In more concrete terms, Evolutionary Psychology explains quite
successfully phaenomena as diverse as emergence of cooperation and
altruistic behaviour (Hamilton, 1963) ; male promiscuity and parental
investment (Trivers, 1972) or even the obesity of current anglo-saxxon
population (Barrett, 2007). All this and much more is explained as a
result of adaptation of homo sapiens sapiens (and its biological ancestors) to dynamism of its ever-changing ecological and social niche.
Thus, in the long run, EP tends to explain and integrate all innate
faculties of human mind in the evolutionary framework. The problem with EP, however, is that in its grandious aim to « assemble out
of the disjointed, fragmentary, and mutually contradictory human
disciplines a single, logically integrated research framework for the
psychological, social, and behavioral sciences » (Cosmides and Tooby,
1997), it can sometimes happen that EP posits as innate, and thus explainable in terms of biological natural selection, cognitive faculties
which are not innate but acquired. Thus it may be more often than
rarely the case that whenever it comes to the famous "nature vs. nurture" (Galton, 1875) controversy, evolutionary psychologists tend to
defend the nativist cause even there, where it means to commit a
epistemological fallacy to do so1 .
And what makes things even worse for the discipline of Evolutionary Psychology as is currently performed is, that the forementioned
Darwin’s precognition has, asides the nativist & biological one, also
another intepretation.
Id est, when Darwin spoke about mental powers and capacities acquired by gradation, one cannot exclude that he was speaking not
only about gradation in phylogeny of species, but also ontogeny of
an individual.
end evolutionary psychology 8.2
1 If ever we accept the notion of falsifiability as an important criterion of accpetation
or rejection of the scientific hypothesis (Popper, 1972), many hypotheses issued from
EP would have to be rejected because, since being based in the distant past which is
almost impossible to access, they are less falsifiable than hypotheses explaining the
same phaenomena in terms of empiric data observable in the present.

29

30

universal darwinism

8.3

Memes coalesce in
auto-catalytic
memplexes

Of inter- and intramental memetics

memetics

Theory of memes or memetics is, in certain sense, a counter-reaction
to Evolutionary Psychology’s aims to explain human mental and cognitive faculties in terms of innate propensities. Similiarly to EP, memetics is also issued from the discipline of sociobiology which was supposed to be « The extension of population biology and evolutionary
theory to social organization» (Wilson, 2000). But contrary to both EP
and sociobiology, memetics does not aim to explain diverse cultural,
psychological or social phenomena solely in terms of evolution operating upon biochemical DNA-encoded genes, but also in terms of
evolution being realised on the plane of more abstract informationcarrying replicators which Dawkins (1976) named « memes ».
The basic definition of the classical memetic theory is: « Meme is
a replicator which replicates from brain to brain by means of imitation» (Blackmore, 2000). These replicators are somehow represented
in the host brain as some kind of « cognitive structure » and if ever externalised by the host organism – no matter whether in form a word,
song, behavioral schema or an artefact – they can get copied into other
host organism endowed with the device to integrate such structures2 .
Similary to genes which often network themselves into mutually supporting auto-catalytic networks (Kauffman, 1995) , memes can also
form more complex memetic complexes, « memplexes », in order to
augment the probability of their survival in time. Memes can thus do
informational crossovers with one another (syncretic religions, new
recepts from old ingredients or DJ mixes can be nice examples of
such memetic crossover) or they can simply mutate, either because
of the noise present during the imitation (replication) process, or due
to other decay factors related to the ways how active memes are ultimately stored in brains or other information processing devices.
Memetic theory postulates that the cumulative evolutionary process applied upon such reproduction of information-carrying stuctures BETWEEN minds shall ultimately lead to emergence of such
complex phaenomena as culture, religion or language. It can be thus
considered to be mainly the theory of inter-mental reproduction of
information. In complementarity with such a view, this dissertation
claims the existence of reproduction of information WITHIN the individual mind. Thus, the theory hereby presented can be labeled as a
theory of intra-mental memetics.
end memetics 8.3

2 In neurobiological terms, the faculty to imitate and hence to integrate memes from
external environment is often associated to «mirror neurons»

8.4 evolutionary epistemology

8.4

evolutionary epistemology

Epistemology is a philosophical discipline concerned with the source,
nature, scope , existence and divesity of forms of knowledge. Evolutionary epistemology (EE) is a paradigm which aims to explain these
by applying the evolutionary framework. But under one EE label, at
least two distinct topics are, in fact, addressed :
1. EE1 which aims to explain the biological evolution of cognitive
and mental faculties in humans and animals
2. EE2 postulates that knowledge itself evolves by selection and
variation
EE1 can be thus considered as sub-discipline of EPSection 8.2 and as
such, is subject to EP-directed criticism. EE2 , however, is closer to
memetics since it postulates the existence of a second replicator, i.e.
of an information-carrying structure which is not materially encoded
by a DNA molecule.
The distinction between EE1 and EE2 can also be characterised in
terms of « phylogeny » and « ontogeny ». Given the definition of
phylogeny as
8.4.1

phylogeny

Process which shapes the form of species.
end phylogeny

8.4.1

Processus which shapes the form of an individual.
end ontogeny

8.4.2

and contrasting it to ontogeny defined as
8.4.2

ontogeny

we find it important to reiterate that while EE1 is more concerned
with knowledge as a result of phylogenetic moulding of DNA, EE2
implies the moulding of non-DNA replicators in both phylogeny and
ontogeny. Thus, the notion of EE2 can be subsequently analysed into
two sub-notions :
• EE2-1 Knowledge can emerge by variation&selection of ideas
shared by a group of mutually interacting individuals (Popper,
1972)
• EE2-2 Knowledge can emerge by variation&selection of cognitive structures within one individuum

31

32

universal darwinism

This distinction is homologous to distinction between inter- and
intra- mental memetics, as discussed in Section 8.3.
It is worth noting that while a so-called recapitulation theory stating that « ontogeny recapitulates phylogeny » (Haeckel, 1879) is considered to be discredited by many biologists and embryologists ; it
is still held as valid by many reseachers in human and cognitive
sciences. In anthropology, for example, some scientists observe a «
strong parallelism between cognitive development of a child and . . .
stages suggested in the archeological record» (Foiter, 2002). Also in
relation to pedagogy it was observed that « education is a repetition
of civilization in little» (Spencer, 1894).
8.4.3

Creativity as
intrapsychic
evolution

individual creativity

In fact, the evolutionary epistemology was born with the tentative of
D.T. Campbell to explain both creative thinking and scientific discovery in terms of « blind variation and selective retention» (Campbell,
1960) of thoughts.
Departing from introspective works of mathematician Henri Poincare
who stated « To create consists precisely in not making useless combinations and in making those which are useful and which are only
a small minority. Invention is discernment, choice...Among chosen
combinations the most fertile will often be those formed of elements
drawn from domains which are far apart...What is the cause that,among
the thousand products of our unconscious activity, some are called
to pass the threshold, while others remain below?» (Poincaré, 1908),
Campbell suggests that what we call creative thought can be described as a Darwinian process whereby the previously acquired knowledge blindly varies in unconscious mind of the creative thinker and
that only some such structures are subsequently selectively retained.
The theory which interprets the creative process as an evolutionary one has been subsequently developped by Dan Simonton who
answers his rhetorical question "How do human beings create variations?" with a UD-constitent answer: « One perfectly good Darwinian
explanation would be that the variations themselves arise from a cognitive variation-selection process that occurs within the individual
brain.» (Simonton, 1999).
end individual creativity 8.4.3

8.4.4

genetic epistemology

« The fundamental hypothesis of genetic epistemology is that there is
a parallelism between the progress made in ... organization of knowledge and the corresponding formative psychological processes. Well,

8.4 evolutionary epistemology

now, if that is our hypothesis, what will be our field of study? Of
course the most fruitful, most obvious field of study would be reconstituting human history: the history of human thinking in prehistoric
man. Unfortunately, we are not very well informed about the psychology of Neanderthal man or about the psychology of Homo siniensis
of Teilhard de Chardin. Since this field of biogenesis is not available
to us, we shall do as biologists do and turn to ontogenesis. Nothing could be more accessible to study than the ontogenesis of these
notions. There are children all around us.» (Piaget, 1974)
When understood only superficially, Piaget’s developmental theory
of knowledge, which he himself called Genetic Epistemology (GE)
may seem to be utterly non-Darwinian. Its concern is not the phylogeny of human species, it is not even concerned with biochemical
genes. In fact, during practically all his fecond life-lasting research, Piaget had focused solely on the study of ontogeny of diverse cognitive
faculties in human children.
Thus, Piaget uses the term « genetic » to refer to a more general
notion of « heredity » defined as structure’s tendency to guard its
identity through time. These structures, which he called « schemas »
can be defined as « a basic set of experiences and knowledge that
has been gained through personal experiences that define how things
should be and act in the person’s environment. As the child interacts
with their world and acquires more experiences these schemes are
modified to make sense, or used to make sense of the new experience.» (Bee and Boyd, 2000)
There are basicly two ways how such schemes can be modified. Either they « assimilate » data from external environment. Or, if ever
such assimilation is not possible because it is simply not possible that
child’s cognitive system matches the perceived external datum with
the internal pre-existing category, the process of « accomodation »
takes place which transforms the internal category to match the external datum.
Ultimately, the set of schemes gets so out-dated or so altered by past
modifications that they are not useful anymore. Whenever such «equilibriation » occur, old set of schemas is rejected, the child tends to «
start fresh with a more up-to-date model» (Bee and Boyd, 2000), thus
attaining new substage or stage of its development. In the Piagetian
system – which is based on very precise yet exhaustive observations
of dozens of children including his own – the order of stages is fixed
and it is very difficult, or even fully impossible, for evolving psyche
to attain pre-operational stage 2 or concrete operational stage 3 if it
had not even mastered all that is to master during the sensorimotor
stage 1.
1. sensorimotor stage - repetitive but playful manipulation of objects without goal

33

34

universal darwinism

2. egocentric stage - imitation of behavioral schemas of others
without understanding of why it is done
3. cooperative stage - coordination of one’s activity with one’s environment
4. autonomous stage - understanding of procedures which allow
to change rules governing one’s environment
Given that that the GE paradigm involves
• heredity – schemes tend to keep their identity in time
• variation – schemes are altered by the environment-driven assimilation or accomodation3
• selective pressures – only those schemas which are most well
adapted to environment and/or form most functionally fit complexes with other schemas shall pass through the equilibriation
milestone
it can be briefly stated that Piaget’s GE could be aligned with ET
and UD. And what more, it may be the case that notion of Piagetian stages is consisted with the notion of attractor or locally optimal states whose emergence is, according to complex system theory
(Kauffman, 1995; Flake, 1998) inevitable in a system as complex as
child’s brain, mind and psyche definitely is.
end genetic epistemology 8.4.4
end evolutionary epistemology

8.5

8.4

evolutionary linguistics

Analogically to Evolutionary Epistemology, objects of interest of EL
subdivide it at least into two branches:
• EL1 the study of origin and development of faculties related
to comprehension and production of linguistic signal by homo
sapiens sapiens and its ancestors
• EL2 the study of historical development of diverse languages
Distinction
between EL1 and
EL2

EL1 can be thus considered to be closely related to Evolutionary
3 Note that in terms of EC, one can relate the Piagetian notion of assimilation to an
operator of local variation which attracts the cognitive system to locally optimal
agreement with its environment, while accomodation suggests an interpretation in
term of more global variation operators (like cross-over), potentially allowing the CS
to adapt to its physical and social environments in a more globally optimal way.

8.5 evolutionary linguistics

35

Psychology (Section 8.2) and discuss phylogenetic evolutionary phenomena taking place during hundreds of thousands of years while
EL2 can be said to take place in the historical time (order of ten thousand years and less) and is thus closely related to disciplines like
anthropology, culturology, comparative grammar and memetics. In
simple terms, EL2 is dedicated to study of linguistic ethnogeny.
8.5.1

ethnogeny (def)

Processus which shapes the form of a human community.
end ethnogeny

8.5

EL2 ’s central tenet that "language changes in time" is far from being new. Socrates have believed that « ...the primeval words (πρώτα
ονόματα) have already been buried by people who wanted to embellish them by adding and removing letters to make them sound better,
and disfiguring them totally, either for aesthetic considerations or as
a result of the passage of time...» (Plato, 80BC) and the best of Plato’s
students was well aware that change can be expressed in terms of

Language changes

• insertion
• deletion
• transposition
• substitution
Aristotle (42BC). Ancient syntacticians like Apollonius Dyscolus could
subsequently apply such notions to describe particular linguistic phenomena (Householder, 1981).
It was, however only centuries later when men of science had realized that language change is far from being a linear degression of the
primordial ideal, as the ancients have mostly believed. On the contrary: sir William Jones’s discovery that sanskrit is similar to Greek,
Celtic, Gothic and Latin languages and that they all « sprung from
some common source, which perhaps no longer exists» (Jones, 1788).
Subsequent realization that these similarities make it possible to cluster languages into hierarchical taxonomies combined with the trivial fact that languages exchange their internal contents (e.g. wordborrowing), all this has led to evermore stronger belief that languages
can be studied as living entities. Darwin himself was well aware of the
parallelism between biology and linguistics:
« The formation of different languages and of distinct species, and
the proof that both have developer through a gradual process, are
curiously parallel...We find in distinct languags striking homologies
due to community of descent, and analogies due to a similar process
of formation.» (Darwin, 1859)

Language evolves

36

universal darwinism

Figure 5: Schleicher’s Stammbaum of family of Indo-European languages.
Reproced from Schleicher (1873).

Language tree
theory

The fossile absence
problem

Glottochronology

Practically in the same as Darwin was preparing his opus which
was to change biology forever, was, on the linguistic side, the existence of such parallelism articulated by Schleicher (1873) in his "tree"
Stammbaum theory of Indo-European languages .
By publishing his theory, Schleicher had in fact triggered a completely new form of evolution - that is, evolution of linguistic theories. In a dozen years that followed was the influx of articles related
to Stammbaumtheorie so high that Société linguistique de Paris had
decided, in 1866, to refuse any articles on the subject. Which is somewhat a pity because many theories which emerged during that period,
for example "the wave theory" (Schmidt, 1872) taking into consideration not only temporal but also spatial (i.e. geographic) aspects of language spread, were indeed preminiscent to diffusion models which
became prominent in biology only a century later.
One of the reasons for Societe’s "ban" was the fact that languages,
contrary to "biological species" do not left fossile traces after them
and therefore any endeavour to understand their distant past or even
origin is only speculative and inconsistent with empiric method of
science. EL simply does not go well with the principle of scientific
parsimony, the omnipresent Occam’s razor. Notwithstanding this critique which stays, we believe, valid today as ever4 , allowed the advent
of computers to EL2 to catch the second breath.
An often criticized but nonetheless very important step in making
EL computer-positive was the introduction of "lexicostatistical" and
glottochronological methodology originally based on cognate distance
matrices (Swadesh, 1952). These numeric matrices, whose elements
Mij were denoting the number of cognates - i.e. the number of similarly sounding words having the same meaning - subsequently al4 C.f. the footnote in Section 8.2 or citation of Piaget in Section 8.4.4 for other reformulations of the same critique.

8.5 evolutionary linguistics

37

lowed to computationally "discover" and (fals|ver)ify hypothesie concerning kinship of existing or past languages. An article of Atkinson
and Gray (2005), from which we reproduce a Table 1 a Table describing parallelism between biological and linguistic evolution, offers a
satisfactory introduction to some EL2 ’s state-of-the-art computational
models some of which pretend to unveil knowledge about ancestry
of languages as far as the end of last ice age (Pagel et al., 2013).
biological evolution

linguistic evolution

Discrete characters

Lexicon, syntax and phonology

Homologies

Cognates

Mutation

Innovation

Horizontal gene transfer

Borrowing

Hybrid plants

Creole languages

Table 1: Conceptual parallels between biological and linguistic evolution. Table partially reproduced from Atkinson and Gray (2005).

Section 8.7.5 shall discuss a so-called "Evolutionary Language Game"
computational model. Since ELG addresses - and some may say that
also answers - the question "How may a coordinated system of soundmeaning mappings evolve ex nihilo in a community of mutually interacting agents?", it can be posited at the very border between EL1 and
EL2 . According to Pinker, who is one of the most famous proponents
of so-called "nativist" theory in developmental psycholinguistics (c.f.
item 10.2) models like ELG « suggest ways of connecting the evolution of language to other topic of in human evolution, allowing each
to constrain the others» (Pinker, 2000).
But there is another question related to evolution of language which
has not yet been sufficiently resolved by ELG nor any other EL theory5 . That is: "Why are languages subject to some types of changes
and not to others?". Why indeed is history of languages so full of
insertions (e.g. "osm" in czech and "osem" in slovak), deletions (e.g.
"mravenec" in czech and "mravec" in slovak), substitutions (e.g. all
instances of what is a diftong "ie" in slovak are pronounced in czech
as a long vowel í) and metathetic transpositions (e.g. "hmla" in slovak
and "mlha" in czech)?
Our answer to this question is as follows : because the changes observable in ethnogeny of diverse languages, dialects and accents are,
at their origin, triggered by "variation operators" inherent to every
fundamental unit of any linguistic community which is, of course,
5 We set aside a so-called neo-grammarian school of historical and comparativist
philology who believed that language change can be described in terms of sequences
of universally applicable "laws which suffer no exceptions". We set them aside because we are strongly persuade that evolution not only does suffer "exceptions" but,
in fact, endorses them in order to be fully operational.

Why are some
changes more fit
than others?

Cognitive
constraints in
language evolution

38

universal darwinism

an individual human mind. Stated more simply: the reasons why
language forms develop in the way they develop are in great extent
cognitive.
Part iii shall present somewhat more concrete an evidence of activity of such operators of intramental variation which potentially
influence the process of language production in human children .
end evolutionary linguistics 8.5

8.6

Basic tenets of
neural darwinism

From neural to
mental

neural and mental darwinism

It was already an evolutionary biologist John Maynard-Smith who
have remarked that « there is a similarity between the dynamics of
genetic selection and the operant conditioning paradigm of Skinner»
(Maynard Smith, 1986)6 . But it was only the book Neural Darwinism:
The Theory of Neural Group Selection of Nobel-prize winner Edelman
(1987) who had, as first in history of science, described in a finegrained detail how a process similar to evolution could be potentially
instantiated within the human brain.
Stated in one sentence, Edelman’s theory postulates that « complex
adaptations in the brain arise through process similar to natural selection» (Fernando et al., 2012). Stated in a more fine-grained detail,
the theory shows how epigenetically influenced interactions of "cell
adhesion molecules" and "substrate adhesion molecules" can lead to
generation of so-called primary repertoire. Synapses within diverse
groups of this repertoire are subsequently, during postnatal ontogenesis, "differentially amplified" into a secondary repertoire by a process which is, according to Edelman, functionally equivalent to the
process of selection as known in evolutionary theory. Edelman also
believes that well-known processes like cell proliferation, cell migration, cell death, neurite branching or synaptic pruning are potentially
also governed by analogic selective processes.
It is not possible for us to explain Edelman’s tour de fource in the
limited scope of this section and it would be, in fact, an act of scientific dishonesty to do so since as computational linguists, we do not
feel competent to express any definite statement about truth or falsity
in such expert domain as neurology definitely is. But we nonetheless
consider as important to emphasize that Edelman is definitely not
alone in his view of things. Thus, for example, important authorities
of continental neurological tradition were not afraid to state that « the
thesis we wish to defend...[is] that the production and storage of mental representations, including their chaining into meaningful propositions and the development of reasoning, can also be intepreted, by
6 Skinner’s behaviorist theory of verbal behaviour will be more closely discussed in
9.4.1

8.6 neural and mental darwinism

39

Figure 6: Possible mechanism of replication of patterns of synaptic connections between neuronal groups. Reproduced from Fernando et al.
(2012).

analogy, in variation-selection (Darwinian) terms within psychological time-scales.» (Dehaene and Changeux, 1989)
It has to be noted, however both Edelman’s "neural" and Dehaene’s
and Changeux’s "mental" Darwinism describe processes fundamentally based on variation and selection, but not on replication, of informationencoding neural groups. It is a well-known fact that neural cells do
Variation and
selection of
not reproduce and the possibility that the reproduction of neurons
neuronal groups
would yield a material basis of existence of intracerebral replicators
is thus a priori excluded. As is pointed by (Fernando et al., 2012)
this fact in itself, however, does not mean that mental or neural darwinism are not evolutionary. They are evolutionary because one can
postulate a sort of evolution for any system whose global development is governed by famous Price’s theorem (Price et al., 1970) which
is - so the authors argue - also the case for development of neuronal
group structures.
Same authors also suggest possible process of replication of information between neuronal groups . This process - fundamentamenReplication among
neuronal groups

40

Neural darwinism
still speculative

Bridging the
explanatory gap

universal darwinism

tally based upon a well-known form of Hebbian-learning7 called "spiketiming dependent plasticity" (STDP) and the existence of a neural
"topographic" map between the original replicans (circuit A) and the
following replicandum (circuit B) - can be described as follows: « If
a neuronal circuit exists in layer A and is externally stimulated to
make its neurons spike, then due to a topographic map from layer
A to layer B, neurons in layer B will experience similar spike pattern
statistics as in layer A. If there is STDP in layer B between weakly
connected neurons then this layer becomes a kind of causal inference
machine that observes the spike input from layer A and tries to produce a circuit with the same connectivity, or at least that is capable of
generating the same pattern of correlations.» (Fernando et al., 2012).
Whole process is visualised on Figure 6.
While we strongly believe that such a mechanism does indeed operate in human cortex, we reiterate what was already stated in 2.11
: under current state of knowledge is existence of neural replicators
not indisputably demonstrated and stays speculative. But in regards
to overall objectives of this dissertation does this speculative nature of
intracerebral replicators NOT pose any hindrance. This being so because our aim is to apply use evolutionary theory to explain linguistic phenomena. And linguistic phenomena are principially intangible,
mental, high-order phenomena which are potentially irreducible to
tangible and physical phenomena labelled as "neural".
On the other hand, it may be the case that a sort of theory intramental evolution allow us to bridge the "explanatory gap" between tangible and intangible, neural and mental. Thus, for example, whenever
we shall emit hypothesis like "canonical babbling is a sort of replicatory process" (9.2.2), we hereby tacitly imply that neural mechanisms,
as the one presented on Figure 6, are to be sought-for in Broca’s area
of one year old infants.
end neural darwinism 8.6

8.7
Universal
Algorithm

evolutionary computation

Evolution can be thought of as a universal, generic algorithm. But
our growing knowledge of evolution serves not only descriptive and
explanatory purposes. It is becoming normative. Thus, not only can
« evolutionary theory » serve us to explain diverse phenomena around
us, it can be also exploited for finding solutions to diverse problems.
Many researchers in informatics have already realized that diverse
of "evolutionary recepts" offer useful heuristics making it possible to
discover (quasi)-optimal ways out of wide range of concrete practical
isses.
7 Principle of Hebbian learning shall be more closely discussed in Section 9.4.1.

8.7 evolutionary computation

Figure 7: Basic genetic algorithm schema. Reproduced from Pohlheim
(1996)

Evolutionary computing (3.1) approaches differ from classical optimization methods in following aspects :
• using a population of potential solutions in their search
• using probabilistic, rather than deterministic, transition rules »
• using «fitness» instead of function derivatives Kennedy et al.
(2001)
First computational models which have the above-mentioned attributes were named « evolutionary strategies » by Rechenberg (1971),
«genetic algorithms» by Holland (1975) and « evolutionary programming » by Fogel et al. (1966). These paradigms, along with the «genetic programming » paradigm later introduced by Koza (1992) constitute the most important sub-branches of «evolutionary computation» Sekaj (2005) branch of computer and informatic science.
8.7.1

genetic algorithms

Basic principle of « genetic algorithms » is illustrated on Figure 7.
GAs iteratively produce populations of data stuctures. Each individual data structure is a possible solution, population of every generation is thus a set of diverse solutions. Every individual solution
is encoded as a vector of values (also called « chromosome » or
« genome ») which can either vary or be copied verbatim from one to
generation to the other. Designer choice related to the way how the
problem solutions are encoded in chromosomal vectors, e.g. the type
(Boolean ? Integer ? Float ? Set? ) of different elements of the vector
is also a crucial one and can often determine whether the algorithm
shall succeed or fail.
In every generation – i.e. in every iteration of the algorithmic cycle
represented by the circle on Figure 7 - all N individuals in the population are evaluated by the fitness function. Every individual thus
obtains the « fitness » value, which subsequently governs the « selection » procedure choosing a subset of individuals from the current

41

42

universal darwinism

generation as those, whose genetic information shall reproduce into
next generations. More on fitness functions in 8.7.1.
Another important design decision which every programmer of
GAs have to do, is to choose the selection operator. An operator which
is widely used, and which we shall also implement in all future EC
simulations (c.f. 10.4.7, ?? & volume 2) is the «fitness proportionate selection». This operator, also called « roulette wheel operator » normalizes the fitness fi of individual i into the probability pi of its survival
by means of a formula :
p i ] = fi /

N
X

fj

j=1

where N is the number of individuals in the population. Once these
probabilities are calculated to different individuals, one can use them
to guide the process of selection of individuals which shall be reproduced into the next generation. Minimal PERL source code for such
fitness proportional selection operator is:
Fitness Proportional Selection (SRC)
1 sub fitness_to_proba {

6

11

my @weights = @_;
my @dist
= ();
my $total
= 0;
local $_;
foreach (@weights) {
$total += $_;
}
for my $weight (@weights) {
push @dist, $weight/$total;
}
return @dist;

}
sub weighted_rand {
my @dist = @_;
16
while (1) {
my $rand = rand;
my $i=0;
for my $w (@dist) {
return $i if ($rand -= $w) < 0;
21
$i++;
}
}
}

end fitness proportional selection (src)

8.7.1.0

8.7 evolutionary computation

43

Another widely used selection operator is a so-called tournament
selection based on repeated selection of the best individual of population’s randomly chosen subset. The tournament selection operator
Tournament
selection
offers multiple advantages: for example, by tuning the tournament
size parameter one can easily adjust the selection pressures favorizing or defavorizing fit candidates. And it can also be used in parallel
computation scenarios.
Once the « most fit » candidates are selected by the selection operator, they are subsequently mutually recombined by means of « crossover »
operators and/or modified by means of « mutation » operators. Many
different types of selection, mutation and crossover operators exist.
For the purpose of this work let’s just note that the probabilities of
Values for
variation operators
occurrence of mutation or crossover have to be fairly low, otherwise
no fitness-increasing information could be transferred among generations and whole system will tend to present non-converging chaotic
behaviour (Nowak et al., 1999).
Another useful strategy, which guarantees that maximal fitness shall
either increase or at least stay constant, is called elitism. In order to imElitism
plement the strategy, one simply guards one (or more) individual(s)
with highest fitness unchanged for next generation, thus protecting
« the best ones » from variations which would, most probably, decrease rather than increase the fitness8 .
Yet another widely used approach reinforces the selection pressure
by removal of the weakest individuals. Both elitist « survival of the
fittest » and the contrary « removal of the weakest » are often combined within the sequence of instructions which, alltogether, form a
genetic algorithm.
The selection of the most fit individuals from the old generation,
their subsequent replication and/or recombination and diversification yields a new generation. Because individuals with lower fitness
Drift towards
higher fitness
have been either completely or at least partially discarded by the selection process, one can expect that the overall fitness of new generation
shall be higher than the fitness of the old generation. With little bit
of luck, one can also hope that the most fit individuals of the new
generation shall be little bit more fitter than the most fit individuals
discovered in the new generation – this can happen if ever a « benign » mutation have occured, i.e. a modification which had moved
the individual from the lower point on the « fitness landscape » to
somewhat higher state.
end genetic algorithms 8.7.1

8 Note that in nature, elitism is often but not always the case. For it can happen that,
due to stochastic factors, the most fit individuals die before they succeed to reproduce the information they encode. But, in such a case, are such individuals truly "the
most fit"?

44

universal darwinism

Fitness functions and fitness landscapes

Of functional core

What is fitness
function?

Fitness function as
a design choice

Fitness landscapes

The core component of every genetic algorithm is the objective «fitness function» able to attribute a cardinal value or ordinal rank to any
individum in the population of potential solutions. In other terms,
the fitness function yields the criterium according to which one candidate individum is evaluated as «more fit» a solution, in regards to
the problem under study, than other potential solutions present in the
population.
The choice of good fitness function determines, more than anything
else, the success or failure of GA as a means to find the solution
for the problem at hand. Ideally, the fitness function is a mathematical representation of the very essence of the problem which is to be
solved. For purely mathematical problems, the choice of the fitness
function is straightforward - fitness function is simply the function
whose global optimum one wants to find. Also in many practical
implementations - notably those of optimalization of physical components - the fitness function is also often evident: one can deduce it
from well-established physical laws.
But fitness functions for other problems are far from being certain.
The "first language learning" which we aim to address in this dissertation belongs among such problem since it is not trivial to answer
the question: "which model of language is better (i.e. more fit): X, Y or
Z?". Such an answer is strongly determined by the theoretical point
of view one adopts: an engineer prefering a sociopragmatic (9.4.4) or
constructivist (9.4.3) theory of language acquisition, a model of language competence of 12-month old baby which generates utterances
like "tato tek tete" would be considered to be more "fit" than model
generating utterances like "father, had your colorless green ideas slept
furiously?" (Chomsky, 1957). Rather contrary should be the case for
an engineer who would decide to formalize his fitness function on
the grounds of nativist (10.2) theories of language acquisition.
The notion of fitness landscape, first introduced by Wright (1932)
is a metaphor useful for understanding, discussion and comparison
of diverse fitness functions. The landscape is depicted as a mountain
range with peaks of varying height. The height at any point on the
landscape corresponds to its fitness value; i.e. the higher the point, the
greater the fitness of an individual represented by the given point of
the landscape9 In such a representation, the evolution of the organism
to more and more « fit » forms can be depicted as a movement uphill, towards the most closest peak (i.e. local optimum) or towards the
highest peak of the whole landscape (i.e. global optimum). Figure 8

9 Note that to find an optimal solution of the problem with N variables, one has to look
for it in the N dimensional search space. This multi-dimensionality is what makes
the search so difficult since the number of possible solutions grows exponentially
with the number of dimensions (i.e. variables of the problem).

8.7 evolutionary computation

45

illustrates a fitness landscape of a very simple organism with only one
gene (whose potential values are encoded by illustration’s X axis).

Figure 8: Possible fitness landscape for a problem with only one variable.
Horizontal axis represents gene’s value, vertical axis represents fitness.

Every arrow on the figure represents one possible individual. Its
length represents the variation which can be brought in by the mutation operator. The fact that individuals always tend to move « upwards » indicates that selection pressures are involved. It has to be
added that without the implementation of the crossover operator, the
globally optimal state (encoded by point C) could not be attained for
individuals who haven’t originated at the slopes of C. Only some sort
of crossover operator could ensure that individuals who attained the
local optima (encoded by peaks A, B, D) could be mutually recombined (for example B with D) in a way that shall allow them to leave
the locally stable states and approach the globally optimal C. The fact
that genetic algorithms, thanks to « crossover » operators, can combine two individuals from diverse sectors of the fitness landscape,
allow them to find solutions to problems where heuristics based on
« gradient descent » would normally fail.
An important property of fitness-landscape is its "ruggedness". Some
fitness functions can yield landscapes as flat as Pannonian Plane: the
algorithm will need a long time to find there a hill if ever a hill there
is. Other may yield landscapes as rugged as mountains of northwest
Vietnam: nothing is certain on such landscapes where even the slightest mutation can produce huge decrease or increase of fitness. Ideal
landscapes are those which are rugged but not too much: as on the
slopes of Himalaya, a steady progress towards some locally optimal
-not neccessarily the highest but sufficiently high- vantage point can
be assured.
C.f. NK Theory introduced in Kauffman (1995) for further discussion of landscape ruggedness and ways how it can be potentially
tuned.
end fitness functions and landscapes 8.7.1.0

Multidimensional
hiking

Ruggedness of
fitness landscapes

46

universal darwinism

Canonical Genetic Algorithms
Canonical genetic algorithm (CGA) is a genetic algorithm applied on
populations (n-tuples) of binary strings (individuals) of length l. Each
among l bits is considered to be a gene and each string of such genes
is considered to be a potential solution to the problem which is to
be solved. Given that the initial population is randomly generated,
the CGA proceeds as follows: In CGAs, fitness proportionate selecListing 1: Canonical Genetic Algorithm
initialize the population
determine the fitness of each individual
perform selection
repeat
5
perform crossover
perform mutation
determine the fitness of each individual
perform selection
until some stopping criterion applies

Operators in CGA

CGA convergence
and elitism

tion (8.7.1) is used as the selection operator. Mutation operates independently on each gene of each individual and consists of stochastic bit flipping of current gene’s value to its opposite. A "one-point
crossover" (4) is most commonly used in CGAs, which consist of randomly chosing a section locus of the chromosome, dissecting two
selected parent individuals A and B along the section locus and creating two children individuals C and D as a concatenation of sections
previously encoded in two distinct parent organisms, i.e. C = A1 B2
and D = B1 A2 .
CGAs being thus defined in (Holland, 1975; Goldberg, 1990), it has
been demonstrated by (Rudolph, 1994) that such pure CGAs are unable to converge to global optimum of the problem they tend to maximize. This is so because even if CGA would be able to discover the
optimum, the unceasing activity of mutation operators would force
the system to depart from such an ideal state. On the other hand, if
ever one implements the elitist trick of keeping the most fit individual,
such convergence is assured. Thus, Rudolph’s theoretical « analysis
reveals that the convergence to the global optimum is not an inherent
property of the CGA but rather is a consequence of the algorithmic
trick of keeping track of the best solution found over time.» (Rudolph,
1994) It is principially because of CGA’s
1. theoretical ability to converge to global optimum
2. simplicity and architectural elegance

8.7 evolutionary computation

that our method of Evolutionary Localization of Semantic Attractors (ELSA, 10.4.7) is, in essentia, nothing else than a CGA endowed
with elitist strategy.
end canonic genetic algorithms 8.7.1.0

Parallel Genetic Algorithms
Parallel Genetic Algorithms (PGAs) add another level of complexity to traditional GAs. In PGAs is the global population of solutions
divided into multi sub-populations which, most of the time, evolve independently from each other. One can understand such sub-populations
as different societies or species evolving on isolated islands. Only during so-called "migratory periods" do the sub-populations communicate with each other: most often by means of "sending" the most fit
individual to another receptor sub-population. Grid (A,B), hierarchical (C), ring (D) and multi-hierarchical (E,F) architectures of such interinsular migratory relations are depicted on Figure 9.

Figure 9: Different architectures of Parallel Genetic Algorithms. Reproduced
from Sekaj (2004)

By introducing multiple independent populations, PGAs allow to
put in equilibrium the selective pressure (i.e. preference of better individuals) and population diversity (i.e. gene dissimilarity). In traditional single-populated GAs, these two forces oppose each other:
by increasing the selective pressure an engineer reduces the diversity
and thus exposes himself to danger of converging "just" to a locally
optimal state. On the other hand, by favorizing too much diversity,
one can slow down significantly the convergence rate. To find the
equilibrium between these two forces is indeed an art.

47

48

universal darwinism

PGAs solve this tradeoff problem by allowing to increase selective
pressures in one sub-population while augmenting the diversity of
the other. The gain seems to be particularly significative in case of
heterogenous PGAs whereby diverse sub-populations implement diverse search strategies. Another improvements in case of problems
with "rugged" fitness landscapes can be attained by introducing "subpopulation re-initialisation" into the process. That is, an exchange of
population whose diversity is too low, for a completely new, randomly generated population. Such « re-initialisation is able to remove differences between homogenous and heterogenous PGA’s or
between different PGA architecture types respectively. However, all
the presented PGA modifications can speed up the search process and
prevent the search algorithm from a premature convergence» (Sekaj,
2004).
It seems that adding another level of complexity to GAs increases
the probability of finding the globally optimal solution. It is true that
even the traditional single-population GAs explore the search space
in multiple directions, in PGAs, however, is such exploration qualitatively augmented. By their faculty to centralize the decentralized;
by their ability to speeden the convergence to optimal solutions of
diverse problems as well as by allowing for hiearchical10 stacking
of independent information-processing units, PGAs are reminiscent
of so-called deep-learning methods principially based on hierarchical stacking of diverse connectionist networks. And what’s more, by
being partially localized and partially globally-integrative, PGAs can
offer a possibly interesting means how to simulate certain functions
of human brain (c.f. 2.8) which seems to dispose of analogic properties.
end parallel genetic algorithms 8.7.1.0

8.7.2

evolutionary programming
& evolutionary strategies

Evolutionary programming (E.Prog) and evolutionary strategies (E.Strat)
are methods whose overall essence is very similar to GAs. There are,
however, some subtle differences among the approaches.
In E.Prog, mutation is the principal and often the only variation
operator. While recombination is rarely used, « operators are freely
adapted to fit the problem at hand» (Kennedy et al., 2001). E.Prog algorithms often double the size of population by mixing children with
parents and then halving the population by selection. Tournament
selection operator is often used.
10 In study of Sekaj (2004) hierarchical architectures C, E and F seem to be the most
successful in approaching the global solutions of two specfic mathematical functions.

8.7 evolutionary computation

Another difference is that while GAs were developped in order
to optimize the numeric parameters of mathematical function under
study – and variation thus directly modifies the genotype – in E.Prog,
one mutates the genotype but evaluates the fitness according to phenotype. E.Prog is thus often used for construction and optimization
of such structures like finite state automata (Fogel et al., 1966). A
self-adaptation approach (Bentley, 1999) allowing for mutation of the
parameters of the evolution itself – e.g. the mutation rate – is also
frequently used.
Such an approach of « evolving the evolution » is also used in
E.Strat which where discovered - in parallel but independetly with
Holland’s GAs – by Rechenberg (1971). The biggest difference between E.Prog and E.Strat is thus fact that E.Strat often recombines
its individuals before mutating them. Popular and well-performing
strategy thus seems to be :
1. Initialize the population
2. Perform recombination using P parents to form C children11
3. Perform mutation on all children
4. Evaluate children population and select P members from it.
5. If the termination criterion is not met, go to step 2 ; terminate
otherwise.
Given that in certain simulations (c.f. ??), we shall
1. encode solutions by means of non-numeric chromosomes
2. evaluate the fitness of individuals by means of additional « phenotypic algorithms »
we consider the works of Fogel & Rechenberg to be precursors of our
approach.
end evolutionary programming & strategies 8.7.2

8.7.3

genetic programming

Contrary to GAs, E.Prog and E.Strat which operate upon the chromosomes (vectors) of fixed length of numeric/boolean/character values,
do individuals evolved by means of Genetic Programming (GP) encode programs of arbitrary length and complexity. In other terms, one
may state that while above-mentioned EC methods look for the most
optimal solution of a given problem, GP tends to produce a hierarchical tree structure encoding a sequence of instructions (i.e. a program)
11 Frequently used C/P ratio is 7

49

50

universal darwinism

able to yield optimal solutions to a whole range of problems. Simply
said : GP is simply a way how computer programs can automatically
« discover » new and useful programs.
The most important thing to do in order to prepare a GP framework is to specify how shall be the resulting individuals (programs)
encoded. Original choice of the founder of the discipline, John Koza,
was to encode all individuals as trees of LISP S-expressions composed
of sub-trees, which are, themselves, also LISP S-expressions. Within
such arborescent S-expressions, the terminal (i.e. leave nodes where
the branches end) nodes represent program’s variables and constants
while the non-terminal nodes (i.e. internal tree points) represent diverse functions contained in the function set (e.g. arithmetic functions
like +, -, *, / ; mathematic functions like log, cos ; boolean functions
like AND, OR, NOT ; conditional operators if/else etc.)

Figure 10: Sequence of steps constructing the program sqrt(x+5)

Figure 10 illustrates how, during the initial run of the algorithm,
an individual – calculating, for example, the square root of X+5 –
could be possibly randomly generated by implementing a following
procedure :
1. « Root » of the program tree is randomly chosen from the function set, it is the function sqrt.
2. The function sqrt has only one argument (arity(sqrt)=1), therefore it will take only one input from the randomly determined
functor + (addition)
3. Functor + takes two inputs (arity(+)=2), therefore the tree bifurcates into two lines in this node. It randomly choses, as the
first argument, the constant 5 ; and the variable X as the second
argument.
Note that in step 3, both arguments were chosen from the terminal
set. If they would have been chosen from the function set, the tree
could bifurcate further. In order to prevent such growth of trees ad
infinitum, a limiting « maximal tree depth » parameter is more than
often implemented in GP scenarios.

8.7 evolutionary computation

Once such a program has been generated, one can evaluate its fitness by confronting it with diverse input arguments and comparing
its output with a golden standard. Such a random-program generation & evaluation is repeated for all N initial candidate programs, subsequently the most individuals are selected and varied. While GP’s
selection techniques can sometimes closely ressemble selection techniques as used in GAs, variation operators are often of essentially
different nature. This is so, because GP’s not individual genomes or
their linear sequences can be mutated or crossed-over, but rather complex and hierarchical networks of expressions. In a case of cross-over,
for example, one switches whole sub-tree encoded within one individual, for a sub-tree encoded within another one.
GP-based solutions cannot be expected to function correctly if they
do not satisfy the theoretical properties of closure and sufficiency.
In order to fulfill the closure condition, each function from the nonterminal set must be able to successfully operate both on output of
any function in the non-terminal set and on any value obtainable by a
member of the terminal set. Even behaviour of some simple operators
thus has to be a priori adjusted (e.g. return 1 in case of division by
zero) in order to assure correct functioning of the resulting program.
On the other hand, sufficiency property demands that the set of
functors and terminals is sufficiently exhaustive. Otherwise the solution could not be found. One can not, for example, hope to discover
equation for generating the Mandelbrot set if the initial set of terminals does not contain the notion of imaginary number, nor does
the function set contain any other explicit or implicit reference to the
notion of complex plane. Thus, while the closure constraint delimits the upper bound beyond which the discovery of the solution is
not feasible, the sufficiency constraint delimits the lower bound of
the minimal set of « initial components » which have to be defined a
priori, so that discovery of the adequate program should be at least
theoretically possible.
Other theoretical notions as well as diverse subtleties (special operators, methods how to distribute the initial population in the search
space, fitness function proposals, domains of application, etc.) of practical implementation, are to be found in possibly the most important
GP-concerning monography (Koza, 1992).
Grammatical evolution
Grammatical Evolution (Gr.Ev) is a variant of GP in a sense that it
also use evolutionary computing in order to automatically generate
computer programs. The most important difference between Gr.Ev
and GP is that while GP operates directly upon phenotypic trees
representing program’s code itself (for example in form of LISP expressions), Gr.Ev uses the evolutionary machinery for the purpose of

51

52

universal darwinism

generating grammars, which would subsequently generate the program code.
In Formal Language Theory (c.f. also item 10.2), grammar is represented by the tuple {N, T, P, S} where N denotes the set of nonterminals, T the set of terminals, S is a symbol which is member
of N and P denotes the set of production rules that substitute elements of N by elements of N, T or their combinations1. Consider a
grammar G exhaustive enough to encode programs able to perform
arbitrary number of operations of addition or subtraction of two variables: Such a grammar contains three non-terminals, non-terminal
Listing 2: An example of grammar G.
1 N = {expr, op, var}

T = { +, -, x, y}
S = expr
P = {
<op> -> + | <var> -> x | y
6
<expr> -> <var> | <expr> <op> <expr> }

<op> which could be subtituted for either terminal + or terminal - ;
non-terminal <var> which could be subtituted for either terminal x or
terminal y ; and non-terminal <expr> which could be substituted for
either a non-terminal <var>, or a sequence of non-terminals <expr>
<op> <expr>.
x+x
The fact that in this last production, the nonx+y
terminal <expr> is present both on left and
y+x
right side of the substitution rule gives this
y+y
grammar a possibility to recursively generx-x
ate infinite number of expressions. As may be
x-y
seen in the listing to the left, even a very simy-y
ple grammar -with only four terminal symy-x
bols and three non-terminal symbols to each
x+x
of which are associated only two production
x+x+x
rules- can theoretically -i.e. if given infinte
x+x-x
amount of time for application of production
x+x+y
rules- produce an infinite number of distinct
x-x+y-y
individual programs able to perform basic
y+y+x+x+y-x
arithmetic operations with two variables.
etc.
Generation of a given resulting expression is determined by the
order of application of specific production rules, starting with nonterminal symbol S. Such a sequence of application of production rules
is called derivation. For example, in order to derive the individual
« x+x », one has to apply production rules in following order:

8.7 evolutionary computation

Listing 3: Production of expression x+x.

4

S = <expr>
<expr> ::= <expr> <op> <expr>
<expr> ::= <var>
# <var> <op> <expr>
<var> ::= x
# x <op> <expr>
<op> :: = +
# x + <expr>
<expr> :: = <var>
# x + <var>
<var> :: = x
# x + x

while the individual « y-x » would be generated, if ever the starting
symbol S should be expanded by a following sequence of production
rules :
Listing 4: Production of expression y-x.
S = <expr>
<expr> ::= <expr> <op> <expr>
3
<expr> ::= <var>
# <var> <op> <expr>
<var> ::= y
# y <op> <expr>
<op> :: = # y - <expr>
<expr> :: = <var>
# y - <var>
<var> :: = x
# y - x

In Grammatical Evolution, it is this « order of application of production rules» which is encoded in the individual chromosome. In
other terms, individual chromosomes encode when and where distinct production rules shall be applied. Figure 11 more closely illustrates, and puts into analogy with biological systems, the sequence of
transformations which every binary chromosome undergoes during
the process of unfolding into fully functional program.
As the Figure 11 indicates, the approach of Gr.Ev is quite intricate
and involves multiple steps of information processing. Whole process
starts with binary chromosome subsequently split into 8-bit codons
which yield an integer specifying which production rule to use in
a given moment of program’s generation. On many different layers
does the « generation » process, as implemented in Gr.Ev, introduce
and implement very original ideas like:
• « Degenerate genetic code » - similary to « nature’s choice » to
encode one amino-acid by means of many different triplets, can
one encode application of a unique production rule by more
than one codon.
• «Wrapping » - under certain conditions can be whole genome
« traversed » more than once during the process of phenotypic
expression. Specific codon can be thus used more than once
during the compilation of single individual.

53

54

universal darwinism

Figure 11: Sequence of transformations from genotype until phenotype
in both Gr.Ev and Biological systems. Figure reproduced from
O’Neil and Ryan (2003).

Rationale for usage of such « biologically inspired tricks » is more
closely presented in the work of the founders of Grammatical Evolution field (O’Neil and Ryan, 2003) . They claim that the focus on
genotype-phenotype distinction, especially in combination with implementation of « degenerate code » and « wrapping » notions, could
result in compression of representation (& subsequent reduction of
size of program search-space) and account for phenomenas like « neutral mutation », well-observed in biological systems, whereby a mutation occures in the genotype but does not have any effect upon
the resulting phenotype. Another important advantage mentioned by
O’Neill and Ryan is that Gr.Ev approach makes it very easy to generate programs in any arbitrary language. This is due to the versatility
and generality of notion of « grammar ».
When compared with traditional GP technique, Gr.Ev was outperformed in a scenario when one had to find solutions to problem of
symbolic regression. But in more case complex scenarios like « symbolic integration », « Santa Fe ant trial » or in scenario where one had
to discover a most precise « caching algorithm », Gr.Ev significantly
outperformed GP. Seminal work of O’Neil and Ryan (2003) presents
also some other interesting examples of practical application of Gr.Ev,
for example in the domain of financial market prediction.
It is worth underlining that while in many points (« grammar »,
« evolution ») does the work of O’Neilly and Ryan significantly overlap with ours, their aims significantly differ from our aim to interpret

8.7 evolutionary computation

the process of language acquisition as an inherently evolutionary process. More concretely, while Gr.Ev tends to offer a very general toolbox to generate useful computer programs in arbitrary programming
language and used for solving arbitrary problems, we confront the
evolutionary computation machinery to shed some light upon diverse
facets of one sole problem : that of «learning of first language».
Other important difference between the approach of Gr.Ev and the
one we shall present in our thesis is that while in Gr.Ev, grammars are
considered to be « generative devices », i.e. tools used for generation
of programs, in our Thesis we shall use them as both « generative »
and « parsing » devices. Another, even more fundamental difference
is due to the fact that while « At the heart of GE lies the fact that
genes are only used to determine which rule is applied when, not
what the rules are» (O’Neil and Ryan, 2003) the evolutionary model
of language-induction proposed in our Thesis shall aim to determine
not only the order of application of the rules, but also the content of
the rules themselves.
end grammatical evolution 8.7.3.0
end genetic programming

8.7.4

8.7.3

tierra

Another example of how can one materialise evolutionary principles
within an in silico framework is offered by Tierra, an artificial life
simulation environment programmed between 1990-2001 by Thomas
S. Ray and his colleagues. Since Ray is an ecologist, his objective was
not to develop an EC-like model in order to find or optimalize solutions of a given problem, rather he aimed to create a system where
artificially entities could spontaneously evolve, co-evolve and potentially create whole artificial ecosystems.
An artificial entity in Tierra’s framework (Ray, 1992) is a program
composed of sequence of instructions, chosen from instruction set
containing 32 quite traditional assembler instructions somewhat tuned
by the author so that their usage would facilitate « replication » of
the code. Every artificial entity runs in its own « virtual CPU » but its
code stays encoded in the « soup », i.e. piece of RAM which is potentially read-accessible to all other entities as well. Rare «cosmic ray »
mutations flip the bits of « soup » from time to time, more variation
is ensured by bit-flipping during the procedure whereby the entity
replicates (i.e. copies) its code from the « mother cell » section of the
soup to the « daughter cell » section.
Selection is in certain sense emulated by a so-called Reaper process
which tends to stop the execution of programs which are either too

55

56

universal darwinism

old or contain too much flawed instructions. Other than that, there
is nothing which ressemble the traditional notion of exogenously defined « fitness function ». For within Tierra, the survival (or death) of
diverse species of programs is a direct consequence of species ability
(or inability) to obtain access to limited ressources (CPU & memory).
Thus, after one seeds the initially empty soup with a manually constructed individual, containing 80-instructions allowing the individual to copy his code into the daughter cell of the memory, after the
memory has been filled and the battle for ressources has started and
once the mutation have generated sufficiently enough of variation,
one can observe the emergence of dozens of new forms of replicable
programs. Some of them being parasites, some of them being able
to create algorithmic counter-mesures against parasites, one can literally observe an emergence of artificial yet living ecological system.
It is therefore little surprising that Tierra could automatically evolve,
among others, an individual containing just 22 instructions, capable
of replication. That is, a replicator almost 4 times shorter than the
replicator manually programmed by the conceptor of the system and
injected into initial « soup ».
Currently the most famous descendant of Tierra is an AVIDA system (Ofria and Wilke, 2004). Contrary to Tierra, however, is every
AVIDA’s individual encapsulated within its own virtual CPU and
memory space. Tierra’s Darwinian metaphore1 of computer programs
evolving by means of fighting for limited ressources is thus not so
strictly followed.
end tierra 8.7.4

8.7.5

evolutionary language game

Evolutionary Language Game (ELG) first proposed by Nowak et al.
(1999) is a stunningly simple yet mathematically feasible stochastic
model addressing the question « How could a coordinated system of
meanings&sounds evolve in a group of mutually interacting agents ?».
In most simple terms, the model can be described as follows: Let’s
have a population of N agents. Each agent is described by an rxc associative matrix A. A’s entry aij specifies how often an individual, in
a role of a student, observed one or more other individuals (teachers)
referring to object i by producing signal j.
Thus, from this associative matrix A, one can derive the active
«speaker» matrix S by normalizing rows :
sij = Pr

aij

n=1 ain

while the «hearer» passive matrix H by normalization of A’s columns:

8.7 evolutionary computation

hij = Pc

57

aij

n=1 anj

The entries sij of the matrix S denote the probability that in Prepresentations of an agent-speaker, object i is associated with sound
j. The entries hi j of the matrix H denote the probability with which,
within C-representations12 of the hearer, a sound j is associated with
the object i.
Subsequently, we can imagine two individuals A and A’, the first
one having the language L (S,H), the other having the language L’
(H’, S’). The payoff related to communication of such two individuals
is, within Nowak’s model, calculated as follows:
F(A, A 0 ) =

r X
c
X

0
sij hji
= T r(SH 0 )

i=1 j=1

And the fitness of the individual A in regards to all other members
of the population can be obtained as follows :
f(A) =

X
1
F(A, A 0 )
|P| − 1 0
A ∈P
A6=A 0

After the fitness values are obtained for all population members,
one can easily apply traditional evolutionary computing methods
in order to direct the population toward more optimal states, i.e.
states where individual matrices are mutually « aligned ». In Nowak’s
framework this alignment represents the situation when hearer and
speaker mutually understand each other, i.e. speaker has encoded
meaning M by sound S and hearer had subsequently decoded sound
S as meaning M.
ELG beautifully illustrates how a mutually shared communication
protocol can emerge from a population of randomly set sound-meaning
associative matrices if there is some « mutual associative reinforcement » mechanism involved. This mechaninsm allows to transfer information from one individual to individual another. This is attained
by creating a blank « student » matrix and then filling its elements, by
means of stochastic « matrix sampling » procedure, in a way so that
the resulting student matrix will partially correspond to| be aligned
with matrices of pre-existing « teacher » (or teachers).
Further experiments with ELG are described in Kvasnicka and Pospichal
(2007, 1999) and Hromada (2012b). All these studies point in the
same direction and suggest that not only emergence of mutually
shared communication protocol practically ex nihilo is possible whenever there exists a means of transfer of information among individuals, but also that without the presence of certain low amount of noise
during the learning processs, the system as a whole would fail to
12 See following chapter to see closer introduction to what C and P-representations are.

58

universal darwinism

converge to « communicatively optimal » state. In other words, ELG
model indicates that presence of noise -a minimal yet not null amount
of mal-transfered information- is necessary in order to assure that the
population of mutually aligned sound-meaning matrices shall, sooner
or later, converge into most communicatively optimal state.
The role of ELG model within the context of our Thesis is quite
opened. For while it is the case that ELG sheds some light upon the
question of emergence of language within a community of symbolicaly interacting agents, it does not, principially address the problem
of language learning by a concrete individual. Thus, ELG is rather
a model of macroscopic phylogeny than microscopic ontogeny - it
addresses the problem of how small communities of homo habilis
could, hundred years ago, gradually converge to system of signs
within which, for example, « baubau » could mean a banana and
« wauwau » mean a lion. Or, in less fatal and more vital affairs, it can
be useful to synchronize activities problems related to dating, mating
etc., as represented on Figure 12.

Figure 12: A case whereby mutual alignement of sound-meaning mappings
can be useful. Reproduced from Kvasnicka and Pospichal (2007)’s
reproduction in Pinker (2000).

Unfortunately, ELG wasn’t explicitely constructed to address the
problem of ontogenetic alignement, id est the problem of how toddlerese adapts to the motherese. But, we believe, it is not completely
hors propos to imagine a slight variation of Nowak’s model wherein
one population of matrices would be much more stable (representing
the linugistic competence of mother, parent or teacher agent) while
the second population of matrices would represent the linguistic com-

8.7 evolutionary computation

petence of a « child ». Given that the fitness function would somehow
succeed to represent the degree of alignment between such « mother »
and « child », we postulate that something like child’s language competence could spontaneously emerge obe distilled and induced from
ontogeny-oriented variant of Evolutionary Language Game.
end evolutionary language game 8.7.5

In this section we have discussed more closely diverse applications
of Evolutionary Computing (as defined in Section 3.1), namely
1. genetic algorithms (GA) and parallel genetic algorithms
2. evolutionary programming (EP) and evolutionary strategies (ES)
3. genetic programming (GP) and its variant grammatical evolution (GE)
4. an artificial ecology environment Tierra
5. model of ex nihilo induction of sound-meaning mappings called
Evolutionary Language Game (ELG)
While some of these applications may strongly differ from each other
they all materialize -sometimes in purely informatic or mathematic
worlds; sometimes in worlds more material or even "social" - the basic
premises of Universal Dariwnism. They all implement, in one way
or another, reproduction, selection and variation of populations of
information-encoding entities.
The content of
1. what these entities encode
2. ways how they encode it and how it varies
3. reasons why some structures are chosen into another generation
and some not
all this varies substantially from application to application. But the
trinity of principles: reproduction, selection, variation is implemented
in all of them, otherwise they could not be, ex vi termini, labeled as
EC implementations.
Dozens of analytical studies - related to topics as fitness-landscapes
8.7.1 or parallel genetic algorithms item 8.7.1 - could, sooner or later,
find their accomplishement in a general, formal, and mathematical
theory of evolution. Articulation of such theory could yield more
rigorous a base for description of phaenomena which are nowadays
explained in terms of somewhat vague, speculative and conjectural
doctrine of Universal Darwinism. For the one who would decide to
establish such a theory, EC could furnish tool as useful as was, for
Kepler, the Galileo’s telescope.

59

60

universal darwinism

As was already indicated, the aim of this dissertation is not to furnish nor even discuss such general theory. The aim is to first use
the conceptual prism of doctrine of Universal Darwinism in order to
observe and interpret the phaenomena related to the topic of our interest - language acquisiton. And subsequently - in order to furnish a
sort of testimonium ex simulatione- to use a most simple evolutionary
model possible to demonstrate that it may be useful to conceive the
problem of language acquisition in terms of gradual optimization and
co-evolution of populations of linguistic functions and structures. We
believe that for such a purpose, EC can furnish a very useful framework.
The reason behind this belief is simple - during few decades since
its conception, EC-based systems have demonstrated their capability to find solutions to thousands of diverse problems and metaproblems. EC-based systems help designers and planners to invent
optimal components, houses and cities; EC-based approaches are used
to tune neural networks in robotic systems; EC-based systems help us
not only to understand our world but also to change it.
Simply stated: Evolutionary Computing works.
end evolutionary computing 8.7
Evolutionary Computing works because evolution itself works. And
evolution - understood as gradual optimization of replicators - works,
because it is a logical necessity.
Such is the doctrine of Universal Darwinism.
The goal of this chapter was to furnish a brief overview of diverse
scientific theories and paradigms based on or inspired by UD’s explicatory power. First was mentioned the biological evolution - it was
the study of this form of evolution which gave birth to evolutionary
theory. A discipline of Evolutionary Psychology was later discussed
and partially criticized as being often too expansive in its aims. It
was reiterated that the aims of this dissertation are not those of EP:
while EP tries to explain diverse human skills as a result of biological
evolution, the Hard Thesis postulates that human learning itself is an
evolutionary process.
Evolutionary epistemology, Campbell&Simonton’s explanation of
individual creativity in terms of "blind variation and selective retention" and the notion of memetics were discussed as examples of evolution which is based on reproduction, variation and selection of nonDNA replicators. It was further precised that contrary to EE2-1 and
traditional memetics which study the evolution based on structures
copied between the brains, we shall tend to put focus on evolution
going on within the brain.
An existence of a sort of 3rd replicator is thus posited. Asides nucleic acids - which furnish the material base for evolution of Nature;
and asides memes - which represent the basic units of evolution of

8.7 evolutionary computation

Culture; a third replicator is posited in order to explain certain properties of a mind (1.1) which learns. To honor Piaget’s work in genetic
epistemology (8.4.4) we tend to call such replicator a "scheme".
By being internal to both mind&brain, such "schemes" are very
elusive and it is of no surprise that they could potentially escape
the attention of occidental "positivist" science. Even in case of other
replicators, science took its time to recognize their nature and force.
While breeding domesticated species for thousands of years, "science"
was nonetheless ignorant of principles of evolution until a sort of
crossover between Mendel’s and Darwin’s ideae occured. While being bombarded on a daily basis by propaganda memplexes and viral
tweets, certain scholae have still somewhat difficult time to admit the
sheer existence of memes.
And if the nature of such salient, objective, empiric phenomena escaped for such a long time the analytic regard of scientific enquiry,
could there be anything done - in the limited scope of this dissertation - to demonstrate the existence of such "subjective" schemes ?
After putting aside introspection as an invalid method of validating
hypotheses in a positivist way, we see only three possible means of
proving the existence of such 3rd replicator:
a. Study of reproduction of information within the brain by means
of imaging techniques like fMRI, EEG etc.
b. Study of "schemes" when they are still observable, i.e. before
they are interiorized.
c. Computational simulations
The path A, the path of neurosciences, is too costly and thus beyond
our reach. Hopefully it shall be undertaken by others with more
resources and more patience. But luckily, the price for undertaking
paths B & C is negligeable and it is thus in this direction that we shall
proceed. For in order to make progress on path B, one just needs to
observe activity of minds who have not yet mastered the way how to
interiorize into their subjective realm their perceptive and behavioral
schemas. Such minds are, according to Piaget and even moreso by
Vygotsky (1987): minds of children.
And when it comes to path C, nothing could serve us better than
EC 3.1 branch of informatics. It has been suggested that EC is a sort
of applied evolutionary theory: it can generate empiric proofs. Whenever a genetic algorithm discovers a useful solution which was not yet
found, whenever a genetic programming scenario generates a piece
of evolutionary art which the programmer haven’t even dreamt of, a
tangible -and often beautiful- proof is furnished.
A proof of belief that darwinists are definitely not further away
than creationists from knowledge of a noumenic Principle governing
our phenomenal world.
end universal darwinism 8

61

D E V E L O P M E N TA L P S Y C H O L I N G U I S T I C S

Developmental Psycholinguistics (DP) is a scientific discipline studying changes occuring in human faculty of understanding and production of natural languages. As such, it is closely related to developmental psychology (a sub-field of psychology) and developmental
linguistics (a sub-field of linguistics).
While developmental psychology thematises phaenomena development of human psyche, consciousness, mind, attention, reasoning,
intellect, memory, perception, action, etc. on their own, DP does so
always in relation to language. And contrary to linguistics, which often thematises language - or linguistic competence - as a product of
some process P, DP ultimately strives to understand the process itself. In other terms, approaches common to DP « regard continuity of
expression and function as critical clues to tracing the path children
follow as they acquire language» (Clark, 2003).
We consider this distinction between "Product versus Process" to
be of crucial importance for our tentative to align DP with UD. This
is so, because the evolution itself is a process and hence it would be
impossible to align the two paradigms if ever the linguistic faculty
was understood solely as a static product. In other terms, alignement
of DP and UD is possible only under the condition that the DP’s main
object of interest is not a static product, but a dynamic process.
As a name for such a process, we shall adopt the decision made by
Harris (2013) and use the term «language development» preferably to
widely-used term «language acquisition». A reason for this being the
tentative to mark the fact that the child not only passively «acquires»
the language from environmental input but rather gradually builds
it, in interaction with its environment. Sometimes the term «language
learning» is also used to denote the same process - a great take care
has to be taken, however, not to forget that the "implicit and natural"
way how a child learns toddlerese differs substantially from "explicit"
drill used in learning of second, third, foreign, etc. languages.
This being said, we can know define the process which is, ex vi termini, the main object of interest of any developmental psycholinguist:
9.1

language development (def)

Language development (LD) - or ontogeny of natural language L in
human individual H - is a constructivist process gradually transforming L into evermore optimized communication channel facilitating
the exchange of information between H and her social surroundings.

62

9

Processing
approach

More active than
plain acquisition

9.1 language development (def)

end language development

Language
Development is
social and
constructivist

Optimized
language allows to
mean more but say
less

Conditions of
success of a
communicative act

9.1

The adjective constructivist indicates that LD should be, within the
theory hereby introduced, considered as a process based on gradual
internalization and modification of mental representations induced
and re-induced by confrontations with external informations. Piaget’s
constructivist theory in relation to LD shall be more closely described
in 9.4.3. Note also that by introduction of terms "verbal communication between H and social surroundings", the definition 9.1 places
emphasis on the social aspects of human language. By doing so, it
embraces so-called socio-pragmatic approach to LD (c.f. 9.4.4) more
closely than so-called generativist and nativist ones (c.f. 10.2).
But the key component of LD’s definition is the notion of "optimization". This notion, which goes hand-in-hand with the notion of
"facilitation of exchange of information" refers to the fact that, as language L develops - in infancy and beyond - it usually makes it possible to encode more precise an information with smaller quantity of
signal. By integrating the notions of "optimization" and "facilitation
of information", definition 9.1 thus ultimately states that language development is indeed a process which, if healthy and well-adopted to
environment, makes it possible to successfully exchange ever subtler
and subtler meanings (signifiées) encoded by shorter - or at least not
longer - sequences of articulated symbols.
An information can be successfully exchanged between human sender
and receiver if and only if following conditions are fulfilled:
1. C1 the sender is able to encode the information into the signal
2. C2 the signal can be decoded by the receiver
3. C3 the result of such decoding attracts receiver’s mind limitely
close to the state intended and anticipated by the sender

Of producing and
parsing

Of comprehension

One can speak about success in intepersonal communication only if
the communicative act fulfilles all of these conditions.
As was already pointed out in 5.1, linguistic signals are usually
strongly sequential and analysable into finite numbers of distinct discrete elements. When sender encodes his intention into such a sequence, (s)he is said to produce or generate the linguistic utterance.
When receiver decodes it, he is said to parse the utterance. Ideally,
when sufficiently strong a morphism exists between such meaningencoding and signal-decoding interactors, the result of such parsing
would have, as a consequence, a precious moment which humans call
"understanding".
Understanding, or comprehension, is closely related to the condition C3 . The fact that humans are able to understand each other, the
fact that speaker and listener, writer and reader can share intentionality is, according to usage-based theorists of LD, something which

63

64

developmental psycholinguistics

seems to be a unique propensity of human species (c.f. 9.4.4 for further introduction of usage-based theories).
It is important to realize that in spite of disposing, at certain level
of abstraction, of a sort of symmetry, production and comprehension
are nonetheless distinct processes. It is as with movement of a hand
which involves different muscles when the hand is going up and different ones when the hands move in the opposite direction; as with
human endocrine system which uses one hormone to promote a certain activity and a completely different hormone to inhibite it; as with
multitudes of other biological and cognitive phaenomena which seem
to be the mirror images of each other but in fact are not: production and comprehension are distinct. Distinct mechanisms are implemented to generate a sentence and distinct to parse it. Distinct brain
regions are involved. Hearing is not speaking with roles of speaker
and heared simply inverted: it is something fundamentally different.
The existence of such a mismatch between linguistic production
and linguistic comprehension is so evident that many linguistic theories ignored it, or at least set it aside, as secondary. Practically all
linguistic schools drawing inspiration from the Formal Language Theory (FLT) (10.2), e.g. the generativist tradition, do not care much
about this distinction. This is so because at the level of abstraction
where FLT is postulated, parsing is practically the same thing as generation and the only thing which differs is the direction in which rules
are applied. . It is true that when parsing, system proceeds from surface structure towards deep structure by always substituting left-side
of the production rule for the right-side; when generating, system
proceeds from the deep structure towards the surface structure by
substituting right-sides of the production rules for the left-sides. But
all the rest - the content of production rules, the alphabet, the lexicon,
the very computational machinery - is the same.
Given more theoretical and much less empirical aims of FLT, one
can understand the reasons why it practically ignores the mismatch
between man’s language comprehension and man’s language production. One could even praise FLT’s conceptors for the fact that by pointing to the level of abstraction where production and comprehension
meet, they point to some potentially fundamental unity. For adopting an attitude where production is a sort of inverted parsing can,
ineed, yield some interesting and potentially useful computer programs. But to build psycholinguistic theories and ignore the principle which every parent feels and every language teacher knows, such
an attitude has to necessarily result in an inconsistent theory or a
mal-functioning model.
In order to evit such an epistemologic disaster, a somewhat more
mundane principle is being posited:

Of asymmetry
between
production and
comprehension

Of symmetry
between
production and
parsing

Of human
condition and
insufficiency of
FLT

9.1 language development (def)

9.1.1

central dogma of dp (def)
C-representations precede P-representations.
end central dogma

Of C- and Prepresentations

C-representations
are not passive

9.1

Eve Clark, who had coined these (C|P)-representation terms further clarifies: « Children set up a representation for each new word
or phrase they notice in the speech they hear, attach meaning to it,
and adjust the representation in the light of further analyses. They
can use it to access that meaning when they next encouter that form.
As they hear more language, they add to their store of such representations. These representations for comprehension (C-representations)
consist first of an acoustic template, to which children then add information about meaning, syntax, and use...Children also represent
the information needed for producing each expression. For this, they
need specifications for articulating the sounds in the target word or
phrase. Their representations for production (P-representations), then,
necessarily differ from C-representations.» (Clark, 2003)
In simple terms, the central dogma states that humans understand
language before they can speak it. A child comprehends what an
airplane means long before it will be able to pronounce that word
correctly. According to Clark, comprehension is always ahead of production, even in adult age when the mismatch is much less visible
than in childhood.
Important thing to realise is that utility of C-representations goes
far beyond some passive involvement in recognition and comprehension of words and phrases. This is so, because C-representations
can also influence and determine the direction of construction of Prepresentations. C-representations can provide targets with which Prepresentations are gradually aligned. This is how Clark describes the
process:
« How would this work? Suppose a child is trying to produce
snow. If children can access the C-representation for snow, they can
compare their own production with the C-representation, detect any
mismatch, and repair their own utterance. The C-representation is
a model of what the word should sound like so others can recognize it. Under this view, C-representations provide model targets for
what children produce. They also provide the target that the product of a P-representation must match. So as children adjust their Prepresentations to match what they hear from others, they align them
with their C-representations. It is this gradual alignment that mirrors
changes in children’s own production of words and phrases.» (Clark,
2003)
In other words, what Clark tacitly indicates is that not only does human language development involve gradual adaptation of one’s internal representations (C-ones) to structures observables in the external

65

66

developmental psycholinguistics

environment, but also that LD involves a sort of gradual adaptation
of one set of internal representations (P) to another set (C). In light
of such a theory can many common phenomena, like canonical babbling (9.2.2), for example be interpreted as highly useful and possibly
inevitable ways how infant’s linguistic faculty tunes and bootstraps
itself in partially auto-programming and auto-poietic fashion.
Another reason why we consider the principle C-precedes-P can to
be of certain interest for this dissertation is, that it indirectly addresses
the debate we have already raised when discussing the hypothesis
stating that "learning involves reproduction of information-encoding
entities" (2.4). For if we accept that C-precedes-P, we have to accept,
in the first place, that representation of the word mama is somewhow
distinct from the P-representation of the same word. And more: if we
accept that the C-representation of the word mama is distinct from
the P-representation of the word mama, yet refers to the very same
mother-referent in the external world, we have to accept that the information contained in two representations has to be, at least partially,
the same.
We thus end up with two distinct representations, C and P but
both originating in C and pointing to the referential content to which
C sole referred when it got set up. Couldn’t this mean that informational content of locus which encodes C (e.g. in Wernicke’s area) got
replicated into independent cortical locus which encodes P (e.g. in
Broca’s area) ? Couldn’t the neural basis of such process be somewhat similar to processes postulated by neural darwinists (8.6), for
example the one depicted on figure Figure 6?
We let the reader (him|her)self to answer these and similar questions. Note, however, that answering these answers with "yes" would
suggest that the statement which have been labeled hereby as a central dogma of developmental psycholinguistics, does indirectly support the thesis that language development is a form of evolutionary
process.
9.2

Central Dogma
implies
information
replication

development of toddlerese

The goal of following subsections is to present facts related to development of multiple facets of toddlerese. In 5.1, the toddlerese was
defined as a protovariant of the natural language, and natural language was defined as a system composed of prosodic, phonologic,
morphologic, syntactic, semantic, and pragmatic structures and principles (4.1). None of these layers is to be ignored by somebody aiming
to have an adequate vision of development of toddlerese.
But taking into account all scientific discussions which were, since
end of 19th century, pre-occupied with elucidation of mystery of LD’s
universality, speed and the fact that in case of healthy individuals, LD
is practically always successful, taking into account all such scholas-

Facets of toddlerese

9.2 development of toddlerese

tic schisms is not a path to knowledge neither. Thousands of experiments and observations were done, hundreds of books published,
dozens of theories and even whole doctrines were unleashed, sometimes sentencing whole generations of linguists into the scholastic
hell filled with infinities, recursive rules and utterly inconvenient formalist games. In order to evit such a destiny, following paragraphs
shall restrict themselves to very "minimalist" presentation of few evident or experimentally well-verified LD-pertaining facts.
Thus, the brief expose hereby introduced will be only very rarely
concerned with any linguistic phenomena beyond the state of the toddlerese, whose upper bound was operationalized, in 5,at 30 months
(i.e. at age 2;6). And given that the "operational thesis" (6) restricted
the scope of our interest to textual modality of human intepersonal
communication, we shall present in closer detail the psycholinguistic
studies pertaining to development of morphosyntactic and semantic
faculties. In contrast to these, prosodic, phonic and pragmatic layers
shall be described much more superficially than they rightfully merit.
For when it comes to language as was known to all our human predecessors, it was indeed the pragmatic and phonetic aspect which were
at the core and inception of it all.
9.2.1

Prosody, Phonetics,
Phonology (PPP)

Prenatal CPPP representations

ontogeny of prosody, phonetics and phonology

Prosody is all that relates to tempo, rhythm, stress and intonation of
speech. Phonetics is concerned with articulation, acoustics and audition of physical properties of speech signs. Phonology, on the other
hand, is less "material" and more "cognitive" in a sense that it is not
concerned with such physical characteristics of phonemes like amplitude, frequency or timbre, but rather with systems of abstract categories and rules whose existence is directly or indirectly observable in
case of any human cognitive system which was exposed to phonemes
and is able to perceive them.
Human beings are sensitive to language even in the prenatal period.
The study of DeCasper and Spence (1986) has shown that new-born
infants prefer to listen to a story which they have already "heard" in
utero. Given the fact that in uterus, frequencies about 1kHz are weakened by transmission through maternal tissue, this preference could
be explained principially in terms of prosodic and not phonemic information. Another study had indicated that even 4-day old newborns are able to distinguish between mother language (e.g. French,
in case of French newborns) and a foreign language (Russian, English
etc.) filtered by a 400Hz low-pass filter Mehler et al. (1988). Clark summarizes the results of both studies in a statement: « what infants are
attending to are the prosodic properties of the speech they have been
exposed to prenatally» (Clark, 2003).

67

68

developmental psycholinguistics

During approximately first eight months which follow the birth,
infants are capable to discriminate practically any phonetically plausible contrast between two sounds. But before attaining one year of
age, children loose this capacity to distinguish practically any sound
from any other and their perceptual filters become more and more
adapted to phonology of language spoken in their social environment.
In other words, « infants can discriminate nonnative speech contrasts
without relevant experience...there is a decline in this ability during
ontogeny...data...shows that this decline occurs within the first year of
life, and that it is a function of specific language experience.» (Werker
and Tees, 1984)
It shall be indicated multiple times in this dissertation that sometimes a loss or limitation can serve a creative purpose. Such is also
the case, we believe, in case of the above-mentioned loss of capacity
to distinguish practically any phoneme from any other. For by losing
this capacity, an infant also gains something: she gains the capacity
to distinguish language from non-language, the mother language from
a language spoken by an alien passing by. When this problem is resolved, child’s cognitive system can focus more efficiently upon the
upcoming problem: that of discovery and extraction of recurring patterns in and from the speech stream.
Set of experiments, performed by Jusczyk and his colleagues, focused principially on infants’ ability to "hear" such regularities. One
type of regularities are prosodic ones, for example syllabic stress patterns (stronger stress on first syllables in English). Another regularities are, of course, due to repetitive occurences of the same words. In
order to remark that the word X was repeated, an infant has to be
able to somehow identify the word as something which was already
heard. The study of Jusczyk and Aslin which focused on perception
of monosyllabic words has shown that « some ability to detect words
in fluent speech contexts is present by 7 and half months of age»
(Jusczyk and Aslin, 1995). The same study has also indicated that
6-month old infants still lack such ability to perceive (monosyllabic)
words as perceptual units.
At 9 months, infants are able to identify sequences of two and more
syllables: « 9-month-olds appear to be capable of integrating sequential and suprasegmental information in forming wordlike (multisyllabic) phonological percepts, 6-month-olds are not» (Morgan and Saffran, 1995). Another study indicates that 9-month-olds also prefer to
listen to words of their ambient (mother) language and not words
from another language Jusczyk et al. (1993). Before attaining first year
of age, children are able not only to discriminate but also to identify familiar phonemic chunks of various sizes, extract them from the
speech stream and potentially associate with contextual information
and other sensory modalities (visual, tactile etc.). It is therefore reasonable to assume that 9-month old healthy infant already disposes

Decline in
universal
discriminative
capacity

Less is more 1

9-month-olds
match lexical
patterns

9.2 development of toddlerese

Of cry and cooing

of dozens of C-representations which can be labeled as protolexical.
Their transformation into full-fledged lexical structures shall be discussed in next section.
Ontogeny of infant’s faculty to produce intelligible verbal signal is
no less fascinating. It starts, of course, with the cry of a new born who
is able to obtain any wished change of environment (food, warmth,
diaper change etc.) with one loudly and adamantly expressed bit
of information. But after cca 2 months, an infant starts to produce
more gentle cooing sounds which seem to express, contrary to crying,
infant’s satisfaction or agreement with the current state of environment. In three and four months which follow get these two modes
of verbal production - crying and cooing - still and more refined
and are evermore accompanied by facial and gestural expressions.
And sometimes - when the cooing vowel-like "ooo" and "aaah" are
co-articulated with some occlusive consonants, thus forming sounds
like "uuum", "baaa" or "maaa" - one can constate the occurence of
marginal babbling.
And then, somewhere between six and ten months, comes canonical babbling.
9.2.2

canonical babbling

« Canonical babbling consists of short or long sequences containing
just one consonant-vowel (CV) combination that is reduplicated or
repeated.» (Clark, 2003)
end canonical babbling 9.2

Development of
babbling

Emergence of first
words

Consonants occurent in canonical babbling are more than often
voiced and labial1 (b), labionasal (m), velar (g) and little bit later when child already has some teeth to block the airflux with - also dental (d). 2 Few months which follow shall be subsequently dedicated
to variation of both the enveloping intonation contour of the babbling
as well as the syllables contained in the babbling sequence: canonical
"mamama" sequences shall thus evolve into sequences like "mamapapadadada?". At cca 1 year of age « many babbled sequences sound
compatible with the surrounding language using similar sound sequences, rhythm and intonation contours.» (Clark, 2003)
It is during the period of babbling when first "words" appear. And
according to growing amount of evidence, the development of first
words is a natural and continuous prolongation of babbling phase. Elbers and Ton, for example, summarize their analysis of monologues
of a Dutch boy Thomas in the six weeks following acquisition of his
1 In our bachelor’s thesis we had emitted the hypothesis that prominence of labial
closures in early babbling is to be associated with succion.
2 Development of canonical babbling

69

70

developmental psycholinguistics

first word (1;3-1;5) with the conclusion: « new words may influence
the character and the course of babbling, whereas babbling in turn
may give rise to phonological preferences for selecting other new
words» (Elbers and Ton, 1985). For example the frequency of t-like
vowels occurent in the babbling sequences has increased significantly
(from 15% to 40%) in the period when Thomas started to use his
t-containing word ("aut(o)").
Toddlers thus seem to be selective in word forms which they pronounce, « working first on what they can already do and only after
that moving on to harder problems» (Clark, 2003). When they do not
have enough practice with a certain sound or a word form, they tend
to avoid it. This hypothesis was to be demonstrated by an ingenious
experiment designed as follows: « during 10 bi-weekly experimental sessions, 12 children (1;0.21 - 1;3.15) were presented with 16 contrived lexical concepts, each consisting of a nonsense word and four
unfamiliar referents. For each child, eight words involved phonological characteristics which had been evidenced in production (IN) and
eight had characteristics which had not been evidenced in production or selection (OUT)» (Schwartz and Leonard, 1982). The results of
the experiment, presented on 2 made it evident that while children’s
ability to understand is independent from the form of the word-tobe-understood, toddlers and pre-toddlers prefer to "mention" mainly
those things, whose names contain only familiar phonetic forms (i.e.
IN words).
in words

out words

Produced spontaneously

33

12

Understood correctly

54

50

Preference and
avoidance

Table 2: Children avoid production of words with unknown characteristics.
Reproduced from table Clark (2003) based on data in Schwartz and
Leonard (1982).

To get from babbling to rich spectrum of intelligible words is not an
easy task. Every child uses its own unique strategy to solve it; every
child traverses a different "path" in order to align its linguistic structures to those of her social environment. As the author of a thorough
study comparing acquisition of phonology by three children put it: «
each of the three children is exhibiting a unique path of development
with individual strategies and preferences and and idiosyncratic lexicon» (Ferguson and Farwell, 1975). We consider it important to underline that rarely are these path a linear descent from random babbling
to optimal (i.e. correct) pronounciation. As can be seen not only in the
data collected by Ferguson & Farwell it is often rather the contrary
which is the case: « although the children tend to be quite accurate
in their first production, their accuracy often decline over time, so

From babbling to
words

9.2 development of toddlerese

Of variation of
PPP structures

later versions of the same words appear to be further from the adult
targets» (Clark, 2003).
What seems to be common, however, to all those paths is that
they flourish with variation. As William and Terese Labovs have observed during the longtitudional observation (1;3-1;8) of their daughter Jessie, she revealed « continuous exploration, experimentation,
practice and intense involvement with linguistic structure» (Labov
and Labov, 1978). For 3 months of Jessie’s life, this experimentation
was concerned solely with words "cat" and "mama"; overall she had
pronounced each of these terms at least 5000 times during the 5
months of observation. « In summary, what might be regarded as
a rather flat plateau in Jessie’s development, upon closer inspection,
revealed a constantly changing series of small experiments where
she progressively scrutinized and tried out different phonological options.» (Clark, 2003)
This small experiments can be often characterized in terms of application (or non-application) of specific simplification routines. These
routines, which we shall call "variation operators" in iii can be coarsly
divided into three big groups of
1. Substitutions
2. Assimilations
3. Transpositions

Of substitutions

Of assimilations

Of transpositions

Substitutions are due to simple replacement of one sound (or group
of sounds) with another sound or group of sounds. Common is voicing of initial voiceless consonants ("pie" pronounced as [bay]), devoicing of final ones ([nop] <- "knob"), gliding ("ball"->[baj]) etc. Also,
children often do not pronounce some parts of word at all. These
omissions - which can be understood as special cases of substitution
whereby one sound is substituted for a "blank" or "non-terminal"
sound which is not articulated - are also very common especially
in case of consonants at initial ("tram"->[am]) or final ("pes"->[pe])
positions.
Assimiliations « refer to the effect of sounds on those preceding
or following them within a word or across word-boundaries» (Clark,
2003). Assimilated can be only one or few features, e.g. in ("orol" > [olol] does the lateral feature of final "l" override the trill feature
of "r") for backward lateralization or ("balon"->[balol]) for forward
lateralization. But whole cluster of features or even sounds can be
assimilated as well: in particular is this the case in syllable reduplication whereby one syllable completely overrides the other ("wasser">[vava]).
Another group of simplification procedures & variation operators
are transpositions. Known as "metathesis" in historical and evolutionary linguistics (8.5) and analogic, mutatis mutandi, to so-called chiasms (Hromada (2011); Dubremetz (2013)) in rhetorics these switches

71

72

developmental psycholinguistics

in order (AxB->BxA) are already at play in production of toddlers
("KOstOL"->[okol]).
All these examples shall be in closer detail discussed in iii. And
in the second volume of this thesis, these cases shall be formalized
and subsequently embedded as "variation operators" into evolutionary computation scripts. But for the purpose of this expose let’s just
limit ourselves to the constatation that the sequence of application of
similiar routines shall, in course of ontogeny, allow the child to converge from a quasi-random babbling to correct articulatory program
able to produce word X.
end ontogeny of ppp 9.2

9.2.3

ontogeny of lexicon and semantics

Raison d’etre of language is to communicate meanings. Semantics is
a scientific discipline devoted to study of meanings. Meaning - also
called signifie in tradition established by (de Saussure, 1916) - is a
fairly abstract entity which does only rarely, if ever, exists on its own.
In language, meanings are always coupled with "signifiants", i.e. with
material phonetic or graphemic forms which denote some specific
meaning.
Signifiant, signifié and information related to morphosyntactic properties (c.f. 9.2.4) form a triad which, taken all together, composes a
word. In modern linguistics, words are sometimes considered to be
members of "lexicon". A lexicon is simply a set of all words internalized by and represented within the individual cognitive system. In
DP, the process of acquisition of lexicon is also known as vocabulary
development.
We consider the process of vocabulary development to be, in huge
extent, reducible to the problem of construction of semantic categories. Under such view, the problem of understanding of a new
word W could be understood as the problem of:
1. detection of recurrence of W in speech
2. establishment of mapping|association between W and corresponding semantic category C3
3. reducing or increasing the extension of C so that it is neither
too specific nor too general
None of these problems is computationally trivial but children nonetheless solve both of them with stunning swiftness and ease. We think
3 We consider it important to precize that within the theory hereby proposed, semantic categories - understood as points, regions or subspaces of some sort of "absolute
semantic space" - can be shared, i.e. accessed by multiple mutually independent cognitive agents.

Of semantics

Of sign

Of semantic
categories

9.2 development of toddlerese

Of word game

Of utility of lexical
constraints

that this is so, because human brain (2.8) is principially a patterndetecting computational device whose principal objective, especially
during initial stages of ontogeny, is to subsume huge amount of contextual multi-modal information under and into as-neatly-as-possible
packaged categories. Under such view a word W, a signifiant, is not
only a "label" for its respective conceptual category; it is also a stimulus triggering a completely unvoluntarily categorization process. As
Kyra Karmiloff and her mother Annette put it: « there is a dynamic
feedback between developing cognitive skills and growing vocabulary, and words can act as an invitation to form a category» (Karmiloff
and Karmiloff-Smith, 2009).
Since we shall return later (10.4) to more theoretic discussion of
what semantic categories "are" in computational sense, and how mapping between them and their labels can be constructed in a general
computational system, let’s just focus on the question "What are particular aspects of acquisition of semantic categories constructed in
human children?".
As infants gradually overcome perceptual limitations of the newborn state they tend to see the world evermore clearly. This subsequently makes it possible that « very young infants can and do perceive even the most subtle differences between and across category
members. One study showed that three-month-olds could not only
differentiate between cats and dogs (a between-category distinction),
but also distinguish among different kinds of cat (a within-category
distinction))» (Karmiloff and Karmiloff-Smith, 2009).
During first year of age, initial C-representations are being formed
by associating such representations of perceptual categories with coocurrent representations of most frequent and salient forms which
the infant succeeds to detect and identify in her linguistic environment. Interaction with such environment - consisting mainly of mother,
father, siblings or other "tutors" - is dynamic, repetitive and goaloriented. Roger Brown describes it as a "word game" of which the
child is a principal player:
« The tutor names things in accordance with semantic customs of
the community. The player forms hypothesis about the categorical
nature of the things named. He tests his hypothesis by trying to name
new things correctly. The tutor compares the player’s utterances with
his own anticipations of such utterances and, in this way, checks the
accuracy of fit between his own categories and those of the player. He
improves the fit by correction.» (Brown, 1958)
To understand what object is meant by what name is not an easy
task. For how does a child know, that word "milk" means the lifestrenghtening liquid and not white color, liquid in general, something
to drink or vessel in which it is stored? A possible answer can be: by
application of diverse lexical constraints (LCs). Among multiple LCs
mentioned in the litterature, we consider these:

73

74

developmental psycholinguistics

1. whole-object assumption
2. basic-level assumption
3. taxonomic assumption
4. mutual exclusivity and fast mapping
constraints to be of biggest importance during the toddler stage of
LD.
The whole-object assumption « presupposes that children already
have categories of objects, such that objects can be represented as
whole entities distinct from their locations or from their relations to
other objects or places.» (Clark, 2003). It is evident that endowing humans with could be quite useful for our survival as species: to be able
to immediately percieve and label a lion as a lion is more "fit" a strategy than to invest computational resource’s in seeing details of lion’s
fur or whiskers. The same applies for the basic-level assumption: ability to parition the world into the basic-level categories Rosch (1999)
which are not too general (above basic level), nor too specific (below
basic level) is crucial to survival. In comparison to one’s ability to
categorize a shark as a shark, is the ability to categorize these predators into below-basic-level categories as blue, white or tiger sharks
or as members of above-basic-level category of chordates, somewhat
secondary.4 .
Another LC which is quite closely related to Rosch’s theory of basiclevel categories and prototypes (10.4.1) is the taxonomic assumption
which presupposes that labels should be a priori extended to object
of the same kind and not the object which is thematically related. Its
validity was demonstrated by experiment in which « children saw a
series of target objects (e.g., dog), each followed by a thematic associate (e.g., bone) and a taxonomic associate (e.g., cat). When children
were told to choose another object that was similar to the target (“See
this? Find another one.”), they as usual often selected the thematic
associate. In contrast, when the instructions included an unknown
word for the target (“See this fep? Find another fep.”), children now
preferred the taxonomic associate.» (Markman and Hutchinson, 1984)
While above-mentioned LCs are useful heuristics for determining
either the nature or scope of categories-to-be-constructed, LCs of fast
mapping and mutual exclusivity are heuristics facilitating the discovery of relation between the label (signifiant) and semantic category
(signifie). Thus, « the mutual exclusivity constraint stipulates that in
a given language an object cannot have more than one name, so if the
child already knows the word “car,” he will not think a new word
refers to cars. In other words, in the early stages of word learning,
the child does not expect synonyms. The second constraint, fast mapping, stipulates that novel words map onto objects for which the child
4 This does not apply for professional biologists and philosophers, of course.

Of whole-object
assumption

Of taxonomic
assumption

Of fast mapping
and mutual
exclusivity

9.2 development of toddlerese

75

People: mommy (1;0), daddy (1;0), baby (1;3)
Food: banana (1;4), juice (1;4), cookie (1;4), apple (1;5), cheese (1;5)
Body parts: eye (1;4), nose (1;4), ear (1;5)
Clothing: shoe (1;4), sock (1;6), hat (1;6)
Animals: dog (1;2), kitty (1;4), bird (1;4), duck (1;4)
Vehicles: car (1;4), truck (1;6)
Toys: ball (1;3), book (1;4), balloon (1;4)
Household objects: bottle (1;4), keys (1;5)
Routines: bye (1;1), hi (1;2), no (1;3)
Activities: uh oh (1;2), woof (1;4), moo (1;4), ouch (1;4)
Table 3: Words produced by at least half of children in the monthly sample.
Reproduced from table in Clark (2003) based on data from Fenson
et al. (1994).

does not already have a name.» (Karmiloff and Karmiloff-Smith, 2009)
Both lexical constraints of mutual exclusivity and fast mapping can
be understood as direct implications of principle of contrast.
The Principle of Contrast (DEF)
« Every two forms contrast in meaning.» (Clark, 1987)
end the principle of contrast

Of usefulness of
principle of
contrast

Of insufficient and
excessive
generalization

9.2.3

The importance of this fairly trivial principle in regards to LD is not
to be underestimated. Acquisition of any kind of form-meaning mapings can be significantly catalysed by the sole fact that PoC applies.
Take, for example, an information-processing agent which knows only
what "mama" means, but often hears an expression "mama a tato"
when it simultaneously sees her mother and father. Discovery that
the form "tato" denotes "father" would be trivial for an agent with
PoC embedded among her information-processing procedures. And
quasi impossible or very (computationally) costly for an agent who
does not.
Thus, with aid of very restricted number of heuristic-like principles
and constraints, and in combination with contexts which repeat themselves day after day and week after week, small infants shall start to
associate first linguistic forms to first conceptual categories. But the
sole establishment of this association between the word and the category is not sufficient. The scope, the extent, the region of semantic
space covered by and attributed to the specific category have to be
delimited as well. Before it shall be the case the child shall commit
many errors of either insufficient or excessive generalisation. For ex-

76

developmental psycholinguistics

ample, in case of insufficient generalisation, it shall sometimes apply
a generic label ("dog") to denote just one specific canine ("lessie"). And
in case of excessive generalisation, it shall denote with a label ("cat")
even referents ("lynx") upon which such label is not commonly applied by child’s linguistic community. Clark (2003) offers a nice example, extracted from Kuczaj’s transcripts 5 contained within CHILDES
corpus, in which a child (2;4) learns a new word which shall help her
to narrow down a general verbal meaning:
I (wanted to have his orange peeled) : Fix it.
T : You want me to peel it?
I : Uh-huh. Peel it.

Section 13 shall present some more detailed results related to such
microconversations resulting in a correction of child’s semantic category. For the time being let’s just suggest that such parental or sibling
corrections could be quite easily integrated into a darwinian model
of language ontogeny either as a sort of selection or variation operator. Such kind of an exogenous, environment-originated perturbations gradually divide infant’s conceptual space into structure of partitions functionally isomorph to structure of partitions embodied in
child’s tutor. Table 4 illustrates in a very brief but nonetheless parlant
way an example of how relations between few labels and subjacent
categories changed in ontogeny of one particular child.

Of usefulness of
exogenous feedback

word

initial and subsequent referents

more appropriate word

papa

father/grandfather/mother (1;0)

mama (1;3)

any man (1;2)

Mann (1;5)

Mann

pictures of adults (1;5)
any adult (1;6)

ball

Frau (1;7)

ball (1;0)
balloon (1;4)

balloon (1;10)

Table 4: Case of development of word|meaning mapings. Based on data in
Barrett (1978).

Thus, an important « part of learning a word meaning is also learning what the extension of each term is, by learning what counts as a
possible referent. Children also try out some words in ways that are
hard to link to any identifiable use. The target word itself may not
be identifiable, and the general absence of adult comprehension typically leads to the word’s being abandonded» (Clark, 2003). We propose to interpret this tendency to "abandon not identifiable words" as
5 In transcripts of conversations with children we shall label child-directed sentences
with I (meaning "infant") and adult-generated sentences with T (meaning "tutor").

9.2 development of toddlerese

Of selection of
vocables

Of graduality of
word-learning

Of
proto-imperatives
and
proto-declaratives

Of vocabulary
explosion

a sort of selection. Cumulation of multitudes of such selective events
combined with playful variation inherent to every healthy child shall,
so we argue, gradually attract child’s mind into a state where she
shall dispose of language of her surroundings.
And learning of concepts is indeed gradual. Analyses of maternal
journals and estimations suggest that at 12 months of age, children
understand on average at least ten words (Menyuk et al., 1991). In
following months increases the size of the lexicon only slowly; topics
for the words which the child understands and is subsequently able
to produce are also quite restrained:« not surprisingly, young children talk about what is going on around them: the people they see
every day; toys and household objects they can manipulate; food they
themselves can control; clothing they can get off by themselves; animals and vehicles both of which move and so attract attention; daily
routines and activities; and some sound effects» (Clark, 2003). Table 3
contains a list of words produced at given age by at least 50% among
1803 children whose parental reports were studied by Fenson et al.
(1994).
From the perspective of end-state language are many among these
first words a specific-object-denoting nouns. But children often use
them with function of verbs or, more specifically, as imperatives. Thus,
when saying "milk" a small child expresses her wish, want and need
meaning of "wanting milk" or "make me get that bottle". Only months
later shall be such proto-imperatives accompanied with proto-declarative
statemens meaning "look, mother, there is milk!". This distinction between proto-imperatives and proto-declaratives is not to be underestimated since it seems to stem from child’s growing will to share information. Since we shall return later (9.4.4) to this properly human tendency to share information, intentionality and attention with others,
let’s just express our agreement with the statement that « Using language simply to share a common experience with the listener is particular to human communication. Animals tend only to use communication in a proto-imperative way» (Karmiloff and Karmiloff-Smith,
2009).
It is approximately in the period of gradual passage from protoimperative to proto-declarative use of language, i.e. between 16-20
months, when the rate of acquisition of vocabulary shall start to accelerate. This phenomenon, known as vocabulary explosion or vocabulary spurt, starts to express itself when child’s productive vocabulary
attains approximately 150 words and can be described as follows: «
Prior to the vocabulary spurt, children learn on average about three
words per week. But when they enter the vocabulary spurt stage, their
learning of new words increases dramatically to about eight to ten
words per day» (Karmiloff and Karmiloff-Smith, 2009). We shall discuss the phenomenon of vocabulary spurt in somewhat more quantitative terms in 10.1.2, during the discussion of Logistic law. Figure 13

77

78

developmental psycholinguistics

Figure 13: Development of productive vocabulary in early (a) and late (b)
toddlerese. Figures reproced from Fenson et al. (1994).

(a)

(b)

shows the development of productive vocabulary in 1803 children as
measured by McArthur infant and toddler communication development inventories.
Authors like Marchman and Bates (1994) interpret occurence of this
and other similar LD-related phenomena in terms of attainment of
critical mass.6 . It seems indeed reasonable to postulate that some sort
of qualitative change of toddler’s linguistic faculties occurs during
the period when she enters the vocabulary spurt: for approximately
in the same period the toddler shall start to juxtapose words side by
side and construct first phrases. And that marks the advent of morphology and syntax which shall be discussed in following section.
Before we end this very brief overview of word learning among
toddlers, let’s just reiterate the founding that « one of the interesting
characteristics of words is that their meanings do not remain static;
they can change» (Karmiloff and Karmiloff-Smith, 2009). And this
"change" is a part of process which is usually called "learning". And
this "learning" can be, we suggest, plausibly interpreted as a particular case of evolutionary process which, during ontogeny, divides
one’s semantic space (10.4) into categories which shall tend to overlap with categories categories "out there".
9.2.4

ontogeny of morphosyntax

In traditional linguistics, most fundamental meaning-carrying units
of linguistic analysis are not individual words, but so-called morphemes. That is, prefixes, suffixes, word roots or other materially
6 Phenomena which occur when only a certain critical mass is attained are best studied by the theory of complexity (c.f. Kauffman (1995) for gentle introduction). Such
phenomena are often related to a so-called "phase transition" (e.g. water → ice; fuel
→ reactor) which can be accompanied with not only quantitative but also qualitative
transformation of the observed system.

Of critical mass in
word-learning

9.2 development of toddlerese

79

encoded signifiants encoding a particular signifie. The particular system of mutual interactions of diverse categories of morphemes yields
a particular morphology. When interaction of morphemes surpasses
an individual word and multiple words are concatenated in a fullfledged utterance, an information can be contained also in the way in
which diverse components (words) are ordered: in utterance’s syntax (σύν
-> "together",τάξις -> "ordering"). Since in many languages, the distinction between morphology and syntax seems to be very fuzzy7 ,
some linguists prefer to speak simply about morphosyntax.
Similiary to many other forms of human activity (e.g. object-manipulation,
food-preparation, rituals etc.) are human languages compositional
and combinatorial. Compositionality can be defined as follows:
Compositionality (DEF)
« The meaning of a signal is a function of the meaning of its parts,
and how they are put together.» (Brighton et al., 2003)
end compositionality 9.2

Of variability in
LD

Of
non-determinism
in LD

while combinatoriality means that theoretically infinite - yet practically vaste but finite - of complex constructions can be obtained by
means of combining of finite amount of elements (morphemes).
Evolutionary and computational advantages of compositionality
and combinatoriality of natural languages being adressed elsewhere
(Brighton et al. (2003); Kvasnicka and Pospichal (1999); Pinker (2000))
let’s now focus on other characteristics related to development of syntax in practically any healthy human child.
One such "universalium" is that there exists both inter- (i.e. children
of different communities acquire different languages) and intra- (i.e.
children of the same community acquire their language differently)
linguistic variability. The variability of developmental trajectories is
in fact so huge, that one could plausibly argue that there are no two
children in the world - not even twins8 - which would acquire language in an absolutely identic way.
This is so, becaue language-learning is strongly dependent on individual perspective as well as context within which the learning occurs.
Contexts from which child acquires language structures involves not
only auditive, but also visual, emotional, social, etc. dimensions, and
internalization of structures thus involves many factors. Since some
of these factors are stochastic, language-acquisition itself can NOT be
a fully deterministic process. This is the second "universalium".
7 Take, for example, the German word of the year 1999 "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" which, in fact, is quite minimalist in comparison to words uttered by classical sanskrt poets. Are the rules which
govern composition of such words the rules of morphology, or the rules of syntax?
8 In this context, we consider it worth mentioning that twins often develop a sort of
their own language, or "idioglossia", whose potential conflict with ambient language
can slow down twins’ language development.

80

developmental psycholinguistics

Asides compositionality, combinatoriality, variability, context-boundedness
and non-determinism, we consider it worth to mention these other
Other LD
universalia
characteristics which could be considered as universal and axiomatic:
• graduality: children tend to master shorter structures before
they master longer structures
• cumulativity: children tend to "build upon" what they already
know
• specificity: children tend to learn individual patterns in individual contexts of social interaction
• repetitivity: scenes in which children acquire individual structure X contain certain recurrent features
• inductivity: specific structures can be transcontextually crossedover to yield structures corresponding to more general meanings than the ones with which the child was already confronted
Of course the list does not end here and other properties like (Tomasello,
2009), recursivity (Chomsky, 1957), syllabicity (Jackendoff, 2002) or
importance of substitution were rightfully highlighted. Since some
of these shall be discussed in 9.4 and 10.2, let’s now lay aside the
generalities and focus upon facts. First, production:
Being still in their babbling phase, children first produce one-word
"holophrases" which they succeed to fit into an individual intonational contour. As intentions they want to communicate get more
Of first phrases
and more complex, children couple these with movements like approaching or running away; with gestures like pointing, nodding or
shoulder-shrugging; or even with more complex manipulation like
object bringing, throwing or showing. In sum, « gestures appear to
help young children communicate before they can pronounce the
longer phonological sequences required for combining words» (Clark,
2003).
As the temporal span of intonational contours increases9 and as
child improves its pronounciation of individual words - thus reducing the cognitive cost related to phonetic aspects of the utterance she succeeds, normally around cca 18 months of age , to fit multiple
words under the vault of a single intonational contour, thus creating
a first two-word construction.
Of 2-word
constructions
According to (Tomasello, 2009, pp. 104), this primordial "word combinations" stage have two distinctive features:
• they partition the scene into multiple symbolizable units
• they are composed only of concrete pieces of language, not categories
9 Possibly because of slowing-down of the "internal oscillator" observable in experiments with so-called spontaneous tempo (c.f. 9.2.6)

9.2 development of toddlerese

A concrete example MAMA NENE (meaning "mother-breast") shall
be further discussed in 12.7.1.
Child’s ability to concatenate two words and integrate them into
a single intonational contour is swiftly followed by emergence of socalled pivot schemas.
Pivot schema (DEF)
A two-word schema in which « one word (the "pivot") recurres frequently in the same position in combinations, and the other word
varies10 » (Braine and Bowerman, 1976)
end pivot schema 9.2

A canonic example of what is meant by pivot words and pivot
schemas is presented on table reproduced on Figure 14. This table
lists all comprehensible two-word combinations, noted by the mother,
which the toddler named Andrew spontaneously produced during
first five months after leaving the single-word stage.
Figure 14: Corpus of two-word utterances produced by a toddler Andrew.
Reproduced from Braine and Bowerman (1976).

Of affinity of pivot
words

In Andrew’s case, the pivot words are "more", "no", "all", "other",
"there", "off", "all gone", "all done", "byebye", "hi" and "see". It can be
immediately seen that pivot words tend to be juxtaposed with words
belonging to specific linguistic categories ("more" with "nouns", "all"
with adjectives or participles). Where are categories, there is generalisation and where is generalisation, there is productivity and, indeed,
10 Word "varies" put in italics by the author of this Thesis.

81

82

developmental psycholinguistics

had been such productivity of pivot schemas experimentally demonstrated in (Tomasello et al., 1997). In retrospect, its author concludes
it as follows: « 22-month-old children who were taught a novel object
for an object knew immediately how to combine this novel name with
other pivot-type words already in their dictionary» (Tomasello, 2009).
Another interesting result of the same study was that « children
combined the novel nouns productively with already known words
much more often than they did the novel verbs – by many orders
of magnitude» (Tomasello et al., 1997). But because categories like
"nouns" and "verbs" are results of adult categorization of certain lexical phenomena and not necessarily categories pertinent to child’s
own linguistic experience, let’s just limit ourselves to trivial constatation that specific pivot words have affinity to words with specific features. And vice versa.
These mutual affinities between "constant" pivot words and their
variable "complements" result in emergence of populations of microsystems of productive order, which (Tomasello, 2009, pp. 117-127)
Of item-based
calls "item-based constructions". When observing his daughter, Tomasello constructions
had realized that: « almost all ... multi-word utterances during her
second year of life revolved around the specific verbs or predicative
terms involved. This was referred to as the Verb Island hypothesis
since each verb seemed like its own island of organization in an otherwise unorganized language system...Within any given verb’s development there was great continuity such that new uses of a given verb
almost always replicated previous uses and then made one small addition or modification.» (Tomasello, 2009)
Other experiments have indicated the validity of the claim that the
stage of "pivot schemas" naturally develops into a stage of such "constructional islands" of productive order. For example, the study of
Of islands of order
Pine and Lieven (1997) had shown that children between 1 and 3
years of age tended to use the determiner "the" juxtaposed with one
set of verbs and determinet "a" juxtaposed with another, with rare
overlap between the sets. In a parallel study conducted with the same
group of 12 toddlers, the same authors have observed that 91.6% of
first 400 distinct utterances could be "traced back" to only 25 initial
patterns Lieven et al. (1997). Since many results of these studies are
English-specific (e.g. the importance of prototypical constructions like
"want+X", "verb+it" etc.), we consider it important to emphasize those
conclusions of these authors which seem to point towards more "universal" a direction: « Our metaphor would be of language developing
initially as a number of different islands of organization which gradually link up...These islands are initially segments (either words or
phrases) which the child has identified to the extent that she can start
analysing other systematic relations between what comes before, after or within them...We think, rather, that the data can support a view
of structure as emergent.» (Lieven et al., 1997)

9.2 development of toddlerese

Based on research of Lieven and her colleagues as well as on his
own (Tomasello, 2009, p.308) lists three basic operations by means of
which a child can produce an utterance:
1. retrieval of a rote-learned concrete expression and the repetition
of the same form as was already heard
2. retrieval of an utterance-level construction and its modification
in order to fit the current situation
3. « combining constituent schemas» (Tomasello, 2009)

Of operators of
morphosyntactic
variation

Note that the first operation can be aligned with notion of "imitation" and thus "replication of information", the third can be interpreted as a "cross-over" and the second is - so states our Thesis
- equivalent to what universal darwinists call "variation operators".
Tomasello lists three principal means of structural modification:
1. extension: concatenation of the constituent to the end or beginning of an expression (e.g. ich auch + Yoga -> ich auch Yoga)
2. injection: « inserting a new constituent into the middle of an
utterance-level construction or expression (the way a German
child might insert auch11 [too] into a schema where nothing
had ever before appeared» (Tomasello, 2009)
3. slot-filling: inserting new content into a slot in the item-based
construction (e.g. Brot + essen X -> Brot essen)

Of
over-regularization

It is evident that such "slots" are, in fact, categories and they are
denoted by what is in formal linguistics 10.2 called non-terminal
symbols. As category-representing symbols, they are undoubtably a
consequence of a category-construction (CC) process12 . In the long
run, the output of CC process should be a set of categories which
are functionally equivalent to categories shared by, and inherent to,
child’s social surroundings. But what was already said about lexical
and semantic categories hold also, mutatis mutandi, for the grammatical ones: before the gap between the ambient and the individual is
bridged, before the structure of partitions inherent to the latter converges to the structure isomorph to the former, discrepancies between
the two systems are to be observed.
Most salient and best studied among such discrepancies is group
of phenomena labeled as "over-regularization" . Traditionally, overregularization is supposed to account for cases whenever the child
applies the production rule beyond the scope of its validity. The most
11 C.f. 12.7.2 for closer discussion of productivity of "auch" in case of one specific child.
12 We prefer to speak about CC and not simply about "categorization" to mark the distinction between the process by means of which a category is built, and the process
during which an already built category is used in order to "categorize" diverse blobs
of stimuli observable in the world

83

84

developmental psycholinguistics

famous example of overregularization in English is that in certain
stage of their development, practically all children tend to apply the
rule Vpast → VPresent+ed on all verbs. Thus, especially during period when their mean length of utterance (MLU) is cca 4-5 words, do
child generate past participles like « throwed » or «braked» which
they had never (or very rarely) heard. Another interesting aspect of
over-regularization is that often, children used the correct forms BEFORE they start producing incorrect over-regularizations: « Initially,
children’s uses of -ed past tense are all accurate. They may say melted
or dropped, but not, as they later do, runned and breaked.» (Maratsos, 1988)
Sooner or later -but often sooner than later- are practically all grammatical over-regularizing discrepancies corrected and child’s linguistic behaviour aligned with that of her surroundings. It is difficult to
describe this fact without taking for granted "the principle of precedence of the specific":
The principle of precedence of the specific (DEF)
« Whenever a newly acquired specific rule (i.e. a rule that mentions a
specific lexical item) is in conflict with previously learned general rule
(i.e. a rule that would apply to that lexical item but also to many others of the same class), the specific rule eventually takes precedence.»
(Braine, 1971)
end principle of precedence 9.2

This principle is defined her in terms of "rules". But as shall be
seen in 9.4, the notion of "rule" is crucial only for certain - and not
all- theories of language and LD. Thus, too much focus on notion
of "rule" can turn out to be misleading, moreso in this section of
our expose where our objective has been to focus more on empirical
and less on theoretical considerations of process of development of
individual morphosyntactic representations.
The body of empiric research which have explored this or that facet
of language acquisition, is indeed vaste. For example, even a simplistic synthesis concerning the developmental, cross-linguistic or clinical aspects of MLU would easily account for a monography thick
as a brick. But in this Thesis we cannot dedicate to this topic more
than space than that which is dedicated to 15. Idem for other fascinating LD-related topics like cue competition (MacWhinney, 1987) in
both comprehension and production, or acquisition of verbal skills related to negations, questions or word order: all these problems, and
many others, are simply too specific to be addressed appropriately
there, were only the most general principles of LD are sought to be
addressed.

9.2 development of toddlerese

Figure 15: Mean length of utterances produced by English and Italian children of different age (in months) . Figures reproduced from Devescovi et al. (2005).

(a)

Of importance of
input

(b)

However, what should be addressed and re-addressed, emphasized
and re-emphasized is the importance of linguistic input. This is so, because both content and distribution of linguistic input significantly influence the content and distribution of resulting representations and
structures. This constatation may seem trivial, but is less so when
one realizes "how special" the content of child-directed input and its
distribution is. Since the content shall be more closely discussed in secion 9.3, let’s now end this brief overview of development of child’s
LD with the question:
Y a-t-il some particular distributional, statistical, computational property of linguistic input which facilitates the internalization of morphosyntactic representations?
And the answer seem to be: yes there is, and the seems to be somehow related to the fact that acquisition of linguistic representations is
governed, similary to many other cognitive functions, by the principle
of distributed practice.
Principle of distributed practice (DEF)
« Given an equal number of exposures, distributed (or spaced) practice at a skill is almost always superior to massed practice.» (Tomasello,
2009)
end principle of distributed practice 9.2

In other words, humans in general and children in particular internalize better when they are confronted with the structure-to-beinternalized within contexts of N different sessions (ideally on different days), and worse when they are confronted with it N times during
the same session (on the same day). In relation to LD, this phenomen

85

86

developmental psycholinguistics

was first noticed in study by Schwartz and Terrell (1983) who noticed
that both group of 1-3 year old children who heard the new word
once par session and group of children who have heard it twice per
session, have, in fact, both needed approximately 6-8 sessions to learn
it. Thus, « when the absolute number of presentations was held constant, distributed (infrequent) presentations led to greater acquisition
than massed (frequent) presentations.» (Schwartz and Terrell, 1983)
Of distributed
practice
Similar results were subsequently obtained in studies of acquisition
of grammatical constructions. For example, Ambridge et al. (2006)
conclude their study: « for grammatical constructions, children are
more able to analogize across exemplars and extract a relational schema
when those exemplars are more widely distributed in time than when
they are temporally contiguous» (Ambridge et al., 2006). And since
the possibility that something like "principle of distributed practice"
exerces its force not only in case of acquisition of human verbal behaviour but also in development and optimization of other cognitive
functions and skills, the same authors conclude: « a single set of general learning and cognitive processes is responsible for the acquisition
of both individual lexical items (the lexicon) and regular and irregular
grammatical constructions (the grammar)» (Ambridge et al., 2006).
Agreeing with such conclusions of which Piaget would be undoubtably
quite fond of, we terminate this section with expression of our belief
that no matter whether taking place in the lexical or morphosyntactic domain, we consider such processes to be principially based on
iterative, gradual and non-deterministic optimization of populations
of internal representations which replicate, vary and are subjects of
selection. A belief, which we shall try to defend in what shall follow.
end ontogeny of morphosyntax 9.2

9.2.5

ontogeny of pragmatics

Pragmatics goes hand in hand with practice. In linguistics, pragmatics is all that is somehow involved in production or comprehension of
utterance but is not contained within the utterance itself. Thus, pragmatics is all that encompasses and envelops the communicative act;
pragmatic layer contains all the context within which the utterance is
exchanged.
It was already stated that language is a social entreprise and the
context within which natural language utterances are exchanged is
thus principially a social context. In such context, multiple human
agents are in mutual interaction and exchange of sequences of linguistic symbols is only one among many other ways how these interactors modify each other’s mental states. Other important channels

Of pragmatics

Of social context

9.2 development of toddlerese

of communication between two prototypical human subjects -mother
and a child- are illustrated on 16
Figure 16: Some modalities of information exchange between mother and
her child. Reproduced from Trevarthen (1993).

Of spatiotemporal
context

Of intersubjective
context

Of most difficult a
task

But the extralinguistic context is not limited to facial expression
and gests. Nor introduction of olphactoric (pheromonal) or haptic
communication would make the notion of "context" complete. For
the context par excellence is given by the very spacetime region within
which the linguistic exchange takes place, the region which contains
specific physical objects or embodies certain processes.
Somewhat contrary to what is displayed on 16, linguistic communication is rarely dyadic. Much more often it refers to object or stateto-be-attained which is external to both members of the interacting
couple. It is rarely by chance that two humans encounter each other:
more often they go towards each other because they want to be with
each other. Even if the object of such wanting can be a simple "being
with the Other".
What’s more, both interactors have mental states and they both
have intentions. And to make things even more complex, they use
language in order to mutually modify their mental states. They use
language in order to augment the probability that their intention shall
be materialized. With the help, and through the act, of the Other.
Thus, an infant whose cry/not cry signal emittor is not appreciated
anymore has to change her strategies. She has to learn what formulas
work best in what contexts, what should be said and when, where, in
what order and how it should be said so that the mental states of the
Other should be modified appropriately. A Herculean task extending
well beyond childhood and puberty towards adolescence and beyond:
pragmatic knowledge seems to be acquired from the very first until
the very last breath of one’s ontogeny. Having many forms -from cry
of a newborn to wisdom of and old man; from benevolent lies to ma-

87

88

developmental psycholinguistics

nipulative propaganda- acquisition of pragmatic knowledge seems to
be too difficult nut to crack.
This awareness of diverse intricacies and complexities of the pragmatic layer as well as the respect in front of maximas (Grice, 1975) and
values which are its foundation; and in agreement with the principle «
Whereof one cannot speak, thereof one must be silent.» (Wittgenstein,
1922), we decide not even trying to discuss computational aspects of
pragmatics-related phenomena.
end ontogeny of pragmatics 9.2

9.2.6

physiological and cognitive development

One cannot speak about early development and ignore the vast amount
of physiological and cognitive changes which children undergo. Between the birth and the end of toddler stage, height of children’s
body almost doubles and both the weight and lung volume more
than triple. Muscles strenghten, bones ossify. Fontanelles close, thus
envelopping the brain within the fully enclosed resonator called skull.
Primordial reflexes appear and disappear.
In first 12 months only does the average brain volume increases
from 369 cubic centimeters to 961 cc. This increase, however, is not to
be explained in terms of increase in quantity of neurons (gray matter) but in terms of increase of glial cells. In context of what was
already said about Neural Darwinism (8.6), we consider it important
to underline that during development, the number of neurons in fact,
decrease due to the process known as "synaptic pruning".
It is also during the early year of development when the linguistic
faculty gets entrenched in the specific hemisphere of the brain. While
there is still an ongoing debate concerning diverse aspect of such processus of "lateralization" (see Clark, 2003, pp. 387-391 for overview),
it is nonetheless commonly accepted that the hemisphere of installation of developing linguistic faculty is determined in first 20 months
of age.
Among hundreds of other neurological and physiological changes
which shall occur with apodictic necessity in any healthy human
child, there are three which we consider to be particulary important for the linguistic development yet thtacitly undiscussedd by psychodevelopmental linguists.
First is related to a relatively trivial fact that in comparison to other
primates, human teeth erupt very late (Holly Smith et al., 1994). This,
on one hand, allows for much longer breast-feeding and hence temporally reinforced emotional and social bonding between the mother
and the child while, on the other hand, makes it impossible for a
child to articulate sounds with dentals or alveodental acoustic fea-

Of ontogeny of
body

Of ontogeny of
brain

Of hemisphere
lateralization

Of teeth eruption

9.2 development of toddlerese

Of tuning of the
internal oscilator

Of sleeping

tures. Child’s ability to correctly generate language of her surrounding is not only cognitively but also physiologically limited.
The second physiological change to which we would like to point
attention in context of ontogeny of human linguistic faculty is related to rythmical behaviour. As suggested by practically whole tradition of research dedicated to psychology of rythm at least from
Fraisse (1974) to Provasi et al. (2014), every human can be characterized in every stage of his development by a so-called Spontaneous
Motor Tempo (SMT). By being both the tempo in which people tap
when asked "Please tap on the table with Your hand in the most
natural speed" and, also, the tempo which people choose as most
natural when asked to choose between multiple tapping sequences,
SMT seems to be a fundamental cognitive phenomenon integrating
both passive (perceptive or even C-structure) and active (productive
or even P-structure) components.
In regards to ontogeny, it’s worth to underline that SMT tends to
slow down with age, which can help children to maturate « from
facile acquisition of relatively brief events, such as phonetic categories,
to enhanced proficiency with longer events» (McAuley et al., 2006). It
is also only with age that children acquire faculty to process and generate rythmic patterns wider range of tempos, id est tempos which
significantly differ from the SMT of their endogenous tact-giving "oscillator".
Given the importance of tempo for control of repetitive or oscillatory activity including not only language but also walking - which, coincidentally or not, appears approximately in the same period when
children leave the phase of canonical babbling and enter the phase
of word productions - we believe that the study of SMT and other
rhytm-related phenomena could by useful for any further study of
human cognition. Within the scope of the current Thesis, however,
shall these phenomena serve only a peripheral role.
At last but not least, the third non-negligeable development occurring in early childhood is related to changes in length, distribution
and composition of sleep cycles. Thus, « the newborn infant spends
two-thirds of each 24-h period asleep; by 6 months he spends half
of his time asleep and half of his time awake. Sleep consolidation is
another important aspect of infant’s sleep development. By the age of
6 months sleep condensed into fewer periods of longer duration so
that sleep periods are lengthened from 4 to 6 h.» (Gertner et al., 2002)
But not only do new-borns sleep significantly more than toddlers
which sleep significantly more than older children and adults; not
only they sleep not in one nocturnal block as adults do but in multiple
blocks, both diurnal and nocturnal, but - and this is an important
"but" - they spend significantly more time in dreaming "rapid eye
movement" (REM) phase than they shall ever do in the future: « REM
sleep assumes a high proportion of total sleep in the first days of life

89

90

developmental psycholinguistics

and its amount and ratio diminish as maturation proceeds» (Roffwarg
et al., 1966).
Beautiful line of research studying pre-sleep "crib talk" monologues
(Nelson, 2006) aside, the relation of LD and sleep has not yet been
studied in an extent it merits. Note, however, the words with which
author of few among very few experiments studying the impact of
sleep upon processing of linguistic stimul concludes her results: «
memory consolidation associated with sleep introduces flexibility into
learning, such that infants recognize a pattern at test regardless of
whether it is instantiated exactly as it was before. Sleep then sustains
the learning of previously encountered information in a form that enables children to generalize to similar but not identical cases, and it
also introduces flexibility into learning» (Gómez, 2011).
Consistently with such conclusions, we end this short expose with a
statement that from the point of view of theory of intramental evolution of linguistic representations which we hereby aim to introduce,
such "memory consolidation" ocurrent during sleep could be interpreted in terms of activity of mutation and cross-over operators acting upon and mixing already encoded structures.
end physiological and cognitive development 9.2

9.3

Of memory
consolidation

motherese

Child’s closest social environment are her parents, most notably her
mother. Hundreds of studies were conducted to study the nature of
«motherese», a special simplified child-directed which mothers use
when speaking with their children. While some studies point in divergent directions, they more or less agree that « maternal speech
has certain characteristics that distinguish it from speech to other
adults. These characteristics are in essence simplicity, brevity and redundancy» (Harris, 2013).
Other characteristics generally associated with child-directed speech
are:
1. uses higher pitch (mean fundamental frequency of speech to 2year olds is cca 267Hz and 198 Hz in case of speaking to adults)
2. exaggerated intonation (wider pitch range)
3. slower speech due to both more and longer pauses
4. contains repetitions and variation sets (9.3)
Clark (2003) summarizes the properties of child-directed speech as
follows: « adults consistently produce shorter utterances to younger
addressees, pause at the ends of their utterances around 90% of the

Of basic features of
motherese

9.3 motherese

time (50% in speech to adults), speak much more fluently, and frequently repeat whole phrases and utterances when they talk to younger
children. They also use higher than normal pitch to infants and young
children, and they exaggerate the intonation contours so that the rises
and falls are steeper over a larger range (up to one-and-a-half octaves
in English).» (Clark, 2003)
Multiple studies indicate an existence of causal link between the
quantity and simplicity of motherese utterances and speed of child’s
linguistic development. More concretely, it had been observed that «
mothers’ choice of simple constructions facilitated language growth»
(Furrow et al., 1979), while more complex style can slow their development down. Other studies precise that « children who showed
the earliest and most rapid language development received significantly more acknowledgments, corrections, prohibitions and instructions from their parents» (Ellis and Wells, 1980). Other means how
LD can be stimulated are variation sets.
Variation sets
Variation set is composed of two or multiple subsequent utterances
which are all derived from one common item-based construction. «
Variation sets are identified by three types of phenomena: (1) lexical
substitution and rephrasing, (2) addition and deletion of specific referential terms, and (3) reordering of constituents.» (Küntay and Slobin,
2002)
A most simple form of VS is a sequence of utterances U1 ...Ux sharing the same word W serving as a link between subsequent utterances.
MOT
MOT
MOT
MOT
MOT
MOT
MOT

we lost that piggy bit .
so that bit goes there .
and (.) we’ve lost that horsie bit which is a bit of a pain .
that bit goes there .
and that bit there .
shall we try and find the lost bits ?
found two bits .

Just a little bit more complex are VS where "linking" between U1 ...Ux
is done first with one word (or construction) W1 which, in certain
moment co-occurs with another word (or construction) W2 which
subsequently "links" to following sentences. In an illustratory example extracted from data/Eng-UK/Lara/1-11-27.30.cha transcript of
CHILDES corpus (MacWhinney, 2014) do words "bit" and "that" fulfill such a fixing role.
More complex variation sets involve slight variation of longer expressions. For example, the transcript data/Eng-UK/Lara/1-11-27.30.cha
of the same mother-daughter couple taken 7 months later, contains a
following VS:

91

92

developmental psycholinguistics

MOT you have to sing to her , Lara , if you want her to go to
sleep
MOT sing her a lullaby
MOT you have to sing her a song

In this VS, it is a whole expression "sing something to someone"
which is varied, first by removal of a non-obligatory dativ marker "to"
and subsequently by variation of the object of singing, from lullaby
to a song.
Many researchers argue that exposure to variation sets can facilitate
acquisition of both semantic and syntactic categories and/or rules.
For example, Küntay and Slobin (1996) have observed that
1. use of variation sets is positively correlated with child’s acquisition of certain verbs
2. VS make up cca 20% of child-directed speech.
Similar observation, i.e. that 1/5th of child-directed speech are variation sets, was obtained by Brodsky et al. (2007). In a study involving
the analysis of CHILDES corpus, the authors explain the advantages
of VS in information-theoretic terms: « variation sets seem to be ideal
environments for learning lexical items and constituent structures...a
pair of utterances that have nothing in common is not informative,
and neither is a pair of identical utterances. An optimally informative
pair would therefore balance between overlap and change.» (Brodsky
et al., 2007)
Note that the notion of «variation set» can be intepreted in UDconsistent terms, given that:
1. repetition is equivalent to «replication in time» and every single instance of the utterance can be therefore considered as an
independent,individual structure
2. alteration of form between subsequent utterances can be interpreted as a consequence of variation operator influencing production of new sentences
To illustrate the extent of some variation sets, 1st appendix (??) contains the longest variation game discovered in, and extracted from,
the CHILDES corpus.
end variation sets 9.3

Studies like those of Harris (2013) suggest that there the releation
between the complexity of motherese and complexity of child’s own
production is in fact reciprocal. Thus, mothers adjust their language
according to the stage of child’s linguistic development.

9.4 language acquisition paradigms

In context of current Thesis, we propose to interpret the mutual getting closer of motherese and toddlerese in terms of adaptation and coevolution of two populations of linguistic structures. Mother adapts
to toddlerese of her child, toddler adapts to motherese: both co-evolve.
In the long run it is the adult who leads the dance. This is so because internalization of the language of the Other is in child’s and
not parent’s vital interest. Thus, child’s P-structures which shall correctly modify the adult’s behaviour in an intended direction could be
considered more fit and thus more prone to intramental replication
than P-structures which shall not yield an intended effect.
end child-directed speech 9.3

9.4

language acquisition paradigms

In practically no modern scientific discipline is the age-lasting trialectics between realists, nominalists and idealists so ardent as in psycholinguistics. Different perspectives and terminology being adopted,
it is nonetheless the eternal "problem of universals" which is being targeted. Travested into rationalists, mentalists or nativists, one group
does its best to convince the public that intangible "general" is prior
to the "specific"; in the other camp, the empiricists battle for their
belief that the observable and specific is prior to general.
In course of centuries do hours of hand-waving, ink-spilling and
dozens of metaphysical chimeres accompany the process whose unfolding is supposed to bring scientific community ever closer to "most
fit" narrative about origins and development of language in onto-,
phylo- or even cosmo- (De Chardin et al., 1965) genesis. Let’s now
glance on few among its most distinctive figures:
9.4.1
First observations

classical

One among first tentatives to describe the process of language acquisition was done by Saint Augustine in his Confessions: « Passing hence
from infancy, I came to boyhood, or rather it came to me, displacing
infancy. Nor did that depart,- (for whither went it?)- and yet it was no
more. For I was no longer a speechless infant, but a speaking boy. This
I remember; and have since observed how I learned to speak. It was
not that my elders taught me words (as, soon after, other learning) in
any set method; but I, longing by cries and broken accents and various motions of my limbs to express my thoughts, that so I might have
my will, and yet unable to express all I willed, or to whom I willed,
did myself, by the understanding which Thou, my God, gavest me,
practise the sounds in my memory. When they named any thing, and
as they spoke turned towards it, I saw and remembered that they

93

94

developmental psycholinguistics

called what they would point out by the name they uttered. And that
they meant this thing and no other was plain from the motion of
their body, the natural language, as it were, of all nations, expressed
by the countenance, glances of the eye, gestures of the limbs, and
tones of the voice, indicating the affections of the mind, as it pursues,
possesses, rejects, or shuns. And thus by constantly hearing words,
as they occurred in various sentences, I collected gradually for what
they stood; and having broken in my mouth to these signs, I thereby
gave utterance to my will.» (Augustine, 1838)
Not being a theory per se, Augustine’s naive but honest reflexion
concerning the origins of his own mental faculties had nonetheless a
non-negligeable impact upon all theories of LD which had followed.
From our current perspective it, this Augustine’s Confessio could be
most probably labeled as a precursor of associationist school. This is
so, because Augustine principially explains the ontogeny of his semiotic faculty in terms of associations between "words" and "things".
Associationism was among one of those very rare prototheories
of cognition which have succeeded to survive one and half millenium which had followed. David Hume, John Locke or J.S. Mill had
adhered, in one way or another, into the camp of those who were
convinced that great deal of mental phaenomena -or possibly ALL
phaenomena- could be principially explained in terms of mind’s tendency to "link" its internally represented "signs" with things-in-world
or other "signs". Thus, thanks to its usefulness in learning and evidence in introspection, associationist had succeeded to cross the centuries to see the day when the neurologist Donald Hebb postulated
the material (i.e. neural) basis of what was before considered to be
solely mental phaenomena:
« When one cell repeatedly assists in firing another, the axon of the
first cell develops synaptic knobs (or enlarges them if they already
exist) in contact with the soma of the second cell.» (Hebb, 1964)
Since the discovery of the phenomenon which is often referred to
by saying "cells that fire together, wire together; neurons that fire out
of sync fail to link" the Hebb’s rule helped to yield explication of
such neuroscientific challenges as "emergence of mirror neurons"13
Keysers and Perrett (2004). The core idea behind many functional artificial neural network architectures - e.g. Hopfield networks Hopfield
(1982)- is also essentially Hebbian. In 10.4.2 shall be the Hebb’s postulate mentioned as a potential explanation of validity of a so-called
"distributional hypothesis" which is the idea behind practically all efficient computational models of semantic vector spaces.
Behaviorism is another school of thought which was derived from
associationist school and in which the Hebb’s can play a role of the
13 It was already indicated, during our discussion of memetic theory, that mirror neurons are often mentioned in relation with imitation. But in theory hereby introduced,
they can also be understood as neural substrate for bridging C-representations with
P-representations.

9.4 language acquisition paradigms

most fundamental principle. For in its very essence, behaviorism simply substituted the notion of association between two signs with the
notion of conditioning. By adopting the terminology of stimuli and reflexes, of rewards and punishement; and by renouncing to any methods which were not strictly positivist and empiric, the behaviorist
school had renounced to any tentatives to understand internals of
the mind. Since behaviorist precepts worked -and worked not only
when applied on Pavlov’s dogs or Skinner’s pigeons but on humans
as well- and because science lacked both computers and subtle experimental neuroimaging apparati, the tentatives to explain man’s mind
in terms of reinforcement of relations between stimuli and reflexes
was the principal pre-occupation of western psychology of 1st half of
20th century.
And it would have possibly dominated until now if ever the central
figure of the field, B.S. Skinner, hadn’t decided to apply the behaviorist doctrine into the domain of linguistics. Skinner’s book Verbal
Behavior Skinner (1957) claimed that language is learned by operant
conditioning, id est, that child learns language because the expressions of her verbal behaviour - the fact that she utters X and not Y are reinforced by parental rewards. For example:
« In all verbal behavior under stimulus control there are three important events to be taken into account: a stimilus, a response and a
reinforcement. These are contingent upon each other... The three term
contingency ... is exemplified when, in the presence of a doll, a child
frequently achieves some sort of generalised reinforcement by saying
doll.» (Skinner, 1957)
In Skinner’s theory, vectors of reinforcement are not necessarily just
dolls, milk and breasts but can be quite abstract: the very parental
attention can be rewarding, lack of it may punish. By focusing on
child’s and parent’s basal needs, wants and behaviours with which
they attain them, Skinner articulated a theory which has some overlaps with current interactionist and sociopragmatic theories of LD
9.4.4. The idea of founding the fitness functions of our grammar induction systems not only on measures internal to the system, but also
on environment’s responses to the system (14.4.1) is also traceable
back to similar behaviorist point of view.
9.4.2

Of Chomsky and
Skinner

generativists and nativists

Then came Chomsky. In his revolutionary Syntactic Structures (Chomsky, 1957) he proposed to adopt a rule-based, algebraic, transformationalist approach to explain the mystery of grammars able to generate infinite number of utterances out of finite sets of elements. Two
years later, young Noam had gained in prominence by overt and uncompromising review (Chomsky, 1959) of aging Skinner’s Verbal Behaviour. More carpet bombing than review-like, this critique - in some

95

96

developmental psycholinguistics

circles considered as the most influential rhetoric exercice of 20th century - irreversibly hallmarked the rupture: the turn from "behaviorist"
to "cognitive" approach to study of language acquisition.
Whether one rightfully praises Chomsky like (Gardner, 1985b), or
rightfully criticize him as Tomasello (2009), one has to admit that he
was one among the first who attempted to interpret linguistic representations and processes as fundamentally computational phenomena. Thus, the surface structure of an utterance was to be understood
as an output of series of substitutional rules acting upon a certain
deep structures offered as an input. While the notion of substitution
rules was already known to sanskrit scholar Panini more than two
millenia before Chomsky and was in 19th century practically deified
by neogrammarians who spent a non-negligeable effort to use the notion of universally applicable rule to explain the mystery of historic
language change (8.5), it was nonetheless Chomsky who, strongly influenced by his predecessors Jakobson and Harris, developed a theory
whereby substitution rules are supposed to be acting also in individual human cognitive systems.

Of generativism

Panini’s Grammar (APH)
Panini’s (cca 400 BC) grammar is the oldest attested work in descriptive linguistics. Composed at the end of Vedic period and at the beginning of Classical period, it contains 3996 rules of Sanskrit morphosyntax and, in lesser extent, also of semantics. It was transfered
Of first rule-based
grammar
orally -from masters to their students in myriads of schools spreadout through whole India- in form of sutras, i.e. verses to be memorized.
Grammar begins with Shiva Sutras which enumeration of 14 fundamental phonological classes from which one can generate 281 pratyāhāras,
i.e. classes of second order which are to be subsequently processed
by application of one or more among 4000 "Ashtadhyayi" substituion
rules and meta-rules. In terms of modern linguistics these 14 sutras
list all sanskrit terminals phonemes (16 vowels and 33 consonants)
and associate them with anubandha labels (non-terminals). PERLconsistent transcription of Shiva sutras follow: every line presents
one sutra in form of a substitution rule. Parantheses contain individual phonemes; symbol enclosed between second and third / symbol
denotes the anubandha.
Shiva Sutras (SRC)
s/ ( a | i | u ) / n
. /
2 s/ ( R
. | l
. ) / K /

s/
s/
s/
s/

( e | o ) / ṅ /
( ai | au ) / c /
( ha | ya | va | ra ) / t
. /
la / n
/
.

9.4 language acquisition paradigms

7 s/ ( ña | ma | ṅa | n
.a | na ) / M /

s/
s/
s/
s/
12 s/
s/
s/

( jha | bha ) / ñ /
( gha | d
.ha | dha )
( ja | ba | ga | d
.a
( kha | pha | cha |
(ka | pa ) / Y /
(śa | s
.a | sa ) / R
ha / L /

/ s
. /
| da ) / ś /
t
.ha | tha | ca t
.|a | ta ) / V /
/

end shiva sutras (src)

9.4.2.0

Description of the means by which 281 (14*3 + 13*2 + 12*2 + 11*2
+ 10*4 + 9*1 + 8*5 + 7*2 + 6*3 * 5*5 + 4*8 + 3*2 + 2*3 +1*1 - 14 10) pratyaharas are to generated from the list of classes listed above,
and description of almost 4000 rules (e.g. vr.ddhir ādaiC) which subsequently allow the production of so-is-believed, of whole corpus of
one amongst the most complex languages ever known to man, all that
surpasses the objectives of this dissertation.
What does not surpass it, however, is the question: "How could
Panini (or any lineage which preceeded Panini) ever discover such a
grammar?". Computationally, the task is enormous. Revelatory explanations - so popular in India - aside, we see only one possible answer:
by means of intramental evolution.
end panini’s grammar 9.4

Of recursivity

Not only Panini, Jakobson, Harris have influenced Chomsky, but
also Turing whose idea of a symbol-substituting machine crossed
over with discovery of transistor, thus yielding new and powerful
generation of computers around the time as Chomsky entered MIT.
And it was in and by and through contact with computers that Chomsky understood the generativity of rule which is applied many times
and which can consider its own past outputs as its present or future
inputs. Recursivity once understood, followed an insight that recursivity is present in natural languages, e.g. in expressions like:
She knows that he knows that he knows that she knows...
Et caetera et caetera, theoretically ad infinitum. For an old-school
generativist, the very theoretical possibility to realize such an infinite
regress constitutes a sort of a proof of belief that grammars, understood as systems of rules which can be recursively applied upon
sequences of symbols chosen from a finite alphabet, can ultimately
generate infinite amounts of such sequences. 14 .
14 C.f. the "Halting problem" in (Hromada, 2008) for closer theoretical discussion of this
"infinitist" fallacy which is, we believe, the source of many problems which haunt
the generative linguistics from the very moment of its conception.

97

98

developmental psycholinguistics

That recursion, combined with substitution, can simulate and/or
generate more than anything, was already known to Cantor and Godel,
let along Turing. But contrary to Godelian proving-god-through-arithmetics
and Turing’s Enigma-breaking, Chomsky had decided to program the
"universal machine" with a goal in mind which his audience could understand: language. To make things formal and contrary to centuries
of knowledge which says otherwise, language was subsequently reduced to a set of sequences of symbols chosen from the alphabet
(10.2). Other definitions, axioms and theorems had followed, often
Of Chomsky’s
contribution to
with huge importance for subsequent development of evermore comcomputer science
plex assemblers, parsers and compilers of artificial languages. For
example, it is difficult to imagine how informatics could move from
assembler to C++ or PERL without having at its disposal the theoretical framework of Chomsky-Schützenberger containment hierarchy of
formal grammars.
Thus, a formal system was developed which turned out to be useful for certain subdiscplines of informatic science. But as is often
the destiny of formal systems which fail to delimit their domain of
applicability, its proponents started to confound the map with the
territory. As a result, practically whole generation of linguists following the "discovery of generativity of recursivity" got lost in the
labyrinth of futile tentatives trying to fit the expressive diversity of
the natural into the monolithic framework able to account only for
simple among the artificial. Thus, in a somewhat paradoxical turn of
events, were thousands of linguists transformed into pigeons pecking X-bar conditioned by the reinforcement principle of "publish or
perish". New models and theories with names as binding as "GovOf operantconditioning of
ernment and Binding" or as non-minimalist as "Minimalist program"
linguists
were proposed and turned into full-fledged doctrines shared among
the castes of initiates. To give at least some meaning to the evermore
ezoteric symbol-substituting passe-temps, a noble quest was launched:
the quest for a so-called Universal Grammar.
In its very essence, the notion of Universal Grammar (UG) is Chomsky’s answer to the problem raised by Gold and related to the fact that
from one specific language sample, one can induce multiple grammars which are able to generate such sample. But if multiple gramOf Universal
Grammar
mars can be obtained, how could a child now which one is the correct one? Chomsky’s answer was: because her choice is constrained
by "something" innate: her majesty the UG15
One can explain "innate" either in creationist or emergentist terms.
Hoping that nativists do not belong to the first group, there is only
one way how existence of UG could be explained: by evolutionary
15 Note that there is a certain symmetry between couples (Turing Machine, Universal Turing Machine) and (Grammar, Universal Grammar). In both couples does the
unicity of the latter furnish a frame for the diversity of the former. But there is a difference as well: while UTM allows to emulate any Turing Machine, UG is supposed
to constrain the set of relevant grammars.

9.4 language acquisition paradigms

Of reconciliation

process. This is, we believe, a point of reconciliation, a point of convergence between non-creationist yet nativist doctrines which postulate UG, and theory of intramental evolution as hereby introduced.
On the other hand, there is also a significant point of divergence:
while we suggest that it is more computationally feasible to produce
language-generating and language-constraining representations during ontogeny, Chomsky’s "nativist" disciples like Pinker (1994) spent
significant part of their carreer arguing for the fact that UG is somehow produced by evolution which is phylogenetic. Simply stated: nativists believe that UG is encoded in DNA.
As was already indicated, the raison d’etre of UG is to constraint
and direct learning of grammar. The problem that English children
learn grammar of english and Chinese learn grammar of chinese was
"solved" by a so-called Principles & Parameters theory as follows: during acquisition of language LX , child extracts from the utterances she
hears a set PX of parameters specific to LX , inserts these parameters
PX into the UG, thus obtaining the specific grammar GX able to generate LX . Id est:
GX = UG(PX )

Of power of
stimulus

According to nativists, a child would not be able to learn GX without UG’s intervention in the acquisition process. The necessity is supposed to be both empiric, and theoretic. The empiric necessity is related to the problem of "poverty of stimulus" that the utterances children here are qualitatively incorrect and quantitatively insufficient to
account for the fact that child, indeed, learns language of its environment. Unfortunately for nativisits, vaste body of rigorous empiric research (Clark, 2003; Karmiloff and Karmiloff-Smith, 2009; Tomasello,
2009) in DP indicates that the notion of "poverty of stimulus" was
nothing else than a chimere and that reality of a healthy child surrounded by a healthy social environment is rather the contrary: one
should not speak about poverty, but rather about "power of stimulus". A scientist who agrees with the statement « in summary, childdirected speech and other sources of language – overheard speech,
stories read aloud, speakers heard on radio or TV, for instance – provide such rich input that children should eventually learn enough of
their language for all their needs» (Clark, 2003) can thus discount the
poverty of stimulus as simply irrelevant.
On theoretical grounds, the necessity to have something like UG
embedded in human Language Acquisition Device (LAD) is often
claimed as a necessary consequence of the Gold’s Theorem, an important result obtained in a "learnability theory" sub-branch of formal
language theory.

99

100

developmental psycholinguistics

Refutation of Gold’s Theorem (APH)
In short, the theorem postulated by Gold (1967) states that « Any class
of languages with the Gold Property is unlearnable.» (Johnson, 2004)
In Gold’s formal system, class C of languages has the Gold Property
if and only if:
1. C contains a countable infinity of languages Li such that Li ⊂
Li+1 for all i>0
2. a further language L∞ such that for any i>0, x is a sentence of
Li only if x is a sentence of L∞ , and x is a sentence of L∞ only
if x is a sentence of Lj , for some j>0
Learnability, on the other hand, is defined in terms of
1. environment E, which is supposed to be an infinite sequence of
sentences of the language to be learned
2. an ideal learner, which « learns L given E iff there is some time
tn such that at tn and all times afterward, the learner correctly
guesses that L is the target language present in the environment» (Johnson, 2004)
Gold’s Theorem therefore simply states that infinite 16 set of mutually embedded languages are «unlearnable» by a system which
never forgets and which is exposed to infinite environment. It is difficult to see, however, how these purely theoretical relation between
purely theoretical infinite sets can have any further implication for
concrete individual languages, understood as finite sets of utterances
exchanged in specific extralinguistic contexts. Given that lifetime of
an individual human learner is always finite, the linguistic environment of such a learner also has to be finite. The first condition of
what "learnability" means in Gold’s formal system is thus irelevant to
human learners.
The second condition is irrelevant as well: human mind is not a
storage system which faitfully internalize and store for "all time afterwards" every piece of linguistic data to which it was exposed, unmodified. Humans forget, children moreso. It is therefore somewhat
unclear how the very notion of learnability and unlearnability, as defined by Gold, could apply to human beings in general and children
in particular.
This being said, we find it appropriate to state that more than an important proof telling us something about LD in human beings, whole
fuzz about Gold’s Theorem is more an evidence of how a whole a
multidisciplinary scientific endeavour can get stuck for decades in
a blind alley just because of an inter-disciplinary quaternio termino16 Taking Gold’s Theorem seriously in regards to language learning is equivalent, mutatis mutandi, to belief that children shall never learn basic arithmetics because in
order to understand addition, they would first have to be confronted with all integers
between one and infinity.

Of finiteness of
environment of
human learner

Of forgetting

Of fallacy of four
terms

9.4 language acquisition paradigms

101

rum (Sokol, 1998) fallacy. In other words the term «unlearnable» in
Gold Theorem is just a term which Gold uses within his tautological
statement to denote certain properties of certain infinite hierarchically
embedded sets of sequences of symbols, and as such does relate only
abstractly to concrete condition of human learners.
end refutation of gold’s theorem 9.4.2.0

Of syllables and
chunks

A theoretical pillar of necessity to postulate UG being somewhat
undermined by the above aphorism, there are not many reasons for
not using the lex parsimoniae of William of Occam to raze the notion
of DNA-encoded UG away from the terminological toolbox of 21st
century linguistics. This does not mean that there are no faculties
and features which would be universally present among all human
languages17 . Take for example the fact that in all human cultures,
people group consonants and vowels into specific clusters. Given the
universality of the phenomenon, one is more than tempted to agree
with authors like Jackendoff (2002) (a nativist of second generation)
and state that syllabization is a component of UG. But when one realizes that syllabization is potentially just a consequence of concrete
application of a deeper cognitive processus, known as "chunking",
upon articulatory programs constrained by the trivial fact that consonants cannot be pronounced without vowels, one can immediately
ask whether syllabization - or any other components of so-called UG
- are in fact not a consequence of particular interactions of more general cognitive processes and neurophysiological characteristics of human beings (9.2.6) and their particular social environments.
Somewhat contradictory to what term "universal" normally means,
a fidele nativist would consider the following equation:
LD = General cognitive processes (Linguistic Input + Extralinguistic Input)

Of non-linguistic
grammars

as an unexcusable heresy. This is so, because he considers UG to be
the core component of a language-specific cognitive module and not
a general cognitive process. For some nativists, there is only one
domain where mind uses rule-based grammars: language. Others
may be ready to accept that some other domains of human activity from walking, dancing, body excercising and mating through musicgenerating, food preparation and object creation to ritual-performing,
healing or simple arithmetics - can also be rule-based and encapsulated in specific cognitive modules Fodor (1983) or even have sorts of
grammars of their own. Thus, all nativists believe that foundations of
language faculty are to be found in genes, only few, however, would
accept that triggering of those very same genes could result in activity
of drumming or salsa-dancing as well.
17 Some universals like compositionality, graduality, specificity etc. were mentioned in
section 9.2.4

102

developmental psycholinguistics

To summarize: more than half a century ago, the generativist approach to language threw an energetic spark into muddying waters
of structural linguistics, thus igniting a passionate interdisciplinary
debate between linguistics and computer science. Strictly formalist,
transformationalist approach had failed to furnish a complete, consistent and elegant framework for the study of natural languages but
significantly facilitated construction and further development of artificial and programming languages. What failed in greater extent, however, was the nativist entreprise aiming to discover DNA-encoded
"innate" predispositions specific to language. In spite of two generations of effort of linguists, psychologues, geneticians or clinicians, no
"language gene" was discovered and the answers to questions asking

Of success and
failure of
chomskian
entreprise

• Which composents of Language Acquisition Device are innate?
• What is the nature of its core, the Universal Grammar?
• Which processes are purely language-specific and which are
more general?
seem to be obfuscated as they ever were, potentially because its terms
mean slightly different things for computer scientists, different things
for logicians, different things for psychologues and different things
for linguists. But the lack of answers to such question notwithstanding, an orthodox nativist position adopted by Chomsky had nonetheless resulted in a state profitable for everybody: on beginning of 21st
century, the vaste majority of all cognitive scientists agrees that man’s
language faculty is a result of interaction between at least two major
components:
1. cognitive and physiological characteristics tuned by phylogenesis of human species
2. input to which the language learner is exposed during its prenatal and postnatal ontogeny
and the whole nature & nurture debate is not led anymore in terms of
mutually exclusive either/or, but focuses more on degree and forms
of mutual triggering and epigenetic interactions between innate and
acquired programs. Let’s now leave the discussion of those who emphasize the importance of the first component and focus on those
who emphasize the role of the second component: empiricists and
constructivists.
end generativists and nativists
9.4.2.0

9.4.3

empiricists and constructivists

Empiricists argue that human knowledge arises principially from experience (ἐμπειρία). They thus explain the acquisition of a certain word

Of consensus

9.4 language acquisition paradigms

or expression in terms of percieved contexts within which the child
hears a given word or expression. Empiricist paradigm is thus quite
similar to associanist, and in lesser extent also behaviorist paradigms
mentioned above (9.4.1).
But what about acquisition of structures and principles which are
not salient, evident or even percievable at all? What about acquisition
of all those directly unpercievablable entities - be it rules, schemas,
patterns, templates or something else - which determine the result
of linguistic comprehension and production, yet are not present per
se in any utterance? What about all word order, long-distance dependency or chiasmatic principles which definitely have to be somehow encoded in the mind - because they act - but are detectable only
through consequences of their actions? Because they express themselves only through their instances, and because they instances vary,
pure empiricism has to encounter serious epistemological problems
when explaining acquisition linguistic representations operating with
and on more general levels of abstraction solely through sensory experience.
Through hundreds of years of both theoretical reflexion as well as
methodic experimentation, empiricists gradually evolved into constructivists. Being firmly rooted in phenomenology of everyday human experience, constructivists do not deny existence of more general
representations. They simply state that all those concrete-surpassing
quantifiers, rules, principles, categories, templates or schemas are as
natural a consequence of exposure of mind to repetitive, contextualized stimuli, as photosyntesis is a natural consequence of exposure
of plant’s leaves to light. For constructivists and their connectivist descendants, mind’s hardware - the brain - is a generalization device
par excellence and therefore there is truly nothing mysterious about
the fact the mind is able to transcend the concrete and the arbitrary.
Thus, contrary to their nativist counterparts who postulate that
child’s mind tries to "deduce" concrete grammar of her ambient language by entering the specific parameters into the formal system of
universal grammar, constructivists postulate that child’s mind in fact
"induces" her grammar from and out of myriads specific utterances
she hears. In the famous Chomsky vs. Piaget debate, the position
of father of all constructivists could be characterized as follows: «
Piaget maintains that one’s linguistic structures are not defined by
the genome, but instead, are ’constructed’ by ’assimilating’ (organizing) things in the environment in terms of pre-linguistic structrues,
and ’accomodating’ (modifying) these as they prove insufficient. This
mode of functioning, called ’reflective abstraction’ is innate, as is
some elementary reflex behaviour (e.g., sucking, grasping), but the
cognitive structures, even the pre-linguistic ones, are not.» (PiatelliPalmarini, 1980)

103

104

developmental psycholinguistics

Asides two modes of schema application and modification which
Piaget called ’assimilation’ and ’accomodation’, and which we have
already discussed in presentation of Piaget’s Genetic Epistemology
theory (8.4.4); and asides notion of ’reflective abstraction’ which Piaget uses to explain cognitive development occuring well after the
age of toddlerese, Piaget introduced other terminology which we consider to be particulary useful when aiming to explain certain facets
of LD: circular reaction, schema coordination and interiorization.
Circular reactions are related to the propensity of cognitive system
to repeat, reproduce and reactivate its schemas. Primary circular reactions occur between 1-4 months and are triggered by child’s discovery
that acts, originally performed by accident, can bring about a pleasing
consequence, and subsequently repeating the action. Secondary circular reactions occur between 4-8 months of age, are still repetitive and
habitus-forming but also involve external objects (e.g. switch switching). Schemas thus formed are subsequently mutually combined and
recombined, coordinated and recoordinated thus generating still bigger a variety of behaviours and habits which the child finds out to be
useful, pain-reducing or simply pleasurable.
According to Piaget’s theory, schemas of first 18 months of age are
principially sensorimotor. But later, after the basic perception-action
couplings were mastered and optimized in sufficient extent, child
leaves the "sensorimotor stage" and enters "the preoperational stage"
wherein she starts to "internalize" the schemata. Internalization does
not mean that child simply creates neural representations of her sensorimotor couplings: such "neural substrate encoding" is axiomatic
for any organism with nervous system and takes place already in
prenatal development. Internalization in Piaget’s theory means that
a child creates neural representations of mental substitutes - symbols
- which themselves refer to certain sensory,motor and later also symbolic "realities". Internalization shall subsequently allow the child to
execute certain operations purely mentally, without need to materially realize them in physical reality: it is in great extent thanks to
internalization that the child can find the shortest way out of an unknown space without the need to physically toddle through all possible paths.
Parallely to Piaget, at the other border of Europe, but in a less Kantian and somewhat more "dialectical materialist" a space, the mind
of Lev Vygotskij was slowly converging to practically identical conclusions: the process of internalization of schemas was to explain the
ontogeny of thinking. Believing that « the internalization of socially
rooted and historically developed activities is the distinguishing feature of human psychology is the distinguishing feature of human
psychology, the basis of the qualitative leap from animal to human
psychology» (Vygotsky, 1978) and knowing that language is potentially the most important exemplar among such "socially rooted and

9.4 language acquisition paradigms

105

historically developped activities" Vygotskij went even further than
Piaget and postulated that thinking is a form of internalized language:
thoughts are inner-speech utterances.
There are, of course, subtle differences between theories of Piaget
and Vygotskij. For example, while Vygotskij’s theory focuses more on
social and cultural forces behind the split between the outside "social"
speech and the inner speech, Piaget’s theory emphasizes child’s individual,egocentric, knowledge-constructing acts. But it would be false
to state that Piaget wasn’t aware of importance of social aspects for
development of cognitive functions, as is evident, for example, from
the statement « The individual would not come to organize his operations in a coherent whole if he did not engage in thought exchanges
and cooperation with others.» (Piaget, 1947) One can rencocile such
point of with Vygotskij: « An operation that initially represents an external activity [e.g., egocentric speech] is reconstructed and begins to
occur internally [e.g., private and internal speech]. . . An interpersonal
process [e.g., social language] is transformed into an intrapersonal
one [i.e., inner speech]. . . The transformation of an interpersonal process into an intrapersonal one is the result of a long series of developmental events.» (Vygotsky, 1978).
Such long series of developmental events is, according to the theory
of intramental evolution, equivalent to evolutionary process wherein
diverse schemata are replicated through processes of internalization
and articulation and are selected by their ability to induce intended
changes in the (social) environment. In other terms: pragmatic, environmentrelated concerns are to be present in function evaluating survival and
reproduction fitness of such structures.
end empiricists and constructivists (src) 9.4.3.0

9.4.4

socio-pragmatic and usage-based paradigms

Piaget’s and Vygotskij’s theories are not theories of LD. They are
much more general: they are theories of development of knowlege
and thought; they are theories of learning. Under light of such generality, concrete particularities are secondary: thus neither Piaget nor
Vygotskij offer specific quantitative values which are to be defined by
any engineer aiming to reproduce LD-like processes in silico. They
rather offer a general gradual, environment-oriented and ludic framework within which once can do so: that alone suffices.
But what about concrete aspects of child’s social learning (Bandura
and McClelland, 1977) of language? One can hardly speak about science if specific processes and representations are not experimentally
explored and verified; there is no science where specific relations
and correlations between variables and phenomena are not evaluated.

106

developmental psycholinguistics

Only entreprise which allows to find concrete answers to concrete
questions by analyzing concrete phenomena, is truly scientific.
Jerome Bruner was among the first scientists who have performed
a detailed analysis of child’s LD and interpreted their findings in
"social environment" terms. By performing a longitudinal study focused on two boys - Richard and Jonathan, by visiting their homes
every fortnight since they were 5 (resp. 3) months old until they were
24 (resp. 18) and by taking half-hour audiovideo recordings of their
playing with their mothers; Bruner initiated a paradigm both rigorous yet "natural" and completely non-violent (because with mother
and in home environment).
During first months, Bruner focused on games through which the
child learns how to manage interactions with the closest social environments. Through games with a cloth clown which the mother
makes gradually appear, disappear and reappear; or through the
game of peek-a-boo, the infant gradually learns the basic conditions
of social and participatory activities . Bruner concludes these observations with words: « If the "teacher" in such a system were to have
a motto,it would surely be: "where before there was a spectator, let
there now be a participant".» (Bruner and Watson, 1983), indicating
that such games help to establish social conventions upon which latter language use shall be based.
Later, Bruner had explored how referential meaning of first words
is born through mother-originated object-highlighting and child-originated
pointing. Or he analyzed the motherese articulated during the picturebook reading to discover that « The variety of mother’s utterance
types in book reading is strikingly limited. She makes repeated use
of four key utterance types, with a surprisingly small number of variant tokens of each. These types were (1) the Attentional Vocative, e.g.
Look; (2) the Query, e.g. What’s that?; (3) the Label, e.g. It’s an X;
and (4) the Feedback Utterance, e.g., Yes.» (Bruner and Watson, 1983).
Complete list of such constructions which occured more than once
during session at 1;1.1, are presented on Table 5.
It is also worth noting that such utterances were observed to occurs
almost always in the sequence:
1. Attentional Vocative
2. Query
3. Label
4. Feedback
Some members of the sequence can be left out - e.g. attentional vocative is left out when mother simply responds to what the child does
- but Bruner noticed that the order of utterances is practically never
switched. Along with the extra-linguistic context (e.g. book-reading),
can be undertood as the first format.

9.4 language acquisition paradigms

type / tokens

frequency

I. Attentional Vocatives

65

Look !

61

Look at that!

4

II. Query

85

What’s that?

57

What are those?

8

What are they doing?

6

What is it?

5

III. Label

216

X (=a stressed label)

91

It’s an X

34

That’s an X

28

There is an X

12

An X

12

That’s X

6

There is X

6

Lots of X

5

They are X-ing

5

More X

3

They are X

3

These are the X

3

The X

2

IV. Feedback

80

Yes

50

Yes, I know

8

It’s not an X

5

That’s it!

3

Isn’t it?

2

Not X

2

No, it’s not X

2

Table 5: Utterances classified as tokens of the four major types of the motherese. Reproduced from table 4.2 in (Bruner and Watson, 1983,
pp.79-80).

107

108

developmental psycholinguistics

Format (DEF)
« A format is a standardized, initially microcosmic interaction pattern
between an adult and an infant that contains demarcated roles that
eventually become reversible.» (Bruner and Watson, 1983)
end format 9.4

Contrary to pivot schemas () or variation sets (), which are "individual" in the sense that they constructed, stored and articulated by
a signle individual, are formats interactive, mutual and shared. It is
also important to emphasize the extra-linguistic and pragmatic facets
of such a microcosmic scene: « format is a routinized and repeated
interaction in which an adult and child do things to and with each
other» (Bruner and Watson, 1983).
Successful unfolding of a "format" from its beginning until its very
end is possible only if both participants succeed to focus their attention upon the same object of interest. Such "joint attention" is the
cornerstone of not not only Bruner’s theory, but also of Tomasello’s.
As a primatologue par formation, Tomasello stresses out the fact that
humans are only apes which use symbols
• to acknowledge sharing of attention with others
• redirect attention of others to external objects, states or processes
• modify mental states of others
. In other terms, humans are capable of "intention-reading", they are
able of "joint attention" surpassing the dyadic I-You relation by integrating an external object (or mental state) into triadic I-You-it relation (Buber (1937)).
For Tomasello, intention-reading « is the foundational social-cognitive
skill underlying children’s comprehension of the symbolic dimensions of linguistic communication» (Tomasello, 2009). It is supposed
to be a domain-general skill allowing not only linguistic communication, but many other practices as well (rituals, tool and house manufacture, co-ordinated warfare, healing, nonreproductive mating etc.)
and is strongly intertwined with other phenomena studied by theory
of mind (imitation, perspective-taking etc).
The principal reason why intention-reading is supposed to be foundational is its ability to attribute function to diverse linguistic expressions or their components (e.g. words): « identifying the functional
roles of the components of utterances is possible only if the child has
some (perhaps imperfect) understanding of the adult’s overall communicative intention—because understanding the functional role of X
means understanding how X contributes to some larger communicative structure.» (Tomasello, 2009). Thus, not only the observable "mi-

9.4 language acquisition paradigms

crocosmos" within the articulation of the utterance AXB took place,
but also "intention of the speaker" has to be understood if the child is
to understand the function of X’use in her language. Which, according to Wittgenstein, is equivalent to meaning of X.
Asides the foundational processes of intention-reading and joint attention, ambition of Tomasello’s theory is to understand the nature
of cognitive processes related to: "
1. schematization and analogy, which account for how children
create abstract syntactic constructions out of the concrete pieces
of language they have heard
2. entrenchment and competition, which account for how children
constrain their abstractions to those that are conventional in
their linguistic community
3. functionally based distributional analysis, which accounts for
how children form paradigmatic categories of various kinds of
linguistic constituents (Tomasello, 2009)
also involved in LD process.
Crucial to explanation of these processes is the nature of their input
and output representations. Contrary to Bruner, whom his generativist Zeitgeist forced to interpret his data through the terminological
prism of "deep" and "surface" structures, has Tomasello conceived his
theory in the period when generativism was already somewhat "out
of mode". Thus, instead of mysterious UG, isolated lexicon, a monolithic set of transformation rules and omnipresent arborescent structures is his usage-based theory of language acquistion based on "itembased constructions", "expressions", "schemas" and "templates" containing "slots" and variable through diverse operators.
A big theoretical advantage of these forms of representation is their
ability to can encode multiple levels of generality at once. As such,
they have no problem to account for acquisition of such fixed or semifixed idiomatic expressions like ça va?, gonna, dunno or kick the bucket
which seem like rule-generated but are in fact learned by rote. Generative models based on rules -supposed to be generally applicable- and
lexicon -whose members are supposed to be as concrete and atomic
as possible- have huge difficulties with such fixed or semi-fixed entities18
Usage-based models, on the other hand, do not have any problem
whatsoever in accounting for existence of such hybrid structures. As
18 Take as an example ça va?, the French equivalent of How do You do? The form is
not purely rule-generated because otherwise a completely normal demonstrativ pronoun ça cannot be substituted for other demonstrativs (il, elle) without the whole
losing completely its meaning. On the other hand, it cannot be a member of lexicon
neither because it is decomposable and the second component va (i.e. "goes") can be,
in some argot contexts, substituted for specific verbs (i.e ça tourne; ça roule).

109

110

developmental psycholinguistics

Tomasello puts it: « The impossibility of making a clear distinction between the core and the periphery of linguistic structure is a genuine
scientific discovery, and it has far-reaching theoretical implications...it
suggests that language structure emerges from language use, and that
a community of speakers may conventionalize from their language
use all kinds of linguistic structures—from the more concrete to the
more abstract, from the more regular to the more idiomatic, and with
all kinds of mixed constructions as well.» (Tomasello, 2009). In usagebased linguistics, mind is free to mix terminals with non-terminals as
is wanted, necessary and appropriate for realization of communicative intention framed by a specific situation.
This being said we consider it futile to try to summarize all details
of Tomasello’s broad and detailed synthesis. Instead, reader is hereby
invited to read the monography: its lecture should significantly facilitate the interpretation of what shall follow in Part ii and Part iii
and could be used as a sort of prolegomena to these. For the purpose of the present exposé, let’s just conclude with the following citation: « As children attempt to read the intentions of other persons
as expressed in utterances, they extract words and functionally coherent phrases from these utterances, but they also create item-based
constructions with open slots on the level of whole utterances. Few
theorists of language acquisition deal with these humble creations,
and those who have dealt with them (e.g., Braine, 1976) have not provided an account by means of which they evolve into more abstract
and adult-like constructions.» (Tomasello, 2009)
As a follow-up to this citation, both phenomenological and computational exploration of extent in which evolution of populations of
such "humble creations" could be characterized as a process involving intramental replication, variation and selection, is now defined as
a principal objective of this dissertation.
end socio-pragmatic and usage-based 9.4.4

Having thus glanced at the history of just a few among legions of
savants who spent time of their lives seeking, in one way or another,
to propose an answer to the question:
"How is it possible that humans understand each other yet often not agree
with each other?"
we conclude this brief overview with a simple truism which could,
with little bit of good-will, reconcile all-above mentioned positions:
"Because we can do so and want to do so."
and note that the debate of what "can" and "want" means in context
of "understanding" and "agreement" surpasses by far the scope of our
current proposal.

9.4 language acquisition paradigms

end language acquisition paradigms

9.4

Intention behind less than 50 pages of this chapter was to acquaint
the reader with certain facts, concepts and theories related to language development. LD was principially defined as "constructivist
process" (9.1) and it was indicated that ontogeny of language competence can be understood as a process of gradual optimization of
one’s linguistic structures and processes. Difference between "comprehension" and "production" of language use was emphasized and
a "dogma" was postulated, stating that in developing mind, language
comprehension is to precede the language production.
Certain facts specific to all facets of linguistic competence were
brought to reader’s attention. These were presented in order to indicate that certain domain-general processes - related to contrast detection; category construction aided by input-driven distributional analysis; schematization; pattern-matching etc. - operate on all levels, from
prosodic to pragmatic and beyond. Gradual increase in diversity and
complexity of representations was observed in many different cases:
in learning of phonotemplates, in vocabulary development, in construction of item-based constructions from pivot schemas etc.
Input and social interactions were also often mentioned. Brown’s
"word game" and Bruner’s "formats" were discussed in case of vocabulary learning and "variation sets" were told to facilitate the acquisition of morphosyntax. Special properties of "motherese" were praised
for their abilities to significantly reduce the computational complexity
decyphering process. This was the reason for adoption of somewhat
critical a stance towards formal "learnability" and "nativist" theories
which see perfect learner in imperfect environment there, where we
tend to see an imperfect learner in the perfect environment.
For this reason we prefer to end this chapter with the citation from
a book which, asides (Clark, 2003; Tomasello, 2009) was our first
guide in the evercomplex labyrinth of LD-related data and theories:«
If grammar were innately specified in the infant brain and simply
triggered by hearing the correct forms, why would it take so long
to manifest itself? In such a case, one might expect grammar to be
an inherent part of the child’s output from the start. After all, by the
time infants reach their first birthdays they will have had considerable
exposure to linguistic input and, as we saw, they already have a significant receptive vocabulary. The average three-month-old has already
had approximately 900 waking hours or 54,000 minutes of auditory
input. And these calculations do not even take into account the last
three months of intrauterine life...» (Karmiloff and Karmiloff-Smith,
2009). Hundreds of thousands of minutes when the child is still a
toddler, millions of sequences of linguistic tokens pre-processed and
pre-formated by those-who-love-the-one-to-whom-they-speak: plenty

111

112

developmental psycholinguistics

of high-quality data to proces by ever-evolving populations of cognitive schemata. Plenty of information to induce useful patterns from.
end developmental psycholinguistics 9.4

10
Plan of the chapter

C O M P U TAT I O N A L L I N G U I S T I C S

Computational linguistics (CL) is a discpline positioned at the intersection between linguistics and informatics. The extent of this intersection is huge because both informatics and linguistics have one important property in common: on the most formal level, they both deal
with sequences of symbols. In this chapter, such abstract and theoretical perspective shall be more closely discussed in section dedicated to
Formal Language Theory 10.2 and its modular counterpart theory of
Grammar Systems 10.2.3. Subsequently, more "real-life" problems related to Natural Language Processing (10.3) shall be mentioned, with
special focus being put on the problems of:
• geometrization of semantics attained by projection of natural
language corpora into N-dimensional vector space
• part-of-speech tagging and part-of-speech induction which make
it possible to automatically attribute grammatical category membership to different tokens occurent in the corpus
• grammar induction which makes it possible to infer grammar
GL of language L from the corpus CL
But before doing so, let’s now briefly discuss that sub-discipline of
CL, which is older than CL itself.
10.1

Frequency of
occurence

quantitative and corpus linguistics

Centuries before first computers were invented, the preceptors had already been counting words in different corpora. Panini and his disciples contemplated the Vedic corpus (9.4.2) in order to invent the most
cognitively efficient means of transmission of the Corpus through
human cerebral wetware without ever writing it down, Dominicans
were creating concordancy tables of biblical verses, Arabs analyzed
the Quran and kabbalists the Torah: and it cannot be excluded that
practically all members of these otherwise divergent currents had
found particular pleasure in doing so.
The advent of computers have changed such an opaque hermeneutic
passe-temps of some most devoted philologues into a full-fledged and
highly empiric science. The symbol-reading and symbol-manipulating
faculties of Turing machines embedded in first thousands, then millions, then billions of transistoric flip-flops have allowed to process
all words of one’s library in few seconds. Frequencies of occurence
of a word W - i.e. the answer to the question "How many times does

113

114

computational linguistics

the word W occur in corpus C ? - were evaluated for still bigger and
bigger corpora; probability distributions of relative frequencies - i.e.
FW normalized by number of all words in C - were assessed.
And new evidence was given that natural language corpora contain
such salient regularities, that one can or even must explain them in
terms of mathematical "laws".
10.1.1

zipf’s law

The basic form of Zipf’s law can be expressed by an equation:
fW ∗ rW ≈ C
where fW is a frequency of occurence of word W in the corpus and
rW is word’s rank in the table where all words of the corpus are
sorting according to their frequency in descending order (i.e. the most
frequent word has rank 1, the second has rank 2 etc.) and C is a
constant. In other terms, Zipf’s law states that the frequency of a word
is inversely proportional to its rank in the frequency table, which
is equivalent to the statement that « the frequency of a word in a
text and its rank is approximately linear when plotted on a double
logarithmic scale» (Ferrer-i Cancho and Elvevåg, 2010). In terms of
probability distributions, this law states that frequencies of occurence
of words in the corpus are independent and identically distributed
random variables with distribution

Formulation of ZL

p(f) = αf−1−1/s
id est, the "power law"/Pareto distribution with exponent s.
G.K.Zipf was profoundly convinced that this regularity is an expression of a domain-general cognitive eco(nom|log)y principle of
least effort. More concretely, he conjectured that the observed regularity is a consequence of tendency of linguistic system to attain the state
of vocabulary balance, i.e. the state in which two oposing forces, force
of unification and force of diversification, characterized as follows: «
on the one hand, the Force of Unification will act in the direction of
decreasing the number of different words to 1, while increasing the
frequency of that 1 word to 100%. Conversely, the Force of Diversification will act in the opposite direction of increasing the number of different words, while decreasing their average frequency of occurrence
towards 1. Therefore number and frequency will be the parameters
of vocabulary balance.» (Zipf, 1949)
Next generations of linguists -Chomsky included- and mathematicians - e.g. Benoit Mandelbrot - were less enthusiastic when it came
to the importance which Zipf attributed to his "law". The point of
conflict was not whether the frequencies of words in natural language texts follow the power law distribution: each new analysis had
demonstrated that it is, verily, the case. The argument arose when

Vocabulary
balance

Critics of ZL

10.1 quantitative and corpus linguistics

ZL defended

ZL in language
ontogeny

ZL and ecology

some authors started to considered Zipfian distributions as a tautological necessity, as a phenomenon emerging anytime, even in randomly generated artificial corpora. One famous study concluded: «
Zipf’s law is not a deep law in natural language as one might first
have thought. It is very much related the particular representation
one chooses, i.e., rank as the independent variable» (Li, 1992) and reiterated Mandelbrot’s remark that ZL is "linguistically very shallow".
On the other hand, more recent article convincingly demonstrates
that « good fit of random texts to real Zipf’s law-like rank distributions has not yet been established. Therefore, we suggest that Zipf’s
law might in fact be a fundamental law in natural languages» (Ferreri Cancho and Elvevåg, 2010). A lateral support for such claim comes
also from study which focused on "evolution" of ZL - notably the
exponent s - in language ontogeny. Given statistically significant observations that « in children the exponent of the law tends to decrease
over time while this tendency is weaker in adults...Our analysis also
shows a tendency of the mean length of utterances (MLU), a simple estimate of syntactic complexity, to increase as the exponent decreases. The parallel evolution of the exponent and a simple indicator of syntactic complexity (MLU) supports the hypothesis that the
exponent of Zipf’s law and linguistic complexity are inter-related»
(Baixeries et al., 2013). We add that it would be somewhat difficult
to observe such ontogeny-related modifications of ZL’s exponent if
ever ZL was just a pure artefact owing its existent to one’s choice of
mathematical formalism.
At last but not least, we consider it worth reiterating that similar
Zipf-Mandelbrot distribution were observed in sciences other than
linguistics. In ecology, for example, the distribution between number
of species observed as a function of their abundance is understood
as a zipfian phenomenon (Mouillot and Lepretre, 2000). Given that
ecology is principially a science about equilibrium-seeking systems
consisting populations of entities which interact and replicate, we
consider the fact that similar scaling phenomena operate both
• in realm of words - and, Zipf would add, also in realm of "meanings" because « words are tools that are used to convey meanings in order to achieve objectives...the reader may infer from
the orderliness of the distribution of words that there may well
be a corresponding orderliness in the distribution of meanings
because, in general, speakers utter words in order to convey
meanings» (Zipf, 1949)
• in ecology
to support the Thesis that certain neurolinguistic structures intramentally interact and replicate.
end zipf’s law 10.1.1

115

116

computational linguistics

10.1.2

logistic law

Another among multiple "quantitative laws" which seems to be of particular interest for anyone aiming to understand and create evolutionary models of language ontogeny is the "logistic law" often known
as Piotrowski’s law. This law postulates that language development
follows the logistic curve formalizable into mathematical notation as
1
c + a · e−b·t
whereby t denotes time, p(t) denotes quantitified value of an observable property of lingustic system in time t, e is euler’s constant
and a, b, c are parameters of the model. We consider it important to
mention that what research of Best (2006) and other participants in
the "Gottingen project" indicates is that the law applies not only on
ethnogenic, cultural and historic (i.e. Sprachwandel, c.f. 8.5) but also
on ontogenic development of linguistic systems (i.e. Spracherwerb).
p(t) =

Figure 17: Logistic law in relation to historic and ontogenetic linguistic processes. Data taken from Best (2006).

Figure 17 illustrates examples of these two cases:
• points on image (a) represents increase of amount of words of
arab origin into german between 14th and 20th century while
the line represents the ideal logistic curve with parameters (a=7.41,
b=0.696, c=160)
• right image (b) represents gradual increase of Mean Length of
Utterance (9.2.4)
While members of certain schools may argue adamantly that many
phenomena in both ethnogeny and ontogeny can be explained or
even modelised in terms of logistic curves, other data1 shall rightfully oblige others to express certain scepsis to capacity of logistic
1 C.f., for example, Figure 15 to see some data which, while aiming to represent practically the same phenomena as Figure 17 seems unsubsumable under the logistic
curve.

LL’s ubiquity

10.1 quantitative and corpus linguistics

Law of population
growth

Towards ecology of
intramental
representations

curve to cover practically ALL quantitative aspects of language acquisition process, and to do so with sufficient statistic significance. On
the other hand, some phenomena like the "vocabulary spurt" (c.f. 13)
seem sometimes to follow the logistic first-slow-then-fast-than-slowagain so faithfuly, that it would be unwise to to apriori ignore such a
salient, formal and high-order analogy between ethno- and onto-geny
of intra- and inter-personal linguistic eco-systems.
Be it as it may, instead of trying to adequately address the Haeckellike conjecture some processes of linguistic ontogeny are formally isomorph
to certain processes in linguistic ethnogeny, it seems more appropriate
to focus the attention of the reader upon the fact that closing the
previous paragraph with the plural form of the term eco-system was
intentional. This is so because it was indeed ecology where logistic
curve models where deployed for the first time: introduced in 1838
by Pierre-François Verhulst as a model whereby the reproduction of
population is proportional to both the existing population and the amount
of available resources, and canonized later by (Lotka, 1925) as the law
of population growth, it is closely related to predator-prey (or LotkaVolterra) differential equations which are, even more than hundred
years since its conception, still consider as a model of reference of population dynamics of biological and ecological systems within which
two or more species interact.
end piotrowski’s law 10.1.2

In this brief overview of Corpus and Quantitative linguistics we
have mentioned two hallmark "laws" postulated (or discovered?) by
proponents of this discipline: Zipf’s law and Piotrowski’s logistic law.
We have indicated that both of these laws have certain ontogenypertinent aspects which make them worthy of interest not only to
researchers interested in historical linguistics, but also to those known
as "psycholinguists".
What’s more, it was also indicated that there exist a certain analogy,
a certain partage of features, between developmental psycholinguistics
and ecology:
• not only frequencies of words in corpus, but also abundances
in ecology are Zipf-distributed
• logistic curves are used to model not only rate of (pro|intro)duction
of new words into the copus, but also population dynamics of
diverse mutually-interacting species within a specific ecosystem
Given that there exists a certain formal similarity between models
of dynamics occurent within ecological or linguistic systems, transposition of certain principles from ecology into psycholinguistics may
seem to be appropriate.
end quantitative and corpus linguistics 10.1.2

117

118

computational linguistics

10.2

formal language theory

Formal Language Theory (FLT) is a computational theory of of formal languages and formal grammars. Being rooted in apodictic definitions of computer science, mathematics and logics, its aim is to offer
solid, coherent and scientifically valid framework useful for
1. design of new artificial (e.g. programming) languages
2. elucidation of structure and function of natural languages
No-one denies that when it comes to the first objective mentioned
above, the practical utility of FLT-originated concepts and principles
is demonstrated anytime a computer translates the source code into
machine code. It is true that without any solid theory thematising the
rules of production and parsing of symboling sequences, it would be highly
problematic to procceed all the way from romantic intuitions of lady
Ada Lovelace, notebooks of Gottlob Frege, Zuse’s Plankalkül through
assembler, C, C++ all the way to parsers, linkers, and compilers of
modern high-level programming languages like Python, PERL or R.
But the capper evidence that FLT can also yield a framework useful
for the attainement of the second goal, is yet to be furnished. In spite
of effort initiated by Chomsky’s focus on generativism (), that is, in
spite more than half of century of intellectual work of thousands of
most brilliant minds of their generation, no pure FLT-based model2
was proposed, which could account for diversity of forms of even
such a morphologically poor language as English. Sadly for science,
sectarian disputes within FLT community are of envergure which
makes it impossible to answer even the most trivial problems, like
that of positioning of natural languages within Chomsky-Schutzenberger
hierarchy.
This being said, let’s just introduce the conceptual pillars upon
which the FLT stands.
10.2.1

basic tenets (def)

) FLT is based on notions of symbols, sequences and sets. Thus,
1. alphabet A is defined as a finite set of symbols including the
empty symbol 
2. string S is defined as an ordered sequence of concatenated symbols contained in A
3. language L is defined as a set of strings over A

2 By pure FLT model, we mean a model that does not contain any statistic components.

Of FLT’s
usefulness

Of FLT’s
uselessness

10.2 formal language theory

4. * (Kleene star) is a free moinoid unary operator generating all
possible strings over a certain alphabet,
[
A∗ =
Ai = {ε} ∪ A ∪ A2 ∪ A3 ∪ A4 ∪ . . .
i∈N

A∗ therefore denotes the infinite set of all possible strings over A and
language L is either a subset of, or equivalent to A∗ , i.e. L ⊆ A∗
Given this, grammar GL of language L is a means how to characterize which among the members of A∗ are to be contained in L, and
which are not. In traditional FLT, Grammar is defined as follows:
Grammar and Rule(DEF)
A grammar G is a tuplet {VT , VN , X, R}, where VT is the set of terminal elements, VN is the set of non-terminals, X -an "axiom symbol" is a member of VN (X ∈ VN ), and R is a finite set of N rules
R = {r1 , r2 , ..., rN }.
A (rewriting|(produc|substitu)tion) rule r has a form foo → bar
and fundamentally denotes 2-ary substitution operation wherein the
first operand foo is substituted by second bar, or vice versa.
end grammar and rule(def)
10.2.1

The expression vice versa is quite important here, for it denotes that
grammar can be useful in both
1. (produc|genera)tion of terminal string-expression E of language
L started by input of "entry axiom" X which takes place when
rules are applied in right-wise order (i.e. foos are substituted
by bars) and is to be terminated only when the string does not
contain any non-terminal symbols.
2. parsing / comprehension of string (sentence) started by input
of E and terminating when the some substitution transform E
or its derivates into X. In other terms, this scenario occurs when
rules are applied in right-wise order (i.e. bars are substituted by
foos within the string) and terminates when the working string
does not contain any terminal symbols.
Practically all currently widely used notations take this symmetry
between substituens (that-which-substitutes) bar and substituendum
(that-which-is-substituted) foo, as granted. Table 6 illustrates "plain",
"compressed" and "uncompressed" grammars written down in three
common notations3
3 In all notations we follow the common convention of denoting non-terminal symbols
with uppercase characters (e.g. "B", "M", "S", "X") and terminal symbols with lowercase (e.g. "a", "b", "m" ...) characters.

119

120

computational linguistics

Note that the "uncompressed" and "compressed" grammars of GL
are equivalent only when it comes to language they cover, but not
in the way how the GL is represented. They are functionally but not
structurally isomorph. Thus, where "uncompressed" grammars represent disjunction in terms of multiple trivial rules (for any disjunction,
one has as many rules as there are disjunct elements), compressed
grammars represent a disjunction by one rule only. The price, however, is the need to introduce the disjunctive symbol | and to use it
every time when disjunction needs to be marked.

Plain
Compressed
Uncompressed

S-notation

Backus-Naur notation

PERL-notation

X → baba

X :=< mama >

s/X/mama/

X → mama

X :=< baba >

s/X/baba/

X → SS

X := SS

s/X/SS/

S → ba|ma

S :=< ba > | < ma >

s/S/ba|ma/

X → MM

X := MM

s/X/MM/

X → BB

X := BB

s/X/BB/

M → ma

M :=< ma >

s/M/ma/

B → ba

B :=< ba >

s/B/ba/

Table 6: Diverse notations of three grammars covering the language L =
{"mama", "baba"}.

For a logics-oriented reader, it may be useful to conclude that FLT
considers languages and their respective grammars to be equivalent
to a sort of formal system. Thus, the set of all strings A∗ being understood as a set of all (i.e. both true and false) proposition, the string
belonging to language L can be understood as true theorems and the
act of deriving these by G is equivalent to theorem proving.
end basic tenets 10.2.1

10.2.2

chomsky-schützenberger hierarchy (txt)

Language L can be classed according to type and form of rules which
its respective grammar GL contains. Undoubtably the most common
typology is the Chomsky-Schützenberger hierarchy of languages which
classes all possible languages into one among four classes. These are
defined as follows:
1. unrestricted grammars contain rules which can contain any
combination of terminals and non-terminals in both substituens
and substituendum

10.2 formal language theory

2. context-sensitive grammars have rules of the form αAβ → αγβ
with A a nonterminal and α, β and γ strings of terminals and|or
nonterminals. Strings α and β may be empty, γ , however, must
be nonempty.
3. context-free grammars which have rules of a form A → γ with
A a being a nonterminal and γ being a string of terminals
and/or nonterminals.
4. regular grammars have rules with one single non-terminal on a
left-side and one terminal with max one juxtaposed non-terminal
on the right side
Containment
hierarchy

Types of automata

Usefulness of C-S
hierarchy

Uselessness of C-S
hierarchy

These classes of languages are mutually embedded. Thus, the class
of regular languages is the specific subset of context-free languages
and it follows that while any regular language is a context-free one,
not any context-free language is a regular one. Idem for embeddings
of higher-order: context-free languages are specific cases among contextsensitive grammars and context-sensitive languages are just a certain
specific subset in the vast "unrestricted" ocean of Type-0 languages.
The main categorization being so canonized, great deal of FLT is
occupied with study of algebraic and computational properties of
these classes. It is thus known that languages produced by regular
grammars can be recognized by finite state automata (FSAs), contextfree languages can be recognized by non-deterministic push-down
automaton, context-sensitive languages are recognizable by means of
linear-bounded non-deterministic Turing machines while an arbitrary
Type-0 is not to be recognized by nothing less complex than a Turing
machine.
It is indeed in such overlap regions between computer science and
algebra, whereby the conceptualization of things in terms of C-S hierarchy finds its utmost utility. Utmost and practical: for as it was stated
- but merits to be re-stated- purely theoretical explorations of mutual
relations between diverse types of grammars and diverse types of
symbol-manipulating automata can and indeed do have serious material consequences: faster encoding and faster decoding means faster
machines.
But in relation to diversity of expressions of natural languages,
FLT taxonomies can be quite misleading. For example, as is nicely
illustrated in overview (Jiménez López et al., 2000, pp. 87-97), even
after decades of debates, "linguists" cannot even find an agreement
whether English alone fits into class of context-free languages or whether
it is more appropriate to consider it as a priori case-sensitive language.
For experts coming from FLT-ignorant domains of linguistics - be it
linguistic typology, comparative grammar etc. - such debates express
nothing else than sad vaste of intellectual resources. Confronted on
a daily basis with the astounding diversity of linguistic structures

121

122

computational linguistics

grounded in the substrate of their usage, adherents of such schools
would at maximum dare to utter:
"In world of natural languages, nothing is certain and nothing is fixed.
Asides the fact that natural languages belong into class of Type-0 languages.
Maybe."
end c-s hierarchy
10.2.2

10.2.3

grammar system theory (txt)

A spin-off branch of Formal Language Theory which is of particular interest in regards to overall objectives of this Thesis is devoted
to study of Grammar Systems (GS). A grammar system is a « set
of grammars working together, according to a specified protocol, to
generate a language» (Jiménez López et al., 2000). Thus, contrary
to definitions of canonic FLT in which one grammar generate ones
language, in GS several grammars work together in order to generate one language. Grammar Systems can be therefore considered as
a sort of multi-agent variants of traditional «monolithic» FLT. Such
multi-agent nature of GSs implies cooperation, communication, distribution, modularity, parallelism, or even emergence of complexity.
Let’s take as an example the most simple among the GS, so-called
"language colonies", defined in (Kelemenová and Csuhaj-Varjú, 1994)
as follows:
Language Colony (DEF)
A language colony colony C is an (n+2)-tuple C = (T , R1 , ..., Rn , S),
where
1. Ri = (Vi , Ti , Pi , Si ) , for every i, 1 6 i 6 n, is a regular grammar
generating a finite language; Ri is called a component of C;
2. S = Si for some i, 1 6 i 6 n; S is called the startsymbol of C;
S
3. T ⊆ i = 1n Ti is called the set of terminals of C
S
And the total alphabet of C is denoted by V, i.e. V = i =
1n (Ti ∪ Ni )
end language colony

10.2.3

Figure 18 illustrates a very simple bi-component (n=2) "language
colony" variant of a GS. What is striking in case of even such a simplistic colony is that the very fact of sharing and exchange of strings
between two otherwise finite regular grammars results in generation
of an infinite language.

Polylithic model

10.2 formal language theory

Figure 18: Emergence of "miraculous" infinite generative capacity by means
of interlock of two finite grammars. Figure reproduced from Kelemen (2004).

Blackboard model

Other Grammar
Systems

GST still mostly
theoretical

Let it be reiterated: by allowing two or more finite components to
communicate through a common symbolic environment, one can generate a set of strings - a language - with potentially infinite cardinality ! Kelemen (2004) denotes such behaviour - which is very common
in the world of GS - with the term «miracle».
The cornerstone idea of not only language colonies but also of any
other GS is that diverse "component" grammars share a common "environment". This environment is nothing else than a shared string
whereupon and wherein diverse components grammars apply their
rules of production. In analogy to class (population) of individual
students which together solve the problem on the blackboard they
see, the term "blackboard model" is often used to denote the idea. For
psychologues this model can be somewhat reminiscent of "working
memory" accessible and accessed by diverse independent and encapsulated cognitive modules. Computer scientists, on the other hand,
may see some similarity with multiple computational threads accessing the same address space in the shared memory.
Aside "language colonies" and GST introduces and precisely defines many other theoretical and formal constructs like "Cooperating
Distributed Grammar Systems", "Parallel Communicating Grammar
Systems" and "eco grammar systems". Notably due to life-long work
of Erzsébet Csuhaj-Varju and substantial contributions by George
Paun and Jozef and Alica Kelemens are these constructs developed in
such a detail that it is practically impossible for us to introduce here,
in extent and rigour they merit, the exact formalisms of GS theory
in closer detail. Instead, we forward a potentially interested reader to
the doctoral dissertation of Jiménez López et al. (2000) which contains
many persuasive arguments for application of GS upon the study of
natural human languages.
On the other hand, the forereferred dissertation is limited by the
fact that it mostly proposes to use the Grammar System Theory as
a framework explaining the final, i.e. "adult" linguistic component,
and not as a framework which could elucidate the very process of

123

124

computational linguistics

language development and language acquistion. In fact, we are not
aware of any study which would use the GST as a theoretical explanatory framework for the process of LD, nor of any tentative aiming to
implement GST in concrete programs, offering solutions to concrete
practical "natural language processing" (NLP) problems.
end grammar systems 10.2.3

FLT unites set theory, algebra and theory of formal systems in a
highly abstract and subtle conceptual framework aiming to help us
(and machines) to conceive more optimal sequences of operations
within the realms of sequences (strings) of symbols. It introduces
many useful notions like that of
1. terminal symbols, i.e. those symbols which materially occur in
the articulated utterance (i.e. are parts of "signifiant")
2. non-terminal symbols, i.e. those symbols which denote generic
properties inherent in and specific to the utterance
3. substitution rules and grammars (10.2.1)
which are, in one form or another, to be found in all linguistic theories at least since Panini (9.4.2). One simply cannot have a linguistic
theory, no matter whether general, descriptive, generative, psycholinguistic or developmental without postulating both material observables (terminals), non-material non-observables (non-terminals) and
something like a list of principles which relate the two.
Unfortunately, FLT was canonized in an era when computer scienHistoric context of
FLT’s
conception
tists and computational linguists had to think about allocation of ev4
ery byte of the memory In such context, the CPU-register-manipulating
recursive while-loops were considered as magical means of generation
of big amount of output from minimal input. Thus, a sort of obsession with the notion of recursivity was born which led generativists
to
1. tentatives to explain huge part (or all) human linguistic creativity in terms of recursivity
2. ignorance of the role which memory plays not only in concrete
situations of linguistic performance, but also for overall stability
of system underlying one’s linguistic competence
What’s more, FLT is strictly about syntax. It is, ex vi termini, a selfencapsulated formal system and any tentative to make any reference
to the world to the world of semantics beyond syntax is predesti4 However, the contemporary generation of computer scientists is not subjected to
such constraints. Memory is cheap in the world where 640kb ought NOT to be enough
for anybody.

No syntax without
semantics

10.3 natural language processing

nated 5 to put FLT into state of irreversible havoc. For the world of
meanings is the world of passionate contextual transpositions, useful
metaphores, implicit ambiguities and fuzzy approximations; FLT, on
the other hand, brings about the realm of evermore-abstract arborescent hierarchies of pure reason. Fitting one into another, subsuming
syntax to semantics or semantics to syntax, thus seems to be at least
as absurd a problem as the good old egg-chicken dilemma.
end formal language theory 10.2.3

10.3

natural language processing

Natural Language Processing (NLP) is a field of artificial intelligence
and linguistics which explores machine’s faculty to understand, produce and interact in natural languages. In contrast to both quantitative and corpus linguistics which mainly concentrates on discovery of
general quantiative principles and sometimes on data-mining, as well
as in contrast to FLT whose ultimate challenge is purely theoretical,
is NLP concerned with concrete, practical and real-life problems of
verbal interaction between humans and machines.
As was already noted in Chapter 4, the so-called Turing’s Test (TT)
is -at least in the canonic6 form in which Alan M. Turing had proposed it- in its very essence nothing else than a NLP challenge. This
is so because in the canonic TT, the interaction between the human
tester and the artificial testee is mediated solely through written verbal modality.
The task of creating an artificial system which would truly pass
the TT is not as easy as Turing and early computer scientists had believed. Natural languages are multi-layered structures whose components mutually interact both with each other as well as their external
environments, the very personal identity of their host not excepted.
Natural languages serve many goals - giving commands, transfer of
information (or deceipt), telling stories - and often exploit highly irregular means with which these goals are attained.
Machines, on the other hand, are regular and ordered. If not programmed otherwise, they blindly follow the path towards the stationary state; if not programmed otherwise, they are unable to deal
with any irregularity whatsoever. Thus, in order to bring the ordered
world of machines together with the unpredictable world of living
language, NLP engineers usually proceed step after step: one minute
linguistic problem is understood, formalized and subsequently tackled with in one’s source code. Then another.
5 Take, as an example, the introduction of Θ roles into Chomsky’s Government &
Binding Theory.
6 C.f. Hromada (2012a) for a description of taxonomy of TT-consistent scenarios allowing the evaluation of not only linguistic, but also emotional, spatial, visual, corporal,
moral etc. intelligences of an artificial agent.

125

126

computational linguistics

Indeed many are such problems:
• author attribution
• plagiate detection
• named entity disambiguation
• word and/or morphological segmentation
• sentiment analysis
• relationship extraction
• rhetoric figure detection (Hromada, 2011)
• automatic summarization
• discourse analysis
• anaphora resolution
• parsing
• automatic translation
• natural language understanding
• natural language generation
• question answering
all these are just few among dozens of other tasks which NLP experts
aim to tackle. These are, in practice, almost always solved by means
of adoption of NLP’s ultimate methodology: the machine learning.
10.3.1

machine learning

Machines can learn. That is, machines are able to discover underlying general patterns and principles governing the concrete input data
and can subsequently exploit such general knowledge in contact with
data which they have never seen before. They « can use experience
to improve performance or make accurate predictions» (Mohri et al.,
2012). And in everbigger number of domains, they do so still better
and better than their human teachers.
Since the moment when machine learning (ML) was first defined, in
relation to game of checkers, as « field of study which gives computers ability to learn without being explicitely programmed» (Samuel,
1959) has the discipline of ML evolved in an extent which is hardly
compressible into a single book (Mohri et al., 2012) and certainly incompressible into limited scope of this subsection. This is so because
not only does the number of domains of ML’s application grow from

subsec:ml

10.3 natural language processing

127

year to year, but firstly because the quantity of distinct ML methods
is already counter in dozens, if not in hundreds.
The general framework - sometimes also called "learning theory"
(LT) - however, stays the same. No matter whether in psychology
or in computer science, LT principially studies how an informationprocessing system (e.g. brain or computer) processes, represents and
stores data sensed from external environment, how it internally transforms them and how outputs of such transformations influence subsequent activity of such a system (including sensing and processing
of future data). There is thus a system that learns (the learner system, LS) , the learnt information (LI) and the process of learning (PL).
Interactions among these three components, whether one should postulate less (e.g. in case when LS 6= PL) or more such components
(e.g. in case when sensed data differs from learnt information), and
many other - some stemming from neurosciences, other from pure
mathematics - all such topics are to be explored by full-fledged LT.
A distinction which is most pertinant for the purposes of this Thesis - and one may argue that for the ML in general as well - is the
distinction between supervised and unsupervised learning.
Supervised learning, called also learning-with-Teacher7 is based
upon the idea that a full cycle of a learning process consists of two
stages:
1. training|learning stage - LS is first exposed to set of problems
and their respective solutions, then aims to create the model
associating the two
2. testing|evaluation stage - LS exploits the previously constructed
model in order to furnish solutions to problems to which she
wasn’t exposed during the training stage. Its performance is
then evaluated according to certain evaluation metrics.
In unsupervised learning, on the contrary, it is expected that the
one-who-launches-the-program shall not furnish any explicit solution|answerrelated information to the LS. The training phase is thus practically
equivalent to the testing phase: both contain questions; neither contain answers. LS is simply furnished a huge dataset - in unsupervised
NLP practice, the dataset is almost always equivalent to textual Corpus - and is asked to do something reasonable with it. Cluster the
corpus contents into classes, for example.
While distinction between supervised and unsupervised seems to
be crystal-clear for anyone practicing the NLP fach, the "cognitive
plausibility" of fully unsupervised learning is more than discutable.
Primo, the distinction turns out to be problematic for any models of
phenomena in which the very order of exposure - i.e. the fact that
7 Or learning-with-Oracle, if the Teacher system is able to correctly solve the problem
(e.g. furnish the answer) immediately after it received input sufficiently describing
the problem (e.g. a meaningful question).

128

computational linguistics

the corpus to which LS was exposed contains first the token baba and
only later the token mama - can significantly influence the learning
process. Thus, for models for which holds the statement « the engineer’s decision to confront the algorithm with corpus X and not Y,
and to do so in the moment T1 and not T2 , is already an act of supervision» (Hromada, 2014b) the method cannot be considered as stricly
un-supervised even in absence of any explicit answers.
Secundo, in case of modeling of LD processes, one cannot say that
toddler undergoes "unsupervised" learning just because the input to
which she is exposed does not contain any explicit corrections, cues
or answers. The very corpus is the answer and - from toddler’s pointof-view - the very authority of the adult who furnishes the corpus
mints the corpus with justification of its truthfulness and validity.
The very notion of "valid solution" or "correct input-output mapping"
loses non-negligeable part of its importance when one realises that
LSs which we aim to discuss here, can be conditioned to perceive
agrammatical and false utterances as grammatical and true. No matter whether it is the case of a child in the middle of ego-centric stage
or a victim of a propaganda machinery, it is often NOT the adequacy
with external reality niether consistency with as big a set of propositions, which counts. Instead, it is the repetition, the frequency of cooccurrence, the self-referential and self-reinforcing set of references to
the minimal "seeding" set of symbols, which counts and which directs
the learning process.
Tertio, it is evident that both accuracy as well as speed of learning of
solving of a particular class of problems is, at least in case of human
learners, significantly catalyzed by the presence of a teacher skilled
enough to adopt the input-to-be-taught to the momentanous state of
LS. Vygotsky’s "zone of proximal development" (Vygotsky, 1978) is
too salient and too omnipresent a fact to be ignored: humans learn
more efficiently with a skilled teacher. And this, as constructivists
would also argue, is a domain-general fact which is to govern not only
singing, drawing, cooking or bicycle riding but...all facets of natural
language learning as well.
Principially for these "cognitive plausibility"-related reasons shall
we attribute, in volume 2, a certain conceptual priority to supervised
ML of evolutionary models of ontogeny of toddlerese. But before doing so, let’s focus on that, which both supervised and unsupervised
branches of ML have in common: evaluation.
Evaluation
It is in degree of sharing and conventionality of formal, quantitative
and objective means of evaluation that science can be distinguished
from art, and in lesser extent, also engineering from science. An NLP,
as principially more a skill than a science, is not an exception in this
regard. To paraphrase the same thing somewhat differently: it is not

10.3 natural language processing

from existence of diverse means of evaluation, but from partage commune of the need to evaluate and knowledge of usefulness of such
evaluations, from which the very productive unity of NLP stems.
Productive unity in the field there is, and diversity -luckily for
field’s survival- is there as well. Thus, what holds for already-mentioned
diversity of NLP’s learning methods, holds also for diversity of diverse evaluation metrics. This is so because there exists no wide agreement about the meta-criterion which could help to decide what criteria exactly should a good evaluation metrics fulfill. Hence, asides
the fact that a good evaluation metrics should make the result of an
arbitrary experiment as much comprehensible as possible even to an
un-initiated greenhorn, and asides the observation that there, verily,
exist evaluation metrics which describe certain classes of phenomena
better than others, it should not be a priori accepted that there exists
"the" evaluation metrics which is the best of all.
Things being as they are, the NLP and "information retrieval" communities often tend to use the traditional evaluation formulas for Precision and Recall:
Precision and Recall (DEF)
Number of retrieved relevant entities
Total number of all relevant entities
Number of retrieved relevant entities
Precision =
Total number of all retrieved entities
precision and recall end
10.3.1.0
Recall =

whereby the relevancy of the "relevant" document is defined to
the external, ideally manually annotated étalon (i.e. golden standard),
corrected by a human judge and subsequently furnished to LSD by
the teacher or evaluator. Precision thus, in certain sense, carries information about how much is the set X, retrieved by the algorithm,
stained with "false positives" which do not belong to X according to
the golden standard. Recall, on the other hand, carries information
about how many among the entities which are labelled as "true" in
the golden standard, were selected (i.e. labeled as "positives") by the
algorithm.
Values of the always are constrained in the interval [0,1] and can be
further combined into their "harmonic mean", commonly known as
F-Score:
precision ∗ recall
F = 2∗
precision + recall
which also yields a score from interval [0,1] whereby 0 is obtained by
the worst possible and 1 by the ideally performing algorithm.
This being said, it should be evident that precision and recall are
concepts useful especially in case of binary classification tasks, i.e.
tasks in which one aims to categorize certain set of entitities into two

129

130

computational linguistics

groups (i.e. a is X or not-X). Given that the notion of binary distinction is indeed a powerful one it is not uncommon that some studies
succeed to get crowned with laurel, thanks to some additional averaging, even when they use use precision & recall based metrics also
for evaluation of pure multiclass classification problems, i.e. problems
where one aims to categorize certain set of entitites into N>2 groups,
or clusters.
Different measures were developed which target specifically the
problem of multiclass clustering. The most traditional among these
being purity, defined as:
PurityΩ,C =

1 X
maxj |ωk ∩ cj |
N
k

where Ω = ω1 , ..., ωK is the set of K clusters hypothetized by the
LS, members C = C1 , ..., CN denote N classes present in the golden
standard. During estimation of purity each among K hypothetized
clusters is assigned to the class which is most frequent in the cluster.
The accuracy of assignment is subsequently assessed by counting the
number of correctly assigned documents and dividing by number of
gold-standard classes (N). Similiarly to all notions closely introduced
in this section, ideal results have value of 1 while bad results shall be
close to 0.
Purity asides, litteraly dozens other measures for clustering accuracy performance have been already developped, see Rosenberg and
Hirschberg (2007) for overview of most important among them. The
same article also a measure called V-measure defined as:
V-measure (DEF)
h = 1−

H(Ω|C)
H(Ω)

c = 1−

H(C|Ω)
H(C)

V=

(1 + β)hc
(βh) + c

where H(C) denotes entropy of collection of classes; H(Ω) denotes
entropy of collection of hypothetized clusters; H(C|Ω) denotes conditional entropy of C given Ω and H(Ω|C) denotes conditional entropy
of Ω given C; and β specifies the weight between the h and c8 .
v-measure end
10.3.1.0
Asides the fact that its values are also from the values of interval [0, 1], V-measure disposes of multiple properties which makes
it worthy of interest for anyone willing to use an elegant measure
8 β is often set to 1 in order not to bias the value of V neither towards homogeneity,
nor towards completeness

10.3 natural language processing

of cluster evaluation. Not only is V-measure a harmonic mean of
h (also called "homogeneity") and c ("completeness") and is thus
strongly reminiscent of F-score, but it has also property of being stable in regards to variation of number of clusters. For these, as well as
other reasons more closely elucidated in Rosenberg and Hirschberg
(2007); Christodoulopoulos et al. (2010); Hromada (2014a), shall be
V-measure used in "part-of-speech induction" chapter of the 2nd volume of this Thesis .
In order to work, Recall, Precision, F-score, Purity and V-measure
require the golden standard which has, in NLP, often the form of
a manually annotated corpora. These measures, based on "external
criteria" must not be, ex vi termini, used to modulate the execution
of an unsupervised learning process. In learning scenarios in which
the only source of knowledge is pure non-annotated dataset, one is
obliged to evaluate the clustering only according to criteria inherent
in the dataset itself. Many such "internal criteria" have been already
discussed in the litterature (e.g. silhouette coefficient, Dunn index,
Davies-Bouldin index), one more - the "prototypicity coefficient" shall
be introduced in volume 2.
Let’s now now move forwards with just one little warning: in no
way does the sketchy overview hereby presented pretends to be a
complete overview of NLP evaluation techniques, let alone the learning methods themselves. Given the amount of research being done in
the domain, this is simply impossible. Thus, in order to restrict this
expose to reasonable length, topic of evaluation of continuous, i.e.
"regression" ML models was completely set aside and all attention
was concentrated upon the evaluation of ML algorithms which tend
to "learn" models composed of two or more discrete categories. This
design choice was mainly motivated by a belief that it is more reasonable to aim to explain functioning of language cognition in terms of
categorization, and not in terms of regression9 .
evaluation end
10.3.1.0
At last but not least, it is important to mention that machine learning is able to yield programs and applications which work, and work
very well. And it is indeed especially NLP which is, asides "computer
vision" 10 a field in and for which the ML is developped. It is thus not
too surprising that recent days have seen, for example in article of
Karpathy and Fei-Fei (2014), results of some quite successful efforts
to unite the two.

9 None that in reasoning that shall follow, operations acting upon continuous domains
are not to be completely excluded. Take as an example the notions of 1) temporal halflife (i.e. decay interval) of a cognitive schema 2) selection of locally-nearest-neighbor
according to similarity defined in cosine metrics.
10 C.f. (Hromada et al., 2010) for an older application of ML methodology in training
of smile-detection classifiers.

131

132

computational linguistics

machine learning end

10.3.1

ML-inspired methodologies for:
1. problem of ontogeny of semantic categories (equivalent to supervised learning of word meanings)
2. problem of ontogeny of morphosyntactic categories (also known
as part-of-speech induction)
3. problem of ontogeny of grammars (also known as grammar induction)
shall be described in closer detail in following sections, as well as in
Volume 2.
end natural language processing 10.3

10.4

semantic vector architectures

It was already (9) mentioned, that natural language furnishes a communication channel for exchange of meanings. Meaning («signifié»)
is intentional, it refers to some external entity («referent») . Within
the language L, meaning M can be denoted by a token («signifiant»)
and it is by exchange of physical (phonic, in case of spoken language,
graphemic in case of written language etc.) manifestations of these
tokens that producer (speaker|writer) and reciever (hearer|reader)
communicate.
Traditionally meaning of the word, i.e. its «semantics», was often
considered as something almost «sacred» and not-to-be-formalized
by mathematical means. Maximum which could be done - and had
been done since Aristotle until middle of 20th century - was to define
concept in terms of lists of «necessary and sufficient features».
Two types of features were considered to be both necessary and sufficient for definition of majority of concepts : first specifying concept’s
genus (or superordinated concept) and second specifying the particular property (differentia) which distinguished the concept from other
members of the same genus. Thus, for example, «dog» could be defined as domesticated (differentia) canine (genus). Important property of such system of concepts was, that it allowed no ambigous or
fuzzy border cases : the logical «law of excluded middle» guaranteed
that all entities which were not both canines and domesticated at the
same time (e.g. a chihuahua which passed all her life in wilderness)
could not be called a dog.
Even in contemporary CL practice, projects like WordNet (Miller,
1995) incarnate such aristotelic view in form of datasets organizing
items of human lexicon in what is principially an arborescent hierarchy of sub- and super- ordinated terms (i.e. of hyponyms and hyperonyms).

Signifier, signified,
referent

Necessary and
sufficient

Aristotelic
paradigm of word
meaning

10.4 semantic vector architectures

Non-aristotelic
paradigm

The change of the classical paradigm came slowly with works of
late Wittgenstein 11 but especially with empirical studies of Eleanore
Rosch. What these studies (e.g. Rosch (1999)) found out, was that
not only are concepts often defined by bundles of features which
are neither necessary not sufficient but also that the degree with
which a feature can be associated with a concept often varies. Subsequently, Rosch has proposed a «prototype theory» of semantic categories whose basic postulate is, that some members of the category
(or some instances of the concept) can be more «central» in relation
to the category (resp. concept) than others. Thus, in some cultures
"rose" is more "flower" than "daisy", in other cultures contrary is the
case.
10.4.1

category prototype (def)

A prototype P of the category C is a member of C, which shall be retrieved with highest probability whenever one queries C for its most
salient concrete representative.
Such a member of C is to be as similar as possible to all other members of C and as dissimilar as possible from members or prototypes
of other categories.
category prototype end
10.4.1

Geometrization of
meaning

Distributional
hypothesis

Prototypical theory as well as other both theoretic and empirical advances like formalization of notion of similarity, in combination with
development of information-processing technologies, have paved the
way to operationalization of semantics which allows to transform
meanings of words into mathematically commesurable entities.
In modern semantics, concepts are operationalized as geometric entities. Thus, meaning of a token X observable within language corpus
C is often characterized as a vector of relations which X holds with
other tokens observable within the corpus. The set of such vectors
associated to all tokens observable in C yields a «semantic space»
which is a vector space within which one can effectuate diverse numeric and|or geometric operations.
Since a methodological objective of this disertation is to bridge developmental psycholinguistics with the computational one, we consider it to important to underscore that in NLP practice, transformation of corpus C into semantic feature space S is practically always
based on empirical validity of "distributional hypothesis" (DH) which
states that « a word is characterized by the company it keeps» (Harris,
1954) 12
11 « For a large class of cases of the employment of the word ‘meaning’—though not
for all—this way can be explained in this way: the meaning of a word is its use in
the language» (Wittgenstein, 1953)
12 DH can be also restated in somewhat more algebraic terms:« In the most simple
case can be the vector which denotes concept X calculated as a normalized linear
combination of vectors of concepts in context of which X occurs.» (Hromada, 2014d)

133

134

computational linguistics

Practical usefulness of DH in practically all models of geometric
operationalization of meaning is undisputable. But DH has also nonnegligable theoretical importance. For stated as it is, it supports «associationist» theories based on the notion that the essence of mind is
somehow related to mind’s ability to create relations, i.e. associations,
between successive states.
In addition to what was said in (9.4.1, we suggest that both mind’s
faculty to create associations, as well as the distributional hypothesis "meaning of symbol X can be defined in terms of meanings of symbols
with which X co-occurs", can be neurologically explained in terms of
already-mentioned Hebb’s postulate:
« The general idea is an old one, that any two cells or systems of
cells that are repeatedly active at the same time will tend to become
’associated’, so that activity in one facilitates activity in the other»
(Hebb, 1964)
One can assume that IF

Hebb’s law

1. Hebb’s rule govern activity of not only single neurons but also
of neural ensembles
2. if distinct words Wx and Wy are somehow processed and represented by distinct neural ensembles Nx and Ny
THEN it shall follow that whenever a hearer shall hear (or speaker
shall speak) the two-word phrase Wx Wy , the ensemble of material
(synaptic?) relations between Nx and Ny shall get reinforced. In more
geometrical terms, on a more « mental » level, such a « rapprochement » of Nx and Ny would be characterized by convergence of the
geometrical representations of both circuits to their common geometrical centroid. Thus, after processing the phrase Wx Wy , the vectorial
representations of both Nx and Ny will be closer to each other than
before hearing (or generating) the phrase.
10.4.2

Associanist
geometry

hebb-harris analogy (aph)

For a corpus linguist, distributional hypothesis means, mutatis mutandi, the same thing as Hebb’s law for a neuroscientist.
end h-h aphorism

10.4.2

We conjecture that an associationist principle, similar to the one
described above, is indeed at work whenever a mind projects stimuli percieved from the external world unto an internally represented
semantic space. Such «semantic vector space» can be subsequently divided, partitioned or tesselated into diverse subspaces each of which
represents diverse semantic categories, classes or concepts. Or maybe
even more than just represent: such partitions are concepts.

Of concepts and
subspaces

10.4 semantic vector architectures

Conceptual
similarity

The big advantage of approaches modelling the « geometry of thought»
(Gärdenfors, 2004) is that they allow, among other things, to measure
and assess similarities and distances between two or more concepts.
By doing so, they seem to be much more closer to actual human experience with meanings than other computational methods (expert
systems, ontologies, RDF etc.), based principially on application of
logical rules of inference. For programs which work with concepts
as if they were geometrical entities have no problem whatsoever to
answer questions like
"what is more similar to a dog - a cat or a wolf?".
Such questions -which any child would love to answer- couldn’t be
answered by an expert system without intervention of human operator who would explicitely declare the criterium of similarity according to which the similarity is assessed. But a system considering all
three terms -"dog", "cat", "wolf"- as being just labels denoting geometrical points, would not have problem to do so if ever it was already
confronted with corpus in which the three terms occured.
And given the fact that these geometric models make it possible to
calculate, evaluate or compere similarities between meanings, it is of
no surprise that these very models make it quite easy to create artificial simulations for such cognitively salient phenomena as analogies,
metaphors Lakoff (1990) and intuitions.
Let’s now glance on few such NLP models which process meanings
as it they were geometric entities.
10.4.3

bag-of-terms

Bag-of-Terms (BoT) distinguish contained and containing entities. Most
often, words are understood as the contained entities and sentences
or whole documents are the containing ones. What is important for
such Bag-of-Words (BoW) models is that the document D1 contains
certain set of words while the document D2 contains another set of
words.
Such quantitative information about the number of occurences of
diverse words in diverse documents can be used to construct vectorial representations of such documents. This is done by representing
every distinct document with a row vector whose specific elements
denote specific words. Table 7 illustrates this for three sentences13 ,
considered as individual documents.
The order of words or other aspects (e.g. morphosyntax, phonology,
prosody) are considered as irrelevant: in pure BoW, it is only the
occurence of the word that counts. This, however, is not necessary the
case in BoTs which implement another definition of the "contained
13 Sentences like these (meaning "mama has ema", "ema has mama" and "mama has
mama") are often among the first used in Slovak language primers.

135

136

computational linguistics

mama

má

emu

ema

mamu

mama má emu

1

1

1

0

0

ema má mamu

0

1

0

1

1

mama má mamu

1

1

0

0

1

Table 7: Vectorial representations of three sentence-sized documents. Every
distinct word yields a distinct column.

entity" - i.e. of component term by means of which one characterizes
the "containing" document. For one can also work with terms which
are either smaller, bigger or utterly different from words. One can
look for occurence of syllables or, simpler yet, a distinct sequence of
N characters (an N-gram). Construction of vectorial representations
based on occurence of 3-gram terms is presented on Table 8.
"mam"

"ama"

"ma "

"a m"

"má "

"á e"

" em"

"emu"

"á m"

" ma"

"amu"

D1

1

1

1

1

1

1

1

1

0

0

0

D2

2

1

1

1

1

0

0

0

1

1

1

Table 8: Vectorial representations of sentence-sized documents D1 =
"mama má emu" and D2 = "mama má mamu". Every distinct character trigram yields a distinct column.

In this case, one can see that some information about the word
order is also included into the vectorial representation. This is so,
because the word-dividing empty space character " " is also taken
into account which was not the case in pure BoW presented in Table 7.
On the other hand, by focusing on trigram features and not on whole
words, one may observe a feature "mam" to occur twice in document
D2 . Hence X2,1 = 2.
No matter what definition of documents and term one uses, one obtains, at the end, a list of N D-dimensional row vectors where N is the
number of documents in the corpus and D is the number of distinct
tokens observed in the corpus. One thus obtains a term-document matrix X. In NLP practice, it is common and recommendable to process
the resulting values of such matrix to so-called term frequency–inverse
document frequency (tf-idf) weighting scheme.
TF-IDF
Let tf (t, d) denote the term frequency, i.e. number of times the term
t occurs in document d and let idf(t, D) denoting the inverse document frequency be obtained as follows:
idf(t, D) = log

N
|{d ∈ D : t ∈ D}|

10.4 semantic vector architectures

where N denotes the total number of documents in the corpus and
|{d ∈ D : t ∈ D}| denotes the number of documents in which t occurs.
Then term frequency–inverse document frequency (tf-idf) is to be
calculated as follows:
tfidf(t, d, D) = tf(t, d) ∗ idf(t, D)
in order to yield a numerical weight reflecting how important a word
is to to a document contained in a corpus.
tf-idf end
10.4.3.0
Verily is tf-idf a very simple yet very effective means how an NLP
engineer can increase the accuracy of one’s vectorial model. But it has
also some disadvantages. Primo, it adds a second pass to construction
of term-document matrices which can, especially in case of BigData
NLP, bring about certain computational and memory costs. Secundo,
the cognitive plausibility of tf-idf models is still to be demonstrated.
In other terms: while practically whole history of NLP empirically
demonstrates that tf-idf represents an information-processing component wherein statistical properties of the whole influence weights of
individual associations, current psycho-linguistic knowledge seems
to fail to identify a cerebral mechanism functioning as tf-idf’s neural
correlate.
Be it as it may, tf-idf brings even more order and information into
the metric space given by the entities represented by the term-document
matrix. And given that these entities are already of numeric, quantified nature, they can be commesurated. Distance between words
can be obtained by measuring distances between two column vectors;
distance between documents can be obtained by assessing distances
between two row vectors. Multiple metrics (e.g. Jaccard index, Euclidean distance, cosine for real-valued vectors, Hamming distance
for binary etc.) are used in order to do so.
bag-of-terms end
10.4.3.0

10.4.4

latent semantic analysis (txt)

A major disadvantage of term-document occurrence matrices, as generated by BoW models, is their sparsity. Given, for example, a corpus
containing N=1 million documents and M=50000 distinct terms, BoW
postulates existence of a rectangular term-document matrix with fifty
billion elements. And given that only a relatively small subset of distinct words shall occurs in any specific document, vaste majority of
values in such a matrix shall be zero.
Latent Semantic Analysis (LSA) was one among the first solutions
aiming to address this sparsity problem in NLP scenario. By unfold-

137

138

computational linguistics

ing the formula, known in algebra as singular value decomposition
(SVD):
X = UΣV T
it transorms the original term-document matrix X into orthogonal
matrices U and V and a diagonal matrix Σ. By selecting D values
from Σ and vectors of U and V associated with these values, one can
reduce the dimensionality of original matrix X to only D dimensions,
with minimal smallest error.
Algebraic and dimensionality-reduction aspects aside, LSA was, in
its time, revolutionary for one principal reason: it allowed to compare
not only documents with documents and terms with terms, but also
terms with documents. It also allowed for a means of optimization:
one could tune model’s performance by modifying the dimensionality 14 . Feats furnished by LSA were, at the time of its conception, so
astounding that LSA’s conceptors considered their model to be the
answer to the problem of category induction and antique problem
concerning the essence of knowledge in general, hence promoting
their computational model to status of « a solution to Plato’s problem: latent semantic analysis theory of knowledge» (Landauer and
Dumais, 1997).
LSA is indeed able to furnish dense, low-dimensional vector spaces
of semantic categories and concepts. It seems to yield interesting
solutions for dozens of other problems, let’s mention, as an example, the problem of grapheme-to-phoneme in speech synthesis (Bellegarda, 2005). And it is also true that transition through the site
http://lsa.colorado.edu has been and is - for at least one generation
of all sorts of cognitive science students - an important, useful, and
potentially obligatory rite of passage of their academic parcours.
But it is also true that LSA has certain drawbacks. Computationally
speaking, LSA is costly because SVD is costly. And cognitively speaking, it is somewhat difficult to see how human brain could perform
such a precise deterministic operation like SVO, let alone the dimensionality optimization which should precede it15 . As LSA’s conceptors put it: « It still remains to understand how a mind or brain could
or would perform operations equivalent in effect to the linear matrix
decomposition of SVD and how it would choose the optimal dimensionality for its representations, whether by biology or an adaptive
computational process.» (Landauer and Dumais, 1997)
We propose to address the problem by simply ignoring the SVD
altogether and rather focusing on another means of dimensionality
reduction: the random projection.
14 According to (Landauer and Dumais, 1997), an optimal dimensionality for problem
of concept induction from English language corpora is approximately 300
15 Note that the dimensionality optimization could have occured during development,
either phylogenetic or ontogenetic, or both.

10.4 semantic vector architectures

latent semantic analysis end

10.4.5

10.4.4

random indexing (txt)

Random Indexing (RI) is a method of representation of textual corpora with dense, low-dimensional vector spaces. In theory, RI is justified by a lemma of Johnson-Lindenstrauss whose corollary « states
that if we project points in a vector space into a randomly selected
subspace of sufficiently high dimensionality, the distances between
the points are approximately preserved» (Sahlgren, 2005). In more
formal terms, dimensionality of rxc-dimensional term-document occurence matrix X can be reduced by projection through rxd-dimensional
random matrix R, whereby the target number of dimensions (d) is the
parameter of the projection, and is smaller than the initial number of
columns (i.e. d  c):
0
Xrxd
= Xrxc Rcxd
In NLP practice, the simplest yet quite efficient variant of creation
of such slighthly distorted d-dimensional matrix X 0 is implemented
by a following procedure: « Given the set of N objects (e.g. documents) which can be described in terms of F features (e.g. occurence
of the string in the document), to which one initially associates a randomly generated d-dimensional vector, one can obtain d-dimensional
vectorial representation of any object X by summing up the vectors
associated to all features F1 , F2 observable within X. The original random feature vectors are generated in a way that out of d elements
of vector, only S among them are set to either -1 or 1 value. Other
values contain zero. Since the "seed" parameter S is much smaller
than the total number of elements in the vector (d), i.e. S «d, initial
feature vectors are very sparse, containing mostly zeroes, with occasional value of -1 or 1.» (Hromada, 2014c). The PERL Data Language
(PDL)-compliant source code of the procedure is presented in Listing 5.
Listing 5: Random Indexing Source Code
1 my %doc_vectors;

my %term_vectors;
sub generate_initvector {
my $value;
my %set;
6
my $vec=zeroes $dimensions;
for (0..$seed) {
(rand >0.5) ? $value=1 : $value=-1;
my $offset=round(($dimensions-1)*rand);
while (exists $set{$offset}) {
11
$offset=round(($dimensions-1)*rand);
}

139

140

computational linguistics

$set{$offset}=$value;
index($vec,$offset).=$value;
16

}
return $vec;

}
for $document (@document_list) {
my @words=split(/[^\w]/,$document);
for my $word (@words) {
21
if (!exists $tvectorz{$word}) {
$term_vectors{$word}=generate_initvector;
}
$doc_vectors{$document}=zeroes $dimensions if !
exists $doc_vectorz{$document};
$doc_vectors{$document}+=$term_vectors{$word};
26 }

Simply stated, vectorial representation of documents A is obtained
as simple linear combination16 of initial vectors associated to terms
T1 , T2 , T3 observable in A. For any such term, an d − dimensional
initial vector is randomly generated and contains d − S zero elements
and S elements whose value is either -1 or 1.
The output of this simple variant of RI is a set of d − dimensional
document vectors which can be used to calculate similarity among
the documents. Normalization of these vectors is needed when one
uses the cosine metrics. But one can go further: for one can additionaly "reflect" the whole process, forget the random vectors (initially attributed to individual terms) and now calculate the vectorial representation of the term Tx as a linear combinations of documents in which
Tx occurs. After 2 or 3 iterations17 of such "reflection of information"
from documents to terms and vice versa, one obtains numeric representations of both documents and terms both projected into one
holistic metric space. Thus, in the spaces generated by Reflective Random Indexing (RRI) (Cohen et al., 2010), there is no distinction of essence
between words and documents or, more generally, between objects and
the context of their use. All can be understood as points or vectors of the
same d − dimensional space. Not only that, such geometric entities
can be also interpreted in terms of subspaces: one can speak about
the region whose centroid is the entity, E or one can speak about
subspaces orthogonal to E’s vector. The world of meanings once thus
geometrized, verily many are applications of such "vector symbolic
architectures" (Widdows and Cohen, 2014).

16 Weighting the term vectors with related tf-idf values is strongly recommended.
17 Note that due to convergence properties of random projection, more than 2 or 3 iterations of the reflective process often tend to degrade the accuracy of RI’s semantic
discrimination. On the other hand, multi-iterative convergence of associanist matrices yields highly useful results in other NLP tasks, including the estimation of the
"importance of the sign" (Hromada, 2009) commonly known as PageRank (Hromada,
2010a).

10.4 semantic vector architectures

random indexing end

10.4.6

10.4.5

light stochastic binarization

Raison-d’etre of all semantic space architectures is information|knowledge
retrieval. No matter whether one encodes one’s dataset in form bagof-words, LSA, RI or RRI vectors, the objective is often the same: to
implement the model in real-life applications which are able to identify members of the dataset which are semantically closest to some
user-specified query. And to do so in reasonable time.
Thus, the computational complexity of retrieval phase at least as
important as the computational complexity of the indexing (encoding) phase. Moreso in the BigData scenario where one aims to find a
needle in a haystack of billions of documents. In case of data of very
low dimensionality (d < 10), the solution is quite straightforward:
one can one’s data and create indices for it, by use of binary trees or
other indexing techniques18 .
Unfortunately, because of a so-called "curse of dimensionality", it
is practically impossible to create retrieval indices for entities with
higher dimensionality. In layman’s terms this is so because two entities close to each other in many dimensions can still be considered
far from each other (because being really far from each other in just
few dimensions); or because two entities far from each other in some
dimensions can still be considered relatively close to each other (because they are quite close in many other dimensions). Thus, in hugedimensional spaces, usage of indices (e.g. k-trees) in retrieval can
sometimes turn out to be more costly than a simple "linear" search
in which one compares one’s query with all vectores stored in the
dataset.
Given that the complexity of such linear search is N ∗ d and given
that one cannot reduce the size of one’s dataset (i.e. N) and given
that one accepts that the "curse of dimensionality" is inevitable in
semantic spaces, one can still fasten, in silico the retrieval by at least
two possible means:
1. construct semantic spaces smallest possible (yet still sufficiently
high to encode semantically relevant distinctions) dimensionality d
2. execute operations with binary vectors (instead of integer, float
or complex ones)
Combination of these two means into one algorithm yields Light
Stochastic Binarization (LSB).
18 Dataset indexing is often explained in terms of a huge library with one shelf containing a sorted cartotheque of cards which specify book’s position in the libary

141

142

computational linguistics

The idea behind LSB is fairly trivial and is inspired by approaches
like Locality Sensitive Hashing (LSH, Datar et al. (2004)) or Semantic
Hashing (SH, Salakhutdinov and Hinton (2009)). In these hashing approaches, the objective is to use a "hashing function" able to attribute
a short and concise binary vector (i.e. "a hash") to any document in
the dataset in a way that if two documents are similar (or identical)
their hashes will also be similar (or identical). In this sense, LSB can
also be understood as a sort of hashing algorithm which simply uses
the Reflected Random Indexing (10.4.5) as its hashing function.
Once RRI transforms document (or a query) Q is transformed into
its vectorial representation ~q and whose n−th element we denote
with ~qn , one obtains the resulting binary hash ~h by trivial thresholding:

~hn =

0

~qn < 0

1

~qn >= 0

Expressed verbally, when value generated by RRI is bigger than 0,
one shall put 1 into respective position of the binary hash, otherwise
one puts 019 . At its very core, it is nothing else than mapping of RRI’s
output integer|float range onto the binary range. A mapping which
exploits a mathematically beautiful intuition of Sahlgren (2005) that
the random projection - as performed by RI and RRI - should be
seeded solely with values of -1 and 1.
The study (Hromada, 2014c) has indicated that in case of classification scenarios where low recall is allowed if high precision is attained,
LSB yields results comparable (or better) than both binarized LSA
and renowned deep-learning technique proposed by Salakhutdinov
and Hinton (2009). Figure 19 displays these results for the problem of
multiclass classification (C=20). All models thereby represented used
dimensionality d = 128 and the size of a document hash was thus
exactly 16 bytes.
light stochastic binarization end
10.4.6

10.4.7

evolutionary localization of semantic attractors

Reflective procedure asides, LSB involves neither optimization nor
machine learning components. But given that it produces simplest
datastructures possible - id est, low-dimensional binary vectors - it
can be easily embedded into more complex frameworks. Evolutionary
Localization of Semantic Attractors (ELSA, Hromada (2015)) aims to
change it.
19 This trivial thresholding is applicable only in case of huge (BigData) corpora where
law of big number applies. C.f. Hromada (2014c) for LSB’s variant usable in cases of
smaller corpora.

10.4 semantic vector architectures

Figure 19: Comparison of reflective LSB (with I=2 iterations) and unreflective LSB (I=0) LSB with Semantic Hashing and binarized Latent
Semantic Analysis. Reproduced from Hromada (2014c).

143

144

computational linguistics

ELSA is a result of embedding the LSB into an evolutionary computation framework. More concretely, ELSA uses canonic genetic algorithms (8.7.1) to localize a set of category prototypes (10.4.1) best
adapted to document classes encoded in the training corpus. LESB
thus aims to address the problem of supervised document classification abd as such expects to be trained with corpus containing documents and associated category labels. It first processes whole corpus
by LSB algorithm and once documents are transformed into binary
vectors, it starts to look for the most optimal set of category prototypes.
In ELSA, the search for category prototypes is equivalent to discovery of such a set of prototypes which minimize the function:
X
X
F(P) = α
H(t, P) − β
H(f, P)
(1)
t∈cP

f6⊂cP

whereby P denotes the vector representation of the prototype in the
binary space, H denotes the Hamming distance20 , t denotes the vector
representation of the "true" document belonging to same class (cP ) as
the prototype, f is the vector of the "false" document belonging to
some other class of the training corpus and α and β are weighting
parameters.
Thus, a candidate prototype P of category cx is considered to be
most fit if it is as close as possible (i.e. has smallest Hamming distance) to all documents which are attributed to cx in the training corpus; and as far as possible from documents which are not attributed
to cx in the training corpus.
In ELSA, solution to multiclass classification problem is formalized
as such group of prototypes which minimize the distance to members of categories they should represent and maximize the distance
to others. Given that the training corpus divides its documents into
|C| classes, and given that every document and every prototype can
be represented as a d−dimensional binary vector, chromoses which
are to be optimized by ELSA are binary vectors of |C| ∗ d.
The rest is in work in progress. C.f. Hromada (2015) for comparison
of ELSA with binarized LSA, non-optimized LSB, or Semantic Hashing. Given that ELSA introduces in a one unified framework three
components which we claim to be cognitively plausible, id est:
1. dimensionality reduction by means of random projection
2. theory of semantic prototypes
3. evolutionary computation
and given that its binary nature predestines it to execute very fast on
any transistor-based computer, we shall use aim to implement ELSA,
20 Hamming distance of two binary vectors h1 and h2 is the smallest number of bits
of h1 which one has to flip in order to obtain h2 .

10.5 part-of-speech induction

in one way or another, in majority of simulations described in volume
2.
elsa end
10.4.7
In this section we have presented multiple architectures which have
all one thing in common: they succeed to transform textual documents into geometric and|or mathematical entities. To keep the overview
as simple and concise as possible, only scalars, vectors and matrices
were discussed; the reader is to be reminded that other mathematical
models of semantics exists which also involve tensors of higher order. Even a very introduction of these, however, surpasses by far the
objectives of this Thesis.
Thus, instead of closer discussion of fascinating topics like interrelations between "binding operators", "circular convolution", "complex numbers" and "quantum logic" (Widdows and Cohen, 2014), we
have preferred to acquaint the reader with the idea that meanings
are subspaces of d−dimensional semantic spaces. Departing from simple word-document occurence matrices of first bag-of-words models,
passing through LSA’s ambitions to answer perennial questions:
What are ideas, how are they stored and how are they accessed ?
and discussing other more natural means of dimensionality reduction, we finally approach the Point where multiple divergent streams
converge into one. But before exploring it somewhat further, let’s see
whether the realms of semantic and syntactic categories does not have
something in common. In computational sense, for example.
semantic vector architectures end
10.4

10.5

POS-i

POS-t

part-of-speech induction

The term Part-of-speech-induction (POS-i) designates the process which
endows the human or an artificial agent with the competence to attribute the POS-labels (like “verb”, “noun”, “adjective”) to any linguistic token observable in agent’s linguistic environment. POS-i can
be understood as a « partitioning problem » since one’s objective is to
partition the initial set of all tokens occuring in corpus C (which represent agent’s linguistic environment E) into N subsets (partitions, clusters) whose members would correspond to grammatical categories as
defined by the gold standard. Because one does not use any information about « ideal » gold standard grammatical categories during
the training phase and uses it only for final evaluation of the performance of the model, POS-i is considered to be an « unsupervised »
machine learning problem.
POS-i’s « supervised » counterpart is the problem of POS-tagging.
In POS-tagging, one trains the system by serving it, during the training phase, sequence of couples (word W, tag T) where tag T is the

145

146

computational linguistics

label denoting the grammatical category into which the word W belongs. POS-tagging is thus simpler than POS-i where no information
about ideal labels is furnished during the learning. Training of POStagging systems is of particular importance especially for languages
where many word forms can potentially belong to many part-ofspeech categories (in English, for example, can almost any noun play
also role of the verb; token like « still » can be intepreted as substantive, verb, adjective and even adverb (Páleš, 1994), its POS-category
being determined by its context). On the contrary, in morphologically
rich languages where such a « homonymy of forms » is present in
lesser degrees and relations between word types and classes are less
ambigous, one can often simply train the POS-tagging system by simply memorizing an exhaustive list of (W, T) couples.
10.5.1

non-evolutionary pos-i

The paradigm currently dominating the POS-i domain was fully born
with article published by Brown and his colleagues in 1992 Brown
et al. (1992). Brown and his colleagues have applied the information
theoretic notion of « mutual information » M :
M(w1 , w2 ) = log

P(w1 , w2 )
P(w1 )P(w2 )

upon all word bigrams (i.e. sequences of two tokens w1 , w2 which
co-occur with probability P(w1 , 22 ) and had subsequently devised a
merging algorithm able to group words into classes in a way that the
mutual information within a class would be maximized.
In two decades since publication of study of Brown’s study Brown
et al. (1992), the word n-gram co-occurence approach has inspired
hundreds of studies : be it hidden Markov Models tweaked with
variational Bayes, Gibbs sampling morphological features, or graphoriented methods – all such approaches and many others consider cooccurence of words with n-gram sequences to be the primary source
of relevant information for subsequent creation of part-of-speech clusters. In all these models, one aims to discover the ideal parameters of
Markovian statistical models, often employing a so-called ExpectationMaximization (EM) algorithm to discover the optimal partitioning.
Unfortunately, EM is unable to quit locally optimal states once they
were discovered. Notwithstanding this disadvantage, comparative study
Christodoulopoulos et al. (2010) suggests that probabilistic models of
part-of-speech induction can be indeed very performant.
POS-i induction can be also realized by means of k-means clustering algorithm, or one of its variants. K-means algorithm MacQueen
et al. (1967); Karypis (2002) partitions N observations, described as
vectors in D-dimensional space, into K clusters by attributing every
observation into the cluster with the nearest centroid (i.e. mean). If

10.5 part-of-speech induction

one considers these centroids to denote prototypes of the categories
in center of which they are located, then one can consider the k-means
algorithm to be consistent with « prototype theory of categorization »,
as proposed by Rosch. Table 9 illustrates simple K-mean partitioning
of tokens present in English version of Orwell’s 1984, as contained in
Multext-East Erjavec (2004).
cluster

nouns

verbs

0

10

3

1

568

67

2

97

668

3

13

1011

4

1173

67

5

608

958

6

1977

97

Table 9: K-means clustering of tokens according both suffixal and cooccurence informations. Table partially reproduced from Hromada
(2014b).

In this example case we have clustered all tokens observable in the
corpus into 7 clusters according to features both internal to the token
– i.e. suffixes21 – and external – i.e. co-occurrence with other tokens.
Note that even such a simple model where no machine learning or
optimization were performed, K-means algorithm somehow succeeds
to distinguish verbs from nouns. As is shown in the Table 9, whose
columns represent the “gold standard” tags and rows denote the artificially induced clusters, even such a naïve computational model has
assigned 83.6% of nouns to clusters 1, 4 and 6 while assigning 91.8%
of verbs into clusters 2, 3 and 5.
non-evolutionary pos-i end
10.5.1

10.5.2

evolutionary

Usage of evolutionary computing in NLP is - in comparison to other
methods like neural networks, Hidden Markov Models, Conditional
Random Fields or SVMs – still very rare. This is also the case to NLP’s
sub-problem of part-of-speech tagging and thus we are not aware
any tentative resolve the POS-i problmem with evolutionary means,
and of only one tentative to use genetic algorithms to train a part-ofspeech tagger:
21 That suffixes are of particular importance for POS-induction is more closely demonstrated in our article Hromada (2014a).

147

148

computational linguistics

In his Araujo (2002) proposal, Araujo describes a system of POSt involving crossover and mutation operators. What is particularly
interesting about Araujo’s system is that separate evolution process
is run for every separate sentence of the test corpus. Training corpus,
on the other hand, serves mainly as a source of statistical information
concerning co-occurrences of diverse words and tags in diverse word
& tag contexts. This information concerning the « global » statistic
properties of the training corpus is later exploited in computation of
fitness.
Let’s take, for example, the phrase « Ring the bell ». Since words
like « ring » and « bell » are in English sometimes used as verbs, and
sometimes used as nouns, such a sentence can be tagged at least in 4
different ways :
N D22 N
VDV
NDV
VDN
Such sequences of tags yiels individual members of Araujo’s initial
population of chromosomes. In languages like English where almost
every word can be attributed to more than one POS category & the
number of possible tag sequences therefore increases with length of
the phrase-to-be-tagged, one will be most probably obliged to randomly choose such initial individuals. Fitness of every individual
possibly tagging the sentence of n words is subsequently calculated
as a sum of accuracies of tags (genes) on position i :
n
X

f(gi )

i=0

Accuracy gi of an individual gene is calculated as :


contexti
f(gi ) = log
alli
whereby values of contexti and alli are extracted from the training table which was constructed during the training phase and represent the overall frequency of occurrence of word wi within specific
(contexti ) and all (alli ) contexts.
Once fitness is evaluated, fitness-proportional crossing-over (50%)
and mutation (5%) is realized. Notwithstanding the fact that Araunjo
doesn’t seem to have used any other selection mechanism, in less
than 100 generations, populations seemed to converge into sequence
of tags which were more than 95% correct in regards to gold standard. This is a result comparable to other POS-tagging systems but
22 The non-terminal symbol D denotes the category of determiners containing such
elements as articles "the", "a / an" etc.

10.6 grammar induction

with lesser computational cost. It is also worth noting that Araujo’s
experiments indicate that working solely with contextual window
WL , W, WR , i.e. just looking one word to the left and one word to
the right, seems to yield, in case of POS-tagging of English language
higher scores than extracting data from larger contextual spans.
When it comes to the «unsupervised» variant of the POS-t problem,
id est the problem of Part-of-speech induction, up to this date there
have been -as far as we know - no tentatives to address the POS-i
problem by means of evolutionary computing. For this reason, we
shall aim propose our own solution in volume 2.
evolutionary pos-i and pos-t end
10.5.2
pos-i and pos-t end
10.5

10.6

grammar induction

Input of Grammar Induction (GI) process is a corpus of sentences
written in language L, its output is, ideally a grammar (i.e. a tuplet
G=S,N,T,P as defined in 10.2) or a language model able to generate
sentences of L, including such sentences that were not present in the
initial training corpus.
The nature of resulting grammar is closely associated to the content
of the initial corpus as well as to the nature of the inductive (learning) process. According to their « expressive power », all grammars
can be located somewhere on a « specificity – generality » spectrum.
On one extreme of the spectrum lies the grammar having following
production rules :
1 → 2∗
2 → a|b|c . . . Z
whereby * means « repeat as many times as You Want ». This very
compact grammar can potentially generate any text of any size and
as such is very general. But exactly because it can accept any alphabetic sequence and thus does not have any « discriminatory power »
whatsoever, is such a grammar completely useless as an explication
of system of any natural language. On the other extreme lies a completely specific grammar which has just one rule :
1 →< corpus >
This grammar contains exactly what corpus C contains and is thus
not compact at all (it is even two symbols longer than C). Such a grammar is not able to encode anything else than the sequence which was
literally present in the training corpus and is therefore also useless for
any scenario were novel sentences are to be generated (or accepted).
The objective of GI process is to discover, departing solely from
corpus C (which is written in language L), a grammar which is nei-

149

150

computational linguistics

ther too specific, nor too general. If it is too general, it shall «overregularize» (9.2.4), i.e. shall be able to generate (or accept) sentences
which the common speaker of L wouldn’t consider as grammatical. If
it is too specific, it shan’t be able to represent all sentences contained
in C or, if it shall, it shan’t be able to generate (or accept) any sentence which is considered to be sentence of L but was not present in
the initial training corpus C.
10.6.1

existing non-evolutionary approaches

One of the first serious computational models of GI is Wolff’s «Syntagmatic – Paradigmatic» (SNPR) model Wolff (1988). Its core algorithm
is presented in Listing 6.
Listing 6: Outline of Processing in the SNPR Model (reproduced from Wolff
(1988)
1. Read in a sample of language.
2. Set up a data structure of elements (grammatical rules)
containing, at this stage, only the primitive elements of the
system.
3. WHILE there are not enough elements formed, do the following
sequence of operations repeatedly:
4
BEGIN
3.1 Using the current structure of elements, parse the
language sample, recording the frequencies of all
pairs of contiguous elements and the frequencies of
individual elements. During the parsing, monitor the
use of PAR elements to gather data for later us in
rebuilding of elements.
3.2 When the sample has been parsed, rebuild any elements
that require it.
3.3 Search amongst the current set of elements for shared
contexts and fold the data structures in the way
explained in the text.
3.4 Generalize the grammatical rules.
9
3.5 The most frequent pair of contiguous elements
recorded under 3.1 is formed into a single new SYN
element and added to the data structure. All
frequency information is then discarded.
END

We consider the SNPR model to be of particular importance because of its aim to explain the process of Grammar Induction as a
sort of cognitive optimization : « The central idea in the theory is
that language acquisition and other areas of cognitive development
are, in large part, processes of building cognitive structures which
are in some sense optimal for the several functions they have to perform» (Wolff, 1988). Wolff also associates his « cognitive optimization
hypothesis » with Brown’s «law of cumulative complexity » (c.f. RE-

10.6 grammar induction

FREF) which Wolff paraphrases in statement: « if one structure contains everything that another structure contains and more then it will
be acquired later than that other structure» (Wolff, 1988).

Figure 20: Equivalence classes and production rules induced from English
language samples by ADIOS algorithm. Fig. reproduced from
Wolff (1988).

Grammar resulting from such a contact between language sample
and SNPR inducing mechanism is displayed on Figure 20.
In Wolff’s theory optimalization is further understood as compression. Within the SNPR model is such compression realized in part
3.5 of his algorithm, where the most frequent pair of contiguous elements (either terminals or non-terminals) is substituted for a new
non-terminal symbol. For this reason, the size of grammar able to
generate the initial language sample ideally decreases with every cycle of model’s « while » loop until the process converges to state
where there is no redundancy to « compress ».
Wolff proposes that Grammar Induction is a process which should
maximize the coding capacity (CC) of the resulting grammar while
minimizing its size, i.e. its Minimal Description Length (MDL). He defines the ratio between grammar’s CC/MDL to denote grammar’s
efficiency and it may be the case that within a more evolutionary
framework where one would work with populations of grammars, a
very similarly defined notion of efficiency could be used as the core
component of the fitness function. Unfortunately, Wolff’s 1988 SNPR
model is not evolutionary since it does not involve any stochastic
factors nor notion of multiple candidate solutions. SNPR is simply
confronted with the language sample, deterministically compresses
redundancies in a way that can sometimes ressembles human grammar (and sometimes not), gets subsequently stuck in local optimum
and there’s no way how to get out of it.
Another famous model of GI is that of Elman Elman (1993). Contrary to Wolff’s algorithm which is principially « symbolic », is Elman’s model « connectionist » one. More concretely, Elman had succeeded to train a simple recurrent neural network which was « trained
to take one word at a time and predict what the next word would be.
Because the predictions depend on the grammatical structure (which

151

152

computational linguistics

may involve multiple embeddings), the prediction task forces the network to develop internal representations which encode the relevant
grammatical information.» (Elman, 1993).
The most important finding of Elman’s study seems to be the evidence for a so-called «less is more hypothesis» which Elman himselfs
labels with terms «importance of starting small»: « Put simply, the network was unable to learn the complex grammar when trained from
the outset with the full “adult” language. However, when the training data were selected such that simple sentences were presented first,
the network succeeded not only in mastering these, but then going on
to master the complex sentences as welli» (Elman, 1993). Something
similar occured also when he tuned the capacity of « internal memory » of his networks rather than the corpus itself. Elman observed:
« If the learning mechanism itself was allowed to undergo “maturational changes” (in this case, increasing its memory capacity) during
learning, then outcome was just as good as if the environment itself
had been gradually complicated» (Elman, 1993).
Thus, not only results of Elman’s computational model point in the
same direction as many developmental and psycholinguistic studies
of « motherese » (c.f. Section 9.3) ; they also show the importance
of gradual physiological changes for ultimate mastering of maternal
language. He goes even so far to state that prolonged infancy of human children can possibly go hand in hand with the fact that only
humans develop language in an extent we do : « In isolation, we see
that both learning and prolonged development have characteristics
which appear to be undesirable. Working together, they result in a
combination which is highly adaptive» (Elman, 1993).
Notwithstanding these interesting results which are not to be underestimated, we see two disadvantages of Elman’s approach. Primo,
as is often the case for connectionist neural networks, his resulting
model is somewhat difficult to interpret : given the training constraints mentioned above, the network seems to predict quite well
the next word in the phrase, but it is not evident why it does what
it does. Elman himself dedicates major part of his article to descriptions of his tentatives to understand how his « blackbox » functions.
Secundo, Elman confronted his model only with artificial corpora, i.e.
corpora generated from manually created grammars. Thus, his model
accounts only for a limited subset of properties of one language (English) and as such is still quite far from full-fledged solution to problem natural language’s GI.
The model called « Automatic Distillation of Structure » (ADIOS)
seems to be in lesser extent touched by this second disadvantage
since, as Solan and his colleagues state: « In grammar induction from
large-scale raw corpora, our method achieves precision and recall performance unrivaled by any other unsupervised algorithm. It exhibits
good performance in grammaticality judgment tests (including stan-

Less is more
hypothesis

10.6 grammar induction

dard tests routinely taken by students of English as a second language) and replicates the behavior of human subjects in certain psycholinguistic tests of artificial language acquisition. Finally, the very
same algorithmic approach also is proving effective in other settings
where knowledge discovery from sequential data is called for, such
as bioinformatics.» (Solan et al., 2005)
ADIOS is a graph-based model. It considers the sentences to be
a path in the directed pseudograph (i.e. loops and multiple edges
are allowed), each sentence being delimited by special « begin » and
« end » vertices. Every lexical entry (i.e. a word type) is also a vertex
of the graph, thus if more than two sentences share the same word
X, they cross themselves in the vertex VX ; if they contain the same
subsequence XY , their paths share the common subpath (edge) VX VY
etc.

Figure 21: Equivalence classes and production rules induced from English
language samples by ADIOS algorithm. Reproduced from Solan
et al. (2005).

Authors of ADIOS describe their algorithm as follows : « The algorithm generates candidate patterns by traversing in each iteration
a different search path (initially coinciding with one of the original
corpus sentences), seeking subpaths that are shared by a significant
number of partially aligned paths. The significant patterns (P) are selected according to a context-sensitive probabilistic criterion defined
in terms of local flow quantities in the graph...Generalizing the search
path, the algorithm looks for an optional equivalence class (E) of units
that are interchangeable in the given context [i.e., are in complementary distribution]. At the end of each iteration, the most significant
pattern is added to the lexicon as a new unit, the subpaths it subsumes are merged into a new vertex, and the graph is rewired ac-

153

154

computational linguistics

cordingly... The search for patterns and equivalence classes and their
incorporation into the graph are repeated until no new significant
patterns are found» (Solan et al., 2005).
In other terms, ADIOS starts with a so-called Motif Extraction (MEX)
procedure which looks for bundles of graph’s subpaths which obey
certain conditions. Once such « patterns » are found, they are subsequently « substituted » for non-terminal symbols and a graph is
« rewired » to incorporate such newly constructed non-terminals. Such
a « pattern distillation » procedure of generalization bootstraps itself
until no further rewiring is possible. Output of the whole process is
a rule grammar combining patterns (P) and their equivalence classes
(E) into rules, able to generate even phrases which weren’t present
in the initial corpus. Example of how ADIOS progressively discovers more and more abstract combinatorial patterns is presented on
Figure 10.6.1.
ADIOS is undoubtably one of the most performant GI systems
which currently exist. It combines both statistic, probabilistic and
graph-theory notions with notion of rule-based grammar and as such
is also of great theoretical interest. On the other hand, ADIOS does
not involve any source of stochasticity, seems to be purely deterministic and as such incapable to deal with highly probable convergence
towards locally optimal grammars. In confrontation with some partial corpora this may possibly not cause any problems but, we predict, without any stochastic variation whatsoever, ADIOS could not
account for more than few « advanced » & real-life properties of natural languages and as such shall possibly share the destiny of SNPR
model.
end non-evolutionary gi 10.6

10.6.2

existing evolutionary approaches

Multiple authors have proposed to solve the GI problem with different variants of evolutionary computinng - in following paragraphs
we shall describe five different approaches:
1. hill-climbing induction of finite state automata Tomita (1982)
2. GIG method for inference of regular languages Dupont (1994)
3. Evolution of stochastic Context-Free Grammars Keller and Lutz
(1997)
4. Evolutionary method of inducing grammars from POS tags of
nine different English language corpora Aycinena et al. (2003)
5. Genetic algorithm of Smith & Witten Smith and Witten (1995)
for inducing a LISP s-expression grammar from a simple corpus
of English sentences

10.6 grammar induction

Tomita’s 1982 paper can be considered to be one of the first empiric
studies of grammatical inference. The study focused on inference of
grammars of 14 different regular languages – which are often called
« Tomita languages » in subsequent litterature – by means of deteministic finite state automata. Tomita had first encoded any possible
finite state machine with n states in a following manner :
((A1 , B1 , F1 )(A2 , B2 , F2 )....(An , Bn , Fn ))
whereby every block « (Ai , Bi , Fi ) corresponds to the state i, and Ai
and Bi indicate the destination states of the 0-arrow and the 1-arrow
from the state i, respectively. If A or B is zero, then there is no 0-arrow
or 1-arrow from the state i, respectively. Fi indicates whether state i
is one of the final states or not. If Fi is equal to 1, the state i is one of
the final states. The initial state is always state 1.» (Tomita, 1982)
Thus, for example, the string ((1 2 1 ) ( 3 1 1 ) ( 4 0 0 ) ( 3 4 1 ))
encodes the finite state automaton illustrated on figure item 10.6.2.

Figure 22: Finite state automaton matching all strings over (1 + 0)* without
an odd number of consecutive 0’s after an odd number of consecutive 1’s. Reproduced from Tomita (1982).

Such encoding allowed Tomita to subsequently apply his hill-climbing
approach. Hill-climbing can be considered to be a precursor to more
extended genetic programming, since it employs both random mutations to explore surounding search-space and sort of selection algorithm which always prefers to use, in following iteration of the algorithm, such individual solutions for which the value of evaluation
function E increases. Tomita’s definition of E is very simple:
E = r−w
« where r is the number of strings in the right-list accepted by the
machine, and w is the number of strings in the wrong-list accepted
by the machine» (Tomita, 1982). Right-list is a positive sample corpus
while wrong-list is the negative sample. Thus, if a random mutation
transforms an individual Xn into individual Xn + 1 so that E(Xn +
1) > E(Xn ), - i.e. if an automaton is discovered which matches more
positive sequences, or less negative sequences, or both - it will be
Xn + 1 which will be mutated in the next cycle of the algorithm.

155

156

computational linguistics

Tomita’s approach cannot be considered to be fully evolutionary
because he haven’t used populations nor did he employed any kind
of cross-over operator. For this reason, Tomita’s regular grammarinfering algorithm did sometimes got stuck in local maxima from
which there was no way out. Notwithstanding this small imperfection – of which Tomita himself was well aware – his work served,
and still serves, the role of an important hallmark on the path to fullfledged GI.
Dupont Dupont (1994), for example, has also focused his study on
induction of 15 different regular Tomita languages. In his formally
very sound work, he defines the problem of inference of regular languages as a problem of finding of optimal partition of a state space
of a finite « maximal canonical automaton » (MCA) able to accept the
sentences from positive sample. Fitness function takes into account
also the system’s tendency to reject the sentences contained in the negative sample. By using a so-called « left-to-right canonical group encoding », Dupont succeeds to represent diverse individuals automata
in a very concise way which allows him to subsequently evolve them
by means of structural mutation (« the structural mutation consists
of a random selection of a state in some block of a given partition
followed by the random assignment of this state to a block» (Dupont,
1994), e.g. MUTATE(((1, 3, 5), (2), (4)) → ((1, 5), (2, 3), (4))) and structural crossover (« the structural crossover consists of the union in both
parent partitions of a randomly selected block» (Dupont, 1994), for example (((1, 4), (2, 3, 5)) ⊗ ((1, 3), (2), (4), (5)) → ((1, 3, 4), (2, 5)), (1, 3, 4), (2), (5)).
Because « the search space size dramatically increases with the size
of the positive sample, making the correct identification more difficult when we have a larger positive information on the language»
(Dupont, 1994), Dupont has also proposed an incremental procedure
allowing to start the search process from smaller yet pertinent region
of the search space. Procedure unfolds as follows « first sort the positive sample I+ in lexicographical order. Consequently, the shortest
strings are first taken into account. Starting with the first sentence
of I+, we construct the associated MCA(I+) and we search for the
optimal partition of its state set under the control of the whole negative sample I. Let A1 denote the derived automaton with respect to
this optimal partition. Let snext denote the next string in I+. If snext
is already accepted by A1, we skip it.» (Dupont, 1994) Otherwise,
the aumaton A1 is be extended so that it can cover also snext. The
search under the control of whole negative sample is then restarted
and whole process is repeated until all sentences from positive sample have been considered.
With population size of 100 individuals, maximum number of 2000
evaluations, crossover rate 0.2, mutation rate/bit 0.01 and semi incremental procedure implemented, Dupont’s approach have attained, in
average, classification rate of 94.4%. For five among fifteen Tomita’s

10.6 grammar induction

languages, grammars were constructed which attained 100% accuracy (i.e. accepted all sentences from positive sample and rejected
all strings from negatives sample). Results have also indicated that if
ever the semi-incremental procedure is applied, the sample size has
positive influence upon the accuracy of infered grammars – bigger
sample yields more accurate grammars.
While Tomita’s results indicate and Dupont’s results further confirm the belief that induction of grammars by means of evolutionary computing is a plausible thing to do, they do so only in regards
to most similar type of grammars – the regular ones. Grammars of
natural languages, however, are definitely not regular languages and
models of GI of more expressive « context free » (CFG) or « context
sensitive » grammars are needed.
Keller and Lutz Keller and Lutz (1997) employed a genetic algorithm to evolve parameters of stochastic context-free grammars (SCFG)
of 6 different languages. SCFGs are similar to traditional CFGs (see
item 10.2 for definition of CFGs), but extended with probability distribution, so that there is a probability value in the range [0, 1] associated to every production rule of the grammar. These values are called
SCFG’s parameters and these are the values which the algorithm of
Keller & Lutz aims to optimize by means of GAs. Their approach
involves following steps :
1. Construct a covering grammar that generates the corpus as a
(proper) subset.
2. Set up a population of individuals encoding parameter settings
for the rules of the covering grammar.
3. Repeatedly apply genetic operations (cross-over, mutation) to
selected individuals in the population until an optimal set of
parameters is found.
Their fitness function F(G) is based on idea of Minimal Description
Length (MDL). More formally, Keller & Lutz aimed to maximize:
F(G) =

Kc
L(C|G) + L(G)

by minimizing the denominator which is defined as a sum of number of bits needed to encode the grammar G (L(G)) plus the number
of bits needed to encode corpus G, given the grammar G (L(C|G)).
Numerator Kc is just a corpus dependent normalization factor assuring that the value of fitness shall be in range [0, 1]. When confronted
with positive samples of cca 16000 strings (typically of length 6 or 8)
of 6 different context-free languages :
1. EQ : language of all strings consisting of equal numbers of as
and bs

157

158

computational linguistics

2. language an bn (n > 1)
3. BRA1 : language of balanced brackets
4. BRA2 : balanced brackets with two sorts of bracketing symbols
5. PAL1 : palindromes over a,b
6. PAL2 : palindromes over a,b,c
their algorithms have converged, in majority of cases, to such combinations of parameters of their SCFGs which had allowed them to
accept more than 95% of strings presented in the positive sample.
Such results indicate that genetic algorithms can be used as a means
for unsupervised inference of parameters of stochastic context-free
grammars. Note that Keller & Lutz confronted, during both testing
and training, their algorithm only with positive sample. While doing
so for training is justifiable - since the objective of their study was
to study whether grammars can be infered solely from positive evidence – not doing so during testing phase makes uncertain the extent
to which their infered grammars overgeneralize.
Another huge disadvantage in regards to aims of our Thesis is the
simple fact that their approach also seems to be very costly (« number
of parses that must be considered increases exponentially with the
number of non-terminals» (Keller and Lutz, 1997)). And since they
confronted their algorithms only with corpora composed of sentences
of artificial and not natural languages, we shall not aim to imitate
their approach of «tuning SCFG parameters» in our simulations.
By being context-free and not simply regular, the grammars studied in Keller and Lutz (1997) or (Choubey and Kharat 2009) could
be considered to be more similar to grammars of natural languages.
Nonetheless, languages composed of palindromes and sequences of
balanced brackets are still far way off from natural languages and
the question « in what extent are results concerning GI of artificial
languages applicable to GI of natural languages ? » is far from being
answered. Rather than trying to answer it, we proceed now to discussion of two approaches where evolutionary GIs have been applied
upon natural language sentences :
The first method, proposed by Aycinena et al. in Aycinena et al.
(2003) focuses on induction of CFG grammars from nine different
part-of-speech tagged natural language corpora. Sentences contained
in these corpora, composed thus of sequences of part-of-speech tags
(see Section 10.5)) were used as positive examples, while randomly
generated sequences of POS-tags have yielded negative examples.
Initial population was composed of linear encodings of randomly
generated context-free grammars, for example the string SABABCBCDCAE would represent this CFG :
S → AB

10.6 grammar induction

A → BC
B → CD
C → AE
During the evaluation of individual grammar G, one would first
try to parse both positive and negative corpora with the grammar G
and subsequently calculate the final fitness by applying the following
formula :
F = γmax(0,|α|−|P|) C(α) − δIα
« where P is the set of preterminals, C(α) is the number of parsed
sentences from the corpus, I(α) is the number of sentences parsed
from the randomly generated corpus, δ is the penalty associated with
parsing each sentence in the randomly generated corpus, and γ is
the discount factor used for discouraging long grammars.» (Aycinena
et al., 2003)
In their study, Aycinena and her colleagues had placed randomly
generated population of 100 individual grammars on a two-dimensional
10 x 10 torus grid. Subsequently, they had applied a following selectbreed-replace strategy :
1. Select and individual randomly from the grid
2. Breed that individual with its most fit neighbor to produce two
children
3. Replace the weakest parent by the fittest child
In their framework, « cross-over is accomplished by selecting a random production in each parent. Then a random point in these productions is selected and cross-over is performed, swapping the remainder
of the strings after the cross-over points» (Aycinena et al., 2003). Every
symbol of a resulting string can be subsequently mutated (mutation
rate=0.01). « A mutation is simply the swapping of a non-terminal or
pre-terminal with another non-terminal or pre-terminal» (Aycinena
et al., 2003)
Figure 10.6.2 shows the number of generations each run was able to
complete, the grammar G that last evolved, the percentage of positive
examples parsed by G, the percentage of negative examples parsed
by G and G’s fitness.
While results displayed above may seem encouraging authors, have
noticed that in majority of cases, their approach « gives a grammar
that is very capable of detecting whether a sentence is valid in English,
but it has not learned much English structure» (Aycinena et al., 2003).
In other terms, Aycinena et al. have succeeded to breed grammars
which have certain discriminatory power but are practically useless
as models of English language. They go even so far as to state, in

159

160

computational linguistics

Figure 23: Grammars induced from nine different POS-tagged corpora. Reproduced from Aycinena et al. (2003).

the ultimate paragraph of their work that « It is still possible that English grammar is too complex to be learned from a corpus of words»
(Aycinena et al., 2003) and that other external clues are necessary for
successful GI of English.
The big disadvantage of above-mentioned algorithm was also the
fact that its input were sequences of already attributed POS-tags and
not sequences of words themselves. Thus, even if the approach would
discover some interesting grammars, a reproach could be made and
justified that in fact it only re-discovered the rules of the tagging
system which was used in the first place. From perspective of our
Thesis, another disadvantage of Aycinena et al.’s approach is related
to the fact that their approach is anything but model of grammar
development in human child. For it is evident 9 that children learn
the grammar of their language in an incremental fashion – they are
not confronted with whole corpus from the very beginning. Nor does
the corpus stay identic after each iteration of the learning process. On
the contrary : as child grows, its linguistic environment - the corpus
– also grows. Both in length and complexity.

10.6 grammar induction

An interesting evolutionary approach of GI which both tries to create own non-terminal categories and also takes such « incrementality » into account is presented in the work of Smith & Witten Smith
and Witten (1995). In their scenario, candidate grammars are evolved
after presentation of every new sentence. Grammars have form of
LISP s-expressions whereby AND represets a concatenation of two
symbols (i.e. a syntagmatic node) and OR represents a disjunction
(i.e. a paradigmatic node). Whole process is started as follows : « The
GA proceeds from the creation of a random population of diverse
grammars based on the first sample string. The vocabulary of the
expression is added to an initially empty lexicon of terminal symbols, and these are combined with randomly chosen operators in a
construction of a candidate grammar...If the candidate grammar can
parse the first string, it is parsed into the initial population.» (Smith
and Witten, 1995)

Figure 24: Two simple grammars covering the sentence "the dog saw a cat".
Fig. reproduced from Smith and Witten (1995).

Figure 10.6.2 displays two sample grammars for the sentence « the
dog saw a cat ».
S-expression sequences representing individual grammars are subsequently mutated. Couple of parent grammars can also switch their
nodes – probability of being chosen for such cross-over is inversely
proportional to grammar’s size : shorter grammars are prefered. Crossover is non-destructive, parents thus also persist. The events of reproductions are grouped in cycles, at the end of each cycle, population
of candidate grammars is confronted with new sentence from sample
of positive evidence.
In their article, Smith & Witten demonstrate, how after presentation
of sentences : «the dog saw a cat », « a dog saw a cat », « the dog bit
a cat », « the cat saw a cat », « the dog saw a mouse » and « a cat
chased the mouse » their system naturally converged to a grammar
which had quite correctly subsumed determiners like « a », « the »
under one group of OR nodes, verbs like « chased », « saw », « bit »

161

162

computational linguistics

under another, and nouns like « dog », « cat », « mouse » under yet
another. The grammar which they finally obtain is not ideal but, as
they argue, it could get better if confronted with new sentences. « It
is an adaptive process whereby the model is graudally conditioned
by the training set. Recurring patterns help to reinforce partial inferences, but intermediate states of the model may include incorrect
generalizations that can only be eradicated by continued evolution.
This is not unlike the developing grammar of a child which includes
mistakes and overgeneralisations that are slowly eliminated as their
weaknenesses are made apparent by increasing positive evidence.»
(Smith and Witten, 1995)
While strongly agreeing with above citation, we nonetheless cannot
ignore certain drawbacks of Smith & Witten’s approach. Most importantly, by using LISP’s s-expressions as a way of representing their
grammars, they ultimately have to end up with highly bifurcated binary trees (since arity of AND|OR operators is 2). Thus, one can easily subordinate two non-terminals to one terminal (e.g. OR(cat,dog)),
but in case of three subordinated terminals, one is obliged to use complex expression involving three non-terminals (e.g. OR(OR(cat,dog),OR(mouse,NULL)).
Therefore, in such an s-expression based representation, is any class
having more than two members neccessarily represented by a longer
sequence → is more prone to mutation → is highly « handicapped »
in regards to much shorter expressions subordinating just two nodes.
Another drawback of Smith & Witten’s work which cannot be ignored is related to the fact that while they used English language
sentences to train their system, the sentences were very simple and
the relevance of their findings to GI of « natural » English is more
than disputable. In fact, they seem to achieve, with quite complex
evolutionary machinery, even less than Wolff’s deterministic SNPR
model have achieved almost a decade before. Notwithstanding these
two drawbacks we nonetheless consider as particularly inspiring their
approach aiming to solve the problem of GI of natural languages by
uniting, in one framework, the notions adaptability, evolvability and
statistical sensitivity to recurring patterns.
We summarize : all five above-mentioned approaches indicate that
evolutionary computing can potentially yield useful solutions to the
problem of Grammar Induction of both artificial (regular, contextfree) and natural language grammars. The length of the candidate
grammar is frequently used as an input argument of the fitness function. Note also that both solutions of Dupont and Smith & Witten
also use a sort of « incremental » procedure whereby individual solutions gradually adapt to every new sentence. Especially Dupont’s
findings are reminiscent of what was already told about « importance
of starting small » when discussing computational model of Elman
(Section 10.6.1).

10.6 grammar induction

On the other hand, none of the above mentioned models was confronted with corpus of child-directed (i.e. « motherese ») or childoriginated utterances. The objective of our Thesis shall be to fill this
gap.
end evolutionary models of gi 10.6.2
Asides these non-evolutionary and evolutionary algorithms for grammar induction, there exist also first tentatives to solve the GI problem
by means of Grammar Systems (10.2.3). The pioneer work in this regard is the study of (Sosík and Štỳbnar, 1997). Contrary to majority
of GS-inspired authors who focus on productive (i.e. generative) aspects of GS, Sosik & Štýbnar have focused on GS’s language-accepting
properties. In a hybrid connectionist-symbolic architecture, they have
used a «neural pushdown automaton» to infer a language colony
(10.2.3) able to cover some simple artificial context-free grammars
able to cover balanced parenthesis or palindrom languages. While
their results demonstrate that it is indeed viable to perform grammatical inference by means of grammar systems, the artificial nature of
the input languages makes it difficult to see whether their approach
could be of any use in modeling acquisition of natural language.
This being said, we conclude with statement that as of 2015, ADIOS
(Solan et al., 2005) seems to be the only full-fledged computational
model of unsupervised grammar induction which is
• publicly available (at least partially23 )
• capable of inducing grammars even from child-speech transcript
input data (Brodsky et al., 2007)
For this reason we shall compare, in second volume of this Thesis, results of our ELSA-based simulations with those, induced by ADIOS.
end grammar induction 10.6.2
As of 2015, NLP is one of the most "hottest" and active sub-disciplines
of not only computational linguistics, but also computer and potentially cognitive sciences in general. Without being aware of it, lifes
of billions of people are influenced on a daily basis by client platforms, applications, marketing bots or search engines which implement some kind of NLP technique.
In NLP, accuracy - defined, for example, in terms of precision and
recall (10.3.1) - is always important because it is easier for human
users to interact with more accurate systems. But in real-life applications, accuracy is not the only constraint which has to be taken into
account: speed and computational complexity of the task are also crucial.
23 Demo
version
of
ADIOS
http://adios.tau.ac.il/download.html

can

be

downloaded

from

163

164

computational linguistics

To support our point, let’s take a Turing test as an example: questionanswering systems which need hours to generate the most accurate
and valid answer shall not pass the test; the test shall be passed by
machines which offer an approximate answer in few seconds. Hence,
even in case of the challenge from which whole discipline of NLP
originates, accuracy of one’s model is not a goal per se and is, in
fact, useless if one forgets that expression "natural language" does not
mean a piece of dead, static corpus stored on one’s disk, but rather a
set of sequences of symbols always expressed in a context, and alway
expressed with an intention.
end natural language processing 10.6.2
Computational linguistics is a symbiont of computer science and
linguistics. In this chapter, we have explored its three principal components:
1. Quantitative and Corpus Linguistics (QCL) devoted to discovery of patterns and laws within linguistic corpora
2. Formal Language Theory (FLT) devoted to formalization of principles of syntax in terms of set theory and algebra
3. Natural Languauge Processing (NLP) devoted to amelioration
of machine’s faculty of processing of information which machines exchange with human beings
During introduction to QCL Zip’s law "the frequency of the word
is inversely proportional to its rank" and logistic law "the increase is
first slow, than fast, shan slow again" were discussed in somewhat
closer detail. It was noted that both of these laws are relevant descriptive mechanisms for diverse diachronic processes, both in linguistic
ethnogeny as well as linguistic ontogeny. The fact that both of these
laws yield very successful models for description of ecological phenomena was also brought to attention.
Brief overview dedicated FLT to has offered only the very basic
definitions: language L was defined as a potentially infinite set of
strings of symbols chosen from a finite alphabet; grammar GL was
defined as a formal system containing rules of production able to
generate, as its theorems, exactly all and only strings of L. Classes
of regular, context-free, context-sensitive and unrestricted grammars
were described and usefulness of such hierarchical view of things was
mentioned, notably in relation to artificially (e.g. programming) languages. Brief excursion to multi-agent, non-monolithic, parallelized
and modular "Grammar Systems" have illustrated that "miraculous"
things - like ability to generate infinite language obtained by interlock
of two finite grammars - can happen whenever individual component
grammars share their input/output string environments.

10.6 grammar induction

The major part of the chapter was dedicated to NLP. Methodological aspects which NLP shares with machine learning field of artificial intelligence were first pointed out. Subsequently, three classes of
problems were addressed:
1. problem of geometrization of meaning was principially presented
as projection of semantic features into N-dimensional metric
spaces
2. problem of part-of-speech induction was principially presented
as projection of morphosyntactic features into N-dimensional
spaces + subsequent attribution of specific partitions of morphosyntactic metric spaces with specific non-terminal labels
3. problem of grammatical induction was principially presented as
a problem of part-of-speech induction + gradual optimalization
of content and order of substitution rules
Few exemplar solutions to these problems were mentioned, both deterministic and, if existing, also non-deterministic and evolutionary.
It was noted that some encouraging results were already attained but
that there is still plenty of work to be done.
So let’s do it.
end computational linguistics 10.6.2

165

SUMMA II

Different paradigms have been presented in preceding chapters:
1. universal darwinism
2. developmental psycholinguistics
3. computational linguistics
1st offering a theoretical framework; 2nd offering the data, the materia, the object of interest; 3rd offering the method how the validity of
the theory in relation to materia is to be ultimately demonstrated.
The framework: a theory of intramental evolution. Id est, a theory
stipulating that not only genes or memes evolve, but that there exists
yet another, 3rd kind of evolutionary force which moulds man’s destiny. An evolutionary force which is neither phylogenetic like unceasing development of DNA-molecule, nor ethnogenetic and cultural
like the memetic evolution occurent between mutually communicating minds. An evolutionary force which is profoundly ontogenetic: a
sort of process limited by a life span of the individual in whose mind
the process occurs.
The materia, the object of interest: a mind of a child. Id est, a mind
in constant change, an exploring mind, a playful mind. A mind that
masters, in less than three years of existence and practically completely ex nihilo, the most fundamental structures of her mother language. Indeed in less than three years do the representations encoding the universally perturbing cry of a newborn into our world evolve
into evermore precise, robust and well-adapted prosodic, phonologic,
phonetic, morphosyntactic, semantic and pragmatic representations.
Being unafraid of commiting an error and feeling no shame nor guilt
when doing so, a soul of an infant, previously so alien to our world,
gradually and swiftly learns how to live in it. Gets grounded in it,
gets informed how to live in it with us.
The method: a computational simulation. Id est, a simulation aiming to reproduce, in silico, at least few key processes through which
a child learns its mother language. A simulation that would succeed
to partition the world of its representations into categories or clusters
similar to those which an organic child would construct, if ever presented with the same data. A simulation able to discover and provide
grammars whose products would be undistinguishable to utterances
produced by normal human children in course of their daily interactions. If such goal were to be attained by means of evolutionary computation, success of such simulation could be used as a non-invasive,

166

11

summa ii

indirect proof that a sort of ontogenetic, intramental evolutionary process governs the process of language acquisition in human children.
A theory of intramental evolution, a mind of a child and a computational simulation: among this trinity of cornerpoints embedded
in a semantic space representing our current knowledge, one can observe overlapping regions, one can observe common topics. To start
with, note the notion of a gradual yet continuous change: no matter
whether in subdisciplines of UD, DP or CL, outputs of phase TN serve
as inputs for the next phase TN+1 . There exist an analogy between
successive stages of a developing child and successive iterations of an
NLP algorithm: both invest present energy into processing of knowledge attained in the past so that more accurate performance can be
attained in the future. Cognitive representations continuously change
but the processes which make the change possible are always present.
Note that such gradually changing continuity does not exclude that
from time to time, paradigm shifting, phase-transiting phenomena
shall be observed. On the contrary, such moments of global equilibriation of the whole psycholinguistic system are necessarily implied by
any theory that consideres child’s linguistic faculty in moment T to
be a nexus of parallel activity of many modular entities whose means
of interaction are complex and potentially non-deterministic.
The notion of "parallel activity" is thus equally crucial both for the
theory, as well as for correct understanding of observations and simulations that shall follow. That human brain is a device which processes information is a well-known fact; the sequential nature of language can, however, lead one to a conclusion that language is processed in a monolithic, serial fashion. To a somewhat "monotheistic"
conclusion that to every language utterance (in production) or to its
understanding (in comprehension) there leads only one correct sequence of applications of rules extracted from one correct grammar.
We consider such conclusions as fallacious. Knowing how the nature usually tends to proceed, we do not consider as necessary to
postulate cold, fixed, static, formal, universal and omnipresent order
there, where much more local notions of dynamism, variation, interaction, exchange and convergence clearly suffice. Given that the
notion of "convergence" is flexible enough to account for the fact
that, in course of time, completely different species (e.g. humans and
cephalopodes) "obtained" the organ with identic function (e.g. eye)
by following two completely different evolutionary trajectories, we
believe that it should also be flexible enough to explain the "mystery"
of language acquistion:
Children learn language by converging to it.
And as we shall now proceed to demonstrate, it is through interaction with peers and parents that the point of convergence is to be
discovered.
end synthesis of part ii 11

167

Part III
O B S E R VAT I O N S
A child’s spontaneous remark is more valuable than all
questioning in the world. Jean Piaget
This part shall describe certain observations related to ontogeny of linguistic structures and interpret in terms of
theory of intramental evolution.
Its first chapter is principially a longitudinal qualitative
study of one particular human child. At its beginning, a
non-invasive, phenomenological, observational data-collecting
method shall be described and few salient moment of subject’s prenatal and postnatal development shall be mentioned. The major part of the study shall be devoted to subject’s linguistic development during the toddler period, id
est between 10 and 30 months of age. Among others, some
among subject’s first words, first phrases, first pivot grammars and first variation sets shall be presented.
A set of "evolutionary" notions shall be developped and
defined in order to facilitate the interpretation of the obtained data in evolutionary terms. Notions like intralexical|intraphrastic|interlinguistic crossover shall be thus introduced and multiple real-life cases shall be furnished for
each notions
These notions will play important role in the following
chapter devoted to quantitative observations. When possible, they would be transcribed into a form of PERL-compatible
regular expression. The corpus of child-language transcriptions (CHILDES) shall be subsequently processed by such
regexps in a series of simple and reproducible data-mining,
pattern-extracting experiments. Ideally, patterns and statistical regularities shall be discovered which are not only
language-specific but also language-indepedent. That is,
occurent in not only English but ideally in all languages
attested in CHILDES corpus.

12

Q U A L I TAT I V E

12.1

Limits of
traditional method

Significance levels
are arbitrary

Problem of
experimental
invasivity

method and data collection

In no domain of scientific endeavour are limits of Gallileo-Cartesian
dubitating yet experimental method as visible and problematic as in
studies of subtle mental and psychic layers of human subjects. And
in case of studies of human children, these problematic situation is
marked to the very extreme: due to a sort of psychosocial uncertainity
principle, the very act of observation significantly modifies the properties of the observed subject. Trying to fox a healthy, curious, vivid
human child in an artificial experimental setting is plainly and simply contradictory to any tentative of evaluation of child’s natural behaviour.
Neither is reassuring a traditional quantitative "psychological" paradigm
in which one proves one’s hypothesis through statistic comparison of
a study-group with a control-group. Even if all went well and one
would succeed to solve the unsolvable and limit the influence of external and hidden variables to a very minimum and even if all children
would behave as expected during the experiment (a very improbable "if" indeed), and even if all subsequent statistical evaluation was
sound and solid, one would end up with one null hypothesis, few coefficients and a p-value. "So You state that those kids cross-over such
linguistic structures and those other don’t. And that the difference
is significant because the p-value is 0.045. But, You know, our community has decided not to bow in front of the Fisher-defined p<0.05
significance level threshold (Fisher, 1925)" could be a provocative, yet
valable denial of such a result.
Asides and above all such criticizms thrones the ethical problem
of invasivity of one’s experiments. One cannot have a theory that
postulates that any stimulus - no matter how small and ephemere - can
influence child’s lifelong trajectory and still aim to prove such theory
by means of putting a child with artificial, non-human, mentally perturbing experimental conditions. Of course, in mental world of experimentators who depart from an axiom that children neither feel nor
reason, such methodology is still allowed. Others can also somehow
bridge the cognitive dissonance which necessarily follows. But for a
scientist who departs from a belief that child feel and reason much
more than adults shall ever will - and such was, indeed, our bias
of departure - an experimental invasivity is an important κριτήριον
which significantly constraints one’s ways of doing responsible and
sustainable science (Hromada, 2010b).

170

12.1 method and data collection

Given that our objectives were not (medi|clini)cal but rather those
of recherche fondamentale, we have not found any reason which could
potentially justify use of any kind of invasivity. All these considerations + practically zero funding taken into account a traditional quantitative methodology of experimental psycholinguistics has been discarded as inappropriate cca in 2nd year of our doctoral studies.
Such a methodological design choice was further motivated by information announcing the "good news" that a child is to be born, in
whose closest presence we could spent years to come. This have put
us into positions of savants like Piaget, Braine, Labows or Tomasello
who had all honor and luck to confront their theories with yearslasting, longitudinal observations of their own children. Thus, our
rejection of purely cartesian attitude seems not to have disastrous consequences neither for validity nor for reproducibility of observations
which have followed.
And what have followed is this: from the moment of subject’s birth
(0;0;0) author of this dissertation has kept a journal. Journal was first
written as an objective "observation log" but quite soon (0;7) it obtained a form of personal monologue addressed, in 2nd person singular, to adult person which shall, ideally, become from the child
herself. Entries in the journal have been written down according to a
sort of biased, random sampling procedure: that is, whenever subject
generated an event which was sufficiently salient and whenever all
among other conditions were fulfilled (i.e. father observed the event
or mother told about it to the father; journal was in the proximity;
pen or pencil was in the proximity; observer had enough time to note
the observation down, etc.), then and only then was the entry written
down.
Given such a relaxed methodology, 123 hand-written journal pages
have been filled with 167 records (14 recorded by mother; 153 by
father) before the subject have attained the upper bound of the toddler period (2;6;0).1 What shall follow in this chapter are principially
biased descriptions and biased intepretations of suched biased observations.
12.1.1

biases

In retrospective analysis of the observation journal, which started at
(2;6;0) and ended at (2;6;14) observer is struck by omnipresence of
following biases:
1. observer consider the subject to be endowed with consciousness

1 Asides the hand-written journal, cca 20 gigabytes of audiovisual material were collected, often in situations when the subject played, ate, was in REM-phase, danced
or simply toddled and babbled. With two or three minor exceptions, this data shall
not published in the present work.

171

parvuli deûm
regnum

172

qualitative

2. observer considers the subject to be somebody, who shall evolve
into a conscious adult
3. observer is the parent of the child
4. observer and vaste majority of other personnae mentioned in
the journal seem to be strongly attached to the child with a
bound which is difficult to describe without referring to the
meaning of the word "love" (14.4.2)
5. observer focused on noting down the observations which match
his theory and was, in fact, unable to note down observations
which do not match the theory
Disadvantage of such biases is that they distort the objective stateof-affairs. But this disadvantage can be reduced if such biases are
known. And in case of biases 1-4, the disadvantage can even turn out
to be advantageous: for these biases are well-known to vaste majority
of those, who were ever blessed with having a child. Thus, instead
of making our observations more subjective they help us to establish
a common prism through which our communicative intention could
be potentially understood.
The 5th bias, of course, is problematic. From our current perspective there is little which can be done to combat such a sort of cognitive
blindness which have made us ignore practically all data which does
not fit our theory. Thus, instead of solving the situation by pretending
that we have observed all that was to be observed, we prefer to honestly admit that in regards to all that could have been noted down,
but wasn’t, observers has often acted as strongly biased, cognitively
blind, hormonally reprogrammed fachidioten.
end biases 12.1.1

It is known since time immemorial that a conscious, reflected, sattvic
awareness of one’s biases is a condition sine qua non of a viable and
valable methodology. But it was Husserl and his followers who gave
the method its western name by calling it the phenomenological
method.
It is, indeed, a sort of phenomenological methodology which can
be understood as the method behind words to come.
end method 12.1

12.2

subject

The subject was conceived as a result of emotionally charged yet
fully conscious decision of two adult individuals. In the prenatal pe-

12.3 linguistic environment

riod, mother have included consumption of magnesium-rich mineral
waters and iron-containing supplements into her otherwise healthy,
dairy&vegetable&fish-based diet. Pregnancy progressed without any
major complications and in 6-th month (0;-3), father could have felt,
during a week-lasting music festival, that the child was already able
to atune its kicking with musical beats.
Birth occured approximately three weeks before expected term and
was probably caused by mother’s passing in the proximity of an active asphalt-drilling machine. Birth itself lasted exhausting 28 hours:
the mother asked for an epidural injection after 23 hours of tentatives
to mentally influence the extent of cervical dilation. From now on, the
initial letters I.M. of her two names shall be used to refer to healthy
girl thus born.
Given that the first-born came to world approximately two months
before winter solstice, its first tentatives to move corresponded with
increased luminosity of longer days.
Standard unfoldment of the universal sensori-motor algorithm followed: rotation from back to belly at (0;5), first unsuceessful crawling
tentatives at (0;5;25), sitting on chair at (0;7;14), crawling on four at
(0;8), autonomous standing at (0;10) and first step at (0;11;20). Lateralization expressed by right-handed object manipulation preference
was noted down at (0;6;25).
Eruption of first teeth noted at (0;9;17). In spite of this fact had the
breast-feeding continued until the material bond between mother and
daughter was broken - after multiple unsuccessful tentatives - at (1;10)
by a more or less bilateral agreement of both participants involved.
In the toddler period, neither IM nor the members of have closest
social surroundings suffered any serious ilness or traumatic experience. IM can thus to be considered as what is often known in developmental literature as "normal child".
end subject 12.2

12.3

linguistic environment

IM’s linguistic competence developed in multilingual environment.
Both parents are of slovak (western slavic) origin. However, since
mother spend more than half of her life in Germany, and since the
child was born and raised in Germany, IM-directed "motherese" was
at least 60% german-based.
Father migrated to Germany just few months before the child was
born and was thus struggling with problem of secondary language
acquisition practically in the same period as the child was struggling
with first language acquisition. Between themselves, parents spoke
mostly slovak. Father’s IM-oriented language was also mostly slovak.

173

174

qualitative

But in majority of other regular daily interactions, IM was mostly
exposed to german. In non-negligeable amount of cases, IM could
observe one or both of her parents verbally interact in czech, english,
french, spanish and, in much lesser extent, polish, ukrorussian, sanskrit and tibetan (sorted in descending order according to structural
exposure frequency).
IM started going to creche two days after her first birthday (1;0;2).
There, she was mainly surrounded by peers verbally interacting by
means of german-ressembling idioglossias.
end linguistic environment 12.3

12.4

crying and babbling

After few months of more&more differentiated crying forms, "happy
cooing" was, along with smiling, noted down at (0;2;18). Three months
later, as soon as of (0;5;11) mother had noted down the presence
of canonical babbling sequences bäh, bäh, bäh; dwn dwn dwn; mamamama. In the same record the mother conjectures that the sequence
hop hop hop corresponds to knee-bending and tou tou tou corresponds
to stretching of hands.
Being more sceptical about IM’s ability to verbally communicate,
paternal record from (0;5;25) observed in child’s production the presence of vocalizations with occlusive labial, velar, glottal and laryngal
features. Glide-like dwndwn like and trill-like drndrn were also noted
during the period. Paternal scepsis notwithstanding, a synchronicity
between the overall context and child’s communicative intention had
made the father to note down, already at (0;7;14), the hypothesis that
bwí could potentially mean porridge [Breie].
First sequences composed of different syllables were observed at
(0;7;23). At (0;10;13), babbling sequence of a sort tititatatetedededidi
was recorded and a week later, syllables ma;pa;ba;ta;da;te;ti;ne;de; me;
pe;be;we;bwe were enumerated as most salient.
As late as (1;8;7) such canonical babbling was listed as one among
multiple modes of communication:
1. crying of a hungry newborn
2. squeling disapproval of a pampered child
3. "mentor mode" (observable especially when IM communicated
with smaller children, acommpanied with vivid gests)
4. melody singing (especially when in stroller or in bike sit)
5. canonical babbling

12.5 first words

In spite of the fact that both bursts of cry as well as production
of expressions highly repetitive yet gently variating syllabic streams
was observed as far as the end of toddler period2 , we conclude that
the babbling schemas had lost their dominant position not later than
at (1;6). For at this period it became evident that at least certain forms
of IM’s language had lost their private, idioglottic character. Convergence of IM’s neurolinguistic structures towards an optimal communicative system was on its way.
end babbling 12.4

12.5

first words

As was indicated in the previous section, mother had detected the sequence "mama mama" as early as of (0;2;18). Father had noted down
the marked repetition of sequence m@m@ at (0;7;19) and few days later,
at (0;8) had noted down that m@m@denotes disagreement. However, it
was only at when (0;9;17) father had noted down that "it is possible
that the term m@m@- which becomes more and more phonetically similar to
MAMA3 - already denotes the mother not only as a source of food, but also
as a person whom You love and whose presence makes You happy". Given
that IM had often used, in following months, the word "mama" in
contexts as diverse as
1. request for food
2. call for help
3. declaration of joy
4. looking at father’s photo (c.f. below)
5. approaching "home"4
it seems to be the case that even the meaning of such a fundamental
signifiant is not completely fixed and varies in time. But given that
IM’s mother have practically always interpreted such a term as a signal which made her personally and immediately concerned, the term
got potentially quite fixed and served as a sort of label denoting IM’s
2 C.f., for example, a sequence recorded between 30th and 70th second of the video
downloadable at http://wizzion.com/im/latebabbling.avi. Recorded at (2;5;8).
3 We shall use upper case letters to mark such signifiers which most probably already
encoded a specific meaning. Lower case transcriptions shall represent sequences
whose meaning, at the moment of production, seemed to be absent or highly ambigous.
4 This was noticed at (1;7;12) when the father used "Google streetview" application to
perform a small experiment. IM could see, on the monitor, the streets she already
know from the real life. Once the walk ended in front of entrance to house where
IM lives, IM pointed to monitor and cried "MAMA!".

175

176

qualitative

mother. Later, already in two-word phase and after she has "discovered" that every peer in the creche has his own distinct "mama", IM
started to denote her one and only mother with the term "MAJNE
MAMA".
When it comes to paternal term which is "tato" in slovak, father
had noted down the production of sequence "tata" as soon as (0;7;23),
mother at (0;8;21). The first indication that IM’s brain associates the
term with the father was furnished by the mother who, during the
trip to seacoast where father was absent (0;9;9) saw IM looking at father’s photo, uttering "TATO" and than observing the sea for a long
time, silent. Three months later, at (1;0;23), such a romantic view was
somewhat perturbed by father’s observation that IM used the term
"mama" when looking at photo on which only father was depicted.
Thus, it was only after months-lasting experimentation with pronounciation with dental occlusives in sequences like "ata" (0;7, 1;0;1), "ada;
dada" (0;8;21) "toto; tete; tata" that the father noted, at (1;3;16) that
the most popular word is currently TATO and it is quite possible that it
also means what it is supposed to mean, since You often say it either when I
disappear from Your view, or when You want something from me".
A first non-parental term whose repetitive usage was considered
as worth recording both by mother (0;8;21) and soon afterwards by
father (0;8;29), was ENTE. Since at such an early age, IM had used
this term -which meaning "duck" in german- in an exclusive and
strongly repetitive fashion when she was confronted with books with
ducks, bathtub ducks as well as real organic instances of species Anas
platyrhynchos it can be stated that IM succeeded to create a cognitive
representation of the word ENTE whose extension strongly overlaps
with the one held by IM’s social surroundings. This can potentially be
explained as a consequence of "duck-feeding and duck observation"
rituals in which IM participated on a regular weekly basis since third
week of her life. But given that ducks are often mentioned in "lists of
first words" (c.f. Table 3) or "first word combinations" (c.f. for example (Braine and Bowerman, 1976, pp23,32,44,49)) presented by other
western authors, one is tempted to state that IM’s obsession with the
form ENTE is to be explained not only as a sort of caprice of ontogeny
of individual psyche, but can also have cultural or even phylogenetic
roots.
At (1;0;23) father noted down that IM often uses the term BABA
when speaking to and/or demanding the presence of her grandmother.
This is consistent with the fact that the term is a slovak coloquial denoting grand-mother, or old woman in general5 . Later, the term was
often used as a part of fixed construction "HALO BABA" (1;2;21,1;4;10)
5 Note that in many languages, the term "baba" is often associated to meanings which
C.G.Jung would most probably understand as instances of the archetype of "old and
wise authority". Thus, asides its well-known use as a honorific in sanskrit, persian,
turkish or arab, the term baba denotes old and wise man among Shona people of
Zimbabwe or Yoruba people of Nigeria, and potentially in other ethnics as well.

12.5 first words

177

People: MAMA (0;9), TATO (M0;9), BABA (1;3)
Food: BAJA/ANAN [banana] (F1;5), MI [milk] (F1;5), BROT [bread](F1;4)
Body parts: NENE [F1;0;23], HÁE [hair](1;5)
Places: KITA [creche](F1;4), ŠPIPA[playground](F1;5),
Animals: ENTE [duck] (MF0;8), uau-uau [dog] (F1;5), mjau [cat] (F1;5)
Toys: BAJ [ball] (1;6), TEDY [teddy-bear] (1;6)
Household objects: KE [keys] (1;5)
Routines: halo (F1;4), e-e [refusal] (M1;4), najn [no] (M1;5)
Activities: papa [to eat](1;2), hají [sleep](1;5), daj [give!](F1;5), auke [sway!](F1;5)
Table 10: IM’s productive lexicon before attainment of 18 months. Words in
the brackets denote most plausible meaning, as decoded by either
father (F) or mother (M). Compare with Table 3.

potentially imitated by rote from mother’s telephone talks to her
mother.
Table 10 contains the list of words noted down before IM attained
one and half year of age. The list is fairly standard and resembles
other such lists reported in litterature. Food and game-related imperatives were common, as well as animal-like onomatopeias. In majority of cases, an initially idioglotic, private sound-form of produced
word developped in a sense which would ideally match the "ideal"
sound-form of the parents. IM’s C- and P- structures adapted to her
surroundings.
There occured, however, multiple cases where C- structures of parents adapted to private P-structures of the child. Most salient among
these was the case of a word NENE, noted down quite early (F6
1;0;23), referring to mother’s "breast". Mother swiftly included the
paedologism into her own productive lexicon, as documents her (M1;5;24)
journal entry where she used the term as a component of a wider declinated expression "meine nene".
12.5.1

nene & taboo (aph)

Humans are essentially mammals. In a healthy normal situation, first
communicative channel between the child and the world passes through
mamelles de mama. And indeed many are indices that bond created
by and during breast-feeding can significantly influence ontogeny of
child’s cognitive and linguistic structures (Hromada, 2009).
It is thus somehow surprising to see that the topic of breast and
breast-feeding is either ignored or tacitly cast aside by major figures
of contemporary DP. Indeed, one shall not find a single occurence of
6 From now on, all references to the observation log shall be preceded by the consonant specifying the author of the enter, e.g. (f)ather or (m)other.

178

qualitative

the word "breast" in (Tomasello, 2009) or (Karmiloff and KarmiloffSmith, 2009). Also in Pinker’s Language Instinct which pretends to
introduce The New Science of Language and Mind, the breast is mentioned only once in context quite unrelated to ontogeny (« Proto-IndoEuropean melg "to milk" resembles Proto-Uralic malge "breast" and
Arabic mlg "to suckle» (Pinker, 1994)). Thus the only monography,
which somehow saves the score and mentions breast in developmental context, is (Clark, 2003) where, in table 4.2 on page 83 is the term
"nenin", produced by a french child translated as breast.7
end nene & taboo
12.5.1

The fact that the term breast seems to be taboo for contemporary
psycholinguists is even more striking when one realizes that it was
already one of the father of the discipline, Roman Jakobson, who
pointed out that « often the sucking activities of a child are accompanied by a slight nasal murmur, the only phonation which can be
produced when the lips are pressed to mother’s breast or to feeding
bottle and the mouth is full. Later, this phonatory reaction to nursing
is reproduced as an anticipatory signal at the mere sight of food and
finally as a manifestation of a desire to eat, or more generally, as an
expression of discontent and impatient longing for missing food or
absent nurser, and any ungranted wish. Since the mother is la grande
dispensatrice, most of the infant’s longings are addressed to her, and
children gradually turn the nasal interjection into a parental term,
and adapt its expressive make-up to their regular phonemic pattern.»
(Jakobson, 1960)
Asserting that our observations of IM’s interactions confirmed Jakobson’s insight, we propose following developmental analysis of IM’s
πρώτα ονόματα:
Left part of Figure 25 suggests that the development of structures
MAMA and NENE can be understood in terms of a general process during which the succion reflex extends into vocalized labial
P-structure (M@M@). Subsequently, this "centroid" schema differentiates into two schemas MAMA and NENE. Right part of the figure
conjectures that such a differentiation can be explained in terms of
replication, variation and selection:
1. first the initial structure (M@M@) gets reproduced

7 Note that none of IM’s parents was aware that the term NENE means "breast" in
french argot. This was "discovered" only post hoc, after the term NENE was already
unambigously used and understood by all family members. Given that the same
signifiant was found out to denote the same referent in two independent language
systems (i.e. IM’s idioglossia and french argot) the theory of "arbitrairness of sign"
(de Saussure, 1916) is to be partially revisited.

12.6 repetitions and replications

179

Figure 25: First differentiation between the whole and its part (a) and its
evolutionary explanation (b).

innate schema(s)

M@M@
replication

M@M@

M@M@

M@M@

mutation
MAMA

mutation

NENE MAMA
(a)

NENE
(b)

2. some of resulting replicas are subject to mutation (shift towards
open vowels in case of emergence of MAMA, shift towards alveolar nasals in case of NENE)
3. structures which turn out to be useful (e.g. they increase probability of being breast-fed) get reinforced, fixed and succeed to
survive to time (contrary to "less fit" structures not resulting in
fulfilment of child’s communicative intention)
This being said, we shall now focus on other phenomena of IM’s linguistic development which seem to fit into such evolutionary framework.
end first words 12.5

12.6

repetitions and replications

Repetitio est mater studiorum et repetitio replicatio est. Repetition is a
form of replication (3). It may be argued, of course, that this formula
is not always valid: take as an example an agent without any memory whatsoever which just executes random movements and by sheer
caprice of hasard repeats the same movement as it has already executed sometimes in the past. But in case of agents with mnemonic
substrate powerful enough to project the temporal onto spatial (e.g.
human brain) we see no a priori reason why the formula should be
rejected. Hence, repetition of information is a form of replication of
information.
By repeating information, children brains replicate information. We
distinguish two major types of processes behind replications:
1. intersubjective replications
2. intrasubjective replications

180

qualitative

As everything in human mind, these processes mutually interact. But
in early development, so we argue, they can be discriminated as independent.
In an intersubjective replication, a structure S is articulated, performed and|or expressed by two or more distinct subjects. Thus,
when mother’s saying of the word "TATO" is followed by child’s utterance of the same word, one observes a minute intersubjective replication. Thus, intersubjective repetition can be understood as equivalent
to imitation.
One observes intrasubjective replication whenever a structure S is
articulated, performed and|or expressed by one subject in two distinct moments. A replication of a syllable MA in the word MAMA
can be thus understood as one amongst its most simple cases. Canonical babbling or many among Piagetian "circular reactions" can be
also understood as expressions of such a general cognitive process.
As is always true in case of a healthy human mind, the intrasubjective and the intersubjective mutually interact. But in early development, so we argue, the two can be discriminated as independent.
In IM’s case, it was around her first birthday when the interplay
between these major processes started to express itself in observable
forms of verbal interaction. More concretely, at (f1;0;7), IM produced
a bi-syllabic MAMA after hearing bi-syllabic MAMA and tri-syllabic
MAMAMA after hearing tri-syllabic MAMAMA. Her internal and potentially innate tendency to repeat was exposed to parsable and reproducible stimuli: result was one among first bipartite micro-dialogues
noted down.
The interplay between the two processes became more salient half
a year later when IM started to consistently use her private words in
recurrent contexts. Parents could therefore quite easily decode "meanings" of such intralexically repetitive terms as BIBIBIBI (f1;6;12) [in
presence of a "baby"]; ANAN (f1;10;15) [when requesting a "banana"];
VAVA (f1;8;0) [when playing with "water" or NANA (f1;8;6) [when
looking into mirror]8 . Given that these words do not exists per se in
neither in German or Slovak (but, as will be shown in 12.10.1, some
can be understand as cross-over forms between the two languages),
they had gradually disappeared from IM’s lexicon.
Disappearance of these pre-syntactic, protolexical structures notwithstanding, intrasubjective replication did not cease to play important
role in development of IM’s linguistic faculty. Consistently with what
is known in litterature, such repetitions prevailed whenever IM became aware of existence of a new form, whose articulation was to
be perfectioned and mastered. For example, after having understood
that a difficult-to-pronounce form AUTOBUS refers to instances of
8 Later, at (f1;11;16) it was noted down that IM tended to use the term ICH when
she was agent of the action and NANA when she was receptor or benefactor of the
action.

12.7 first constructions

large, noisy, useful yet dangerous species, IM produced (f2;0;2) the
term 63 times in less than 30 minutes. Given that during this time interval there was sometimes no autobus in sight and given that the articulatory sequences where sometimes interrupted by minutes-lasting
pauses or sequences dedicated to other topics, one is obliged to explain such loops in terms of structures and processes whose temporal
span extends well-beyond the milisecond- or second- span of the standard Millerian short-term memory.
At the end of IM’s toddler period, we constate that plain intrasubjective replications are more and more rare. Sometimes, they still
occur when the child is playing alone, especially with water or her
child-ressembling puppets. Or they occur in situations where the
term is too difficult to pronounce on its own (i.e. IM’s pronounciation of SAMBASAMBhAVA when exposed to the picture of the buddhist saint Padmasambhava (f2;9;3)). And some intrasubjective repetitions are still observable in communative scenarios (e.g. when saying MEINE, MEINE in order to emphasize that a certain toy or food
should not be taken away). Whether these cases still represent a rudiment of a subjacent cognitive processus, or whether they are simply
expressions of structures which were culturally acquired9 opens an
argument which we have no intention to enter.
end repetitions 12.6

12.7

first constructions

While repetitive sequences can be rightfully considered as "constructions" because they contain multiple juxtaposed (con-) elements (structions), we label as "constructions" only such expressions which
fulfill following conditions:
1. they contain sequences of two or more elements which, when
taken alone, are distinct from each other
2. basic elements are at least as complex as a morpheme or a syllable
Under such definition, the sequence "ma" is not a construction because it does not fulfill the second condition ("m" and "a" are neither
morphemes nor syllables); and the sequence "mama" is not a construction because its basic elements (two "ma" syllables), when taken
alone, are not distinct from each other.
9 Note that rhetorical figures as diverse as antanaclasis, epizeuxis, conduplicatio,
anadiplosis, anaphora, epistrophe, mesodiplosis, diaphora, epanalepsis, diacope or
chiasm all exploit, in one way or another, the impact of repetition upone one’s Cstructures. Note also that "reduplication" is a phenomenon observed in practically
all major language families of the world.

181

182

qualitative

Under such definition, products of plain intrasubjective replication
are not to be considered as "constructions". What is needed in order
to obtain "constructions" thus defined, is not only replication, but also
variation.
12.7.1

first word combinations

The first observed, decoded and registered multi-word combination
which IM had uttered was: MAMA NENE (F1;4;25). Construction was
uttered in context of a request for breast-feeding.
Note that without previous knowledge of what NENE (12.5.1) means,
it would be impossible to decode the signal as a legitimate phrase on
its own and given its C1 V1 C1 V1 C2 V2 C2 V2 structure, it could be even
considered, by an external observer unable to parse IM’s idioglossia,
to be a meaningless babbling fragment.
A month later, at (f1;4;30) IM uttered the expression TATO MAMA
BABA ALA. Given that IM called her paternal grandmotherwith the
nickname ALA, IM’s mother immediately understood the 4-word (!)
utterance as a nominal phrase meaning "father’s mother is grandmother Alena". Three weeks later, at (f1;5;23) IM pronounced the
utterance MAMATATO immediately after waking up, potentially requesting attention of (or greeting?) both parents by means of concatenation of both parental terms. At (f1;8;6) it was suspected that TATOTUTO means "here, father" since "tuto" is a standard local demonstrative of Slovak language. But the advent of full-fledged two-word
stage was noted down at (f1;8;9) when IM had said in the interphone
"HALO TATO" while usually she was saying either HALO or TATO.
Only two weeks later, at (f1;8;23), the mother has immediately decoded the expression AJs NANA MAMA AKUKE as ditransitive construction meaning "ice-cream me mother buy". Given that the utterance was indeed produced in the proximity of an ice-cream stand,
and given that it was accepted by both parents at least since (m1;8;23)
that AKUKE means "einkaufen" (e: to shop, s: nakupit) and taken as
granted that IM uses the term NANA to refer to herself (c.f. previous
section), the father would be obliged to set aside his scepticism and
buy both girls an ice-cream if ever the younger one would not fall
asleep in the meantime.
end first word combinations 12.7

12.7.2

first pivot(s)

"Names" like MAMA, TATO or NANA were sometimes used as components of longer and more complex constructions. They were also
partially productive in a sense that some of these constructions (like

12.7 first constructions

183

MAMATATO) were never uttered by the parents and thus could not
have been learnt by rote. But before introduction of pivot words, productivity of such nominal terms was highly restricted: they never
occured in less than a handful of constructions.
Things changed with arrival of the first pivot term, and in case of
IM, it was the term AUCH (meaning "too", "also"). Thus, an (f1;10;0)
entry mentions following constructions: TATO AUCH (as a fatheraddressed request to eat as IM does); MAMA AUCH; NANA AUCH
(when requesting to eat same food as parents eat); ENTE AUCH
(when feeding ducks). The pivot AUCH could be thus understood
as a productive "seed" of a following micro-grammar:
MAMA
TATO
NANA
ENTE

AUCH

Table 11: IM’s seeding grammar: AUCH at ultimate position.

Depending from the context and agent-term of the construction, the
pivot carried meanings as diverse as imperative "You (father) do that
(eat) as I do", declarative "I do it (put clothes) as You do" or "They
(ducks) also want to eat". In general, it seems that the term was quite
closely related to the fact of imitation and/or to the fact of intending
that activities of two distinct agents should be aligned. Thus, the next
recorded constructions were:

ICH AUCH

NACH HAUZE (f1;10;15 - When wanting to go home)
AKE, NANA AKE (f1;10;17 When seeing father swinging on a seesaw;
AKE=g:schaukeln, e:to swing)
YOGA (f1;10;30)

Table 12: Seeding grammar extended: AUCH in the central position.

Note, however, that the term AUCH allowed IM to articulate thoughts
encompassing realities well beyond here&now. For example, when
watching the scene of her favorite animated movie in which the benevolent mole bottle-feeds an orphaned eagle, IM declared: ICH AUCH
MI (f1;11;1) meaning something like "I am also used to drink milk". Or
putting the milk-bottle on mother’s breasts and saying NENE AUCH
(f1;11;5). Or, when reading book about Babar the elephant (f1;11;5)
IM stated ICH AUCH AM (f1;11;5 AM=tram) when observing picture on which which Babar the elephant takes the tram, potentially
intending to declare that she also takes the tram; few pages later,
ICH AUCH LIEALO (F1;11;5 LIEALO=s:lietadlo, e:airplane)10 ) was
10 In a sort of congitive and phonotactic economy par excellence, IM had consistently
used the slovak signifier LIEtAdLO when mentioning airplanes in her otherwise germanophone constructions. Cognitive: airplanes were strongly associated with departures and arrivals of slovak-speaking TATO. Phonotactic: it is definitely easier for a
child to pronounce a word full of laterals and vowels than the german "flugzeug" con-

184

qualitative

declared when observing picture on which Babar exercises yoga on
the airport. During the same evening reading session it was, however IM’s act of uttering ICH AUCH accompanied with pointing to
the image of the Eiffel tower which made both parents to feel utterly
perplexed. Not only because IM had indeed, visited the Eiffel tower
more than 2 months before, but also because no "ICH AUCH" was
uttered during the lecture of subsequent pages, on which Babar exercises yoga in Yosemite park, near Golden Gate bridge etc...
Given the recurrence of the construction ICH AUCH, one would be
tempted to state that it was this longer complex and not the simple
AUCH which was the true pivot. But this was no the case since more
than often, AUCH agglutinated to and with other agential terms than
simple AUCH. Thus, TATO AUCH LIEtAdLO (f1;11;2) was uttered
when observing airplanes on the sky; expression TATO AUCH UHE
(UHE=g:schuhe,e:shoes) ordered father to put on his shoes. What’s
more, in her pre-sleep monologue of (f1;11;4), IM had spontaneously
generated all utterances given by the paradigm: and did so in a repetICH
MAMA
TATO

AUCH

AJA (e: egg)
KUCHEN (e: cake)

Table 13: Another AUCH-centered paradigm.

itive and combinatorial fashion (i.e. produced all 6 combinations) normally common to scholastic methods or text-books in secondary language acquisition.
This being said, both parents unanimously agree that IM’s first
pivot "strong" enough to structure around itself whole system of constructions, was the intersubjective term AUCH. This pivot was only
slightly antecedent to gain of force of other pivot, namely the egocentric MAJnE (d:meine,s:moje,e:my) expressed at (m1;11;2) in such utterances as MAJnE MAMA or MAJnE MIAU. Soon after, these phrases
were also cried out from sleep: MAJnE MIAU at (f2;0;0, f2;0;21) MAJnE MAMA at (f2;0;21). But it was already at (f1;11;21) that this "pivot
of personal property" was already strong enough to cause IM to cry
out the expression MAJnE UHE (my shoes!) amidst the REM-phase
of one of her sleeping cycles.
Somewhat contrary to other children reported in litterature, IM had
started to use only relatively late her term MEA (meaning d:mehr,e:more)
as a productive pivot. Often, she had simply used other means (including the usage of AUCH or NOCH) to express longing for bigger
quantities of food or for reproduction of certain action.
end first pivot(s) 12.7.2

taining such phenomena as voiced velar occlusive juxtaposed with an affricate. Being
reassured that she masters the syllable "LIE" well, IM had later consistently prefered
to use the term LIEnKA (meaning "ladybird") instead of German "Marienkafer".

12.7 first constructions

12.7.3

first micro-grammars

Once pivot words had helped IM to "understand" the meaning-specifying
expressive force behind the act of juxtaposition of specific tokens, IM
had swiftly and naturally proceeded to the application of such "combinatorial trick" in other contexts and for other uses. Asides protoislands of order structure around AUCH and MAJnE, instances acceptable by following micro-grammars were noted down (f2;0;7) as most
salient and recurrent:
Agent → MAMA | T AT O | NANA | ICH | BABA | BEJBY
G1 → Agent AUCH
Patient → MIAU | MET E
G2 → MAJNE Patient
Food → BROT | AJA | ANAN
Drink → MI | VAVA

(2)

Action → HAJI | ESSEN | T RINKEN
G3 → Food ESSEN
G4 → Drink T RINKEN
G5 → Action MACHEN
G6 → Agent KOM Action
Grosso modo, this proto-grammar already includes references to
those actions (eating, drinking) and agents (parents, self) which are
most vital for IM’s survival. But in rules G5 and G6 , one can already
observe "the seed" of much more general a knowledge, a knowledge that certain precise actions can be "made" (G5 ) and, in a sort
of half-imperative, half-causative fashion, other agents can be incited
to "come" and actualize them (G6 ). From such knowledge, child is
only one cognitive step far from the reflected and conscious metaknowledge of the fact that it is by language and language alone that
such precise incitations can be made.
From there on, whole evolution of IM’s syntactic P-structures has
become complex, filled with non-monotonic returns, iasynchonic detours, parallel developments and both intra- and inter- insular population dynamics. Since sufficient accounting of such development
would demand a book on its own, let’s know shift away from terminology of "grammars" towards more dynamic a terminology speaking about "mutations", "crossovers" and "life".
end first micro-grammars 12.7.3

Many constructions hereby presented would not have been successfully decoded if the parents had not accepted as granted that at least
some among unintelligible productions of their daughter are, in fact,

185

186

qualitative

complex utterances. Almost a year later, at the very end of the toddler period, none of the parents is able to correctly understand the
meaning behind all child’s utterances. Many are still unintelligible
and very often, the child must still take recourse in other means of
information-passing (e.g. pointing, gests, facial expressions) to make
herself understood.
What is, however, understood by both parents as well as by wider
social surrounding is that at 2;6, IM diposes of rich internal world of
dreams, intentions and playful tendencies which strongly influence
what|when|how she interacts with her environment, both verbally
and not verbally. Perhaps there is meaning, perhaps there is communicative intention behind any sequence which the child utters, no
matter how unintelligible the sequence may sound. Or perhaps not
and child simply explores the limits of language11 by joyful playing of
the most fundamental among all the language games (Wittgenstein,
1953).
Be it as it may, the process of language ontogeny brings into the
world an unprecedented amount of novelty. True, novelty having the
form of an unstoppable tantrum can sometime destroy one’s day. But
luckily for all the parents of the Earth, the positives seem to outweigh
the negatives. Thus, most of the time, verbal interaction with children
is simply beautiful, comforting and -let’s not forget the another important aspect motivating all parties involved - are child’s first linguistics
constructions perceived and felt as cutely and adorably funny.
end first constructions 12.7

12.8

mutations

Mutations (from lat. mutare "to change") are basic atomic units of
change. Mutations occur in time; in informatic terms mutations are
events caused by transition of information-encoding substrate from
one state into another. Given that the physical nature of substrate
of linguistic representations is still speculative, and in great extent
unknown (8.6), we shall present, in the following paragraphs, just a
handful of illustrations of such transitions occuring in ontogeny of
IM’s linguistic structures and processes.
12.8.1

context-free substitutions

Context-free substitutions are mutations characterized by substitution (replacement) of each occurence of the original symbol So rigin
11 If the statement « The limits of my language are the limits of my world» (Wittgenstein, 1922) is true, than agent’s exploration of limits of her language equivauts the
exploration of limits of her world.

12.8 mutations

with exactly one instance of the target symbol St arget. Given that all
occurences of So rigin are substituted, CSM operators are, so to say,
agnostic of substituents position.
A first example was already given: transition M@M@ → MAMA (c.f.
25) can be explained as a substitution of a central vocalization @ for a
more marked A, i.e. as a result of application of a rule @ → a.
Other particularly illustrative example of a CSM was given by IM
on three consecutive days, during which she was observed to utter
sequences of a form
BABIJÁ (f1;4;16)
MAMIJÁ (f1;4;17)
PAPIJÁ (f1;4;18)
Within the framework of the theory hereby proposed, such transitions could be explained by mutation of the content attributed to
non-terminal Clab,occ within the template:
Clab,occ aClab,occ ijá
which is equivalent, at certain level of abstraction, to substitutions
b → m and m → p which most probably occured in IM’s mind
during the first (resp. second) night between the observations.
Note, however, that in spite of being labeled as context-free, even
these mutations are not "global". It would be utterly false to believe
that the fact that every B within the construction BABIJÁ was substituted by P resulted in the situation whereby IM ceased to pronounce
the sound B alltogether. This was, of course, not the case and the
sound "B" did not disappear from IM’s repertoire. Thus, in regard to
the local "template" in which it occured, the substitution could be considered as context-free. But not more: the mutation had practically no
impact beyond the local micro-grammar within which it took place.
Or, to come back to the example of the primary differentiation Figure 25, the fact that @ was replaced by A in case of insula slowly converging to meaning of "mother" did not have any impact whatsoever
upon the fact that within the insula slowly converging to meaning
of "breasts" another mutation (i.e. MA → NE) took place. This is so
because in moment of the mutation, both insulae were already materially encoded in at least partially distinct neural loci.
To summarize: context-free mutations are mutations which alter all
instances of a certain symbol. But the scope of their action is still
constrained to only a specific template | insula | micro-grammar 12 .
Or a restricted group of these.
end context-free mutations 12.8

12 In following sections we shall use terms template, micro-grammar and insula in a mutually interchangeable, synonymic fashion to mark the fact these notions are computationally equivalent

187

188

qualitative

12.8.2

First vocatives

context-sensitive substitution

The scope of impact of a context-sensitive mutation is also constrained
to a specific template or to a strongly restricted group of these. But
in addition to this constraint, scope of applicability of CS-mutation is
also limited by the context | position | neighborhood within the template itself. To illustrate with first well-documented CS-substitution:
during her stay by IM’s czech-speaking BABA, the mother has documented IM’s production of expressions MAMI and BABI (m1;4;9).
Emergence of these forms, which are completely correct vocatives in
czech, could be explained by a context-sensitive mutation A$ → I$ 13
occuring in IM’s mind. In the observation journal, mother had commented the phenomenon: I suppose these came because of my calls "Babi"
tu my own grand-mother and "Mami" to my mother.. Further analysis can
unveil, however, that acquisition of such vocatives could have been
synergetically catalyzed by the presence of a dog called DEXI and a
cat JESI in grand-mother’s appartement. Since it was one among first
IM’s exposures to animal life and since IM did not hesitate to establish not only visual, but also verbal (by production of onomatopees
like HAU-HAU and MIAU-MIAU) and haptic communicative interlock, it is undoubtable that representations (i.e. signifiees) of both pets
attained a highly salient status within IM’s mind. And given that in
czech language vocative forms of I-terminated animal pet names are
identic to the nominative forms, it cannot be excluded that the very
presence and saliency of pet-denoting −I$ protonominals had stimulated IM’s nominative-to-vocative transition within more general a
class of living beings.
Thus, IM’s success in mastering of vocatives seems to be result of
interplay of three mechanisms:
1. an endogenous mutation which caused the −A$ → −I$ transition within certain among IM’s private P-structures
2. exogenous gold-standard structures (i.e. persons in IM’s social
environment which use −I$ nominals within certain contexts)
3. a cerebral mechanism reinforcing or even replicating such private representation which match public structures
Nature of these three mechanisms correctly understood, one can
see how development of practically any expressions - from initial babbling all the way through infantese, toddlerese, pupilese to the "correct" adult-like pronounciation - can be characterized as a sequence
of such CSSs. In IM’s case, for example, one can see the trajectories
along which the words for "milk", "water", "baloon" podded out of
the initial babbling:
13 Consistently with the syntax of Perl Compatible Regular Expressions (PCREs) we
shall denote the "ultimate position" with the dollar sign $.

12.8 mutations

Context-sensitive substitutions (EXT)
MiMi*14 → MI (f1;5;12) → MICH (f2;0;13) → MILCH
UaUa* → VAVA (f1;8;10) → VASA (f2;1;8) → VAS
BALaL* → BALOL (f1;10;30) → BALOND (f2;0;13) → BALON (f2;4;19)
end context-sensitive substitutions 12.8.2.0

In these cases, mutations had often counteracted child’s tendency
for elision, assimilation or fronting of certain phonemes at certain
positions (9.2.1). In each example the symbol → tends to denote a
moment, or a group of moments whereby IM’s linguistic structures
underwent a structural change, i.e. mutation.
In reality the situation is, of course, much more continuous and
much less discrete than in our transcriptions. To describe whole phonic
development more closely one would have to use a more refined transcription alphabet (e.g. International Phonetic Alphabet) but even this
one could be criticized as too coarse-grained for the task at hand.
But no matter what transcription system would one choose, independently even of the fact whether one stays faithful to continuous reality
or discretize the phenomena in already existing boxes, one thing stays
certain: IM’s interiorization of any individual linguistic structure consisted of multiple intermediate steps.
end context-sensitive mutations 12.8

By stating that development of any individual linguistic structure consists of multiple intermediate steps we want to focus reader’s attention to
the fact that not only P-structures and articulated signifiers develop,
but - and this is important - also any C-structure (i.e. conceptual signifié) as well as structures relating the two do so as well. In preceding
paragraphs, we have focused mainly on development of P-structures
because their development is easier to assess. But this does not mean
that the world of C-structures does not develop i.e. that it is not subject to mutations.
Contrary, in fact, is the case: in course of her development, IM’s
innermost structures had been constantly modified by multitude of
events of exogenous origin. By myriads of minute interactions and
couplings of linguistic inputs with other auditive, visual, haptic, olphactoric, gustative, vestibular, nociceptive or placiceptive inputs. By
parental questions and parental corrections and by facts that a certain question and a certain correction were given in one context but
14 We use the star sign * to denote expressions which were not recorded in the observation journal but are considered as plausible protoforms. This is similar to comparativist tradition in which the *-sign denotes hypothetic forms postulated by the
theory but not attested by any existing corpus.

189

190

qualitative

not in another. But other, more endogenous factors related to playing, dreaming and φαντασία well beyond traditional adult notions of
"abstraction and generalization" had to be active as well, in order to
account for emergence as well as correction of such cases of poietic
over-generalization as:
1. at (f1;11;8) saying ZONE (e: sun; d: Sonne) when seeing a fullmoon in the evening sky
2. at (f2;0;19) naming the circle of light projected by the lamp upon
the bedrooms’s ceiling with the term BALONd
3. at (f2;3;19) saying LIENKA (e: ladybug) when seeing, on a picture in a picture book, a red ball with white dots
4. at (f2;5;8) using the term KUGEL (e:sphere) to describe pingpong ball (correctly called "BAL" a year before)
5. at (f2;6;15) answering NENE when asked to describe what is on
a swimming-pool tile with two concentric circles
6. and the DING-DONG mystery
12.9

Cathedral

case study of semantic mutations: the ding-dong
mystery (aph)

To demonstrate the arbitrairness of any system of categorization or
even any epistemology, both Michel Foucault as well as Eleanore
Rosch fondly cite the taxonomy fictitiously attributed, by Jose-Luis
Borges, to an ancient Chinese encyclopedia entitled the Celestial Emporium of Benevolent Knowledge:
« On those remote pages it is written that animals are divided into
(a) those that belong to the Emperor, (b) embalmed ones, (c) those
that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g)
stray dogs, (h) those that are included in this classification, (i) those
that tremble as if they were mad, j) innumerable ones, (k) those drawn
with a very fine camel’s hair brush, (l) others, (m) those that have just
broken a flower vase, (n) those that resemble flies from a distance.»
(Borges, 1952)
Such a Borgesian account is something which has to come, willynilly, to one’s mind when confronted with the case of the DINGDONG mystery (DDM). Contrary to Borges, however, is DDM not
fictitious but rooted in reality of facts. These are as follows:
First mention of DING-DONG (f2;0;7) clearly mentions the term in
context of church bells. Indeed had used the expression to express
her will to be in the proximity of Bratislava St.Martin cathedral exactly at 18:00 when cathedral’s bells ring the most. The same record
mentions, however, that the term cannot refer to "church" in general
since another church building was labeled as OKOL (s: kostol).

12.9 case study of semantic mutations: the ding-dong mystery (aph)

That the concept develops started to be evident a month later (f2;1;4),
when it was noted down, during the visit to the library, that IM had
picked from a bookshelf a book about European history and labeled
the building depicted on the front cover as DING-DONG. A week
later, the (f2;1;8) record continues: You are still occupied with the DINGDONG concept. It seems to denote all big buildings, today, for example, have
You seen the picture of the skyscraper and called it a DING-DONG.
A later (f2;2;1) log indicates when things started to get somewhat
more complex. Thus, during a simple walk between Berlin’s central
station and Hackesheer Markt, IM used the term DING-DONG when
labeling following objects:

191

Big building

• stone sculptures on the bridge
• tower in the distance
• fluttering German wing atop the Bundestag
• a building just next to Bundestag
• synagogue’s golden dome
• cross atop the Berlin’s cathedral
• buildings of Marienkirche and Boden museum
which indicates that at that period, the DING-DONG concept still
overlapped with something similar to an adult concept of a "fancy
piece of masonry" or "building’s top". The fact that the later (f2;2;11)
log states "Still occupied by DING-DONG, You were completely fascinated
by youtube videos of St.Martin’s cathedral." seems to support the hypothesis.
While (f2;2;18) log entry stated that "DING-DONG fades into background" it also stated that "from time to time, You still label something with
that term: picture of castle in the book, two noodles stuck together...".
The entry logged at (f2;5;7), i.e. 5 months after initial use of the term,
states with the word DING-DONG You have labeled the picture on the "ace
of staffs" Crowley’s tarot card as well as a flute). And two weeks later
(f2;5;21), it was written that "You are still occupied with DING-DONG.
It seems that You use it especially to denote spiky things, for example green
buoys on the Elbe river are DING-DONG. But the red ones, without the
spike, are not". Approximately in the same period, father also considered as quite plausible the hypothesis that the term can also denote
the property of being long (d: lang).
Given the importance of the term within IM’s world, a small constructional island coalesced around it. At (f2;3;19) a recurrent usage
of the construction dING-dONG mACHEN was noted down when

Castle and noodles

Tarot card and a
flute
Spiky green buoys

192

qualitative

building towery lego churches15 , at (f2;5;8) intense repetitive production of expression ING ONG OJTET (d: lautet e: rings) was noted
down and the (f2;5;23) recorded following playful variations:
dING dONG lOJTET
dING dONG lOJTET
dING dONG lOJTET
lOJTET dING dONG
lOJTET dING dONG
lOJTET dING dONG
which repeatedly transgress even the most primitive subject-precedesverb syntactic rule of german language.
Such indeed is the mystery of DING-DONG: unconcerned by the
"correct word order", unconcerned even by appropriate, adequate
and optimal conceptual boxes, infantine mind plays the poetic game.
Shamelessly, joyfully and naturally plays the poetic game, and do so
at all levels. Isn’t that Borgesian?
end the ding-dong mystery 12.9
The above aphorism indicates that asides context-free and contextsensitive substitutions, yet another variation operators are at play in
a developing mind. Not only formal but also semantic substitutions,
not only replacement of symbol by an empty one, but also diminution
(or expansion) of extension of a concept C (when C is understood as
a set).
Or, when concepts are understood in more geometric terms, mutations consisting of either increase or decrease in volume of the Clocalizing subspace or translation of C’s centroid to some other position.
But as was already indicated not only by IM’s playful switch from
grammatical dINGdONG lOJTET to agrammatical lOJTET dINGdONGduring, but firstly the enumeration of different intralexical metatheses
(12.9.1), yet another class of mutation operators seem to act within the
developing mind: switching of position within the sequence, permutations within the temporal order.
12.9.1

first transpositions

A transposition occurs when two or more elements of a bigger whole
(e.g. phonemes within a word or words within a phrase) exchange

15 However, a general term for other constructions built from lego or wooden cubes
was BAUT (f2;2;18, f2;3;19), potentially derived from past participle of to build (d:
gebaut). What’s more, when asked to label diverse lego blocks, IM had consistently
used the term BAUK (f2;2;1). Note that such a term does not exist neither in slovak
nor in german.

12.9 case study of semantic mutations: the ding-dong mystery (aph)

their position. Frequency of occurence of elements within the sequence
thus does not change, relative positions, however, do.
Already a relatively early transcription (f1;5;30) of IM’s sometimes
babbling, sometimes one-word "stream of consciousness" improvisations, produced at the breakfast table contains sequences like ÁU,UÁ
and ÍTÁ, ÉTÍ. Such were indeed IM’s first tentatives to switch positions of two protophonemes in her protowords.
IM’s later productions indicated activity of more complex (i.e. involving more than 2 transposed elements) metathesis-like reorganizations as:
Context-sensitive metatheses (EXT)
APUK (f2;1;10) → "kaput"
IPEK (f2;1;12) → Wipke
UKAKS (f2;1;24) → "Rucksack"
MAKTA (f2;6;0) → "matka"
etc.
end context-sensitive metatheses

12.9.1.0

Given the prominence of such "errors" and mis-productions16 in IM’s
speech, we are tempted to state that a non-negligeable amount of
transpositions commonly studied in evolutionary linguistics (e.g. e:
"fog", s: "hmla", czech: "mlha") had their origin in slips-of-tongue of
individual toddlers which were subsequently accepted and spread
through wider community.
At the end of this development, IM started to permutate position of
not only individual phonemes, but also of phonetic clusters or even
words within phrases. Thus, (f2;4;5) mentions a permutation
HUNDUNDMIAU
MIAUUNDHUND
undoubtably stimulated by a symmetric, non-preferential binary coordinative UND (e: and). But as demonstrates already the following
entry (f2;4;6) noted down during a game whose objective was to put
diverse wooden animals into correspondent slots:
DA IST KKO, IST DA KKO
as well as an entry noted down a day later, during the marble game
KUGEL EINE, DA IST KUGEL EINE
IM’s propensity to permutate the order often didn’t worry much
about even the most fundamental among the syntactic contraints of
germanic languages. That is, the constraint that the article (EINE)
16 From adult-like point of view

193

194

qualitative

should precede the noun (KUGEL) and definitely not the other way
around.
In the following chapter, dedicated to quantitative analyses of the
CHILDES corpus, we shall aim to shed somewhat more light upon
the question whether this situation - in which IM’s urge to permutate
word order was stronger than the most fundamental among the syntactic constraints - was peculiar to IM who, as partially slavic person
potentially feels less bounded by the need to correctly prefix substantives with determinants; or more general a trend present even among
germanic and anglosaxxon toddlers.
end first transpositions 12.9.1

In above subsections we have presented multiple variation operators which, we believe, could be rightfully labeled as "mutations".
Substitution of nothing with something, something with nothing, something with something else; expansion or diminution of extension;
switches in positions, fillings of empty slots: in one way or another, all
this was already known not later than after Aristotle and his followers
(8.5). It was evident to Godel and Turing as it is evident to proponents
of FLG: any computable number (resp. construable string of symbols)
can be obtained by means of insertions, deletions, substitutions and
transpositions.
Until now, practically nothing new in comparison with traditional
cognitivist symbolic architectures. But everything changes when the
most noble among all variation operators is introduced: the crossover.
end mutations 12.9

12.10

crossovers

Boldly speaking, crossover is the operator of unita in diversitate. As
indicated by the figures attached to tour brief discussion of biological
evolution (Figure 4) and fitness landscapes (Figure 8), the power of
crossover consists in its ability to:
1. let two (or more) parent structures to project their features upon
one (or more) child structures
2. allow the evolving system to get out of locally optimal states
(i.e. to fly away from the peaks into unkown realms in between)
The second point systems implies that systems involving a crossover
operator are able to continue evolving even there and then where
other, gradient-following approaches are doomed to get stuck. Computationally speaking, crossover’s ability to direct the search into regions of useful unknown yields the ultimate coup-de-grace which the

12.10 crossovers

model involving the operator apdodictcically gives to any model which
does not do so.
When it comes to the first point, consider, for example, a sort of
semantic crossover:
(HUMAN)
× (WINGS)
(ANGEL)
Without resorting to cross-over mechanism and without falling into
a trap of pseudoscientific divinatory explanations, we consider it as
very difficult - if not impossible - to offer a scientific account of a
cognitive mechanism by means of which all angels and chimeres, all
centaurs and mermads, all mythological visions as well as technoscientific insights, how representations of entities without a material referent
could have ever entered the mind of a primordial man.
And by whom else if not by children could one see such process
act?
12.10.1

multilingual crossovers

Given the lucky coincidence (12.3) due to the fact that structures in
IM’s linguistic environment primarily consisted of structures coming
from two distinct sub-branches (i.e. germanic and balto-slavic) of the
same language tree (Figure 5), IM was exposed, on a regular basis and
in analogous contexts, with instances of constructions which were
both similar and distinct in the same time.
What resulted is a phenomenon of interlingual "mixing" which is
well known to practically every parent of a healthy bilingual child.
Let’s now focus on two most prominent types of such mixing.
Intralexical crossovers
A multilingual intralexical crossover is a mixing of two word-representing
schemas SL and SJ intersubjectivly replicated from exogenous oracles using diverse languages L and J. Given that the schemes-tobe-combined are defined as word-representing, and given that their
sources are oracles (e.g. parents, grandparents, teachers), they are expected to occur during:
1. cases of bi- or multi- lingual language acquisition (hence "intralexical")
2. acquisition of base-level terms (i.e. signifiers) for base-level meanings and referents

195

196

qualitative

In the following exposure, crossovers shall be presented consistently with following formula:
referent

german
slovak
T ODDLERESE
whereby the first row shall contain the English term for the referent R,
second row shall contain the R-denoting term most frequently used
by IM’s mother and third row the R-denoting term used by IM’s father. The last row of every example shall contain the transcript of
IM’s idioglottic productions consistently (i.e. more than once) uttered
in R-cooccurrent context.
First three multilingual intralexical crossovers all noted down at
(f1;7;30) were:
eyes

augen
oči
OGE
and
water

vas@
voda
VAVA
and
shoes

šúhe
boty
OGH E
Another couple of salient crossovers was noted down during game
with animal picture books, at (f1;8;11)
monkey

afe
opica
API
and at (f2;1;9)
elephant

elefant
slon
OLOOND
Asides substantives, other parts-of-speech were mixed as well. Verbs,
for example (f1;8;24)

12.10 crossovers

buy

ajnkaufen
nakúpit’
AKUKE
as well as possessives (f2;2;22)
my

majne
moje
MAJE
Many other cases were noted down where IM had opted for a form
which shared as many features as possible with forms in both ambient languages. Thus the (f2;0;14) entry recorded
stick

štok
papek
AK
and it has to be added that in following weeks to come IM had used
the term AK and|or @K to denote practically any piece of wood she
could easily carry and manipulate. This was done in spite of initial
parental tentatives to correct her and resulted, in fact, to parental
adaptation whereby parents resigned and used the convenient term
AK es well.17
We believe that these examples illustrate that in many cases of production of new words, IM tended to:
1. produce forms with known characteristics
2. produce forms which are as close as possible to both parental
forms
The first tendency seems to be the case for all healthy children, no
matter whether they are raised in monolingual or multilingual environment (c.f. Table 2 and the associated discussion of "preference"
and "avoidance"). When it comes to second, centroid-form-seeking
tendency, it is definitely most easily assessed in case of multilingual
acquistion wherein the forms to be crossed-over are distinct.
To prove our point, we conclude this brief enumeration of IM’s
multilingual intralexical productions with the final example (f2;0;17)
drink

trinken
pije
PIJEN
17 The form @K had withdrew into the background once IM mastered the correct pronounciation of more correct forms OK (f2;6;13), TOK and ŠTOK. The form @K however soon reappeared in order to mean "wolf" (sk: vlk).

197

198

qualitative

as well as with a link http://wizzion.com/thesis/videos/pijen.mp4
which, between 4:20 and 4:34 (as well as at 5:47), demonstrates our
case.
end intralexical crossovers 12.10.1.0

Intraphrastic crossovers
Multilingual intraphrastic crossovers are crossovers which mix, within
one construction, morphemes originating from multiple languages.
They are well known to practically any person subjected to second
language acquisition: one wants to form construction in language 1
but somehow "unvoluntarily" populates it with certain items proper
to language 2.
IM started to produce her first intraphrastic mixes when she was
still in the "pivot grammar" stage. It had thus often been the case that
certain slots within constructions pivoted by german AUCH were
filled with an item of slovak origin:
(f1;11;3) TATO AUCH LIEtAdLO (nom. sg. sk: "airplane")
(f1;11;4) ICH AUCH LIEtAdLO
(f2;0;0) NANA AUCH rUKY (acc.pl. sk: "hands")
(F2;0;7) LIENKA (nom. sg. sk: "ladybug") AUCH

Alongside these AUCH-pivoted intraphrastic crossovers, IM’s production was also full of utterances composed of slovak noun and a
german predicate. For example, during the period dedicated to story
of mole and eagle, following utterances were very common
(f1;11;12) OLOL (nom. sg. sk: "eagle") šAUEn (inf. de: "to watch
")
KKO (nom. sg. sk: "mole") šAUEn
(f1;2;18) OLOL ÍflGt (3p. sg. pres. de: "to fly")

It is, however, discutable, whether one could count such constructions as "interlexical" crossovers . This is so because in concrete cases
of usage IM had used the term KKO, OLOL etc. similiarly to terms
like TATO, MAMA, BABA, i.e. as personal names. Not knowing other
instance of eagle or mole than the one which was presented to her
it seems more plausible to state that IM has juxtaposes languageagnostic names and not language-specific nouns asides her germanoriginated predicates.
But 6 months later, with her toddler period coming to an end, a
sudden phase transition in both amount and diversity of intraphrastic crossovers had occured. Thus, during interval of three days only,
production of following germanoslavic structures has been observed:

12.10 crossovers

(f2;5;13) TIETO (sk: "these") ČpANuCHy (sk: "sock pants")
AJNgekAUFT (de: "bought")
2 (f2;5;15) WO (de: "where") IST (de: "is") mOTYl (sk: "butterfly")
?
(f2;5;18) TO (sk: "that") JE (sk: "is") MAJNS (de: "mine")
(f2;5;18) NANDA tAM (sk: "there") BYVA (sk: "lives") , MAMA AUCH
tAM BYVA
(f2;5;18) DA (de: "there") IST (de: "is") mUCHA (sk: "a fly")

Closer inspection of these examples may reveal that sometimes IM
had used the slovak (ex. 3. and 4.) and sometimes the german (ex. 5.)
forms to express the meaning "there is". On the very same day even
a small multilingual grammar was noted down:

NICHt

chrOBÁČIK
(sk: "beetle")
mrAVČEK
)
(sk: "ant")
ČMELJAK
(sk: "bumblebee"

Table 14: Interlinguistic micro-grammar.

Both parents are unaware that they had produced such "negation
in german + slovak animate substantive" constructions. Given that it
is highly unlikely that someone in IM’s wider environment would expose her to such constructions, the sole explanation of their existence
has to be sought for among IM’s endogenous cognitive processes. We
agree with Piaget that at this stage, one among such processes can be
the child’s egocentricity and her tendency to playfully negate any information that comes from exogenous oracles. And this had been, in
IM’s case, expressed notably by means of german pivots NAJN and
NICHt whose productive affinity at this period was such, that they
succeeded to form constructions even with non-germanic words.
end intraphrastic crossovers 12.10.1.0

On preceding pages we have presented few cases of multilingual
crossovers, i.e. crossovers between schemas embedded in distinct languages. Two main groups - intralexical and intraphrastic - were introduced in order to organize the presentation. We consider as highly
plausible that asides these two types, the super-group of interlinguistic cross-over contains other types of operators as well.
But instead of studying into detail each one of them, let’s just close
this brief discussion of bilingual acquisition with the aphorism stating
that

199

200

qualitative

Of crossover and calques (APH)
If the reader has understood that operators which we have labeled
as "interlingual crossovers" could elucidate phenomena which traditional linguistics call as "calques" or even "faux amis", then the reader
has understood us well.
end of crossover and calques 12.10.1.0

and now focus upon crossovers occurent not among elements of
multiple languages, but among elements of one sole language.
end multilingual crossovers 12.10.1

12.11

monolingual crossovers

A monoglingual crossover is a crossover between two or more input schemas which all originate and are extracted from the same language L.
A schema is the most fundamental element of the theory hereby
introduced. It is a template, a pattern a sort of micro-grammar which,
when embedded within human brain or within computational agent,
can be useful for both comprehension and production. In comprehension, schema’s role is to "match" an external stimuli (e.g. linguistic utterance). In production, e.g. when coupled with articulatory circuitry,
a schema determines the process of generation and execution of a
specific action (e.g. pronouncing of a word or a phrase).
Schemas themselves are composed of atomic features and it is important to realize that, in theory, one individual schema can integrate
in itself features of different types: conceptual, semantic, syntactic or
morphophonologic features can be considered as a constitutive elements of one individual schema S. In theory.
In spite of the fact that certain schemas SX , SY can integrate in
themselves "semantic" (i.e. signified) and "morphophonologic" (i.e.
signifier) components in an extent which strongly ressembles entities
WX , WY commonly known as "words", it would be a mistake to simply state that words are schemas and schemas are words. For it may
be the case that a certain word can be encoded by multiple schemas.
Let’s know glance at few crossover types which indicate that it can
be, indeed, the case.
12.11.1

intralexical

Given that porridge with bananas was her favorite breakfast, the
word denoting banana (de: banáne, sk banán) was one amongst the

12.11 monolingual crossovers

first items in her lexical repertoire. Thus, at (f1;5;12) it was noted
down that IM consistently used the P-scheme BAJA to denote the
fruit. Few months later, however, at (f1;10;8) it was observed that IM
uses the P-scheme ANÁN to denote the same referent.
A month later, as IM had consistently used the incorrect pronounciation to ask for the fruit, the father had tried to exogenously induce
the correct pronounciation:
IM: ANAN, ANAN
F: banan, ba, ba, baba, banan
IM: ANAN

without success since IM was still responding with pronounciation
of ANÁN. But knowing that few months ago, there used to be a period where IM labeled the fruit with the schema correctly beginning
with B, the dialogue continued:
F: baja
IM: BANAN

id est, IM had pronounced a correct form which she was unable to
pronounce otherwise.
This pedagogical "success story" can be quite easily explained in
terms of a monolingual crossover. Thus, knowing that IM used to
produce the P-schema BAJA before, father had simply uttered the
token which had reactivated the latent schema. Subsequently, during
a moment of practically instantenous cognitive crossover, the latent
schema mixed with the dominant schema:
BAJA
ANÁN
BANÁN
and the correct "centroid form" of two protoforms was obtained.
end intralexical crossovers 12.11.1

12.11.2

interlexical

It may be the case that mind sometimes mixes together even the
schemas which encode different semantic contents. Thus the first case
of cross-over recorded by the father (f1;4;25) was the spontaneous usage of the vocative MAMI minute or two after the lecture of the book
about the cat called MIMI:

201

202

qualitative

MAMA
MIMI
MAMI
c.f. 12.8.2 for description of other exogenous factors which have
primed IM for acquisition of vocatives.
Another couple of quite interesting crossovers was observed amidst
IM’s "eagle period" 18 . As was already mentioned, the word which
dominated IM’s production during the period was OLOL (sk: orol,
en: eagle) and given the frequency of occurence of the term in IM’s
production, it is undeniable that the P-schema OLOL$ was strongly
activated.
It may be for this reason that at (f1;10;30), IM’s term for "ballon"
was BALOL, which could be explained in terms of a cross-over:
OLOL$
BALÓN
BALOL
But since it could be argued that the production of the word BALOL
could be also explained as the assimilation of the lateral feature by
the terminating nasal consonant, and since we want to evit confusion
between causes and effects19 , let’s just focus on the second case which
we consider as particularly instructive.
Thus it happened that at (f1;11;20), during her pre-sleep oratory,
IM had tried to list names of all her kindergarten friends. But given
that she forget to mention her friend Nikol, IM’s mother had turned
monologue into a dialogue:
M: NIKOL
IM: KOLOL
thus producing a word which does not exist neither in german nor
in slovak, and doing so in a context which undeniably indicates that
her communicative intention was to say "Nikol". One could, of course,
argue that the production of such term as a result of avoidance of
using the syllable ni- in the initial position (demonstrated, for example, by calling one of her friends KITA instead of Nikita) combined
with the reduplication. But if such was the whole explanation, one
could hardly see why IM had opted for the term KOLOL and not for
*KOKOL. Thus, another force had to be at play and we argue that it
was the productive affinity of the scheme OLOL and the subsequent
crossover:
18 During this period IM was exposed, on her own request, to dozens or potentially
even hundreds of instances of the same narrative concerning the friendship of the
benevolent mole and an orphaned eagle. The exposure was multimodal: thus IM had
sometimes watch the movie without commentaries, sometimes it was commented.
Sometimes the picture book was read, sometimes the story without any visual support whatsoever was narrated. C.f. http://wizzion.com/thesis/videos/olol.mp4
19 What was first? BALOL or OLOL?

12.11 monolingual crossovers

OLOL$
NIKOL
KOLOL
which have taken their toll.
Another interesting interlexical crossover was observed during another pre-sleep dialogue (f2;1;14). When IM was asked to describe
games she plays in her kindergarten with another kid, she had answered with the word MAUEN. Given that such a word does not exist
in german language and given that frequent usage of terms "mahlen"
(en: to draw, to paint) and "bauen" (en: to build) was observed noted
down already a month before (f2;0;16), it cannot be excluded that the
term was a result of a following crossover:
paint and build

mahlen
bauen
MAUEN
and that, potentially, it had a meaning of both building (e.g. lego,
wooden cubes etc.) and painting (a common activity in IM’s kindergarten) in the same time. If that was the case, IM’s answer by means
of the term MAUEN could potentially suggest that crossover may
be useful not only for explanation of development of surface morphophonologic signifiers but also in explanation of much more deeper
semantics- and concept-related signifieds.
end interlexical crossovers 12.11.2

12.11.3

intraphrastic crossovers

Monolingual intraphrastic crossovers are operators which mix together
components (e.g. morphemes) originating from different phrase-encoding
schemes.
Let’s look at just one video20 , recorded at (f2;5;15), to see what this
could mean. The video shows IM and her mother during a creative
session initiated by stone-painting and terminated by sticking small
artificial eyes on the painted stones. Many interesting things happen
in the video, including:
1. within 304 seconds, IM uses the fixed construction UNK@BLAU
("dark blue") 18 times in three "bursts"
2. at 4:21, IM produces a multilingual crossover VO (de: "where")
ISt (de: "is") OKO (sk: "eye" nom. sg.), subsequently is corrected
20 http://wizzion.com/thesis/videos/augen.mp4

203

204

qualitative

by her mother which she immitates and produces the full slovak
construction ĎE jE OKO at 4:25
In regards to monolingual intraphrastic mixing, it is already the
first phrase:
ICH MAHLEN
pronounced at 9th second which is of certain interest. This is so
because this phrase - agrammatical on its own due to the the nonagreement of the pronoun (1p. singular) with the verb form (infinitive
or 1p. plural) - can be understood as a result of crossover of two
grammatically correct phrases:
ich mahle
wir mahlen
ICH MAHLEN
The same holds, mutatis mutandi, for incorrect pronounciations
which came later, such as
1. 03:08 VO ISt AUGEN? (where is eyes?)
2. 04:02 VO ISt mAJN AUGEN? (where is my eyes?)
3. 04:14 mAJN AUGEN VEG (my eyes (is) away)
Thus, all such syntactically incorrect constructions can be easily explained as a consequence of a crossover between correct forms which
the child could have easily heard in her environment. For example:
wo ist auge?
wo sind augen?
VO ISt AUGEN?
This being said, we feel no need to spam the reader with other
instances of such "monolingual intraphrastic" crossovers, produced
by IM aplenty since cca. 2 years of age. Instead we conclude with yet
another aphorism:
Of crossover and overgeneralizations (APH)
If the reader has understood that operators hereby labeled as "monolingual intraphrastic crossovers" could elucidate phenomenona which
developmental linguistics label as "overgeneralizations", then the reader
has understood us well.
end of crossover and overgeneralizations 12.11.3.0

In other words, the notion of "monolingual intraphrastic crossover"
can be a useful conceptual aid for anyone aiming to explain the problem of over-generalization or to construct a theory thereof.

12.11 monolingual crossovers

end intraphrastic crossovers

12.11.3.0

Many among above-mentioned cases of monolingual crossover were
triggered, induced or even primed by an exogenous event (i.e. parent
asking or saying something). Detection of crossover forms of purely
endogenous origin is much more complicated: it is easier for a parent
to believe that the child speaks agrammatical and meaningless gibberish than to admit that the toddler communicates meanings to which
he (the parent) does not have access anymore. For this reason we had
restricted, with exception of MAUEN, this introduction to purely surface crossovers between morphophonologic and syntactic P-schemas.
To go deeper would be too speculative.
This being said, we conclude this introduction by a remark which
states that hearing or seeing a child producing a monolingual crossover
is, verily, a reveatory event: it is as if, for a brief moment, one had indirectly regained access to the realm of long-forgotten knowledge.
end monolingual crossovers 12.11.3

On preceding pages, we had used the term "crossover" to denote
operators acting within a cognitive system, which are able to yield
a new child scheme by means of mixing of multiple parent schemes.
It was tacitly indicated that existence of many phenomena, including linguistic calcs, creol languages or overgeneralizations could be
explained in terms of activity of such operators in human brain. Following table recapitulates the basic distinctions
Multilingual

Monolingual

Intralexical

PIJE + TRINKEN =PIJEN

BAJA + ANÁN
= BANÁN

Interlexical

???
(difficult to assess)

BAUEN + MAHLEN
= MAUEN

Intraphrastic

VO ISt OKO?
(calques etc.)

VO ISt AUGEN?
(overgeneralizations)

Table 15: Recapitulation of crossover types observed in IM’s production.

Due to abstract nature of "operator" entities it aims to organize,
can be this taxonomy rightfully criticized as both crude and arbitrary.
Thus, for example, a distinction between multilingual and monolingual
could be considered as arbitrary by anyone asserting that the child is
exposed to a multilingual linguistic environment (composed of, for
example, motherese, fatherese, teacherese etc.) even in case when all
members of social environment speak the same dialect.

205

206

qualitative

The distinction between intraphrastic and intralexical could be also
attacked on the sole ground that in many morphosyntactically rich
languages, the very nature or even existence of distinction between
notion of lexeme and phraseme is not as straightforward as it may
seem.
But be it as it may, such theoretical hassles are of little use for phenomenological objectives of this chapter. By aiming to stay as faithful
as possible to our initial method of describing but non-categorizing we
cast aside this taxonomy as secondary and precise, that all aboveintroduced înt-21 terms were introduced and categorized not because
we would be 100% sure that such operators indeed materially operate
within the human brain, but because we hope that their introduction
could potentially stimulate or even facilitate further discussions.
One such discussion, concerning the assessement of crossover-like
phenomena in CHILDES corpus shall be soon introduced.
end crossovers 12.11

12.12

other phenomena

Many unexpected and surprising events occur during such a complex and years-lasting process as language development definitely is.
But since many amonth these phenomena are already in exhaustively
described in litterature, let’s just briefly describe two observations
which were in certain sense "salient":
12.12.1

multilingual c-scheme mismatch

The journal log entry (f2;5;18) describes an interesting dialogue which
happened one afternoon after the mother picked up IM in kindergarten:
FAT: ako bolo v (sk: "how was it in") Kite ? (sk. locative of
german word meaning kindergarten)
IM: MAMA ABHOLEN (de: "mother pick up")
FAT: ako bolo v Kite?
IM: MAMA ABHOLEN
FAT: ako bolo v Kite?
IM: MAMA ABHOLEN

On first sight it is somewhat difficult to see why IM had responded,
three times in a row, with an answer "mama picked me up" to a question "how was it in kindergarten"?. The thing however can get more
21 Consistently with the syntax of PCRE we shall use the symbol t̂o denote the initial
position. An expression înt hence matches all expressions prefixed by the trigram
int.

12.12 other phenomena

lucid when one realizes that a sequence "ako bolo" and sequence "abholen" have certain morphophonologic features in common. In other
terms, both can be fact matched by a a following C-scheme22 :
a.∗?bh?olo?
Given that it is evident that notions associated to the event of "being picked up from kindergarten" are, within child’s mind, definitely
more important than smalltalk questions about past events; and given
that the father question was terminated with the term "Kita" which
was practically the only attribute of the term "abholen" to which IM
was exposed on a frequent basis, IM’s thrice repeated answer was
neither non-sense nor surprising.
On the contrary, it was a meaningful and true answer to the question which her C-schemes processed as question meaning something
like "who picked You up from kindergarten?". Hence, not the term
"slip of the tongue" but rather "slip of the ear" could be used to describe such phenomenon.
We consider this case of multilingual perceptive parapraxis to be of
particulary interest because it can potentially result in a method, or
even set of experiments, allowing to elucidate the problem of development of C-schemas which is, contrary to development of directly
observable P-schemas, quite difficult to empirically measure and assess.
end c-scheme mismatch 12.12

12.12.2

compression of information

Another interesting phenomenon was observed at (f2;6;18) at onset of
period of increasing phrasal productivity. During a trip through the
forrest with another family which has following members: H = father,
M = mother, J = older son, T = younger son; IM enumerated a list of
people who should go home in a following manner:
ALLE NACH HAUZE (de: "all home")
AUCH T, M AUCH (de: "also T, M also")
PAPA AUCH J (de: "father too J")
What is striking is the last sentence which was uttered after few seconds of silence during which apparently tried to remember the name
of J’s father (i.e. H). Since she could not remember it (or avoided its
pronounciation), she had ultimately found her way out by producing
22 The C-scheme is a valid Perl regular expression which matches both strings "ako
bolo" and "abholen". In such regexps, symbol "." matches any possible symbol, symbol "*" means "match zero or more occurences of preceding element" and symbol "?"
means "match zero or one occurence of preceding element".

207

208

qualitative

PAPA AUCH J which, in correct german would have to be a 6-word
"auch J und auch sein papa".
But not caring much about correct rules of grammar which would
oblige her to articulate sentences twice as long as necessary, IM had
expressed the same communicative intention with three words only.
Hence, at least in this case, "optimizing" forces inviting her to express her intention with as little resources as possible were definitely
stronger than socializing and normative forces obliging her to produce only grammatically correct constructions.
end compression of information 12.12

In a following manner could we continue and discuss one entry of
the observation log after another. For example, we could discuss not
only IM’s pre-sleep monologues, but also mention productions which
she used to cry out of her sleep, or uttered immediately after waking
up.
We could focus on one meaning and describe development of labels which IM used to denote it. Or, as was the case when discussing
the DING-DONG mystery, we could focus on one label and describe
development of its meanings. Or we could list IM’s first adjectives,
questions and sylogisms. Or publish digital versions of the observation log as well as all other recorded materials.
But given the momentanous lack of IM’s conscious and reflected
consent to publication of her personal data, we think it is now time
to conclude this chapter dedicated to development of this particular
child.
other phenomena end
12.12

f(2;4;6)
ICH HABE AJN HUND
HABE AJN HUND
HABE AJN HUNDI
HABE AJN HUND
LA LA LA
DA BAUEN
DA BAUEN JA
DA BAUEN TATO
end qualitative

12

13

Q U A N T I TAT I V E

13.1

method

The method of previous chapter mainly consisted in observations and
interpretations thematizing structures produced by one individual
toddler. But knowing that science should always aim to unveil not
only the individual and specific, but also and especially the universal
and generic, a hard-core empiricist could rightfully reproach us that
was presented until now was maybe cute, but it was not science.
Thus, in order to correct and complement the methodological gap,
all the effort presented in this chapter shall be subordinated to two
ultimate virtues of the cartesian method. They are, of course:
1. reproducibility
2. quantification
Reproducibility is to be attained by exact specification of the input data and by publication of computational machinery which transforms the data into information or even knowledge. More concretely,
every analysis shall include a list of corpora which were analysed as
well as the bash|PERL|R code which performed the analysis. Thus,
instead of using traditional logicomathematical formalisms, other formalism - less theoretical and more practical one - shall be used: that
of PERL and its regular expressons (regexps).
When it comes to quantification, it shall be exactly the use of regexps which shall allow us to transform texts into numbers. By using
regexps which are, in their essence, nothing else than strings of characters able to match sets of strings of characters, it should be potentially
possible to identify, detect and mesure frequencies of occurence of
quite abstract patterns or schemas.
Summa summarum, the method of this chapter shall mix little bit of
data-mining with little bit of statistics and information extraction in
order to attain the goal commonly known as "knowledge extraction".
end method 13.1

13.2

data

« Child Language Data Exchange System (CHILDES)» (MacWhinney, 2014) is undoubtedly the biggest publicly accessible collection
of both recordings of child speech as well as their transcripts. Since

209

210

quantitative

its foundation in 1984 by Brian MacWhinney and Catherine Snow,
CHILDES had attracted interest of thousands researchers from all
over the world and thus became the most important dataset for the
nascent DP discipline.
Given its open yet standardized design, CHILDES contains hundreds of megabytes of transcripts representing children verbal productions and interactions in more than two dozen world languages.
What’s more, some of these transcripts include morphosyntactic annotations and/or audiovisual recordings which allow a more thorough contextualization of otherwise pure-text transcripts.
Note, however, that not all transcripts downloaded from the site of
CHILDES project1 shall be analysed. Primo, both directory "Frogs"
as well as "PhonBank-Phon" are to be removed from the workbench
since they do not contain .CHA transcripts made "in vivo". Secundo,
all transcripts of children whose age is higher than the upper-bound
of toddlerese (i.e. >30 months) are also excluded from analysis. This
can be done by running the agesort.pl2 script whose main functionality, however, is to devide the transcripts into two datasets:
1. PROTOTODD - the "prototoddlerese" dataset contains transcripts
of children not older than 16 months
2. TODDLER - the "toddlerese" dataset contains transcripts of children between 16-30 months
Transcripts contained in both datasets thus obtained follow the
.CHA format which stipulates that:
1. lines with child-originated speech are marked with token *CHI
2. lines with mother-originated speech (motherese) are marked
with token *MOT
3. lines with father-originated speech (fatherese) are marked with
token *FAT
and in all tables which will follow, we shall apply the same CHI|MOT|FAT
notation to denote child, resp. mother or father. Distribution of different line types is presented on Table 16
CHI

MOT

FAT

PROTOTODD

2248553

320454

13974

TODDLER

1453931

893357

154964

Table 16: Activity of different speakers in two age groups.

Every line of the .CHA file roughly represents a distinct and unique
utterance. Thus, Table 16 suggests first distinction between two age
1 $wget -r –no-parent http://childes.psy.cmu.edu/data/
2 http://wizzion.com/thesis/code/childes/agesort.pl

13.3 universals

211

groups: in the protoddler period mothers in general produced 42%
more utterances than children, the ratio was more than inversed in
the later group4 . In comparison to mothers, fathers seem to serve
only marginal role within both datasets, their presence, however, seems
to be significantly higher in case of the older group (FATPROT /MOTPROT ≈
4.3%, FATT ODD /MOTT ODD = 17.3%).
end data 13.2

13.3

universals

This section offers analysis of CHILDES transcripts coming from different languages. Table 18 shows number of distinct .CHA files (transcript) which are to be analysed as well as languages in which they
were spoken.
ara

deu

eng

fra

jpn

rum

spa

tha

biling

other

PROTOTODD5

54

176

1026

152

42

53

56

142

31

107

TODDLER

25

591

2505

410

235

42

140

46

801

1063

Table 17: Repartition of languages in studied corpus.

It is thus evident that all in all, CHILDES is strongly biased towards
indo-european languages in general and English in particular. This
bias notwithstanding we shall, in following analyses, throw all data
into one bag as if we were studying one sole language.
13.3.1

letters

Let’s now run the script 6 performing the most simple analysis possible: i.e. the measurement of frequencies of occurence of diverse
graphemes (i.e. letters) within utterances produced by children and
their parents. This yields distributions presented on Table 17.
It can be seen that no matter the speaker and no matter the age
group, vowels A, E, and O are always among four most frequent
entities. But closer inspection of the data can lead to discovery of
certain interesting developmental phenomena occuring also between
the groups whose contrast interest us the most, that is CHIPROT O and
CHIT ODD . It can thus be seen that the utterances of children younger
not older than 15 months are dominated by occlusive consonants (H,
M, T, D, N, P) and other types of consonants like fricatives (S), trills
(R) or laterals (L) attain more dominant positions only in later period.
A particularly instructive case seems to be the decrease in ranking
of the labionasal occlusive M. While this consonant is the 5th most
4 The ratio 1453931/893357 ≈ 1.614 is quite close to number φ ≈ 1.618 better known
as "golden ratio" or "golden section". Sapienti sat.
6 http://wizzion.com/thesis/code/freq_1gram.pl

212

quantitative

Table 18: 20 most frequent graphemes according to speakers and age
groups.
CHI
PROTO

MOT
TODDL

PROTO

FAT
TODDL

PROTO

TODDL

a

32187

a

108278

e

400499

e

1448021

a

21516

a

208869

e

21151

e

103443

o

371032

a

1220454

o

12563

e

195921

o

19400

o

700249

a

344629

t

1101875

n

10394

o

155729

h

17569

n

665280

t

319301

o

1069481

e

10258

t

132600

m

12472

i

654456

h

252510

i

865696

i

9571

n

127344

t

11557

t

611047

n

233620

n

853893

u

8399

i

126082

d

10969

h

482184

i

228587

h

801306

t

8063

s

107667

u

10949

s

438493

s

194898

s

755952

h

7273

h

87652

n

10668

r

383941

u

173597

r

607236

r

5230

r

85586

i

10068

m

364861

r

160583

u

532238

k

5191

u

85416

b

8768

u

338297

y

142311

l

501064

s

4995

l

68172

p

7200

d

323376

l

137550

d

450710

m

4689

d

64320

y

6754

l

310237

m

109696

m

351217

j

4071

m

52445

l

6704

c

234078

d

106698

y

345154

d

3876

y

44005

r

6501

k

226683

w

94381

c

290880

p

3683

c

41410

c

6163

p

192730

g

85473

g

282157

l

3555

k

39026

s

5589

g

189097

k

77850

w

280615

w

2825

g

35873

â

5328

y

186999

c

76826

k

234218

y

2436

p

34373

g

5233

b

184783

p

72531

p

216647

c

2379

w

30227

k

4581

w

138086

b

67397

b

191091

b

1632

b

26719

w

3857

j

107099

f

33203

f

123552

g

1581

v

18730

frequent in the transcripts of younger children and is 2.58 times less
frequent than the most frequent A, in case of older toddlers M is only
10th more important and 3.36 less frequent than A. Given that all four
FAT and MOT distributions consistently place M at rank 12 or 13, the
phenomenon of "decrease of importance of M" - and in lesser extent
also of P and B - can be potentially explained in terms of divergence
from certain potentially innate labiotactic schemata (c.f. 12.5.1) and
gradual convergence towards more socially determined articulations.
We leave to readers’s ingenuity detection and discussion of other
phenomena presented by the table, including mother’s preference for
the vowel E and father’s and children’s preference for the vowel A.
end letters 13.3.1

13.3 universals

13.3.2

213

n-grams

Let’s now focus on distributions of N-grams, that is, the sequences
of N letters. Since we have already presented the distribution of letters which is equivalent to distribution of 1-grams, Table 19 presents
the distribution of 2-grams (bigrams) as assessed in 7697 transcripts
which our script7 had analysed.
Table 19: 20 most frequent bigrams according to speakers and age groups.
CHI
PROTO

MOT
TODDL

PROTO

FAT
TODDL

PROTO

TODDL

’a_’

6168

’e_’

341873

’e_’

167116

’e_’

562814

’a_’

6085

’e_’

72319

’e_’

5487

’a_’

273927

’t_’

108290

’t_’

383994

’n_’

4173

’a_’

53852

’^ a’

4688

’n_’

191424

’_t’

92762

’s_’

335865

’e_’

3722

’s_’

42541

’^ b’

4089

’t_’

173103

’s_’

89654

’_t’

311683

’aa’

3242

’t_’

39194

’^ d’

4072

’o_’

157271

’th’

89035

’th’

266289

’oo’

2663

’n_’

36965

’^ m’

3744

’s_’

153264

’he’

76868

’he’

246986

’j_’

2660

’_t’

34962

’y_’

3699

’_t’

123289

’ou’

73519

’n_’

243248

’i_’

2642

’o_’

34081

’h_’

3660

’er’

120991

’n_’

66505

’a_’

220702

’o_’

2448

’an’

25338

’ma’

3649

’an’

118561

’re’

57772

’ha’

198677

’_t’

2233

’ d’

24430

’ah’

3451

’i_’

116078

’ha’

57707

’ou’

186602

’_n’

2194

’ a’

24018

’da’

3085

’he’

113844

’a_’

57042

’o_’

183667

’_m’

2022

’er’

23690

’n_’

3072

’in’

111541

’yo’

55805

’_d’

182416

’aj’

1990

’ s’

23146

’oo’

3024

’th’

102953

’y_’

53855

’er’

180576

’an’

1981

’i_’

22763

’ba’

2716

’_a’

96622

’an’

51556

’ a’

174639

’uu’

1933

’th’

22579

’h_’

263

’_d’

94794

’u_’

50789

’an’

170761

’t_’

1880

’he’

22503

’an’

2361

’ch’

94477

’er’

49019

’re’

170253

’_k’

1866

’r_’

22353

’o_’

2357

’_m’

94073

’at’

48593

’in’

167113

’u_’

1813

’ha’

21978

’^h’

2289

’h_’

93204

’o_’

46667

’at’

160793

’ha’

1758

’u_’

21764

’de’

2254

’ma’

92728

’_a’

44251

’ i’

159637

’ p’

1712

’ou’

20780

’^p’

2219

’en’

92239

’_y’

43725

’r_’

158742

’ii’

1574

’re’

20519

’t_’

2188

’ha’

91373

’on’

41773

’_s’

156205

’th’

1574

’en’

19847

’at’

2056

’r_’

89388

’_s’

41663

’u_’

147883

’na’

1555

’on’

17843

In our notation, symbol ^ means "beginning of utterance" and symbol _denotes the pause between the words (normally denoted by a
simple blank space) and is understood as a symbol in its own right.
In general, vowels A and E at the ultimate word position tend to
dominate the lists but in case of the group which interests us most,
i.e. CHIPROT O they are followed by a group of bigrams denoting either vowel A or occlusives B, D, and M (and somewhat later also H
and P( occuring at the initial position of whole utterance.
7 http://wizzion.com/thesis/code/freq_2gram.pl

214

quantitative

It is also worth noting that for this group, the most frequent bigrams having the consonent-vowel (CV) syllabic form are MA, DA
and BA and bigrams following the VC form are AH, AN and AT.
We consider these findings as consistent with both data commonly
reported in DP litterature, as well as with qualitative observations of
IM’s first protowords (c.f. Table 10 like MAMA, DADA or BABA.
As usually, we set aside other potentially interesting questions like
"is the predominance of long vowels AA, OO, UU, II in prototoddlerdirected fatherese a sheer artefact of the corpus8 or do these results
point to somewhat more profound a phenomenon ?" and point hereby
the reader’s attention to Table 20 outputs of the scripts assessing the
frequencies of 3-grams.
Table 20: 10 most frequent trigrams according to speakers and age groups.
CHI
PROTO
’^ba’

1692

MOT
TODDL

’_th’

PROTO

FAT
TODDL

57081

’_th’

58738

’the’

140513

PROTO
’aa_’

1994

TODDL
’_th’

187369

’^ma’

1629

’er_’

51736

’you’

54163

’you’

127201

’aj_’

1990

’the’

140513

’mam’

1619

’en_’

51386

’the’

41636

’hat’

112441

’_th’

1151

’you’

127201

’ah_’

1448

’the’

51022

’ou_’

40205

’re_’

104532

’ii_’

1148

’hat’

112441

’^da’

1428

’re_’

48700

’_yo’

38424

’_yo’

100048

’an_’

1074

’re_’

104532

’ama’

1323

’^ja’

41974

’re_’

35344

’he_’

98865

’oo_’

1016

’_yo’

100048

’det’

1145

’in_’

40925

’hat’

34041

’ou_’

97958

’on_’

1016

’he_’

98865

’dad’

1118

’no_’

40168

’at_’

30480

’at_’

96754

’_ma’

980

’ou_’

97958

’aa_’

1030

’^no’

39940

’he_’

28589

’is_’

76056

’_na’

823

’at_’

96754

’et.’

998

’her’

39541

’her’

28009

’her’

74624

’aw_’

819

’is_’

76056

’^ah’

971

’ne_’

38460

’ere’

25958

’er_’

74623

’re_’

805

’her’

74624

In general it can be stated that the trigram-related phenomena seem
to extend quite naturally the phenomena which were already observed and discussed in relation to bigrams. Word onset syllables BA,
MA and DA thus dominate the list of prototoddlerese. But since these
trigrams are not fully qualified (they contain the meta-character ^), it
can be stated that the most frequent trigrams with equally trigramic
phonemic correlates are MAM, AMA, DET and DAD.
In the later period, i.e. in CHIT ODDL transcripts one can observe
a bias towards distribution of standard english marked, of course,
by the dominant position of the graphemic trigram (and phonemic
bigram) denoting the most frequent word of english language, the
determiner THE. This bias notwithstanding, word onset syllables ^JA

8 These long vowel sequences seem to originate, in great extent, from transcripts of
japanese and tamil fatherese.

13.3 universals

and ^NO appear at highest positions of the list for a reason which we
can briefly elucidate only in the footnote9 .
Leaving again the question of fatherese aside as the problem of its
own, let’s now look at motherese. In general, both distributions indicate that the corpus was strongly biased towards English. Thus, the
obligatory THE is present (as well as its fragment _TH preceded by
the pause), as well as trigrams HAT and ERE owing their high ranks
to highly frequent words like what/that resp. where/there/here, within
which they occur.
What is striking is, however, the position of the trigram YOU. While
in frequency lists generated from "standard English" corpora 10 , the
word You is 17th most frequent and occurs ≈ 9.3 less often than the
most frequent word THE, in the speech directed to younger infants
the trigram it is the trigram You which dominates the list of fully
qualified trigrams, occuring 1.3 more often than THE !
Among all phenomena observed until now, we consider mother’s
tendency to say You, to be the most salient example of what we consider to be the very essence of motherese.
end n-grams 13.3.2

13.3.3

intrasubjective replications

It has been already repeated on multiple places (5,12.6) that interpretation of "repetition of information" as a sort of "replication of information" is one among main tenets of the theory hereby presented. Thus,
let’s now try to assess the extent in which children repeat their own
productions.
Intralocutory duplications
Intrasubjective duplications can be detected by searching for repetition of a sub-string X within the envelopping utterance-string U. If X
is a bigram this can be easily done by matching the utterance with
the regexp:
$U =∼ /(.2) \ 1/g
9 Execution of grep -P "CHI:\tno" ./toddler/* indicates that within the corpus of later
toddlerese, high frequency of ^no is principially caused by augmentation of child’s
egocentric tendency to answer questions in negative. In case of ^ja execution of
the command grep -P "CHI:\tja" ./toddler/* indicates that the situation seems to
complicated by the fact that the grapheme J denotes different phonemes in different
languages (compare "jagen" in german and "Jack" in English or "jagami" in sanskrit).
This complication notwithstanding, it seems that high frequency of ^ja can be in nonnegligeable extent explained in terms of baltoslavogermanic "yes". Thus, for example,
the sole 11312/c-00023045-1 transcript shows how small german boy Leo answered
104 times with the word JA
10 https://en.wiktionary.org/wiki/Wiktionary:Frequency _lists/PG/2006/04/1-10000

215

216

quantitative

and for any duplicated sub-string at least two characters long, the
matching pattern is
$U =∼ /(.2,) \ 1/g
(3)
Note that these patterns match only adjacent repetitions, id est such
cases when two instances of the repeated substring are juxtaposed
side by side.
The script11 confrontating the second (i.e. length(X) > 2) with
child-produced utterances yields outputs presented in Table 21.
Table 21: Duplicated expressions and numbers of child-originated and childdirected utterances in which they occur.
CHI
PROTO

MOT
TODDL

PROTO

TODDL

ma

1117

ma

11756

ma

1266

ma

2633

pa

294

pa

4545

is_

733

is_

2408

da

290

ko

2970

bye

696

em

1904

bye

142

is_

1651

no_

547

mm

1338

an

140

ba

1412

it_

532

pa

1283

ba

136

la

908

na

523

it_

1177

ta

76

an

764

da

443

e_

963

na

75

da

730

mm

382

na

820

ah

65

no_

647

ba

336

a_

700

woof_

64

ta

616

an

254

an

692

uh

60

na

600

em

197

ba

505

open_

59

do

588

boo

177

to_

468

mommy_

57

be

580

ing

167

ko

435

cou

53

bye

552

uh

160

no_

384

vov

48

e_

536

nyan

158

in

374

mm

45

bo

535

man

157

ing

349

ga

41

pi

468

ha

147

bye

328

he

40

ca

430

pa

143

er_

319

no_

40

in

387

nai

135

cher

311

book_

39

cha

372

to_

134

li

293

ha

39

ka

344

cou

132

_we

290

Postponing the discussion12 of specificites of the data hereby presented to the later date let’s focus on scientifically more pertinent a
fact: the overall statistics of duplications. This is shown on Table 24
whose values were calculated by normalization by means of a formula
P(duplication) = ALLmatching /ALLutterances
11 http://wizzion.com/thesis/code/isipr.pl
12 "Mothers do not say woofwoofwoof as babies do, mothers say manmanman".

13.3 universals

whereby ALLmatching denotes the number of utterances produced
by CHI (resp. MOT) matchable by regexp presented in Formula 3
and ALLutterances denotes the number of all utterances uttered by
the person.
CHI

MOT

PROTO

0.041

0.086

TODDL

0.066

0.058

Table 22: Probability that the utterance shall contain at least one ajdacently
duplicated 2+gram.

This table indicates that the intralocutory duplication is to be most
probably observed in motherese directed to younger children. Younger
children, on the contrary, tend to produce less adjacently duplicated
sequences13 . In the later period, however, they tend to replicate, within
one utterance, the fragments of their production more frequently than
their mothers.
end intralocutory replications 13.3.3.0

Translocutory replications
Let’s now focus on reproduction not to be observed within one individual utterance, but between two adjacent utterances. Given the
speaker S who utters U1 before uttering U2 , one can look for replication of patterns between U1 and U2 by simply
1. creating a new datastructure, a "couplet" which concatenates
two utterances and the divisor symbol #, i.e.
couplet = concatenate(U1 , #, U2 )
2. matching the couplet with regex like
$couplet =∼ /(.3,). ∗ #. ∗ \1/g

(4)

and this is exactly what is being done by a 3rd line of the14 script
whose outputs are in part presented in Table 23.
Note that in contrast to Formula 3, the regexp in Formula 4 contains expresion {3,} and not {2,}. This means that in this analysis, we
have been looking for repeated strings of three or more characters
(3+grams). This design choice was done in order not to pollute the
results with repeated bigrams among which many (e.g. "th", "ch") represent in many languages just a sole phenome, and their repetition is
thus highly probable. Other design choices are, of course, possible.
13 Or transcribers do not transcribe them as such.
14 http://wizzion.com/thesis/code/isitd.pl

217

218

quantitative

Table 23: Most frequent translocutory 3+-grams.
CHI
PROTO

MOT
TODDL

PROTO

TODDL

det.

332

ja.

8160

you

8698

you

19229

kore.

223

no.

4461

the

3809

the

15302

maman.

210

the

3328

that

1554

what

557

mama.

182

da.

3071

here

1531

that

441

eh.

174

yeah.

2870

what

908

here

300

baby.

162

ein

1979

and

776

ing

2423

ball.

126

that

1670

t’s

627

and

2088

no.

121

aa.

1655

look

550

t’s

2056

daddy.

104

nein.

1536

ing

520

das

1612

aa.

102

en.

1535

there

393

there

1437

papa.

89

there.

1363

that’s

382

she

1399

mommy.

84

here

1190

her

366

her

1323

ooh

81

das

1099

your

344

that’s

1

up.

76

die

1088

where

338

tha

1179

dah

75

der

1006

come

325

ein

1093

ah.

73

this

973

this

323

ich

1091

da.

73

and

883

see

300

est

1069

dog.

67

you

830

one

272

der

1030

dada.

62

yes.

772

yeah

261

one

1020

uhoh.

61

ich

698

no.

249

n’t

1005

Some interesting phenomena pop up here. In general the table is in
general populated by deictic pronouns15 , determinants, answer particles and various form of positional adverbs. In motherese, an injunction to action appears from time to time in the form of a verb ("look",
"come", "see"). And, of course, it is very probable that if the current
motherese utterance contains the word "you", the next one utterance
shall contain it as well.
The presence of motherese expressions like "t’s" and "n’t" also suggests the occurence of first variation sets (that’s vs. it’s, isn’t vs. don’t
etc.)
What’s more, one can see quite clearly a distinction between language of younger and older children. While the distribution of translocutory duplications of older children is quite similar to motherese16
this is in no way the case for younger children. Repetition of "abstract"
15 DET is the danish deictic meaning "that" and KORE is japanese deictic meaning "this". Transcripts of danish children Anne (e.g. 11312/c-00021705-1), Jens (e.g.
11312/c-00021750-1) and japanese girl Hiromi (e.g. 11312/c-00009753-1) seem to be
in great extent "responsible" for high scores of these words.
16 The most salient exception to this being the tendency to repeatedly utter "ja." or
"no.".

13.3 universals

deitics is quite rare and seems to be limited to few particular children
like Jens and Hiromi. On the other hand, the list of repeated tokens is
dominated by words denoting concrete persons ("maman", "mama",
"baby", "ball", "daddy", "papa", "mommy", "dog") and particles with
undefined content ("eh", "aa", "ooh", "ah", "uhoh") potentially referring to emotional states. Even the adverb/preposition "up" is present,
sometimes probably serving the function of injunction "raise me up!"
or "look up!".
This being said, let’s now look at overall statistical properties of
distributions thus obtained:
CHI

MOT

PROTO

0.08

0.37

TODDL

0.28

0.38

Table 24: Probability that both parts of a utterance couplet shall contain at
least one identic 3+gram.

An significant increase in amount of translocutory replications is
observed when one compares data of younger and older children.
This is consistent with what was observed in case of intralocutory
duplications Table 24 but here, the phenomenon is even more marked.
Motherese, on the contrary, seems to keep a property of repeating a
3+gram slightly more often than once in three utterance couplets.
end translocutory replications 13.3.3.0

Many minor phenomena asides, preceding subsubsections have briefly
shown:
1. a fast17 and frugal18 regexp-based method of extraction of repetitive patterns from huge corpora
2. that language of children younger than 15months contains less
intralocutory resp. translocutory replications of 2+ resp. 3+gram
sequences than language of older toddlers
This being said, let’s now focus on replication of structures which
is to be observed not in and/or between utterances produced by one
speaker but in utterances produced my multiple speakers.
end intrasubjective replications 13.3.3

17 All presented analyses were performed in less than a minute on one single 2.5GHz
core.
18 All scripts are shorter than 42 lines of pure PERL, including loading the corpora,
cleaning it from metadata and most salient noise, parsing and printing the result.

219

220

quantitative

13.3.4

intersubjective replications

Intersubjective replication is equivalent to imitation. It is observed if
and only if two distinct subjects produce the same construction in
a very limited timespan. To make things simple, this section shall
be concerned only with detection of the most trivial intersubjective
replications: those which immediately follow each other.
PROTO
CHIINIT

TODDL

MOTINIT

CHIINIT

MOTINIT

ball

74

ball

42

the

2038

the

3058

baby

68

baby

40

that

1534

here

1731

daddy

50

here

33

here

1045

that

1175

up.

47

det

26

you

969

you

997

guh

40

apple

21

no.

895

ing

993

det

36

byebye.

21

what

764

what

539

dada

33

the

20

yeah.

528

one

502

more

33

daddy

20

ing

492

there

447

that

30

that

19

there

466

ein

436

byebye.

29

down

19

das

463

and

361

book

29

open

19

ein

444

t’s

333

hi.

28

mommy

18

want

439

das

314

water

25

hi.

17

one

348

ich

285

car

25

dada

17

ja.

346

det

261

down.

24

book

16

t’s

342

der

240

block

23

block

16

and

328

want

234

open

23

big

16

non

327

this

230

mama.

23

guh

15

yeah

326

que

216

bottle

22

you

15

daddy

324

est

209

big

20

water

14

hat’s

303

can

203

non

20

car

14

det

303

die

184

uhoh.

19

boo

14

baby

275

c’est

161

no.

18

dad

13

oh.

274

see

158

agu

18

bye.

13

her

262

it’s

155

backpack

17

okay

12

where

245

oh.

153

duck

17

bye

12

there.

230

den

149

doggie.

17

and

12

mhm.

226

they

144

apple

17

uhoh.

12

ich

219

baby

138

here

16

duck

12

that’s

217

that’s

138

dirty

16

sticky

11

can

213

car

138

Table 25: Most frequent words replicated from child to mother (CHIINIT )
and mother to child (MOTINIT ).

Technically, the extraction is performed by means very similar to
those which extract intrasubjective translocutory replications (c.f. previous section). The only difference being, of course, the origin of UT T1

13.3 universals

and UT T2 . While in detection of intrasubjective replications UT T1 and
UT T2 are uttered by the same person, in case of intersubjective replications it cannot be so and additional condition speaker(UT T1 ) 6=
speaker(UT T2 ) has to be implemented in the code.
Another thing which is to be carefully considered is identity of the
person which initiated the replication (i.e. uttered UT T1 ) in contrast to
identity of the person which reacted (i.e. uttered UT T2 ). On following
pages these shall be distinguished by attributes INIT resp. REACT.
Listings generated by the19 script implementing such considerations have been listed on table Table 25. They indicate, among other
things, that
• entities intersubjectively replicated and shared between mothers and younger toddlers tend to denote concrete physical referents ("ball", "baby", "book", "water", "car", "mama", "bottle",
"backpack", "doggie", "apple", "block"), their properties ("big",
"dirty", "sticky") or directions along vertical axis ("down", "up")
• entities intersubjectively replicated and shared between mothers and older toddlers tend to encode more abstract linguistic
entities (deictic pronouns, locative adverbs) as well as basic syntactic constructions ("that’s", "c’est", "it’s")
• children initiate exchanges about different "topics" than mothers do20
Overall statistic properties assessed by the script are presented on
Table 26. These are: number of couplets in which child utterance
preceeds or follows the mother utterance (NC ); number of couplets
which have at least one 3+gram in common (NR ) and probability that
a MOT-CHI or CHI-MOT couplet will have at least one 3+gram in
common (PR|C = NR /NC ).
CHIINIT

CHIREACT

PROTO

NC = 46795
NR = 6005
PR|C = 0.128

NC = 46923
NR = 4167
PR|C = 0.088

TODDL

NC = 378958
NR = 130713
PR|C = 0.344

NC = 378712
NR = 92340
PR|C = 0.243

Table 26: Basic statistics concerning the replication of 3+grams between
mother and the child.
19 http://wizzion.com/thesis/code/tsr.pl
20 For example in younger toddler group had mothers repeated 33 times the word
"more" uttered by their child. But only in 9 cases was the word "more" uttered by a
mother repeated by her child. Or, in older group, mothers have reproduced "no." of
their children in 895 cases; toddlers repeated "no." of their mothers only in 133 cases.

221

222

quantitative

It may be thus seen that in both groups, toddlers initiate more intersubjective replications than they react to. Or, in other terms, that
mothers tend to prefer reacting to topic changes than changing the
topic themselves. It is as if mothers, not children, were adapting themselves to the currently addressed topic.
But it can be also seen that this asymmetry is less marked in exchange with older toddlers. For while mothers reproduce at least one
trigram after approximately every third utterannce their children repeat at least one trigram approximately after every fourth utterance.
In younger group this is not so: child reproduces the fragment of
mother’s talk only once in every 12th utterances and mother do so
only once in 8 utterances.
These distinctions notwithstanding, we consider it worth mentioning that there seems to be, in fact, one thing common to both age
groups: the ratio between probabilities. That is, Table 26 indicates
that, statistically speaking, it is ≈ 1.421 times more probable that the
replication-containing couplet was initiated by the child and not by
the mother.
end intersubjective replications 13.3.4

Thus ends our brief excursion through the realm of linguistic universalia. It could undoubtably continue, for example by following the
direction indicated by Table 27:
CHI

MOT

PROTO

114822

332923

TODDL

2454 24

1319 25

Table 27: Distributions of occurences of marker for laughing in diverse subsets of CHILDES corpus.

and a lot of ink could be spilled by tentatives trying offer a serious, scientific, fully cartesian and p-value-endowed answer to question "how is it possible that the CHA format’s marker laugh is 2.5 times
more frequent in transcripts of prototoddlerese when it contains 4013 less
transcripts than the corpus of toddlerese?".
But given the importance, intensity, diversity and perennial actuality of the topic (Aristotle, 5 BC), the role of laughing in development
of mind can not to be addressed in the current pamphlet in extent it
merits.

21
22
23
24
25

0.128/0.088 = 1.45; 0.344/0.243 = 1.42
$ grep laugh ./prototoddler/* |grep CHI |wc -l
$ grep laugh ./prototoddler/* |grep MOT |wc -l
$ grep laugh ./toddler/* |grep CHI |wc -l
$ grep laugh ./toddler/* |grep MOT |wc -l

13.4 english-specific

Instead of doing so let’s now fully admit that in case of analyses of
corpus so strongly biased towards english language as the one hereby
studied, it’s maybe wiser to stop babbling about "universals" 26 and
rather start assessing, evaluating and interpreting the central tenets
of our theory in "english-specific" terms.
end universals 13.3

13.4

english-specific

In this section we shall present results of few data-mining experiments which concerned only those parts of CHILDES corpora which:
1. which transcribe interaction between english-speaking adults
and english-speaking children
2. which also contain morpohological and grammatical annotations (i.e. every utterance line is also followed by %mor line
and %gra line)
Table 28 contains overall statistics27 of the datasets fulfilling these
conditions and obtained by running the script langsort.pl28 .
PROTO (<16 months)

TODDL (<16 months >31 months)

Investigators

10

35

Subjects

86

288

Transcripts

330

1335

CHI utterances

42229

293751

MOT utterances

196781

370972

CHI words

132927

1035341

MOT words

1076028

1921131

Table 28: Counts related to morphologically annotated english-language
transcripts analyzed in this section.

As can be seen, the corpus still contains a non-negligeable amount
of data describing interactions from almost hundred younger toddler
subjects and almost three hundred older toddler subjects. Given that
the data were collected and transcribed by dozens of diverse investigators, it can be expected that certain knowledge about generic ten26 Or do so elsewhere, c.f. (Hromada, 2016e).
27 All values were obtained by means of a standard UNIX utility wc (e.g. the amount
of letters in CHI utterances was obtained by executing the shell command $grep -P
’ˆCHI’ ../toddl_english/*.cha |wc -c. Note that for wc’s operational definition of what
"word" (i.e. a continuous sequence of characters separated from other words by blank
spaces) means strongly overlaps, but is nonetheless not completely equivalent, to
what "word" means in linguistics.
28 http://wizzion.com/thesis/code/langsort.pl

223

224

quantitative

dencies could be attained if ever the data was to be processed in a
stringently quantitative manner.
Instructions and definitions of CHILDES Manual (MacWhinney and
Snow, 1991) should be also taken into account more strictly than was
the case in the preceding "universals" section. Other details of text
preprocessing are mentioned in annex (??).
13.4.1

utterance-level constructions

Table 29 contains top most frequent utterance-level constructions obtained by launching one simple command29 .
That communication of younger children is dominated by non-linguistic
behaviours (vocalizations, babbling, crying, laughing etc.) is hardly
surprising. Nor is much surprising that younger children tend to
produce shorter utterances. Nor the fact that vaste majority of multiword motherese utterances are short fixed expressions (e.g. "come on",
"that’s right", "oh dear", "good girl").
Observation of certain similiarities between distributions of childdirected speech and speech produced by older children can one lead
to a hypothesis that these distributions correlate. In order to verify
the hypothesis, a simple script was programmed30 which merged two
complete distributions into one table. Subsequently, Pearson correlation coefficients were calculated and are presented on table Table 30.
One may thus observe the existence of statistically significant (i.e.
p<0.05) correlations in all cases except the one: no statistically significant correlation was observed between MOTT ODDL and CHILDPROT O .
These seems reasonable, for how could the language of a young toddler a priori correlate with language which the mother shall use
when the child will be older? In reverse direction, however, a weak
(cor=0.022) but nonetheless statistically significant correlation is observed: thus, there exists certain relation between distributions of utterances in language produced by older children and distribution of
utterances in language heard by younger children.
The strongest correlation, however, is to be observed between MOTT ODDL
and CHIT ODDL which can be potentially explained in terms of convergence of toddlerese towards "the golden standard" actualized by
language of the mother.
This being said, let’s now conclude this brief overview of utterancelevel distributions with Table 31 which presents following quantities:
• Nd : the number of distinct utterances present in the corpus
• Pd : probability that the utterance is distinct (Nd normalized by
number of all utterances (c.f. Table 28)

29 grep -h -P "CHI: [^0]" ./toddl_english/* | sort |uniq -c |sort -g -r
30 http://wizzion.com/thesis/code/correlator.pl

13.4 english-specific

• H : Shannon entropy of utterance distribution, calculated31 as
P
H = − i P(xi ) log2 P(xi ) (Shannon, 1948) where P(xi ) denotes
the probability of occurence of i-th utterance (e.g. its relative
frequency of occurence)
Given that the Shannon entropy can be understood as a measure
of uncertainity and unpredictability, it may be stated that production
of younger children yields most predictable transcripts. Production
of older children is much less predictable and every new utterance
seems to bring about twice as much information content (≈ 13.7 shannons instead of 6.8) . And utterances produced by mothers are even
less predictable.
end utterance-level constructions 13.4.1

13.4.2

pivot schemas

In item 9.4.4, a pivot schema was defined as a two-word schema in
which one word ("the pivot") recurres frequently in the same position
and the other word varies. In order to detect potential pivot words,
let’s define a sort of "pivoteness" score as:
scorepivoteness = FN−gram ∗ length(N − gram) = FN−gram ∗ N
which is to be calculated for every continuous N-gram which occurs
in the corpus and has more than X characters (i.e. N > X) . For
example, if the corpus have contained only 4 utterances containing
only the expression "dogs" and one utterance containing the expressions "dog", and if the parameter X was set to 3, the score-attributing
script32 would attribute score 4 ∗ 4 = 16 to tetragram "dogs", score
15 = 5 ∗ 3 to trigrammaton" "dog" and score 12 = 4 ∗ 3 to 3gram "ogs".
However, bigrams "do", "og", "gs" as well as unigrams "d", "o", "g", "s"
would be ignored since parameter X = 3.
Table 32 lists top thirty 8+grams (i.e. X=733 ) extracted from all
CHILDES transcripts of english-speaking children not older than 2
years and 7 months34
As may be seen, more than half of most salient pivots are onset
expressions initiating the utterance (marked by the starting symbol ^)
and the rest is divided between expressions which end the utterance
("in there", "’s that?", "on there", etc.) or are in midst of it (" in the ", "
on the ", "another", "little").
31 http://wizzion.com/thesis/code/entropycalc.pl
32 http://wizzion.com/thesis/code/exh.pl
33 Note that the choice of the parameter X was in great extent arbitrary and only in
much lesser extent motivated by "magical number seven, plus or minus two", postulated by (Miller, 1956).
34 The complete list of all 8+gram expressions and their associated pivoteness7 scores
is available at http://wizzion.com/thesis/results/pivots_english_7.

225

226

quantitative

It is, however, quite probable that even among these pivot candidates there would be some which are not true pivots because they
occur only in restricted amount of contexts. But in case like ours
when all contexts are known, such "false pivots" can be potentially
identified by an algorithm which, for every pivot candidate C:
1. assesses the distribution of contexts35 DC
2. calculates the Shannon entropy of DC
and this is, indeed, the procedure actualized by the script pivotentropy.pl36 whose outputs37 introduced on the Table 33.
As may be seen, results presented on Table 33 are quite similar to
results already presented on Table 32. There exists, indeed, a statistically significant correlation between scorepivoteness and Hcontextual
(i.e. Spearman’s non-parametric rank correlation test yields p-value <
2.2e-16, ρ = 0.474). Since evaluation of scorepivoteness is less costly
than that of Hcontextual , and since entropy values are, so to say,
more precise than the scorepivoteness , the fact that these two measures tend to correlate may turn out to be quite useful in applied
NLP practice.
Summa summarum, constructions extracted from CHILDES by means
of above-mentioned methods strongly ressemble Bruner’s "formats"
(5).
end pivot schemas 13.4.2

13.4.3

pivot instances

Let’s now focus on pivot instances, that is, on expressions which are
matched by pivot schemas. We define: an utterance U instantiates
the pivot schema P if and only if U can be matched by the Prepresenting pattern. In case we choose PERL regexes as a means
of representation of pivot schemes this definition can be formalized
as
$U =∼ /$P/
whereby =~denotes the regex-matching operator.
This notion is implemented by the script pivot_utterance_global.pl38
which, when initialized with list of pivots as its input data, returns
35 In what shall follow, the term "context" means "two words to the right" if pivot
initiates the utterance, "two words to the left" if it terminates the utterance and "one
word to the left and one word to the right" if it is in the midst of it.
36 http://wizzion.com/code/thesis/pivotentropy.pl
37 The list of 1000 pivot7
schemas with highest pivoteness and
their
CHILDES
Hc ontextual
entropies
is
downloadable
at
http://wizzion.com/thesis/results/pivot7_entropies_english.1000
38 http://wizzion.com/thesis/code/pivot_utterance_global.pl

13.4 english-specific

the frequencies of utterances which instantiate one among ten pivots with such high informational content. Most frequent among such
pivot-instantiating utterancess are listed on Table 34 along their respective frequencies of occurrence.
The list makes evident certain usage-oriented, egocentric (e.g. "I
want it", "that’s mine"), attention-sharing ("look at this one") tendencies potentially inherent to human toddlers. But in order to be sure
that it is, indeed the case and not just an artefact of the method with
whic we treat the corpus, let’s now slightly readjust the methodology:
let’s NOT throw all data coming from all children to one bag which
is subsequently analyzed, but instead, let’s keep all utterances well
associated to their locutors in order to identify such utterances which
are being spoken by biggest number of individual locutors.
Operationalization of such a methodology into the PERL script39
makes it possible to pose the following question :
Which pivot-instantiating utterance was uttered by the biggest number of distinct children ?

45 top-ranking utterances are listed on Table 35 as an answer40 .
As before, this more horizontal an analysis indicates that toddlerese
tends to be dominated by level-0 constructions:
1. encoding deictic focusing of attention to some object
2. expressing wanting or egocentric posession
3. asking for more information
or level-1 crossovers of such level-0 schemas like, for example,
(another one)
× (I want)
(I want another one)

Q.E.D.

end pivot instances

13.4.4

13.4.3

pivot grammars

Three tables which follow aim to ellucidate more closely the concrete
substance of some pivots with big Hcontextual 41 :
Before leaving, we remind the reader of the fact that all "grammars" presented hereby are "intersubjective" in a sense that they were
extracted from corpus of transcripts produced by distinct children.
39 http://wizzion.com/thesis/code/pivot_utterance_distinct.pl
40 http://wizzion.com/thesis/results/utterances_with_pivots_distinct_children_sorted
41 Quantity in square brackets denote utterance’s "popularity", i.e. the number of distinct children which have uttered the construction.

227

228

quantitative

Thus, it is more reasonable to label micro-grammars hereby introduced as "social" than purely individual (e.g. "cognitive"). But given
the size of the CHILDES sample which was analyzed and given that
the condition of "random sampling"42 would hold - which is not
granted - than it could be, more or less, expected, that "popularity" of
utterances hereby unveiled characterizes not only English language
as a mutually shared intersubjective entity, but could also characterize the intensity with which are certain structures encoded in the
mind of an individual.
end pivot grammars 13.4.4

It seems as if a non-negligeable amount of salient phenomena were
revealed during the analysis of English parts of CHILDES corpus.
Primo, distributions of utterance-level constructions indicated that
• communicative tentatives of prototoddlers are dominated by
non-linguistic means (29)
• that distribution of utterances which mothers say to younger
children significatively correlates with language produced by
older children (30)
• productions of mothers and older toddlers are less predictable
than productions of younger toddlers (31)
Secundo, analyses aimed at pivot schemas and their instances (35)
suggest that
• most salient (32) and potent (33) pivot schemas coalesce around
expressions used for location-related and deictic "pointing" and
expressions of "wanting"
• most frequent instances of pivots tend to refocus interlocutor’s
attention to something else ("another one"), reinforcement of
the current situation ("I want some more"), or express child’s
egocentricity ownership ("that’s mine")
• certain utterances can be easily explained as crossovers between
two frequent pivots (i.e. "I want" × "another one" → "I want
another one")
Tertio, closer inspection of certain pivot schemas instantiated in utterances of biggest number of distinct children shows that
• non-abstract referents of child’s linguistic pointing are mainly
animates (Daddy, elephant, cow, horse) and color attributes (green,
red, yellow, orange, blue) (36)
42 Id est, that CHILDES corpora in general represent a random sample of child’s normal verbal interactions.

13.4 english-specific

• children "want" a drink, to see, and to play (37)
• the pivotal affinity of the adjective "little" is in part caused by
concrete referents (piggy, baby, ball), in part by fixed expressions ("a little bit") and in part by expressions belonging to both
classes ("Mary had a little lamb", "twinkle twinkle little star")43
This being said, reader is cordially invited to explore the "results"
files in order to identify other interesting (ir)?regularities potentially
allowing us to increase the amount of knowledge we have about the
Weltanschauung of a modal english-speaking toddler.
end english-specific 13.4
But many other, somewhat more universal "facts", were mined from
the CHILDES corpus in the first half of this chapter. Asides the fact
that mothers interacting with younger children laugh significatively
more often than mothers interacting with older children (27), our initial attention was captivated by the relatively frequent44 occurence of
the consonant nasal labial M in productions of younger children (18).
When it comes to expressions composed of more than one signifier,
one fact issued from analysis of child-directed motherese had struck
us as particulary salient one. That is, the use of 2nd person singular
pronoun "you" (13.3.2) significantly more common than in standard
corpora.
Extraction of two or more replicas of 2+-gram sequences juxtaposed to each other within one utterance had lead us to conclusion
that intralocutory duplications (13.3.3) are most frequently observed
in motherese directed to younger children. Subsequently, the analyses of translocutory duplications - that is, repetitions spanning multiple utterances - has revealed a structural distinction between language of younger and older children: while prototoddlers use repetition of meaning-carrying "lexical" morphemes ("mama", "baby", "ball",
daddy"), repetitions of older toddlers are populated by members of
the closed set of "grammatical" morphemes ("the", "this", "yes", "here")
(23) . The latter distribution being similiar to distributions of the
adult grammar, it was hypothetized that during the process of development, child’s language gradually adapts to language system of
surrounding linguoracles, especially the mother.
A following analysis of "intersubjective replications" - i.e. of cases
were a word uttered by the mother was immediately uttered after the
child, or vice versa (25)- had indicated, that the hypothesis of a child
unilaterally adapting to the mother is not sufficient. More concretely,
the summary results presented on Table 26 caused us to state that
"mothers tend to prefer reacting to topic changes than changing the
topic themselves".
43 Child’s growing exteroceptic, proprioceptic and/or spatial awareness of the fact that
she’s "little girl" also plays, of course, an important role.
44 I.e. in contrast with older children.

229

230

quantitative

Thus, it seems, that in a long run - during weeks, months and years
- it is the child who adapts to the mother, but in a short span - in
concrete scenes lasting seconds and minutes - it is the mother who
adapts her topic, her focus, her attention to that of the child.
Asides all these phenomena - and all other explicitely discussed
on preceding chapters - we find it important to repeat once more
the methodological objective behind this chapter. That is, to show
that both relevant and interesting "knowledge" can be extracted from
CHILDES corpora by means of a simple, fast and unambigously reproducible method of extraction of patterns attained by means of
matching the corpus with PERL-compatible regular expressions.
end quantitative 13

13.4 english-specific

CHI

MOT

PROTO

TODDL

PROTO

&=vocalize .

10683

yeah .

2337

1344

&=babble .

9126

no .

1180

&=nonspeech .

3856

oh .

1009

&=cry .

3003

894

&=involuntary .

381

&uh .

5206

231

TODDL

&=involuntary .

6609

oh .

1969

yeah .

5219

no .

1785

&=nonspeech .

3852

yeah .

mhm .

1458

okay .

3777

okay .

2769

yes .

1201

&hmm ?

2591

yes .

2640

there .

1011

&=speechplay .

2563

mhm .

379

&=laugh .

1204

huh ?

1000

come on !

2002

right .

239

&ah .

1129

look .

985

here .

1900

&hmm ?

223

Mama .

816

that .

871

huh ?

1582

there .

206

&=cough .

773

here .

860

uhoh .

1576

what ?

106

Dada .

717

what’s that ?

846

no .

1456

that’s right .

104

&=labial .

601

Mummy .

764

&=laugh .

1377

what’s that ?

87

&eh .

587

okay .

719

what ?

1156

well .

86

&=laughs .

514

uhhuh .

637

look !

1153

look .

79

ball .

489

yup [= yes] .

628

oh .

1144

come on .

67

ooh@b .

422

that one .

558

there you go .

1115

that’s it .

55

&=raspberry .

412

&mm .

476

&=labial .

1020

oh dear .

54

&mm .

410

on there .

474

mhm .

1004

thank_you .

54

baby .

406

oh no .

472

thank_you .

820

pardon ?

52

&u:h .

404

in there .

472

hi .

764

no ?

52

Mommy .

396

this .

454

yes .

740

whoops .

49

byebye .

389

more .

433

come (h)ere !

734

what is it ?

48

guh@b .

382

what ?

414

that’s right .

726

huh ?

47

up .

380

uhoh .

331

yay .

722

what’s this ?

47

no .

375

oh dear .

329

ahhah .

690

what is that ?

45

oo@b .

354

baby .

317

whoa .

582

there you go .

42

&a:h .

337

car .

311

hello .

509

here .

38

Mom .

318

I don’t know .

307

&mm .

464

oh no .

36

uguh@b .

317

me .

291

hey .

451

uhhuh .

35

heh@b .

310

what’s this ?

281

what’s that ?

409

good girl .

Table 29: Most frequent utterance-level constructions produced by englishspeaking mothers and children in 2 phases of their development.

232

quantitative

MOTPROT O

MOTT ODDL

CHIPROT O

t = 28.4625
df = 32679
p 6 2.2e − 16
cor = 0.155

t = 0.5
df = 74078
p = 0.6126
cor = 0.0019

CHIT ODDL

t = 5.9801
df = 70555
p = 2.24e − 09
cor = 0.022

t = 317.27
df = 110006
p 6 2.2e − 16
cor = 0.692

Table 30: Correlations between distributions of frequences of utterances.

CHI

MOT

PROTO

Nd = 3645
Pd = 0.0863
H = 6.824

Nd = 83267
Pd = 0.423
H = 14.2

TODDL

Nd = 120219
Pd = 0.41
H = 13.7

Nd = 199704
Pd = 0.538
H = 15.37

Table 31: Number of distinct utterances in diverse datasets and entropies of
their distributions.

Score

Pivot

Score

Pivot

Score

Pivot

18472

^that’s

6507

another

4320

^I can’t

16368

^ what’s

5950

^that’s a

4288

’t know.

16360

^ I want

5841

want to

4239

^I wanna

13527

^ where’s

5810

on there.

4235

^ there’s a

10640

in there.

5808

little

3950

^ that one

9632

in the

4860

^this is

3790

going to

9513

^there’s

4760

that one.

3740

^I want to

8328

’s that?

4734

^ another

3600

^ it’s a

7335

^I don’t

4667

^ where’s the

3591

, Mummy.

7320

on the

4608

don’t know.

3552

^ here’s

Table 32: Thirty 8+grams with highest scorepivoteness .

13.4 english-specific

Pivot

Hcontextual

^that’s X Y

9.25876131528133

^I want X Y

8.96609935363205

^where’ X Ys

8.95606540894548

^there’s X Y

8.79381971491988

X in the Y

8.74578245657441

X on the Y

8.65695029616604

X Y in there.

8.5923250584143

^this is X Y

8.34192768618957

^that’s a X Y

8.20433784100614

X little Y

8.01430314990973

Table 33: Ten CHILD-produced pivot7 schemas with highest contextual entropy (in shannons).

Utterance

Frequency

another one.

143

I want it.

84

where’s it gone?

72

that’s it.

69

what’s in there?

68

that’s mine.

67

where is it?

65

I want another one.

60

I can’t do it.

49

look at this one.

48

Table 34: CHILDES utterances most frequently instantiating some pivot7
schema.

233

234

quantitative

Utterance

Children

Utterance

Children

another one.

33

what’s in there?

that’s mine.

27

that’s right.

Utterance

Children

13

little girl.

10

13

I want another one

10

I want it.

27

look at this.

13

I can’t find it.

1

I want that.

26

it’s all_gone.

13

a little one.

10

that’s it.

25

in the car.

12

where go?

9

yes please.

23

I like that.

12

where are you?

9

I can’t do it.

22

here’s one.

12

there , look.

9

where is it?

19

what’s this one?

11

that’s red.

9

look at that.

17

there’s one.

11

I want this one.

9

go in there.

16

and there.

11

go in here.

9

I want that one.

15

what’s in here?

10

where’s this go?

8

where’s it gone?

14

there’s another one.

10

where’s other one?

8

that’s better.

14

that’s green.

10

that’s yellow.

8

little one.

14

that’s Daddy.

10

that’s a .

8

I want some more.

14

put it in there.

10

that one there.

8

Table 35: Pivot-instantiating CHILDES utterances pronounced by biggest
number of distinct children.

that’s

mine [27]
it [25]
better [14]
right [13]
green [10]
Daddy [10]
red [9]
yellow [8]
orange [7]
all [7]
nice [6]
blue [6]
a elephant [6]
a cow [6]
you [5]
my [5]
horsie [5]
good [5]
a car [5]
...

Table 36: Most popular instances of pivot "ˆthat’s X"

13.4 english-specific

I want

it. [27]
that. [26]
that one.[15]
some more.[14]
another one.[10]
this one.[9]
this.[8]
some.[7]
a drink.[7]
two.[6]
to see.[6]
one.[6]
more.[6]
to play.[5]
down.[5]

Table 37: Most popular instances of pivot "ˆI want X"

little

one. [14]
girl. [10]
lamb. [5]
boy. [5]
bit. [5]
man. [4]
car. [4]
piggy. [3]
ball. [3]
baby. [3]

a

little

one. [10]
boy. [3]
bit. [3]
box. [3]

that

little

one [3].

twinkle twinkle

little

star. [4]

Mary had a

little

lamb. [3]

Table 38: Most popular instances of pivot "X little Y"

235

14

SUMMA III

Ideas are never static but develop across time and context, constantly
cross-fertilizing with other currents of thought.
— Edwin F. Bryant
Hence ends the last part of the first volume of the Thesis aiming
to offer certain fragments of evidence of the validity of the theory of
intramental evolution. Two principal ways of acquisition of such fragments have been presented:
1. qualitative: holistic, naturalistic and multi-modal observations
of development of one specific child, from prenatal period onwards
2. quantitative analyses of patterns recurrent in transcripts produced by hundred children immortalized in the CHILDES corpus
but their combination is, of course, also possible.
14.1

crossroads of thoughts

As is often the case in science, a crossover between the methods
can also lead to interesting results. Thus, it was by means of PERL
regex pattern matching that the following schema was detected in
behaviour of Kuczaj’s (Kuczaj and Maratsos (1975) son named Abe
(2;5.23):
Listing 7: Some lines from abe009.cha (PID 11312/c-00016245-1) transcript
301 *FAT:
306 *CHI:
310 *FAT:
.
315 *CHI:
322 *FAT:
325 *CHI:
...
365 *FAT:
368 *CHI:
372 *FAT:
375 *CHI:
...
398 *FAT:

236

okay (.) here’s another one a cow ate the carrot
cow ate the carrot .
okay (.) now do this one the boy fell down the stairs
boy fell down the stairs .
dinner was eaten by the boy .
dinner eaten by the boy .
the cow ate a carrot .
cow ate a carrot .
the little boy is happy .
happy .
a cow ate carrot the .

14.1 crossroads of thoughts

401
...
507
510
516
523
526
529
533
536
539
542
...
551
554
...
582
585
588
591

*CHI:

a cow ate carrot the .

the cow did not eat his dinner .
*FAT:
CHI:
cow didn’t eat his dinner he can’t get snacks .
*
that’s right no dinner (.) no snacks here’s another
*FAT:
game the elephant cannot go home
elephant can’t go home .
*CHI:
nobody hit me .
*FAT:
nobody hit me .
*CHI:
FAT:
the boy did not eat any cookies .
*
CHI:
boy can’t eat any cookies .
*
FAT:
the cow cannot eat grass .
*
cow can’t eat grass .
*CHI:
*FAT:
*CHI:

the boy did not sleep .
boy can’t not sleep .

*FAT:
CHI:
*FAT:
*CHI:

the goat eat did his dinner .
goat didn’t eat his dinner .
the boy not did eat any cookies .
boy can’t eat any cookies .

642 *FAT:
647 *CHI:
eat his
654 *MOT:
659 *CHI:

we can play some more tomorrow too (.) okay .
tomorrow too (.) boy can’t eat his carrots boy can’t
carrots
do you want to go outside for awhile (.) Abe ?
play outside (.) boy can’t eat his carrots .

Closer inspection of the above-listed father-son interaction unveils
multiple interesting phenomena:
Primo, Kuczaj’s son consistenly used the construction "boy can’t" in
cases where he should repeat his father’s "boy did not". Well beyond
the objectives of this Thesis is the question whether this phemonenon
is to be explained by mismatch on the level of passive, perceptive
morphosyntactic C-structures (c.f. also 12.12.1) or whether it has more
to do with mismatch of productive P-structures leaves a.
But mismatch there is, and for a reason unknown, Abe was consistently crossing-over the external schemata of a form "boy did not X"
with private schema "boy can’t X".
Secundo, other crossovers between external stimuli (e.g. utterances produced by external linguistic oracles like parents, peers or teachers) and child’s
private world of needs, wants and protothoughts are to be observed on
lines 510 and 647 of the transcript. In the first case, father’s utterance
"cow did not eat his dinner" is augmented with Abe’s private "he
can’t get snacks" which makes father to react to the "snack" topic1 .
without preliminary intention to do so.

1 Snacks are also mentioned in other Abe transcripts, on line 45 in abe004.cha mother
urges the child to eat with a threat "okay (.) come eat or no snacks later on ." and on
line 516 of abe017.cha where Abe offers his father an apple "as a snack"

237

238

summa iii

Even more important - for the purpose of verification of theory
hereby presented - is the crossover construction which emerges at the
very end of the transcript, in the moment where father closes the session with words "we can play some more tommorrow", thus putting
Abe in a position of a brief vacuum where anything can be said. The
vacuum is immediately filled by Abe’s production and replication
(twice on lin 647, one on line 659) of the construction "boy can’t eat
his carrots".
Note that nowhere in the transcript had the father uttered a construction with "boy" as a subject and "carrot" as an object2 . Thus, the
expression with which Abe closes the language game seems to be his
own invention, an invention which we consider to be the product of
the crossover summarized on Table 39.
306

cow ate the carrot

361

cow ate a carrot

510

cow didn’t eat his dinner he can’t get snacks

542

cow can’t eat grass

536

boy can’t eat any cookies

554

boy can’t not sleep

591

boy can’t eat any cookies

647
647

boy can’t eat his carrots

659
Table 39: Interphrastic crossover behind Abe’s "boy can’t eat his carrots".

Given all this, a question can be posed: "why carrots?". Why not
"dinner", "cookies", "cheese" or "grass" which are also used as direct
objects of "eat-ing" mentioned in the transcript? Why was it the substantive "carrot" which, as Tomasello (2009) would say, had "filled the
slot"?
It may be the case that multiple cognitive processes and biases are
to be taken into account in order to answer the question:
1. the primacy effect: term "carrot" is the first concrete object of
eating mentioned in Kuczaj’s "repeat after me" language game
2. the frequency effect: Abe was three times exposed to father’s
production of the term "carrot", i.e. more than in case of "cookies" (2 times), "grass" (2 times) or "cheese" (once)

2 The word "carrot", in fact, does not occur in any other Abe transcript, only in
abe009.cha

14.1 crossroads of thoughts

3. the perturbation effect: term "carrot" was once heard (line 398)
and once produced (line 401) in a syntactically anomalous construction "a cow ate carrot the" 3
4. the semantic consistency effect: "boy can’t eat his carrots" refers
to more plausible a scenario than, for example, "boy can’t eat
his grass"
5. priming etc.
It seems to us evident that all these processes and biases are to be
taken into account by anyone hoping to develop a reasonable theory
of crossover among linguistic structures which does not contradict but
rather naturally extends both cognitivist, connectionist, usage-based4
paradigms which dominate contemporary developmental psycholinguistics.
But since the objectives addressed in the second volume of this
work will be principially computational ones, let’s now start concluding this theoretical volume with one sole principle which can be immediately deployed in a functional program.
14.1.1

the linguistic crossover principle

Fitness of product of the crossover of (linguistic) schemes A and B is
proportional to fitness of A and B as well as to amount of features
which A and B share.
end the crossover principle

14.1

A more formal and geometric variant of this principle shall be furnished in the second volume of this work. For the time being, let’s
just elucidate that by the term of "features" we mean not only overlap between "semantic" features (c.f. "the semantic consistency effect"
in enumeration above) of two "parental" schemes, but also overlap
between prosodic, phonologic, morphologic, syntactic or even pragmatic characteristics of schemes which are to be fusioned.
As in case of any creative, poietic act, the form and content, the
program and the data, fuse.
Thus, "AFE" and "OPICA" yield "API" (12.10.1), "BAJA" and "ANAN"
yield "BANAN" (12.11.1) etc. not only because they denote the same
meaning but also because they are phonetically similar."MAHLEN"
and "BAUEN" yields "MAUEN" (12.11.2) not only because their signifiants can be matched by the pattern /Labocc A*EN/ but also because within certain subspace of the envelopping semantic space they
3 Exposure to such anomalous stimuli can be potentially asessed in terms of P600
event-related potential. It cannot be excluded that such a P600-related anomalies attain higher level of salience and activation than terms occurent in coherent contexts.
4 And with little bit of luck also mentalist

239

240

summa iii

tend to be quite close (i.e. they both denote object-manipulating, constructive, creative activites etc.).
Hence, Abe joyfully utters "the boy can’t eat his carrots" not only
because "boy eats" is semantically closer to "carrot" than to "grass"
but also because - on the morphosyntactic level - expressions "cow
can’t eat grass" and "boy can’t eat any cookies" are similiar enough
to induce the activation of the pathway like "X can’t eat Y" → subsequently filled with most affine fillers ("boy" before "can’t" and "carrot"
after "eat").
Subtilities aside, the linguistic crossover principle can be further
elucidated by the following aphorism:
14.1.2

of crossovers and analogies (aph)

If the reader has understood that events which we have labeled as
"linguistic crossovers" could elucidate phenomena to which the traditional cognitive science refers by the term "analogy", then the reader
has understood us well.
end of crossover and analogies 14.1.2

... and the precept
"whenever You notice an analogy or schematization, seek for existence the
implicit structural crossover behind it"
can turn out to be useful methodological "rule of thumb" for any
researcher potentially interested by our proposal.
end crossroads of thoughts 14.1

14.2

axes of analysis

Diverse aspects of crossovers produced by IM, Abe or other toddlers, can be studied. Of non-negligeable importance is the analysis
in terms of temporal interval between the last activation of crossover’s
input schemata and crossover’s output product. Thus, in case of Abe’s
carrots, minutes had to pass between Abe’s productions of all initial
"carrot" and "boy can’t" (inputs) expressions and his final "boy can’t
eat his carrots" (output).
Many crossovers uttered by IM also had a property of mixing together schemata separated from each other and their product by minutes of other content, c.f. the (MAMA + MIMI → MAMI, 12.11.2). But
sometimes - as in case of PIJEN (12.10.1) - the timespan seemed to
be even shorter and crossover seemed to be occuring in short term
memory or even in a much more volatile phonologic buffer. And yet

14.3 the source of variation

in other cases (12.11.1), a simple trick of letting child hear "AJAN"
caused a 5-month old latent schema to get reactivated, fuse with
much more recent BAJA and form the globally optimal form BANAN.
Another important aspect is the origin of input schemata. Analyzed from this perspective, one can state that nature of majority of
crossovers noted down in the volume, was of following kind:
external
personal
CROSSOVER
whereby "external" denotes the schemata encoded in the stimuli to
which the child is exposed (e.g. motherese utterances etc.) while "personal" denotes private and often unique idioglottic structures already
encoded and productive within the mind of the given child.
Crossovers between two or more purely "personal" schemata also
seem possible. Unfortunately for empiric science, they are either impossible to access (as is the case "dreaming"5 ) or difficult to recognize
as what they truly are (e.g. certain babbling sequences etc.)
end axes of analysis 14.2.0

14.3

the source of variation

Encoded in the material substrate of the brain, schemata are subjects
to same physical laws of entropy and decay as the brain and body itself. Cognitive schemata are not engraved onto some kind of eternally
lasting crystal. Humans forget 6 .
Forgetting is a form of variation and as every form of variation, it
can sometimes lead to disastrous loss of information. But less rarely, it
can also cause one to discard previous "locally optimal" information,
thus giving one the impetus to seek more globally optimal states.
Another source of variation inherent to the child is her tendency "to
want another one" and "to play". While many phenomena related to
craving and wanting more can be in large extent explicated in terms
of standard behaviorist theories (reinforcement, reward etc.) or "3rd
noble truth" already posited by Shakyamuni some 25 centuries ago
((Lama et al., 2005)), child’s everactual readyness to play does not
cease to struck us with such intensity that even after months of observations and empiric research, we still consider our initial definition
of the "child" (Section 5.2) as a reasonable and a valid one.
5 For if there is a realm inaccessible to reason of an adult man than it is indeed the
realm of toddler’s dreams.
6 Sometimes the tendency to forget is so strong that some researchers (c.f. 9.4.2) have
even forgotten that humans forget

241

242

summa iii

What’s more, our research had lead us to conviction that a modal
toddler is much more a member of the species Homo Ludens (Huizinga,
1956) than of the species Homo Sapiens. And if there is one single thing
which should be potentially reproached to otherwise most advanced
and complete theory of linguistic development - i.e. the usage-based
theory of Tomasello - then let it be this one:
14.3.1

extending usage-based paradigm (txt)

That language development could not be possible without child’s ability to share attention with other humans is true.
And it is also true that recurrence and distribution of patterns
among and within diverse "usage scenarios" to which child is exposed
and in which she is supposed to act, all that is an indispensable prerequisite to the success of the whole process.
But a similiarly indispensable prerequisite it’s child’s tendency to
play with sounds, words, sentences and whole contexts. To laugh, to
sing, to talk to herself, to say "no" when the child already knows that
the only word which her interlocutor does NOT want to hear is..."no".
To playfully explore the limits of principles and rules and to do so
in order to break them. To playfully explore limits of one’s world.
end extending usage-based paradigm 14.3.1

And to feel Joy during and because all of that.
end the source of variation

14.4

14.3

from selection to replication

Principial source of variation thus ellucidated, the theory of intramental evolution still lacks a component without which it could not
be neither formalized nor translated into a functional computer program. That is, the description of the bridge between process of "selection" and process of "replication". What is still missing is such a
fitness function which could be pertinent to the process of language
development.
In other terms, what we still lack is a criterion by means of which
one’s language-processing system could evaluate which schemata (or
their ordered sets) are "fit" for linguistic communication and which
are not.
We posit the following principle in order to fill this gap.

14.4 from selection to replication

14.4.1

the principle of exogenous selection (def)

The more schema S encoded in cognitive system C matches the data
produced by external oracle O, the more it is probable that S shall
replicate into another region of C.
end the principle of exogenous selection 14.4.1

Stated in more Piagetian and less probabilistic terms: whenever
the schema succeeds to assimilate linguistic stimulus produced by
the person endowed with implicit authority7 , then the schema gets
copied in other region of child’s mind.
Stated even more simply, the principle can be compressed in the
following precept:
14.4.2

mpr precept (aph)
Matching Pattern Replicates.
end mpr precept

14.4.2

And that’s it. Given that within the brain of a child replicated
schemata are practically immediately subjected to forces of decay and
(play|forget)ful variations, three words of MPR precept prepare the
territory for great deal of adaptation which could potentially follow.
Under this view is the computational burden related to informationprocessing, noise-filtering and the selection of structures delegated to
external oracles. By "uttering this and not that", by exposing child’s
schemata to this "data" and not that "data", indeed by such indirect
mediated means do the model persons influence the development of
structures in child’s mind.
Asides few dozens of innate schemata is the mind of the nascent
child filled mainly with unceasing swarming of images issued from
the unknown realm of φαντασία. All the rest - including labels, rules
and criteria - comes from outside, neatly packaged, preprocessed and
preselected by caring oracles.
It is in this sense that the adjective exogenous is to be understood.
end the principle of exogenous selection 14.4.2

7 Child experiences on a daily basis how persons like mother, father, grandparents,
teachers, older siblings, older peers etc. succeed to solve problems which she is unable to solve on her own. In computational theory such problem-solvers able to yield
immediate and correct answer are called oracle machines Turing (1939). C.f. Clark
(2010) for discussion of how involvement of certain oracles, called Minimally Adequate Teachers, can reduce the computational complexity of the problem of grammatical inference of context free languages.

243

244

summa iii

Nothing precludes that in a healthy symmetric relation between the
parent and a child, parent can approach the child as if she was a computational oracle able to immediately solve certain types of problems.
In such a case, adaptation and evolution leads to a sort of bilateral coadaptative, co-evolutive interlock in which the child does for a parent
what a parent does for a child.
That is, by selecting and exposing the parent with the data-to-bematched, the child indirectly influences the population dynamics of
schemata encoded in parent’s mind. Et vice versa.
Willingness of many mothers to adapt to the topic proposed by
their child (Table 25) as well as their readiness to perceive a fragile
and powerless hominid not as an alien but as a "2nd person singular" (Table 23) lead us to belief that authentic, non-superficial comprehension of "the Other" (Buber, 1937) - i.e. "love" - is not a privilege
but rather an essential prerequisite of successful co-adaptation of two
minds and souls whose destinies are inexorably bound to each other.
Love (DEF)
« Strong positive emotional relation to persons, things, ideas or self.
Conscious, effective, voluntary acceptation of value of the other in
one’s life. Readiness to be hostage for the other (Levinas). Platonic
tradition accentuates that less perfect is attracted by more perfect
(love as a longing for what one does not have, especially beauty).
In christian tradition humans respond to the gift of existence (life,
world, happiness, friendship, family) with love, id est by devotion
to well-being of the other, which does not await anything in return
(love as devotion). Love expresses itself on all levels of human being,
physical, personal and spiritual. It is the only solid bound between
humand and the ultimate source of everything in the world which
has real value.» (Sokol, 1998)
end love (def) 14.4.2

end from selection to replication

14.4

This being said, we end the first volume of this work with an expression of a simple hope. Of a hope that on preceeding pages we have
already succeeded to furnish some fragmented, preliminary and undoubtably incomplete yet consistent evidence supporting the theory
which was initiated by two words forming the Thesis
Mind evolves.
end summa iii

14

Part IV
S I M U L AT I O N S
There is an appealing symmetry in the notion that the
mechanisms of natural learning may resemble the processes
that created the species possessing those learning processes.
— D.E. Goldberg and J. Holland
This part can be understood as a collection of four scientific articles. Each article describes a distinct simulation
and can be read individually. Aspiration common to all
articles is to provide different facets of cognitively plausible, ex computatio et simulatio proof-of-concept for theory
of intramental evolution.
Zeroth simulation aspires to demonstrate that Evolutionary Computation (EC) can offer useful insights to an agent
hoping to break the code of an unintelligible corpus (e.g.
to help decode riddle as cryptic as the Voynich Manuscript).
First simulation aspires to demonstrate that EC can be a
useful means of multiclass classification of textual documents according to their semantic content (e.g. and in
a Big Data scenario could potentially lead to results as
good as those produced by connectionist "deep learning"
methods). Second simulation aspires to demonstrate that
EC can help to identify useful solutions to problem of
multiclass part-of-speech classification. Third simulation
aspires to demonstrate that EC can pave the way to induction of plausible micro-grammars from solely positive
corpus of motherese utterances.

15

BREAKING INTO UNKNOWN CODE

15.1

generic introduction

A cryptologue posed with an unbroken cipher is, in certain sense, in
a position similar to a child (P+19) which has just been born into our
common world. Both cryptologue and a child are confronted with
novel constellations of symbols and features. Both assume that the
data with which they are confronted - a motherese (P+90-93) utterance perceived by a child or a cipher studied by a cryptologue ultimately carry a certain meaningful message. Both combine their
ingenuity with relentless perseverance: both accept that the path to
success leads through ocean of trials and errors (P+22). Ultimately,
they both transcend their initial state of limited knowledge and attain understanding: child shall understand the world and the scholar
shall understand the cipher.
This analogy between a child and a cipher-breaker can be pushed
even further in case we speak about the cipher stored in the enigmatic
medieval Voynich Manuscript (VM). This is so because VM contains a
non-negligible amount of visual content and it can be rightfully speculated that if VM contains a cipher to be decoded, than the deciphering
process (and its subsequent evaluation) shall be founded on discovery
of associations between VM’s visual content and the adjacent "voynichese" script.
This is - we believe - similar to the position of a visually nonimpaired human child who acquires a non-negligible amount of information about her world and her language by means of associating
the components of surrounding visual scenes with simultaneously
heard phonemic sequences (e.g. "red ball in mama’s hand").
This being said, let’s now present first implications of our "child as
a cryptologue" analogy, as published in the article Hromada (2016a).
15.2

abstract

Voynich Manuscript is a corpus of unknown origin written down in
unique graphemic system and potentially representing phonic values of unknown or potentially even extinct language. Departing from
the postulate that the manuscript is not a hoax but rather encodes authentic contents, our article presents an evolutionary algorithm which
aims to find the most optimal mapping between voynichian glyphs
and candidate phonemic values.

246

15.3 introduction

Core component of the decoding algorithm is a process of maximization of a fitness function which aims to find most optimal set of
substitution rules allowing to transcribe the part of the manuscript
- which we call the Calendar - into lists of feminine names. This
leads to microgrammars which allow us to consistently transcribe
dozens among three hundred calendar tokens into feminine names:
a result far surpassing both "popular" as well as "state of the art"
tentatives to crack the manuscript. What’s more, by using name lists
stemming from different languages as potential cribs, our "adaptive"
method can also be useful in identification of the language in which
the manuscript is written.
As far as we can currently tell, results of our experiments indicate
that the Calendar part of the manuscript contains names from baltoslavic, balkanic or Hebrew language strata. Two further indications
are also given: primo, highest fitness values were obtained when the
crib list contains names with specific in-fixes at token’s penultimate
position as is the case, for example, for Slavic feminine diminutives
(i.e. names ending with -ka and not -a). In the most successful scenario, 240 characters contained in 35 distinct Voynichese tokens were
successfully transcribed. Secundo, in case of crib stemming from Hebrew language, whole adaptation process converges to significantly
better fitness values when transcribing voynichian tokens whose order of individual characters have been reversed, and when lists feminine and not masculine names are used as the crib.
15.3

introduction

Voynich Manuscript (VM) undoubtedly counts among the most famous unresolved enigmas of the medieval period. On approximately
240 vellum pages currently stored as manuscript (MS) 408 in Yale
University’s Beinecke Rare Book and Manuscript Library, VM contains many images apparently related to botanics, astronomy (or astrology) and bathing. Written aside, above and below these images
are bulks of sequences of glyphs. All this is certain.
Also certain seems to be the fact that in 1912, VM was re-discovered
by a polish book-dealer Wilfrid Voynich in a large palace near Rome
called Villa Mandragone. Alongside the VM itself, Voynich also found
the correspondence - dating from 1666 - between Collegio Romano
scholar Athanasius Kircher and the contemporary rector of Charles
University in Prague, Johannes Marcus Marci. Other attested documents - e.g. a letter from 1639 sent to Kircher by a Prague alchemist
Georg Baresch - also indicate that during the first half of 17th century,
VM was to be found in Prague. The very same correspondence also

247

248

breaking into unknown code

indicates that VM was acquired by famous patron of arts, sciences
and alchemy, Emperor Rudolf II. 1
Asides this, one more fact can be stated with certainty: the vellum
of VM was carbon-dated to the early 15h century (Hodgins, 2014).
15.3.1

pre-digital tentatives

Already during the pre-informatic era of first half of 20th century had
dozens, if not hundreds, men of distinction invested non-negligible
time of their life into tentatives to decipher the "voynichese" script.
Being highly popular in their time, many such tentatives - like that
of Newbold who claimed to "prove" that VM was encoded by Roger
Bacon by means of 6-step anagrammatic cipher (Newbold, 1928), or
that of Strong (Strong, 1945) who claimed VM to be a 16th-century
equivalent of the Kinsey Report" - may seem to be, when looked upon
through the prism of computer science, somewhat irrational 2 .
C.f. (d’Imperio, 1978) for a overview of other 20th-century "manual"
tentatives which resulted in VM-deciphering claims. After description of these tentatives and and after presentation of informationally
very rich introduction to both VM and its historical context, d’Imperio
adopts a skeptical stance towards all scholars who associated VM’s
origin with the personage of Roger Bacon3 .
In spite of skeptic who she was, d’Imperio hadn’t a priori disqualified a set of hypotheses that the language in which the VM was
ultimately written was Latin or medieval English. And such, indeed,
was the majority of hypotheses which gained prominence all along
20th century.4 .
15.3.2

post-digital tentatives

First tentatives to use machines to crack the VM date back to prehistory of informatic era. Thus, already during 2nd world war did
the cryptologist William F. Friedman invited his colleagues to form
1 Savants which passed through Rudolf’s court included Johannes Kepler, Tycho deBrahe or Giordanno Bruno. The last one is known to have sold a certain book to the
emperor for 600 ducats.
2 Note, for example, Strong’s "translation" of one VM passage: "When the contents of
the veins rip, the child comes slyly from the mother issuing with leg-stance skewed and bent
while the arms, bend at the elbow, are knotted like the legs of a craw-fish." Strong (1945)
Note also that such translation was a product of man who was "a highly respected
medical scientist in the field of cancer research at Yale University" (d’Imperio, 1978).
3 "I feel, in sum, that Bacon was not a man who would have produced a work such
as the Voynich manuscript...I can far more easily imagine a small society perhaps in
Germany or Eastern Europe (d’Imperio, 1978, 51)"
4 Note that such pro-English and pro-Latin bias can be easily explained not by the
properties of VM itself, but by the simple fact that first batches of VM’s copies were
primarily distributed and popularized among Anglosaxon scholars of medieval philosophy, classical philology or occidental history

15.3 introduction

"extracurricular" VM study group - programming IBM computers for
sorting and tabelation of VM data was one among the tasks. Two
decades later - and already in position of a first chief cryptologist of
the nascent National Security Agency - Friedman had formed the 2nd
study group. Again without ultimate success.
One member of Friedman’s 2nd Study Group After was Prescott
Currier whose computer-driven analysis led him to conclusion that
VM in fact encodes two "statistically distinct" (Currier, 1970) languages. What’s more, Currier seems to have been the first scholar
who facilitated the exchange and processing of Voynich manuscript
by proposing a transliteration5 of voynichese glyphs into standard
ASCII characters. This had been the predecessor of the European
Voynich Alphabet (EVA) (Landini and Zandbergen, 1998) which had
become a de facto standard when it comes to mapping of VM glyphs
upon the set of discrete symbols.
Canonization of EVA combined with dissemination of VM’s copies
through Internet have allowed more and more researchers to transcribe the sequence of glyhps on the manuscript into ASCII EVA sequences. Is is thanks to laborious transcription work of people like
Rene Zandberger, Jorge Stolfi or Takeshi Takahashi that verification
or falsification of VM-related hypotheses can be nowadays in great
extent automatized.
For example, Stolfi’s analyses of frequencies of occurrence of different characters in different contexts has indicated that majority of
Voynichese words seems to implement a sort of tripartite crust-coremantle (or prefix, infix, suffix) morphology. Later study has indicated that the presence of such morphological regularities could be
explained as an output of a mechanical device called Cadran grill
(Rugg, 2004). The "hoax hypothesis" is also supported by the study
of Schinner (2007) who suggested that "the text has been generated
by a stochastic process rather than by encoding or encryption of language". Pointing in the similar direction, the analysis also concludes
that "glyph groups in the VM are not used as words".
On the other hand, a methodology based on "first-order statistics of
word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated
as a time series" presented in (Amancio et al., 2013) lead its authors
to conclusion that VM "is mostly compatible with natural languages
and incompatible with random texts". Simply stated, the way how
diverse "words" are distributed among different sections of VM indicates that these words carry certain semantics. And this indicates that
VM, or at least certain parts of it, are not a hoax.
5 In this article we distinguish transliteration and transcription. Transliteration is a bijective mapping from one graphemic system into another (e.g. VM glyphs is transliterated into ASCII’s EVA subset). Transcription is a potentially non-bijective mapping
between symbols one one side and sound- or meaning- carrying units on the other.

249

250

breaking into unknown code

15.3.3

our position

Results of (Amancio et al., 2013) had made us adopt the conjecture
"VM is not a hoax" as a sort of a fundamental hypothesis accepted
a priori. Surely, as far as we stand, it could not be excluded that
VM is a work of an abnormal person, of somebody who suffered
severe schizophrenia or was chronically obsessed by internal glossolalia (Kennedy and Churchill, 2005). Nor can it be excluded that
the manuscript does not encode full-fledged utterances but rather
lists of indices, sequences or proper names of spirits-which-are-tobe-summoned or sutra-like formulas compressed in a sort of private
pidgin or a sociolect. But given VM’s ingenuity and given the effort
which the author had to invest into the conception of the manuscript
and given a sort of "elegant simplicity" which seems to permeate
the manuscript, we have felt, since our very first contact with the
manuscript, a sort of obligation to interpret its contents as meaningful.
That is, as having the capability of denoting the objects outside of
the manuscript itself. As being endowed with the faculty of reference
to the world (Frege, 1994) which we, 21st century interpreters, still
inhabit hundred years after VM’s most plausible date of conception.
It is with such bias in mind that our attention was focused upon
a certain regularity which we have later decided to call "the primary
mapping".
15.3.4

primary mapping

Condition sine qua non of any act of deciphering is a discovery of rules
which allow to transform initially meaningless cipher into meaningful information. In most trivial case, such deciphering is facilitated by
a sort of Rosetta Stone (Champollion, 1822) which the decipherer already has at his disposition. Since both the cipher-text as well as the
plain-text (also called "the crib") are explicitly given by the Rosetta
Stone, discovery of the mapping between the two is usually quite
straightforward.
The problem with VM is, of course, that it seems not to contain
any explicit key which could help us to decipher its glyphs. Thus,
the only source of information which could potentially help us to
establish reference between VM’s glyphs and the external world are
VM’s drawings. One such drawing present atop of folio f84r is shown
on Figure 26.
Figure 26 displays twelve women bathing in eight compartments of
a pool. Bathing women is a very common motive present in VM and
there seems to be nothing peculiar about it. The fact that word-like
sequences are written above heads of these women is also trivial.

15.3 introduction

Figure 26: Drawing from folio f84r containing the primary mapping.

One can, however, observe one regularity which seems to be interesting. That is, in case two women bath in the same compartment, the
compartment contains two word-like sequences. If one woman bathes
in the compartment, there is only one word-like sequence which is
written above her head.
One figure - one word, two figures - two words. This principle is
stringently followed and can be seen on other folios as well. What is
more, the words themselves are sometimes similar but they are not
the same. Such trivial observations lead to trivial conclusion: these
word-like sequences are labels.
And since these names are juxtaposed to feminine figures, it seems
reasonable to postulate that these labels are, in fact, feminine names.
This is the primary mapping.
15.3.5

three conjectures

Method which shall be described in following sections can be considered as valid only under assumption that following conjectures are
valid:
1. "the primary mapping conjecture" : voynichese words asides
feminine figures are feminine names
2. "diachronic stability of proper names" : proper names are less
prone to diachronic change than other language units
3. "Occam razor" : instead of containing a sophisticated esoteric
cipher, VM simply transmits a text written in an unknown script
Further reasons why we consider "the primary mapping conjecture"
as valid shall be given alongside our discussions of "the Calendar".
When it comes to conjecture postulating the "diachronic stability of
proper names", we could potentially refer to certain cognitive peculiarities or how human mind tends to treat proper names (Imai and
Haryu, 2001). Or focus the attention of the reader to the fact that
for practically every human speaker, one’s own name undoubtedly
belongs among the most frequent and most important tokens which

251

252

breaking into unknown code

one hears or utters during whole life. This results in a sort of stability against linguistic change and allow the name to cross the centuries with higher probability than words of lesser importance and
frequency.
But instead of pursuing the debate in such a direction, let’s just
point out that successful decoding of Mycenaean Linear script B ((Ventris and Chadwick, 1953) would be much more difficult if certain toponyms like Amnisos, Knossos or Pylos haven’t succeeded to carry their
phonetic skeleton through aeons of time.
At last but not least, the "Occam razor conjecture" simply explicitates the belief that a reasonable scientist should not opt to explain
VM in terms of anagrams and opaque hermeneutic procedures if
similar - or even more plausible - results can be attained when approaching VM as it was a simple substitution cipher.
15.4

method

The core of our method is an optimization algorithm which looks
for such a candidate transcription alphabet Ax which, when applied
upon the list of word types occurring in VM’s Calendar section yields
an output list whose members should be ideally present in another
list, called the Crib. The optimization is done by an evolutionary strategy - an individual chromosome encode a candidate transcription
alphabet and a fitness function is given as a sum of lengths of all
tokens which were successfully transcribed from Calendar to a specified Crib.
15.4.1

calendar

Six among twelve words present on Figure 26 occur only on folio f84r.
Six others occur on other folios as well, and five of these six words
occur also as labels near feminine figures displayed on 12 folios of the
section commonly known as "Zodiac". It is like this that our attention
was focused from the limited corpus of "primary mapping" towards
more exhaustive corpus contained in the Zodiac.
Every page of Zodiac displays multiple concentric circles filled with
feminine figures. Attributes of these figures differ - some hold torches,
some do not, some are bathing, some are not - but one pattern is
fairly regular. Asides every woman there is a star and asides every
star, there is a word.
While some authors postulate that these words are names of stars
or names of days, we postulate that these words are simply feminine

15.4 method

names6 . From Takahashi’s transliterations of twelve folios of the Zodiac we extract 290 tokens which instantiate 264 distinct word types.
To evit possible terminological confusion, we shall denote this list
of 264 labels7 with the term Calendar. Hence, Zodiac is the term to
refer to folios f70v2 - f73v, while Calendar is simply a list of 264 labels.
Total length of this 264 labels is 2045 letters. These characters are chosen from 19-symbol (|Acipher | = 19) subset of the EVA transliteration
alphabet.
15.4.2

cribbing

Cribbing is a method by means of which a hypothesis, that the Calendar contains lists of feminine names, can potentially lead to deciphering of the manuscript. For if the Calendar is indeed such a list,
then one could use lists of existing and attested feminine names as
hypothetical target "cribs".
In crypt-analytic terms, an intuition that the Calendar contains feminine names makes it possible to perform a sort of known-plain-text
attack (KPA). We say "a sort of", because in case of VM are the "cribs"
upon which we shall aim to map the Calendar, not known with 100%
certainty. Hence, it is maybe more reasonable to understand the cribbing procedure as the plausible-plain-text attack (PPA).
This beings said, we label as "cribbing" a symbol-substituting procedure Pcribbing which replaces symbols contained in the cipher (i.e.
in the Calendar) with symbols contained in the plain-text. Hence, not
only cipher but also plain-text are inputs of the cribbing procedure.
Every act of execution of Pcribbing can be followed an act of evaluation of usefulness Pcribbing in regards to its inputs. The ideal procedure would result in a perfect match between the rewritten cipher
and the plain-text, i.e.
Pcribbing (cipher) == plain − text
On the other hand, a completely failed Pcribbing results in two
corpora which do not have anything in common.
And between two extremes of the spectrum, between "the ideal"
and "the completely failed", one can place multitudes other procedures, some closer to the ideal than the others.
This makes place for optimization.

6 It cannot be excluded, however, that they all this at once. Note, for example, that
in many central European countries, it is still a fairly common practice to attribute
specific names to specific days in a year, i.e. "meniny".
7 Available at http://wizzion.com/thesis/simulation0/calendar.uniq

253

254

breaking into unknown code

Listing 8: Discrete cross-over
#discrete crossover
2 my $child_genome;

my $i=0;
for (@mother_genome) {
if ($_ ne $father_genome[$i]) {
rand > 0.5 ? ($child.=$mother_genome[$i]) : (
$child.=$father_genome[$i]);
7
} else {
$child_genome.=$mother_genome[$i];
}
$i++;
}

15.4.3

optimization

All experiments described in the next section of this article implement
an evolutionary computation algorithm strongly inspired by the architecture of Canonical genetic algorithm (CGA, P+46) Holland (1992);
Rudolph (1994). Hence, initial population is randomly generated and
the fitness-proportionate (i.e. "roulette wheel", P+42) selection is used
as the main selection operator. But contrary to CGAs, our optimization technique does not implement a classical single-point crossover
but rather a sort of "discrete crossover" which takes place only in case
that parent individuals have different alleles of a specific gene.
Another reason why our solution can be considered to be more
similar to evolutionary strategies (Rechenberg, 1971) than to CGAs is
related to the fact that it does not encode individuals as binary vector
(P+48). Instead, every individual represents a candidate mono-alphabetic
substitution cipher application of which could, ideally, transform the
Calendar into a crib. More formally: given that cipher is written in
symbols of the alphabet Acipher and given that the crib is written in
symbols of the alphabet Acrib , then each individual chromosome will
have length of |Acrib | genes and every individual gene could encode
one among |Acipher | values.
Size of the search space is therefore |Acipher || Acrib |. Search for optima in this space is governed by a fitness function:
FPcribbing =

X

length(w)

w∈cipher∧Pcribbing (w)∈crib

where w is a word type occurring in the cipher (i.e. in the Calendar)
and which, after being rewritten by Pcribbing also matches a token in
the input crib. Given that the expression length(w) simply denotes
w’s character length, the fitness function of the candidate transcription procedure Pcribbing is thus nothing else than the sum of char-

15.5 experiments

Listing 9: Cipher2Dictionary adaptation fitness function

3

8

13

18

#Fitness Function
my $text=$calendar;
my $old = "acdefghiklmnopqrsty" ;
my %translit;
@translit{split //, $old} = split //, $individual;
$text =~ s/(.)/defined($translit{$1}) ? $translit{$1} : $1/eg; #
core transcription of calendar content
my %matched;
for (split/\n/,$text) {
my $token=$_;
if (exists $crib{$token}) {
@antitranslit{split //, $individual} = split //,
$old;
$token =~ s/(.)/defined($antitranslit{$1}) ?
$antitranslit{$1} : $1/eg;
my $t=$token;
$matched{$t}=1;
}
}
for (keys %matched) {
$Fitness[$i]+=length $_;
}

acter lengths of all distinct labels contained in the Calendar which
Pcribbing successfully maps onto the feminine names contained in
the input crib.
15.5

experiments

Within the scope of this article, we present results of two sets of experiments which essentially differed in the choice of a name-containing
cribs.
Other input values (e.g. Takahashi’s transliteration of the Calendar used as the cipher) and evolutionary parameters (total population size = 5000, elite population size = 5, gene mutation probability <0.001) were kept constant between all experiments and subexperiments. Each experiment consisted of ten distinct runs. Each run
was terminated after 200 generations.
15.5.1

slavic crib

What we label as "Slavic crib" is a plain-text list of feminine names
which we had compiled from multiple sources publicly available on
the Internet. Principal sources of names were websites of western
Slavic origin. This choice was motivated by following reasons:

255

256

breaking into unknown code

1. The oldest more or less certain trace of VM’s trajectory points
to the city of Prague - the center of western Slavic culture.
2. Orthography of western Slavic languages relatively faithfully
represent the pronunciation. That is, there are relatively few digraphs (e.g. a bi-gram "ch" which denotes a voiced velar fricative). Hence, the distance between the graphemic and the phonemic representations is not so huge as in case of English or
french.
3. Slavic languages have rich but regular affective and diminutive
morphology which is often used when addressing or denoting
beloved persons by their first name.
The third reason is worth to be introduced somewhat further: in
both Slavic and western Slavic languages, a simple in-fixing of the
unvoiced velar occlusive "k" before the terminal vowel "a" of a feminine names leads to creation of a diminutive form of such a name (e.g.
alena → alenka, helena → helenka etc.) The fact that this morphological rule is used both by western as well as eastern Slavs indicates
that the rule itself can be quite old, date to common Slavic or even preSlavic periods and hence, was quite probably in action already in the
period when VM was written.
For the purpose of this article, let’s just note that application of the
substitution:
a$ → ka/
allowed us to significantly increase the extent of the "Slavic crib".
Thus, we have obtained a list a of 13815 distinct word types which are
in quite close relation to phonetic representation of feminine names
used in Europe and beyond8 . The alphabet of this crib comprises of
38 symbols, hence there exists 1939 possible ways how symbols of the
Calendar could be replaced by symbols of this crib.
Figure 27 shows the process of convergence from populations of
randomly generated chromosomes towards more optimal states. In
case of runs averaged in the "SUBSTITUTON" curve, the procedure
Pcribbing consisted in simple mapping of the Calendar onto the crib
by means of a substitution cipher specified in the chromosome. But in
case of runs averaged in the "REVERSAL + SUBSTITUTION" curve,
whole process was initiated by the reversal of order of characters
present within individual tokens of the Calendar (e.g. okedy → ydeko, otedy →
ydeto etc.) Let’s now look at contents of individuals which were
"identified" by the optimization method.
More concrete illustrations can also turn out to be quite illuminating. Hence, if the most elite individual of run 1 (i.e. the one with fitness 197) is as a means of substitution of EVA characters contained in
the Calendar, one will see appearance of names like ALENA, ALETHE,
8 Slavic crib is publicly available at http://wizzion.com/thesis/simulation0/slavic_extended.crib

15.5 experiments

Figure 27: Evolution of individuals adapting label in the Calendar to names
listed in the Slavic crib.

ANNA, ATENKA, HANKA, HELENA, LENA etc. And when the
last one (i.e. the one with fitness 240 is used), the resulting list shall
contain tokens like AELLA, ALANA, ALINA, ANKA, ANISSA, ARIANNKA, ELLINA, IANKA, ILIJA, INNA, LILIJA, LILIKA, LINA,
MILANA, MILINA, RANKA, RINA, TINA etc.
This being said, the observation that all reversal-implementing runs
have converged to genomes which:
1. transcribe e in EVA as nasal n
2. transcribe k in EVA as velar k
3. transcribe t in EVA as nasal n
4. transcribe y in EVA as vowel a
5. transcribe a in EVA as vowel (80% times as "i", 10% as "e", 10%
as "o")
6. transcribe l in EVA as either a liquid consonant (80% "l", 10%
"r") or "m" (10%)
...could also be of certain use and importance.
15.5.2

hebrew crib

At this point, a skeptical mind could start to object that what our algorithm adapt to is in fact not the Calendar, but the statistical properties

257

258

breaking into unknown code

Fitness
197

e s t nhk a hk l h t ak amena

230

i k t n s knhk l z t a j s m i na

224

i c t nvk/gk l mba j / r i na

227

i

240

i k t nak f l k l mea j g r i na

226

i

208

i qgnxkdek l mxa j x r i na

239

i k t ndo l l k l f e ak i m i na

191

o t l n t nn r km z banh r ena

240

i s t n s kn l k l mea j I r i na

t npa f l k l me ank r i na
l nho

l k r g eanam i na

EVA a c d e f g h i k l m n o p q r s t y
Table 40: Fittest chromosomes which map reversed tokens in the Calendar
onto names of the Slavic crib

of the crib. And in case of such a long and sometimes somewhat artificial list like CribSlavic , such an objection would be in great extent
justified. For the adaptive tendencies of our evolutionary strategy are
indeed so strong that it would indeed find a way to partially adapt
the calendar to a crib which is long enough9
For this reason, we have decided to target our second experiment
not at the biggest possible crib but rather at the oldest possible crib.
And given that our first experiment has indicated that it seems to
be more plausible to interpret labels in the Calendar as if they were
written in reverse, id est from right to left, our interest was gradually
attracted by Hebrew language10 . This lead us to two lists of names:
• CribHebrew−men contains 555 masculine names11
• CribHebrew−women contains 283 feminine names12
both lists were extracted from the website finejudaica.com/pages/hebrew_names.htm
and were chosen because they did not contain any diacritics and
9 This has been, indeed, shown by multiple micro-experiments which we do not report
here due to the lack of space. No matter whether we use cribs as absurd as list of
modern American names or Enochian of John Dee and Edward Kelly, we could
always observe a sort of adaptation marked by the increase of fitness. But it was
never so salient as in case of CribSlavic or CribHebrew .
10 Other reasons why we decided to focus on Hebrew include: important presence of
Jewish diaspora in Prague of Rudolph the 2nd (c.f. the story of rabbi Loew and
the Golem of Prague); ritual bathing of Jewish women known as mikveh; usage of
VM-resembling triplicated forms (e.g. amen, amen, amen) in Talmudic texts; attested
existence of so-called Knaanic language which seems to be principally a Czech language written in Hebrew script et caetera et caetera.
11 http://wizzion.com/thesis/simulation0/jewish_men
12 http://wizzion.com/thesis/simulation0/jewish_women

15.5 experiments

Figure 28: Evolution of individuals adapting label in the Calendar to names
listed in the Hebrew cribs.

hence transcribing Hebrew names in a similar way as they had been
transcribed millenia ago.
Figure 28 displays the summary of all runs which aimed to transcribe the Calendar with Hebrew names.
As may be seen, the whole system converged to highest fitness values when CribHebrew−women was used in concordance with reversal
of order of characters. In such scenario, minimal attained fitness was
attained by run converging to Fmin(hebrew28,283,hfr) = 52, maximal
attained fitness was Fmax(hebrew28,283,hfr) = 63. Difference results
of hebrew, reverse batch of runs and other results of other batches is
statistically significant (Welch Two Sample t-test, p-value < 7e-10).
Subsequently, a list of 283 was tokens randomly generated in a way
that the distribution of lenghts of randomly generated sequences is
identic to distribution of lenghts of names in the hebrew crib. Maximal attained fitness was Fmax(random28,283,hfr) = 26 among 10
runs aiming to adapt the Calendar to such a random crib. Statistical difference between results of batch of runs adapting to valid
character-reversed hebrew crib hebrew28, 283, hfr and the equidistributed randomly generated crib random28, 283, hfr turned out to
be strongly significative (Welch two sample two sided non-paired ttest: t = 22.0261, df = 15.442, p-value = 4.384e-13).
The highest attained fitness value was was attained by the cribbing
procedure which first reverses the order of characters whose EVA

259

260

breaking into unknown code

representations are subsequently substituted by a following chromosome:

This chromosome transcribes the voynichese Calendar labels okam,
otainy, otey, oty, otaly, okaly, oky, okyd, ched, otald, orara, otal, salal and
opalg to feminine Hebrew names
(i.e. Bina, Gabriela, Ghila, Gala, Galila, Galina, Gina, Degana, Diyna,
Deliyla, Yedidya, Lila, Lilit and Alica).
Worth mentioning are also some other phenomena related to these
transcriptions. One can observe, for example, that the label "otaly" translated as Galina - is also present on folios f33v, f34r or f46v which
all contain drawings of torch-like plants. This is encouraging because
the word "galina" is not only a Hebrew name, but also a substantive
meaning "torch". Similarly, the word "lilit" is not only a name but also
means "of the night". This word supposedly translates the voynichese
token "salal" which is very rare - asides the Calendar it occurs only
on purely textual folio f58v and on a folio f67v2 which, surprise!, may
well depict circadian rhythms of sunrise, sunset, day and night.
Or it could be pointed out kind that the huge majority of occurrences of voynichese trigram "oky" (potentially denoting the name
"gina" which also means "garden") is to be observed on herbal folios. Or the distribution of instances of "okam" (transcribed as "bina"
which means "intelligence and wisdom"13 could, and potentially should,
be taken into consideration. Or maybe not.
15.6

conclusion

In 2013, BBC Online had announced "Breakthrough over 600-yearold mystery manuscript". The breakthrough was to be effectuated by
Stephan Bax who, in his article, describes the process of deciphering
as follows:
« ?» (?)
What Bax does not add, unfortunately, is that the voynich crossword puzzle is so big that anyone who looks at it close enough can
find in it small islands of order, local optima where few characters
seem to fit the global pattern. Thus, even if Bax had succeeded, as
he states, in "identification of a set of proper names in the Voynich
text, giving a total of ten words made up of fourteen of the Voynich
symbols and clusters", this would mean nothing else than that he had
identified a locally optimal transcription alphabet.
13 Note that "bina" is one among highest sephirots located at north-western corner of
kabbalistic tree of life. In this context it is worth noting that only partially readable
EVA group "...kam" occurs as a third word near the north-western "rosette" of folio
85v2. Such considerations, however, bring us too far.

15.7 generic conclusion

261

In this article, we have presented two experiments employing two
different lists of feminine names. Both experiments have indicated
that if labels in the Zodiac encode feminine names, then these have
been originally written from right to left 14 . The first experiment led
to identification of multiple substitution alphabets which allow to
map 240 EVA letters, contained in 40 distinct words present in the
Calendar, onto 35 feminine-name-resembling sequences enumerated
among 13815 items of CribSlavic . Results of second experiment indicate that if ever the Calendar contains lists of Hebrew names, then
these names would be more probably feminine rather than masculine.
This is, as far as we can currently say, all that could be potentially
offered as an answer to the question « Can Evolutionary Computation
Help us to Crib the Voynich Manuscript?» (Hromada, 2016a). Everything else is - without help coming from experts in other disciplines just a speculation.
15.7

generic conclusion

Looked upon from a superficial point of view, an article presented in
this "zeroth analysis" contains nothing else and nothing more than:
1. a very brief description of a particular enigma commonly known
as "Voynich Manuscript"
2. introduction of a so-called "primary mapping" hypothesis potentially able to direct any future tentative to decipher the manuscript
3. discussion of inner workings of an "evolutionary algorithm" programmed whose source code is hereby transferred to the public
domain15
4. presentation of fairly reasonable results obtained after confrontation of the manuscript with the algorithm which takes lists of
Slavic and Hebrew names at its input
What is meant by the attribute fairly reasonable is, of course, a place
for argument. And contrary to legions of other researchers, we do
not pretend that we have succeeded to "crack" the manuscript. We
simply state that after being executed on a single core of 1.8GHz
CPU, a simple 160-line script written in pure PERL can yield, in just
few hours, intelligible transcriptions of "lattices of terms" contained in
a previously unknown corpus. Thanks to a fairly trivial derivative of
14 Note, however, that this does not necessarily imply that the scribe of VM
(him|her)self had written the manuscript in right-to-left fashion. For example, in
case (s)he was just reproducing an older source which (s)he didn’t understand,
his|her hand could trace movements from left to right while the very original had
been written from right to left
15 http://wizzion.com/thesis/simulation0/voynich.PERL

262

breaking into unknown code

a Canonical Genetic Algorithm, an average home PC can closely approximate a brute-force search which would otherwise run weeks (at
least) even when executed at state-of-the-art computational clusters.
Simply stated, our 0th simulation indicates that, which has already
been indicated many times before:
Evolution narrows-down the search to regions where most plausible hypotheses reside.
Non-negligible speed-up goes hand in hand with such narrowingdown. And it is evident that such speed-up can be useful for any system which can invest only limited amount of time and energy into
its search of the most optimal hypothesis. It does not really matter
whether the system in which we speak in this context is a PERL script,
child’s mind or the Nature herself: problem-solving system which implements evolutionary principles tends to converge (Rudolph, 1994)
to "the answer" in less time, and with less resources wasted, than the
system which does not implement such principles.
At least as fascinating as her ability to speed things up is evolution’s
propensity to produce adaptations. Zeroth simulation is particularly
instructive in this regards: as noted in the footnote 9, the VM-to-crib
transcribing EA produced certain results even in cases when cribs
as "list of 20th century American names" have been used as target
dictionaries. In spite of absurdity of such cribs - for it is indeed highly
improbable that VM initially contained names like Butch or Mitch the EA succeed to discover certain inherent similarities between two
texts in order to exploit them in the future search.
Thus, the main conclusion of 0th simulation can be stated as follows:
Evolution is able to facilitate the search for optimal mapping between distinct corpora encoded in distinct forms of representation.
In this simulation, distinct forms of representation has been a socalled EVA alphabet (into which VM is transcribed) and phonemic
alphabets common to Slavic or Hebrew languages. Mapping itself
was nothing else than simple substitution of one symbol from one
alphabet with one symbol from another alphabet. A mapping - a hypothesis - was considered "the fittest" if it succeeded to transcribe initial unintelligible EVA corpus to intelligible list of names. Both EVA
corpus and the name list were EA’s inputs and thus in certain sense
"innate" to each individual run of the algorithm.
What was "acquired" during the process was the set of mono-alphabetic
substitution rules. EA presented in 0th simulations is thus an example of evolution which processes strictly "symbolic" representations.
This will not be the case in simulations which are now to follow:
let’s now descend to the realm of sub-symbolic (vectorial) entities in
order to propose an evolutionary solution to the problem of category
induction.

16

E V O L U T I O N A R Y L O C A L I Z AT I O N O F S E M A N T I C
PROTOTYPES

16.1

generic introduction

How does a child create mappings between "signifiers" and "signifieds" (de Saussure, 1916), between words and their meanings? How
do concepts emerge in the mind of a child?
These question are addressed on many places of Conceptual Foundations. Be it during our discussion of "ontogeny of lexicon and semantics" (P+72-78) or classical theories thereof (P+93-95), be it during
the definition of "category prototype" (P+132) or in the Hebb/Harris
analogy (P+133) suggesting a sort of equivalence between Hebb’s law
well-known to neuroscientists and so-called "distributional hypothesis" well-known to linguists, it has been indicated on multiple places
that what contemporary linguists label as "vocabulary development"
is, in its essence, nothing else than a usage-based, goal-oriented, associanist process. And that Chomsky’s critic of Skinner (P+95), in
regards to acquisition of meanings, quite inappropriate: in fact it
does not even apply. This is so because first syntactic representations
(P+173-179) are acquired, tuned and perfectioned later than first semantic constructions (P+179-184).
And how could such "vocabulary development" be simulated by an
engineer willing to do so ?
In an ideal world, such an engineer would have to have, at least,
two things at his disposition:
• a corpus C representing the world of a modal toddler: it should
contain representations of objects with many attributes (some
of them could and should mutually overlap)
• an algorithm A capable of clustering objects into categories in a
"cognitively plausible" (P+13) way (i.e. similar to the way child’s
mind does it)
Unfortunately, as far as 2016, no such C is available, at least not
in textual form which could be processed by means of methods commonly used in computational linguistics (P+112-164). The corpus CHILDES
(P+207-209) is as close as one can get to C but, and this is a nonnegligible "but", CHILDES contains transcripts representing interactions within certain worlds BUT does not contain descriptive representations of these worlds selves. And as we have noted elsewhere
(Hromada and Gaudiello, 2014) construction of such corpus surpasses

263

264

evolutionary localization of semantic prototypes

by far possibilities of any individual engineer and thus also possibilities of this dissertation.
Willing to develop A but without proper C, one is obliged to approximate. In regards to simulations of induction of meaning, a plausible approximation could be proposed as follows:
Let’s suppose that text documents are "objects" and that groups of objects
which have similar semantic content (i.e. refer to or speak about similar
things) delimit a certain "semantic category".
Under such supposition - and under such supposition only - can
one reduce the problem of vocabulary development to a problem of
multi-class categorization of documents. Under such ceteris paribus and under such ceteris paribus only - can one pretend that the model
first published in the article « Genetic Optimization of Semantic Prototypes for Multi-class Document Categorization» (Hromada, 2015)
could , in the long run, potentially lead to full-fledged computational
models of vocabulary development.
16.2

introduction

In computational theories and models learning, one generally works
with two types of models: regression and classification. While in regression models one maps continuous input domain onto continuous
output range, in models of classification, one aims to find mappings
able to project input objects onto a finite set of discrete output categories.
This article introduces a novel means of construction of a particular
type of the latter kind of learning models. Due to finite and discrete
nature of its output range, classification - also called categorization by
more cognition-oriented researchers - seems to be of utmost importance in any cognitively plausible (Hromada, 2014b) model of learning. But under these terms, two distinct meanings are confounded
and the term categorization thus often represents both:
1. process of learning (e.g. inducing) of categories
2. process of retrieving information from already learned (induced)
categories
which crudely correspond to training, resp. testing phases of supervised learning algorithms.
In the rest of this section we shall more closely introduce an approach combining notions of category prototype, dimensionality reduction and evolutionary computing in order to yield a potentially
"cognitively plausible" means of supervised machine learning of a
multi-class classifier. We shall subsequently present specificities of a
Natural Language Processing (NLP) simulation which was executed
in order to assess the feasibility of our approach. Results hence obtained shall be subsequently compared with comparable "deep learn-

16.2 introduction

ing" semantic hashing technique of (Salakhutdinov and Hinton, 2009).
The article shall be concluded with few remarks integrating whole research into more generic theories of neural and universal Darwinism.
16.2.1

geometrization of categories

In contemporary cognitive science, categories are often understood
as entities embedded in an ∆-dimensional feature space (Gärdenfors, 2004). The most fundamental advantage of such models, whose
computer sciences counterparts are so-called "vector symbolic architectures" (VSAs) (Widdows and Cohen, 2014), is their ability to geometrize one’s data, i.e. to represent one’s data-set in a form which
allows to measure distances (similarities) between individual items
of the data-set.
Thus, even entities like "word meanings" or "concepts" can be geometrically represented, either as points, vectors or sub-spaces of the
enveloping vector space S. One can subsequently measure distances
between such representations, e.g. distance of the meaning of the
word "dog" from the meaning of "wolf" or "cat" etc. Geometrization
of one’s data-set once effectuated, space S can be subsequently partitioned into a set R of |C| regions R = R1 , R2 , ..., R|C| .
In unsupervised scenario, such partitioning is often done by means
of diverse clustering algorithms, the most canonical among which being the k-means algorithm (MacQueen et al., 1967). Such clustering
mechanisms often characterize candidate cluster CX in terms of a geometric centroid of the members of the cluster. Feasibility of a certain
partition is subsequently assessed in terms of "internal clustering criteria" which often take into account distances among such centroids.
In the rest of this article, however, we shall aim to computationally implement a supervised learning scenario and instead of working with the notion of category’s geometric centroid, our algorithm
shall be based upon the notion of category’s prototype. The notion
of the prototype was introduced into science notably by theory of
categorization of Eleanore Rosch which departed from the theoretical
postulate that:
"the task of category systems is to provide maximum information with the
least cognitive effort" (Rosch, 1999)
In seminal psychological and anthropological studies which have
followed, Rosch have realized that people often characterize categories in terms of one of their most salient members. Thus, a prototype of category CX can be most trivially understood as such a
member of CX which is the most prominent, salient member of CX .
For example "apples" are prototypes of category "fruit" and "roses"
are prototypes of category "flowers" in western cultural context.

265

266

evolutionary localization of semantic prototypes

But studies of Rosch had also suggested another, more mathematical, notion of how prototypes can be formalized and represented. A
notion which is based upon the notion of closeness (e.g. "distance")
in a certain metric space:
"items rated more prototypical of the category were more closely related to
other members of the category and less closely related to members of other
categories than were items rated less prototypical of a category" (Rosch and
Mervis, 1975)
Given that this notion is essentially geometric, the problem of discovery of a set of prototypes can be potentially operationalized as a
problem of minimization of a certain fitness function. The fitness function, as well as means how it can be optimized, shall be furnished in
section 2. But before doing so, let’s first introduce certain computational tricks which allow to reduce the computational cost of such
search of the most optimal constellation of prototypes.
16.2.2

radical dimensionality reduction

There is potentially an infinite number of ways how a data-set D consisting of |D| documents can be geometrized into a ∆−dimensional
space S. In NLP, for example, one often looks for occurrences of diverse words in the documents of the data-set (e.g. corpus). Given
that there are |W| distinct words occurring in |N| documents of
the corpus, one used to geometrize the corpus by means of a N * M
co-occurrence matrix M whose X-th row vector represents the X-th
document NX , Y-th column vector represents the Y-th word WY and
the element on position MX,Y represents the number of times WY
occurred in NX .
Given the sparsity of such co-occurrence matrices as well as for
other reasons, such bag-of-words models are more or less abandoned
in contemporary NLP practice for sake of more dense representations,
whereby the dimensionality of the resulting space, d, is much less
than |W|, d  |W|. Renowned methods like Latent Semantic Analysis
(LSA) (Landauer and Dumais, 1997) set aside because of their computational cost, we shall use the Light Stochastic Binarization (LSB) (Hromada, 2014c) algorithm to perform the most radical dimensionalityreducing geometrization possible.
LSB is an algorithm issued from the family of algorithms based
on so-called random projection (RP). Validity and feasibility of all
these algorithms, be it Random Indexing (RI, (Sahlgren, 2005)) or Reflective Random Indexing (RRI,(Cohen et al., 2010)) is theoretically
founded on a so-called lemma of Johnson-Lindenstrauss, whose corollary states that "if we project points in a vector space into a randomly selected subspace of sufficiently high dimensionality, the distances between the
points are approximately preserved" (Sahlgren, 2005).

16.3 genetic localization of semantic prototypes

Methods of application of this lemma in concrete NLP scenarios being described in references above, we precise that LSB can be labeled
as "most radical" variant of RP-based algorithms because:
• it tends to construct spaces with as small dimensionality as possible (in LSB, d < 300; in RI or RRI models, d > 300)
• LSB tends to project the data onto binary and not real or complex spaces
It can be, of course, the case that such dimensionality-reduction and
binarization can lead to certain decrease of discriminative accuracy
of LSB-produced spaces. On the other hand, given that dimensionality reduction and binarization necessary bring about reduction of
computational complexity of any subsequent algorithm which could
be used to explore the resulting space S, such decrease of accuracy
is to be more swiftly counteracted by subsequent optimization. The
goal of this study is to explore whether such post hoc optimization
of classifiers operating within dense, binary, LSB-produced spaces is
possible, and whether the combination of the two can be used as a
novel means of machine learning.
But before describing in more closer such evolutionary optimizations, let’s precise that because of its low-dimensional and binary nature, LSB can also be understood as yielding a sort of "hashing function" aiming to attribute similar hashes to similar documents and different hashes to different documents. In this sense, LSB is similar to
approaches like Locality Sensitive Hashing (LSH, Datar et al. (2004))
or Semantic Hashing (SH, Salakhutdinov and Hinton (2009)) often
used, or at least presented, as the solution of multi-class classification
of Big-Data corpora. It is with the results of the latter, "deep-learning"
approach, that we shall compare our own results in section 16.5.
16.3

genetic localization of semantic prototypes

Let D = {d1 , ..., d|D| } be a training data-set consisting of |D| documents to which the training dataset attributes one among |L| corresponding members of set of class labels L = {L1 , ..., L|L| }.
Let Γ denote a tuple Γ = C1 , ..., C|L| whose individual elements are
sets containing indices of members of D to which a same label Ll is
attributed in the training corpus (e.g. C1 = {3, 4, 5} if training corpus
attributes its 1st label only to documents d3 , d4 and d5 ).
Let H = {h1 , ..., h|D| } be a set of ∆-dimensional binary vectors attributed to members of D by a hashing function FH , i.e. hX = FH (dX ).
Let S be a ∆-dimensional binary (Hamming) space into which members of H were projected by application of mapping FH .
Then a classificatory pertinence FCP of the candidate prototype PK
of K-th class (K 6 |C|) can be calculated as follows:

267

268

evolutionary localization of semantic prototypes

FCP (PK ) = α

X

Fhd (ht , PK ) − ω

X

Fhd (hf , PK )

(5)

f6⊂CK

t∈CK

whereby P denotes the position of the prototype in S, Fhd denotes
the Hamming distance 1 , ht denotes the hash "true" document belonging to same class as the prototype, hf is the vector of the "false"
document belonging to some other class of the training corpus and α
and ω are weighting parameters.
In simpler terms, an ideal prototype of category C is as close as
possible to members of C and as far away as possible from members
of other categories.
Given such a definition of an ideal prototype, an ideal |C|-class classifier I can be trained by searching for such a set P = {P1 , ..., P|L| }
of individual prototypes, which minimize their overall classification
pertinence:
X

K=|L|

I = min

FCP (PK )

(6)

K=0

In simpler terms, an ideal |C|-class classifier I is composed of |C|
individual prototypes which are as close as possible to documents of
their respective categories, and as far away as possible from all other
documents.
Equations 1 and 2 taken together, one obtains a fitness function
which can be optimized by evolutionary computing algorithms. And
given that one explores the prototypical constellations embedded in a
binary space, one can use canonical genetic algorithms (CGAs, Goldberg (1990)) for the optimization of the problem of discovery of ideal
constellation of most pertinent prototypes. We choose CGAs for three
principal reasons:
Primo, we choose CGAs mainly for their property, proven in Rudolph
(1994), to converge to global optimum in finite time if ever they are endowed with the best-individual protecting, elitist strategy. Secundo,
one can obtain practically useful and exploitable increase in speed
simply due to the fact that CGAs are conceived to process binary
vectors and do so on CPUs which are essentially built for processing
such vectors. Tertio, CGAs offer a canonical, well-defined, "baseline"
gateway to much more sophisticated evolutionary computing (EC)
techniques and are well understood by both neophytes as well as the
most experts of the EC community.
For this reason, we consider as superfluous to describe in closer detail the inner workings of a CGA: instead, references (Goldberg, 1990;
Rudolph, 1994) are to be followed and read. Given that the particular values of mutation and cross-over parameters shall be specified
1 Hamming distance of two binary vectors h1 and h2 is the smallest number of bits
of h1 which one has to flip in order to obtain h2 . It is equivalent to a number of
non-zero bits in a XOR(h1 , h2 ) binary vector.

16.4 corpus and training parameters

in the following section, the only thing which in which the reader
now needs to be reassured is her correct understanding of the nature
of data structures which the algorithm hereby proposed shall implement, in order to encode an individual |C|-class classifier:
Given that equation 1 defines a prototype candidate as a position in
∆-dimensional Hamming space and given that equation 2 stipulates
that an ideal |C|-class classifier is to be composed of representations
of |C| ideal prototype candidates, the data structure representing an
individual solution can be constructed by a simple concatenation of
|C| ∆-dimensional vectors. Thus, the individual members of the populations which the CGA shall optimize are, in essentia, nothing else
than binary strings of length |C|*∆.
16.4

corpus and training parameters

In order to be able to compare the performance of our algorithm
with non-optimized LSB and SH, same corpus and dimensionality
parameters were chosen as those, which are already reported in the
previous studies (Salakhutdinov and Hinton, 2009; Hromada, 2014c).
Thus, dimensionality of the resulting binary hashes was ∆=128. Every
document of the corpus was hence attributed a 16-byte long hash.
A so-called "20newsgroups" corpus2 has been used. The corpus contains 18,845 postings taken from the Usenet newsgroup collection
divided into training set containing 11,314 postings, 7531 being the
testing set (|Dtraining | = 11313, |Dtesting | = 7531). Both training and
testing subsets are divided into 20 different newsgroups which correspond each to a distinct topic. Given that every distinct topic represents a distinct category label, |C| = 20.
Documents of the corpus were subjected to a very trivial form of
pre-processing: documents were split into word-tokens by means of
ˆ
[\w]
separator. Stop-words contained in PERL library Lingua::StopWords
were subsequently discarded. 3000 word types with highest "inverse
document frequency" value were used as initial terms to which the
initial random indexing iteration attributed 4 non-zero values. Hashing function FH = LSB(∆ = 128, Seed = 3, Iterations = 2) because
there were 2 "reflective" iterations preceding the ultimate stage of "binarization".
Once hashes were attributed to all documents of the corpus, the
Hamming space S was considered as constructed and stayed unmodified during all phases of subsequent optimizations and evaluations.
As CGA-compliant algorithm, the optimization applied generated the
new generation by crossing over two parent solutions chosen by the
fitness proportionate (e.g. roulette wheel) selection operator. Each
among 2560 (128*20) genes was subsequently mutated (i.e. a correspondent bit was flipped to its opposite value) with probability of
2 http://qwone.com/ jason/20Newsgroups/

269

270

evolutionary localization of semantic prototypes

0.1%. Population contain 200 individuals, zeroth generation was randomly generated. Elitist strategy was implemented so that all individuals with equally best fitness survived intact the transition to future
generation. Parameters α and ω (e.g. equation 1) used in fitness estimation were both set to 1.
Information concerning the category labels guided the optimization during the training phase. During the testing phase, such information was used only for evaluation purposes. Multiple independent
runs were executed and values of precision and recall were averaged
among the runs in order to reduce the impact of stochastic factors
upon the final results.
16.5

evaluation and results

Every 250th generation, classificatory accuracy of an individual solution with minimal overall classification pertinence (c.f. equation 2)
was evaluated in regards to 7531 documents contained in the testing part of the corpus. Following aspects of classifier’s performance
were evaluated in order to allow comparison with the results with
Precision-Recall curves presented in (Salakhutdinov and Hinton, 2009;
Hromada, 2014c):
Number of retrieved relevant documents
Total number of retrieved documents
Number of retrieved relevant documents
Recall =
|Dtesting |

Precision =

The notion of relevancy is straightforward: an arbitrary document
DT contained in the testing corpus is considered to be relevant to
query document DQ if and only if they were both labeled with the
same category label, LQ = LT .
On the other hand, the correct understanding of what is meant
by "retrieved" is the key to correct understanding of the core idea
behind the functionality of the algorithm hereby proposed. That is:
the prototypes induced by the CGA optimization are to be used as
retrieval filters.
We precise: given a hash hQ of a query document dQ , one can
easily identify - among |C| prototypes encoded as components of an
quasi-ideal constellation I furnished by the CGA - such a prototype
PN which is nearest to hQ . Subsequently, each among N documents
whose hashes are N nearest neighbors of the prototype PN , should
be considered as retrieved by dQ . Prototypes discovered during the
training phase therefore primarily specify, during the testing phase,
which documents are to be considered as retrieved, and which not.
For all LSB curves present on Figure 29, the size of such retrieval
neighborhood was set to N=2000.
Also, in order to obtain viable precision-recall curves, Radius R=(0,
..., ∆ = 128) of the Hamming ball was used as a trade-off parameter.

16.5 evaluation and results

Figure 29: Retrieval and 20-class classification performance in 128dimensional binary spaces. Non-LSB results are reproduced from
Figure 6 of study (Salakhutdinov and Hinton, 2009), plain LSB
from (Hromada, 2014c).

271

272

evolutionary localization of semantic prototypes

For every data-point of the plot on Fig. 1, hN was considered as retrieved by query hQ only if the hamming distance of query and the
candidate document was smaller than R (hd(hQ , hN ) > R). Points on
the very left of the plot correspond thus correspond to R=0 (i.e. hQ
and hN collide), while points on the right correspond to R=128 (i.e.
hQ does not have a single bit in common with hN ).
As comparison of curves on the figure indicates, biggest increase
in performance is attained by decision to use prototypes as retrieval filters.
Thus, when one uses the most fit among 200 randomly chosen prototype constellations as a retrieval filter (c.f. curve CGA1(LSB)), one
obtains significantly better results than when does not use any prototypes at all (c.f. curve "Plain LSB"). If the process is followed by
further genetic optimization (c.f. CGA500 for situation after 500 generations), one observes a non-negligible increase of precision in the
high recall region of the spectrum. But it can also be seen that the
optimization has its limits, hence there is a slight decrease between
500th and 1000th generation which potentially corresponds to situation whereby the induced prototype constellation tends to over-fit
the training data-set. This leads to subsequent decrease in overall accuracy of classification of documents contained in the testing data-set.
Figure 29 also suggests that the genetic discovery of sets of prototypes - and their corresponding use as retrieval filters - seems to
produce results which are better than those produced by both binarized Latent Semantic Analysis or SH. Exception to this is SH’s 20%
precision at recall level of 51.2%. Note, however, that since on page 6
of their article, Salakhutdinov and Hinton (2009) claim to have used
their hashes as retrieval filters of neighborhood of size N=100, and
given that the every size of the category in a 20newsgroup corpus ≈
390 documents, such a result is not even theoretically possible. This
is so because even in case the classifying system would retrieve only
the relevant documents (i.e. precision would be 100%) the maximal
attainable recall would still be just 100/390 ≈ 25.6%. Both authors
were contacted by mail with a request to rectify possible misunderstanding. Unfortunately, none of them replied.
16.6

conclusion

Results hereby presented indicate that supervised localization of constellations of semantic prototypes can significantly increase accuracy
of classifiers which use such constellations as retrieval filters.
Given that the localization of such constellations is governed by the
training corpus but the increase is also significant in case when one
confronts the system with previously unseen testing corpus, we are
allowed to state that our algorithm is capable of generalization. This
was principally attained by combination of following ideas:
1. projection of documents into low-dimensional binary space

16.6 conclusion

2. definition of fitness of prototype in terms of distances to both
documents of its category, as well as distance to document of
other categories
3. search for fittest prototype constellations
4. use of the most fit prototype constellation as a sort of retrieval
filter
In spite of its generalizing and thus "machine learning" capabilities,
our algorithms is essentially a non-connectionist one. Thus, instead
of introducing synapses between neurons, or speaking about edges
between nodes of the graph, briefly, instead of speaking about deep
learning of multi-layer encoders of stacks of Restricted Boltzmann Machines
fine-tuned by back-propagation as (Salakhutdinov and Hinton, 2009) do
- we have found as more preferable to reason in geometric and evolutionary terms. It is indeed due to this "geometric" perspective that
the computational complexity of the algorithm is fairly low: ∆|D||C|
for evaluation of fitness of one individual prototype constellation. In
future study, we aim to explore the performance of slightly modified
fitness function whose complexity ∆|D| + |C|2 could be of particular
interest in cases of huge data-sets (i.e. big |D|) with fairly limited
number of classes (|C|).
In practical terms, it is also advantageous that both fitness function
evaluation as well as final retrieval assess distances in terms of binary hamming distance measure. In both cases, one can use basic logical operations like XOR + some basic assembler instructions which
would furnish indices allowing to execute sort of "conceptual geometry" with particular swift and ease. Given these properties + the fact
that hashes which are manipulated are fairly small (in one gigabyte of
memory, one can store hashes for 8 million documents), one can easily predict existence of future application-specific integrated circuit
(ASIC) potentially executing billions query2document comparisons
per second.
Computational aspects aside, our primary motive in developing the
algorithm hereby proposed was to furnish a sort of cognitively plausible (Hromada, 2014b) "experimental proof" for our doctoral Thesis
which postulates that a sort of evolutionary process exists not only
in the realm of biological species, but also in realms populated by
"species" of a completely different kind.
Id est, in realms of linguistic structures and categories, in realms of
word meanings, concepts and, who knows, maybe even in the realm
of mind itself.
Being uncertain about whether our demonstrate, with sufficient
clarity, that it is reasonable to postulate not only neural (Edelman,
1987), but also intramental evolutionary processes, we conclude by
saying that the formula hereby introduced offers a simple yet reason-

273

274

evolutionary localization of semantic prototypes

ably accurate method of solving the problem of multi-class categorization of texts.
16.7

generic conclusion

Speaking less concretely, this article shows that model, implementing
evolutionary search within a certain type of vector space, can bring
practically applicable results. Given that results obtained with training data are in non-negligible extent transposable to testing data, one
can consider such model to instantiate a particular case of machine
learning (P+125-130). Training data-set is labeled and labels are exploited to direct the evolutionary search: hence, the algorithm can be
understood as a supervised one.
Concretely speaking, this article shows how one can perform multiclass (N=20) classification of textual documents. Hence, newspaper
articles were considered as entities which are to be classified and occurrence frequencies of words contained within the articles are used
as features by means of which the articles are characterized.
And speaking less concretely again, this chapter indicates that evolutionary computation can provide the means to identify constellations of regions in a semantic space which roughly correspond to
constellations of semantic categories 3 . Ideally, the process converges
to state where correct category labels are attributed to correct regions
with correct extension.
It is in this sense that the approach hereby introduced can be, mutatis mutandi, understood as a potential model of vocabulary development within individual child. This is so because the aim of vocabulary ontogeny is analogical: one aspires to attribute correct phonic
representations ("words", "signifiers", "labels") to correct regions of
the conceptual space. As has been observed by other researchers
(P+173) or illustrated by the Borgesian Ding-Dong Mystery (P+177179) such process of attribution appropriate handel to appropriate vessels
is far from being a monotonic descent to most optimal state.
Rather, the process of acquisition of vocabulary is full of periods
where the category is either too exhaustive or too specific, full of small
adjustments, detours and returns. It is in this sense that the conjecture
learning of words is an evolutionary process should be interpreted, and it
is in this sense that the aspirations of the algorithm hereby introduced
are to be understood.

3 Note that we use terms "semantic category", "semantic class" or "concept" as synonyms.

E V O L U T I O N A RY I N D U C T I O N O F A L I G H T W E I G H T
MORPHOSEMANTIC CLASSIFIER

17.1

generic introduction

The aim of previous chapter was to show that one can use evolutionary computation to induce sufficiently pertinent semantic categories
from a corpus of text documents. Individual text documents were understood as "entities", words present within such documents were understood as their "features" and topics1 to which diverse documents
were attributed were understood as "semantic categories".
Analogies between such process of induction of semantic categories
and the process of "vocabulary development", occurring in practically
every human being since birth until death, have also been made.
In this chapter we shall explore evolutionary models of induction
of yet another type of categories which also play a non-negligible
role in human linguistic communication. Id est, induction of grammatical categories. And given that a commonly used definition of
a grammatical category (GC) as a grouping of language units sharing
some common feature or function is very general and vague, this chapter shall focus on particular type of GCs, that of "parts-of-speech" (e.g.
"nouns", "verbs", "adjectives" etc.). There are three main "technical"
reasons which motivate this choice:
• part-of-speech induction (POS-i, P+135-136) and POS-tagging
are well-known NLP problems
• in spite of being well-known, relatively few researchers have
proposed evolutionary means to solve these problems (P+137139)
• certain transcripts within the CHILDES (P+196-222) corpus are
tagged with POS-labels
and it is the 3rd reason which is to be understood as the most
decisive one in regards to "psycho-linguistic" aims of this dissertation.
But the ultimate reason for which we have opted to focus on part-ofspeech categories is a theoretical one:
Part-of-speech categories tend to integrate word’s semantic content with
its grammatical function.
1 Note the congruence between the fact that the word "topic" is derived from the Greek
τόπος which means "place" and the fact that in computational semantics, a topic is
literally understood as a "place" within the semantic space

275

17

276

evolutionary induction of a lightweight morphosemantic classifier

In other terms, the very information that "X belongs to category
of nouns" informs the one, who already disposes of a certain notion
of what a noun is, that X most probably denotes a thing or a state.
And the very information that "Y is a member of category of verbs"
suggests that Y most probably denotes a process or an activity. In
this regards, the appartenance of the word W to the category C is
an irreplaceable clue to not only of W’s function and position in the
enveloping utterance, but also to W’s meaning. This is maybe not so
important when the meaning of W is already known, but in case of a
language-learning toddler, the ability to recognize that W ∈ C could
significantly reduce her difficulties in solving the problem "to which
components in a recently perceived scene should be a novel W associated?".
Simply stated: POS-categories can help the child to bootstrap (Karmiloff
and Karmiloff-Smith, 2009, pp.111-118) herself into the language.
But how does a child construct such categories in the first place?
The aim of the article hereby introduced and recently submitted to
journal Computational Linguistics (Hromada, 2016c), is to propose
an evolutionary answer.
17.2

introduction

What is the essence of linguistic categories, how are such categories
represented in human mind and how do such representations develop? Questions which intrigue linguists and philosophers since time
immemorial, questions of such elusive nature that any proposal aspiring to answer them have to be, per definitionem, only partial and
incomplete.
Such epistemological problems notwithstanding, contemporary computer science tends to offer an instructive answer: categories are classes
and classes can be operationalized as regions within an ∆-dimensional
vector space S∆ . Under such definition, training of a categorizing system (i.e a "classifier") can be simulated as a search for the most accurate partitioning of S∆ . This holds for categories in general and hence
it also holds for linguistic categories in particular.
One possible way how such partitioning can be performed is offered by so-called Support Vector Machines (SVM, Cortes and Vapnik
(1995)). Basic idea behind SVMs is simple: the algorithm aims to find
such a hyper-plane (also called a "decision boundary") which cuts the
vector space into two sub-spaces each of which shall ideally contain
only data-points attributed to one class. But not only that: given that
many such decision boundaries are often possible and identifiable,
an SVM tends to identify the one which maximizes the gap (i.e. margin) between data-points themselves and the boundary. Motivation
behind such a choice is simple: the more margin is maximized in regards to objects extracted from the training data-set, the more it can
be expected that object extracted from a previously unseen "testing

17.2 introduction

data-set" shall be also projected onto the correct side of the boundary.
And very often it indeed does: SVMs are able to generalize.
17.2.1

from planes to prototypes

In spite of their theoretical elegance, SVMs - as well as their neural
network "perceptron" counterparts - have one important drawback.
That is: SVMs and perceptrons look for a "plane" which cuts the space
into partitions. But as is illustrated by Figure 31, data-to-be-classified
is very often not "linearly separable": a linear decision boundary is
nowhere to be found (Minsky and Papert, 1969). In SVM practice, the
problem is often solved by applying a certain "kernel function" (Hofmann et al., 2008) which projects the initial data-set onto the space of
higher dimensionality where - if the kernel was well chosen - could
be the data separated.
While kernel functions have other pleasing mathematical properties 2 , they are highly abstract and of significant« mathematical slant»
(Hofmann et al., 2008). This, we believe, makes it almost impossible that kernel-based models could ever be labeled as "cognitively
plausible" (Hromada, 2014b). In other terms: it is highly improbable
that human cognitive and neurolinguistic system would implement
as mathematically precise, pure and fragile a machinery as kernels
definitely are.
In this article we shall argue that it is in great extent possible to
bypass the problem of "linear separability". This is to be attained by
focusing one’s attention on neighborhoods points PA , PB , ..., PX supposedly representing categories A, B, ..., X instead of focusing it on
linear boundaries BAB , BAX , BBX ... which supposedly represent the
distinction between A and B; A and X etc.
Hence, categories are to be defined in terms of their prototypes
(Rosch and Mervis, 1975; Hromada, 2015). Prototypes themselves are
points in S tending to satisfy the following condition:
An point PC can be understood as an optimal prototype of a category C if
and only if all data-points attributed to C are closer to P than to any other
prototype (PX , PY ) simultaneously represented within the system.
In spite of its surface simplicity, the problem posed by this definition of "the optimal prototype" is not an easy one to tackle in a multiclass scenario: the constraint closer than any other simultaneously represented prototype substantially complicates the case. If this constraint
wasn’t present, the problem of identification of "optimal prototypes"
would be trivial: prototype would be the centroid of all members of
C. But the condition "closer than any other prototype" makes all components of the system mutually dependent on each other. In the end,
2 The most prominent of which is related to a so-called "kernel trick" which can significantly speed up the classifier-training process.

277

278

evolutionary induction of a lightweight morphosemantic classifier

one is posed in front of the problem somewhat analogical to the famous three-body problem in physics. That is, a problem of which it
is well known that it is insolvable by analytic means (Poincaré and
Magini, 1899).
17.2.2

from prototypes to constellations

This article aims to demonstrate that the problem of discovery of constellations of optimal prototypes can be approximated by a natureinspired non-connectionist method. In other terms, we shall use a
relatively simple evolutionary algorithm in order to "induce" constellations of prototypes which are closer to training data-points to
which they should be close and further from training data-points
from which they should be far.
Thus, an individual solution contains a position of each component
prototype. Every individual has a genome of length |C|∆ whereby |C|
denotes number of distinct classes and ∆ is the dimensionality of the
space within which the search is performed. As is common to evolutionary algorithms (EAs), these individual solutions are subjected
to process replication, selection and variation across multiple generations. Notions of "far" and "close" are implemented directly in the
fitness function so that the evolutionary search minimizes the number of incorrectly positioned "nearest prototypes".
Ideally - id est if EAs parameters have been correctly specified and
iff the problem of prototype constellation is optimizable at all - the
system should converge to such a constellation of prototypes which
could accurately classify both testing and training data.
17.2.3

from constellations to lightweight classifiers

Note that if EAs could discover and optimize such constellations, then
these constellations would yield truly "lightweight" classifiers: solution to the C−class classification problem of objects in ∆−dimensional
space has length |C| ∗ ∆. To be even more radical, let’s precise that the
search shall operate within binary ∆ = 64 spaces which means that
position of every data-point as well as a candidate prototype could
be defined by exactly 8 bytes. 5−class classifiers presented in the next
sections are described by no more and no less than 5 ∗ 8 = 48 bytes.
Another reason why these classifiers can be considered as "lightweight"
is the nature of features used to project diverse textual tokens into
such 64−dimensional Hamming spaces. Being aware of results issued
from our previous empiric simulations (Hromada, 2014a), we have
decided to use three features only, i.e.
• suffix of the word W (i.e. last three characters of the word-to-becategorized)

17.3 method

• suffix of the word WL eft (i.e. word immediately preceding W)
• suffix of the word WR ight (i.e. word immediately preceding W)
are
in order to transform tokens into geometric entities. No other feature
has been used during the geometrization phase of the algorithm.
All this in order to propose a nature-inspired model of induction of
part-of-speech categories which is, we believe, at least as "minimalist"
as Chomsky’s "minimalist" program (Chomsky, 1995).
17.3

method

Algorithm presented in this article is very similar to the one presented in (Hromada, 2015). Procedure starts with characterization of
training-corpus entities (i.e. "words") in terms of their features (i.e.
"suffixes" of W, WL and WR ) . These features are subsequently used
to project all entities into a 64-dimensional Euclidean space SE(64) :
this component is known as Random Indexing (Sahlgren, 2005). In
following steps, whole "space" is reflected so that entities and features
"implicitly connected" in the original corpus shall be more pushed to
each other than entities and features which are not so connected: this
component is known as Reflective Random Indexing (Cohen et al.,
2010). At last but not least, all vectors are "binarized" by a simple
binary thresholding procedure known as Lightly Stochastic Binarization (Hromada, 2014c). All this steps yield a binary Hamming space
SH(64) .
Once SH(64) is constructed, one can proceed to localization of most
optimal constellations of category prototypes. This is being done by
a fairly standard evolutionary algorithm (EA) which is more closely
described in 17.3.2.
Most fit solutions obtained after certain number of generations are
subsequently confronted with data extracted from the testing corpus
in order to assess EA’s capability beyond the training set.
17.3.1

corpus

This article is conceived as a part of dissertation addressing the possibility of developing evolutionary models of induction of linguistic
categories in (and by) human children. This makes the choice of the
corpus quite straightforward: the corpus from which we shall aim to
extract first linguistic categories is to be contained in Child Language
Data Exchange System (CHILDES, (MacWhinney and Snow, 1985)).
However, not all among 30 thousand transcripts contained in CHILDES
(Hromada, 2016e) contain part-of-speech labels. Quality of labels also
varies: this is no surprise given that some transcripts were manually
labeled and/or corrected by multiple annotators while other tran-

279

280

evolutionary induction of a lightweight morphosemantic classifier

scripts have been labeled only by automatic NLP tools (Sagae et al.,
2007).
For this reason we have ultimately focused our interest on one particular corpus: Brown’s (Brown, 1973) transcriptions of verbal interactions of a girl named Eve. Primo because Brown’s work is seminal
for whole discipline of developmental psycho-linguistics. Secundo because it is indeed the Eve section of Brown’s corpus whose POS-labels
have been, according to (Sagae et al., 2007), manually corrected by human annotators.
Classes
According to (Sagae et al., 2007), each token of CHILDES corpus is
labeled with one among 31 part-of-speech tags. However, majority of
these tags are used only very rarely and/or denote such categories
(e.g. AUX for auxiliaries, REL for relativizers or CONJ for conjunctions) of words which encode only little amount or semantic or deontic information.
It is certain that mastery of words belonging to categories like AUX,
REL or CONJ play an important role in development of full/fledged
adult-like competence. But given that an objective of our dissertation
was to elucidate evolutionary computation can simulate bootstrapping of morphosyntactic categories from semantics (and vice versa),
we have decided to focus on induction of five classes only. These are
enumerated in table 17.3.1.
Class Tag

CHILDES POS tags

Example words

ACTION

v, part, cop

"think", "saying", "is"

SUBSTANCE

n

"cookies", "cow", "ball"

PROPERTY

adj, qn

"better", "blue", "three"

RELATION

prep

"on", "with", "to"

REFERENCE

pro, det, art

"I", "you", "this", "the"

Table 41: Five classes of interest, their corresponding CHILDES part-ofspeech tags, some example word types which instantiate them.

What is common to these classes is, that their member words very
often denote visible and tangible entities, states and processes. Id est,
when a child hears these words it can be the case that she also perceives their referents by other senses.
Classification of words labeled with tags OTHER than "v", "part",
"cop", "n", "adj", "prep", "pro", "art", "det", "qn" has been excluded
from the following analysis. Primo,
• because such words do, more often than not, lack easily recognizable visual semantic contents and should not thus be mixed
with words which encode such contents
secundo,

17.3 method

• because in ontogeny of a normal child, items belonging to such
more abstract classes are mastered later (i.e. after the "toddlerese"
(P+17) stage) than words denoting concepts subsumed under
five classes listed in (Tomasello, 2009)
tertio,
• because problem of classification of words into 5 classes is, of
course, less computationally complex and hence more tractable
than problem of classification into 31 classes
and finally,
• it is far from certain whether categories like "auxiliaries" or "relativizers" are represented per se within minds of normal verbally
communicating humans, or whether such categories are simply
abstractions developed by linguists for their own purposes
All these arguments taken together had made us renounce to tentatives to train 31-class POS-classifier and made us focus on training of
5-class classifier only.
Pre-processing
10443 "motherese" utterances have been extracted from twenty transcripts of Brown’s Eve corpus. These are very easy to detect because
in CHILDES, every utterance is on a separate line and begins with
the trigram denoting the locutor of the utterance (in case of mothers,
the trigram is MOT). 10443 lines which follow these "motherese" utterances and begin with marker %mor have been also extracted: these
are lines which contain manually annotated POS-labels.
Thus, 10443 line-couplets like this:
Listing 10: Motherese utterance from CHILDES corpus + associated morphological tier.
eve05.cha:*MOT: that s a duck .
eve05.cha-%mor: pro:dem|that cop|be\&3S art|a n|duck .

have been obtained by executing a simple shell command3 . Lines
beginning with MOT and %mor have been subsequently merged by
a PERL script enrich_pos.pl4 which yields output exemplified by the
following listing:
Such is the primary data format of this simulation. In this format,
each token is characterized on a separate line along with the utterance
in which it occurred, as well as with its "gold standard" class-label
which was attributed to it by manual annotators. Individual columns
3 cd Brown/Eve; grep -A3 -P ’ˆMOT’ *|grep -P ’(MOT|%mor)’
4 Publicly available at URL http://wizzion.com/thesis/simulation2/enrich_pos.perl

281

282

evolutionary induction of a lightweight morphosemantic classifier

Listing 11: Primary input format of this simulation.
that###REFERENCE###train###that s a duck .
s###ACTION###train###that s a duck .
3 a###QUANTIFIER###train###that s a duck .
duck###SUBSTANCE###train###that s a duck .

are separated by ### separator. The first column denotes the entity
itself (the word token), second column contains its class, third column
specifies whether the token occurred in a training or testing part of
the corpus and the last column contains whole context within which
the token entity occurred (i.e. the enveloping utterance).
Let’s precise that the training corpus was extracted from first 12
Eve transcripts (i.e. files eve01.cha - eve12.cha) which describe verbal
interactions which occurred before Eve attained 2 years of age. Testing corpus, on the other hand, was composed of 8 files (eve13.cha eve20.cha) transcribed down as Eve was 2 - 2. 21 years old.
The script enrich_pos.pl thus outputs 12453 training corpus tokens
and 8746 testing corpus tokens instantiating 972 (training) and 934
(testing) word types. Almost one half (449) of word types occurring
in testing corpus does not occur in the training corpus.
17.3.2

algorithm

This is the core of the model. It consists of two major components:
1. "vector space preparation" (VSP): a trivial suffix-extracting filter
is used in order to project text from the primary input onto a
64−dimensional Hamming space
2. "evolutionary optimization": searches SH64 for most discriminative constellations of prototypes
Vector Space Preparation
Approach which was used to "geometrize" the primary textual input
shares its essential features with that of Random Indexing ((Sahlgren,
2005)) as well as with other Vector Symbolic Architectures (Cohen
et al., 2012) based on so-called Random Projection (Hromada, 2013).
We describe it elsewhere as follows:
« Given the set of N objects which can be described in terms of
F features, to which one initially associates a randomly generated
d-dimensional vector, one can obtain d-dimensional vectorial representation of any object X by summing up the vectors associated to all
features F1 , F2 observable within X. Initial feature vectors are generated in a way that out of d elements of vector, only S among them
are set to either -1 or 1 value. Other values contain zero. Since the

17.3 method

"seed" parameter S is much smaller than the total number of elements
in the vector (d), i.e. S «d, initial feature vectors are very sparse, containing mostly zeroes, with occasional value of -1 or 1.» (Hromada,
2014c).
Section 17.2.3 has already indicated the nature of features which
we shall use to initiate the process of geometrization of textual input.
We reiterate: we shall characterize every token T with three principal
features only:
1. T’s own suffix5
2. suffix of the token to T’s right
3. suffix of the token to T’s left
.
Asides this, only two other "lateral features" are used: token T has
feature INIT if it is the initial (i.e. first) token of the utterance. Conversely, it is endowed with feature END if it is the last (i.e. terminal)
token of the enveloping utterance.
These 3 principal and/or two lateral features are extracted - during
the initial phase of VSP - by a following feature-extracting snippet.
Listing 12: PERL code of suffix-feature extractor
1 sub suffix3_featurefilter {

my @f;
my @wrdz=split / /,shift; #utterance in 1st parameter
my $nam = shift; #token of focus in the 2nd
my ($index)= grep { $wrdz[$_] eq $nam } 0..$#wrdz;
$index+=1;
my $pos = 1;
for my $w (@wrdz) {
my $w=lc $w;
my $s=substr $w,-3;
my $n=$index-$pos;
$n=$n*-1; #features with minus to the left
push @f, $n.$s if (abs($n)<2); #main 3 features
$pos++;
}
push @f, "INIT" if $index==1; #lateral feature
push @f, "END" if $index==scalar(@wrdz); #lateral feature
return @f;

6

11

16

}

For example, when the Random Indexing procedure makes the following call:
suffix3_featurefilter("that s a duck","that")
5 What we label as suffix SFXT of token T is, for the purpose of this text, equivalent to
T 0 s terminal character trigram (i.e. T 0 s last three letters).

283

284

evolutionary induction of a lightweight morphosemantic classifier

it returns three features characterizing this concrete occurrence (i.e.
token) of the word "that":
INIT 0hat 1s
Accordingly, features −1hat, 0s, 1a would be used to characterize
this instance of the token s and features −1a, 0uck, END would characterize this instance of duck.
This is the last level of representation which can still be understood
as "symbolic". Subsequently, Random Indexing associates a random,
sparsely non-zero init vector to each distinct feature (e.g. INIT , END,
0hat, −1hat, 1s, 0s, −1s, −1a, 1a, 0a, 1uck, 0uck etc.) present in any
motherese utterance of the Brown/Eve corpus.
All in all, presence of 1321 distinct features has been assessed in the
training corpus.
Once features are extracted, things go geometric. Vector representations for individual tokens are obtained as sums of vector representations of associated features. Subsequently, initial random feature
vectors are discarded and features themselves are characterized as
sums of vector representations of associated tokens. This steps marks
the first "reflective" iteration of the process called Reflective Random
Indexing (RRI). C.f. Cohen et al. (2010) for closer description of how
and why RRI works.
For the purpose of this article, let’s just precise that introduction
of 2 max 3 "reflective iterations" practically always increases results
of one’s experiment. This is, in sense, quite expected: for what the
reflective process does is not only enriching the representations of entities (e.g. tokens, documents) with information about their features
(suffixes, resp. word occurrences) but also enriches representations of
features with information about entities within which they occur.
For example, not only should be the word thinking characterized
with the feature "ends with suffix −ing" but, conversely, the feature
"ing is in part characterized by the fact of occurrence in the word
thinking.
Note that all vectors produced by RI and RRI are euclidean. After
every "reflection", vectors are normalized so that their unit length is
1. After last such reflection, each real number element of each vector
is transformed into a Boolean value by a binary thresholding process
known as Light Stochastic Binarization (Hromada, 2014c).
Such binarization is the last step of the vector space preparation. At
its end, one obtains a binary vector "hash" tending to have a property
common to other convergent6 hashing methods (Datar et al., 2004;
Salakhutdinov and Hinton, 2009):
6 A hashing function FH is said to be convergent if similarity between its inputs implies similarity of its outputs. On the other hand, FH is said to be "divergent" if
similarity between inputs does not imply similarity between output hashes. Being
of strongly divergent nature, functions like SHA2 or MD5 are not to be confounded
with convergent hashing which we discuss here.

17.3 method

Similar inputs tend to have similar hashes.
The moment of attribution of binary hash to each token occurring
in the corpus marks the end of the "vector space preparation" phase
of the algorithm. In the current model, this VSP occurs only once - at
the beginning of simulation and is not repeated.
Evolutionary Optimization
Ensemble of all binary hashes obtained from the corpus yields a hamming space SH with fairly low dimensionality. This is technically very
advantageous since measuring distances can be very swift in such
spaces: calculating the hamming distance between two binary strings
is definitely7 less costly than calculating a distance between two real
(or even complex) vectors.
The fact that we can measure distances swiftly is crucial for our
evolutionary approach for measurement of distances constitutes the
very core of the fitness function which is to evaluated for every individual member of every single generation of every single run of
the simulation. This is exemplified by the following snippet of PERL
pseudo-code.
Listing 13: PERL pseudocode of prototype-inducing fitness function
my $fitness=0;
for @individual (@population) {
for $training_token (@training_tokens) {
$training_token_hash=$hashes{$training_token};
5
$training_token_class=$correct_classes{
$training_token};
$true_prototype_distance=hamming_weight(
$training_token_hash XOR $individual[
$training_token_class]);
for $incorrect_prototype ($incorrect_classes{
$training_token}) { #the innermost cycle
$fitness-- if (hamming_weight(
$training_token_hash XOR $individual[
$incorrect_prototype]) <=
$true_prototype_distance);
}
10
}
}

As may be seen that the innermost cycle of the fitness function
evaluation contains three operations:
1. XOR between vector of the training object ~o and the vector of
"false" prototype p~F : this yields new vector with true values on
those positions where elements of input vectors differ

7 Or at least on an ordinary transistor-based 21st century Turing machine

285

286

evolutionary induction of a lightweight morphosemantic classifier

2. calculation of hemming weight (i.e. number of non-zero bits) of
XOR’s result8 : this is equivalent to hamming distance Hd(~o, p~F )
3. penalization (decrementation of fitness value) for every incorrect prototype p~F which is not further from ~o than o’s true prototype pT , i.e.
Hd(~o, p~F ) <= Hd(~o, p~T )

(7)

This concrete instance of prototype-inducing fitness function can
be further elucidated by a formula

Fobject (~i, ~o) =

|PF |

(8)

px 6=pT ∧ Hd(~o,p~x )<=Hd(~o,p~T ) =⇒ px ,→PF

which defines the object-wise fitness Fobject (~i, ~o) of individual solution ~i in regards to vector representation of the training object ~o as
a number (i.e. cardinality of a set) of "false" prototypes |PF | which are
not further from ~t as ~o’s corresponding (i.e. "true") prototype p~T .
Subsequently, an overall fitness of the individual chromosome ~i in
regards to each and every object occurring in a training corpus T , is
a sum
Ftotal (~i) = −

X

Fobject (~i, ~o)

(9)

o∈T

The sum is inverted so that whole function is a maximization one.
Under such definition the maximum fitness value is 0 and corresponds to situation where all training corpus objects are closer to
their true prototypes than to any other prototype.
In theory, it may be the case that multiple global optima of such
kind exist. In practice, and in case of many vector spaces, such global
optima may not exist at all and fitness of any locally optimal states
will have negative value.
Fitness function thus defined, the form of representation of individual solutions is quite straightforward:
An individual solution ~i encodes a constellation of all candidate
prototypes of |C| categories.
This means that, in regards to every single object ~o present in the
training corpus T , ~i shall encode not only "true" prototype p~T associated to ~o by the training corpus. It shall also encode all prototypes
which are not ~o’s true prototypes and which - if ever located closer to
8 Assembler routine for hamming_weight calculation exploiting the POPCNT instruction implemented (on hardware level) of SSE4.2-compliant CPUs (Suciu et al., 2011)
is accessible at URL http://wizzion.com/thesis/simulation2/popcount.asm

17.3 method

~o than p~T - should be evaluated as members of a set of "false positive"
PF .
In practice, individual solution ~i is represented as a vector or an
ordered tuple which concatenates all its components. Number of possible distinct individuals is
2∆∗|C|
where ∆ is the dimensionality of the space and |C| is the number
of classes. Since in our simulations we have focused on partitioning of
64-dimensional space into five classes (|C|=5) there exist potentially
264∗5 = 2320 constellations.
Fitness landscape is thus finite but its complete traversal seems to
be impossible to execute in a reasonable amount of time 9 .
Two evolutionary heuristics has been deployed in order to explore
the landscape:
1. CANONIC: a heuristic strongly reminiscent of Canonical Genetic Algorithms (Goldberg, 1990)
2. MERGE1 : an extension to CANONIC which merges independent runs of CANONIC into one big population and continues
the evolution further
In both approaches, every generation starts with fitness evaluation
for all individuals in the population. Subsequently, a so-called 2-way
tournament selection operator (Sekaj, 2005) selects members of the mating pool. Size of the mating pool equals the size of population. Members of new generation are obtained from the mating pool as follows:
two parents (mother and father) individuals are randomly chosen
from the mating pool in order to be subsequently "cut" at a randomly
chosen point. Segment before the cut is taken from the mother, segment after the cut is taken from the father and new offspring is obtained. Any gene of offspring’s genome can be mutated with 0.2%
probability: mutation is equivalent to flipping of a bit. Elitism is not
implemented and even the most fit individual can be subjected to
decay.
There are thus only two aspects in which CANONIC and MERGE1
differ. One difference is the population size: in CANONIC, populations are fairly small (100 individuals) while MERGE1 implements
somewhat bigger ones (1000 individuals).
Both heuristics also differ in the way how their initial population
are generated. In CGAs one departs ex nihilo and CANONIC heuristics is no exception to this rule: genes present in the gene pool of
generation 0 are randomly generated. Things are slightly different

9 At least on clusters of ordinary transistor-based 21st century Turing machines.

287

288

evolutionary induction of a lightweight morphosemantic classifier

in case of MERGE1 heuristics: MERGE1 is initiated by populations
yielded by different runs 10 of CANONIC after 200 generations.
CANONIC and MERGE1 taken together can be thus understood
as a very primitive form of "parallel genetic algorithm" (PGA) (Sekaj,
2004).
Under this view, 100 independent runs of CANONIC can be understood as independent nodes on the lower level of the hierarchy and
MERGE1 as the node of the higher level. A "migration" from all lowlevel nodes occurs after 200 generations. Follows a big tournament in
which the initial MERGE1 population is constituted.
Subsequently, MERGE1 evolves further.
Parameters

VSP

CANONIC

MERGE1

Input corpus

Brown-Eve motherese 11

Feature Filter

suffix3

Dimensionality

∆ = 64

Seed

S=3

Reflections

I=3

Population size

N = 100

Selection

Tournament

Crossover

One-point

Mutation rate

M = 0.2%

Initial population

ex nihilo

Generations

G = 200

Elitism

E=0

Runs

R = 100

Population size

N = 1000

Selection

Tournament

Crossover

One-point

Mutation rate

M = 0.2%

Initial population results of CANONIC

Machine Learning

Generations

G = 300

Runs

R=6

Classes

|C| = 5

Table 42: Parameters of simulation 2.
10 Note that one common "vector space preparation" phase preceded all CANONIC
runs. Hence, in spite of the fact that diverse runs of CANONIC followed different
evolutionary trajectories, they always did so in the space S64 explored by other runs
as well. This makes it possible to "merge" results of different runs.
11 Available at http://wizzion.com/thesis/simulation2/eve12-8-5classes.mot

17.4 discussion of results

17.3.3

289

evaluation

Accuracy of induced classifiers was primarily evaluated in terms of
quantity of correctly predicted category labels (i.e. true positives).
Hence, maximum score of 100% would correspond to situation when
all objects have been successfully classified. On the contrary, a classifier attributing category membership by random would have precision of cca. 20% in case of classification into 5 equidistributed classes.
Overall classification accuracy of classifiers induced by CANONIC
and MERGE1 heuristics has been evaluated after each 10 generations
of the training process. Asides this, each class has been explored individually in order to yield class-specific precision and recall values.
Three other classification methods have been evaluated in order to
compare the evolutionary method with non-evolutionary approaches:
• CENT ROIDHAMMING and CENT ROIDEUCLIDEAN baselines
• MSVM (i.e. a Multi-class Support Vector Machine)
Two baseline approaches characterize every class by their centroid.
In CENT ROIDHAMMING approach is centroid CX of a category X
a hash obtained as an average of hashes of all objects belonging to
X. Things are similar in case of CENT ROIDEUCLIDEAN : the only difference being due to the fact that elements of objects and centroid
vectors are now represented in their real-valued form. Id est, a representation issued from the last reflective iteration of the RRI component of the VSP phase of our algorithm.
At last but not least, binary vector space issued from the VSP phase
has been partitioned by means of a MSVM implemented in the opensource package MSVMPack (Lauer and Guermeur, 2011). Default settings of the package have been used: linear kernel has been applied
and training of MSVM2 (Guermeur and Monfrini, 2011) model has
been stopped after converging to 98% accuracy level.
17.4

discussion of results

Table 43 summarizes main results of five compared methods. Smallest amount of correctly classified tokens was attained by baseline
CENT ROID approaches: this was expected since these approaches do
not include any optimization at all12 . The observation that CENT ROIDHAMMING
is less precise than CENT ROIDEUCLIDEAN is also trivial: transformation of real-valued vectors into binary ones brings about a nonnegligible information loss. Worse performance of binary-based classifiers is a result of this information loss.
Optimization, however, can significantly reduce or even counteract
impact of such loss. Hence, even a fairly simple CANONIC genetic
12 Note, however, that classification accuracy of these models is still significantly superior to a random classifier.

290

evolutionary induction of a lightweight morphosemantic classifier

Method

Training corpus Testing corpus

CENT ROIDHAMMING

455 (42.12%)

412 (40.47%)

CENT ROIDEUCLIDEAN

572 (52.96%)

533 (52.35%)

MEAN(GACANONIC )

631 (58.44%)

589 (57.88%)

MEAN(GAMERGE1 )

718 (66.51%)

657 (64.57%)

FIT T EST (GAMERGE1 )

772 (71.48%)

699 (68.66%)

MSVM2

781 (72.31%)

736 (72.30%)

Table 43: Overall results of five different approaches. GA results have been
averaged across diverse runs (R = 6*100 for CANONIC, R=6 for
MERGE1).

algorithm discovers, in just five sweeps through the hamming space,
constellations of prototypes whose precision is higher than that of
Euclidean centroids. This is exemplified by Figure 30 which plots
evolution of precision across generations.

Figure 30: Evolutionary optimization increases the precision of a multi-class
classifier. Curves represent results averaged across diverse runs
(R = 6*100 for CANONIC, R=6 for MERGE1)).

It may be seen that introduction of PGA-like approach - as exemplified by MERGE1 - results in a significant increase in amount of
precisely classified tokens. The score is still not so high as that of
MSVM2 (compare 781 with 718 for training corpus, resp. 736 with 657
in testing corpus), but the jump between CANONIC and MERGE1
suggests that that another PGA architecture, introduction of elitism

17.4 discussion of results

291

or a different choice of parameters or operators can potentially result
in significant boost.
Table 44: MSVM2 training corpus confusion matrix.

Table 45: MSVM2 testing corpus confusion matrix.

ACT SUB PROP REL REF

ACT SUB PROP REL REF

ACT 266 54

0

0

1

ACT 271 38

4

0

1

SUB

4

0

0

SUB

8

1

0

66

18

0

0

62

15

0

0

55 495

PROP 21

55 450

PROP 21

REL

20

12

1

2

0

REL

20

6

3

0

0

REF

15

47

3

0

0

REF

20

38

5

0

0

Table 46: Training
corpus
confusion
matrix produced by
FIT T EST (GAMERGE1 ).

ACT SUB PROP REL REF

Table 47: Testing
corpus
confusion
matrix produced by
FIT T EST (GAMERGE1 ).

ACT SUB PROP REL REF

ACT 278 28

4

4

7

ACT 269 21

9

6

9

SUB 56 427

34

18

19

SUB 62 371

41

25

15

PRO 19

39

43

3

1

PRO 16

35

40

3

4

REL 15

5

1

11

3

REL 15

3

4

5

2

REF

35

7

1

13

REF 11

26

8

4

14

9

As may be seen on confusion matrices shown on tables 5 - 6, MSVM2
fails to correctly classify any testing corpus token attributed to minor
REL and REF categories (i.e. recall = 0%) and the situation is not better in case of PROP class neither (testing recall 15.3%)13 . On the other
hand, this handicap is counteracted by MSVM’s higher recall rates in
regards to dominating SUB and ACT classes. This could potentially
suggest that MSVM still tends to behave like a good old "dualist"
Support Vector Machine rather than a truly multi-class classifier.
Confusion matrices on tables 7-8 indicate that FIT T EST (GAMERGE1 )
also performs quite well when it comes to classification of tokens into
major categories ACTION (86.6% recall; 73.74% precision) and SUBSTANCE (73.74% recall; 79.96%precision). Asides this, it also attains
40% testing recall for PROPERTY class and 22% testing recall for the
REFERENCE class. This suggests that even categories of minor importance play a certain role in models induced by evolutionary search for
prototype constellations.

13 These low recall rates imply that the average F1 score of MSVM is, in
fact, inferior that of FIT T EST (GAMERGE1 ). This is the case for both training
(FMSVM = 0.481; FFIT T EST (MERGE1) = 0.518) as well as testing (FMSVM =
0.426; FFIT T EST (MERGE1) = 0.474) phases.

292

evolutionary induction of a lightweight morphosemantic classifier

17.5
17.5.1

conclusions
computational conclusion

Figure 31: Centroidal tessellation of twelve data-points belonging to three
distinct classes. Dots represent data-points, crosses are category
prototypes and colors denote category membership. Black lines
denote tesselation boundaries.

Figure 31 displays a potential training data-set composed of twelve
data-points attributed to three distinct classes. One can observe that it
is not possible to draw a single straight line which would separate all
datapoints of one class from data-points of other classes. Hence, these
data-points are plainly not separable by a linear boundary: many a
researcher would be tempted to say that in order to classify such dataset, one would be obliged to apply a certain kernel and project it into
space with higher dimensionality.
This is, however, not necessary, if one applies a machine learning strategy which looks for constellation of points instead of lines,
planes or hyper-planes. Denoted on Figure 31 by crosses of different
colors, such points - labeled as "category prototypes" - satisfy one
simple condition:
Every data-point is closer to its prototype than to any other prototype.
Search for constellations of prototypes which satisfy such condition can be thus understood as a problem closely related to problems of Voronoi-Dirichlet tessellations (Aurenhammer, 1991). But contrary to such approaches where "seed" "generator" points are given in

17.5 conclusions

293

advance, positions of such points are, in our approach, induced by
means of evolutionary computation.
Inductive process described in this article took place in a 64-dimensional
binary space. Reasons behind this choice were of pragmatic nature:
1. optimization involving calculation of Hamming distances can
be very fast, especially when implemented on arrays of dedicated Field Programmable Gate Arrays (Sklyarov and Skliarova,
2014) or Application Specific Integrated Circuits
2. binary hashes are very concise form of representation: our approach could be thus useful in Big Data scenarios14
These reasons aside, nothing forbids to bypass the "binarization
procedure" and search for constellations of prototypes in an Euclidean
space. It can be expected that precision of classifiers induced in Euclidean space would be higher than precision of classifiers induced in
binary spaces. However, since there is no free lunch, such euclidean
search would be undoubtedly more demanding when it comes to
consumption of both memory and computational resources.
Shortcomings related to decision to execute the search in binary
space notwithstanding, obtained results are quite encouraging. Hence,
in a scenario aiming to classify tokens occurring in Brown-Eve section of the CHILDES corpus into 5 morphosemantic classes, classifiers induced by evolutionary optimization identified almost as many
true positives as a multi-class SVMs (Lauer and Guermeur, 2011). In
terms of F1-Score obtained as a harmonic mean of average recall and
average precision, the performance of the most fit prototype constellation FIT T EST (GAMERGE1 ) turned out to be even higher than that
of MSVM2. This, however, is more a residuum of a F1-score metrics
than a result which would merit to be reported elsewhere than in a
fotnote13 .
17.5.2

psycholinguistic conclusion

Table 48 lists tokens located in closest neighborhoods of three major
prototypes which have been encoded in the constellation FIT T EST (GAMERGE1 ).
A subsequent inspection of false positives present in Table 48 turns
out to be quite instructive. Hence, the token "building", present in
the utterance "what are you building here?" on line 5417 of eve05.cha
transcript is clearly not a noun, as CHILDES annotators supposed,
but rather a participle - and hence an instance belonging to ACTION
class, as correctly predicted by FIT T EST (GAMERGE1 ). Idem for "hit"
present in the utterance "did you hit your head?" present on line 4145
of eve01.cha transcript: the token is clearly not a noun, as postulated
14 In case of 64-bit hashes one could potentially need as little as 800 Megabytes of
storage volume in order to store hash representations of 100 million documents.

294

evolutionary induction of a lightweight morphosemantic classifier

PACT ION

PSUBST ANCE

PPROPERT Y

H TOKEN POS H

TOKEN

POS H

TOKEN

POS

10 pointing part 10

penny

noun 18

whistle

noun

10 tripped

v

10

tummy

noun 20

bent

v

11 slipped

v

11

cracker

noun 21

graham

noun

part

noun

tough

adj

alright

adj

11 squashing part 11 graham+cracker noun 21
12 building noun 11
12 burped
12 cutting

v

11

part 11

12 dripping part 11

key
matter

other

adj

paddle

noun 22

pitcher

noun

noun 22 sweetheart noun

v

12

drinker

12

mix

v

12

letter

v

12

numbers

13

hit

v

12

paper

13

hit

n

12

snowman

part 12

22

noun 22

fixed

13 playing

v

nap

12

13 dropped

noun 21

noun 23
v

23

noun 23

a

art

cough

noun

fun

noun

noun 23 grannie_hart noun

worse

adj 23

lemon

noun

13

are

v

13

bx

noun 23

little

adj

13

saw

v

13

face

noun 23

through

prep

maam

noun 24

all_gone

adj

13 standing part 13
13

swim

v

13

purple

noun 24

good

adj

13

want

v

13

soup

noun 24

bigger

adj

13

wiped

v

13

stove

noun 24

busy

adj

Table 48: Testing corpus tokens closest to prototypes of ACTION, SUBSTANCE and PROPERTY encoded in FIT T EST (GAMERGE1 ) constellation. Hamming distance H(token, prototype) and token’s
CHILDES part-of-speech annotations. False positives are marked
by bold font.

by CHILDES annotators, but, as predicted, a verb and hence member
of ACTION class. And one can continue: the token "matter" annotated
on lines 2152 and 5688 of CHILDES corpus as a verb is clearly not
a verb but a noun - and hence a member of a class SUBSTANCE because it twice occurs in the utterance "what’s the matter?. And in
spite of the fact that CHILDES labels the token "numbers" as a verb, it
is definitely not a verb when it occurs in the utterance "the numbers
are going around too" (eve15.cha, line 6276). Et caetera et caetera.
Thus, in spite of the fact that POS tokens in Brown/Eve section of
the CHILDES corpus are supposedly « annotated with high accuracy»
(Sagae et al., 2007) it is, sometimes, not really the case. In this regards,
one would be tempted to state that, as of 2016 AD, is the frontier
between developmental and computational psycholinguistics still re-

17.6 generic discussion

sembling a structure standing on clay feet. This is a first conclusion
which could be potentially useful to any (comp|dev)psycholinguists
willing to undertake the path initiated by the study hereby introduced.
The fact that our approach has allowed us to identify errors in the
corpus which even humans didn’t succeed to identify, is indeed encouraging. And it is moreso encouraging when one realizes how simple was a feature set which has been used to construct the vector
space in which all subsequent classifications took place. We repeat:
Every token T was primarily characterized by:
1. T’s three last characters
2. three last characters of the token which precedes T
3. three last characters of the token which follows T
asides this, only other information taken into account concerned T’s
potential position at the very beginning or end of the utterance.
Reason to depart from such a restricted feature set has been in part
empiric (Hromada, 2014a). But there exist others, more profound reasons why we have initiated the training of a verbally interacting computational agent with focus on suffix-like features. Primo: the "less is
more" hypothesis whose implication for neural-network-based processing of natural language has been so beautifully demonstrated by
Elman (1993).
Secundo, note Slobin’s operating principle A:
« Pay attention to the ends of words.» (Slobin, 1973)
which, according to its author, is a "general developmental universal".
In this regards does our analysis indeed demonstrate that "ends of
words" offer features strong enough to initiate a supervised process
of induction of categories which have been, for the purpose of this
article, labeled as "morphosemantic".
And that the whole process can yield fruit even when representing
a 5-class classifier with representation as concise, as 40-bytes long
vector definitely is.
17.6

generic discussion

This chapter has presented an algorithm which succeeds to correctly
classify a significant amount of tokens into so-called "morphosemantic classes" (MS-classes). But why should one speak about such MSclasses instead of staying faithful to well-established term "parts of
speech" ?

295

296

evolutionary induction of a lightweight morphosemantic classifier

An answer is simple: because MS-classes are sometimes not equivalent to parts-of-speech categories. For example, an MS-category labeled as "ACTION" includes not only verbs, but also participles. Motivation behind this distinction is quite simple: it may potentially make
sense for an expert linguist to state that "eating" functions is a participle but "to eat" a verb. However, a modal toddler of 20 months shall
most probably turn out be ignorant of such a distinction (Tomasello,
2009). For what counts for such a toddler is the fact that he can associate both words "eat" and "eating" with the fact of simultaneously
observing certain invariant structural property of her15 surrounding
environment (i.e. observes activity of putting something into one’s
mouth).
Table 17.3.1 introduced five initial MS-classes16 . These MS-categories
have been defined very loosely in the limited scope of this study: all
substantives where defined as belonging to the class SUBSTANCE,
diverse verbal, participial and infinitival forms as those instantiating ACTION, adjectives and numerals were collapsed into the MSclass PROPERTY, everything which had something to do with pointing, specification and deictics was subsumed under REFERENCE and
prepositions were told to instantiate notion of RELATION.
Said in more practical terms: introduction of the notion of MSclass allowed us to enrich certain section of the CHILDES corpus (i.e.
Brown’s 20 transcripts of a girl named Eve) (Brown, 1973) with certain
amount of loosely semantic information. Loosely because MS-classes,
as used in this chapter, are loosely constructed themselves. For it is
not always true that POSsubstantives always denote substances and
POSverba always denote actions: no serious linguist could defend
such a general view in more than one article and still stay unostracized by the linguistic community.
Loosely, but in regards to "motherese" addressed to a modal toddler
(P+17), also semantic. For what is more vital for a 18-month old child,
to understand&express the difference between verb "eat" and participle/property "eating", or rather understand&express the difference
between act of eating and the object being eaten ?
We summarize: act of making a notational turn from the concept of
"parts-of-speech" to the notion of "morphosemantic class" led to enrichment of CHILDES corpus with few bits of semantic information.
Few bits maybe, but still more bits than noise. Subsequent coupling
of this information with morphological information contained in suffixes followed by optimization by means of an evolutionary algorithm
allowed us to converge to very concise, 40-byte long multiclass classi15 To stay consistent with Conceptual Foundations as well as with other books of psycholinguistic tradition, we refer to toddlers and children with feminine pronouns
"she", "her" etc.
16 We leave to reader’s own ingenuity the exploration of an extent in which could these
MS-classes correspond to Aristotle’s categories, or Kant’s and Piaget’s "forms of pure
reason".

17.7 second simulation bibliography

fiers. These classifiers have subsequently resulted in identification of
errors produced by much more complex and - so the authors pretendalso « highly accurate» (Sagae et al., 2007) POS-tagging systems supposedly corrected by multiple human annotators.
These considerations make us believe that a notion of morphosemantic classifier could be of certain use and applicability for any
present or future researcher aiming to deploy, develop or fine-tune
certain nature-inspired yet cognitively plausible (Hromada, 2014b)
models of ontogeny of linguistic categories 17 .
17.7

second simulation bibliography

17 Proof-of-concept source code of this simulation is freely available at URL
http://wizzion.com/thesis/simulation2/ELLA.tgz under mrGPL licence.

297

18

E V O L U T I O N A RY I N D U C T I O N O F 4 - S C H E M A
MICROGRAMMARS FROM CHILDES CORPORA

18.1

general introduction

First simulation has indicated that one can use evolutionary computation in order to partition a semantic feature space into regions
which roughly correspond to certain "topics". Second simulation has
shown how an evolutionary search succeeds to increase the accuracy
of so-called morphosemantic classifiers. Both simulations differed in
regards to corpus-which-was analyzed (20 newsgroups corpus in simulation 1, CHILDES/Brown/Eve corpus in simulation 2) as well as in
a feature set used to project initial text into binary vector space.
However, both simulations:
1. were optimized by means of an evolutionary algorithm
2. succeed to transpose knowledge present in the training set in
order to correctly classify the elements of the testing set (i.e.
generalization)
3. used labeled corpus as input of the learning process
Taken together, points two and three indicate that simulations 1
and 2 can be understood as particular instances of supervised machine
learning. That is, a case of learning which demands more than exposition to the plain input corpus. In case of supervised learning, one
needs to have another, parallel, source of information as well. Category labels which have been manually attributed by human annotators are most common cases of such "parallel" source of information.
It may be the case, however, that certain problems do not necessitate
the exposure to such additional input at all. Such is, according to
some linguists, also the problem of grammar induction (P+148-162)
whereby one aims to infer a grammar of a language L solely from the
corpus of utterances of L.
Because of this, computational models of GI are considered to be
particular cases of unsupervised machine learning1 .
This chapter shall aim to present one particular model of GI. That
is, an evolutionary model strongly resembling models presented in
1 Note, however, that the very act of choosing, in the moment T0 (and not in T1 )
and input corpus CX and not CY can also be considered as an act of supervision.
C.f. (Hromada, 2014b, 2016f) for further discussion of the "unsupervised" vs. "semisupervised" dilemma.

298

18.2 introduction

previous chapters. But also a model aspiring to induce certain generic
"microgrammars" from nothing else than the Brown/Eve section of
the CHILDES corpus.
Article presented in this chapter has been submitted to journal Evolutionary Computation (Hromada, 2016b).
18.2

introduction

Input of Grammar Induction (GI) process is a corpus of sentences
written in language L, its output is, ideally a grammar (P+117-P+124)
or a transparent language model able to generate sentences of L, including such sentences that were not present in the initial training
corpus.
In spite of a seemingly simple nature of the problem, induction
of grammars from natural language is quite a difficult nut to crack.
Thus, symbolic models like the Syntagmatic-Paradigmatic GI (Wolff,
1988), graph-based ADIOS (Solan et al., 2005; Brodsky et al., 2007)
do, indeed, attain interesting results in their efforts to extract English
grammar from English corpora.
But given the deterministic nature of these models, they tend to
converge to certain local optima from which there is no way out.
To make things worse, such models often do not dispose of means
which would allow them to purge themselves from unwanted overregularizations (P+83).
In this chapter, we shall present a GI model aiming to harness evolution’s ability to discard the unwanted. What’s more, we shall exploit
the genotype - phenotype distinction (Fogel, 1995) in order to perform sub-symbolic variation of sets of symbolic sequences. By doing
so, we shall obtain a models which integrates entities represented at
two levels of abstraction:
1. sub-symbolic feature vector spaces
2. symbolic PERL-compatible regular expressions
Ideally, such a model could be both robust as well as flexible enough
to find its middle path between grammars which cover just one thing,
and grammars which cover everything.
18.2.1

two extremes

The nature of resulting grammar is closely associated to the content
of the initial corpus as well as to the nature of the inductive (learning)
process. According to their « expressive power », all grammars can be
located somewhere on a « specificity – generality » spectrum. On one
extreme lies the grammar having following production rules :
1 → 2∗

299

300

evolutionary induction of 4-schema microgrammars from childes corpor

2 → a|b|c . . . Z
whereby ∗ means «repeat as many times as You Want» and | denotes
disjunction.
This very compact grammar can potentially generate any text of
any size and as such is very general. But exactly because it can accept
any alphabetic sequence and thus does not have any « discriminatory power » whatsoever, is such a grammar completely useless as an
explication of system of any natural language.
On the other extreme of the spectrum lies a completely specific
grammar which has just one rule :
1 →< Corpus >
This grammar contains exactly what Corpus contains and is therefore not compact at all (in fact, it is even two symbols longer than
Corpus). Such a grammar is not able to encode anything else than
the sequence which was literally encoded by the training Corpus.
Such grammar is therefore completely useless for any scenario were
novel sequences are to be generated (or accepted).
The objective of GI process is to discover, departing solely from
Corpus (written in language L), a grammar which is neither too specific, nor too general. If it is too general, it shall «over-regularize»
(P+83). That is: such G shall be able to generate (or accept) sentences
which the common speaker of L would never ever consider as grammatical.
On the other hand, if G is too specific, it shan’t be able to represent
all sentences contained in Corpus or, if it shall, it shan’t be able to
generate (or accept) any sentence which is considered to be sentence
of L but was not present in the initial training corpus Corpus.
18.2.2

definitions

G-Category (DEF)
Let’s have a set of N objects (O1 , O2 , ..., ON ) embedded within a ∆dimensional space S (i.e. every object OX can be described by a vector
~oX = V1 , V2 , ..., V∆ ). Then geometrized category (G∆ -Category) C is
defined as a content of S-embedded D-dimensional sphere with
1. centroid whose coordinates are given by a vector ~c = C1 , C2 , ..., C∆
2. radius R
Under such definition, all objects OY , OZ , ... positioned within volume of C are to be understood as members of C.
end g-category 18.2.2.0

18.2 introduction

We reinforce: under this view, a G∆ -category is a convex region
within S (Gärdenfors, 2004)2 . Concrete geometric properties of such
a ball (e.g. increase in volume in regards to increase of radius etc.)
are, of course dependent on the nature of metric space in which the
sphere is embedded (e.g. V/r = 4/3πr3 for 3E-categories, i.e. categories embedded within 3-dimensional euclidean space).
In our simulations 2 and 3, we have used the Lightly Stochastic
Binarization (Hromada, 2014c) algorithm to project initial objects onto
positions within 128- or 64-dimensional binary Hamming spaces. We
define categories within such spaces as follows:
H∆ -Category (DEF)
H∆ -Category is a Hamming ball within a ∆-dimensional Hamming
space.
end H ∆ -category 18.2.2.0
Given that
1. the radius of a H ∆ -Category cannot be higher than ∆ (for such
a sphere would envelop whole space S)
2. any integer ∆ can be represented with log 2 ∆ bits
3. log 2 128 = 7 and log 2 64 = 6
it is evident that one needs exactly 135 bits of information3 - in order
to unambiguously specify a specific H 128 -category embedded in a
128-dimensional hamming space.
And one needs 70 bits of information in order to unambiguously
specify a H 64 -category embedded in a 64-dimensional hamming space.
In this simulation, we shall juxtapose vectors representing diverse
H 64 -categories in order to obtain more complex schemata.
N ∆ -Schema (DEF)
An N ∆ -Schema is a result of concatenation of N vectors g~1 , g~2 , ..., g~n
whereby each vector g~1 , g~2 , ..., g~n represents a G-category located
within a ∆-dimensional space S ∆ .
end N ∆ -schema 18.2.2.0
Focus of the current simulation shall be on induction of schemata
in case where N = 4. Given that basic units of such 4−schemata will
be H 64 -categories, it can be easily seen that they such 4−schemata
could be encoded by no more and no less than 4 ∗ 70 = 280 bits.
end definitions 18.2.2
2 Those endowed synesthesia could potentially visualize G-categories as ∆dimensional pearls (Hesse, 1967) or balls of certain material, state and color.
3 128 bits to specify coordinates of the centroid and 7 bits to specify the radius

301

302

evolutionary induction of 4-schema microgrammars from childes corpor

Under these definitions, the model and the simulation described in
this text can be understood as a method which aims to infer - from
plain-text Corpus written in language L Corpus - a 4−schema or (a
set of 4−schemata) able to generate utterances which were originally
not in the Corpus but are nonetheless still syntactically correct utterances of language L Corpus .
end introduction 18.2

18.3

model

In its essence, model presented in this simulation is reminiscent of
the model presented in (Chapter 17). Hence, during the phase of
"vector space preparation", texts from English-language transcripts of
CHILDES corpora are first projected into 64−dimensional Hamming
space H64 . Subsequently, a search within H64 is realized by means
of an evolutionary algorithm.
There exists, however, a certain difference which ultimately causes
the algorithm hereby presented to be essentially a non-supervised
one. Thus, in the present situation, a HX -category increase the probability of its survival in time if and only if is HX contained in the
utterance-like N−schema which matches as many utterances as possible.
18.3.1

vector space preparation

Listing 14: PERL code of neighbor-word feature extractor
sub word_juxtaposition_featurefilter {
my @features;
my @all_words = split / /, shift;
4
my $word = shift;
my ($word_position)= grep { $all_words[$_] eq $word }
0..$#all_words;
if ($word_position==0) { #word begins the utterance
push @features, "INIT" ;
push @features, "1" .$all_words[$word_position+1];
9

14

} elsif ($word_position==$#all_words) { #word ends the
utterance
push @features, "−1" .$all_words[$word_position-1];
push @features, "END" ;
} else {
push @features, "−1" .$all_words[$word_position-1];
push @features, "1" .$all_words[$word_position+1];
}
return @features;

19 }

18.3 model

Method known as Light Stochastic Binarization (LSB) (Hromada, 2014c)
is used to project the input text onto H64 . Note, however, that initial
features slightly differ from both approach presented in Chapter 16
which used word frequency distributions to project documents onto a
resulting semantic space, as well as from approach presented in Chapter 17 which used suffixal information to project words onto a resulting morphosemantic space.
In contrast to both these methods, the feature extractor presented
on Listing 14 focuses on two sources of information only: the identity
of the word WL and the word WR juxtaposed to the left (resp. to the
right) side of the target word WX .
For example, the function call:
word_juxtaposition_featurefilter("this is a dog","dog")
returns array @features characterizing this concrete token of the
word "dog" in terms of two features:
−1a, END
In this case, the first feature encodes the fact that the token is preceded by an indeterminate article a while the second feature encodes
the fact that "dog" is the last token of the utterance. Similarly, the token this would be characterized by features INIT , 1is; token is would
be characterized by features −1this, 1a and the token a would be
characterized by features −1is, 1dog.
Once each word of each utterance is characterized by its features,
one follows a standard Random Indexing procedure (Sahlgren, 2005)
in order to attribute each distinct feature a distinct randomly generated 64−dimensional sparsely non-zero "init" vector. Subsequently,
euclidean representation of every word type WX is obtained as a sum
(i.e. unweighted linear combination) of features to which WX is associated in the corpus.
These euclidean vectors are later normalized and enter the binarization procedure which leads to concise 8-byte hashes having the
property:
The more words WX and WY tend to occur in similar contexts, the less the
Hamming distance between LSB(WX ) and LSB(WY ) shall be.
It is, indeed, this property which shall potentially allow us to effectuate successful evolutionary searches within the H64 space which
could be potentially labeled as "morpho-syntactic".
end vector space preparation 18.3.1

303

304

evolutionary induction of 4-schema microgrammars from childes corpor

18.3.2

bridging the sub-symbolic and symbolic realms

In order to better understand the model hereby presented, one needs
to understand a certain distinction often implemented by proponents
of evolutionary programming (Fogel, 1995) or evolutionary strategies
(Rechenberg, 1971). Id est, the distinction between the genotype and
the phenotype.
Genotype
Information-encoding substrate potentially modifiable by variation
and replication operators. Unambiguously translatable into phenotype.
end genotype 18.3.2.0

Phenotype
Concrete manifestation of specific genotype against which fitness can
be evaluated. A distinct phenotype PX can potentially manifest multiple distinct genotypes.
end genotype 18.3.2.0
Listing 15: Transcription of vector representations (genotypes) into regular
expression phenotypes
" ";
$extension = " " ;
for $component (0..5) {
$component_regex = " " ;
$component_extension = 0;
6
$radius=$genotype_radius[$component];
for $word (@all_words_in_corpus) {
$word_hash=$word_hashes[$word]};
$word_hcategory_distance = hamming_weight(
$word_hash XOR $genotype[$component]);
if ($word_hcategory_distance<$radius) {
11
!$cregex ? ($cregex = ’ ( ’ .$word) : (
$cregex .= ( ’| ’ .$word));
$cextension++;
}
}
$cregex ? ($regex .= ($cregex. ’ ) ’ )):($regex .= ’ ’ );
16
$extension *= $cextension if ($cextension);
}
$regex= ’^ ’ .$regex. ’$ ’ ; #utterance-based
1 $regex =

In context of the current simulation, N−schemata (18.2.2) of length
N = 4, i.e. 4−schemata, are to be understood as individual genotype instances. As is always the case in evolutionary computation,

18.3 model

Word Hash Word Hash
this BABA that BABB
it

BAAB

is

0F23

are

0F11

a

C123

the

C125

not 5FF5

duck 7720

dog 7725

Table 49: Words of a CorpusMini and hexadecimal representations of their
potential hashes.

Syntagma5

H1

H2

H3

H4

H5

Center Radius Center R Center R Center R Center R
BABC

17

0F20 5 5FF0 7 C124 3 7723 7

Table 50: A candidate genotype which could be potentially induced from
the hypothetic CorpusMini .

these schemata replicate, mutate, cross-over etc. But in order to get
their fitness attributed, these genotypes have to be translated into
phenotypes. Such translation is realized by means of the procedure
displayed on Listing 15
The core idea of the genotype - phenotype translation is to be found
on lines 9-11. On line 9, a hamming distance between hash of each
among 5 components of the candidate genotype 4−schema is evaluated in regards to hash of each word WX represented in the H64 vector space. On line 10, algorithm checks whether the obtained distance
is smaller than the radius which is also included in the genotype. If
yes, then the literal sequence of signifiant of the word WX is injected
into the resulting phenotype in a way, so that the resulting phenotype
would be a syntactically correct Perl-Compatible Regular Expression
(Wall et al., 1994; Hromada, 2011, 2016e) .
In other terms, the code displayed in Listing 15 can be understood
as a method of translation of sub-symbolic (feature-based) binary vector representations into symbolic representations known as regular
expressions.
For example, let’s look at Table 49 which illustrates a small hypothetical CorpusMini containing only words that, this, it, is ... and
their corresponding binary hashes 4 .
Then if ever a 5 − schema like the one presented in Table 50 would
be identified by the evolutionary search, it would be translated into a
regular expression:
4 As usual, 64-bit hashes are presented in hexadecimal format as sequences of four
characters from range 0-9A-F
5 In order to stay aligned with traditional linguistics, we shall sometimes use the term
"syntagma" (resp. its abbreviated form "syn") as a synonym for the term "component".

305

306

evolutionary induction of 4-schema microgrammars from childes corpor

ˆ(this |that|it )(is )(not )(a |the )(dog |duck)$
which represents the microgrammar
Utterance → Syn1 Syn2 Syn3 Syn4 Syn5
Syn1 → this | thatkit
Syn2 → is
Syn3 → not

(10)

Syn4 → a| the
Syn5 → dog| duck
potentially covering 12 distinct utterances6 . It would, however, not
match utterances of a sort "this are not the dog" because the Hamming distance between the word are and the centroid of the 2nd
component is bigger than the radius of the very same component (i.e.
HD(LSB("are"), Centroid2 ) = HD(0F11, 0F20) = 9 > Radius2 ).
In such a way, one can determine the exact form of a Perl-Compatible
regular expression (PCREs) by means of distance measurements in
the underlying H64 space. And given that PCREs are
1. strings of symbols which describe sets of strings of symbols
2. a sort of lingua franca of many engineers active in the domain
of Natural Language Processing, data-mining or information
retrieval
3. well-tuned and optimized by almost three decades of development by not only PERL but also C++, Python, or R communities
4. transparent to inspections by human examinators7
one can potentially start to see a certain utility usefulness in developing an architecture which can unambiguously transform subsymbolic geometrized genotypes into comprehensible, symbolic, and
manually modifiable PCRE-compatible phenotypes.
18.3.3

fitness function

Fitness of N−schema NX is principally determined by two characteristics:

6 We shall further denote the quantity of maximal theoretical number of covered utterances
with the term extension.
7 Only 5 PCRE meta-characters are used in this article: ( denotes beginning of a disjunctive group; ) denotes end of a disjunctive group; | is a separator between two
members of a disjunctive group; ˆ denotes beginning of expression and $ denotes
the end of expression

18.3 model

1. extension E, or a maximal theoretically possible sensitivity, is
a finite natural number representing the quantity (i.e. the cardinality of a set) of all utterances which could be matched by
NX
2. Corpus sensitivity Y is a number of utterances, present in the
Corpus, which have been matched by NX
More formally: Let’s have a N−schema X composed of N H64 categories HX1 , HX2 ...HXN . Then X is said to have an overall extension E defined as a multiplicative product of extensions of individual
categories:
N
Y
EX =
IHk
(11)
k=1

whereby the individual extension IHk of a k−th category Hk is defined as number of members of Hk . I.e. |IHk = |Hk | where |Hk | denotes the cardinality of set of objects whose distance from centroid
h~k is less than the radius of category Hk .
For example, extension E of the 5−schema presented in Table 50 is
12 because IH1 ∗ IH2 ∗ IH3 ∗ IH4 ∗ IH5 = 3 ∗ 1 ∗ 1 ∗ 2 ∗ 2 == 12.
In contrast to E which is more an information-theoretic quantity, is
the sensitivity Y a value which is always relevant in regards to certain
corpus.
YX = NX matchesCorpus
This notion is further exemplified by first line of following listing.
Listing 16: PERL code behind fitness function Fitness1
my $sensitivity = true { /$regex/ } @corpus;
2 #returns number of utterances in @corpus matchable by $regex

if ($sensitivity) {
$f=($sensitivity**2)/$extension;
} else {
$f=0;
7 }

Extension and sensitivity thus defined, the fitness value of the schema
NX has been, for the purpose of the current simulation, defined as:
Fitness1 (NX ) =

YX ∗ YX
EX

(12)

Rationale behind our choice of this and not other 8 is simple: given
that we shall tend to maximize the fitness function, we put extension
8 Many other fitness functions, of course, are possible and only very few of them have
been tested. It cannot be excluded that more useful fitness functions shall be identified in the future. If not, then the fitness function Fitness1 hereby defined could
be potentially thought of as an expression of certain cognitive law. Such conjectures,
however, would bring us too far.

307

308

evolutionary induction of 4-schema microgrammars from childes corpor

in the denominator (i.e. divisor) while putting the sensitivity into
numerator (i.e. dividend).
Thus aligned, it may be expected that implementation of such a fitness function shall direct the evolutionary search towards schemata
with both low extension as well as with high sensitivity. For this reason, sensitivity is squared in order to somewhat counteract the impact
of extension which itself is a multiplicative product of its components.
18.3.4

evolutionary strategy

The INDUCT OR1 evolutionary algorithm implemented in this simulation is similar to the algorithm CANONIC presented in 17.3.2. Tournament operator is used as the main and only method of selection of
fit individuals from the population to the mating pool. Size of the
mating pool is equal to population size and mutations of centroid
coordinates are equivalent to "bit flipping".
There exist, however, certain important differences which distinguish the algorithm hereby presented from the CANONIC:
1. implementation of phenotype-genotype distinction
2. evolution of both centroid coordinates as well as category radii
3. zeroth population is not generated pseudo-randomly
4. crossover occurs only at specific locations
5. re-focusing strategy is implemented
Taken together, this differences result in an algorithm endowed with
certain characteristics of an evolutionary strategy (Rechenberg, 1971)
or evolutionary programming (Fogel, 1995).
18.3.5

evolution of both centroids and radii

As had been already indicated, individual solutions identified by
INDUCT OR1 are essentially nothing else than 4−schemata. That is,
binary vectors which encode a syntagmatic sequence of four H64 categories.
Given that a H64 category are defined in terms of both their center
as well as radius, INDUCT OR1 tries to identify not only the most
optimal coordinates of category’s centroid (as was the case in Chapter 17), but also the most optimal "extension" which is principally
represented by H’s radius.
Information about radius of each category is thus also part of the
chromosome and is encoded as an integer value from range < 0, ∆ >.
Probability of mutation of radius-encoding gene is 0.2%. If subjected
to mutation, radius is either decremented or incremented with 1: this
corresponds to category becoming less, resp. more exhaustive.

18.3 model

18.3.6

pseudo-random initialization of 0th population

Every single individual of the initial population of N−schemata is
generated as follows:
1. choose a random word W1 occurring in the corpus and retrieve
its geometric coordinates w~1
2. define w~R 1 as the center of first category H1
3. choose a random word W2 occurring in the corpus and retrieve
its geometric coordinates w~2
4. define w~2 as the center of the second component H2
5. ...
6. choose a random word WN occurring in the corpus and retrieve
its geometric coordinates w~N
7. define w~N as the center of the last syntagmatic component HN
Subsequently, a radius which is neither too big nor too small is attributed to each among N components. In case of INDUCT OR1 , the
radius was set-up to value 139 which, in context to 64−dimensional
Hamming space, seems to denote a distance which is neither too
small nor too big.
Thus, contrary to ex nihilo initialization of CANONIC which started
the induction process from randomly generated positions of all centroids, is INDUCT OR1 ’s initial 0-th population only partially random.
This is so because at the end of initialization process, center of each
component of every individual N−schema is the same as the position
of a certain word present in the Corpus. 10
18.3.7

locus-constrained cross-over

INDUCT OR1 cross-overs took place only at specific loci: namely at
positions 64, 128 and 192 of the chromosome specifying centers of
diverse G − categories. In more practical terms, such a design choice
assured that information precising all coordinates of G − category
of the parent individual X have been substituted by information precising all coordinates of another G − category encoded in another
parent individual Y.
9 Big radius results in big extension of the corresponding category and hence to many
false positives. Small radius causes the category to have small extension and hence
to potentially miss many true positives.
10 Such an approach significantly boosts the inductive process which could have otherwise certain difficulties in booting itself up.

309

310

evolutionary induction of 4-schema microgrammars from childes corpor

This distinction aside, the usage of cross-over in INDUCT OR1 strategy has been fairly standard: every individual of a new generation
was obtained as a result of cross-over between two randomly chosen
members of the mating pool.
18.3.8

re-focusing strategy

Another particular aspect is related to INDUCT OR1 ’s ability to prioritize, with every new run, induction of new schemata. In practice,
this is attained by starting every new run with execution of the code
present in Listing 17.
Listing 17: PERL code behind re-focusing strategy
@corpus =grep {!/$previous_fittest_schema/ } @corpus;

Literally speaking, this line of code removes from the corpus all
utterances matched by the most fittest N − schema of the previous
run. This results in gradual shrinking of size of the corpus against
which the fitness of all future candidate schemata shall be evaluated.
In more general terms, the re-focusing strategy orients the process
to inference of schemata from such utterances, from which no schema has
been yet induced. 11 .
And said in more "cognitive" terms, the algorithm invests more
attention into exploration of structural regularities within data which
have not yet been explored.

11 Inductive process lacking the re-focus strategy would often "lock" itself to most
salient patterns present in the corpus which would result in distinct runs often converging to similar schemata.

18.4 simulation

Pseudo-random initialization

Zeroth
Generation

Vector Space Geometrization & Binarization
Preparation

Genotype2Phenotype

Fitness
Evaluation

Fittest
4-Schemata

Re-focusing

Offspring

Variation

Selection

Working
Corpus

Features
and Entities

Mating Pool
first run

new run

Input Corpus
(CHILDES)

Microgrammar

Figure 32: Data flow among main components of INDUCT OR. Lime color
denotes components related to evolutionary optimization, royal
blue color denotes components of the preliminary VSP phase.

18.4

simulation

Simulation presented in this section has implemented the evolutionary strategy INDUCT OR in order to induce sets of regexp-like rules
from four-word English utterances contained in CHILDES corpus. Diagram elucidating relations between main INDUCT OR components
is visible on Figure 32. The simulation was invoked twice, once in
64−dimensional space (INDUCT OR64 ) and once in 128−dimensional
space (INDUCT OR128 ).
The vector space preparation phase (c.f. Section 18.3.1) yielded a
vector space in which all subsequent INDUCT OR runs took place.
Each among 2 * 100 distinct runs of INDUCT OR was initialized by a
pseudo-random generation of zeroth population.
18.4.1

corpus

This article is conceived as a part of dissertation addressing the possibility of developing evolutionary models of induction of linguistic
rules in (and by) human children. This makes the choice of the corpus
quite straightforward: the corpus from which we shall aim to extract
first linguistic categories is to be contained in Child Language Data
Exchange System (CHILDES, (MacWhinney and Snow, 1985)).

311

312

evolutionary induction of 4-schema microgrammars from childes corpor

Inspired by the "less is more hypothesis" (Elman, 1993), input corpus used in simulation hereby presented consisted of 1047 four-word
"motherese"12 utterances extracted from English section of CHILDES.
No other data has been used to guide the inductive process.
18.4.2

parameters

Table 51: Parameters of diverse components of the INDUCT OR algorithm.

VSP

INDUCTOR

Machine Learning

18.5

Input corpus

CHILDESEnglish 13

Feature Filter

word_juxtaposition

Dimensionality

∆ = 64 or ∆ = 128

Seed

S=3

Reflections

I=0

Population size

N = 100

Selection

Tournament

Crossover

One-point

Mutation rate

M = 0.2%

Initial population

pseudo-random

Generations

G = 100

Elitism

E=0

Runs

R = 100

Syntagms

N=4

observations

?? lists 100 regexp-like rules which have been evaluated as "fittest" at
the end of distinct INDUCT OR runs which took place in a H64 space.
These hundred rules match 176 from 1047 utterances present in the
input corpus (16.8%).
?? lists 100 regexp-like rules which have been evaluated as "fittest"
at the end of distinct INDUCT OR128 . These runs took place in a H128
space. These hundred rules match 176 from 1047 utterances present
in the input corpus (15.8%).
As marked in both Appendices by the token GENERAL, INDUCT OR
was also able to identify many completely grammatical 4-schemata
which able to accept (or generate) even utterances which have not
been present in the input corpus.
Such generalization faculty was observed in 82% resulting individuals in case of H64 and in 77% individual 4 − schemata induced in
H128 .
Among these individuals induced in H64 , 32 have been manually
evaluated as ALLGOOD, id est capable of accepting|generating only
grammatically correct utterances of English language.
12 In CHILDES, lines containing motherese utterances begin with the marker *MOT.
13 Available at http://wizzion.com/thesis/simulation3/utterances.4

18.5 observations

313

For example, the most fit schema of sixth run of INDUCT OR64 :
ˆ(that )(is )(a )(bag|banana|basket|bridge|cherry|cow|gate
|horse|kleenex|motorcycle|puzzle|rabbit|raccoon|shoe|
spoon|story|timer|tractor)$
is able to accept|generate 18 grammatically correct English utterances in spite of the fact that only 5 among these 18 sentences have
been explicitly present in the input corpus.
Excessive over-regularization was observed in case of 21 individuals willing to accept|generate at least one WRONG utterance.
Asides this, 4-schemata issued from 28 runs of INDUCT OR64 have
been marked as DISPUT ABLE. That is, as capable of accepting|generating
utterances which would be classified as "ungrammatical" by an orthodox grammarian, but could nonetheless occur in a real-life usage.
This border cases include utterances as:
where is the clever (individual 9)
what are we joey (individual 18)
there is what one (individual 34)
there does he go (individual 55)
what are you joey (individual 83)
oh what is i (individual 87)
as well as utterances which are syntactically correct, but semantically doubtful:
oh you are strawberries (individual 63)
oh you are fries (individual 63)
okay that is thumb (individual 91)
et caetera, et caetera.
In case of INDUCT OR128 37 induced 4−schemata have been manually evaluated as ALLGOOD and 17 as DISPUT ABLE.
Listing 18: First exemplar of a non-monotonic ontogenetic trajectory
#ITERATION 30 FITNESS 1.333333
^(do )(you )(like )(candy|some|strawberries)$
#ITERATION 40 FITNESS 1.14285714285714
4 ^(do )(you )(like )(bananas|box|candy|cover|fell|ketchup|
nana|not|papa|popsicles|some|sorry|strawberries|tired)$
#ITERATION 50 FITNESS 1.8
^(do )(you )(like )(box|candy|ketchup|some|strawberries)$

18.5.1

diachronic observations

A deeper time-oriented inspection of processes taking place during
individual runs can also be of certain interest.

314

evolutionary induction of 4-schema microgrammars from childes corpor

On Listing 18 it may be seen that after 30 iterations, INDUCT OR1
has identified a 4-schema able to accept|generate utterances "do you
like candy", "do you like some" and "do you like strawberries". However, this schema was lost in following 10 generations and fitness fell
from 1.33 to 1.1414 . Hence, an over-regular schema gained in prominence which was able to accept even such constructs as "do you like
sorry" or "do you like tired".
But in following ten generations, population dynamics of the whole
system not only lead to correction of the previous errors, but even
brought about the increase in fitness to 1.8 which went hand in hand
with scheme’s ability to match utterances like "do you like box" or
"do you like ketchup".
Another run presented on Listing 19 also exemplified such nonmonotonic, error-correcting aspects of INDUCT OR1 algorithm:
As it may be seen that an incorrect utterance "what is he going" was
acceptable by the fittest individual of 40th and 50th iteration. This was
corrected in 60th generation but further development brought about
yet another batch of mistakes: utterances like "what is he cute" and
"what is he share" were thus acceptable by the most fit individual of
80th generation. This has been subsequently corrected and the run
terminated, after 100 generations, with a GENERAL, ALLGOOD 4schema.
Listing 19: Second exemplar of a non-monotonic ontogenetic trajectory
#ITERATION 30 FITNESS 1.33333333333333
2
^(what )(is )(he )(doing|playing|saying)$
#ITERATION 40 FITNESS 1.8
^(what )(is )(he )(doing|going|holding|playing|saying)$
#ITERATION 50 FITNESS 1.5
^(what )(is )(he )(doing|drinking|going|holding|playing|saying)$
7 #ITERATION 60 FITNESS 1.8
^(what )(is )(he )(doing|drinking|holding|playing|saying)$
#ITERATION 70 FITNESS 2.25
^(what )(is )(he )(doing|holding|playing|saying)$
#ITERATION 80 FITNESS 2.28571428571429
12
^(what )(is )(he )(called|cute|doing|holding|playing|saying|
share)$
#ITERATION 90 FITNESS 1.5
^(what )(is )(he )(doing|drinking|going|holding|playing|saying)$
#ITERATION 100 FITNESS 2.25
^(what )(is )(he )(doing|holding|playing|saying)$

14 This is, of course, due to the fact that INDUCT OR1 does not implement any form of
elitism which would safeguard the fittest individuals from destructive variations.

18.6 conclusion

18.6

conclusion

Almost one third (32%) of 4 − schemata - identified by INDUCT OR1
sweeping a 64−dimensional Hamming space representing 1047 English "motherese" utterances - produce only correct generalizations.
Collection of all induced N-schemata yields what we call a "microgrammar". Such a microgrammar is more a as construction-based
(Fillmore et al., 1988; Lakoff, 1990) or usage-based (Tomasello, 2009)
grammar than a grammar in sense of the Formal Language Theory
(P117+122) or in the sense commonly accepted by proponents of the
generativist doctrine (Chomsky, 2002).
But given that such a microgrammar (c.f. ??) is capable of generating more syntactically correct utterances than those which had been
presented through the training corpus, one can still consider it to be,
in certain regards, modestly generative.
We say "modestly" because the generative faculty is kept on the
leash by evolution’s tendency to discard such schemata which would be too
concrete (i.e. have low sensitivity Y), or too exhaustive (i.e. have high extension E). Hence, the thorny problem of over-generalization is - at
least in case of algorithm implementing the INDUCT OR1 Evolutionary Strategy - not resolved by any a priori knowledge embedded in a
some kind of chomskyan "Universal Grammar".
Far from it: we propose to depart from the idea that the grammarinducing agents are not "ideal learners" in sense of Gold’s Theorem
(Gold, 1967; Johnson, 2004). On the contrary: the process of grammarinduction can only fully succeed if some information-encoding representations are, sometimes, irreversibly forgotten or subsumed to
variation.
In this article, variation was attained by operators which:
1. mutate coordinates of centers of syntagmatic G − categories
2. mutate radii of syntagmatic G − categories (i.e. increases or
decreases category’s extension)
3. substitute a G − categories from one N − schema with G −
categories from another N − schema (i.e. locus-constrained crossover)
By causing these operators to perform their operations in a subsymbolic vector space, and by evaluating results of their activities
on a symbol-sequence level, one can obtain a system able to induce
simple 4 − schema microgrammars from simplified corpus of English
"motherese" utterances which are four words long.
This15 , however, is only the beginning.

15 Proof-of-concept source code of this simulation is available
http://wizzion.com/thesis/simulation3/EGI.tgz under mrGPL license.

at

URL

315

316

evolutionary induction of 4-schema microgrammars from childes corpor

18.7

general discussion
There is an appealing symmetry in the notion that the mechanisms of
natural learning may resemble the processes that created the species
possessing those learning processes.
— D.E. Goldberg and J. Holland

More generally and beyond syntax, operators implemented in the
3rd simulation can be associated to following psychological phenomena:
1. mutation of an N-schema - synaptic pruning (P+38), information decay, forgetting etc.
2. crossover between two N-schemata - related to creativity, dreaming (P+89-90) and phantasia
Other variation operators - corresponding to certain forms of
1. playing certain language games (Wittgenstein, 1953; Nowak et al.,
1999), or "intrapsychic" (Brams, 2011) games
2. imitating certain phenomena observed in linguistic behavior of
human children (P+184-204)
could also be deployed.
Another subsequent enhancement of the GI method hereby introduced could potentially result from introduction of additional feature sets. For example, one could take a fit N−schema X, decompose
it into its component G−categories G1 , G2 , ..., GN and, if ever a certain
component G−category Gα turns out to be disjunctive, enrich vectorial representations of all its members with information that they belong to Gα . For example, one could enrich vectorial representations
of tokens "doing", "holding", "playing", "saying" with information that
they turned out to be subsumed under G − category present in one
quite fit 4 − schema (c.f. Listing 18). And enrich vectorial representations of tokens "ketchup", "strawberries" etc. with information that
these tokens turned out to subsumed by yet another G − category
present in another schema (c.f. Listing 19).
Note that introduction of such feature-sets could be interpreted as
introduction of a feedback-loop in the system. Essence of such a system could thus be considered to be not only linguistic, but also cybernetic (Wiener, 1961; Lorenz, 1973). It could be postulated that introducing of such feed-back, bootstrapping (Hromada, 2014b; Karmiloff
and Karmiloff-Smith, 2009, pp.111-118) loop into the system would
not only result in identification of more complex microgrammars, but
would also cause the system to follow similar ontogenetic trajectories than those of children which undergo a so-called syntagmaticparadigmatic shift (Nelson, 1977).

18.7 general discussion

Vector Space Geometrization & Binarization
Preparation

Zeroth
Generation

Offspring

Fitness
Evaluation

Fittest
N-Schemata

Re-focusing

Syntagmatic
rules

Features
and Entities

Variation

Selection

Working
Corpus

Mating Pool
new run

new run

Input Corpus
(CHILDES)

Paradigmatic
Categories

Figure 33: Data flow among main components of extended variant of
INDUCT OR introducing a syntagmatic-paradigmatic feedback
loop.

All such operators, features and feedback-loops taken together and
coupled with
1. the fact that brain (P+5) is a finite material object with finite resources which is subjected to 2nd law of thermodynamics (P+7)
2. the fact that linguistic input which the child becomes is preprocessed by loving (P+241) and caring computational oracles
(Turing, 1939; Clark, 2010) like mothers, fathers, care-takers etc.
3. the fact that acquisition of language takes place in informationally very rich, contextually grounded, usage-based scenarios
(Tomasello, 2009)
one cannot exclude that a sort of evolutionary, ecological, equilibriumseeking process indeed takes place in the mind of a modal healthy
language-acquiring toddler.
And given that certain high-profile developmental linguists terminate their inquiry, concerning the informatic properties of the language input, with the conclusion
« internal mechanisms are necessary to account for the unlearning
of ungrammatical utterances» (Marcus, 1993)
we allow us to conclude with a suggestion that the internal mechanism which Marcus mentions is, in reality, not a sort of universal
grammar (P+98-101) black-box but instead a potentially "general cog-

317

318

evolutionary induction of 4-schema microgrammars from childes corpor

nitive process" (P+101, (Piaget, 1974)) whose very essence is to discard
that, which is non-functional:
Evolution (P+3).

Part V
SUMMA

19

SUMMA SUMMARUM

The natural selection paradigm of such knowledge increments can be
generalized to other epistemic activities, such as learning, thought and
science.
— D.T. Campbell
The objective of this dissertation was to provide a computational evidence of the "operational thesis" (P+20):
«Learning of toddlerese can be successfully simulated by means of
evolutionary algorithm processing textual representations of
motherese.»
Given that
• the third simulation used no other input than the plain-text corpus of motherese utterances
and given that
• the third simulation resulted in identification of schemata able
to generate grammatically correct utterances which have not
been present in the initial corpus
A Popperian
conclusion

one may consider the "operational thesis" as temporarily unfalsified.
In this sense, we consider any future effort to falsify or verify "the
softest thesis" (P+17-19):
«Ontogeny of toddlerese can be successfully simulated by means of
evolutionary computation.»
as effort worthy of interest.
It is worthy of noting in this regards that certain notions like that
of a 4 − schema or morphosemantic class are not to be considered as
some ultimate elements of some sort of ewige Theorie but rather as
temporary, limited building blocks of an architecture which is to be
surpassed.
Surpassed by what? Maybe surpassed by models which introduce
not only 4 − schemata but also 2 − schemata, 3 − schemata, 5 −
schemata ... N − schemata. Or by procedures which integrate semantic, morphological and syntactic spaces within a single "linguistic" space SL . Given what we have seen until now, it can not be a
priori excluded that results of certain types of evolution-inspired simulations taking place within such SL would turn out to be consistent
with "the softer hypothesis" (P+14-16) which states that

320

summa summarum

321

«learning of natural language can be successfully simulated
by means of evolutionary computation»
But when speaking about optimization taking place within a linguistic space SL , shouldn’t it be also possible to speak also about
optimization taking place within even more generic a space SG ?
For nothing prohibits that category-inducing methods hereby introduced could be used to induce classifiers of partially or even fully
non-linguistic entities. For example, a research project stemming from
this dissertation may potentially explore the extent in which the evolutionary search for prototypes could be useful in Computer Vision:
the only thing which would be fundamentally different would be the
essence of input entities (i.e. images and not texts) and features occurring in such entities (e.g. Haar features (?Hromada, 2010c; Hromada
et al., 2010) or others).
In fact, nothing forbids to use one among three CI models hereby
introduced whenever one needs to perform:

Relation to
Computer Vision

1. multiclass classification of entities (exemplified by "supervised"
simulations 1 and 2)
2. induction of rules from positive corpus only (exemplified by
"unsupervised" simulation 3)
In other terms, the combination of "vector spaces" and "evolutionary computation" components can be understood as a "generic optimization toolbox" (GOT) which could potentially be applied upon
any set of features. It is, however, primarily the nature of the input
corpus and the nature of features which extracted from the corpus
which should most closely determine the nature of categorizationperforming agent thus induced.
Hence, when applied upon data-sets describing "spatial" trajectories within
a group of "labyrinths", one could aspire to induce rules allowing a certain
robot, a certain automatized vehicle, or a certain sort of embedded artificial
classifier system (Booker et al., 1989), to find its way out of the "labyrinth"
it never saw before.
Or - if one would depart from so-called "morally relevant features"
(Hromada and Gaudiello, 2014) - one could even hope to simulate
ontogeny of categories and rules of a somewhat different kind. That
is, of categories and rules which are commonly labeled as "aesthetic"
(i.e. beautiful / ugly), "moral" (i.e. good / bad), "deontologic" (i.e.
forbidden / allowed) (Hromada, 2016f).
Asides "linguistic", "visual", "spatial" or "moral", implementation
of EML GOTs in induction of other types of intelligence (Gardner,
2011) or their combinations (Karpathy and Fei-Fei, 2015) in artificial
agents and robots is also a task to be explored. If successful, it cannot
be excluded that such explorations would potentially bring scientific
and engineering communities one step closer us to deployment metamodular (Hromada, 2012a) artificial agents able to:

Generic
Optimization
Toolbox

Induction of
spatial trajectories

Moral Induction

EML and theory of
multiple
intelligences

322

summa summarum

1. integrate (Tononi, 2004) multi-modal (i.e. linguistic, visual, proprioceptive etc.) information
2. use nature-inspired, evolutionary computational core to identify most fit groupings of such information
By doing so, an ultimate ex computatio atque simulatio proof of the
"soft thesis" (P+11-13):
« learning can be successfully simulated
by means of evolutionary computation »

Evolutionary
Machine Learning
and its advantages

could be, potentially, given.
To offer such a proof, however, is a task which by far surpasses limits of any individual researcher. What is more, alternative machine
learning paradigms (e.g. deep learning) currently predominate and
it may be the case that popularity of such approaches decreases the
amount of attention which could - and should - be focused on exploration of common grounds between computational models of learning and computational models of evolution.
Let’s now enumerate certain advantageous properties of evolutionary machine learning (EML) models which have been presented in
simulations one, two and three. These EML models are :
1. functional: function of the model is principally determined by
choice of fitness function and selection/variation operators
2. alternative: in any moment TX , the learning system contains
multiple alternative solutions of the problem (P+8-10)
3. population-based: behavior of the learning system can be interpreted in terms of population dynamics (P+116)

Comparison with
connectionist
models

Contrary to these, connectionist models are more "structural" than
"functional", they do not explicitly encode representations of diverse
solutions and their convergence towards optimal states is more easily interpretable in terms of differential "gradient descent" of "backpropagation" than in terms of population dynamics.
What’s more, by coupling the notion of evolution with that of a
vector space, and by implementing a fairly trivial phenotype - genotype transcription (Section 18.3.2), one can obtain unsupervised EML
models
1. bridging sub-symbolic (vectorial) and symbolic (regexps and
grammars) realms
2. transparent to investigation and modulation by a human investigator (i.e. easy to interpret and teach)

summa summarum

Note that the property of being transparent to investigation and
modulation is not a property which should be taken à la légère. For it
could result in a creation of the inter-subjective bound between the
artificial system which is being (investig|modul)ated and the human
who (investig|modul)ates.
In other terms, it could, potentially, result in emergence of entities
of non-organic origin who could, and should, be considered as not
only objects of machine-learning but also as subjects of machine-teaching.
Such considerations, however, bring us further than paradigms like
machine learning or even computer science could ever bring us. Such
considerations bring us towards meta-paradigm1 of paedagogy and
didactics (Komenskỳ et al., 1991) which solely can demonstrate the
validity and usefulness of the Theory of Intramental Evolution (Hromada, 2015).
Such considerations bring us towards such regions of SG whereby
the very "hard thesis" (P+2-10)

323

Of learners and
teachers

«Learning is a form of evolution»
could be evaluated as valid.
Valid or not, nothing forbids the sign-manipulating2 mind (P+1) to
realize a transposition (P+190-192) which savants like Bateson (Bateson, 2006) once realized.
That is, a transposition between two terms each of which denote
one big stochastic system, a transposition between "Mind" and "Nature", a transposition which obliges one to state:
«Evolution is a form of learning3 »
Such is, indeed, the ultimate result of the dissertation with which
we aspire for attribution of the title Philosophiae Doctor in both cybernetics as well as cognitive psychology.
Such is, indeed, the result of work commenced by two words forming the "initial thesis" (P+1):
«Mind Evolves»
*
**

1 A scientific paradigm (Kuhn, 2012) transfers knowledge about certain field of study.
A scientific meta-paradigm transfers knowledge concerning the transfer of knowledge.
2 « Thinking is essentially the activity of operating with signs.» (Wittgenstein, 1934)
3 Lorenz (1973) states that the principal difference between learning and evolution
is the ability of a learning system to "learn from one’s own errors". System which
learns is supposed to have such ability while system which "only" evolves does not.
But is it really always the case?

Ultimate Chiasm

BIBLIOGRAPHY

Adler, A. (1976). Connaissance de l’homme. Payot.
Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr, O. N., and
Costa, L. d. F. (2013). Probing the statistical properties of unknown texts: application to the voynich manuscript. PloS one,
8(7):e67310.
Ambridge, B., Theakston, A. L., Lieven, E. V., and Tomasello, M.
(2006). The distributed learning effect for children’s acquisition of an abstract syntactic construction. Cognitive Development,
21(2):174–193.
Araujo, L. (2002). Part-of-speech tagging with evolutionary algorithms. In Computational Linguistics and Intelligent Text Processing,
pages 230–239. Springer.
Aristotle (-335 BC). Poetics: On Comedy. Unknown.
Aristotle (342BC). On Coming-to-be & Passing-way. At the Clarendon
Press.
Atkinson, Q. D. and Gray, R. D. (2005). Curious parallels and curious connections—phylogenetic thinking in biology and historical
linguistics. Systematic biology, 54(4):513–526.
Augustine, S. (1838). Confessions. Book I.
Aurenhammer, F. (1991). Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR),
23(3):345–405.
Aycinena, M., Kochenderfer, M. J., and Mulford, D. C. (2003). An
evolutionary approach to natural language grammar induction.
Final Paper Stanford CS224N June.
Baixeries, J., Elvevåg, B., and Ferrer-i Cancho, R. (2013). The evolution
of the exponent of zipf’s law in language ontogeny. PloS one,
8(3):e53227.
Bandura, A. and McClelland, D. C. (1977). Social learning theory.
Barrett, D. (2007). Waistland: A (R) evolutionary View of Our Weight and
Fitness Crisis. WW Norton & Company.
Barrett, M. D. (1978). Lexical development and overextension in child
language. Journal of child language, 5(02):205–219.

324

bibliography

Bateson, G. (2006). Mind and nature: A necessary unity (advances in
systems theory, complexity, and the human sciences).
Bee, H. L. and Boyd, D. R. (2000). The developing child. Allyn and
Bacon Boston.
Bellegarda, J. R. (2005).
Unsupervised, language-independent
grapheme-to-phoneme conversion by latent analogy. Speech Communication, 46(2):140–152.
Bentley, P. (1999). Evolutionary design by computers. Morgan Kaufmann.
Best, K.-H. (2006). Quantitative linguistik: Eine annäherung. 3., stark
überarbeitete und ergänzte auflage.
Blackmore, S. (2000). The meme machine. Oxford University Press.
Booker, L. B., Goldberg, D. E., and Holland, J. H. (1989). Classifier
systems and genetic algorithms. Artificial intelligence, 40(1):235–
282.
Borges, J. L. (1952). El idioma analítico de john wilkins. Otras inquisiciones, pages 158–159.
Braine, M. D. (1971). On two types of models of the internalization of
grammars. The ontogenesis of grammar, pages 153–186.
Braine, M. D. and Bowerman, M. (1976). Children’s first word combinations. Monographs of the society for research in child development,
pages 1–104.
Brams, S. J. (2011). Game theory and the humanities: bridging two worlds.
MIT Press.
Brighton, H., Kirby, S., and Smith, K. (2003). Situated cognition and
the role of multi-agent models in explaining language structure.
In Adaptive agents and multi-agent systems, pages 88–109. Springer.
Broca, P. (1861). {Remarque sur le siege de la facult\\’{e} du language
articul\\’{e}, suivie d\’une observation d\’aph\\’{e} mie (perte
de la parole)}. {Bulletin de la soci\\’{e} t\\’{e} anatomique de Paris},
36:330–356.
Brodsky, P., Waterfall, H., and Edelman, S. (2007). Characterizing
motherese: On the computational structure of child-directed language. In Proceedings of the 29th Cognitive Science Society Conference,
ed. DS McNamara & JG Trafton, pages 833–38.
Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C.
(1992). Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479.

325

326

bibliography

Brown, R. (1958). Words and things.
Brown, R. (1973). A first language: The early stages. Harvard U. Press.
Bruner, J. S. and Watson, R. (1983). Child’s talk: Learning to use language.
Oxford University Press Oxford.
Bryant, E. F. (2009). The yoga sutras of patanjali.
Buber, M. (1937). I and thou. Clark, Edinburgh.
Campbell, D. T. (1960). Blind variation and selective retentions in
creative thought as in other knowledge processes. Psychological
review, 67(6):380.
Campbell, D. T. (1974). An essay on evolutionary epistemology. The
philosophy of Karl Popper, pages 413–463.
Champollion, J. F. (1822). Observations sur l’obelisque Egyptien de l’Ile
de Philae.
Chomsky, N. (1957). Syntactic structures. Mouton.
Chomsky, N. (1959). A review of bf skinner’s verbal behavior. Language, 35(1):26–58.
Chomsky, N. (1995). The minimalist program, volume 28. Cambridge
Univ Press.
Chomsky, N. (2002). Syntactic structures. Walter de Gruyter.
Christodoulopoulos, C., Goldwater, S., and Steedman, M. (2010). Two
decades of unsupervised pos induction: How far have we come?
In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 575–584. Association for Computational Linguistics.
Clark, A. (2010). Distributional learning of some context-free languages with a minimally adequate teacher. In Grammatical Inference: Theoretical Results and Applications, pages 24–37. Springer.
Clark, E. (1987). The principle of contrast: A constraint on language
acquisition. Mechanisms of language acquisition, pages 1–33.
Clark, E. V. (2003). First Language Acquisition. Cambridge University
Press.
Cohen, T., Schvaneveldt, R., and Widdows, D. (2010). Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics,
43(2):240–256.

bibliography

Cohen, T., Widdows, D., Schvaneveldt, R. W., Davies, P., and
Rindflesch, T. C. (2012). Discovering discovery patterns with
predication-based semantic indexing. Journal of biomedical informatics, 45(6):1049–1065.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine
learning, 20(3):273–297.
Cosmides, L. and Tooby, J. (1997). Evolutionary psychology: A primer.
Evolutionary Psychology: a primer.
Currier, P. (1970). 1976." voynich ms. transcription alphabet; plans
for computer studies; transcribed text of herbal a and b material;
notes and observations.". Unpublished communications to John H.
Tiltman and M. D’Imperio, Damariscotta, Maine.
Darwin, C. (1859). The Origin of Species. J. Murray.
Darwin, C. and Bettany, G. T. (1890). Journal of researches into the natural history and geology of the countries visited during the voyage of
HMS" Beagle" round the world: under the command of Capt. Fitz Roy,
RN. Ward, Lock.
Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. (2004).
Locality-sensitive hashing scheme based on p-stable distributions.
In Proceedings of the twentieth annual symposium on Computational
geometry, pages 253–262. ACM.
Dawkins, R. (1976). The selfish gene. Oxford University Press Oxford.
De Chardin, P. T., Wall, B., et al. (1965). The phenomenon of man, volume
383. Harper & Row New York, NY, USA:.
de Saussure, F. (1916). Cours de la linguistique générale.
DeCasper, A. J. and Spence, M. J. (1986). Prenatal maternal speech
influences newborns’ perception of speech sounds. Infant behavior
and Development, 9(2):133–150.
Dehaene, S. and Changeux, J.-P. (1989). A simple model of prefrontal
cortex function in delayed-response tasks. Journal of Cognitive Neuroscience, 1(3):244–261.
Dennett, D. C. (1995). Darwin’s dangerous idea. The Sciences, 35(3):34–
40.
Devescovi, A., Caselli, M. C., Marchione, D., Pasqualetti, P., Reilly, J.,
and Bates, E. (2005). A crosslinguistic study of the relationship
between grammar and lexical development. Journal of Child Language, 32(04):759–786.
d’Imperio, M. E. (1978). The voynich manuscript: an elegant enigma.
Technical report, DTIC Document.

327

328

bibliography

Dubremetz, M. (2013). Vers une identification automatique du chiasme de mots. TALN-RÉCITAL 2013, page 150.
Dupont, P. (1994). Regular grammatical inference from positive and
negative samples by genetic search: the gig method. In Grammatical Inference and Applications, pages 236–245. Springer.
Edelman, G. M. (1987). Neural Darwinism: The theory of neuronal group
selection. Basic Books.
Elbers, L. and Ton, J. (1985). Play pen monologues: the interplay
of words and babbles in the first words period. Journal of Child
Language, 12(03):551–565.
Ellis, R. and Wells, G. (1980). Enabling factors in adult-child discourse.
First Language, 1(1):46–62.
Elman, J. L. (1993). Learning and development in neural networks:
The importance of starting small. Cognition, 48(1):71–99.
Erjavec, T. (2004). Multext-east version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In LREC.
Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., Pethick, S. J.,
Tomasello, M., Mervis, C. B., and Stiles, J. (1994). Variability in
early communicative development. Monographs of the society for
research in child development, pages i–185.
Ferguson, C. A. and Farwell, C. B. (1975). Words and sounds in early
language acquisition. Language, pages 419–439.
Fernando, C., Szathmáry, E., and Husbands, P. (2012). Selectionist and
evolutionary approaches to brain function: a critical appraisal.
Frontiers in computational neuroscience, 6.
Ferrer-i Cancho, R. and Elvevåg, B. (2010). Random texts do not
exhibit the real zipf’s law-like rank distribution. PLoS One,
5(3):e9411.
Fillmore, C. J., Kay, P., and O’connor, M. C. (1988). Regularity and
idiomaticity in grammatical constructions: The case of let alone.
Language, pages 501–538.
Fisher, R. A. (1925). Statistical methods for research workers. Genesis
Publishing Pvt Ltd.
Flake, G. W. (1998). The computational beauty of nature: Computer explorations of fractals, chaos, complex systems, and adaptation. MIT press.
Floridi, L. (2011). The Philosophy of Information. Oxford University
Press.

bibliography

Fodor, J. A. (1983). The modularity of mind: An essay on faculty psychology. MIT press.
Fogel, D. B. (1995). Phenotypes, genotypes, and operators in evolutionary computation. In Evolutionary Computation, 1995., IEEE
International Conference on, volume 1, page 193. IEEE.
Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artificial intelligence
through simulated evolution.
Foiter, M. L. (2002). Symbolism: The foundation of culture. Companion
Encyclopedia of Anthropology, page 366.
Fraisse, P. (1974). Psychologie du rythme. Presses universitaires de
France Paris.
Frege, G. (1994). Über sinn und bedeutung. Wittgenstein Studien, 1(1).
Furrow, D., Nelson, K., and Benedict, H. (1979). Mothers’ speech to
children and syntactic development: Some simple relationships.
Journal of child language, 6(03):423–442.
Galton, F. (1875). English men of science: Their nature and nurture. D.
Appleton.
Gärdenfors, P. (2004). Conceptual spaces: The geometry of thought. MIT
press.
Gardner, H. (1985a). Frames of mind: The theory of multiple intelligences.
Basic books.
Gardner, H. (1985b). The mind’s new science. Basic Books.
Gardner, H. (2011). Frames of mind: The theory of multiple intelligences.
Basic books.
Gertner, S., Greenbaum, C. W., Sadeh, A., Dolfin, Z., Sirota, L., and
Ben-Nun, Y. (2002). Sleep–wake patterns in preterm infants and
6 month’s home environment: implications for early cognitive development. Early Human Development, 68(2):93–102.
Gödel, K. (1931). Über formal unentscheidbare sätze der principia
mathematica und verwandter systeme i. Monatshefte für mathematik und physik, 38(1):173–198.
Gold, E. M. (1967). Language identification in the limit. Information
and control, 10(5):447–474.
Goldberg, D. E. (1990). Genetic algorithms in search, optimization &
machine learning. Addison-Wesley.
Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and
machine learning. Machine Learning, 3:95–99.

329

330

bibliography

Gómez, R. L. (2011). Memory, sleep and generalization in language
acquisition. Experience, Variation and Generalization: Learning a First
Language, 7:261.
Grice, H. (1975). Logic and conversation’in p. cole and j. morgan (eds.)
syntax and semantics volume 3: Speech acts.
Guermeur, Y. and Monfrini, E. (2011). A quadratic loss multi-class
svm for which a radius–margin bound applies. Informatica,
22(1):73–96.
Haeckel, E. (1879). The evolution of man. London: Kegan Paul.
Hamilton, W. D. (1963). The evolution of altruistic behavior. American
naturalist, pages 354–356.
Harris, M. (2013). Language experience and early language development:
From input to uptake. Psychology Press.
Harris, Z. S. (1954). Distributional structure. Word.
Hebb, D. O. (1964). The Organization of Behaviour: A Neuropsychological
Theory. John Wiley and Sons.
Hesse, H. (1967). Das Glasperlenspiel: Versuch e. Lebensbeschreibung d.
Magisters Ludi Josef Knecht samt Knechts hinterlassenen Schriften, volume 842. Suhrkamp.
Hodgins, G. (2014). Forensic investigations of the voynich ms. In
Voynich 100 Conference www. voynich. nu/mon2012/index. html. Accessed, volume 4.
Hofmann, T., Schölkopf, B., and Smola, A. J. (2008). Kernel methods
in machine learning. The annals of statistics, pages 1171–1220.
Holland, J. H. (1975). Adaptation in natural and artificial systems: An
introductory analysis with applications to biology, control, and artificial
intelligence. U Michigan Press.
Holland, J. H. (1992). Genetic algorithms. Scientific american, 267(1):66–
72.
Holly Smith, B., Crummett, T. L., and Brandt, K. L. (1994). Ages of
eruption of primate teeth: a compendium for aging individuals
and comparing life histories. American Journal of Physical Anthropology, 37(S19):177–231.
Hopfield, J. J. (1982). Neural networks and physical systems with
emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558.
Householder, F. W. (1981). Apollonius Dyscolus: The Syntax of Apollonius Dyscolus, volume 23. John Benjamins Publishing.

bibliography

Hromada, D. (2008). 23 comments to the chomskian doctrine. personal communication with D.Sportiche.
Hromada, D. (2010a). Quantitative intercultural comparison by
means of parallel pageranking of diverse national wikipedias.
Hromada, D. D. (2009). Basen o jablku. Master’s thesis, Faculty of
Humanities, Charles University, Prague, Czech Republic.
Hromada, D. D. (2010b). Concepts of «invasivity» and «reversibility»
and their relation to past, present and future techniques of neural imagery. Course work written for Bordeaux team of Neural
Imagery affiliated to Ecole Pratique des Hautes Etudes.
Hromada, D. D. (2010c). smiled : Sourire naturel et sourire artificiel.
de l’utilisation d’opencv pour le tracking, la reconnaissaince des
expressions faciales et la détection du sourire. Master’s thesis,
Ecole Pratique des Hautes Etudes, Paris, France.
Hromada, D. D. (2011). Initial experiments with multilingual extraction of rhetoric figures by means of perl-compatible regular expressions. In RANLP Student Research Workshop, pages 85–90.
Hromada, D. D. (2012a). From age&gender-based taxonomy of turing
test scenarios towards attribution of legal status to meta-modular
artificial autonomous agents. page 7.
Hromada, D. D. (2012b). Variations upon the theme of evolutionary
language game. Written for prof. Vladimir Kvasnicka, downloadable at http://wizzion.com/papers/2012/.
Hromada, D. D. (2013). Random projection and geometrization of
string distance metrics. In RANLP, pages 79–85.
Hromada, D. D. (2014a). Comparative study concerning the role of
surface morphological features in the induction of part-of-speech
categories. In Text, Speech and Dialogue, pages 46–52. Springer.
Hromada, D. D. (2014b). Conditions for cognitive plausibility of computational models of category induction. In Information Processing
and Management of Uncertainty in Knowledge-Based Systems, pages
93–105. Springer.
Hromada, D. D. (2014c). Empiric introduction to light stochastic binarization. In Text, Speech and Dialogue, pages 37–45. Springer.
Hromada, D. D. (2014d). Geometrizacia ontologii - pripadova studia
snomed. Written for doc. Mikulas Popper.
Hromada, D. D. (2015). Genetic optimization of semantic prototypes
for multiclass document categorization. submitted to Elitech 2015
conference.

331

332

bibliography

Hromada, D. D. (2016a). Can evolutionary computation help us to
crib the voynich manuscript ? Submitted to JADT 2016 conference.
Hromada, D. D. (2016b). Evolutionary induction of 4-schema microgrammars from childes corpora. submitted to journal Evolutionary Computation.
Hromada, D. D. (2016c). Evolutionary induction of a lightweight morphosemantic classifier. submitted to Computational Linguistics.
Hromada, D. D. (2016d). Evolutionary Models of Ontogeny of Linguistic
Categories: Four Simulations. PhD thesis, Slovak Technical University and University Paris Lumieres.
Hromada, D. D. (2016e). Fast and frugal retrieval of linguistic universalia from childes transcripts. Submitted to JADT 2016 conference.
Hromada, D. D. (2016f). Narrative fostering of morality in artificial
agents: Constructivism, machine learning and story-telling. In
L’esprit au-delà du droit: Pour un dialogue entre les sciences cognitives
et le droit. Mare et Martin.
Hromada, D. D. and Gaudiello, I. (2014). Introduction to moral induction model and its deployment in artificial agents. In Sociable
Robots and the Future of Social Relations, pages 209–216. IOS Press.
Hromada, D. D., Tijus, C., Poitrenaud, S., and Nadel, J. (2010). Zygomatic smile detection: The semi-supervised haar training of a fast
and frugal system: A gift to opencv community. In Computing and
Communication Technologies, Research, Innovation, and Vision for the
Future (RIVF), 2010 IEEE RIVF International Conference on, pages
1–5. IEEE.
Huizinga, J. (1956). Homo ludens vom ursprung der kultur im spiel.
Imai, M. and Haryu, E. (2001). Learning proper nouns and common
nouns without clues from syntax. Child development, 72(3):787–
802.
Jackendoff, R. (2002). Foundations of language: Brain, meaning, grammar,
evolution. Oxford University Press.
Jakobson, R. (1960). Why “mama” and “papa”. Essays in honor of
Heinz Werner.
Jiménez López, M. D. et al. (2000). Gramar systems: a-formallanguage-theoretic framework for linguistics and cultural evolution.

bibliography

Johnson, K. (2004). Gold’s theorem and cognitive science*. Philosophy
of Science, 71(4):571–592.
Jones, W. (1788). The third anniversary discourse, delivered 2 february, 1786. Asiatick Researches, 1:415–431.
Jusczyk, P. W. and Aslin, R. N. (1995). Infants0 detection of the sound
patterns of words in fluent speech. Cognitive psychology, 29(1):1–
23.
Jusczyk, P. W., Cutler, A., and Redanz, N. J. (1993). Infants’ preference for the predominant stress patterns of english words. Child
development, 64(3):675–687.
Karmiloff, K. and Karmiloff-Smith, A. (2009). Pathways to language:
From fetus to adolescent. Harvard University Press.
Karpathy, A. and Fei-Fei, L. (2014). Deep visual-semantic alignments
for generating image descriptions. arXiv preprint arXiv:1412.2306.
Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments
for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–
3137.
Karypis, G. (2002). Cluto-a clustering toolkit. Technical report, DTIC
Document.
Kauffman, S. (1995). At home in the universe: The search for the laws of
self-organization and complexity. Oxford University Press.
Kelemen, J. (2004). Miracles, colonies, and emergence. In Formal Languages and Applications, pages 323–333. Springer.
Kelemenová, A. and Csuhaj-Varjú, E. (1994). Languages of colonies.
Theoretical Computer Science, 134(1):119–130.
Keller, B. and Lutz, R. (1997). Evolving stochastic context-free grammars from examples using a minimum description length principle. In 1997 Workshop on Automata Induction Grammatical Inference
and Language Acquisition. Citeseer.
Kennedy, G. and Churchill, R. (2005). The Voynich manuscript: the unsolved riddle of an extraordinary book which has defied interpretation
for centuries. Orion Publishing Company.
Kennedy, J., Kennedy, J. F., and Eberhart, R. C. (2001). Swarm intelligence. Morgan Kaufmann.
Keysers, C. and Perrett, D. I. (2004). Demystifying social cognition: a
hebbian perspective. Trends in cognitive sciences, 8(11):501–507.

333

334

bibliography

Komenskỳ, J. A., Okál, M., and Pšenák, J. (1991). Vel’ká didaktika: Didactica magna. Slovenské pedagogické nakladatel’stvo.
Koza, J. R. (1992). Genetic programming: on the programming of computers
by means of natural selection, volume 1. MIT press.
Kuczaj, S. A. and Maratsos, M. P. (1975). What children can say before
they will. Merrill-Palmer Quarterly of Behavior and Development,
pages 89–111.
Kuhn, T. S. (2012). The structure of scientific revolutions. University of
Chicago press.
Küntay, A. and Slobin, D. I. (1996). Listening to a turkish mother:
Some puzzles for acquisition. Social interaction, social context, and
language: Essays in honor of Susan Ervin-Tripp, pages 265–286.
Küntay, A. and Slobin, D. I. (2002). Putting interaction back into child
language: Examples from turkish. Psychology of Language and Communication, 6(1).
Kvasnicka, V. and Pospichal, J. (1999). An emergence of coordinated
communication in populations of agents. Artificial Life, 5(4):319–
342.
Kvasnicka, V. and Pospichal, J. (2007). Evolúcia jazyka a univerzální
darwinizmus. Mysel, inteligencia a zivot.
Labov, W. and Labov, T. (1978). The phonetics of cat and mama. Language, pages 816–852.
Lakoff, G. (1990). Women, fire, and dangerous things: What categories
reveal about the mind. Cambridge Univ Press.
Lama, D. et al. (2005). In the Buddha’s words: An anthology of discourses
from the Pali Canon. Simon and Schuster.
Landauer, T. K. and Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction,
and representation of knowledge. Psychological review, 104(2):211.
Landini, G. and Zandbergen, R. (1998). A well-kept secret of mediaeval science: The voynich manuscript. Aesculapius, 18:77–82.
Lashley, K. (1950). In search of the engram. Symposia of the Society for
Experimental Biology.
Lauer, F. and Guermeur, Y. (2011). Msvmpack: a multi-class support
vector machine package. The Journal of Machine Learning Research,
12:2293–2296.

bibliography

Li, W. (1992). Random texts exhibit zipf’s-law-like word frequency
distribution. Information Theory, IEEE Transactions on, 38(6):1842–
1845.
Lieven, E. V., Pine, J. M., and Baldwin, G. (1997). Lexically-based
learning and early grammatical development. Journal of child language, 24(01):187–219.
Lorenz, K. (1973). Die Rückseite des Spiegels. R. Piper.
Lotka, A. J. (1925). Elements of physical biology.
MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1,
pages 281–297. California, USA.
MacWhinney, B. (1987). The competition model. Mechanisms of language acquisition, pages 249–308.
MacWhinney, B. (2014). The CHILDES project: Tools for analyzing talk,
Volume I: Transcription format and programs. Psychology Press.
MacWhinney, B. and Snow, C. (1985). The child language data exchange system. Journal of child language, 12(02):271–295.
MacWhinney, B. and Snow, C. (1991). Childes manual.
Maratsos, M. (1988). The acquisition of formal word classes. Categories
and processes in language acquisition, pages 31–44.
Marchman, V. A. and Bates, E. (1994). Continuity in lexical and morphological development: A test of the critical mass hypothesis.
Journal of child language, 21(02):339–366.
Marcus, G. F. (1993). Negative evidence in language acquisition. Cognition, 46(1):53–85.
Markman, E. M. and Hutchinson, J. E. (1984). Children’s sensitivity to constraints on word meaning: Taxonomic versus thematic
relations. Cognitive psychology, 16(1):1–27.
Maynard Smith, J. (1986). The problems of biology, volume 144. Oxford:
Oxford University Press.
McAuley, J. D., Jones, M. R., Holub, S., Johnston, H. M., and Miller,
N. S. (2006). The time of our lives: life span development of timing and event tracking. Journal of Experimental Psychology: General,
135(3):348.
Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., and
Amiel-Tison, C. (1988). A precursor of language acquisition in
young infants. Cognition, 29(2):143–178.

335

336

bibliography

Menyuk, P., Liebergott, J., Schultz, M., Chesnick, M., and Ferrier, L.
(1991). Patterns of early lexical and cognitive development in
premature and full-term infants. Journal of Speech, Language, and
Hearing Research, 34(1):88–94.
Miller, G. (1956). The magic number seven plus or minus two: Some
limits on our automatization of cognitive skills. Psychological Review, 63:81–97.
Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
Mink, J. and Blumenschine, R. (1981). Ratio of central nervous system
to body metabolism in vertebrates: its constancy and functional
basis. Am J Physiol, 241(3):R203–R212.
Minsky, M. and Papert, S. (1969). Perceptrons.
Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of
machine learning. MIT press.
Morgan, J. L. and Saffran, J. R. (1995). Emerging integration of sequential and suprasegmental information in preverbal speech segmentation. Child development, 66(4):911–936.
Morgan, T. H. (1916). A Critique of the Theory of Evolution. Princeton
University Press.
Mouillot, D. and Lepretre, A. (2000). Introduction of relative
abundance distribution (rad) indices, estimated from the rankfrequency diagrams (rfd), to assess changes in community diversity. Environmental monitoring and assessment, 63(2):279–295.
Nelson, K. (1977). The syntagmatic-paradigmatic shift revisited: a
review of research and theory. Psychological bulletin, 84(1):93.
Nelson, K. (2006). Narratives from the crib. Harvard University Press.
Newbold, W. R. (1928). Cipher of Roger Bacon.
Nowak, M. A., Plotkin, J. B., and Krakauer, D. C. (1999). The evolutionary language game. Journal of Theoretical Biology, 200(2):147–
162.
Ofria, C. and Wilke, C. O. (2004). Avida: A software platform for
research in computational evolutionary biology. Artificial life,
10(2):191–229.
O’Neil, M. and Ryan, C. (2003). Grammatical evolution. Springer.
Pagel, M., Atkinson, Q. D., Calude, A. S., and Meade, A. (2013). Ultraconserved words point to deep language ancestry across eurasia.
Proceedings of the National Academy of Sciences, 110(21):8471–8476.

bibliography

Páleš, E. (1994). Sapfo–parafrázovač slovenčiny. Veda vydavatel’stvo
SAV.
Piaget, J. (1947). La psychologie de l’intelligence.
Piaget, J. (1965). The Moral Judgment of the Child. The free press.
Piaget, J. (1974). Introduction à l’épistémologie génétique. Paris, PUF.
Piatelli-Palmarini, M. (1980). Language and learning: The debate between jean piaget and noam chomsky.
Pine, J. M. and Lieven, E. V. (1997). Slot and frame patterns and the
development of the determiner category. Applied psycholinguistics,
18(02):123–138.
Pinker, S. (1994). The language instinct: The new science of language and
mind, volume 7529. Penguin UK.
Pinker, S. (2000). Survival of the clearest. Nature, 404(6777):441–442.
Planck, M. (1926). Über die begründung des zweiten hauptsatzes
der thermodynamik. Sitzungsberichte der Preussischen Akademie der
Wissenschaarticle.
Plato (380BC). Republic.
Pohlheim, H. (1996). Geatbx: Genetic and evolutionary algorithm
toolbox for use with matlab documentation. Online]. http://www.
geatbx. com/docu/algindex. html.(Accessed May, 2004).
Poincaré, H. (1908). L’invention mathématique.
Poincaré, H. and Magini, R. (1899). Les méthodes nouvelles de la
mécanique céleste. Il Nuovo Cimento (1895-1900), 10(1):128–130.
Popper, K. R. (1972). Objective knowledge: An evolutionary approach.
Clarendon Press Oxford.
Price, G. R. et al. (1970). Selection and covariance. Nature, 227:520–21.
Provasi, J., Anderson, D. I., and Barbu-Roth, M. (2014). Rhythm perception, production, and synchronization during the perinatal period. Frontiers in psychology, 5.
Ray, T. S. (1992). Evolution, ecology and optimization of digital organisms. Santa Fe.
Rechenberg, I. (1971). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dr.-Ing. PhD thesis,
Thesis, Technical University of Berlin, Department of Process Engineering.

337

338

bibliography

Rizzolatti, G., Sinigaglia, C., and Anderson, F. T. (2008). Mirrors in the
brain: How our minds share actions and emotions. Oxford University
Press.
Roffwarg, H. P., Muzio, J. N., and Dement, W. C. (1966). Ontogenetic
development of the human sleep-dream cycle. Science.
Rosch, E. (1999). Principles of categorization. Concepts: core readings,
pages 189–206.
Rosch, E. and Mervis, C. B. (1975). Family resemblances: Studies in
the internal structure of categories. Cognitive psychology, 7(4):573–
605.
Rosenberg, A. and Hirschberg, J. (2007). V-measure: A conditional
entropy-based external cluster evaluation measure. In EMNLPCoNLL, volume 7, pages 410–420.
Rudolph, G. (1994). Convergence analysis of canonical genetic algorithms. Neural Networks, IEEE Transactions on, 5(1):96–101.
Rugg, G. (2004). An elegant hoax? a possible solution to the voynich
manuscript. Cryptologia, 28(1):31–46.
Sagae, K., Davis, E., Lavie, A., MacWhinney, B., and Wintner, S. (2007).
High-accuracy annotation and parsing of childes transcripts. In
Proceedings of the Workshop on Cognitive Aspects of Computational
Language Acquisition, pages 25–32. Association for Computational
Linguistics.
Sahlgren, M. (2005). An introduction to random indexing. In Methods
and applications of semantic indexing workshop at the 7th international
conference on terminology and knowledge engineering, TKE, volume 5.
Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978.
Samuel, A. (1959). Some studies in machine learning using the game
of checkers. IBM Journal of Research and Development, 3(3):210.
Schinner, A. (2007). The voynich manuscript: evidence of the hoax
hypothesis. Cryptologia, 31(2):95–107.
Schleicher, A. (1873). Die Darwinsche theorie und die sprachwissenschaft:
Offenes sendschreiben an herrn Ernst Häckel, volume 2. Böhlau.
Schmidt, J. (1872). Die verwantschaftsverhältnisse der indogermanischen
sprachen. Böhlau.
Schwartz, R. G. and Leonard, L. B. (1982). Do children pick and
choose? an examination of phonological selection and avoidance
in early lexical acquisition. Journal of child language, 9(02):319–336.

bibliography

Schwartz, R. G. and Terrell, B. Y. (1983). The role of input frequency
in lexical acquisition. Journal of child language, 10(01):57–64.
Sekaj, I. (2004).
Robust parallel genetic algorithms with reinitialisation. In Parallel Problem Solving from Nature-PPSN VIII,
pages 411–419. Springer.
Sekaj, I. (2005). Evolučné vỳpočty a ich využitie v praxi. Iris.
Shannon, C. E. (1948). A mathematical theory of communication.
Simonton, D. K. (1999). Creativity as blind variation and selective
retention: Is the creative process darwinian? Psychological Inquiry,
10(4):309–328.
Skinner, B. F. (1957). Verbal Behavior.
Sklyarov, V. and Skliarova, I. (2014). Hamming weight counters and
comparators based on embedded dsp blocks for implementation
in fpga. Advances in Electrical and Computer Engineering, 14(2):63–
68.
Slobin, D. I. (1973). Cognitive prerequisites for the development of
grammar. Studies of child language development, 1:75–208.
Smith, T. C. and Witten, I. H. (1995). A genetic algorithm for the induction of natural language grammars. In Proc. of IJCAI-95 Workshop on New Approaches to Learning for Natural Language Processing,
pages 17–24.
Sokol, J. (1998).
Vyšehrad.

Malá filosofie člověka: Slovník filosofickỳch pojmu.

Solan, Z., Horn, D., Ruppin, E., and Edelman, S. (2005). Unsupervised
learning of natural languages. Proceedings of the National Academy
of Sciences of the United States of America, 102(33):11629–11634.
Sosík, P. and Štỳbnar, L. (1997). Grammatical inference of colonies. In
New Trends in Formal Languages, pages 236–246. Springer.
Spears, W. M., De Jong, K. A., Bäck, T., Fogel, D. B., and De Garis,
H. (1993). An overview of evolutionary computation. In Machine
Learning: ECML-93, pages 442–459. Springer.
Spencer, H. (1894). Education: Intellectual, moral, and physical. CW
Bardeen.
Strong, L. C. (1945). Anthony askham, the author of the voynich
manuscript. Science, 101(2633):608–609.
Suciu, A., Cobarzan, P., and Marton, K. (2011). The never ending
problem of counting bits efficiently. In Roedunet International Conference (RoEduNet), 2011 10th, pages 1–4. IEEE.

339

340

bibliography

Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts: with special reference to north american indians and eskimos. Proceedings of the American philosophical society, pages 452–
463.
Tomasello, M. (2009). Constructing a language: A usage-based theory of
language acquisition. Harvard University Press.
Tomasello, M., Akhtar, N., Dodson, K., and Rekau, L. (1997). Differential productivity in young children’s use of nouns and verbs.
Journal of Child Language, 24(02):373–387.
Tomita, M. (1982). Dynamic construction of finite-state automata from
examples using hill-climbing. In Proceedings of the fourth annual
cognitive science conference, pages 105–108.
Tononi, G. (2004). An information integration theory of consciousness.
BMC neuroscience, 5(1):42.
Trevarthen, C. (1993). The self born in intersubjectivity: The psychology of an infant communicating.
Trivers, R. (1972). Parental investment and sexual selection.
Turing, A. M. (1939). Systems of logic based on ordinals. Proceedings
of the London Mathematical Society, 2(1):161–228.
Turing, A. M. (1950). Computing machinery and intelligence. Mind,
pages 433–460.
Ventris, M. and Chadwick, J. (1953). Evidence for greek dialect in the
mycenaean archives. The Journal of Hellenic Studies, 73:84–103.
Vygotsky, L. S. (1978). Mind and society: The development of higher
mental processes.
Vygotsky, L. S. (1987). Thinking and speech. The collected works of LS
Vygotsky, 1:39–285.
Wall, L. et al. (1994). The perl programming language.
Watson, J. D., Crick, F. H., et al. (1953). Molecular structure of nucleic
acids. Nature, 171(4356):737–738.
Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception:
Evidence for perceptual reorganization during the first year of
life. Infant behavior and development, 7(1):49–63.
Wernicke, C. (1874). {Der aphasische Symptomencomplex}.
Widdows, D. and Cohen, T. (2014). Reasoning with vectors: a continuous model for fast robust inference. Logic Journal of IGPL, page
jzu028.

bibliography

Wiener, N. (1961). Cybernetics or Control and Communication in the Animal and the Machine, volume 25. MIT press.
Wilson, E. O. (2000). Sociobiology: The new synthesis. Harvard University Press.
Wittgenstein, L. (1922). Tractatus logico-philosophicus. Kegan Paul.
Wittgenstein, L. (1934). The blue book.
Wittgenstein, L. (1953). Philosophical Investigations. Blackwell.
Wolff, J. G. (1988). Learning syntax and meanings through optimization and distributional analysis. Categories and processes in language acquisition, 1(1).
Wright, S. (1932). The roles of mutation, inbreeding, crossbreeding, and
selection in evolution, volume 1. na.
Zipf, G. K. (1949). Human behavior and the principle of least effort.

341

CONTENTS

i theses
1
1 initial thesis
2
1.1 mind (DEF)
2
1.2 to evolve (DEF)
2
2 hard thesis
3
2.1 Evolution (DEF)
3
2.2 Learning (DEF)
4
2.3 Form (DEF)
4
2.4 First condition of HT’s validity (DEF)
4
2.5 Second condition of HT’s validity (DEF)
5
2.6 Third condition of HT’s validity (DEF)
5
2.7 Fourth condition of HT’s validity (DEF)
5
2.8 Brain (DEF)
5
2.9 2nd law of thermodynamics (DEF)
7
2.10 Connectionist explanation of non-locality (TXT)
2.11 Alternative explanation of non-locality (TXT)
3 soft thesis
11
3.1 Evolutionary computation (DEF)
11
3.2 Successful Simulation (DEF)
12
3.3 Cognitive Plausibility (DEF)
13
4 softer thesis
14
4.1 Natural language (DEF)
14
4.2 Why natural language ? (APH)
14
5 softest thesis
17
5.1 Toddlerese (DEF)
17
5.2 Child (DEF)
19
6 operational thesis
20
6.1 Text (DEF)
20
7 summa i
22
7.1 Trial and Error (DEF)
22

8
8

ii paradigms
25
8 universal darwinism
26
8.1 Biological evolution
27
8.2 Evolutionary Psychology
29
8.3 Memetics
30
8.4 Evolutionary Epistemology
31
8.4.1 Phylogeny
31
8.4.2 Ontogeny
31
8.4.3 Individual Creativity
32
8.4.4 Genetic Epistemology
32

343

344

contents

8.5 Evolutionary Linguistics
34
8.5.1 Ethnogeny (DEF)
35
8.6 Neural and Mental Darwinism
38
8.7 Evolutionary computation
40
8.7.1 Genetic algorithms
41
Fitness Proportional Selection (SRC)
42
Fitness functions and fitness landscapes
44
Canonical Genetic Algorithms
46
Parallel Genetic Algorithms
47
8.7.2 Evolutionary programming
& evolutionary strategies
48
8.7.3 Genetic programming
49
Grammatical evolution
51
8.7.4 Tierra
55
8.7.5 Evolutionary Language Game
56
9 developmental psycholinguistics
63
9.1 Language Development (DEF)
63
9.1.1 Central dogma of DP (DEF)
66
9.2 Development of Toddlerese
67
9.2.1 Ontogeny of prosody, phonetics and phonology
68
9.2.2 Canonical Babbling
70
9.2.3 Ontogeny of lexicon and semantics
73
The Principle of Contrast (DEF)
76
9.2.4 Ontogeny of morphosyntax
79
Compositionality (DEF)
80
Pivot schema (DEF)
82
The principle of precedence of the specific (DEF)
85
Principle of distributed practice (DEF)
86
9.2.5 Ontogeny of pragmatics
87
9.2.6 Physiological and cognitive development
89
9.3 Motherese
91
Variation sets
92
9.4 Language Acquisition Paradigms
94
9.4.1 Classical
94
9.4.2 Generativists and Nativists
96
Panini’s Grammar (APH)
97
Shiva Sutras (SRC)
97
Refutation of Gold’s Theorem (APH) 101
9.4.3 Empiricists and Constructivists 103
9.4.4 Socio-pragmatic and Usage-based paradigms 106
Format (DEF) 109
10 computational linguistics 114
10.1 Quantitative and corpus linguistics 114
10.1.1 Zipf’s law 115
10.1.2 Logistic law 117
10.2 Formal Language Theory 119

contents

10.2.1 Basic tenets (DEF) 119
Grammar and Rule(DEF) 120
10.2.2 Chomsky-Schützenberger hierarchy (TXT) 121
10.2.3 Grammar System Theory (TXT) 123
Language Colony (DEF) 123
10.3 Natural Language Processing 126
10.3.1 Machine learning 127
Evaluation 129
Precision and Recall (DEF) 130
V-measure (DEF) 131
10.4 Semantic Vector Architectures 133
10.4.1 Category Prototype (DEF) 134
10.4.2 Hebb-Harris Analogy (APH) 135
10.4.3 Bag-of-Terms 136
TF-IDF 137
10.4.4 Latent Semantic Analysis (TXT) 138
10.4.5 Random Indexing (TXT) 140
10.4.6 Light Stochastic Binarization 142
10.4.7 Evolutionary Localization of Semantic Attractors
10.5 Part-of-speech induction 146
10.5.1 Non-evolutionary POS-i 147
10.5.2 Evolutionary 148
10.6 Grammar induction 150
10.6.1 Existing non-evolutionary approaches 151
10.6.2 Existing evolutionary approaches 155
11 summa ii 167
iii observations 169
12 qualitative 170
12.1 Method and data collection 170
12.1.1 Biases 171
12.2 Subject 172
12.3 Linguistic environment 173
12.4 Crying and Babbling 174
12.5 First words 175
12.5.1 NENE & taboo (APH) 177
12.6 Repetitions and Replications 179
12.7 First constructions 181
12.7.1 First word combinations 182
12.7.2 First pivot(s) 182
12.7.3 First micro-grammars 185
12.8 Mutations 186
12.8.1 Context-free substitutions 186
12.8.2 Context-sensitive substitution 188
Context-sensitive substitutions (EXT) 189
12.9 Case study of semantic mutations: The DING-DONG
mystery (APH) 190

143

345

346

contents

12.9.1 First transpositions 192
Context-sensitive metatheses (EXT) 193
12.10Crossovers 194
12.10.1Multilingual crossovers 195
Intralexical crossovers 195
Intraphrastic crossovers 198
Of crossover and calques (APH) 200
12.11Monolingual crossovers 200
12.11.1Intralexical 200
12.11.2Interlexical 201
12.11.3Intraphrastic crossovers 203
Of crossover and overgeneralizations (APH) 204
12.12Other phenomena 206
12.12.1Multilingual C-scheme Mismatch 206
12.12.2Compression of Information 207
13 quantitative 209
13.1 Method 209
13.2 Data 209
13.3 Universals 211
13.3.1 Letters 211
13.3.2 N-grams 213
13.3.3 Intrasubjective replications 215
Intralocutory duplications 215
Translocutory replications 217
13.3.4 Intersubjective replications 220
13.4 English-specific 223
13.4.1 Utterance-level Constructions 224
13.4.2 Pivot schemas 225
13.4.3 Pivot instances 226
13.4.4 Pivot grammars 227
14 summa iii 236
14.1 Crossroads of thoughts 236
14.1.1 The linguistic crossover principle 239
14.1.2 Of crossovers and analogies (APH) 240
14.2 Axes of analysis 240
14.3 The source of variation 241
14.3.1 Extending usage-based paradigm (TXT) 242
14.4 From selection to replication 242
14.4.1 The principle of exogenous selection (DEF) 243
14.4.2 MPR Precept (APH) 243
Love (DEF) 244
iv simulations 245
15 breaking into unknown code
15.1 Generic Introduction 246
15.2 Abstract 246
15.3 Introduction 247

246

contents

15.3.1 Pre-digital tentatives 248
15.3.2 Post-digital tentatives 248
15.3.3 Our position 250
15.3.4 Primary Mapping 250
15.3.5 Three Conjectures 251
15.4 Method 252
15.4.1 Calendar 252
15.4.2 Cribbing 253
15.4.3 Optimization 254
15.5 Experiments 255
15.5.1 Slavic crib 255
15.5.2 Hebrew crib 257
15.6 Conclusion 260
15.7 Generic Conclusion 261
16 evolutionary localization of semantic prototypes
16.1 Generic Introduction 263
16.2 Introduction 264
16.2.1 Geometrization of Categories 265
16.2.2 Radical Dimensionality Reduction 266
16.3 Genetic Localization of Semantic Prototypes 267
16.4 Corpus and Training Parameters 269
16.5 Evaluation and Results 270
16.6 Conclusion 272
16.7 Generic Conclusion 274
17 evolutionary induction of a lightweight morphosemantic classifier 275
17.1 Generic Introduction 275
17.2 Introduction 276
17.2.1 From planes to prototypes 277
17.2.2 From prototypes to constellations 278
17.2.3 From constellations to lightweight classifiers 278
17.3 Method 279
17.3.1 Corpus 279
Classes 280
Pre-processing 281
17.3.2 Algorithm 282
Vector Space Preparation 282
Evolutionary Optimization 285
Parameters 288
17.3.3 Evaluation 289
17.4 Discussion of Results 289
17.5 Conclusions 292
17.5.1 Computational conclusion 292
17.5.2 Psycholinguistic conclusion 293
17.6 Generic Discussion 295
17.7 Second Simulation Bibliography 297

347

263

348

contents

18 evolutionary induction of 4-schema microgrammars from childes corpora 298
18.1 General Introduction 298
18.2 Introduction 299
18.2.1 Two extremes 299
18.2.2 Definitions 300
G-Category (DEF) 300
H∆ -Category (DEF) 301
N∆ -Schema (DEF) 301
18.3 Model 302
18.3.1 Vector Space Preparation 302
18.3.2 Bridging the Sub-symbolic and Symbolic realms 304
Genotype 304
Phenotype 304
18.3.3 Fitness Function 306
18.3.4 Evolutionary Strategy 308
18.3.5 Evolution of both centroids and radii 308
18.3.6 Pseudo-random initialization of 0th population 309
18.3.7 Locus-constrained cross-over 309
18.3.8 Re-focusing strategy 310
18.4 Simulation 311
18.4.1 Corpus 311
18.4.2 Parameters 312
18.5 Observations 312
18.5.1 Diachronic observations 313
18.6 Conclusion 315
18.7 General Discussion 316
v summa 319
19 summa summarum
bibliography

324

320

LIST OF FIGURES

Figure 1
Figure 2

Figure 3
Figure 4
Figure 5
Figure 6

Figure 7
Figure 8

Figure 9
Figure 10
Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Figure 16

Main notions of this dissertation.
iv
Distinction between "connectionist" (a) and "alternative" (b) representations of the same data.
It is evident that the latter allows for more structural variation than the former.
9
Cognitive Hexagram
16
One-point and two-point crossovers. Figures reproced from Morgan (1916).
28
Schleicher’s Stammbaum of family of Indo-European
languages. Reproced from Schleicher (1873).
36
Possible mechanism of replication of patterns of
synaptic connections between neuronal groups.
Reproduced from Fernando et al. (2012).
39
Basic genetic algorithm schema. Reproduced from
Pohlheim (1996)
41
Possible fitness landscape for a problem with
only one variable. Horizontal axis represents gene’s
value, vertical axis represents fitness.
45
Different architectures of Parallel Genetic Algorithms. Reproduced from Sekaj (2004)
47
Sequence of steps constructing the program sqrt(x+5)
50
Sequence of transformations from genotype until phenotype in both Gr.Ev and Biological systems. Figure reproduced from O’Neil and Ryan
(2003).
54
A case whereby mutual alignement of soundmeaning mappings can be useful. Reproduced
from Kvasnicka and Pospichal (2007)’s reproduction in Pinker (2000).
58
Development of productive vocabulary in early
(a) and late (b) toddlerese. Figures reproced from
Fenson et al. (1994).
79
Corpus of two-word utterances produced by a
toddler Andrew. Reproduced from Braine and
Bowerman (1976).
82
Mean length of utterances produced by English
and Italian children of different age (in months)
. Figures reproduced from Devescovi et al. (2005).
86
Some modalities of information exchange between mother and her child. Reproduced from
Trevarthen (1993).
88

349

350

List of Figures

Figure 17

Figure 18

Figure 19

Figure 20

Figure 21

Figure 22

Figure 23
Figure 24

Figure 25
Figure 26
Figure 27
Figure 28
Figure 29

Figure 30

Logistic law in relation to historic and ontogenetic linguistic processes. Data taken from Best
(2006). 117
Emergence of "miraculous" infinite generative
capacity by means of interlock of two finite grammars. Figure reproduced from Kelemen (2004). 124
Comparison of reflective LSB (with I=2 iterations) and unreflective LSB (I=0) LSB with Semantic Hashing and binarized Latent Semantic
Analysis. Reproduced from Hromada (2014c). 144
Equivalence classes and production rules induced
from English language samples by ADIOS algorithm. Fig. reproduced from Wolff (1988). 152
Equivalence classes and production rules induced
from English language samples by ADIOS algorithm. Reproduced from Solan et al. (2005). 154
Finite state automaton matching all strings over
(1 + 0)* without an odd number of consecutive
0’s after an odd number of consecutive 1’s. Reproduced from Tomita (1982). 156
Grammars induced from nine different POS-tagged
corpora. Reproduced from Aycinena et al. (2003). 161
Two simple grammars covering the sentence "the
dog saw a cat". Fig. reproduced from Smith and
Witten (1995). 162
First differentiation between the whole and its
part (a) and its evolutionary explanation (b). 179
Drawing from folio f84r containing the primary
mapping. 251
Evolution of individuals adapting label in the
Calendar to names listed in the Slavic crib. 257
Evolution of individuals adapting label in the
Calendar to names listed in the Hebrew cribs. 259
Retrieval and 20-class classification performance
in 128-dimensional binary spaces. Non-LSB results are reproduced from Figure 6 of study (Salakhutdinov and Hinton, 2009), plain LSB from (Hromada, 2014c). 271
Evolutionary optimization increases the precision of a multi-class classifier. Curves represent
results averaged across diverse runs (R = 6*100
for CANONIC, R=6 for MERGE1)).
290

List of Figures

Figure 31

Figure 32

Figure 33

Centroidal tessellation of twelve data-points belonging to three distinct classes. Dots represent
data-points, crosses are category prototypes and
colors denote category membership. Black lines
denote tesselation boundaries. 292
Data flow among main components of INDUCT OR.
Lime color denotes components related to evolutionary optimization, royal blue color denotes
components of the preliminary VSP phase. 311
Data flow among main components of extended
variant of INDUCT OR introducing a syntagmaticparadigmatic feedback loop. 317

351

L I S T O F TA B L E S

Table 1

Table 2

Table 3

Table 4
Table 5

Table 7

Table 8

Table 9

Table 10

Table 11
Table 12
Table 13
Table 14
Table 15
Table 16
Table 17

352

Conceptual parallels between biological and linguistic evolution. Table partially reproduced from
Atkinson and Gray (2005).
37
Children avoid production of words with unknown characteristics. Reproduced from table
Clark (2003) based on data in Schwartz and Leonard
(1982).
71
Words produced by at least half of children in
the monthly sample. Reproduced from table in
Clark (2003) based on data from Fenson et al.
(1994).
76
Case of development of word|meaning mapings. Based on data in Barrett (1978).
77
Utterances classified as tokens of the four major types of the motherese. Reproduced from
table 4.2 in (Bruner and Watson, 1983, pp.7980). 108
Vectorial representations of three sentence-sized
documents. Every distinct word yields a distinct
column. 137
Vectorial representations of sentence-sized documents D1 = "mama má emu" and D2 = "mama má mamu".
Every distinct character trigram yields a distinct
column. 137
K-means clustering of tokens according both suffixal and co-occurence informations. Table partially reproduced from Hromada (2014b). 148
IM’s productive lexicon before attainment of 18
months. Words in the brackets denote most plausible meaning, as decoded by either father (F) or
mother (M). Compare with Table 3. 177
IM’s seeding grammar: AUCH at ultimate position. 183
Seeding grammar extended: AUCH in the central position. 183
Another AUCH-centered paradigm. 184
Interlinguistic micro-grammar. 199
Recapitulation of crossover types observed in
IM’s production. 205
Activity of different speakers in two age groups. 210
Repartition of languages in studied corpus. 211

List of Tables

Table 18
Table 19
Table 20
Table 21

Table 22
Table 23
Table 24
Table 25
Table 26
Table 27
Table 28
Table 29

Table 30
Table 31
Table 32
Table 33
Table 34
Table 35
Table 36
Table 37
Table 38
Table 39
Table 40
Table 41

20 most frequent graphemes according to speakers and age groups. 212
20 most frequent bigrams according to speakers
and age groups. 213
10 most frequent trigrams according to speakers
and age groups. 214
Duplicated expressions and numbers of childoriginated and child-directed utterances in which
they occur. 216
Probability that the utterance shall contain at
least one ajdacently duplicated 2+gram. 217
Most frequent translocutory 3+-grams. 218
Probability that both parts of a utterance couplet
shall contain at least one identic 3+gram. 219
Most frequent words replicated from child to
mother (CHIINIT ) and mother to child (MOTINIT ). 220
Basic statistics concerning the replication of 3+grams
between mother and the child.
221
Distributions of occurences of marker for laughing in diverse subsets of CHILDES corpus. 222
Counts related to morphologically annotated englishlanguage transcripts analyzed in this section. 223
Most frequent utterance-level constructions produced by english-speaking mothers and children
in 2 phases of their development. 231
Correlations between distributions of frequences
of utterances.
232
Number of distinct utterances in diverse datasets
and entropies of their distributions. 232
Thirty 8+grams with highest scorepivoteness . 232
Ten CHILD-produced pivot7 schemas with highest contextual entropy (in shannons). 233
CHILDES utterances most frequently instantiating some pivot7 schema. 233
Pivot-instantiating CHILDES utterances pronounced
by biggest number of distinct children. 234
Most popular instances of pivot "ˆthat’s X" 234
Most popular instances of pivot "ˆI want X" 235
Most popular instances of pivot "X little Y" 235
Interphrastic crossover behind Abe’s "boy can’t
eat his carrots". 238
Fittest chromosomes which map reversed tokens
in the Calendar onto names of the Slavic crib 258
Five classes of interest, their corresponding CHILDES
part-of-speech tags, some example word types
which instantiate them. 280

353

354

List of Tables

Table 42
Table 43

Table 44
Table 45
Table 46
Table 47
Table 48

Table 49
Table 50
Table 51

Parameters of simulation 2. 288
Overall results of five different approaches. GA
results have been averaged across diverse runs
(R = 6*100 for CANONIC, R=6 for MERGE1). 290
MSVM2 training corpus confusion matrix. 291
MSVM2 testing corpus confusion matrix. 291
Training corpus confusion matrix produced by
FIT T EST (GAMERGE1 ). 291
Testing corpus confusion matrix produced by
FIT T EST (GAMERGE1 ). 291
Testing corpus tokens closest to prototypes of
ACTION, SUBSTANCE and PROPERTY encoded
in FIT T EST (GAMERGE1 ) constellation. Hamming
distance H(token, prototype) and token’s CHILDES
part-of-speech annotations. False positives are
marked by bold font. 294
Words of a CorpusMini and hexadecimal representations of their potential hashes. 305
A candidate genotype which could be potentially induced from the hypothetic CorpusMini . 305
Parameters of diverse components of the INDUCT OR
algorithm. 312

ACRONYMS

CL

Computational Linguistics

DP

Developmental Psycholinguistics

EC

Evolutionary Computing

EL

Evolutionary Linguistics

ES

Evolutionary Strategy

ET

Evolutionary Theory

FLT

Formal Language Theory

GA

Genetic Algorithm

GE

Genetic Epistemology

GS

Grammar System

GI

Grammar Induction | Grammar Inference

HT

Hard Thesis

LA

Language Acquisition

LD

Language Development

MDL

Minimal Description Length

MLU

Mean Length of Utterance

ND

Neural Darwinism

NLP

Natural Language Processing

OT

Operational Thesis

POS-i

Part-of-Speech Induction

POS-t

Part-of-Speech Tagging

ST

Soft Thesis

S2 T

Softer Thesis

S3 T

Softest Thesis

UD

Universal Darwinism

VD

Vocabulary Development

355

ACKNOWLEDGEMENTS

Ideas presented in this dissertation are result of crossover of multiple sources
of intellectual and cultural influence. Some of them are enumerated in the
bibliography and some of them are mentioned in acknowledgments section of my bachelor and master dissertations. Their individual names being listed elsewhere, the influence of the teachers - affiliated with Charles
University in Prague, National University of Mongolia, University of Nice
Sophia-Antipolis as well as of researches affiliated to 3rd Section of Ecole
Pratique des Hautes Etudes (be it in Paris, Dijon, Caen, Aix en Provence or
Bordeaux) - is not to be underestimated.
Also not to be underestimated is the influence of EPHE’s 4th section. Although the unpredictability of life has’t allowed me to pursue the "philologic track" longer than one year, I stay in the state of reconnaissance profonde
towards prof. G-J. Pinault, J.E.M. Houben, D. Petit, A. LeMarechal, Fanny
Meunier and others for showing me how "muddy" are formalisms of any
theory in comparison with gems of poetry and language which have been,
were, had been or simply are still alive.
University Paris 8 St. Dennis - renamed to Paris Lumières amidst the work
on this dissertation - is also to be praised as an entity without whose support of which this dissertation would not see the light of a day. On one side
its teachers, such as S. Peperkamp who introduced me into both beauties
and intricacies of psycholinguistic disputes, or M.B. Jover whose courses of
epistemology reignited in me a long-forgotten interest in philosophy. On another side dozens of other fonctionnaires who allowed me - sometimes slowly,
sometimes fast, but always with success - to pursue my investigations of unknown epistemic territories.
However, EPHE and Paris 8 are just drops on the surface of structure
much greater and ancient: Lutetia Parisorium, the city of Paris. Verily, if
there is a three-dimensional mineral structure which is to be attributed the
acknowledgment as a primary source of inspiration for this work, then it is
the web which surrounds the Notre-Dame. Deeply grounded and firmly represented in my hippocampus hippocampus, Paris is indeed the city where
first drafts of this dissertation had been brought to life and discussed in
wines-lasting debates with Simon Carrignon, Adil elGhali, Fabien Ruggieri,
Ilaria Gaudiello, Jean-Marc Thiebaut, Mary Rougeux, Yann Leger, Christophe
Chavatte, Kechadi Lagha, Anne Ronsheim, Barbarka Jarkovska, Maurice &
Florence Benayoun, Ivan Bigorgne, Geoffrey Tissier, Jeremy Gardent, my
first student Pauline Vallies and second student Geoffrey Vantalon, Jarmila
Mendelova, Louise Hearsum, Jitka Pelechova, Ophelie Monsoreau, Julienne
Michele and Julie Rocton. Help coming from Kristina Poliakova, Julian Bonnyaud, Mikaela Barankova & Monique Girodroux, François Jodelet, Anh
Nguyen, stream of good books coming from antiquaire Andre from bookshop on the border of rue d’Ulm and rue Claude Bernard as well as support
of bouqinists Julien & Lue from Quai Voltaire were quite vital during periods when dissertation existed only in posse and not yet in esse.
But it would be unjust to praise Paris but not address some praise also
à la République toute-entière. For without help of her institutions and her establishements - as diverse as CROUS, CAF, ANPE, Cité des Sciences, So-

357

358

acronyms

ciété d’exploitation de la Tour Eiffel (!), Mairie de Paris or Campus France
- it would be highly unplausible that an ordinary Bratislava boy could ever
dedicate years of his life to pure science. In this regards, the role of French
Ministry of Foreign Affairs, Embassy of France in Slovakia and people like
Mme. Monika Saganova are of particular importance because of their assistance which ultimately allowed me to cover significant part of material
needs with the scholarship of french government for doctoral studes under
double supervision.
It is also thanks to Michal Oravec and Zuzana Dideková that such a double supervision got realized. By a strange coincidence of events and independently from each other they have both attracted my attention to the
fact that in my own country of origin, Slovakia, there already exists a wellestablished, firm and intellectually rich tradition of cybernetics in general
and evolutionary computation in particular. Hence I met Mr. Ivan Sekaj
who was not only willing to take me under his wings, re-introduced me
to education system of my own homeland, made me program my first genetic algorithm and always somehow succeeded to adopt his agenda to my
needs. It is thanks to him that I had opportunity to get in contact with other
"wizards from Mlynska Dolina", including prof. V. Kvasnicka or M. Popper.
None of these meetings and encounters would take place, however, if it
hadn’t been for one man: professor Charles Tijus. This is so because it was
mainly Charles - assisted by Francois Jouen and Joelle Provasi - who kept
alive the curriculum "Cognition Humain et Artificielle" at EPHE/Paris8, it
was Charles who guided the direction of my Master Thesis and asides this
also managed the complexities of laboratory ChART and the research platform Lutin where the germs of this dissertation have been conceived. But
asides all this, it was Charles who convinced me that pursuing the path
of science is worth the effort in order to subsequently give me practically
absolute liberté in finding my own method of such pursue.
At last but not least, my ultimate "thank you" is dedicated to a woman
which has transformed herself, during 6 years of doctoral studies, from a
completely unknown féerie into a virtual acquitance into my guest into my
host into my tourist guide into my friend into my love into my muse into
mother of our daughter into my fiancee into my wife. It is thanks to You,
Lucia, and thanks to thousands of numinous adjustments You do that our
Iolanda Maitreya sleeps her green ideas peacefully and not furiously, that
our house fragrances with myriads essences and this dissertation could be
hereby considered as finished.

colophon
This document was typeset using the typographical look-and-feel
classicthesis developed by André Miede. The style was inspired
by Robert Bringhurst’s seminal book on typography “The Elements of
Typographic Style”. classicthesis is available for both LATEX and LYX:
http://code.google.com/p/classicthesis/

Happy users of classicthesis usually send a real postcard to the
author, a collection of postcards received so far is featured here:
http://postcards.miede.de/

Final Version as of November 17, 2016 (classicthesis version 1).

D E C L A R AT I O N

I declare that this Thesis is a fruit of my own work and that all citations and references to external sources are explicitly marked.

Daniel Devatman Hromada,
November 17, 2016

Enrichir et raisonner sur des espaces sémantiques pour
l’attribution de mots-clés
Adil El Ghali1, 2

Daniel Hromada1

Kaoutar El Ghali

(1) LUTIN UserLab, 30, avenue Corentin Cariou, 75930 Paris cedex 19
(2) IBM CAS France, 9 rue de Verdun, 94253 Gentilly

elghali@lutin-userlab.fr

RÉSUMÉ
Cet article présent le système hybride et multi-modulaire d’extraction des mots-clés à partir
de corpus des articles scientifiques. Il s’agit d’un système multi-modulaire car intègre en soi
les traitements 1) morphosyntaxiques (lemmatization et chunking) 2) sémantiques (Reflective
Random Indexing) ainsi que 3) pragmatiques (modélisés par les règles de production). On
parle aussi d’un système hybride car il était utilisé -sans modification majeure- pour trouver des
solutions aux toutes les deux pistes du DEFT 2012. Pour la Piste 1 - où une terminologie était
fournie - nous obtînmes le F-score de 0.9488 ; pour la Piste 2 – où aucune liste des mots clés
candidates n’était pas fourni au préalable – le F-score obtenu est 0.5874.

ABSTRACT
Enriching and reasoning on semantic spaces for keyword extraction
This article presents a multi-modular hybrid system for extraction of keywords from corpus of
scientific articles. System is multi-modular because it integrates components executing transformations on 1) morphosyntactic level (lemmatization and chunking) 2) semantic level (Reflected
Random Indexing), as well as upon more 3) « pragmatic » aspects of processed documents,
modeled by production rules. The system is hybrid because it was able to address both tracks of
DEFT2012 competition – a «reduced search-space» scenario of Track 1, whose objective was to
map the content of a scientific article upon one among the members of a « terminological list » ;
as well as more « real-life » scenario of Track2 within which no list was associated to documents
contained in the corpus. In both Tracks, the system hereby presented has obtained the an F-score
of 0.9488 for the Track1, and 0.5874 for the Track2.

MOTS-CLÉS
Chunking.

: Extraction de mots-clés, Espaces sémantiques, RRI, Réseau bayésien, Règles de production,

KEYWORDS: Keyword extraction, Semantic spaces, RRI, Bayesian Network, Production Rules, Chunking.

1

Introduction

L’édition 2012 du défi fouille de textes (DEFT) a pour thème l’identification automatique des
mots-clés indexant le contenu d’articles publiés dans des revues scientifiques. Deux pistes ont été
proposées : dans la première (Piste 1) la terminologie des mots-clés est fournie, alors que dans la
deuxième (Piste 2) l’attribution des mots-clés devait se faire sans terminologie.

Pour la réalisation de cette tâche nous avons décidé, dans la continuité de ce que nous avions
réalisé en 2011 (El Ghali, 2011), de représenter le sens des termes et des documents du corpus
dans des espaces sémantiques utilisant la variante Reflective Random Indexing (RRI). Le choix de
RRI une variante de Random Indexing (RI) (Sahlgren, 2006) est motivé par les bonnes propriétés
de cette méthode, héritées de RI et qui sont largement décrites dans la littérature (Cohen et al.,
2010a). Mais une de ces propriétés moins connue et commentée s’est révélée particulièrement
pertinente pour le problème posé dans le cadre de cette édition du DEFT, à savoir l’uniformité de
l’espace sémantique : en effet, les vecteurs construits par RRI pour représenter les documents et
les termes du corpus sont « comparables ».
Dans la méthode que nous avons développé pour cette édition du DEFT, nous avons voulu
répondre à deux questions principales :
1. quel serait l’apport d’un pré-traitement linguistique de surface aux espaces sémantiques ? et
en quoi pourrait-on comparer ces pré-traitements aux méthodes de constructions d’espaces
sémantiques permettant de capturer des éléments de structure ?
2. peut-on améliorer les méthodes de scoring développées dans les précédentes éditions
du DEFT en utilisant les dernières avancées en Intelligence artificielle, notamment le
raisonnement à base de règles et les graphes probabilistes, encodant respectivement des
règles générales sur le choix des mots-clés et des informations incertaines issues du corpus
d’apprentissage ?
La première question s’imposait naturellement du fait qu’une grande partie des mots-clés qui ont
été fournis pour la Piste 1 sont en fait des groupes de mots et que leurs catégories morphosyntaxiques et grammaticales respectait des règles assez simples. Pour pouvoir traiter les mots-clés
composés de plusieurs mots, certaines méthodes de représentation de textes en espaces sémantiques telles que BEAGLE (Jones et Mewhort, 2007), PSI (Cohen et al., 2009), ou encore RRI
avec des indexes positionnels (Widdows et Cohen, 2010), permettent d’encoder les informations
sur l’ordre des mots. La deuxième question est née du fait que l’on disposait d’informations de
nature différentes qui pouvait aider à attribuer correctement des mots-clés : sur la sémantique,
sur la distribution des mots-clés, sur la structure, sur les revues dont sont issues les articles ...
Ces informations pouvaient être difficilement encodées dans un seul formalisme de décision.
Nous avons donc décidé de définir une procédure de décision pour l’attribution de mots-clés
qui combine des règles symboliques avec des réseaux bayésiens, avec les Règles de production
Probabilistes (Aït-Kaci et Bonnard, 2011).
Nous avons fait le choix d’aborder les deux pistes du défi de cette année de manière sensiblement
identique, les mêmes méthodes ont été utilisées pour les deux pistes. Pour ce faire, nous avons
construit une terminologie pour la Piste 2. Cette terminologie est une liste de mots-clés candidats
établie en utilisant un espace sémantique et un pré-traitement linguistique de surface.
L’article est organisé comme suit : nous commençons par présenter dans la section 2 une analyse
du corpus et des informations qui peuvent en être extraite et qui sont utiles pour la tâche
d’attribution de mots-clés. Ensuite, dans la section 3, nous rappelons brièvement le principe
de fonctionnement de RRI, puis nous décrivons comment incorporer les informations issue
du pré-traitement linguistique dans les espaces sémantiques, mais aussi comment la liste des
candidats mots-clés pour la Piste 2 est construite. Dans la section 4 nous présentons le principe
de fonctionnement de la procédure de décision pour l’attribution des mots-clés. Enfin, dans la
section 5 nous détaillons les caractéristiques de chacune des exécutions et discutons les résultats
avant de conclure.

2

Le Corpus

2.1
2.1.1

Statistiques générales de corpus d’apprentissage
Piste 1

Pour la Piste 1, il y a 140 documents dans le corpus d’apprentissage. Les documents proviennent
de 4 revues différentes, l’identificateur de la revue étant encodé dans le nom du fichier XML
contenant l’article.
La liste terminologique – i.e. la liste contenant tous les termes uniques choisies comme un mot
clé pour un document dans le corpus - associée au corpus d’apprentissage contient Tappr = 666
termes uniques.
Les nombres des mots-clés associés sont fournis pour chaque document du corpus d’apprentissage
aussi bien que du corpus de test. En somme, Σi Nappri = 754. En moyenne, chaque article de
corpus d’apprentissage a :

mean(Nappr ) = 5.386 ; median(Nappr ) = 5; min(Nappr ) = 1; max(Nappr ) = 13; sd(Nappr ) = 1.344
Etant donné que Σi Nappri > Tappr , il est évident qu’il y a des termes qui sont définis comme mots
clés pour plusieurs articles. Le principe de bijection 1 terme – 1 article n’est pas donc applicable.
Plus précisément, pour le corpus d’apprentissage, 604 mots clés sont associés à un seul article,
46 en sont associés à deux, 10 à trois, quatre mots clés (i.e. « identité », « interprétation », «
enseignement de la traduction », « traduction ») sont chacun associés à quatre articles, tandis que
le terme « humanitaire » est défini comme mot clé pour cinq articles et le terme « mondialisation »
pour sept articles.
On note aussi que parmi 62 termes qui sont associés à plus qu’un article, seulement 26 (i.e.
41,9%) sont associés aux articles appartenants à plus qu’une revue.
Les analyses fréquentielles préliminaires montrent aussi que dans 141 parmi 740 cas, le mot clé
ne se trouve pas dans le corps ni résumé d’article auquel il est associé. En d’autres termes, pour
plus que 19% des mots clés, la fréquence de leur occurrence dans l’article est zéro, c’est donc
plus qu’évident qu’il faut aller au-delà des fréquences « brutes » si on veut que notre système
d’extraction des mots clés ait la précision > 80% (la Figure 1 montre les fréquences d’occurrence
des mots-clés dans les documents associés).
L’objectif de la Piste 1 est donc de concevoir le système qui, partant de fichiers de corpus
d’apprentissage contenant Dappr ∗ Tappr = 140 * 666 = 93240 couplages (document, terme)
serait capable à déterminer les couples ayant été établis par les auteurs de leurs documents.

2.1.2

Piste 2

Le corpus d’apprentissage contient 142 documents. Contrairement à la Piste 1, aucune liste
terminologique n’est fournie, l’espace de recherche dans lequel on cherche les candidats censé
d’être les mots clés est donc beaucoup plus grande. Mais les quantité des mots clés associés
au différents articles sont présents. Grâce à ces quantités fournis dans la balise <nombre> des
documents XML, on sait sans regarder au fichier de référence que la distribution de Σi Nappri = 763

FIGURE 1 – Cca 19% (en rouge) des mots clés de corpus d’apprentissage ne figurent pas dans les
documents auxquels ils sont attribués

associations entre mots clés et articles dispose de propriétés suivantes :

mean(Nappr ) = 5.411; median(Nappr ) = 5; min(Nappr ) = 3; max(Nappr ) = 13; sd(Nappr ) = 1.404.
L’analyse de fichier de référence révèle que parmi 681 termes qui couvrent l’ensemble de tous
les mots clés du corpus d’apprentissage de piste2 , 627 en sont associés à un seul article, 37 à
deux, 12 à trois, deux termes à (« humanitaire » et « didactique ») à quatre articles, les termes «
identité » et « culture » étant associé à cinq articles et le terme « traduction » à huit documents.
Étant donné que l’information concernant l’appartenance d’un article à une revue est présente,
on sait aussi que parmi 54 termes associés à plus qu’un article, seulement 18 (i.e. 33.3%) sont
associés à plus qu’une revue.
L’analyse des fréquences de mots clés dans les articles associés donne les résultats qui vont dans
le même sens que ceux de la Piste 1 : dans 145 cas (19%), les mots clés n’apparaîssent pas dans
l’article auquel ils étaient associés !

2.2
2.2.1

Statistiques générales du corpus de test
Piste 1

Le corpus de test de la Piste 1 contient D t est = 94 documents dans . La liste terminologique du
corpus de test contient 478 termes uniques. Parmi ces 478 termes-candidats, 435 en sont associés
avec un seul document, 34 aux deux documents différentes, quatre termes sont associés aux
trois articles, et quatre termes aux quatre articles, le terme le plus réussi comme mot clé étant «
identité » lui-même associé au six articles. Parmi les 43 termes associés à plus d’un article, 20
(i.e. 46,5%) sont associés aux articles appartenants à plus d’une revue.
La distribution de la somme du nombre des mots clés associés aux articles du corpus de test de la

Piste 1 ( Σi Nt est i = 537) dispose de propriétés suivantes :

mean(Nt est ) = 5.712; median(Nt est ) = 5; min(Nt est ) = 1; max(Nt est ) = 12; sd(Nt est ) = 1.701.
2.2.2

Piste 2

La distribution de Σi Nt est i = 484 mots clés attribués aux 93 documents contenus dans le corpus
de test de la Piste 2 est caractérisé par les mesures suivantes :

mean(Nt est ) = 5.204; median(Nt est ) = 5; min(Nt est ) = 2; max(Nt est ) = 10; sd(Nt est ) = 1.323.
La consultation des fichiers de référence obtenus après la fin de la phase competitive de DEFT2012
nous permets à savoir que parmi 35 termes associés à plus qu’un article, seulement 10 (i.e. 28,6%)
sont associés aux articles appartenants à plus d’une revue.

2.3

Que peut-on apprendre d’autre du corpus ?

Un rapide parcours du corpus de d’apprentissage et de la terminologie fournie pour la Piste 1,
nous montre qu’au delà des fréquences, les mots-clé choisis par les auteurs respectent quelques
règles :
– les mots-clés sont différents entre eux : les auteurs n’utilisent que rarement des mots-clés très
proches ;
– ils sont assez souvent repris dans l’introduction et la conclusion de l’article ;
– leur catégorie morphosyntaxique ou grammaticale est très rarement « verbale », les mot-clés
les plus utilisés sont des noms (communs ou propres), des adjectifs ou des groupes nominaux ;
Par ailleurs, comme on pouvait s’y attendre les mots-clés sont fortement liés sémantiquement au
document, comme le montre la figure 2 :

FIGURE 2 – Similarités document-mots-clés (min, max, mean) vs. document-terminologie (mean)

3

Espaces sémantiques

Les modèles de représentation vectorielle de la sémantique des mots sont une famille de modèles
qui représentent la similarité sémantique entre les mots en fonction de l’environnement textuel
dans lequel ces mots apparaissent. La distribution de co-occurrence des mots dans le corpus est
rassemblée, analysée puis transformée en espace sémantique dans lequel les mots sont représentés
comme des vecteurs dans un espace vectoriel de grande dimension. LSA (Landauer et Dumais,
1997), HAL (Lund et Burgess, 1996) et RI (Kanerva et al., 2000) en sont quelques exemples. Ces
modèles sont basés sur l’hypothèse distributionnelle de (Harris, 1968) qui affirme que les mots
qui apparaissent dans des contextes similaires ont un sens similaire. La caractérisation de l’unité
de contexte est une problèmatique commune à toutes ces méthodes, sa définition est différente
suivant les modèles. Par exemple, LSA construit une matrice mot-document dans laquelle chaque
cellule ai j contient la fréquence d’un mot i dans une unité de contexte j. HAL définit une fenêtre
flottante de n mots qui parcourt chaque mot du corpus, puis construit une matrice mot-mot dans
laquelle chaque cellule ai j contient la fréquence à laquelle un mot i co-occure avec un mot j
dans la fenêtre précédemment définie.
Différentes méthodes mathématiques permettant d’extraire la signification des concepts, en
réduisant la dimensionnalité de l’espace de co-occurence, sont appliquées à la distribution
des fréquences stockées dans la matrice mot-document ou mot-mot. Le premier objectif de
ces traitements mathématiques est d’extraire les «patrons» qui rendent compte des variations
de fréquences et qui permettent d’éliminer ce qui peut être considéré comme du « bruit ».
LSA emploie une méthode générale de décomposition linéaire d’une matrice en composantes
indépendantes : la décomposition de valeur singulière (SVD). Dans HAL la dimension de l’espace
est réduite en maintenant un nombre restreint de composantes principales de la matrice de
co-occurrence. À la fin de ce processus de réduction de dimensionnalité, la similitude entre deux
mots peut être calculée selon différentes méthodes. Classiquement, la valeur du cosinus de l’angle
entre deux vecteurs correspondant à deux mots ou à deux groupes de mots est calculée afin
d’approximer leur similarité sémantique.

3.1 Reflective Random Indexing
La méthode de construction d’espace sémantique utilisée est Reflective Random Indexing (RRI)
(Cohen et al., 2010a), c’est une nouvelle méthode de construction d’espaces sémantiques basée
sur la projection aléatoire qui est assez différente des autres méthodes de construction d’espaces
sémantiques. Ses particularités sont (i) qu’elle ne construit pas de matrice de co-occurrence
et (ii) qu’elle ne nécessite pas, contrairement aux autres modèles vectoriels de représentation
sémantique, des traitements statistiques lourds comme la SVD pour LSA. RRI est basée sur la
projection aléatoire (Vempala, 2004; Bingham et Mannila, 2001), qui permet un meilleur passage
à l’échelle pour grand nombre des documents. La construction d’un espace sémantique avec RRI
se déroule comme suit :
– Créer une matrice A(d × n), contenant des vecteurs indexes, où d est le nombre de documents
ou de contextes et n le nombre de dimensions choisies par l’expérimentateur. Les vecteurs
indexes sont des vecteurs creux générés aléatoirement.
– Créer une matrice B(t × n), contenant des vecteurs termes, où t est le nombre de termes
différents dans le corpus. Initialiser tous ces vecteurs avec des valeurs nulles pour démarrer la

construction de l’espace sémantique.
– Pour tout document du corpus, chaque fois qu’un terme τ apparaît dans un document δ,
accumuler le vecteur index de δ au vecteur terme de τ.
à la fin du processus, les vecteurs termes qui apparaîssent dans des contextes similaires ont
accumulé des vecteurs indexes similaire.
L’aspect « Reflective » dans RRI consiste à rejouer plusieurs cycles des trois étapes de l’algorithme
non plus à partir de vecteurs aléatoires mais à partir des vecteurs indexes obtenues pour
les documents. Ces cycles permettent de gommer l’aspect aléatoire de l’espace, le système
convergeant généralement au bout d’un nombre réduit de cycles.

3.1.1

Semantic Vectors

Plusieurs implémentations libre de RRI sont disponibles, nous utilisons la librairie Semantic
Vectors 1 (Widdows et Cohen, 2010). Semantic Vectors présente un certain nombre d’avantages
par rapport aux autres librairies implémentant RRI, en particulier, parce qu’il offre, d’une part,
une implémentation de RRI basé sur des indexes positionnels (Cohen et al., 2010a) qui construit
l’espace sémantique non plus en se basant sur les occurrences d’un terme dans un document
mais dans une fenêtre glissante à la manière de HAL, cette version de RRI permet de capturer
outre les informations sur la sémantiques des termes, des informations structurelles sur leur
proximité. D’autre part, Semantic Vectors implante un certain nombre de mesures de similarité
entre des groupes de mots, en particulier (i) la « disjonction quantique » (Cohen et al., 2010b)
qui permet de construire un volume correspondant à plusieurs termes dans l’espace sémantique
et de calculer la distance entre ce volume et d’autres termes ou documents de l’espace ; (ii) «
similarité tensorielle » qui prend en entrée une suite ordonnée de termes et calcule sa similarité
avec d’autres suites ordonnées, exploitant ainsi les informations d’ordre provenant des indexes
positionnels.
Semantic Vectors est utilisé dans nombre d’applications. Nous l’avons utilisé dans nos participations au DEFT depuis l’édition 2009. Dans des tâches proches de celle qui nous occupe, la
librairie a été utilisée pour comparer RRI à d’autres méthodes d’espaces sémantiques pour la
recherche de relations entre termes dans un corpus (Rangan, 2011).

3.2

Enrichir les espaces sémantiques avec des informations linguistiques

Dans le problème d’attributions de mots-clés à un texte, les termes utilisés comme mots-clés
sont, pour une partie d’entre-eux, des groupes de mots. La sémantique associée à un groupe de
mots dans espace sémantiques n’est pas aussi précise que celle associé à un mot : elle comprend
des composantes de ce mots dans d’autres contextes. Pour pouvoir traiter la sémantique de
ces groupes de mots, certaines méthodes de représentation du sens en espaces sémantiques
telles que BEAGLE (Jones et Mewhort, 2007), PSI (Cohen et al., 2009), ou encore RRI avec des
indexes positionnels (Cohen et al., 2010b; Widdows et Cohen, 2010), permettent d’encoder les
informations sur l’ordre des mots. Nous avons voulu tester une autre méthode basée sur une
analyse linguistique de surface du texte.
1. http://code.google.com/p/semanticvectors/

Le principe de cette méthode est d’identifier des groupes de mots candidats dans le texte via
une phase de chunking (Abney, 1991) puis de construire des classes d’équivalence de chunks
qui regroupent une majorité de mots identiques (après lemmatisation des mots) et qui sont
sémantiquement proches - en se basant sur la sémantique, dans un espace sémantique “classique”,
des mots qu’ils contiennent -. Le corpus est alors transformé en remplaçant tous les chunks d’une
même classe d’équivalence par un représentant de la classe et un nouvel espace sémantique est
construit à partir de ce nouveau corpus, dans cet espace les représentants des classes de chunks
sont considérés comme des mots.
Pour les besoins de la Piste 1, le chunker a été entrainé pour considérer comme chunk tous
les mots-clés composés de la terminologie fournie. Dans la Piste 2 ce même chunker, ainsi que
la procédure de construction de classes de chunks, sont utilisés pour construire une liste de
mots-clés candidats.

4

4.1

Affectation de mots-clés comme procédure de décision
mixte
Réseau Bayésien pour l’affectation de mots-clés

En analysant un corpus d’articles, nous cherchons, dans un premier temps, à déterminer la taille
des différents mots-clés rattachés à un article donné. Dans un second temps, nous nous efforçons
d’établir les probabilités d’appartenance de ces mots-clés à une liste pré-établie. Nous disposons
pour chaque document du corpus des informations suivantes :
– les longueurs du résumé l et du texte L ;
– la revue R dans laquelle l’article est paru ;
– le nombre de mots-clés n et leurs tailles respectives n1 , . . . , nn (ie le nombre de mots les
composant) ;
– les similarités avec la totalité du lexique des mots-clés (d1 , . . . , dN ) (N taille de la terminologie) ;
– les mots-clés (kw1 , . . . , kw n ).
Il s’agit donc de trouver des relations entre les variables exogènes (l, L, R, n, d1 , . . . , dN ) permettant de prévoir le comportement des variables endogènes (n1 , . . . , nn , kw1 , . . . , kw n ). A cette fin,
il faut disposer d’un formalisme de modélisation des connaissances adapté. Les réseaux bayésiens (Barber, 2012), étant des modèles graphiques auxquels sont associées des représentations
probabilistes sous-jacentes, apparaissent comme particulièrement adaptés à notre cas d’étude.
Un réseau bayésien B est un couple (G, θ ) où G est un graphe acyclique dirigé dont les noeuds
représentent un ensemble de variables aléatoires X = {X 1 , . . . , X n } et θi = [P(X i /C(X i ))] est la
matrice des probabilités conditionnelles du nœud i connaissant l’état de ses parents C(X i ).
L’intérêt des réseaux bayésiens est donc que leurs structures graphique et probabiliste permettent
de prendre en charge une représentation modulaire des connaissances, une interprétation à la
fois quantitative et qualitative des données. En effet, le graphe d’un réseau bayésien permet ainsi
de représenter schématiquement les relations entre les variables du système à modéliser et les
distributions de probabilités, elles, permettent de quantifier ces relations.

Le modèle que l’on se propose de construire est un réseau bayésien à variables discrètes (le
nom de la revue R, les mots-clés kw i , leur nombre n, leurs tailles ni ) et à variables continues
(longueurs du résumé l, de l’article L et les similarité à la terminologie). C’est un modèle
mixte, appelé modèle conditionnel gaussien, pour lequel la distribution des variables continues
conditionnellement aux variables discrètes est une gaussienne multivariée. Cela implique qu’il
peut y avoir des arcs partant de noeuds discrets vers des noeuds continus, mais pas l’inverse
hormis pour le cas où les noeuds continus sont observables (ce qui est notre cas). Notons
également que le nombre de variables n1 , . . . , nn et kw1 , . . . , kw n varie selon le nombre de motsclés n ; le nombre de noeuds dans un réseau bayésien étant fixe, nous nous proposons de poser
n1 , . . . , n25 , les tailles des différents mots-clés avec ni = 0 si i > n et kw1 , . . . , kw25 les différents
mots-clés avec kw i = N U L L si i > n.
Pour résumer nous disposons des variables aléatoires suivantes représentées par les noeuds du
réseau bayésien que l’on cherche à construire :
–
–
–
–
–
–
–

R, le nom de la revue (variable discrète pouvant prendre 4 valeurs) ;
l, la longueur du résumé (variable continue) ;
L, la longueur de l’article (variable continue) ;
n, le nombre de mots-clés (variable discrète pouvant prendre 25 valeurs) ;
n1 , . . . , n25 , la taille des mots-clés (variable discrète pouvant prendre 11 valeurs) ;
d1 , . . . , d1062 , les similarités à l’ensemble des mots-clés (variable continue) ;
kw1 , . . . , kw25 , les mots-clés (variable discrète pouvant prendre 1062 valeurs).

L’observation des distributions des documents entre les différentes revues nous permet d’affirmer
que celles-ci sont similaires dans le corpus d’apprentissage et celui de test ; ce qui implique que le
biais qu’introduit cette distribution n’impactera pas les performances du modèle à construire.
Les moyennes des longueurs de résumé l et d’article L présentent le même ordre de grandeur.
Ces moyennes ne sont certes pas similaires dans le corpus d’apprentissage et celui de test, mais
elles sont distribuées de la même manière, ie que les longueurs de résumé (respectivement
d’article) sont égales dans le corpus d’apprentissage et dans celui de test au même facteur près.
Notons également que les longueurs d’article et de résumé ne sont pas distribuées de la même
manière ; cela veut dire qu’en plus de la relation directe évidente entre ces deux variables, il
existe probablement une cause commune aux deux, ce qui se traduit dans la structure du réseau
bayésien par la présence d’un parent commun.
Les distributions des nombres de mots par article (respectivement par résumé) peuvent être
approximées par des mélanges de gaussiennes. Ces histogrammes sont similaires pour le corpus
entier et pour celui d’apprentissage. Ce qui nous montre que l’échantillon étudié peut être
considéré comme représentatif du problème. Toutefois, la relative disparité observée entre le
corpus de test et celui d’apprentissage créera probablement un problème de biais qu’il faudra
prendre en compte durant la construction du modèle.
Les histogrammes des nombres de mots par article (respectivement par résumé) représentent pour
les différentes revues des distributions différentes. Ces variables sont donc directement reliées à
la nature de la revue. Ces différentes distributions ont des formes quelconques, cependant, nous
remarquons que l’on pourra les approximer par un mélange de gaussiennes ; ce qui nous conforte
dans le choix d’un modèle conditionnel gaussien pour représenter ces variables dans un réseau
bayésien.

En observant la monotonie des moyennes des similarités à la terminologie des mots-clés sur
les différentes parties du corpus, nous remarquons qu’elle présente la même allure (et même
quasiment le même tracé) dans tous les cas (corpus entier, corpus d’apprentissage, revue en
particulier, . . . ). Cela nous permet de supposer que la sélection de mots-clés se fait strictement
de la même manière partout, et donc l’idée d’en faire un modèle mathématique est parfaitement
cohérente.
Sur la base de ces différentes observations, prenons un exemple de structure de réseau bayésien
reliant les variables de notre problème. Par convention, les variables discrètes sont représentées
par des noeuds carrés, les variables continues par des noeuds ronds et les variables observables
par des noeuds ombrés (figure 3).

FIGURE 3 – Structure du réseau bayésien appris sur le corpus

4.2

Combiner des décisions statistiques avec du raisonnement à base de
règles

Les récents travaux en intelligence artificielle sur la combinaison de méthodes de décision
statistiques et de raisonnement à base de règles de production, comme les Règles de Production
Probabilistes (PPR) de (Aït-Kaci et Bonnard, 2011), nous offrent un cadre pour modéliser une
procédure de décision qui prend en compte ce qui est appris par le réseau bayésien décrit
ci-dessus, et les connaissances symboliques encodées dans les règles sur le choix des mots-clés
dont nous avons donné des exemples en 2.3.

Le principe de fonctionnement du système de décision, construit en se basant sur PPR, est de
calculer un score pour chacun des mots-clés pour un document donné. Ce calcul est réalisé en
utilisant des règles pouvant faire appel au réseau bayésien. Par exemple, la règle “les mots-clés
sont différents entre eux” peut se traduire par la règle production “si deux mots-clés sont proches
alors augmenter le score de celui qui est le plus haute probabilité d’être un mot-clé du document et
réduire l’autre” qui s’écrit :

IF similarity(kw1, kw2) > seuil AND bnproba(kw1|doc) > bnproba(kw2|doc)
THEN increase-score(kw1, doc) AND decrease-score(kw2, doc)
Le système de règles que nous avons utilisé contient une quinzaine de règles. Nous ne pouvons
pas les détailler ici par manque de place.

5

Les exécutions soumises

La table 1 résume les exécutions soumises par notre équipe. Ses résultats sont très satisfaisants
pour toutes les approches que nous avons utilisé. La moyenne de F-score pour la Piste 1 pour
l’ensemble des participants étant de 0,3575 et pour la Piste 2 de 0,2045. On notera que les
premières exécutions pour les deux pistes (1.1 et 2.1) qui sont nos exécutions de base donnent
des résultats corrects en des temps relativement bas.
Run
1.1
1.2
1.3

Precision
0.4618
0.9479
0.7486

Rappel
0.4618
0.9497
0.7486

F-score
0.4618
0.9483
0.7486

Temps (en s)
2
7590
-

2.1
2.2
2.3

0.2438
0.3471
0.5879

0.2438
0.3471
0.5867

0.2438
0.3471
0.5873

26
269
12700

TABLE 1 – Résultats soumis : performance et temps d’éxecution

5.1
5.1.1

Piste 1
Run 1.1 – baseline : RRI et k-NN

Dans cette exécution qui constitue notre baseline, nous avons construit un espace sémantique
RRI avec l’ensemble des documents du corpus (appr + test), un document étant constitué par la
concaténation du résumé et du corps de l’article. Puis pour chaque document d du corpus de test,
nous avons retenu comme mots-clés les k plus proches voisins du document dans la terminologie,
k étant le nombre de mots-clés pour le document d. Le vecteur pour un mot-clé kw i composé
des mots w1 , ..., w n étant obtenu en sommant les vecteurs des mots qu’il contient.
~ i = Σi w
kw
~i

(1)

5.1.2

Run 1.2 – RRI(chunks), BN et règles

Dans cette exécution, qui a obtenu le meilleur résultat, nous avons construit un espace sémantique
“enrichi” comme nous l’avons décrit dans la section 3.2, mais dans lequel un document était
représenté par quatre vecteurs, un pour le résumé, un pour le corps de l’article et deux vecteurs
pour le premier et le dernier paragraphe de l’article (que nous avons pris comme approximation
de l’introduction et la conclusion) . Nous avons ensuite appris le réseau bayésien décrit en 4.1
en utilisant les distances entres les documents et les mots-clés obtenues sur cet espace. Enfin,
nous avons utilisé la procédure de décision décrite en 4.2 pour affecter un score à chacun des
mots-clés, les mots-clés retenus sont les k ayant les plus hauts scores (k étant le nombre de
mots-clés pour le document).

5.1.3

Run 1.3

Dans le cadre de ce run, on a combiné les résultats de run 1 et run 2, en donnant une légère
préférence aux candidates-termes lesquels sont plus longues que d’autres termes-candidates. On
a donc combiné, par exemple, les termes-candidates de run1 :
Catalogne ; Narotzky ; conflit ; contexte ;district industriel ; femmes ; production
traductionnelle ; production écrite ; réseau
avec les termes-candidates de run 2 :
Espagne ; Narotzky ; anthropologie économique ; district industriel ; féminisme ;
histoire ; réseaux de production ; économie politique ; économie régionale
pour obtenir la liste des candidates de run3 :
district industriel ; réseaux de production ; économie politique ; production traductionnelle ; anthropologie économique ; Narotzky ; économie régionale ; production écrite ;
féminisme
Le score du candidat était calculé par la formule :
scor e = F r ∗ (l − Fa )

(2)

où F r est la fréquence relative du terme-candidat dans l’article analysé, Fa est la fréquence
absolue du terme-candidat dans tous les articles du corpus et l est le nombre de caractères du
terme-candidat.

5.2
5.2.1

Piste 2
Run 2.1 – baseline : RRI et k-NN

Cette exécution est identique à la première exécution de la Piste 1 5.1.1, la terminologie obtenue
par la méthode décrite en 3.2 contient 3000 candidats mots-clés.

5.2.2

Run 2.2 – RRI(PositionalIndex), Tensor Similarity et k-NN

Dans cette deuxième exécution, nous avons utilisé la même terminologie que pour 2.1, mais
l’espace sémantique a été construit en utilisant RRI sur des indexes positionnels. Le calcul des
vecteurs de mots-clés utilise l’opérateur Tensoriel de Semantic Vectors. Les mots-clés retenus
pour un document d sont les k plus proches voisins du document d dans la terminologie, k étant
le nombre de mots-clés pour le document d.
5.2.3

Run 2.3 – RRI(chunks), BN et règles

Cette exécution est identique à la deuxième exécution de la Piste 1 décrite en 5.1.2, la terminologie obtenu par la méthode décrite en 3.2 à laquelle on ajouté les mots-clés du corpus
d’apprentissage elle contenaint 3270 candidats mots-clés.

5.3

Discussion

Nous pouvons voir que les exécutions 1.2 et 2.3 sont celles qui obtiennent les meilleurs résultats,
ce qui nous conforte dans nos hypothèses de départ. Les exécutions officielles nous ne permettent
pas de comparer les performances des espaces “enrichis” par des chunks et des espaces RRI avec
indexes positionnels, nous avons effectué une exécution 2.2bis avec un espace “enrichi” et k-NN
le F-score obtenu est de 0.4186, le résultat est sensiblement meilleur que l’exécution 2.2.
Rappelons que pour le 1.3, on a combiné les résultats de 1.1 et 1.2 de en donnant plus de poids
aux candidates-termes longues (cette règle n’ayant pas été incluse dans le système de règles
décrit en 4.2 ). Etant donné que le F-score obtenu (0.7486) se trouve au mi-chemin entre le
F-score de 1.1 et de 1.2, nous ne pouvons pas réellement conclure quand à la pertinence de cette
règle.

Conslusion
Dans cet article, nous avons présenté un système d’attribution de mots-clés à des articles scientifiques, qui se base sur des espaces sémantiques construit en utilisant RRI. Puis nous avons
essayé d’améliorer les performances du systèmes par deux moyens : (i) en enrichissant les
espaces sémantiques par des informations issues d’une analyse linguistique de surface, et (ii)
en définissant une procédure de décision basée sur une combinaison de réseaux bayésiens et
de systèmes à base de règles. Les résultats obtenus montrent que ces deux hypothèses se sont
révélées payantes et qu’elles améliorent sensiblement les résultats obtenus par une approche RRI
seul (qui obtient déjà des résultats honorables).

Références
ABNEY, S. (1991). Principle-Based Parsing, chapitre Parsing By Chunks. Kluwer Academic
Publishers.
AÏT-KACI, H. et B ONNARD, P. (2011). Probabilistic production rules. Rapport technique, IBM.
BARBER, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press.
BINGHAM, E. et MANNILA, H. (2001). Random projection in dimensionality reduction : Applications to image and text data. In in Knowledge Discovery and Data Mining, pages 245–250. ACM
Press.
COHEN, T., SCHVANEVELDT, R. et RINDLESCH, T. (2009). Predication-based semantic indexing :
Permutations as a means to encode predications in semantic space. In Proceedings of the AMIA
Annual Symposium, pages 114–118.
COHEN, T., SCHVANEVELDT, R. et WIDDOWS, D. (2010a). Reflective random indexing and indirect
inference : A scalable method for the discovery of implicit connections. Biomed Inform, 43(2):
240–256.
COHEN, T., WIDDOWS, D., SCHVANEVELDT, R. et RINDLESCH, T. (2010b). Logical leaps and quantum
connectives : Forging paths through predication space. In Proceedings of the AAAI Fall 2010
symposium on Quantum Informatics for cognitive, social and semantic processes (QI-2010).
EL GHALI, A. (2011). Expérimentations autour des espaces sémantiques hybrides. In Actes de
l’atelier DEFT’2011, Montpellier.
HARRIS, Z. (1968). Mathematical Structures of Language. John Wiley and Son, New York.
JONES, M. N. et MEWHORT, D. J. K. (2007). Representing word meaning and order information
in a composite holographic lexicon. Psychological Review, 114(1):1–37.
KANERVA, P., KRISTOFERSON, J. et HOLST, A. (2000). Random Indexing of Text Samples for
Latent Semantic Analysis. In GLEITMAN, L. et JOSH, A., éditeurs : Proceedings of the 22nd Annual
Conference of the Cognitive Science Society, Mahwah. Lawrence Erlbaum Associates.
LANDAUER, T. K. et DUMAIS, S. T. (1997). A Solution to Plato’s Problem : The Latent Semantic
Analysis Theory of Acquisition, Induction and Representation of Knowledge. Psychological
Review, 104(2):211–240.
LUND, K. et BURGESS, C. (1996). Producing high-dimensional semantic space from lexical
co-occurence. Behavior research methods, instruments & computers, 28(2):203–208.
RANGAN, V. (2011). Discovery of related terms in a corpus using reflective random indexing. In
Proceedings of Workshop on Setting Standards for Searching Electronically Stored Information In
Discovery Proceedings (DESI-4).
SAHLGREN, M. (2006). The Word-Space Model : Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Thèse de
doctorat, Department of Linguistics Stockholm University.
VEMPALA, S. S. (2004). The Random Projection Method, volume 65 de DIMACS Series in Discrete
Mathematics and Theoretical Computer Science. American Mathematical Society.
WIDDOWS, D. et COHEN, T. (2010). The semantic vectors package : New algorithms and public
tools for distributional semantics. In Proceedings of the Fourth IEEE International Conference on
Semantic Computing (IEEE ICSC2010).

Parallel Democracy Model and Its First Implementations
in the Cyberspace
Daniel Devatman Hromada *

Abstract
Parallel democracy model is a variant of traditional participative democra­
cy approach aiming to provide a new method of convergence toward quasi-op­
timal solutions of diverse perennial political and social challenges. Within the
framework of the model, such challenges are operationalized into variables for
which the authority can associate possible value. The value which is assigned to
the variable in a moment T of system’s history is called an active value. Every
couple {variable, active value} represents a functional property of a given society
and the genomic vector of such properties can not only describe, but also determi­
ne the functioning of a society under question. In majority of occidental societies,
the authority to assign values to variables is delegated to parliament which assi­
gns values to variables by means of aggregated voting. The government or other
institutional bodies subsequently execute actions according to such activated va­
lues. In almost all modern political systems, the value->variable assignment is
done in sequential (serial) order for example by voting for one law proposal after
another during a parliamentary session or a referendum. Any reform of the sy­
stem —an update of multiple variables— is a costly process because a change of
value for almost any variable requires a new vote which, in majority of societies,
involves the relocation of vote-givers to a vote giving location during a period de­
dicated to voting. The progress of quasi-perennial storage systems and databases
in combination with communication networks make it possible to aggregate and
store information about numbers of votes related to potentially infinite number
of variables. Therefore a method by means of which the values are activated and
assigned to variables does not necessarily need to be sequential. Given the con­
dition that the act of voting is identic to the incrementation of the chosen value
stored on the medium, it is not necessary anymore that the decision-makers shall
meet at one point in time in order to give their vote to the value they want to
activate. Even the condition that they must meet in one place is weakened since
the vote-giving place can be purely virtual. Kyberia.sk and kyberia.cz domains are
domains where first tentatives to implement such a system were implemented «in
vivo» for a limited set of variables. The aggregated voting of already registered
kyberia.cz users determine the number of votes which a registration application
of a new user needs to obtain in order to be accepted. Kyberia.sk senators can
decide what motto shall be displayed at the top of the page —the option which
receive the biggest number of votes becomes the active title. Since both variables
are internal constants of kyberia’s engine, no human intervention is necessary
* STU / EPHE / Paris 8, hromi@kyberia.sk.
TEORIA POLITICA. NUOVA SERIE, ANNALI III
2013: 165-180

••TEORIA POLITICA 2013••.indb 165

18/6/13 10:09:44

166

daniel devatman hromada

after a value becomes variable’s active value and the system automatically recon­
figure its own code.
Keywords: Parallel democracy model. Participative democracy. Vote aggregation. Auto-configuration of social networks. Genome of political bodies.
1.

From perennial challenges to properties of political system

One cannot speak about human society but ignore the innate nature of human beings. And because the innate nature of human beings changes only slowly
in course of evolution of homo sapiens sapiens species, and because at least some
features of this «nature of human beings» —be it loving, learning, laughing,
looking or listening— seem to be constant and omnipresent among all human
beings, it is not unimaginable that the very human nature causes and shall cause
certain challenges to appear and reappear in any human and/or transhuman society imaginable and conceivable.
We label as «perennial» such challenges which cannot be ignored by any human society. In order to avoid useless metaphysical debates which could stem
from such a definition, we rectify that in the rest of this article, following definition shall be adopted: «perennial challenge (PC) is a challenge implicitly or
explicitly addressed by all documented societies of human history».
To be more concrete, we may consider questions of a form «shall X be Y in
our polis?» as representatives of such challenges, no matter whether X means
«death penalty, slavery, wine-drinking, meat-eating or prostitution» and no matter whether Y means «forbidden, permitted, or obligatory». Throughout the
course of human history, such Xs and Ys were, in one form or another, represented in minds of all individuals forming a given society. The thing which
changed, the structure which evolved, was only the weighted network of associa­
tions among such X & Y representations.
One of the main objectives of social sciences of the last century was to unveil  1
what was the structure  2 of such networks  3, i. e. what Xs were connected with
what Ys, and to find the raison d’être for such connections in the underlying
totality of such graph-like system. The political science, on the other hand, focus
attention upon a different question: «who is the source of authority?», «who
holds the power ?», «WHO associates values to X in this polis?».
The objective of this article is present the Parallel Democracy Model (PDM)
within the scope of which the answer to this question is: «everybody» and to
present its first naive implementations within the framework of virtual communities kyberia.sk and kyberia.cz.

1
2
3

Bourdieu, 1984.
Lévi-Strauss, 1967.
Saussure et al., 1995.

••TEORIA POLITICA 2013••.indb 166

18/6/13 10:09:44

Parallel Democracy Model and Its First Implementations in the Cyberspace	

2.

167

From properties of political systems to typed variables and their values

A property of a political system is, within the theoretical framework of Parallel Democracy Model (PDM), defined by a couple {typed variable,active value}.
A typed variable can be defined as «uniquely defined conceptual entity (the variable) and a set (called its type), consisting of all the values that the entity may
take»  4. An active value is such a member of the set of possible variable's values
which is assigned to the variable in time T.
Intuitively, a variable can be imagined as a box with the label on the box being variable's name and the content of the box being the value of the variable.
One can have many boxes with many different labels —one can have many variables. One can have boxes for shoes and one can have boxes for food— there are
many types of variables.
Since the set of possible sets is infinite, infinitely many types of variables can
exist. Only some of them are of particular practical interest for PDM, more concretely:
1.
2.
3.
4.
5.
6.
7.

boolean – has two members {true, false}
integer – set of all integers
real – set of all real numbers
probability – set of all real numbers from the interval <0, 1>
text – set of all possible strings of symbols
legality – set of three members {permitted, obligatory, forbidden}
formule – set of all possible mathematical formulae.

As was indicated above, every property of a political system, when operationalized into a variable, represents a challenge with which every society and
every polis has to deal, in one way or another. One can easily imagine a variable $  5immigrants representing the number of immigrants which a given polis
is ready to integrate during certain $interval of time. The variable $immigrants
would be of type integer if ever it is defined in absolute terms; it would be of
type real or probability if ever it is defined relatively to the size of the polis (i. e.
0.2 or 1.2 if polis is ready to increase its population up to 120% of its original
size). A boolean-typed variable can code such a property of the polis which can
either exist or not, e. g. $has_basileus in order to encode the possibility that the
polis has (or has not) its βασιλεÚς; $deathpenalty_exists in order to encode the
death penalty is legal (or not) in the polis or $has_pdm in order to denote the
difference between the polis which is fully PDM-compliant in contrast to the one
which is not. Other variables already implemented in real-life scenarios will be
mentioned in following paragraphs in order to clarify our point.
To every variable, an «activ» value is assigned in every moment of variable's
history. It is possible to have «array» variables which are associated with more
than one active value in a given moment, due to pedagogic reasons we shall not,
4
5

Floridi, 2011.
Every variable is denoted by a $ prefix.

••TEORIA POLITICA 2013••.indb 167

18/6/13 10:09:44

168

daniel devatman hromada

however, deal with such cases in the current article, and if ever the need to assign
simultaneously two values shall emerge, we shall assign them to two distinct scalar variables. Thus, within the following framework, a variable always contains
one and only one active value in a given moment T of its history. We precise that
a value which is assigned to the variable in the moment T is the active value,
contrary to other members of the set specifying variable's type which are, in a
moment T, just «potentially activable values».
Analogically to the box on which the label $shoe is engraved and which contains today my winter shoes, and which could, possibly, sometimes in the future,
contain my summer shoes, one can imagine the variable $has_basileus to which
the value «false» is assigned today (i. e. «false» is the active value) in the polis
which lacks its basileus but to which, possibly, the value «true» shall be assigned
tomorrow (i. e. «true» is potentially activable value). The rationale behind this
analogy is simple: not to forget that values are assigned to the variable in a mutually exclusive way.
Simply stated, there shall always be one and only one pair of shoes in the box
in one moment and whether they are the boots or the sandals will undoubtedly
determine the extent to which I'll feel comfortable after I shall decide do put the
content of the box on and walk out into the rain.
3. From properties represented by variables to systems represented
by vectors of variables
We believe that a political system —be it a polis or a sultanate, a republic
or an empire— can be described in terms of sets of its properties. As we have
indicated above, properties are often closely related to a challenge with which a
society has to deal, in one way or another. Such a challenge can almost always be
conceived as a question whose form is «what value Y shall be assigned to variable
X ?». For example, a very ancient challenge of «whether or not to accept aliens
in one’s polis» can be formalized as a variable $accept_immigrants of type boolean, or in a more evolved systems as a variable of type integer or real (c.f. above)
which has the value 0 (i. e. no immigrants) as its limit option among multitudes
of other options.
The way how every society deals with a given challenge yields a property of
a given society and can be operationalized into variable. In every moment of
system’s history, a certain value is assigned to that variable and the sequence
—or rather a vector— of such values can yield a description of the system under
question.
We precise that a vector of length N is an ordered sequence of N values.
Therefore, any element of a vector can represent a variable. Let’s imagine, for
example a simplistic vector of length 2 defined as a sequence of 2 boolean variables [$has_basileus, $has_parliament]. Under such an interpretation, the vector
having values [1,0] represents the autocracy; the vector [1, 1] represents either a
constitutional monarchy or such a presidential democracy where rights of president are so strong that he can be even considered to be the basileus; the vector

••TEORIA POLITICA 2013••.indb 168

18/6/13 10:09:44

Parallel Democracy Model and Its First Implementations in the Cyberspace	

169

[0, 1] can represent a parliamentary democracy with basileus role non-existing
or reduced to ceremonial purposes and the vector [0, 0] can represent a system
without basileus nor parliament, e. g. an anarchy, oligarchy, mediarchy etc. We
do not pretend that such a simplistic vector composed of 2 boolean variables
could be of much practical use and we present it only as a paedagogic example
whose secondary effect is to suggest that even a highly general and abstract functional level - described in the case of a modern society in the Constitution- could
possibly be addressed by our formalism.
It is not completely hors propos to imagine a research program aiming to
1) enumerate all existing or known political systems 2) find invariants among
them 3) operationalize those invariants into variables with a set of possible values
and 4) subsequently describe every unique political system by unique vector of
values. The output of such a «herculean task», if ever finished, would be a set of
vectors —a dataset— describing properties of political systems during different
moments of recorded history. It may be the case that purely mathematical or topological study of such vector ensembles (i. e. the vector spaces) would indicate,
among other things, that only very small part of a search space «of all possible
configurations of a political system» was already explored in course of human
history, and that multitudes of theoretically stable political configurations are
still to be discovered.
4.

From descriptive vectors to normative vectors

What was said until now shall hardly be of big surprise for the expert in
political theory. In one way or another, every political or historical theory from
Aristotle to Toynbee addresses the same problem —to describe how, by what
means and by what individual or social body have been set the values for variables determining the properties of a given society and what are the most common (cor)relations between different variable values?
The turning point occurs when one realizes that a formal framework presented hereby —i. e. the framework which allows us to formalize political systems
into vectors of variables— can be exploited not only for descriptive&explicative
purposes, but that it can be normative as well. In other terms: vector-like representations of political systems can help us not only to understand the systems
under study; they can allow us to «run» them in an unprecedented way.
But before we shall explain how this turn from descriptivity to normativity can
occur in case of political science, let’s take some inspiration from biology. Less
than a century after Darwin’s theory of evolution suggested that there exists a material substrate of heredity, such a substrate was discovered by Watson & Crick,
having the form of a DNA molecule. In modern science, this molecule is conceptualized as a genome which can be defined as an ordered sequence of genes.
A gene can be defined as «locatable region of genomic sequence corresponding to
a unit of inheritance»  6. From the point of view of this article, a gene is simply a
6

Pearson, 2006.

••TEORIA POLITICA 2013••.indb 169

18/6/13 10:09:44

170

daniel devatman hromada

variable which can code different values, for example the gene $eye_color is of
type {green, blue, brown, grey,... }. The values of different variables determine
the biochemical «unfolding» of the development procedure whose output is an
individual living being phenotypically expressed by different properties.
Ceteris paribus, one can imagine that a society —with its laws, functions,
rituals, institutions etc.— is a phenotypical expression of a vector of values
—the genome— which is normally implicitly encoded in artifacts, books of
laws, or a distributed holographic information in the brains of the members
of the society under study. In order to form a functional political body, every
society has to integrate 1) a certain set of institutions which «execute» certain actions according to active values of the variables (e. g. tax collector will
behave according to the value set in the variable $tax_rate) 2) a certain set of
procedures which precise how & by whom the different elements of «society’s
genome» are updated.
Often, these procedures are self-referential in a sense that not only their very
execution is governed by values encoded in society’s genomic vector (i. e. input
parameters) but the result of their action (i. e. the output) can be formalized as
an update of a value (or a set of values) of other parts of the initial same genomic
vector.
As an example of such a vector update which is determined by the values
in the very same vector, we may take an example a micro-society encoded by a
vector of length 2 composed of a variable $tax_rate_updator having two possible values {basileus, parliament} as its type and a variable $tax_rate having a
real value from interval <0, 1> as its type. Subsequently, a procedure update_
tax_rate can be defined which, when executed, consults the information source
referenced by an active value of the variable $tax_rate_updator in order to set
the value of the variable $tax_rate. Notice that the execution of the procedure
can be fully automatic, for example by means of a procedure update_tax_rate()  7
since the only thing which is in reality going all on the time is assigning values to
variables. We hope that at this point, it is evident to the reader that in such a case
the vector [basileus, 0.07] would be the genome for a political system where only
the monarch has a right to change the tax rate from 7% of one's income, while
the genome [parliament, 0.23] would encode such the society whereby only parliament has the right to change the current 23% tax rate.
We hope that this small example makes it somewhat more clear that the formalism presented hereby can be exploited not only for descriptive purposes.
Verily, the objective is not to offer «another formal descriptive framework» for
social sciences, for this was already done in multitudes of papers. The objective
is to indicate that in the world where many executive procedures like collect_
taxes() can be fully automatized by computer programs or other artificial agents
serving as tax collectors, the vectors we speak about can determine the function­
ing of the society and to do so in the strongest possible sense. To take our analogy
from natural sciences somewhat further, we precise the inspiration for what will
be presented in the rest of this article does not come from descriptive sciences
7

Procedures are denoted by () suffix.

••TEORIA POLITICA 2013••.indb 170

18/6/13 10:09:44

Parallel Democracy Model and Its First Implementations in the Cyberspace	

171

like genetics or biochemistry; on the contrary, our aim is inspired by constructive
aims of genetic engineering.
5.

From Serial Model to Parallel Model

Who can prove that from the earliest human societies until our present situation, the functioning of social bodies —be it the tribe of Pygmees, the macroanthropos  8 of Classical Athens or European Union— have not been determined
by some genome-like vector of values ? It is true that the representational medium of the genomic vector changes —from pure wetware (i. e. stored in brain or
set of brains) of pre-literal societies through pillars of Ashoka the Great towards
multilingual norms of the Union, stored and backed-up in parallel in dozens of
books, digital corpora or even cities.
It seems, however, that one thing haven't really changed, and that is the method by means of which the variables of utmost importance are being updated in
almost all existing societies. We label this method as the central dogma of the
Serial Model (SM) and define it like this:
«the Serial Model, values of variables which determine the functioning of the
society are updated in a serial order, one after another».

Caesar gives a list to his scribe and orders him: «You go to Forum Romanum
and first You engrave the first law into the Stone, after Thou shall engrave the
second». The parliament meets, they discuss one proposal, than vote for it and
only if the proposal is voted for by the majority, the genome of the state shall be
modified; afterwards another proposal is being discussed and voted for. University’s council meets and they discuss the order of the day, point after point, vote
after vote. The temporal preposition «after» is crucial here.
But certain deviations from the Central Method exist even in the world governed by SM. Mostly they are due to mutual independence of institutions to
which the «authority to update the certain parts of the vector» have been assigned. A circulaire de ministére can be distributed in the same moment as the
new law is passed. A coup d’état can occur in the society if ever the president
tends to assign to variable X a different value than his strongest general, or even
if they both update different variables with such values that two resulting vectors
(president-generated vs. general-generated) diverge in such a degree of orthogonality  9 that they cannot be considered as consistent anymore. Or, in a somewhat extreme but pedagogically useful case one can imagine a muslim scholar
articulating a fatwa obliging the nudity at noon in the midst of a sultanate which
have just accepted the sharia law. Theoretically, even under SM, such updates
of different parts of the vector can occur on the very same day, even at the very
same moment, because different agents can modify values of different variables
contained within the genomic vector.
8
9

Plato, 2009.
Notice that orthogonality is a geometric term.

••TEORIA POLITICA 2013••.indb 171

18/6/13 10:09:44

172

daniel devatman hromada

Traditionally, such cases were considered to be a «bug» of the political system
and extraordinary amount of intellectual power was invested, in course of human
history, to bug-proof different systems by adding new watchdog institutions, or
by proposing new set of variables pretending to be hierarchically superordinate
to already existing ones, as is the case for Constitution. The final result, however,
is that the length of society’s genomic vector —i. e. the number of variables to be
set— grows, becoming less and less comprehensible for a common human being
whom it was supposed to serve in its very beginning  10, hence bringing with itself
still more or and more place for disharmony and (cor)ruption.
But what may seem to be a bug when interpreted through the prism of level
of abstraction  11 of Serial Model can turn out to be a feature when another level
of abstraction is involved in the interpretation. Such is the case for the Parallel
Model whose central dogma can be stated as follows:
«Within the Parallel Model, a new value can get assigned to any variable in
any moment, and independently from the moment of assignment of a value to any
other variable. Theoretically, the values of all variables can be changed in the very
same moment or in any other moment».

Seemingly tautological and therefore useless, the preceding definition can
nonetheless lead to an unprecedented sort of «transvaluation of all values»  12 in
the political domain. While all the transformations in political domain —be it
small-scale reforms or full-fledged social revolutions— have simply updated the
values of few variables or of certain parts of societies’ genomic vector, a possible
transition from SM to PM is not the change of content. It is the change of the
form and more concretely, it is being realized by transformation of the form of
procedure of voting.
6. From Parallel Model to Parallel Democratic Model
We believe that a transition from SM to PM is possible because of 1) development of information-storage mediums which can be accessed for viewing and
updating independently of temporal constraints, 2) development of communication networks which allow us to access or update informational content stored
on such mediums independently of spatial constraints and 3) such an extensive
presence of information&communication technologies (ICT) that, at the beginning of a so-called 3rd millennium, the critical mass of inhabitants of the planet
Earth can access (e. g. google) or even update (e. g. wikipedia) certain pools of
informational content.
In its very essence, the genomic vector —i. e. an ordered sequence of variables describing and governing the functioning of a political body— is a piece of
informational content. Hence it can be stored on information storage mediums
and accessed or updated by means of ICTs.
10
11
12

Hobbes, 2011
Floridi, 2011.
Nietzsche, 1969.

••TEORIA POLITICA 2013••.indb 172

18/6/13 10:09:44

Parallel Democracy Model and Its First Implementations in the Cyberspace	

173

To make the genomic vector, or at least its certain parts, of one’s own polis
accessible & updateable by all, or at least by the biggest possible number of
independent human agents, is the goal of all those who strive for participative
democracy. However, even the most radical proponents of participative democracy sometimes lack to realise that the way how society stores and aggregates
information strongly influences the way how it can function as a political body.
We have already addressed the question of storing when stating that the genomic vector of a pre-literal society was stored in a distributed fashion in the
brains of critical mass of members of such societies (older members often had
a decisive word to say in case of «data check-sum error») and indicated that a
completely new system of legal formulae and institutions could have emerged
from religious rituals  13 only because of the advent of writing and later, printing
press.
Contrary to writing, press or television, which allows many to get into passive
contact with the information stemming from a unique source of content, modern
ICTs allow many to get into active contact with the medium encoding the informational content  14. Thanks to ICTs, the shared information can be not viewed
but also updated by anyone.
What's more, multiple updates of multiple informational contents can be
realised in the same moment. This is of crucial importance for implementation
of PM. It is debatable in what extent one can have a legal system which swiftly
adapts to ever-still-accelerating transformations of external world if one bases
himself solely on a printing press where every law is widely known only after
1) the authority stated the law 2) the statement of the law is published by means
of a costly process of book preparation & printing 3) the book has to be distributed and attain its target. It is, however, non-debatable that in the end, the local
lawyer whose practice could be substantially transformed from the very moment
he receives the new collection of laws, shall have few possibilities to influence the
edition of the next volume of the book. He can view but he cannot update. And
even if he could update —for example because he is lucky, virtuous or corrupted
enough to be the member of parliament— his overall contribution to society's
welfare is more than doubtful since even with the best will possible, he shall be,
more often than not, obliged to attribute values to variable which do not concern
his domain of expertise.
Voting is the most fundamental form of opinion aggregation which is implemented in many social bodies in order to assign a certain value to a certain
variable or the set of variables. In its most common, SM-consistent form, the
voting act requires that a voting agent to cast his vote at a voting place during a
temporal interval dedicated to voting. Vote concerns only one variable (in case of
the most simplistic yes/no referendum) or a bundle of variables (in case of passing a complex law in the parliament). Subsequently, votes are aggregated by the
voting committee (in case of elections) or an automatic vote-aggregating device
(in case of parliaments) and according to the result of the aggregation, the value
13
14

Coulanges, 2010.
McLuhan, 1965.

••TEORIA POLITICA 2013••.indb 173

18/6/13 10:09:44

174

daniel devatman hromada

of the variable concerned by the voting is assigned (or not) a new value. Only
afterwards can the body of vote-givers proceed to another vote.
Let's now imagine a voting scenario for PM: One can imagine, for example, a
tribe inhabiting a village located in an environment so hostile, that in every moment of existence of the village, at least two thirds of adult men are patrolling at
different spots on the circumference of the village. It happens from time to time
that some warriors die in the battle and sometimes their chieftain dies as well.
Given a security constraint that forbids the majority of men to meet at one spot
and vote, thus leaving the perimeter of the village unprotected, what method
could assure that the village shall always have a chieftain respected by the biggest
number of his comrades ?
One can imagine the following answer: to every man of the tribe, a distinct
color is associated, be it the color that only the man himself can mix. It does not
really matter whether the knowledge of color's preparation was revealed to a
given individual during a certain rite of passage or whether it was transferred to
him by his father —what matters is, that any adult member of the tribe can use a
distinct color as his unique identification token. In the middle of the village, there
is a group of totems. One of the most central totems is divided into sections, for
example stripes colored in different colors. The old legend states that once a man
is able to mix his own distinct color, the spirit of the village shall allow him to do
two things: Firstly, he can paint his stripe on the totem, hence creating his own
section. Secondly, if ever he meets a man worthy of his respect, he can paint one
and only one line into the stripe colored in the same color as is the tattoo on the
forehead of such a respectable man. And if ever, after engraving such a line into
the chosen section of the totem (let’s say green), one finds out that the chosen
section contains more colored lines than any other section, one’s duty is to go
and seek as much comrades as one can find in order to tell them that the village’s
new chieftain is a man with a green tattoo on his forehead...
In this example, the totem represents a variable. When considered as a set, the
group of all colors of different sections of the totem $chief represent the type of
that variable. When taken individually, every colored section of the totem represents a possible value of the variable. Lines on different sections represent votes
which a given possible value had obtained and the section which had obtained
the biggest number of votes —i. e. a stripe with the biggest number of distinctly
colored lines on it— represents the «active value» of the variable $chief. The act
of writing a line correspond to the act of voting and the act of counting the lines
within all the sections and the subsequent choice of the section which contain
the maximum number of lines can be interpreted as aggregation of votes.
There are several crucial aspects to notice in the above «totem» scenario.
Primo, in spite of the fact majority of voters never meet sat the same spot at the
same time, they succeed to aggregate their votes because they use the surface of
the totem as the information storage medium. Secundo, the aggregation can be
possibly executed after a cast of every individual vote and thus just one vote can
overthrow the current chieftain. Tertio, the totem in itself does not change after
a new chieftain is elected, no information is lost and therefore a chieftain which
had just lost his chieftain status can quite easily regain it by obtaining two fresh

••TEORIA POLITICA 2013••.indb 174

18/6/13 10:09:45

Parallel Democracy Model and Its First Implementations in the Cyberspace	

175

votes —one which will put him into a tie situation with the present chieftain
and another one which shall put him into the lead again, hence starting a sort
of «cat&mouse» game between two chieftains. Quatro, any man can express his
respect towards many possible chieftains by drawing lines into multiple stripes it is not forbidden to give vote to more than candidate. One can also express his
respect to one candidate today and for another tomorrow - one can change his
mind. Quinto, it is forbidden to give more than one vote to just one candidate once the respect was expressed by drawing a line, it cannot be reinforced. From
this point of view, all candidates are equal. Sixto, since no line is deleted from
the totem, the votes of vote-givers which have already passed away can influence
the result of the aggregation process until the moment when the totem-variable
$chief falls into oblivion, for example due to the rising influence of other totemvariables which demand the attention of inhabitants of the village. Septo, given
that any voter has a unique identification token no further knowledge about the
attributes of the village (e. g. size of its population) is necessary.
This being said, it is now time to present the combination of PM with
ICT-sustained participative democracy which results in a Parallel Democracy
­Model:
«Parallel Democracy Model (PDM) is a framework allowing auto-configuration and self-adaptation of social bodies according to aggregated collective will
of individuals who compose these bodies (e. g. virtual avatars in case of virtual
networks considered as such social bodies)»  15.

PDM aims to address several fallacies inherent to the most common variant
of the SM known as «parliamentary democracy». In case of PDM, there is no
need for individuals to meet in the same moment in order to influence the functioning of the society. However, they still have to meet in one place —which can
be of purely virtual nature. Many different variables related to the functioning of
the social body are presented in this place simultaneously (i. e. in parallel) and
in perennial fashion (i. e. from eternity to eternity). In the most free variant of
PDM, any individual is free not only to vote for one or more possible values of
a variable (i. e. add a line on a stripe), but also add a new possible value to the
variable (coloring a section of the totem with a new stripe), hence extending its
type or even add a new variable (i. e. erection of a new totem). The act of voting is operationalized as the incrementation of the vote counter associated to a
possible value on the storage medium. Among other features, the most extreme
variant of PDM makes it even possible that the vote of a person already dead can
influence political reforms to come.
7.

Description of first tentatives to implement Parallel Democratic Model

Kyberia.sk is a virtual community founded by the author of this article in
year 2001. During following years to come, it had succeeded to change from a
community of hackers, artists and philosophers into a mainstream social net15

Hromada, 2012.

••TEORIA POLITICA 2013••.indb 175

18/6/13 10:09:45

176

daniel devatman hromada

work, nonetheless guarding its local nature and complete economical and political autonomy from the surrounding real-life environment. In 2008 it won the
prize for the best Slovak Internet community and in 2009 it forked from Slovak
cyberspace to Czech cyberspace and parallel project was launched on kyberia.cz
domain, exploiting somewhat more evolved variant of the initial engine, which
had meanwhile become open source and was published  16 under AGPL license.
From the very beginning, one of the academic objectives of the Kyberia project was to furnish a certain virtual «in vivo» incubator for experiments with
community-modeling. One such tentative was realized in year 2003 when kyberia's version introduced in a feature called «K». K, which was originally meant to
abbreviate the term «karma» and later «kredit» became a sort of currency which
is 1) distributed on a daily basis and in a certain amount to every registered user
of kyberia 2) can be transferred by its owner to another data node (i. e. a submission, forum, blog, user, whatever). Further extensions like K-wallet were added
in subsequent version 2.3 of kyberia's engine, thus making kyberia's K-based
transaction system very similar to normal economical system. Since the economical aspects of kyberia are of minor importance within the scope of this article,
let's clarify that an act of giving a K to a given node is very similar to what had
later been implemented on facebook in form of «I like» button.
What is of importance, however, is that the version 2.3. of kyberia's engine
have been 1) the first to introduce a tentative to implement PDM in order to
alleviate the administrative burden placed on the shoulders of kyberia's administrators 2) the K-giving system was exploited as a method for casting votes.
The variable which was chosen as the first one to be subjected to PDM is a
variable $page_title whose type is text, and whose «activated value» can be seen
by any visitor of the page at the top of the page, in the browser's title bar (as
of 15/11/2012 the $page_title is assigned the value «Remember, remember, the
velvet November»).
Let's inspect closer how the value of this variable is assigned. There is a certain
specific region of kyberia.sk called «Agora» where only users who were granted
the status of a «senator» can give K. Within this «nodeshell» there is another
«nodeshell» called «system configur»  17 where the system seeks for variables and
their values - in terms of the «totem scenario» from part 6 of this paper, it can
be illustrated as that part of the village where the totems are erected. And within
this «node» there is a node «title content»  18 which, for the automatic scripts of
kyberia's engine, represents the variable $title_content. Into this variable node,
any senator can add his own «child node» whose content is the «possible future
value» of $title_content. The act of adding such a «child node» into the node
representing the variable $title_content is similar to the act of drawing of a new
stripe on the totem; the only difference being due to variable's type: cardinality
of type of all possible text strings is much bigger (infinitely bigger, in fact) than a
finite number of possible chieftains in the village example presented above.
16
17
18

https://github.com/Kyberia/Kyberia-bloodline.
http://kyberia.sk/id/5604218/.
http://kyberia.sk/id/5604239/.

••TEORIA POLITICA 2013••.indb 176

18/6/13 10:09:45

Parallel Democracy Model and Its First Implementations in the Cyberspace	

177

What follows is quite simple: the senators simple give Ks to one or more
nodes whose content represent the possible value. They mark their line on their
stripe of interest. Subsequently, every night at 2:23 AM, an automatic procedure
update_title() is executed which looks which child node of the «title content»
variable has obtained the biggest number of Ks, takes its content and assigns it
as a value of the variable $title_content which is internal to kyberia's code. Such
an «active value» will be then visible to all visitors of kyberia.sk domain in the
top part of their web browser.
An ignorant novice may consider it to be a waste of time to have such a seemingly complex machinery in order to do such a simple change as that of assigning
a new value to a title of the website. But the fact is, that the «behind the scene»
machinery is not that complex —just a simple cron script containing 40 lines of
simple php code— and is very universal: ANY global parameter of kyberia’s
code —be it the number of Ks distributed to users on a daily basis, a K-cost of
adding a new node or the number of K-s can has to obtain from other senators in
order to become a senator— can be easily integrated into PDM by simply adding
it to the «system configure» node located in Agora of kyberia.sk.
Since conservative operators of the Slovak kyberia feel certain reluctance to
integrate more variables into PDM, more extensive «in vivo» experimentation
is pursued within the scope of much smaller, nonetheless much more liberal
domain of kyberia.cz. There, not only the title of the page (as of 21/12/2012
the $page_title = «mèδεις ageôμετρèτος eisitô μου tè» stegè»»), but 3 other
variables can be set as well, the most interesting among them being the variable
PDM_CONSTANT_REGISTRATION_K indirectly addressing the challenge
«shall immigrants be accepted into our polis?» which was already described, in
the initial parts of this article, as the challenge which has to be addressed by any
human society.
As a sufficiently big community, kyberia also has to address this challenge.
Both czech and slovak kyberias shere the feature that there is only one way
how one can become their member: 1) one has to apply for registration and 2)
one’s application has to obtain a sufficient number of approval votes from any
already registered user (as is the case for kyberia.cz) or a senator (kyberia.sk).
The number of needed registration-approving votes is addressed by the variable
PDM_CONSTANT_REGISTRATION_K. Currently, the «active value» of this
variable is 3 within the scope of kyberia.cz domain, meaning that the registration
application of a new user shall be approved only after it had received at least 3
K-votes. If such is the case, an automatic register_user() procedure will execute
necessary database transactions transforming user’s registration application into
a full-fledged user node; subsequently the user is informed by email that he can
enter the domain.
It is possible that if ever the size of the kyberia.cz shall grow, more and more
users will propose or vote still higher and higher values of the above-mentioned
variable, in order to somehow regulate the influx of possible immigrants. On the
other hand, it can also happen that the variable shall be assigned the value «zero»
—in such a case a registration application could be approved even if it haven’t received any K-vote. Such a case could be quite dangerous, however, since it could

••TEORIA POLITICA 2013••.indb 177

18/6/13 10:09:45

178

daniel devatman hromada

lead to uncontrollable influx of alternative egos which are a true problem for
every virtual community and for which the kyberia.sk has found a partially successful solution by setting the acceptation threshold to 5 senator approval votes.
8.

From PDM to political engineering

Let’s look closer at the above-mentioned variable determining how many
votes are needed in order to approve a registration application of a new user. We
believe that it can illustrate the importance of good choice of variable’s value in
relation to the survival of a community or a society.
As was already mentioned, if the value is too low, anyone can easily become
the member of a community. In case of virtual communities, the system like
facebook, which does not put almost any constraints on user selection, can easily become a playground for toxic egos causing the overall quality of content to
go down. In case of real-life societies, such completely open societies can easily
became a haven for black-passengers or outlaws. But if the constraint determining the acceptation of a new member is too strict —i. e. the number of votes
needed is too high— the system can easily get into situation where less and less
immigrants succeed to get approved. This can potentially lead to the death of the
community or society, especially in case of significant user outflux (e. g. «locking
out» of kyberia and investing computational resources of one’s brain into the
construction of google+ identity).
We believe that in the history of humanity, it was not uncommon to see highly
advanced societies perish just because the $immigration_rate variable have been
assigned a non-optimal value or because it was wrongly balanced with other set
of variables contained in society’s genomic set, e. g. those variables which determined the immigrant’s subsequent integration into society. The problem is, of
course, that in situation where no honest man can pretend to know in advance
what is the optimal value of a single variable, it is practically impossible to attain
any kind of optimality in cases complex sets of variables. The problem, in its very
essence, is that humans beings are unable to find agreement of what «optimality»
means in case of sociopolitical bodies.
For a computer scientist it is evident that there exist certain problems for
which we shall never know the solution, nor know whether we shall ever know
their solution  19. In the world where the aims to attain «the common good» and
to spread «the human dignity»  20 had indirectly led to biggest demographic and
ecologic disasters of recorded history, one would tend to adopt a sceptical attitude expressed by a belief that the problem of global optimality of political
bodies is an unsolvable problem.
It was many times advised that instead of falling into scepticism, it is wiser to
observe in amazement  21 the wisdom of Nature. Be it the ontogeny of a human
19
20
21

Turing, 1936.
Mirandola, 1486.
Plato, 1986: 155d.

••TEORIA POLITICA 2013••.indb 178

18/6/13 10:09:45

Parallel Democracy Model and Its First Implementations in the Cyberspace	

179

baby or a phylogeny of species, the Nature maybe does not find the global solutions to «life, universe and all» but succeeds to discover stunningly elegant and
simple local optima by means of very simple heuristics like «trial and error» and
evolution.
We believe that the reason why Nature succeeds to do so is because it unceasingly permutes and mutates diverse information-carrying vectors, and that
it always find new ways —new mutation operators— to do so. As all experts in
the domain of evolutionary algorithms (EA) know, a very method combining the
idea of 1) information conservation 2) information replication and 3) information mutation can offer us sufficiently satisfactory solutions for stunningly wide
range of problems.
This article tries to suggest had that the evolution of different configurations
of political bodies can be not only described in terminology not so distant from
the terminology used by experts in EA; we also indicate that act of making explicit the variables which determine the functioning of a given society —as is
the case in PDM— could accelerate the research of a locally optimal political
configuration. In our opinion the advantage lies in PDM's vote-aggregation ability to harness «wisdom of crowds»  22 better than a «classical» crowd-sourcing
algorithm located at the very core of different variations of the Serial Model.
We may be, of course, wrong in our conclusions but our «in vivo» social experiment with kyberia communities haven't furnished us any reason to support a
belief that systems based on SM-aggregation should be uncritically accepted and
PDM-like variants a priori excluded. Verily we believe that the only obstacle to
wider expansion of PDM seems to be SM’s strong «social inertia» and not a flaw
inherent to PDM itself.
As there is no order without conservation, there is no evolution without mutation. If in case of political bodies mutation can be operationalized as a modification of variable’s value then it follows that methods of opinion-aggregation
can be interpreted as mutation operator if ever the execution of such method
results in update of variable included in society’s genomic vector. Simply stated:
as members of different virtual communities, as citizens of the polis aiming to
apply the principles of participative democracy or simply as holders of the passport of the Union, we all have a possibility to contribute to the final output of a
variable-updating operator.
Whether we want it or not, we are all co-engineers of the political body which
envelops us as mother’s matrice. Our actions contribute to mutations of the vector which is generating cette matrice and this matrice subsequently influences our
future actions and choices. In majority of cases, this dialectics between the agent
and his sociopoliticohistoricoecolognomical environment is implicit and hidden
behind stratas of constitutions, laws and institutions. The objective of the model
hereby proposed is to make more explicit at least certain parts of this dialectics.
Our hope is that by making things —values, variables, vectors, models— explicit, we make them accessible to conscious reflexion. By making them acces22

Surowiecki, 2005.

••TEORIA POLITICA 2013••.indb 179

18/6/13 10:09:45

180

daniel devatman hromada

sible to conscious reflexion, and by subsequent transforming of these structures
according to this very reflexion, we let consciousness to co-construct our shared
world, hoping that consciousness and reason shall help us to reduce to zero
the probability of participation on the construction of the world about which it
already stated: «this is not the world I love»  23. Hopefully, by reducing the possibility of waking once into such a world, we shall gradually raise the feeling that
the corner of the universe we slowly learn to inhabit... is our home  24.
Reducing and raising, incrementing and decrementing, encoding and decoding, analysing but also uniting —such are the tools of a conceptual engineer.
Since in this article we had implemented these tools for purposes of political
science, we find it appropriate to end this excursion by proposing a following
definition of Political Engineering:
«Political engineering is the science and an art of adjusting the values of variables which determine the functioning of a political body»

and we terminate with the conclusion that it is left to engineer’s own choice
whether (s)he wants this adjustment to be done in accordance with external environment, or with internal intentions.
Bibliography
Bourdieu, P. (1984). Distinction: A Social Critique of the Judgement of Taste, Harvard,
Harvard University Press.
Coulanges, F. de (2010). La cité antique: Étude sur le culte, le droit, les institutions de la
Grèce et de Rome (1893), Cambridge, Cambridge University Press.
Floridi, L. (2011). The Philosophy of Information, Oxford, Oxford University Press.
Hobbes, T. (2011). Leviathan (1651), Empire Books.
Hromada, D. D. (2012). Initiation to Parallel Democracy Model. Presented at the Fabrique de la Loi, Ecole des Sciences Politiques, Paris.
Kauffman, S. (1996). At Home in the Universe: The Search for the Laws of Self-Organiza­
tion and Complexity, Oxford, Oxford University Press.
Lévi-Strauss, C. (1967). Structural Anthropology, New York, Doubleday Anchor Books.
Lévi-Strauss, C. (2005). Interview with Claude Levi-Strauss, Television France 2: http://
www.youtube.com/watch?v=bT8sFygU8fY.
McLuhan, M. (1965). The Gutenberg galaxy: the making of typographic man, Toronto,
University of Toronto Press.
Mirandola, G. P. D. (1971). Oration on the Dignity of Man (1486), Gateway.
Nietzsche, F. (1969). Umwertung Aller Werte Band 1. Deutscher Taschenbuch.
Pearson, H. (2006). Genetics: What is a gene?, in «Nature», 441, 398-401.
Plato (1986). Theaetetus: Part I of The Being of the Beautiful, Chicago, University of Chicago Press.
Plato (2009). The Republic, Cambridge, Cambridge University Press.
Surowiecki, J. (2005). The Wisdom of Crowds, New York, Anchor.
Turing, A. (1936). On computable numbers with an application to Entschiedungsproblem,
in «J. of Math», 58, 345-363.
23
24

Lévi-Strauss, 2005.
Kauffman, 1996.

••TEORIA POLITICA 2013••.indb 180

18/6/13 10:09:45

Foreword from the Congress Chairs
For the Turing year 2012, AISB (The Society for the Study of Artificial
Intelligence and Simulation of Behaviour) and IACAP (The International
Association for Computing and Philosophy) merged their annual
symposia/conferences to form the AISB/IACAP World Congress. The
congress took place 2–6 July 2012 at the University of Birmingham, UK.
The Congress was inspired by a desire to honour Alan Turing, and by
the broad and deep significance of Turing's work to AI, the philosophical
ramifications of computing, and philosophy and computing more generally.
The Congress was one of the events forming the Alan Turing Year.
The Congress consisted mainly of a number of collocated Symposia on
specific research areas, together with six invited Plenary Talks. All papers
other than the Plenaries were given within Symposia. This format is perfect
for encouraging new dialogue and collaboration both within and between
research areas.
This volume forms the proceedings of one of the component symposia.
We are most grateful to the organizers of the Symposium for their hard work
in creating it, attracting papers, doing the necessary reviewing, defining an
exciting programme for the symposium, and compiling this volume. We
also thank them for their flexibility and patience concerning the complex
matter of fitting all the symposia and other events into the Congress week.

John Barnden (Computer Science, University of Birmingham)
Programme Co-Chair and AISB Vice-Chair
Anthony Beavers (University of Evansville, Indiana, USA)
Programme Co-Chair and IACAP President
Manfred Kerber (Computer Science, University of Birmingham)
Local Arrangements Chair

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

3

Foreword from the Workshop Chairs
2010 marked the 60th anniversary of the publication of Turing’s paper, in
which he outlined his test for machine intelligence. Turing suggested that
consideration of genuine machine thought should be replaced by use of a
simple behaviour-based process in which a human interrogator converses
blindly with a machine and another human. Although the precise nature of
the test has been debated, the standard interpretation is that if, after five
minutes interaction, the interrogator cannot reliably tell which respondent is
the human and which the machine then the machine can be qualified as a
'thinking machine'. Through the years, this test has become synonymous as
'the benchmark' for Artificial Intelligence in popular culture.
There is both widespread dissatisfaction with the 'Turing test' and
widespread need for intelligence testing that would allow to direct AI
research towards general intelligent systems and to measure success. There
are a host of test beds and specific benchmarks in AI, but there is no
agreement on what a general test should even look like. However, this test
seems exceedingly useful for the direction of research and funding. A
crucial feature of the desired intelligence is to act successfully in an
environment that cannot be fully predicted at design time, i.e. to produce
systems that behave robustly in a complex changing environment - rather
than in virtual or controlled environments. The more complex and changing
the environment, however, the harder it becomes to produce tests that allow
any kind of benchmarking. Intelligence testing is thus an area where
philosophical analysis of the fundamental concepts can be useful for cutting
edge research.
There has been recently a growing interest in simulating and testing in
machines not just intelligence, but also other mental human phenomena, like
qualia. The challenge is twofold: the creation of conscious artificial systems,
and the understanding of what human consciousness is, and how it might
arise. The appeal of the Turing Test is that it handles an abstract inner
process and renders it an observable behaviour, in this way, in principle; it
allows us to establish a criteria with which we can evaluate technological
artefacts on the same level as humans.
New advances in cognitive sciences and consciousness studies suggest it
may be useful to revisit this test, which has been done through number of
symposiums and competitions. However, a consolidated effort has been
attempted in 2010 and in 2011 at AISB Conventions through TCIT
symposiums. However, this year’s symposium forms the consolidated effort
of a larger group of researchers in the field of machine intelligence to
revisit, debate, and reformulate (if possible) the Turing test into a
comprehensive intelligence test that may more usefully be employed to
evaluate 'machine intelligence' at during the 21st century.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

4

The Chairs
Vincent C. Müller (Anatolia College/ACT & University of Oxford) and
Aladdin Ayesh (De Montfort University)
With the Support of:
Mark Bishop (Goldsmiths, University of London),
John Barnden (University of Birmingham),
Alessio Plebe (University Messina) and
Pietro Perconti (University Messina)
The Program Committee:
Raul Arrabales (Carlos III University of Madrid),
Antonio Chella (University of Palermo),
Giuseppe Trautteur (University of Napoli Federico II),
Rafal Rzepka (Hokkaido University)
… plus the Organizers Listed Above

The website of our symposium is on http://www.pt-ai.org/turing-test

Cite as:
Müller, Vincent C. and Ayesh, Aladdin (eds.) (2012), Revisiting Turing and
his Test: Comprehensiveness, Qualia, and the Real World (AISB/IACAP
Symposium) (Hove: AISB).
Surname, Firstname (2012), ‘Paper Title’, in Vincent C. Müller and Aladdin
Ayesh (eds.), Revisiting Turing and his Test: Comprehensiveness, Qualia,
and the Real World (AISB/IACAP Symposium) (Hove: AISB), xx-xx.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

5

Table of Contents
Foreword from the Congress Chairs

3

Foreword from the Workshop Chairs

4

Daniel Devatman Hromada
From Taxonomy of Turing Test-Consistent Scenarios Towards
Attribution of Legal Status to Meta-modular Artificial
Autonomous Agents

7

Michael Zillich
My Robot is Smarter than Your Robot: On the Need for a Total
Turing Test for Robots

12

Adam Linson, Chris Dobbyn and Robin Laney
Interactive Intelligence: Behaviour-based AI, Musical HCI and
the Turing Test

16

Javier Insa, Jose Hernandez-Orallo, Sergio España,
David Dowe and M.Victoria Hernandez-Lloreda
The anYnt Project Intelligence Test (Demo)

20

Jose Hernandez-Orallo, Javier Insa, David Dowe
and Bill Hibbard
Turing Machines and Recursive Turing Tests

28

Francesco Bianchini and Domenica Bruni
What Language for Turing Test in the Age of Qualia?

34

Paul Schweizer
Could there be a Turing Test for Qualia?

41

Antonio Chella and Riccardo Manzotti
Jazz and Machine Consciousness: Towards a New Turing Test

49

William York and Jerry Swan
Taking Turing Seriously (But Not Literally)

54

Hajo Greif
Laws of Form and the Force of Function: Variations on the
Turing Test

60

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

6

!"#$%&'()*(+,("-./0(,%1/2#+#$3%#4%15"6+'%1(07%
89(+/"6#0%7#:/",0%&77"6.576#+%#4%;('/<%87/750%7#%=(7/=#,5</"%&"764696/<%&57#+#$#50%&'(+70
>/+6(<%>(?/7$/+%@"#$/,/!"#
&.07"/97A$$%&'$()*+*,-.$%/)*,+%'01$$*0$2(3*4*'3$*,$()3')$1($1-5'$
*,1( $ -66(/,1 $ 1&' $ $ -+'7+',3') $ (4 $ - $ 8/3+' $ 9&( $ ':-./-1'0 $ 1&'$
2-6&*,' $ -,3 $ 1&' $ -+'7+',3') $ (4 $ - $ ;/2-, $ 9*1& $ 9&(2 $ 1&'$
2-6&*,' $ *0 $ 6(2<-)'3 $ 3/)*,+ $ ':-./-1*(,= $ %&*0 $ >*'.30 $ - $ ?-0*6$
1-@(,(2> $ (4 $ %/)*,+%'01A6(,0*01',1 $ 06',-)*(0 $ 9&*6& $ *0$
0/?0'B/',1.> $ '@1',3'3 $ ?> $ 1-5*,+ $ *,1( $ -66(/,1 $ 1&' $ 1><' $ (4$
*,1'..*+',6' $ ?'*,+ $ ':-./-1'3= $ C(,0*01',1.> $ 9*1& $ 1&' $ %&'()> $ (4$
D/.1*<.' $ E,1'..*+',6'0F $ ,*,' $ ?-0*6 $ *,1'..*+',6' $ 1><'0 $ -)'$
<)(<(0'3F$-,3$-,$'@-2<.'$(4$-$<(00*?.'$06',-)*($4()$':-./-1*(,$
(4$'2(1*(,-.$*,1'..*+',6'$*,$'-).>$01-+'0$(4$3':'.(<2',1$*0$+*:',=$
E1 $ *0 $ 0/++'01'3 $ 1&-1 $ 0<'6*4*6 $ *,1'..*+',6' $ 1><'0 $ 6-, $ ?'$
0/?0'B/',1.>$+)(/<'3$*,1($&*')-)6&>$-1$1&'$1(<$(4$9&*6&$*0$0'-1'3$
-,$G)1*4*6*-.$E,1'..*+',6'$.-?'..'3$-0$H2'1-A2(3/.-)I=$J*,-..>F$*1$
*0$<)(<(0'3$1&-1$0/6&$-$2'1-A2(3/.-)$GE$0&(/.3$?'$3'4*,'3$-0$-,$
G)1*4*6*-.$G/1(,(2(/0$G+',1$-,3$0&(/.3$?'$+*:',$-..$1&'$)*+&10$
-,3$)'0<(,0*?*.*1*'0$-66()3*,+$1($-+'$(4$&/2-,$6(/,1')<-)10$*,$
6(2<-)*0(, $ 9*1& $ 9&(2 $ -, $ GE $ /,3') $ B/'01*(, $ &-0 $ <-00'3 $ 1&'$
%/)*,+%'01=!"#

B%C&8DE%1FGDH*1I81%1&JKHK=L
K)*2($L!M$F$9'$<)'0',1$-$.-?'..*,+$06&'2-$4()$-$0'1$(4$3*:')0' $
-)1*4*6*-. $ *,1'..*+',6' $ 1'010 $ *,0<*)'3 $ ?> $ - $ <)(6'3/)' $ *,*1*-..>$
<)(<(0'3$?>$G.-,$D-1&*0(,$%/)*,+$L"MF$*,$()3')$1($01-,3-)3*N'$
-11)*?/1*(,$-,3$':-./-1*(,$(4$1&'$.'+-.$01-1/0$(4$G)1*4*6*-.$$G+',10$
OGG0PF$?'$*1$)(?(1F$6&-1A?(1$()$-,>$(1&')$,(,A()+-,*6$:')?-..>$
*,1')-61*,+$0>01'2=$
J()$1&(0'$9&($3($,(1$5,(9$()$&-:'$4()+(11',$9&-1$%/)*,+$
%'01 $ O%%P $ *0F $ 9' $ <)'6*0' $ 1&-1 $ -66()3*,+ $ 1( $ - $ 2-, $ )*+&14/..>$
.-?'.'3$-0$-$4(/,3*,+$4*+/)'$(4$1&'()'1*6-.$*,4()2-1*60F$-$%%$*0$-$
9->$&(9$1($-33)'00$1&'$B/'01*(,$9&'1&')$HC-,$2-6&*,'0$1&*,5QI$
*,$-$06*',1*4*6-..>$<.-/0*?.'$>'1$3''<.>$'2<-1&*6$9->=
D()'$6(,6)'1'.>F$%/)*,+$<)(<(0'0$1&-1$1&'$<')4()2-,6'$(4$-,$
GG $ /,3') $ B/'01*(, $ 0&-.. $ ?' $ ':-./-1'3 $ ?> $ - $ &/2-, $ R/3+' $ O8P$
9&(0'$(?R'61*:'$*0$1($3'1')2*,'$9&*6&$-2(,+$19($',1*1*'0$A9*1&$
9&(2$8$*0$*,$)'-.A1*2'$*,1')-61*(,A$*0$(4$&/2-,$-,3$9&*6&$*0$(4 $
-)1*4*6*-.$,-1/)'=$%)-3*1*(,-..>F$2()'$-11',1*(,$9-0$<(*,1'3$/<(,$
1&'$)(.'$(4$GG$-*2*,+$1($S1)*65T$8$*,1($1&*,5*,+$1&-1$GG$*0$&/2-,$
O;P=$U'F$&(9':')F$<)(<(0'$1($<-)1*-..>$1/),$1&'$-11',1*(,$1($)(.'0$
(4$8$7$;=$J()$*1$*0$':*3',1$1&-1$4-61()0$.*5'$8V;T0$-+'F$+',3')$()$
!

$W.(:-5$%'6&,*6-.$X,*:')0*1>F$J-6/.1>$(4$Y.'61)*6-.$Y,+*,'')*,+$-,3$
E,4()2-1*(,$%'6&,(.(+>F$E,01*1/1'$(4$C(,1)(.$-,3$E,3/01)*-.$E,4()2-1*60F$
Z)-1*0.-:-F$W.(:-5*-=$$Y2-*.[$&)(2*\5>?')*-=05=$
"
$ ]/1*, $ X0').-? $ -44*.*-1'3 $ 1( $ 3(61()-. $ 06&((. $ C(+,*1*(,F $ ]-,+-+'F$
E,1')-61*(,$(4$X,*:')0*1>$K-)*0$^F$J)-,6'=
#
$C(+,*1*(,$;/2-*,'$'1$G)1*4*6*'..'$.-?()-1()>$OC&G_%P$$-44*.*-1'3$1($
`6(.'$K)-1*B/'$3'0$;-/1'0$`1/3'0=

8V;T0$.':'.$(4$'@<')1*0'$<.->$6')1-*,$)(.'$*,$-00'00*,+$%%a0$4*,-.$
)'0/.1=$
G1$1&'$:')>$6()'$(4$(/)$GG$3',(1-1*(,$06&'2-F$(,'$4*,30$-$
H%%I$?*+)-2$0*+,*4>*,+$'*1&')$H%/)*,+$%'01IF$H%'01$%-@(,(2>I$
()$9&-1':')$'.0'$(,'$6&((0'0$1&'2$1($3',(1'=$U*1&(/1$-33*1*(,-.$
<)'4*@'0 $ () $ 0/44*@'0F $ 1&' $ <)'0',6' $ (4 $ - $ %% $ ?*+)-2 $ *, $ 1&'$
01-,3-)3*N'3 $ .-?'. $ (4 $ -, $ GG $ *,3*6-1'0 $ 1&-1 $ 1&' $ 6-,3*3-1' $ &-0$
-.)'-3>$0/66'004/..>$<-00'3$-1$.'-01$(,'$*,01-,6'$(4$1&'$%/)*,+$
%'01 $ -66'<1'3 $ ?> $ 1&' $ 6(22/,*1>= $ U&', $ - $ ,/2')*6 $ <)'4*@ $ *0$
+*:',F$*1$3',(1'0$1&'$-+'$(4$-$R/3+'F$()$-+'$-:')-+'$-$01-1*01*6-..>$
)'.':-,1$+)(/<$(4$R/3+'0$9&($':-./-1'3$1&'$1'01=$b,$1&'$6(,1)-)>F$
9&',$-$,/2')*6$<(014*@$*0$+*:',F$*1$3',(1'0$1&'$-+'$(4$-$&/2-,$
6(/,1')<-)1 $ A $ () $ -+' $ -:')-+' $ (4 $ &/2-, $ 6(/,1')<-)10 $ A $ $ *,$
6(2<-)*0(,$9*1&$9&(2$1&'$GG$$&-0$0/66''3'3$1($<-00$1&'$1'01=
G $ 6-0' $ 2-> $ (66/) $ 9&')' $ 8 $ -,3c() $ ;a0 $ +',3') $ 0&-..$
0*+,*4*6-,1.>$*,4./',6'$1&'$%%A<)(6'3/)'$O6=4=$1&'$H2(1&')$R/3+'I$
'@-2<.'$*,$<-)1$#$(4$1&*0$-)1*6.'P=$%&/0$*1$0''20$1($?'$)'-0(,-?.'$
1($*,1'+)-1'$+',3')$*,4()2-1*(,$*,1($1&'$.-?'.*,+$06&'2-=
%&/0F$1/)*,+$1'010$':-./-1'3$-66()3*,+$1($1&'$6)*1')*-$<)(<(0'3$
*,$<-)-+)-<&0$6-,$ ?'$ .-?'.'3$ ?>$ 2'-,0$(4$06&'2-$&-:*,+$1&'$
4()2[
*MNMM11OO*ON
*M %3',(1'0$8a0$+',3')F $*O$ 3',(1'0$;a0$+',3')$-,3$ $MM %-,3 $OO%
1(5',0 $ -)' $ 0/?01*1/1'3 $ ?> $ -+' $ O*, $ >'-)0P $ (4 $ R/3+' $ () $ &/2-,$
)'0<'61*:'.>=$HNP%*0$-$)'+/.-)$'@<)'00*(,$B/-,1*4*')$*,3*6-1*,+$1&-1$
1&'$+',3')$*,4()2-1*(,$*0$(,.>$4-6/.1-1*:'$-,3$6-,$?'$(2*11'3$4()$
6')1-*,$0'10$(4$1'010=
J()$'@-2<.'F$-,$GG$9&*6&$9-0$,(1$)'6(+,*N'3$-0$-,$-)1*4*6*-.$
',1*1>$d$-,3$1&')'4()'$<-00'3$1&'$%/)*,+$%'01$A$9&',$6(2<-)'3$1($
01-1*01*6-..> $ )'.':-,1 $ ,/2?') $ (4 $ !^A>'-) $ (.3 $ &/2-, $ 2-.'$
6(/,1')<-)10 $ 9&*.' $ ?'*,+ $ ':-./-1'3 $ ?> $ 01-1*01*6-..> $ 0*+,*4*6-,1$
)'.':-,1$,/2?')$(4$"!A>'-)$(.3$4'2-.'$R/3+'0F$0&-..$?'$.-?'..'3$
-0$J"!%%!^DA6(2<.*-,1$GG=
G0$9*..$?'$01-1'3$*,$1&'$.-01$<-)1$(4$1&'$-)1*6.'F$9'$<)(<(0'$1&-1$
- $J"!%%!^DA6(2<.*-,1 $ GG $ 0&-.. $ (?1-*, $ 6')1-*, $ .'+-. $ )*+&10F$
'0<'6*-..>$*4$1&'$1'01$6(,6'),'3$GGT0$2'1-A2(3/.-)$4-6/.1*'0=
G41')$4(6/0*,+$-11',1*(,$/<(,$87;a0$-+'$()$+',3')F$-,(1&')$
0'1$(4$:-)*-,10$(4$%%A.*5'$06',-)*(0$?'6(2'$<(00*?.'=$U&*.'$*,$
6-0'$(4$1&'$01-,3-)3$%%F$1&'$(?R'61*:'$*0$1($H<')0/-3'$1&'$R/3+'$
(4$(,'a0$&/2-,$,-1/)'IF$*,$-,$-+'A()*',1'3$06',-)*(F$-,$-+',1a0$
(?R'61*:'$6-,$?'$0*2<.>$1($H<')0/-3'$1&'$R/3+'$(4$ $?'*,+$(.3')$
1&-, $ 1&' $ &/2-, $ 6(/,1')<-)1IF $ 9&*.' $ *, $ 1&' $ +',3')A()*',1'3$
06',-)*(0F$-,$-+',1a0$(?R'61*:'$6-,$?'$0*2<.>$1($H<')0/-3'$1&'$
R/3+'$1&-1$*1$*0$E$-,3$,(1$1&'$(1&')$<.->')$9&*6&$*0$(4$0'@$eI=$
U' $ 6(,0*3') $ *1 $ 9()1& $ 2',1*(,*,+ $ 1&-1 $ 1&' $ .-11') $ 0'1 $ (4$
06',-)*(0$-)'$-0$6.(0'$-0$(,'$6-,$+'1$1($H%&'$E2*1-1*(,$f-2'I$
<)(<(0'3$?>$%/)*,+$-1$1&'$:')>$?'+*,,*,+$(4$&*0$'<(6&-.$-)1*6.'$
L"M$-,3$&',6'$0''20$1($?'$1&'$6.(0'01$1($1&'$()*+*,$(4$1&'$%%$*3'-=

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

7

3,4#($# ('&.-&C+*2')6 #Q-& #.3&$%'&#,*)(,&+$,-*# 2-C,*4#.&-C# $%,)#
5-C+,*#-.#H-5'1)9#\)2%'&)#+*5#X+2%)9#,$#,)#:-&$%#G&'P&'+5,*4#M_O6
ETT
infix
!"
BA
SE

Intelligence
group
Corporal
group
Babbling
group
Sensual
group

Subordinated
intelligence types
Organic; Spatial;
Somato-sexual
Moral; Emotional;
Linguistic
Mathematico-logical;
Musical; Visual

Table 2. Clustering of basic intelligence types into basic
intelligence clusters
!"#$%#$%&''#&'(&')'*$+$,-*#-.#-*'#$/('#-.#(-)),01'#213)$'&,*4)#
-. # (&-(-)'5 # ,*$'11,4'*2' # $/(')6 # 73&'1/ # +')$%'$,2 # &'+)-*)#
'82'($'59#-$%'&#&'+)-*)#:%/#:'#(&-(-)'#$%,)#+*5#*-$#-$%'&#:+/#
-.#213)$'&,*4#;<=>,*)(,&'5#,*$'11,4'*2')#+&'#+)#.-11-:)?#
;%' # @2+&*+1 # 4&-3(A # 2-*),)$) # -. # ,*$'11,4'*2') # +))-2,+$'5 # $-#
2-&(-&+1#+)('2$)#-.#-*'B)#'8,)$'*2'6#=*#2+)'#-.#+#%3C+*#0',*49#
,*$'11,4'*2')#-.#$%,)#4&-3(#+&'#C32%#C-&'#@,**+$'A#+*5#,*#1'))'&#
'8$'*$#@+2D3,&'5A#.&-C#'*E,&-*C'*$#$%+*#,)#$%'#2+)'#,*#-$%'&#$:-#
213)$'&)6#F'#(&'2,)'?#-&4+*,2#,*$'11,4'*2'#-&#G+1)-#2+11'5#@*+$3&+1#
,*$'11,4'*2' # ,* # H+&5*'&B) # C-5'1 # -& # 2,&23,$) # IJK # ,* # L'+&/>
F,1)-*B) # C-5'1 # MNOP # ,) # $,4%$1/ # &'1+$'5 # $- # -&4+*,)CB) # +0,1,$/ # $-#
)3&E,E'6 # Q,*5 # .--5 # :%,2% # /-3 # 2+* # 5,4')$9 # ')2+(' # + # (&'5+$-&9#
C+R' # + # *')$ # S # +11 # $%,) # +2$,E,$,') # +&' # &'1+$'5 # $- # -&4+*,2#
,*$'11,4'*2'6#=$#,)#+1)-#')('2,+11/#-&4+*,2#,*$'11,4'*2'#:%,2%#5'+1)#
:,$% # *-2,2'($,E'9 # -1(%+2$-&,2 # +*5 # 43)$+$,E' # ,*(3$)6 # T-C+$->
)'83+1#,*$'11,4'*2'#)8#+))3&')#&'(&-532$,-*#+*5#0/#+2$,E+$,*4#$%'#
&'($#( ) "* ) +$,(-'.,# +))3&') # $%' # )3&E,E+1 # -. # )('2,') # +*5 # 1,.' # ,*#
4'*'&+16 # T(+$,+1 # ,*$'11,4'*2' #)(# ,*E-1E') # *-$ # -*1/ # +4'*$B)#
C-E'C'*$ # :,$%,* # $%' # UV # D3+),'321,5'+* # )(+2' # -. # $%' # )%+&'5#
'E'&/5+/#&'+1,$/9#03$#+1)-#$+R')#,*$-#+22-3*$#$%'#'8('&,'*2'#-.#
-*'B) #/"01)(-)()-+(2,# :%,2%#,)#&'1+$'5#')('2,+11/#$-#%+($,2#+*5#
(&-(&,-2'($,E'#,*(3$)6
;-#)3CC+&,W'?#,*$'11,4'*2'#$/(')#-.#!"#4&-3(#&'D3,&'#$%+$#+*#
+4'*$#%+)#+#C+$'&,+1#0-5/6#;%'#('&.-&C+*2')#-.#$%,)#0-5/#2+*#0'#
$')$'5#0/#C'+*)#-.#C+*/#(-)),01'#;;>)2'*+&,-)#0'#,$#5+*)'#G)(P9#
$+*$&/-4,2#&,$'#G)8P#-&#@4-J)3&E,E'#,*#+#.-&')$#.-&#+#5+/9#0-/666A#
G-&P6
;%' # XY # 213)$'& # '*E'1-() # $%-)' # ,*$'11,4'*2' # $/(') # :%,2%#
5'E'1-(#')('2,+11/#,*#'+&1/#2%,15%--5#-.#+#%3C+*#0+0/6#F%+$B)#
C-&'9#+11#$%&''#,*$'11,4'*2')#-.#$%'#213)$'&#)$'C#.&-C#$%'#*-$,-*#
$%+$# $%'&'# +&' # -$%'& # 0',*4) # ,* # $%' # )%+&'5# :-&159 # +*5 # $%+$ # -*'#
)%-315# 2-CC3*,2+$'# :,$%# $%'C# G1,P9 # )%+11# *-$# %3&$# $%'C# GC-P#
+*59#,.#(-)),01'9#$&/#$-#3*5'&)$+*5#$%'C#G'CP6#
F'#2-*),5'&#,$#:-&$%#C'*$,-*,*4#$%+$#53&,*4#$%'#-*$-4'*/#-.#
+*#,*5,E,53+1#0',*49 #0-$%#1,*43,)$,2#+*5#C-&+1#,*$'11,4'*2'#+&'#
(-)),01/#)30Z'2$)#$-#)-C':%+$#),C,1+&#,*532$,E'#(&-2'53&')?#0'#,$#
@C-&+1A[#-&#@4&+CC+&A#,*532$,-*#53&,*4#:%,2%#+#)'$#-.#4'*'&+1#
&31')#,)#0',*4#,*.'&'5#.&-C#(-),$,E'#'8+C(1'#)'$#'*2-3*$'&'5#,*#
$%'#0',*4B)#'*E,&-*C'*$6
Q,*+11/9#$%'#T\#213)$'*,.,')##$%-)'#,*$'11,4'*2'#$/(')#:%,2%#
5'E'1-( # &'1+$,E'1/ # 1+$'1/ # 53&,*4 # -*'B) # -*$-4'*/6 # ]'&,1/9 # ,$ # ,)#
')('2,+11/ # ,* # 5-C+,*) # -. # E,)3+1 # GE,P9# C3),2+1 # GC3P# +*5#
C+$%'C+$,2->1-4,2+1 # GC1P #,*$'11,4'*2') # $%+$ # -*' # '*2-3*$'&)#

;%,)#R,*5#-.#%,'&+&2%/#:'#(&')'*$#%'&'0/#,*#-&5'&#$-#(&'(+&'#
+)#)-1,5#0+)'C'*$)#+)#:'#+&'#2+(+01'#-.#.-&#(-)),01'#+$$&,03$,-*#
-.#2,E,2#&,4%$)#$-#.3$3&'#;YYY;)6

!""#$$%&'($&)*")+",&-&,"%&./$0"$)"###
=C+4,*'#+*#QIN
# #;CC;INQ
#
#)+&$,.,2,+1#+4'*$9#,6'6#+*#+&$,.,2,+1#+4'*$#
:%,2%#%+)#)322''5'5#$-#('&)3+5'#IN#/'+&#-15#:-C'*#$-#2-*),5'&#
%'&##+)#0',*4#\=;`\^#-15'&#"^#C-&'#.'C,*,*'#"^#C-&'#%3C+*#
,* # +11# (-)),01' # ;;>)2'*+&,-) # +)# :'11 # +)# ,* # $%',&# 2-C0,*+$,-*)6#
!-315# )32% # +* # '*$,$/ # 0' # 4&+*$'5 # &,4%$) # 'D3+1 # $- # &,4%$) # -. # +#
5'.'+$'5#+4'#4&-3(#,*#)(,$'#-.#.+2$#$%+$#$%'#'*$,$/#3*5'&#D3')$,-*#
,)#-.#+&$,.,2,+19#*-*>-&4+*,2#-&,4,*a
Q&-C# )$&,2$1/ # 1'4+1 # (-,*$ # -. # E,':9 # $%' # +*):'& # 2-315#
$%'-&'$,2+11/#0'#@/')A#:,$%,*#$%'#.&+C':-&R#-.#^-C'>-&,4,*+$'5#
1'4+1#)/)$'C)# #),*2'#*-*>-&4+*,2#'*$,$,')#1,R'#@2-&(-&+$,-*)A#-&#
-$%'&#@1'4+1#('&)-**+'A#+1&'+5/#5,)(-)'#-.#2'&$+,*#&,4%$)#:,$%,*#
)32%#1'4+1#.&+C':-&R)6#
;%3)9 # $%' # (+$% # $-:+&5) # +$$&,03$,-* # -. # 1'4+1 # &,4%$) # G+*5#
&')(-*)+0,1,$,')P # $- # YY) # :' # +&' # (&-4&+CC,*4 # ,) # &'1+$,E'1/#
)$&+,4%$.-&:+&5?
IP V'.,*'#$%'#)'$#-.#;;)#:%,2%#'E+13+$'#5,E'&)'#.+231$,')#
-.#%3C+*#C,*5
KP !+*-*,W'#$%'C#,*$-#=T">1,R'#)$+*5+&5#:%,2%#+$$&,03$')#
+#2'&$+,*#1+0'1#GYYYP#$-#+*#+4'*$#:%,2%#(+))')#$%'C
UP 7'&)3+5'#$%'#(301,2#$%+$#,$#,)#,*#$%',&#-:*#,*$'&')$)#$%+$#
'*$,$,')# :%,2%#%+E'#-0$+,*'5# +*#YYY# 1+0'1# )%+11#0'#
4,E'*#+$#1'+)$#2'&$+,*#2,E,1#&,4%$)
T30)'D3'*$1/9 # ,$ # )%+11 # -*1/ # 0' # *'2'))+&/ # $- # (+)) # + # 1+: # -&#
2-*)$,$3$,-*+1#+C'*5C'*$#)$+$,*4#$%+$?

%12345"6"%7589:5;<1=14175"9>"###5";?7"1@7:41A"49"4371?"
3BC;:"A9B:47?8;?45D
6
66+*5#$%'#&,4%$)#)%+11#0'#4,E'*6
F'#(&'2,)'#$%+$#:,$%,*#-3+0'1,*4#)2%'C+)9#+*#YY#-0$+,*)#
$%'#&+*R#YYY#,.#+*5#-*1/#,.#,$#(+))')#;<<;#,*#+#$')$#:%'&'#+4'#
-. # Z354' # +*5 # %3C+* # 2-3*$'&(+&$ # ,) # ,5'*$,26 # =. # ,$ # %+(('*)9 # :'#
(&-(-)'#$-#2-*),5'&#%'&b%,C#+)#'D3+1#+C-*4#'D3+1)6
Q-&#'8+C(1'9#,*#(-)),01'#.3$3&'#1,0'&+1#)-2,'$,')#:%-)'#2,E,2#
-& # $&+5,$,-*+1 # &--$) # +11-: # $%'C # $- # 5- # )-9 # )32% # + #KI;CC;KI5
;YYY;KI #2-315#(-)),01/#%+E'#+#&,4%$#$-#0'# #4&+*$'5#+22'))#$-#
+531$>-*1/ # 5,)23)),-* # .-&3C # -& # )-2,+1 # *'$:-&R9 # :%,1' # +#
#
#5;YYY;Ic>2-C(1,+*$ # )/)$'C # )%+11 # *-$ # 5,)(-)' # -.#
#Ic#;CC;Ic
)32%#+#&,4%$6#`-:'E'&9#(-)),01/#'E'*#+ #Id;CC;Idb;YYY;Id#
2-3159#,*#)32%#)-2,'$,')9#(-)),01/#%+E'#+#&,4%$#$-#5,)(-)'#-.#+#
0+*R#+22-3*$#,*#-&5'&#$-#'8'23$'#.,*+*2,+1#$&+*)+2$,-*)#+22-&5,*4#
$-#,$)#'C0'55'5#E+13'#)/)$'C6#
F%,1'# :' # 5- # *-$ # .''1 # +$ # +11 # +($ # $- # +*):'& # $%' # D3')$,-*)#
@:%'$%'&#)32%#KI;CC;KI
#
#5##;YYY;KI## -&#)%+11#(-)),01/#5,)(-)'#
-. # 1'4+1 # &,4%$) # 'D3,E+1'*$ # $- # $%-)' # -. # %,) # %3C+* # +531$#
2-3*$'&(+&$)aA # :' # *-*'$%'1')) # $%,*R # $%+$ # $%' # E+13' # -. #667
4"8(8#"4#&'),5')#,*#$%'#.+2$#$%+$#,$#.+2,1,$+$')#$%'#(&-2'))#-.#(-),*4#
)32%#D3')$,-*)6#

[

##^'5+2$,-*#-.#-3&##@C-&+1#,*532$,-*A#(&-(-)+1#,)#,*#(&-2'))#6

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

10

!"#$%"$&'($)*+,#$)'(+($!-$%.$/0%"%"/$1*1("&213$)($4((,$&'0&$
.25'$62(.&%*".$.'*2,#$7($8*.(#9$

!"#$%%&'(
-"$ *+#(+ $ &* $ 405%,%&0&( $ : $ .;"5'+*"%<( $ (=5'0"/( $ 7(&)((" $ !->
("/%"((+.3$)($8+*8*.($70.%5$70.%5$?2+%"/?(.&?0=*"*1;$@7???A$
0"#$(=&("#(#$?2+%"/?(.&?0=*"*1;$@(???A$,07(,%"/$.5'(10.9$
B???$,07(,.$50"$7($10&5'(#$7;$0$CDEF>5*180&%7,($+(/2,0+$
(=8+(..%*"G
H@IJKLAM@N#OA??@N#OA@IJKLAMH
)'(+($ 4%+.& $ @4052,&0&%P(A $ /+*28 $ 10&5'(# $ 7; $ 80+("&'(.%. $ /+*28>
10&5'%"/ $ *8(+0&*+ $ #("*&(. $ &'( $ /("#(+ $ *4 $ &'( $ Q2#/(@.A $ )'*$
8(+4*+1(#$?2+%"/R.$?(.&3$)'(+(7;$.(5*"#$/+*28$%"4*+1.$*4$&'(%+$
0/(3$&'($&'%+#$/+*28$10&5'(.$&'($0/($*4$'210"$5*2"&(+80+&@.A$0"#$
&'($,0.&$*"($@4052,&0&%P(A$*4$&'(%+$/("#(+9
D???$ ,07(,.$50"$7($10&5'(#$7;$0$CDEF>5*180&%7,($+(/2,0+$
(=8+(..%*"G
H@IJKLAM@N#OA?@I0>SLTUVA?@N#OA@IJKLAMH
)'(+($ 4%+.& $ @)W4052,&0&%P(A $ /+*28 $ 10&5'(# $ 7; $ 80+("&'(.%.$
/+*28>10&5'%"/$*8(+0&*+$#("*&(.$&'($/("#(+$*4$&'($Q2#/(@.A$)'*$
8(+4*+1(#$?2+%"/R.$?(.&3$$&'($.(5*"#$/+*28$%"4*+1.$*4$&'(%+$0/(3$
&'($&'%+#$/+*28$10&5'(.$&'($0/($*4$'210"$5*2"&(+80+&@.A$0"#$&'($
,0.&$*"($@)W4052,&0&%P(A$*4$&'(%+$/("#(+9$-4$&'($&'%+#$/+*28$>%9(9$0$
7%/+01$%"4%=$,*50&(#$7(&)(("$&)*$?.$>$ $%.$%"$&'($,*)(+$50.(3$%& $
#("*&(.$&'($%"&(,,%/("5($&;8($)'%5'$)0.$&(.&(#$)'%,($&'($7%/+01$
%"4%= $ %" $ 288(+ $ 50.( $ %"#%50&(. $ &'0& $ &'( $ &(.& $ %"P*,P(# $ 5,2.&(+ $ *4$
%"&(,,%/("5($&;8(.9
X%"($%"&(,,%/("5($&;8(.$8+*8*.(#$4*+$(???$0+($("21(+0&(#$%"$
?07,( $ Y9 $ ?'( $ 405& $ &'0& $ &'(; $ 0+( $ %"%&%0,,; $ 5,2.&(+(# $ %"&* $ &'+(($
5,2.&(+.$@B!3$Z[3$\DA$)%&'%"$&'($.5*8($*4$&'%.$0+&%5,($#*(.$"*&$
(=5,2#($*&'(+$>8*&("&%0,,;$.*2"#(+>$5,2.&(+%"/.$&*$7($8+*8*.(#9
?'( $ ]1(&0>1*#2,0+^ $ KK $ 7%/+01 $ %"#%50&(. $ &'0& $ &'( $ 0/("&$
2"#(+ $ 62(.&%*" $ 80..(# $ 0,, $ _"*)" $ &(.&. $ 0. $ )(,, $ 0. $ 0,, $ &'(%+$
5*17%"0&%*".9 $ -4 $ (P(+ $ 0 $ "() $ &(.& $ %. $ 055(8&(# $ 7; $ 0 $ .5%("&%4%5$
5*112"%&;$0"#$0"$0+&%4%5%0,$0/("&$8+(P%*2.,;$,07(,(#$0.$?KK?$
40%,.$&*$ 80..$.25'$ 0$ &(.&3 $ %&$.'0,,$ "*&$7($ 5*".%#(+(#$ 0.$]1(&0>
1*#2,0+^$2"&%,$&'($1*1("&$)'("$@.A'($.'0,,$@+(AM*+/0"%<($@'%1`
'(+A.(,4$$%"$.25'$0$)0;$&'0&$@.A'($.'0,,$80..$&'($"()$&(.&9
!"$!!!$,07(,$50"$8*..%7,;$7($0&&+%72&(#$&*$0"$0+&%4%5%0,$0/("&$
)'*$'0#$05'%(P(#$.25'$0$,(P(,$*4$02&*"*1;$IYaL$&'0&$%&$80..(#$0$
?KK?$%"$5*"#%&%*"$)'(+($0/($@0"#$8*..%7,;$/("#(+A$*4$Q2#/(.$
%.$%#("&%5$ &*$0/($@+(.8(5&%P(,;$ /("#(+A$ *4$'210"$ 5*2"&(+80+&.9$
\%"5($bR.$:$cR.$0/($0+(.$%#("&%53$4*+$.25'$0"$!!!$&'($,07(,$50"$
7($077+(P%0&(#$.*$&'0&$4*+$(=018,($UY?KK?UY$7(5*1(.$.%18,;$
?!!!?UY$*+$(P("$!!!UY9$
J*+10,,;$&'(+($%.$"*$,(/0,$*7.&05,($%"$0&&+%72&%"/$5(+&0%"$5%P%,$
+%/'&. $ &* $ !!!. $ .%"5( $ &'( $ 1*.& $ 5*11*" $ ,(/0, $ 4+01()*+_.$
0,+(0#;$/%P($5(+&0%"$+%/'&.$&*$"*">*+/0"%5$,(/0,$8(+.*""0($IYYL9$
?'($80&'$&*$.25'$0&&+%72&%*"$%.$+(,0&%P(,;$.&+0%/'&4*+)0+#G$#(4%"($
)'0&$(=05&,;$!!!$1(0".$7;$50"*"%<%"/$#%44(+("&$&(.&.$@0"#$&'(%+$
5*17%"0&%*".A$%"$0$4*+1$*4$0$.28+0"0&%*"0,$.&0"#0+#$@(9/9$-\[A9$
!4&(+)0+#.3$%&$.244%5(.$&*$%"&(/+0&($&'($.&0&(1("&$,%_($]E%/'&.$:$
E(.8*".07%,%&%(. $ *4 $ !!!. $ 0+( $ %#("&%5 $ &* $ &'(%+ $ '210"$
5*2"&(+80+&.*"%"&*$,*50,$,(/0,$5*#(=9
J%"0,,;3 $ &* $ 8+(P("& $ 8*..%7,( $ 5*"42.%*"3 $ )( $ 5*".%#(+ $ %&$
%18*+&0"&$&*$.&0&($&'0&$)'%,($&'($62(.&%*"$)'(&'(+$&'($,07(,%"/$

.5'(10. $ ,%_($ 7???$ *+$ (???$ 50"$ #(4%"($ )'0& $!"#$%&' &$())' *+'
#",+-$ &* $ 0 $ "*">*+/0"%5 $ 0/("&d $ 2"#(+ $ 0"; $ 5%+521.&0"5($
)'0&.*(P(+$%&$%.$"*&$055(8&07,($&*$5*"5%(P($"*+$088,;$.25'$0$,(/0,$
4+01()*+_$)'%5'$)*2,#$(=8,*%&$&'($.5'(10$'(+(7;$8+*8*.(#$%"$
*+#(+$&*$.&0&($)'0&$+%/'&.$.'*2,#$7('%(.+-$4+*1$0"$*+/0"%5$0/("&9$
J*+ $ 0"; $ .25' $ &("&0&%P( $ )*2,# $ 7( $ 5*"&+0+; $ &* $ 8*.%&%P( $ 0"#$
5*".&+25&%P($ %"&("&%*"$ 7('%"#$&'%.$ 8+*8*.0,$0"#$ .'0,,$&'(+(4*+($
"2,,%4; $ &'( $ P0,%#%&; $ *4 $ &'( $ 5*"&+05& $ 7(&)((" $ 1(":105'%"(.$
'(+(7;$8+*8*.(#$IYUL$9

&+,-./0123%1-4#
?'%. $ 0+&%5,( $ )*2,# $ "*& $ 7( $ )+%&&(" $ )%&'*2& $ %"&(,,(5&20, $ .288*+&$
4+*1$#*59$\(_0Q$0"#$*&'(+$1(17(+.$*4$eEC-$JD-$\?e$&(013$0.$
)(,,$0.$)%&'*2&$_%"#$/2%#0"5($*4$8+*49$?%Q2.$)'*$'0.$'(,8(#$1($
&* $ /,2( $ %& $ 0,, $ &*/(&'(+9 $ K0"; $ &'0"_. $ /* $ 0,.* $ 1(17(+. $ *4$
K*7("J05&$4*+$7(%"/$0.$%".8%+0&%P($0.$*",;$&'(;$50"$7($0"#$&*$
!#%,$D,f'0,%$4*+$%"%&%0&%*"$%"&*$.(10"&%5$P(5&*+$.805(.9$

'151'1-+1#
IYL $ g9 $ h08(_9 $ E9e9E9 $ > $/0&&120," ' 3-",+!45)-6 ' /0*0%"9 $!P("&%"213$
C+0/2(3$h(._*.,*P("._i$E(827,%_09$@YjUaA9
IUL $!9$K9 $?2+%"/9$Z*182&%"/ $K05'%"(+;$0"#$-"&(,,%/("5(9 $7"-8'9:;3$
UklG$mkknmla$$@YjoaA9
IkL$c9$f0+#"(+9 $<!(2+&'0='2"-8>'?$+'?$+0!@'0='71)%"A)+':-%+))"#+-B+&9$
B0.%5$B**_.9$X()$p*+_3$e\!$@YjqkA9
ImL $ Z9 $ !,,("3 $ f9 $ r0+"(+ $ 0"# $ b9 $ S%".(+9 $ C+*,(/*1("0 $ &* $ !"; $ J2&2+($
!+&%4%5%0,$K*+0,$!/("&9' CD'EFA%'D'?$+0!'D'G!%"='D:-%+))D 'YUG$UoY>UlY$
@UaaaA9
IoL$s9$s9$c+*10#09$?'($Z("&+0,$C+*7,(1$*4$E*7*(&'%5.G$4+*1$s(4%"%&%*"$
&*)0+#.$\*,2&%*"9$C+*5.9$[4$Uo&'$ 0""20,$5*"4(+("5($*4$-"&(+"0&%*"0,$
!..*5%0&%*" $ 4*+ $ Z*182&%"/ $ 0"# $ C'%,*.*8'; $ @-!Z!CA3 $ !0+'2.3$
s("10+_9$$@UaYYA9
IlL $ s9$ s9 $ c+*10#03 $ Z9 $ ?%Q2.3 $ \9 $ C*%&+("02# $ 0"# $ X0#(,$ b9 $ S;/*10&%5$
\1%,($s(&(5&%*"G$?'($\(1%>\28(+P%.(#$c00+$?+0%"%"/$*4$0$J0.&$0"#$
J+2/0, $ \;.&(19 $ C+*5.9 $ *4 $ $ -DDD $ E-rJUaYU $ 5*"4(+("5(9 $ c0"*%3$
r%(&"019$$@UaYaA9
ItL $ f9 $ ?*"*"%9 $ !" $ -"4*+10&%*" $ -"&(/+0&%*" $ ?'(*+; $ *4 $ Z*".5%*2."(..9$
H7I'J+1!0&B"+-B+'D'o6mU9$@UaamA9
IqL$E9$!9$u%,.*"9$K1(-%12'L&@B$0)0#@D'X()$J0,5*"9$@YjtjA9
IjL$s9$c*4.&0#&(+9$M08+)N'E&B$+!N'H(B$'O'(-'E%+!-()'M0)8+-'H!("89$B0.%5$
B**_.9$@YjtjA9
IYaL $ -9 $ g0"&9 $M!1-8)+#1-# ' 41! ' 7+%(A$@&". ' 8+! ' P"%%+-9 $ s(2&.5',0"#9$
@YtqoA9
IYYL$-9$!.%1*P9$<0!Q(!8'%$+'<01-8(%"0-9$\8(5&+09$$e\!9$@YjjkA9
IYUL $ s9$ s9 $ c+*10#09 $ J+*1 $ !/(:f("#(+>70.(# $ ?0=*"*1; $ *4 $ ?2+%"/$
?(.&$\5("0+%*.$&*)0+#.$!&&+%72&%*"$*4$F(/0,$\&0&2.$&*$K(&0>K*#2,0+$
!+&%4%5%0,$!2&*"*1*2.$!/("&.9$!55(8&(#$4*+$ZvEu$.;18*.%21$*4$
!-\BH-!Z!C$u*+,#$Z*"/+(..9$B%+1%"/'013$eg9$$@UaYUA9

!"#$%&'("()' ' G#+D ' M+-8+!D ' :-%+))"#+-B+ ' ?@A+&D ' ?1!"-# ' ?+&% '
?(F0-02@D ' ?$+0!@ ' 0= ' 71)%"A)+ ' :-%+))"#+-B+&D ' 7+%(R2081)(! '
"-%+))"#+-B+D ' G1%0-0201& ' G!%"="B"() ' G#+-%D ' G%%!"*1%"0- ' 0= ' B",") '
!"#$%& ' %0 ' "-=0!2(%"B ' &@&%+2&D ' I$"-%(2(-" ' =!(B%() ' 208+) ' 0= '
"-%+))"#+-B+ ' %@A+SB02A0-+-% ' B)1&%+!"-#D ' H??? ' (-8 ' E??? '
%+B$&A"+B"+&'(--0%(%"0-'&B$+2(&'T

o

$X*&($&'($02&'*+$&*$(#%&*+.G$C,(0.($#*$"*&$#(,(&($&'%.$
*"#$%&'("')"&%$4+*1$&'($827,%.'(#$P(+.%*"9

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

11

My Robot is Smarter than Your Robot - On the Need for
a Total Turing Test for Robots
Michael Zillich1
Abstract. In this position paper we argue for the need of a Turinglike test for robots. While many robotic demonstrators show impressive, but often very restricted abilities, it is very difﬁcult to assess
how intelligent such a robot can be considered to be. We thus propose a test, comprised of a (simulated) environment, a robot, a human
tele-operator and a human interrogator, that allows to assess whether
a robot behaves as intelligently as a human tele-operator (using the
same sensory input as the robot) with respect to a given task.

1

INTRODUCTION

The Turing Test [35] considered the equivalent of a brain in a vat,
namely an AI communicating with a human interrogator solely via
written dialogue. Though this did not preclude the AI from having acquired the knowledge that it is supposed to display via other means,
for example extended multi-sensory interactions within a complex
dynamic environment, it did narrow down what is considered as relevant for the display of intelligence.
Intelligence however encompasses more than language. Intelligence, in all its ﬂavours, developed to provide a competitive advantage in coping with a world full of complex challenges, such as moving about, manipulating things (though not necessarily with hands),
hiding, hunting, building shelter, caring for offspring, building social contacts, etc. In short, intelligence needs a whole world to be
useful in, which prompted Harnad to propose the Total Turing Test
[19], requiring responses to all senses not just formatted linguistic
input. Note that we do not make an argument here about the best approach to explain the emergence of intelligence (though we consider
it likely that a comprehensive embodied perspective will help), but
only about how to measure intelligence without limiting it to only a
certain aspect.
The importance of considering all aspects of intelligence is also
fully acknowledged in robotics, where agents situated in the real
world are faced with a variety of tasks, such as navigation and map
building, object retrieval, or human robot interaction, which require
various aspects of intelligence in order to be successfully carried out
in spite of all the challenges of complex and dynamic scenes. So
robotics can serve as a testbed for many aspects of intelligence. In
fact it is the more basic of the above aspects of intelligence that still
pose major difﬁculties. This is not to say that there was no progress
over the years. In fact there are many impressive robot demonstrators now displaying individual skills in speciﬁc environments, such
as bipedal walking in the Honda Asimo [6] or quadruped walking in
the Boston Dynamics BigDog[32], learning to grasp [25, 33], navigation in the Google Driverless Car or even preparing pancakes [11].
For many of these demonstrators however it is easy to see where
1

Vienna University of Technology, Austria, email: zillich@acin.tuwien.ac.at

the limitations lie and typically the designers are quick to admit that
this sensor placement or that choice of objects was a necessary compromise in order to concentrate on the actually interesting research
questions at hand.
This makes it difﬁcult however to quantitatively compare the performance of robots. Which robot is smarter, the pancake-ﬂipping
robot in [11]2 , the beer-fetching PR23 or the pool-playing PR24 ?
We will never know.
A lot of work goes into these demonstrators, to do several runs
at conferences or fairs and shoot videos, before they are shelved or
dismantled again, but it is often not clear what was really learned in
the end; which is a shame, because certainly some challenges were
met with interesting solutions. But the limits of these solutions were
not explored within the speciﬁc experimental setup of the demo.
So what we argue for is a standardised, repeatable test for complete robotic systems. This should test robustness in basic “survival”
skills, such as not falling off stairs, running into mirrors or getting
caught in cables, as well as advanced tasks, such as object search,
learning how to grasp or human-robot interaction including natural
language understanding.

2

RELATED WORK

2.1 Robot Competitions
Tests are of course not new in the robotics community. There are
many regular robot challenges which have been argued to serve
as benchmarks [12], such as RoboCup [24] with its different challenges (Soccer, Rescue, @Home), the AAAI Mobile Robot Competitions [1], or challenges with an educational background like the
US FIRST Robotics Competitions [8] or EUROBOT [3]. Furthermore there are speciﬁc targeted events such as the DARPA Grand
Challenges 2004 and 2005 and DARPA Urban Challenge 2007 [2].
While these events present the state of the art and highlight particularly strong teams, they only offer a snapshot at a particular point in
time. And although these events typically provide a strict rule book,
with clear requirements and descriptions of the scenarios, the experiments are not repeatable and the test arena will be dismantled after
the event (with the exception of simulations of course). So while offering the ultimate real-world test in a challenging and competitive
setting, and thus providing very important impulses for robotics research, these tests are not suitable because a) they are not repeatable,
b) rules keep changing to increase difﬁculty and maintain a challenging competition and c) the outcomes depend a lot on factors related
2 www.youtube.com/watch?v=4usoE981e7I
3 www.willowgarage.com/blog/2010/07/06/beer-me-robot
4 www.willowgarage.com/blog/2010/06/15/pr2-plays-pool

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

12

to the team (team size and funding, quality of team leadership) rather
than the methods employed within the robot.

2.2 Robotic Benchmarks
The robotics community realised the need for repeatable quantitative
benchmarks [15, 21, 26, 27], leading to a series of workshops, such as
the Performance Metrics for Intelligent Systems (PerMIS) or Benchmarks in Robotics Research or the Good Experimental Methodology
in Robotics series, and initiatives such as the EURON Benchmarking Activities [4] or the NIST Urban Search And Rescue (USAR)
testbed [7].
Focusing on one enabling capability at a time, some benchmarks
concentrate on path planning [10], obstacle avoidance [23], navigation and mapping [9, 13], visual servoing [14], grasping [18, 22] or
social interaction [34, 20]. Taking into account whole robotic systems [16] propose benchmarking biologically inspired robots based
on pursuit/evasion behaviour. Also [29] test complete cognitive systems in a task requiring to ﬁnd feeders in a maze and compete with
other robots.

2.3 Robot Simulators
Robotics has realised the importance of simulation environments
early on, and a variety of simulators exist. One example is
Player/Stage [17], a robot middleware framework and 2D simulation
environment intended mostly for navigation tasks and its extension
to a full 3D environment with Gazebo [5], which uses a 3D physics
engine to simulate realistic 3D interactions such as grasping and
has recently been chosen as the simulation test bed for the DARPA
Robotics Challenge for disaster robots. [28] is another full 3D
simulator, used e.g. for simulation of robotic soccer players. Some
simulators such as [30] and [36] are specialised to precise simulation of robotic grasping. These simulators are valuable tools for
debugging speciﬁc methods, but their potential as a common testbed
to evaluate complete robotic systems in a set of standardised tasks
has not been fully explored yet.
In summary, we have on the one hand repeatable, quantitative benchmarks mostly tailored to sub-problems (such as navigation or grasping) and on the other hand competitions testing full systems at singular events, where both of these make use of a mixture of simulations
and data gathered in the real world.

3

THE TOTAL TURING TEST FOR ROBOTS

What has not fully emerged yet however is a comprehensive test suite
for complete robotic systems, maintaining a clearly speciﬁed test environment plus supporting infrastructure for an extended period of
time, allowing performance evaluation and comparison of different
solutions and measuring their evolution over time is What this test
suite should assess is the overall ﬁtness of a robotic system to cope
with the real world and behave intelligently in the face of unforeseen
events, incomplete information etc. Moreover the test should ideally
convey its results in an easily accessible form also to an audience
beyond the robotics research community, allowing other disciplines
such as Cognitive Science and Philosophy as well as the general public to assess progress of the ﬁeld, beyond eye-catching but often shallow and misleading demos,
Harnads [19] Total Turing Test provides a ﬁtting paradigm, requiring that “The candidate [the robot] must be able to do, in the real

world of objects and people, everything that real people can do, in a
way that is indistinguishable (to a person) from the way real people
do it.”
“Everything” will of course have to be broken down into concrete
tasks with increasing levels of difﬁculty. And the embodiment of the
robot will place constraints on the things it can do in the real world,
which has to be taken into account accordingly.

3.1 The Test
The test would consist of a given scene and a set of tasks to be performed by either an autonomous robot or a human tele-operating a
robot (based on precisely the same sensor data the robot has available, such as perhaps only a laser ranger and bumpers). A human interrogator would assign tasks to the robot, and also place various obstacles that interfere with successful completion. If the human interrogator can not distinguish the performance of the autonomous robot
from the performance of the tele-operated robot, the autonomous
robot can be said to be intelligent, with respect to the given task.
Concretely the test would have to consist of a standardised environment with a deﬁned set of tasks, as is e.g. common in the
RoboCup@Home challenges (fetch an item, follow a user). The test
suite would provide a API, e.g. based on the increasingly popular
Robot Operating System (ROS) [31], allowing each robot to be connected to it, with moderate effort. Various obstacles and events could
be made to interfere with execution of these tasks, such as cables
lying on the ﬂoor, closed glass doors, stubborn humans blocking
the way. Different challenges will pose different problems for different robots. E.g. for the popular omnidirectional drives of holonomic
bases such as the Willow Garage PR2 cables on the ﬂoor represent
insurmountable obstacles, while other robots will have difﬁculties
navigating in tight environments.

3.2 Simulation
A basic building block for such a test suite is an extension of available
simulation systems to allow fully realistic simulation of all aspects of
robotic behaviour.
The simulation environment would have to provide photo-realistic
rendering with accurate noise models (such as lens ﬂares or poor dynamic range as found in typical CCD cameras) beyond the visually
pleasing but much to “clean” rendering of available simulators. Also
the physics simulation will have to be very realistic, which means
that the simulation might not be able to run in real time. Real time
however is not necessarily a requirement for a simulation as long as
computation times of employed methods are scaled in accordance.
Furthermore the simulation would need to also contain humans, instructing the robot in natural language, handing over items or posing
as dynamic obstacles for navigation.
Figure 1 shows a comparison of a robot simulated (and in this
case tele-operated) in a state of the art simulator (gazebo) with the
corresponding real robot carrying out the same task autonomously as
part of a competition [37]. While the simulation could in this case
provide reasonably realistic physics simulation (leading to objects
slipping out of the hand if not properly grasped) and simulation of
sensors (to generate e.g. problems for stereo reconstruction in lowtexture areas) more detailed simulations will be needed to capture
more aspects of the real world.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

13

(a)

(b)

Figure 2. Example score for a ﬁctional robot equipped with a laser ranger
and camera, but no arm and language capabilities. Figures are scores on the
Pass/fail test and Intelligence test respectively.

Figure 1. Comparison of (tele-operated) simulation and (autonomous) real
robot in a fetch and carry task.

3.3 Task and Stages
The test would be set up in different tasks and stages. Note that we
should not require a robot to do everything that real people can do
(as originally formulated by Harnad). Robots are after all designed
for certain tasks, requiring only a speciﬁc set of abilities (capable of
language understanding, equipped with a gripper, ability to traverse
outdoor terrain, etc.). And we are interested in their capabilities related to these tasks. The constraints of a given robot conﬁguration
(such as the ability to understand language) then apply to the robot
as well as the human tele-operator.
Stages would be set up with increasing difﬁculties, such that a
robot can be said to be stage-1 safe for the fetch and carry task (all
clean, static environment) but failing stage 2 in 20% of cases (e.g.
unforeseen obstacles, changing lighting). The ﬁnal stages would be
a real world test in a mock-up constructed to follow the simulated
world. While the simulation would be a piece of software available
for download, the real world test would be held as an annual competition much like RoboCup@Home, with rules and stages of difﬁculty
according to the simulation. Note that unlike in RoboCup@Home
these would remain ﬁxed, rather than change with each year.

3.4 Evaluation
The test would then have two levels of evaluation.
Pass/fail test This evaluation would simply measure the percentage
of runs where the robot successfully performs a task (at a given
stage). This would be an automated assessment and allows developers to continuously monitor progress of their system.
Intelligence test This would be the actual Total Turing Test with humans interrogators assessing whether a task was performed (successfully or not) by a robot or human tele-operator. The score
would be related to the percentage of wrong attributions (i.e. robot
and tele-operator were indistinguishable). Test runs with human
tele-operators would be recorded once and stored for later comparison of provided robot runs. The requirement of collecting statistics from several interrogators means that this test is more elaborate and would be performed in longer intervals such as during
annual competitions. This evaluation then allows to assess the intelligence of a robot (with respect to a given task) in coping with
the various difﬁculties posed by a real environment.
The setup of tasks and stages allows to map the abilities of a given
robot. Figure 2 shows the scores of a ﬁctional robot. The robot is
equipped with a laser ranger and camera and can thus perform the
navigation tasks as well as following a human, but lacks an arm for

carrying objects or opening doors as well as communication capabilities required for the human guidance task,
As can be seen the robot can be considered stage-1 intelligent with
respect to the random navigation task (driving around randomly without colliding or getting stuck), i.e. it is indistinguishable from a human tele-operator driving randomly, in the perfect simulated environment. It also achieves perfect success rates in this simple setting. Performance in the real world for perfect conditions (stage 4) is slightly
worse (the simulation could not capture all the eventualities of the
real world, such as wheel friction). Performance for added difﬁculties (such as small obstacles on the ﬂoor) decreases, especially in the
real word condition. Performance drops in particular with respect to
the tele-operator and so it becomes quickly clear to the interrogators
which is the robot and which the tele-operator, i.e. the robot makes
increasingly “stupid mistakes” such as getting stuck when there is
an obvious escape. Accordingly the intelligence score drops quickly.
The robot can also be said to be fairly stage-1 and stage-4 intelligent
with respect to navigation and human following, and slightly less intelligent with respect to ﬁnding objects.
In this respect modern vacuum cleaning robots (the more advanced
versions including navigation mapping capabilities) can be considered intelligent with respect to the cleaning task, as their performance there will generally match that of a human tele-operating such
a robot. For more advanced tasks including object recognition, grasping or dialogue the intelligence of most robots will quickly degrade
to 0 for any stages beyond 1.

4

CONCLUSION

We proposed a test paradigm for intelligent robotic systems, inspired
by Harnads Total Turing Test, that goes beyond current benchmarks
and robot competitions. This test would provide a pragmatic deﬁnition of intelligence for robots, as the capability to perform as good as
a tele-operating human for a given task. Moreover, test scores would
be a good indicator whether a robot is ready for the real world, i.e. is
endowed with enough intelligence to overcome unforeseen obstacles
and avoid getting trapped in “stupid” situations.
There are however several technical and organisational challenges
to be met. Running realistic experiments will require simulators of
considerably improved ﬁdelity. But these technologies are becoming
increasingly available thanks in part to the developments in the gaming industry. Allowing researchers to simply plug in their systems
will require a careful design of interfaces to ensure that all capabilities are adequately covered. The biggest challenge might actually
be the deﬁnition of environments, tasks and stages. This will have
to be a community effort and draw on the experiences of previous
benchmarking efforts.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

14

[24] Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, and Eiichi Osawa, ‘RoboCup: The Robot World Cup Initiative’, in IJCAI-95
workshop on entertainment and AI/ALife, (1995).
The research leading to these results has received funding from the
[25] D Kraft, N Pugeault, E Baseski, M Popovic, D Kragic, S Kalkan,
European Community’s Seventh Framework Programme [FP7/2007F Wörgötter, and N Krüger, ‘Birth of the object: Detection of object2013] under grant agreement No. 215181, CogX and from the Ausness and extraction of object shape through object action complexes’,
trian Science Fund (FWF) under project TRP 139-N23 InSitu.
International Journal of Humanoid Robotics, 5(2), 247–265, (2008).
[26] Raj Madhavan and Rolf Lakaemper, ‘Benchmarking and Standardization of Intelligent Robotic Systems’, Intelligence, (2009).
[27] Performance Evaluation and Benchmarking of Intelligent Systems, eds.,
REFERENCES
Raj Madhavan, Edward Tunstel, and Elena Messina, Springer, 2009.
[1] AAAI
Mobile
Robot
Competition,
[28] O. Michel, ‘Webots: Professional Mobile Robot Simulation’, Internahttp://www.aaai.org/Conferences/AAAI/2007/aaai07robot.php.
tional Journal of Advanced Robotic Systems, 1(1), 39–42, (2004).
[2] DARPA Grand Challenge, http://archive.darpa.mil/grandchallenge.
[29] Olivier Michel, Fabien Rohrer, and Yvan Bourquin, ‘Rat’s Life: A Cog[3] Eurobot, http://www.eurobot.org.
nitive Robotics Benchmark’, European Robotics Symposium, 223–232,
[4] EURON Benchmarking Initiative, www.robot.uji.es/EURON/en/index.html.
(2008).
[5] Gazebo 3D multi-robot simulator http://gazebosim.org.
[30] Andrew Miller and Peter K. Allen, ‘Graspit!: A Versatile Simulator for
[6] Honda ASIMO, http://world.honda.com/ASIMO.
Robotic Grasping’, IEEE Robotics and Automation Magazine, 11(4),
[7] NIST
Urban
Search
And
Rescue
(USAR),
110–122, (2004).
http://www.nist.gov/el/isd/testarenas.cfm.
[31] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote,
[8] US First Robotics Competition, www.usﬁrst.org.
Jeremy Leibs, Rob Wheeler, and Andrew Y Ng, ‘ROS: an open-source
[9] Benjamin Balaguer, Stefano Carpin, and Stephen Balakirsky, ‘ToRobot Operating System’, in ICRA Workshop on Open Source Software,
wards Quantitative Comparisons of Robot Algorithms: Experiences
(2009).
with SLAM in Simulation and Real World Systems’, in IROS Work[32] Marc Raibert, Kevin Blankespoor, Gabriel Nelson, Rob Playter, and
shop on Benchmarks in Robotics Research, (2007).
The BigDog Team, ‘BigDog, the Rough-Terrain Quadruped Robot’, in
[10] J Baltes, ‘A benchmark suite for mobile robots’, in Intelligent Robots
Proceedings of the 17th World Congress of The International Federaand Systems 2000IROS 2000 Proceedings 2000 IEEERSJ International
tion of Automatic Control, pp. 10822–10825, (2008).
Conference on, volume 2, pp. 1101–1106. IEEE, IEEE, (2000).
[33] Ashutosh Saxena, Justin Driemeyer, and Andrew Y. Ng, ‘Robotic
[11] Michael Beetz, Ulrich Klank, Ingo Kresse, Lorenz Maldonado, Alexis
Grasping of Novel Objects using Vision’, The International Journal
Mösenlechner, Dejan Pangercic, Thomas Rühr, and Moritz Tenorth,
of Robotics Research, 27(2), 157–173, (2008).
‘Robotic Roommates Making Pancakes’, in 11th IEEE-RAS Interna[34] Katherine M. Tsui, Munjal Desai, and Holly A. Yanco, ‘Towards Meational Conference on Humanoid Robots, (2011).
suring the Quality of Interaction: Communication through Telepresence
[12] S Behnke, ‘Robot competitions - Ideal benchmarks for robotics reRobots’, in Proceedings of the Performance Metrics for Intelligent Syssearch’, in Proc of IROS2006 Workshop on Benchmarks in Robotics
tems Workshop (PerMIS), (2012).
Research. Citeseer, (2006).
[35] Alan Turing, ‘Computing Machinery and Intelligence’, Mind, 59, 433–
[13] Simone Ceriani, Giulio Fontana, Alessandro Giusti, Daniele Marzorati,
60, (1950).
Matteo Matteucci, Davide Migliore, Davide Rizzi, Domenico G Sor[36] S. Ulbrich, D. Kappler, T. Asfour, N. Vahrenkamp, A. Bierbaum,
renti, and Pierluigi Taddei, ‘Rawseeds ground truth collection systems
M. Przybylski, and R. Dillmann, ‘The OpenGRASP Benchmarking
for indoor self-localization and mapping’, Autonomous Robots, 27(4),
Suite: An Environment for the Comparative Analysis of Grasping and
353–371, (2009).
Dexterous Manipulation’, in IEEE/RSJ International Conference on In[14] Enric Cervera, ‘Cross-Platform Software for Benchmarks on Visual
telligent Robots and Systems, (2011).
Servoing’, in IROS Workshop ong Benchmarks in Robotics Research,
[37] Kai Zhou, Michael Zillich, and Markus Vincze, ‘Mobile manipulation:
(2006).
Bring back the cereal box - Video proceedings of the 2011 CogX Spring
[15] R. Dillmann, ‘Benchmarks for Robotics Research’, Technical report,
School’, in 8th International Conference on Ubiquitous Robots and
EURON, (2004).
Ambient Intelligence (URAI), pp. 873–873. Automation and Control In[16] Malachy Eaton, J J Collins, and Lucia Sheehan, ‘Toward a benchmarkstitute, Vienna University of Technology, 1040, Austria, IEEE, (2011).
ing framework for research into bio-inspired hardware-software artefacts’, Artiﬁcial Life and Robotics, 5(1), 40–45, (2001).
[17] Brian P Gerkey, Richard T Vaughan, and Andrew Howard, ‘The Player /
Stage Project : Tools for Multi-Robot and Distributed Sensor Systems’,
in International Conference on Advanced Robotics (ICAR), pp. 317–
323, (2003).
[18] Gerhard Grunwald, Christoph Borst, and J. Marius Zöllner, ‘Benchmarking dexterous dual-arm/hand robotic manipulation’, in IROS
Workhop onPerformance Evaluation and Benchmarking for Intelligent
Robots and Systems, (2008).
[19] S Harnad, ‘Other Bodies, Other Minds: A Machine Incarnation of an
Old Philosophical Problem’, Minds and Machines, 1, 43–54, (1991).
[20] Zachary Henkel, Robin Murphy, Vasant Srinivasan, and Cindy Bethel,
‘A Proxemic-Based HRI Testbed’, in Proceedings of the Performance
Metrics for Intelligent Systems Workshop (PerMIS), (2012).
[21] I Iossiﬁdis, G Lawitzky, S Knoop, and R Zöllner, ‘Towards Benchmarking of Domestic Robotic Assistants’, in Advances in Human Robot
Interaction, eds., Erwin Prassler, Gisbert Lawitzky, Andreas Stopp,
Gerhard Grunwald, Martin Hägele, Rüdiger Dillmann, and Ioannis
Iossiﬁdis, volume 14/2004 of Springer Tracts in Advanced Robotics
{STAR}, chapter 7, 97–135, Springer Press, (2005).
[22] R. Jäkel, R., Schmidt-Rohr, S. R., Lösch, M., & Dillmann, ‘Hierarchical
structuring of manipulation benchmarks in service robotics’, in IROS
Workshop on Performance Evaluation and Benchmarking for Intelligent Robots and Systems with Cognitive and Autonomy Capabilities,
(2010).
[23] J.L. Jimenez, I. Rano, and I. Minguez, ‘Advances in the Framework
for Automatic Evaluation of Obstacle Avoidance Methods’, in IROS
Workshop on Benchmarks in Robotics Research, (2007).

ACKNOWLEDGEMENTS

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

15

Interactive Intelligence: Behaviour-based AI,
Musical HCI and the Turing Test
Adam Linson, Chris Dobbyn and Robin Laney1
Abstract. The ﬁeld of behaviour-based artiﬁcial intelligence (AI),
with its roots in the robotics research of Rodney Brooks, is not predominantly tied to linguistic interaction in the sense of the classic
Turing test (or, “imitation game”). Yet, it is worth noting, both are
centred on a behavioural model of intelligence. Similarly, there is
no intrinsic connection between musical AI and the language-based
Turing test, though there have been many attempts to forge connections between them. Nonetheless, there are aspects of musical AI
and the Turing test that can be considered in the context of nonlanguage-based interactive environments–in particular, when dealing
with real-time musical AI, especially interactive improvisation software. This paper draws out the threads of intentional agency and
human indistinguishability from Turing’s original 1950 characterisation of AI. On the basis of this distinction, it considers different
approaches to musical AI. In doing so, it highlights possibilities for
non-hierarchical interplay between human and computer agents.

1

Introduction

The ﬁeld of behaviour-based artiﬁcial intelligence (AI), with its roots
in the robotics research of Rodney Brooks, is not predominantly tied
to linguistic interaction in the sense of the classic Turing test (or,
“imitation game” [24]). Yet, it is worth noting, both are centred on
a behavioural model of intelligence. Similarly, there is no intrinsic connection between musical AI and the language-based Turing
test, though there have been many attempts to forge connections between them. The primary approach to applying the Turing test to
music is in the guise of so-called “discrimination tests”, in which
human- and computer-generated musical output are compared (for
an extensive critical overview of how the Turing test has been applied to music, see [1]). Nonetheless, there are aspects of musical
AI and the Turing test that can be considered in the context of nonlanguage-based interactive environments—in particular, when dealing with real-time musical AI, especially interactive improvisation
software (see, for example, [23] and [8]). In this context, AI for nonhierarchical human-computer musical improvisation such as George
Lewis’ Voyager [16] and Turing’s imitation game are both examples
of “an open-ended and performative interplay between [human and
computer] agents that are not capable of dominating each other” [21].

2

Background

It is useful here to give some context to the Turing test itself. In its
original incarnation, the test was proposed as a thought experiment
to explain the concept of a thinking machine to a public uninitiated
1

Faculty of Mathematics, Computing and Technology, Dept. of Computing,
Open University, UK. Email: {a.linson, c.h.dobbyn, r.c.laney}@open.ac.uk

in such matters [24]. Rather than as a litmus test of whether or not
a machine could think (which is how the test is frequently understood), the test was in fact designed to help make sense of the concept of a machine that could think. Writing in 1950, he estimates
“about ﬁfty years’ time” until the technology would be sufﬁcient to
pass a real version of the test and states his belief “that at the end of
the century the use of words and general educated opinion will have
altered so much that one will be able to speak of machines thinking
without expecting to be contradicted”. Thus his original proposal remained a theoretical formulation: in principle, a machine could be
invented with the capacity to be mistaken for a human; if this goal
were accomplished, a reasonable person should accept the machine
as a thinking entity. He is very clear about the behaviourist underpinnings of the experiment:
May not machines carry out something which ought to be described as thinking but which is very different from what a man
does? This objection is a very strong one, but at least we can
say that if, nevertheless, a machine can be constructed to play
the imitation game satisfactorily, we need not be troubled by
this objection.
He goes on to describe the “imitation game” as one in which the
machine should “try to provide answers that would naturally be given
by a man”. His ideas became the basis for what eventually emerged
as the ﬁeld of AI.
As Turing emphasised, the thought experiment consisted of an abstract, “imaginable” machine that—under certain conditions to ensure a level playing ﬁeld—would be indistinguishable from a human, from the perspective of a human interrogator [24]. Presently,
when the test is actually deployed in practice, it is easy to forget the
essential role of the designer, especially given the fact that the computer “playing” the game is, to an extent, thrust into the spotlight. In a
manner of speaking, the interactive computer takes centre stage, and
attention is diverted from the underlying challenge set forth by Turing: to determine the specifications of the machine. Thus, one could
say in addition to being a test for a given machine, it is also a creative
design challenge to those responsible for the machine. The stress is
on design rather than implementation, as Turing explicitly suggests
imagining that any proposed machine functions perfectly according
to its speciﬁcations (see [24], p. 449). If the creative design challenge
were fulﬁlled, the computer would behave convincingly as a human,
perhaps hesitating when appropriate and occasionally refusing to answer or giving incorrect answers such as the ones Turing imagines
[24]:
Q: Please write me a sonnet on the subject of the Forth Bridge.
A: Count me out on this one. I never could write poetry.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

16

Q: Add 34957 to 70764.
A: (Pause about 30 seconds and then give as answer) 105621.
The implication of Turing’s example is that the measure of success
for those behind the machine lies in designing a system that is also
as stubborn and fallible as humans, rather than servile and (theoretically) infallible, like an adding machine.

3

Two threads unraveled

Two threads can be drawn out of Turing’s behavioural account of intelligence that directly pertain to contemporary AI systems: the ﬁrst
one concerns the kind of intentional agency suggested by his example answer, “count me out on this one”; the second one concerns the
particular capacities and limitations of human embodiment, such as
the human inability to perform certain calculations in a fraction of a
second and the human potential for error. More generally, the second
thread has to do with the broadly construed linguistic, social, mental and physical consequences of human physiology. Indeed, current
theories of mind from a variety of disciplines provide a means for
considering these threads separately. In particular, relevant investigations that address these two threads—described in this context as
intentional agency and human indistinguishability—can be found in
psychology, philosophy and cognitive science.

3.1 Intentional agency
The ﬁrst thread concerns the notion of intentional agency, considered here separately from the thread of human indistinguishability.
Empirical developmental psychology suggests that the human predisposition to attribute intentional agency to both humans and nonhumans appears to be present from infancy. Poulin-Dubois and Shultz
chart childhood developmental stages over the ﬁrst three years of
life, from the initial ability to identify agency (distinguishing animate
from inanimate objects) on to the informed attribution of intentionality, by inference of goal-directed behavior [22]. Csibra found that
infants ascribed goal-directed behavior even to artiﬁcially animated
inanimate objects, if the objects were secretly manipulated to display
teleological actions such as obstacle avoidance [7]. Király, et al. identify the source of an infant’s interpretation of a teleological action: “if
the abstract cues of goal-directedness are present, even very young
infants are able to attribute goals to the actions of a wide range of
entities even if these are unfamiliar objects lacking human features”
[10].
It is important to note that in the above studies, the infants were
passive, remote observers, whereas the Turing test evaluates direct
interaction. While the predisposition of infants suggests an important basis for such evaluation, more is needed to address interactivity. In another area of empirical psychology, a study of adults by
Barrett and Johnson suggests that even a lack of apparent goals by a
self-propelled (nonhuman) object can lead to the attribution of intentionality in an interactive context [2]. In particular, their test subjects
used language normally reserved for humans and animals to describe
the behaviour of artiﬁcially animated inanimate objects that appeared
to exhibit resistance to direct control in the course of an interaction;
when there was no resistance, they did not use such language. The
authors of the study link the results of their controlled experiment to
the anecdotal experience of the frustration that arises during interactions with artifacts such as computers or vehicles that “refuse” to
cooperate. In other words, in an interactive context, too much passivity by an artiﬁcial agent may negate any sense of its apparent

intentionality. This suggests that for an agent to remain apparently
intentional during direct interaction, it must exhibit a degree of resistance along with the kind of adaptation to the environment that indicates its behaviour is being adjusted to attain a goal. These features
appear to be accounted for in Turing’s ﬁrst example answer above:
the answer is accommodating insofar as it is a direct response to the
interrogator, but the show of resistance seems to enhance the sense
of “intelligence”. It is noteworthy that this particular thread, intentional agency, relates closely to Brooks’ extension of intelligence to
nonlinguistic, nonhuman intelligence, especially in relation to insect
and other animal intelligence, which he has emulated in robotic form
with his particular approach to AI (see [3]).

3.2 Human indistinguishability
The second thread, the idea that human capacities and limitations
should be built into an AI system, strongly relates to many signiﬁcant accounts of embodied, situated activity (see, for example, [9],
[4] and [11]). These accounts focus on how the human body, brain,
mind and environment fundamentally structure the process of cognition, which can be understood through observable behaviour. When
dealing with AI, the focus on behaviour clearly ties back to Turing.
These themes are also taken up in Brooks’ behaviour-based AI approach, but, at least in his early research, he applies them primarily
to nonhuman intelligence. In particular, he relates these themes to the
kinds of adaptive behaviour described in the ﬁrst thread. The differing properties of the second thread will come into sharper focus by
returning to Turing’s example, for a consideration of matters particular to humans.
Although Turing’s example of pausing and giving an incorrect answer is a clear example of a human limitation over a machine, it is
possible to give an inverted example of human and machine competence that applies equally well. If the question posed to the machine
were instead “Is it easy to walk from here to the nearest supermarket?”, the machine’s answer would depend on how its designers handled the notion of “easy to walk to”. In this case, the machine must
not only emulate humans’ abstract cognitive limitations when solving arithmetical problems; it must also be able to respond according
to human bodily limitations. One could easily imagine a failed machine calculation: the supermarket is at the end of a single straight
road, with no turns; it answers “yes, it is easy to walk to”. But if the
supermarket is very distant, or nearby but up a steep incline, then
in order for the machine to give an answer that is indistinguishable
from a human one, it must respond in a way that seems to share
our embodied human limitations. Returning to the arithmetic example, as Doug Lenat points out, even some wrong answers are more
human than others: “93 − 25 = 78 is more understandable than
if the program pretends to get a wrong answer of 0 or −9998 for
that subtraction problem” [14]. Although Lenat disputes the need for
embodiment in AI (he prefers a central database of human common
sense [13], which could likely address the “easy to walk to” example), it could be argued, following the above theoretical positions,
that the set of humanlike wrong answers is ultimately determined by
the “commonalities of our bodies and our bodily and social experience in the world” [11].
This second thread, which could also be characterised as the attempt to seem humanlike, is taken up in another nonlinguistic area of
AI, namely, musical AI. Some “intelligent” computer music composition and performance systems appear very close to achieving human indistinguishability in some respects, although this is not always their explicitly stated purpose. For example, Manfred Clynes

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

17

describes a computer program that performs compositions by applying a single performer’s manner of interpretation to previously unencountered material, across all instrumental voices [5]. He states
that “our computer program plays music so that it is impossible to
believe that no human performer is involved,” which he qualiﬁes by
explaining the role of the human performer as a user of the software,
who “instills the [musical performance] principles in the appropriate way”. Taking an entirely different approach, David Cope, argues
that a Turing-like test for creativity would be more appropriate to his
work than a Turing test for intelligence [6]. On the other hand, he has
called his well-known project “Experiments in Musical Intelligence”
and he also makes reference to “intelligent music composition”. Furthermore, he states that his system generates “convincing” music in
the style of a given composer (by training the system with a corpus of human-composed music), and one can infer that, in this context, “convincing” at least approximates the notion of human indistinguishability. With a more critical articulation, Pearce and Wiggins
carefully differentiate between a test for what Cope calls “convincing” and a Turing test for intelligence [19]. As they point out, despite
the resemblance of the two approaches, testing for intelligence is distinct from determining the “(non-)membership of a machine composition in a set of human composed pieces of music”. They also note
the signiﬁcant difference between an interactive test and one involving passive observation.

4

Broadening the interactive horizon

One reason for isolating these two threads is to recast Turing’s ideas
in a wider social context, one that is better attuned to the contemporary social understanding of the role of technology research: namely,
that it is primarily intended (or even expected) to enhance our lives.
Outside the thought experiment, in the realm of practical application,
one might redirect the resources for developing a successful Turing
test candidate (e.g., for the Loebner Prize) and instead apply them
toward a different kind of interactive system. This proposed system
could be built so that it might be easily identiﬁed as a machine (even
if occasionally mistaken for a human), which seemingly runs counter
to the spirit of the Turing test. However, with an altered emphasis,
one could imagine the primary function of such a machine as engaging humans in a continuous process of interaction, for a variety of
purposes, including (but not limited to) stimulating human creativity
and providing a realm for aesthetic exploration.
One example of this kind of system is musical improvisation software that interacts with human performers in real time, in a mutually
inﬂuential relationship between human and computer, such as Lewis’
Voyager. In his software design, the interaction model strongly resembles the way in which Turing describes a computer’s behaviour:
it is responsive, yet it does not always give the expected answer, and
it might interrupt the human interlocutor or steer the interaction in a
different direction (see [16]). In the case of an interactive improvising
music system, the environment in which the human and computer interact is not verbal conversation, but rather, a culturally speciﬁc aesthetic context for collaborative music-making. In this sense, a musical improvisation is not an interrogation in the manner presented by
Turing, yet “test” conversations and musical improvisations are examples of free-ranging and open-ended human-computer interaction.
Among other things, this kind of interaction can serve as a basis for
philosophical enquiry and cognitive theory that is indeed very much
in the spirit of Turing’s 1950 paper [24] (see also [15] and [17]).
Adam Linson’s Odessa is another intelligent musical system that
is similarly rooted in freely improvised music (for a detailed descrip-

tion, see [18]). It borrows from Brooks’ design approach in modelling the behaviour of an intentional agent, thus clearly taking up
the ﬁrst thread that has been drawn out here. Signiﬁcantly, it isolates this thread (intentional agency) for study by abstaining from
a direct implementation of many of the available methods for human emulation (aimed at the second thread), thus resulting in transparently nonhuman musical behaviour. Nonetheless, initial empirical studies suggest that the system affords an engaging and stimulating human-computer musical interaction. As the system architecture
(based on Brooks’ subsumption architecture) is highly extensible, future iterations of the system may add techniques for approximating
ﬁne-grained qualities of human musicianship. In the meantime, however, further studies are planned with the existing prototype, with the
aim of providing insights into aspects of human cognition as well as
intelligent musical agent design.

5

Conclusion

Ultimately, whether an interactive computer system is dealing with
an interrogator in the imitation game or musically improvising with
a human, the system must be designed to “respond in lived real
time to unexpected, real-world input” [17]. This responsiveness takes
the form of what sociologist Andrew Pickering calls the “dance of
agency”, in which a reciprocal interplay of resistance and accommodation produces unpredictable emergent results over time [20].
This description of a sustained, continuous play of forces that “interactively stablize” each other could be applied to freely improvised
music, whether performed by humans exclusively, or by humans and
computers together. Pickering points out a concept similar to the process of interactive stabilisation, ‘heterogeneous engineering’, elaborated in the work of his colleague John Law (see [12]); the latter, in
its emphasis on productive output, is perhaps more appropriate to the
musical context of free improvisation.
Although these theoretical characterisations may seem abstract,
they concretely pertain to the present topic in that they seek to address the “open-ended and performative interplay between agents
that are not capable of dominating each other” [21], where the agents
may include various combinations of humans, computers and other
entities, and the interplay may include linguistic, musical, physical and other forms of interaction. With particular relevance to the
present context, Pickering applies his conceptual framework of agent
interplay to the animal-like robots of Turing’s contemporary, cybernetics pioneer Grey Walter, and those of Brooks, designed and built
decades later [21]. Returning to the main theme, following Brooks,
“the dynamics of the interaction of the robot and its environment are
primary determinants of the structure of its intelligence” [3]. Thus,
independent of its human resemblance, an agent’s ability to negotiate
with an unstructured and highly dynamic musical, social or physical environment can be treated as a measure of intelligence closely
aligned with what Turing thought to be discoverable with his proposed test.

REFERENCES
[1] C. Ariza, ‘The interrogator as critic: The turing test and the evaluation
of generative music systems’, Computer Music Journal, 33(2), 48–70,
(2009).
[2] J.L. Barrett and A.H. Johnson, ‘The role of control in attributing intentional agency to inanimate objects’, Journal of Cognition and Culture,
3(3), 208–217, (2003).
[3] R.A. Brooks, Cambrian intelligence: the early history of the new AI,
MIT Press, 1999.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

18

[4] A. Clark, Being There: Putting Brain, Body, and World Together Again,
MIT Press, 1997.
[5] M. Clynes, ‘Generative principles of musical thought: Integration of
microstructure with structure’, Communication and Cognition AI, Journal for the Integrated Study of Artificial Intelligence, Cognitive Science
and Applied Epistemology, 3(3), 185–223, (1986).
[6] D. Cope, Computer Models of Musical Creativity, MIT Press, 2005.
[7] G. Csibra, ‘Goal attribution to inanimate agents by 6.5-month-old infants’, Cognition, 107(2), 705–717, (2008).
[8] R.T. Dean, Hyperimprovisation: Computer-interactive sound improvisation, AR Editions, Inc., 2003.
[9] H. Hendriks-Jansen, Catching ourselves in the act: Situated activity,
interactive emergence, evolution, and human thought, MIT Press, 1996.
[10] I. Király, B. Jovanovic, W. Prinz, G. Aschersleben, and G. Gergely,
‘The early origins of goal attribution in infancy’, Consciousness and
Cognition, 12(4), 752–769, (2003).
[11] G. Lakoff and M. Johnson, Philosophy in the Flesh: The Embodied
Mind and Its Challenge to Western Thought, Basic Books, 1999.
[12] J. Law, ‘On the social explanation of technical change: The case of the
portuguese maritime expansion’, Technology and Culture, 28(2), 227–
252, (1987).
[13] D.B. Lenat, ‘Cyc: A large-scale investment in knowledge infrastructure’, Communications of the ACM, 38(11), 33–38, (1995).
[14] D.B. Lenat, ‘The voice of the turtle: Whatever happened to ai?’, AI
Magazine, 29(2), 11, (2008).
[15] G. Lewis, ‘Interacting with latter-day musical automata’, Contemporary Music Review, 18(3), 99–112, (1999).
[16] G. Lewis, ‘Too many notes: Computers, complexity and culture in voyager’, Leonardo Music Journal, 33–39, (2000).
[17] G. Lewis, ‘Improvising tomorrow’s bodies: The politics of transduction’, E-misférica, 4.2, (2007).
[18] A. Linson, C. Dobbyn, and R. Laney, ‘Improvisation without representation: artiﬁcial intelligence and music’, in Proceedings of Music, Mind,
and Invention: Creativity at the Intersection of Music and Computation,
(2012).
[19] M. Pearce and G. Wiggins, ‘Towards a framework for the evaluation of
machine compositions’, in Proceedings of the AISB, pp. 22–32, (2001).
[20] A. Pickering, The mangle of practice: Time, agency, and science, University of Chicago Press, 1995.
[21] A. Pickering, The cybernetic brain: Sketches of another future, University of Chicago Press, 2010.
[22] D. Poulin-Dubois and T.R. Shultz, ‘The development of the understanding of human behavior: From agency to intentionality’, in Developing
Theories of Mind, eds., Janet W. Astington, Paul L. Harris, and David R.
Olson, 109–125, Cambridge University Press, (1988).
[23] R. Rowe, Machine musicianship, MIT Press, 2001.
[24] A.M. Turing, ‘Computing machinery and intelligence’, Mind, 59(236),
433–460, (1950).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

19

The AN Y NT Project Intelligence Test Λone
Javier Insa-Cabrera1 and José Hernández-Orallo2 and David L. Dowe3
and Sergio España4 and M.Victoria Hernández-Lloreda5
Abstract. All tests in psychometrics, comparative psychology and
cognition which have been put into practice lack a mathematical
(computational) foundation or lack the capability to be applied to
any kind of system (humans, non-human animals, machines, hybrids,
collectives, etc.). In fact, most of them lack both things. In the past
ﬁfteen years, some efforts have been done to derive intelligence tests
from formal intelligence deﬁnitions or vice versa, grounded on computational concepts. However, some of these approaches have not
been able to create universal tests (i.e., tests which can evaluate any
kind of subjects) and others have even failed to make a feasible test.
The AN Y NT project was conceived to explore the possibility of deﬁning formal, universal and anytime intelligence tests, having a feasible
implementation in mind. This paper presents the basics of the theory
behind the AN Y NT project and describes one of the test propotypes
that were developed in the project: test Λone .
Keywords: (machine) intelligence evaluation, universal tests, artiﬁcial intelligence, Solomonoff-Kolmogorov complexity.

1

INTRODUCTION

There are many examples of intelligence tests which work in practice. For instance, in psychometrics and comparative psychology,
tests are used to evaluate intelligence for a variety of subjects: children and adult Homo Sapiens, other apes, cetaceans, etc. In artiﬁcial intelligence, we are well aware of some incarnations and different variations of the Turing Test, such as the Loebner Prize or
CAPTCHAs [32], which are also feasible and informative. However,
they do not answer the pristine questions: what intelligence is and
how it can be built.
In the past ﬁfteen years, some efforts have been done to derive
intelligence tests from formal intelligence deﬁnitions or vice versa,
grounded on computational concepts. However, some of these approaches have not been able to create universal tests (i.e., tests which
can evaluate any kind of subjects) and others have even failed to
make a feasible test. The AN Y NT project6 was conceived to explore
the possibility of deﬁning formal, universal and anytime intelligence
tests, having a feasible implementation in mind.
1
2
3
4
5
6

DSIC, Universitat Politècnica de València, Spain. email:
jinsa@dsic.upv.es
DSIC, Universitat Politècnica de València, Spain. email:
jorallo@dsic.upv.es
Clayton School of Information Technology, Monash University, Australia.
email: david.dowe@monash.edu
PROS, Universitat Politècnica de València, Spain. email:
sergio.espana@pros.upv.es
Universidad
Complutense
de
Madrid,
Spain.
email:
vhlloreda@psi.ucm.es
http://users.dsic.upv.es/proy/anynt/

In the AN Y NT project we have been working on the design and
implementation of a general intelligence test, which can be feasibly
applied to a wide range of subjects. More precisely, the goal of the
project is to develop intelligence tests that are: (1) formal, by using
notions from Algorithmic Information Theory (a.k.a. Kolmogorov
Complexity) [24]; (2) universal, so that they are able to evaluate the
general intelligence of any kind of system (human, non-human animal, machine or hybrid). Each will have an appropriate interface that
ﬁts its needs; (3) anytime, so the more time is available for the evaluation, the more reliable the measurement will be.

2

BACKGROUND

In this section, we present a short introduction to the area of Algorithmic Information Theory and the notions of Kolmogorov complexity,
universal distributions, Levin’s Kt complexity, and its relation to the
notions of compression, the Minimum Message Length (MML) principle, prediction, and inductive inference. Then, we will survey the
approaches that have appeared using these formal notions in order
to give mathematical deﬁnitions of intelligence or to develop intelligence tests from them, starting from the compression-enhanced Turing tests, the C-test, and Legg and Hutter’s deﬁnition of Universal
Intelligence.

2.1 Kolmogorov complexity and universal
distributions
Algorithmic Information Theory is a ﬁeld in computer science that
properly relates the notions of computation and information. The key
idea is the notion of the Kolmogorov Complexity of an object, which
is deﬁned as the length of the shortest program p that outputs a given
string x over a machine U . Formally,
Deﬁnition 1 Kolmogorov Complexity
KU (x) :=
p

min
l(p)
such that U (p)=x

where l(p) denotes the length in bits of p and U (p) denotes the result
of executing p on U .
For instance, if x = 1010101010101010 and U is the programming language Lisp, then KLisp (x) is the length in bits of the shortest program in Lisp that outputs the string x. The relevance of the
choice of U depends mostly on the size of x. Since any universal
machine can emulate another, it holds that for every two universal
Turing machines U and V , there is a constant c(U, V ), which only
depends on U and V and does not depend on x, such that for all x,
|KU (x) − KV (x)| ≤ c(U, V ). The value of c(U, V ) is relatively
small for sufﬁciently long x.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

20

From Deﬁnition 1, we can deﬁne the universal probability for machine U as follows:
Deﬁnition 2 Universal Distribution
Given a preﬁx-free machine7 U , the universal probability of string
x is deﬁned as:
pU (x) := 2−KU (x)

which gives higher probability to objects whose shortest description
is small and gives lower probability to objects whose shortest description is large. Considering programs as hypotheses in the hypothesis language deﬁned by the machine, paves the way for the mathematical theory of inductive inference and prediction. This theory
was developed by Solomonoff [28], formalising Occam’s razor in a
proper way for prediction, by stating that the prediction maximising
the universal probability will eventually discover any regularity in the
data. This is related to the notion of Minimum Message Length for
inductive inference [34][35][1][33] and is also related to the notion
of data compression.
One of the main problems of Algorithmic Information Theory is
that Kolmogorov Complexity is uncomputable. One popular solution to the problem of computability of K for ﬁnite strings is to use a
time-bounded or weighted version of Kolmogorov complexity (and,
hence, the universal distribution which is derived from it). One popular choice is Levin’s Kt complexity [23][24]:
Deﬁnition 3 Levin’s Kt Complexity
KtU (x) :=
p

min
{l(p) + log time(U, p, x)}
such that U (p)=x

where l(p) denotes the length in bits of p, U (p) denotes the result of
executing p on U , and time(U, p, x) denotes the time8 that U takes
executing p to produce x.
Finally, despite the uncomputability of K and the computational
complexity of its approximations, there have been some efforts to use
Algorithmic Information Theory to devise optimal search or learning
strategies. Levin (or universal) search [23] is an iterative search algorithm for solving inversion problems based on Kt, which has inspired other general agent policies such as Hutter’s AIXI, an agent
that is able to adapt optimally9 in all environments where any other
general purpose agent can be optimal [17], for which there is a working approximation [31][30].

2.2 Developing mathematical deﬁnitions and tests
of intelligence
Following ideas from A.M. Turing, R.J. Solomonoff, E.M. Gold,
C.S. Wallace, M. Blum, G. Chaitin and others, between 1997 and
7

For a convenient deﬁnition of the universal probability, we need the requirement of U being a preﬁx-free machine (see, e.g., [24] for details). Note also
that even for preﬁx-free machines there are inﬁnitely many other inputs to
U that will output x, so pU (x) is a strict lower bound on the probability
that U will output x (given a random input)
8 Here time does not refer to physical time but to computational time, i.e.,
computation steps taken by machine U . This is important, since the complexity of an object cannot depend on the speed of the machine where it is
run.
9 Optimality has to be understood in an asymptotic way. First, because AIXI
is uncomputable (although resource-bounded variants have been introduced
and shown to be optimal in terms of time and space costs). Second, because
it is based on a universal probability over a machine, and this choice determines a constant term which may very important for small environments.

1998 some works on enhancing or substituting the Turing Test [29]
by inductive inference tests were developed, using Solomonoff prediction theory [28] and related notions, such as the Minimum Message Length (MML) principle. On the one hand, Dowe and Hajek
[2][3][4] suggested the introduction of inductive inference problems
in a somehow induction-enhanced or compression-enhanced Turing
Test (they arguably called it non-behavioural) in order to, among
other things, completely dismiss Searle’s Chinese room [27] objection, and also because an inductive inference ability is a necessary
(though possibly “not sufﬁcient”) requirement for intelligence.
Quite simultaneously and similarly, and also independently, in
[13][6], intelligence was deﬁned as the ability to comprehend, giving a formal deﬁnition of the notion of comprehension as the identiﬁcation of a ‘predominant’ pattern from a given evidence, derived
from Solomonoff prediction theory concepts, Kolmogorov complexity and Levin’s Kt. The notion of comprehension was formalised by
using the notion of “projectible” pattern, a pattern that has no exceptions (no noise), so being able to explain every symbol in the given
sequence (and not only most of it).
From these deﬁnitions, the basic idea was to construct a feasible
test as a set of series whose shortest pattern had no alternative projectible patterns of similar complexity. That means that the “explanation” of the series had to be much more plausible than other plausible
hypotheses. The main objective was to reduce the subjectivity of the
test — ﬁrst, because we need to choose one reference universal machine from an inﬁnite set of possibilities; secondly, because, even
choosing one reference machine, two very different patterns could
be consistent with the evidence and if both have similar complexities,
their probabilities would be close, and choosing between them would
make the series solution quite uncertain. With the constraints posed
on patterns and series, both problems were not completely solved but
minimised.
k=9
k = 12
k = 14

:
:
:

a, d, g, j, ...
a, a, z, c, y, e, x, ...
c, a, b, d, b, c, c, e, c, d, ...

Answer: m
Answer: g
Answer: d

Figure 1. Examples of series of Kt complexity 9, 12, and 14 used in the
C-test [6].

The deﬁnition was given as the result of a test, called C-test [13],
formed by computationally-obtained series of increasing complexity.
The sequences were formatted and presented in a quite similar way
to psychometric tests (see Figure 1) and, as a result, the test was administered to humans, showing a high correlation with the results of
a classical psychometric (IQ) test on the same individuals. Nonetheless, the main goal was that the test could eventually be administered
to other kinds of intelligent beings and systems. This was planned
to be done, but the work from [26] showed that machine learning
programs could be specialised in such a way that they could score
reasonably well on some of the typical IQ tests. A more extensive
treatment of this phenomenon and the inadequacy of current IQ tests
for evaluating machines can be found in [5]. This unexpected result
conﬁrmed that C-tests had important limitations and could not be
considered universal in two ways, i.e., embracing the whole notion
of intelligence, but perhaps only a part of it, and being applicable to
any kind of subject (not only adult humans). The idea of extending
these static tests to other factors or to make them interactive and extensible to other kinds of subjects by the use of rewards (as in the
area of reinforcement learning) was suggested in [7][8], but not fully

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

21

developed into actual tests. An illustration of the classical view of an
environment in reinforcement learning is seen in Figure 2, where an
agent can interact through actions, rewards and observations.
!"#$%&'()!*

!"#$%

%$+'%,

#$&'()$*#$%

'-()!*

Figure 2. Interaction with an Environment.

A few years later, Legg and Hutter (e.g. [21],[22]) followed the
previous steps and, strongly inﬂuenced by Hutter’s theory of AIXI
optimal agents [16], gave a new deﬁnition of machine intelligence,
dubbed “Universal10 Intelligence”, also grounded in Kolmogorov
complexity and Solomonoff’s (“inductive inference” or) prediction
theory. The key idea is that the intelligence of an agent is evaluated
as some kind of sum (or weighted average) of performances in all the
possible environments (as in Figure 2).
The deﬁnition based on the C-test can now be considered a static
precursor of Legg and Hutter’s work, where the environment outputs no rewards, and the agent is not allowed to make an action until
several observations are seen (the inductive inference or prediction
sequence). The point in favour of active environments (in contrast
to passive environments) is that the former not only require inductive and predictive abilities to model the environment but also some
planning abilities to effectively use this knowledge through actions.
Additionally, perceptions, selective attention, and memory abilities
must be fully developed. Not all this is needed to score well in a
C-test, for instance.
While the C-test selects the problems by (intrinsic) difﬁculty
(which can be chosen to ﬁt the level of intelligence of the evaluee),
Legg and Hutter’s approach select problems by using a universal distribution, which gives more probability to simple environments. Legg
and Hutter’s deﬁnition, given an agent π, is given as:
Deﬁnition 4 Universal Intelligence [22]
"∞
#
∞
! µ,π
!
ri
Υ(π, U ) =
pU (µ) · E
µ=i

i=1

where µ is any environment coded on a universal machine U , with π
being the agent to be evaluated, and riµ,π the reward obtained by π
in µ at interaction i. E is the expected reward on each environment,
where environments are assigned with probability pU (µ) using a universal distribution [28].
Deﬁnition 4, although very simple, captures one of the broadest
deﬁnitions of intelligence: “the ability to adapt to a wide range of environments”. However, this deﬁnition was not meant to be eventually
converted into a test. In fact, there are three obvious problems in this
deﬁnition regarding making it practical. First, we have two inﬁnite
sums in the deﬁnition: one is the sum over all environments, and the
10

The term ‘universal’ here does not refer to the deﬁnition (or a derived
test) being applicable to any kind of agent, but to the use of Solomonoff’s
universal distribution and the view of the deﬁnition as an extremely general
view of intelligence.

second is the sum over all possible actions (agent’s life in each environment is inﬁnite). And, ﬁnally, K is not computable. Additionally,
we also have the dependence on the reference machine U . This dependence takes place even though we consider an inﬁnite number of
environments. The universal distribution for a machine U could give
the higher probabilities (0.5, 0.25, ...) to quite different environments
than those given by another machine V .
Despite all these problems, it could seem that just making a random ﬁnite sample on environments, limiting the number of interactions or cycles of the agent with respect to the environment and using
some computable variant of K, is sufﬁcient to make it a practical test.
However, on the one hand, this is not so easy, and, on the other hand,
the deﬁnition has many other problems (some related and others unrelated).
The realisation of these problems and the search for solutions in
the quest of a practical intelligence test is the goal of the AN Y NT
project.

3

ANYTIME UNIVERSAL TESTS

This section presents a summary of the theory in [11]. The reader is
referred to this paper for further details.

3.1 On the difﬁculty of environments
The ﬁrst issue concerns how to sample environments. Just using the
universal distribution for this , as suggested by Legg and Hutter, will
mean that very simple environments will be output again and again.
Note that an environment µ with K(µ) = 1 will appear half of the
time. Of course, repeated environments must be ruled out, but a sample would almost become an enumeration from low to high K. This
will still omit or underweight very complex environments because
their probability is so low. Furthermore, measuring rewards on very
small environments will get very unstable results and be very dependent on the reference machine. And even ignoring this, it is not clear
that an agent that solves all the problems of complexity lower than
20 bits and none of those whose complexity is larger than 20 bits
is more intelligent than another agent who does reasonably well on
every environment.
This constrasts with the view of the C-test, which focus on the
issue of difﬁculty and does not make the probability of a problem appearing inversely related to this difﬁculty. In any case, before
going on, we need to clarify the notions of simple/easy and complex/difﬁcult that are used here. For instance, just choosing an environment with high K does not ensure that the environment is indeed
complex. As Figure 3 illustrates, the relation is unidirectional; given
a low K, we can afﬁrm that the environment will look simple. On
the other hand, given an intuitively complex environment, K must
be necessarily high.
Environment with high K ⇐= Intuitively complex (difficult) environment
Environment with low K =⇒ Intuitively simple (easy) environment

Figure 3. Relation between K and intuitive complexity.

Given this relation, only among environments with high K will
we ﬁnd complex environments, and, among the latter, not all of them
will be difﬁcult. From the agent’s perspective, however, this is more
extreme, since many environments with high K will contain difﬁcult patterns that will never be accessed by the agent’s interactions.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

22

As a result, the environment will be probabilistically simple. Thus,
giving most of the probability to environments with low K means
that most of the intelligence measure will come from patterns that
are extremely simple.

3.2 Selecting discriminative environments
Furthermore, many environments (either simple or complex) will
be completely useless for evaluating intelligence, e.g., environments
that stop interacting, environments with constant rewards, etc. If
we are able to make a more accurate sample, we will be able to
make a more efﬁcient test procedure. The question here is to determine a non-arbitrary criterion to exclude some environments. For instance, Legg and Hutter’s deﬁnition forces environments to interact
inﬁnitely, and since the description must be ﬁnite, there must be a pattern. This obviously includes environments such as “always output
the same observation and reward”. In fact, they are not only possible
but highly probable on many reference machines. Another pathological case is an environment that “outputs observations and rewards
at random”. However, this has a high complexity if we assume deterministic environments. In both cases, the behaviour of any agent
on these environments would almost be the same. In other words,
they do not have discriminative power. Therefore, these environments would be useless for discriminating between agents.
In an interactive environment, a clear requirement for an environment to be discriminative is that what the agent does must have consequences on rewards. Thus, we will restrict environments to be sensitive to agents’ actions. That means that a wrong action might lead
the agent to part of the environment from which it can never return
(non-ergodic), but at least the actions taken by the agent can modify the rewards in that subenvironment. More precisely, we want an
agent to be able to inﬂuence rewards at any point in any subenvironment. This does not imply ergodicity but reward sensitivity at any
moment. That means that we cannot reach a point from which rewards are given independently of what we do (a dead-end).

3.3 Symmetric rewards and balanced
environments
An important issue is how to estimate rewards. If we only use positive
rewards, we ﬁnd some problems. For example, an increase in the
score may originate from a really good behaviour on the environment
or just because more rewards are accumulated since they are always
positive. Instead, an average reward seems a better payoff function.
Our proposal is to use symmetric rewards, which can range between
−1 and 1:
Deﬁnition 5 Symmetric Rewards
We say an environment has symmetric rewards when:
∀i : −1 ≤ ri ≤ 1
If we set symmetric rewards, we also expect environments to be
symmetric, or more precisely, to be balanced on how they give rewards. This can be seen in the following way. In a reliable test, we
would like that many (if not all) environments give an expected 0
reward to random agents.
This excludes both hostile and benevolent environments, i.e., environments where doing randomly will get more negative (respectively positive) rewards than positive (respectively negative) rewards.
In many cases it is not difﬁcult to prove that a particular environment

is balanced. Another approach is to set a reference machine that only
generates balanced environments.
Using this approach on rewards, we can use an average to estimate
the results on each environment, namely:
Deﬁnition 6 Average Reward
Given an environment µ, with ni being the number of completed
interactions, then the average reward for agent π is deﬁned as follows:
!ni µ,π
i=1 ri
vµπ (ni ) =
ni
Now we can calculate the expected value (although the limit may
not exist) of the previous average, denoted by E(vµπ ), for an arbitrarily large value of ni .
To view the test framework in more detail, in [11] some of these
issues (and many other problems) of the measure are solved. It uses
a random ﬁnite sample of environments. It limits the number of interactions of the agent with respect to the environment. It selects a
discriminative set of environments, etc.

4

ENVIRONMENT CLASS

The previous theory, however, does not make the choice for an environment class, but just sets some constraints on the kind of environments that can be used. Consequently, one major open problem is to
make this choice, i.e., to ﬁnd a proper (unbiased) environment class
which follows the constraints and, more difﬁcult, which can be feasibly implemented. Once this environment class is identiﬁed, we can
use it to generate environments to run any of the tests variants. Additionally, it is not only necessary to determine the environment class,
but also to determine the universal machine we will use to determine
the Kolmogorov complexity of each environment, since the tests only
use a (small) sample of environments, and the sample probability is
deﬁned in terms of the complexity.
In the previous section we deﬁned a set of properties which are
required for making environments discriminative, namely that observations and rewards must be sensitive to agent’s actions and that
environments are balanced. Given these constraints if we decide to
generate environments without any constraint and then try to make
a post-processing sieve to select which of them comply with all the
constraints, we will have a computationally very expensive or even
incomputable problem. So, the approach taken is to generate an environment class that ensures that these properties hold. In any case,
we have to be very careful, because we would not like to restrict
the reference machine to comply with these properties at the cost of
losing their universality (i.e. their ability to emulate or include any
computable function).
And ﬁnally, we would like the environment class to be userfriendly to the kind of systems we want to be evaluated (humans,
non-human animals and machines), but without any bias in favour or
against some of them.
According to all this, we deﬁne a universal environment class from
which we can effectively generate valid environments, calculate their
complexity and consequently derive their probability.

4.1 On actions, observations and space
Back to Figure 2 again, actions are limited by a ﬁnite set of symbols
A, (e.g. {lef t, right, up, down}), rewards are taken from any subset
R of rational numbers between −1 and 1, and observations are also

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

23

limited by a ﬁnite set O of possibilities (e.g., the contents of a grid
of binary cells of n × m, or a set of light-emitting diodes, LEDs).
We will use ai , ri and oi to (respectively) denote action, reward and
observation at interaction i.
Apart from the behaviour of an environment, which may vary from
very simple to very complex, we must ﬁrst clarify the interface. How
many actions are we going to allow? How many different observations? The very deﬁnition of environment makes actions a ﬁnite set
of symbols and observations are also a ﬁnite set of symbols. It is clear
that the minimum number of actions has to be two, but no upper limit
seems to be decided a priori. The same happens with observations.
Even choosing two for both, a sequence of interactions can be as rich
as the expressiveness of a Turing machine.
Before getting into details with the interface, we have to think
about environments that can contain agents. This is not only the case
in real life (where agents are known as inanimate or animate objects,
animals among the latter), but also a requirement for evolution and,
hence, intelligence as we know it. The existence of several agents
which can interact requires a space. The space is not necessarily a
virtual or physical space, but also a set of common rules (or laws)
that govern what the agents can perceive and what the agents can do.
From this set of common rules, speciﬁc rules can be added to each
agent. In the real world, this set of common rules is physics. All this
has been extensively analysed in multi-agent systems (see e.g. [20]
for a discussion).
The good thing about thinking of spaces is that a space entails the
possible perceptions and actions. If we deﬁne a common space, we
have many choices about observations and actions already taken.
A ﬁrst (and common) idea for a space is a 2D grid. From a 2D grid,
the observation is a picture of the grid with all the objects and agents
inside. In a simple grid where we have agents and objects inside the
cells, the typical actions are the movements left, right, up and down.
Alternatively, of course, we could use a 3D space, since our world
is 3D. In fact, there are some results using intelligence testing (for
animals or humans) with a 3D interface [25][36].
The problem of a 2D or 3D grid is that it is clearly biased in favour
of humans and many other animals which have hardwired abilities
for orientation in this kind of spaces. Other kinds of animals or handicapped people (e.g. blind people) might have some difﬁculties in
this type of spaces. Additionally, artiﬁcial intelligence agents would
highly beneﬁt by hardwired functionalities about Euclidean distance
and 2D movement, without any real improvement in their general
intelligence.
Instead we propose a more general kind of space. A 2D grid is a
graph with a very special topology, where there are concepts which
hold such as direction, adjacency, etc. A generalisation is a graph
where the cells are freely connected to some other cells with no particular predeﬁned pattern. This suggests a (generally) dimensionless
space. Connections between cells would determine part or all the
possible actions, and observations and rewards can be easily shown
graphically.

4.2 Deﬁnition of the environment class
After the previous discussion, we are ready to give the deﬁnition of
the environment class. First we must deﬁne the space and objects, and
from here observations, actions and rewards. Before that, we have to
deﬁne some constants that affect each environment. Namely, with
na = |A| ≥ 2 we denote the number of actions, with nc ≥ 2
the number of cells, and with nω the number of objects/agents (not
including the agent which is to be evaluated and two special objects

known as Good and Evil).

4.2.1

Space

The space is deﬁned as a directed labelled graph of nc nodes (or
vertices), where each node represents a cell. Nodes are numbered,
starting from 1, so cells are refered to as C1 , C2 , . . . , Cnc . From each
cell we have na outgoing arrows (or arcs), each of them denoted as
Ci →α Cj , meaning that action α ∈ A goes from Ci to Cj . All the
"i . At least two outgoing
outgoing arrows from Ci are denoted by C
"i such
arrows cannot go to the same cell. Formally, ∀Ci : ∃r1 , r2 ∈ C
that r1 = Ci →αm Cj and r2 = Ci →αn Ck with Cj '= Ck and
αm '= αn . At least one of the outgoing arrows from a cell must lead
to itself (typically denoted by α1 and is the ﬁrst action). Formally,
"i such that r = Ci →α1 Ci .
∀Ci : ∃r ∈ C
A path from Ci to Cm is a sequence of arrows Ci → Cj , Cj →
Ck , . . . , Cl → Cm . The graph must be strongly connected, i.e., all
cells must be connected (i.e. there must be a walk over the graph that
goes through all its nodes), or, in other words, for every two cells Ci ,
Cj there exists a path from Ci to Cj .

4.2.2

Objects

Cells can contain objects from a set of predeﬁned objects Ω, with
nω = |Ω|. Objects, denoted by ωi can be animate or inanimate, but
this can only be perceived by the rules each object has. An object is
inanimate (for a period or indeﬁnitely) when it performs action α1
repeatedly. Objects can perform actions following the space rules,
but apart from these rules, they can have any behaviour, either deterministic or not. Objects can be reactive and can be deﬁned to act
with different actions according to their observations. Objects perform one and only one action at each interaction of the environment
(except from the special objects Good and Evil, which can perform
several actions in a row).
Apart from the evaluated agent π, as we have mentioned, there
are two special objects called Good and Evil. Good and Evil must
have the same behaviour. By the same behavior we do not mean that
they perform the same movements, but they have the same logic or
program behind them.
Objects can share a same cell, except Good and Evil, which cannot
be at the same cell. If their behaviour leads them to the same cell, then
one (chosen randomly with equal probability) moves to the intended
cell and the other remains at its original cell. Because of this, the
environment becomes stochastic (non-deterministic).
Objects are placed randomly at the cells with the initialisation of
the environment. This is another source of stochastic behaviour.

4.2.3

Observations and Actions

The observation is a sequence of cell contents. The cells are ordered
by their number. Each element in the sequence shows the presence
or absence of each object, included the evaluated agent. Additionally,
each cell which is reachable by an action includes the information of
that action leading to the cell.

4.2.4

Rewards

Raw rewards are deﬁned as a function of the position of the evaluated
agent π and the positions of Good and Evil.
For the rewards, we will work with the notion of trace and the
notion of “cell reward”, that we denote by r(Ci ). Initially, r(Ci ) = 0

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

24

complexities, and we analysed whether the obtained results correlated with the measure of difﬁculty. The results were clear, showing
that the evaluation obtains the expected results in terms of the relation
between expected reward and theoretical problem difﬁculty. Also, it
showed reasonable differences with other baseline algorithms (e.g.
a random algorithm). All this supported the idea that the test and
the environment class used are on the right direction for evaluating a
speciﬁc kind of system. However, the main question was whether the
approach was in the right direction in terms of constructing universal
tests. In other words, it was still necessary to demonstrate if the test
serves to evaluate several kinds of systems and put their results on
the same scale.
In [18] we compared the results of two different systems (humans
and AI algorithms), by using the prototype described in this paper
and the interface for humans. We set both systems to interact with
exactly the same environments. The results, not surprisingly, did not
show the expected difference in intelligence between reinforcement
learning algorithms and humans. This is explained by several reasons. One of them is that the environments were still relatively simple
and reinforcement learning algorithms could still capture and represent all the state matrix for these problems with some partial success.
Another reason is that exercises were independent, so humans could
not reuse what they were learning on some exercises for others, an
issue where humans are supposed to be better than these simple reinforcement algorithms. Also, another possibility is the fact that the
environments had very few agents and the few agents that existed
were not reactive. This makes the state space bounded, which is beneﬁcial for Q-learning. Similarly, the environments had no noise. All
these decisions were made on purpose to keep things simple and also
to be able to formally derive the complexity of the environments. In
general, other explanations can be found as well, since the lack of
other interactive agents can be seen as a lack of social behaviours, as
we explored in subsequent works [12].
Of course, test Λone was just a ﬁrst prototype which does not incorporate many of the features of an anytime intelligence test and the
measuring framework. Namely, the prototype is not anytime, so the
test does not adapt its complexity to the subject that is evaluating.
Also, we made some simpliﬁcations to the environment class, causing objects to lose reactivity. Furthermore, it is very difﬁcult to construct any kind of social behaviour by creating agents from scratch.
These and other issues are being addressed in new prototypes, some
of them under development.

6

CONCLUDING REMARKS

The AN Y NT project aimed at exploring the possibility of formal, universal and feasible tests. As already said, test Λone is just one prototype that does not implement all the features of the theory of anytime
universal tests. However, it is already very informative. For instance,
the experimental results show that the test Λone goes in the right direction, but it still fails to capture some components of intelligence
that should put different kinds of individuals on the right scale.
In defence of test Λone , we have to say that it is quite rare in the
literature to ﬁnd the same test applied to different kinds of individuals11 . In fact, as argued in [5], relatively simple programs can get
good scores on conventional IQ tests, while small children (with high
potential intelligence) will surely fail. Similarly, illiterate people and
11

The only remarkable exceptions are the works in comparative psychology,
such as [14][15], which are conscious of the difﬁculties of using the same
test, with different interfaces, for different subjects.

most children would score very badly at the Turing Test, for instance.
And humans are starting to struggle with many CAPTCHAs.
All this means that many feasible and practical tests work because
they are specialised for speciﬁc populations. As long as the diversity of subjects is enlarged, measuring intelligence becomes more
difﬁcult and less accurate. As a result, the mere possibility of constructing universal tests is still a hot question. While many may think
that this is irresoluble, we think that unless an answer to this question is found, it will be very difﬁcult (if not impossible) to assess the
diversity of intelligent agents that are envisaged for the forthcoming decades. Being one way or another, there is clearly an ocean of
scientiﬁc questions beyond the Turing Test.

ACKNOWLEDGEMENTS
This work was supported by the MEC projects EXPLORAINGENIO
TIN
2009-06078-E,
CONSOLIDER-INGENIO
26706 and TIN 2010-21062-C02-02, and GVA project PROMETEO/2008/051. Javier Insa-Cabrera was sponsored by Spanish
MEC-FPU grant AP2010-4389.

REFERENCES
[1] D. L. Dowe, ‘Foreword re C.S. Wallace’, The Computer Journal, 51(5),
523–560, Christopher Stewart WALLACE (1933–2004) memorial special issue, (2008).
[2] D. L. Dowe and A. R. Hajek, ‘A computational extension to the Turing
Test’, in Proceedings of the 4th Conference of the Australasian Cognitive Science Society, University of Newcastle, NSW, Australia, (1997).
[3] D. L. Dowe and A. R. Hajek, ‘A computational extension
to the Turing Test’, Technical Report #97/322, Dept Computer Science, Monash University, Melbourne, Australia, 9pp,
http://www.csse.monash.edu.au/publications/1997/tr-cs97-322abs.html, (1997).
[4] D. L. Dowe and A. R. Hajek, ‘A non-behavioural, computational extension to the Turing Test’, in International conference on computational
intelligence & multimedia applications (ICCIMA’98), Gippsland, Australia, pp. 101–106, (1998).
[5] D. L. Dowe and J. Hernandez-Orallo, ‘IQ tests are not for machines,
yet’, Intelligence, 40(2), 77–81, (2012).
[6] J. Hernández-Orallo, ‘Beyond the Turing Test’, Journal of Logic, Language and Information, 9(4), 447–466, (2000).
[7] J. Hernández-Orallo, ‘Constructive reinforcement learning’, International Journal of Intelligent Systems, 15(3), 241–264, (2000).
[8] J. Hernández-Orallo, ‘On the computational measurement of intelligence factors’, in Performance metrics for intelligent systems workshop, ed., A. Meystel, pp. 1–8. National Institute of Standards and Technology, Gaithersburg, MD, U.S.A., (2000).
[9] J. Hernández-Orallo, ‘A (hopefully) non-biased universal environment
class for measuring intelligence of biological and artiﬁcial systems’, in
Artiﬁcial General Intelligence, 3rd International Conference AGI, Proceedings, eds., Marcus Hutter, Eric Baum, and Emanuel Kitzelmann,
“Advances in Intelligent Systems Research” series, pp. 182–183. Atlantis Press, (2010).
[10] J. Hernández-Orallo, ‘On evaluating agent performance in a ﬁxed period of time’, in Artiﬁcial General Intelligence, 3rd Intl Conf, ed.,
M. Hutter et al., pp. 25–30. Atlantis Press, (2010).
[11] J. Hernández-Orallo and D. L. Dowe, ‘Measuring universal intelligence: Towards an anytime intelligence test’, Artiﬁcial Intelligence,
174(18), 1508–1539, (2010).
[12] J. Hernández-Orallo, D. L. Dowe, S. España-Cubillo, M. V. HernándezLloreda, and J. Insa-Cabrera, ‘On more realistic environment distributions for deﬁning, evaluating and developing intelligence’, in Artiﬁcial
General Intelligence 2011, eds., J. Schmidhuber, K.R. Thórisson, and
M. Looks (eds), volume 6830 of LNAI, pp. 82–91. Springer, (2011).
[13] J. Hernández-Orallo and N. Minaya-Collado, ‘A formal deﬁnition of
intelligence based on an intensional variant of kolmogorov complexity’,
in In Proceedings of the International Symposium of Engineering of
Intelligent Systems (EIS’98), pp. 146–163. ICSC Press, (1998).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

26

[14] E. Herrmann, J. Call, M. V. Hernández-Lloreda, B. Hare, and
M. Tomasell, ‘Humans have evolved specialized skills of social cognition: The cultural intelligence hypothesis’, Science, Vol 317(5843),
1360–1366, (2007).
[15] E. Herrmann, M. V. Hernández-Lloreda, J. Call, B. Hare, and
M. Tomasello, ‘The structure of individual differences in the cognitive
abilities of children and chimpanzees’, Psychological Science, 21(1),
102, (2010).
[16] M. Hutter, Universal Artiﬁcial Intelligence: Sequential Decisions based
on Algorithmic Probability, Springer, 2005.
[17] M. Hutter, ‘Universal algorithmic intelligence: A mathematical
top→down approach’, in Artiﬁcial General Intelligence, eds., B. Goertzel and C. Pennachin, Cognitive Technologies, 227–290, Springer,
Berlin, (2007).
[18] J. Insa-Cabrera, D. L. Dowe, S. España-Cubillo, M. V. HernándezLloreda, and J. Hernández-Orallo, ‘Comparing humans and AI agents’,
in Artiﬁcial General Intelligence 2011, eds., J. Schmidhuber, K.R.
Thórisson, and M. Looks (eds), volume 6830 of LNAI, pp. 122–132.
Springer, (2011).
[19] J. Insa-Cabrera, D. L. Dowe, and J. Hernández-Orallo, ‘Evaluating a
reinforcement learning algorithm with a general intelligence test’, in
CAEPIA, Advances in Artiﬁcial Intelligence, volume 7023 of LNCS,
pp. 1–11. Springer, (2011).
[20] D. Keil and D. Goldin, ‘Indirect interaction in environments for
multi-agent systems’, Environments for Multi-Agent Systems II, 68–87,
(2006).
[21] S. Legg and M. Hutter, ‘A universal measure of intelligence for artiﬁcial agents’, in International Joint Conference on Artiﬁcial Intelligence,
volume 19, p. 1509, (2005).
[22] S. Legg and M. Hutter, ‘Universal intelligence: A deﬁnition of machine intelligence’, Minds and Machines, 17(4), 391–444, (2007).
http://www.vetta.org/documents/UniversalIntelligence.pdf.
[23] L. A. Levin, ‘Universal sequential search problems’, Problems of Information Transmission, 9(3), 265–266, (1973).
[24] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its
applications (3rd ed.), Springer-Verlag New York, Inc., 2008.
[25] F. Neumann, A. Reichenberger, and M. Ziegler, ‘Variations of the turing
test in the age of internet and virtual reality’, in Proceedings of the 32nd
annual German conference on Advances in artiﬁcial intelligence, pp.
355–362. Springer-Verlag, (2009).
[26] P. Sanghi and D. L. Dowe, ‘A computer program capable of passing IQ
tests’, in Proc. 4th ICCS International Conference on Cognitive Science
(ICCS’03), Sydney, Australia, pp. 570–575, (July 2003).
[27] J. Searle, ‘Minds, brains, and programs’, Behavioral and Brain Sciences, 3(3), 417–457, (1980).
[28] R. J. Solomonoff, ‘A formal theory of inductive inference. Part I’, Information and control, 7(1), 1–22, (1964).
[29] A. M. Turing, ‘Computing machinery and intelligence’, Mind, 59, 433–
460, (1950).
[30] J. Veness, K. S. Ng, M. Hutter, and D. Silver, ‘Reinforcement learning
via AIXI approximation’, in Proceedings of the Twenty-Fourth AAAI
Conference on Artiﬁcial Intelligence (AAAI-10), pp. 605–611, (2010).
[31] J. Veness, K.S. Ng, M. Hutter, W. Uther, and D. Silver, ‘A Monte
Carlo AIXI Approximation’, Journal of Artiﬁcial Intelligence Research, 40(1), 95–142, (2011).
[32] L. Von Ahn, M. Blum, and J. Langford, ‘Telling humans and computers apart automatically’, Communications of the ACM, 47(2), 56–60,
(2004).
[33] C. S. Wallace, Statistical and Inductive Inference by Minimum Message
Length, Ed. Springer-Verlag, 2005.
[34] C. S. Wallace and D. M. Boulton, ‘A information measure for classiﬁcation’, The Computer Journal, 11(2), 185–194, (1968).
[35] C. S. Wallace and D. L. Dowe, ‘Minimum message length and Kolmogorov complexity’, Computer Journal, 42(4), 270–283, (1999). Special issue on Kolmogorov complexity.
[36] D.A. Washburn and R.S. Astur, ‘Exploration of virtual mazes by rhesus
monkeys ( macaca mulatta )’, Animal Cognition, 6(3), 161–168, (2003).
[37] C.J.C.H. Watkins and P. Dayan, ‘Q-learning’, Mach. learning, 8(3),
279–292, (1992).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

27

Turing Machines and Recursive Turing Tests
José Hernández-Orallo1 and Javier Insa-Cabrera2 and David L. Dowe3 and Bill Hibbard4
Abstract. The Turing Test, in its standard interpretation, has been
dismissed by many as a practical intelligence test. In fact, it is questionable that the imitation game was meant by Turing himself to be
used as a test for evaluating machines and measuring the progress
of artiﬁcial intelligence. In the past ﬁfteen years or so, an alternative
approach to measuring machine intelligence has been consolidating.
The key concept for this alternative approach is not the Turing Test,
but the Turing machine, and some theories derived upon it, such as
Solomonoff’s theory of prediction, the MML principle, Kolmogorov
complexity and algorithmic information theory. This presents an antagonistic view to the Turing test, where intelligence tests are based
on formal principles, are not anthropocentric, are meaningful computationally and the abilities (or factors) which are evaluated can
be recognised and quantiﬁed. Recently, however, this computational
view has been touching upon issues which are somewhat related to
the Turing Test, namely that we may need other intelligent agents
in the tests. Motivated by these issues (and others), this paper links
these two antagonistic views by bringing some of the ideas around
the Turing Test to the realm of Turing machines.
Keywords: Turing Test, Turing machines, intelligence, learning,
imitation games, Solomonoff-Kolmogorov complexity.

1

INTRODUCTION

Humans have been evaluated by other humans in all periods of history. It was only in the 20th century, however, that psychometrics was
established as a scientiﬁc discipline. Other animals have also been
evaluated by humans, but certainly not in the context of psychometric tests. Instead, comparative cognition is nowadays an important
area of research where non-human animals are evaluated and compared. Machines —yet again differently— have also been evaluated
by humans. However, no scientiﬁc discipline has been established for
this.
The Turing Test [31] is still the most popular test for machine intelligence, at least for philosophical and scientiﬁc discussions. The
Turing Test, as a measurement instrument and not as a philosophical
argument, is very different to the instruments other disciplines use to
measure intelligence in a scientiﬁc way. The Turing Test resembles
a much more customary (and non-scientiﬁc) assessment, which happens when humans interview or evaluate other humans (for whatever
1

DSIC, Universitat Politècnica de València, Spain. email:
jorallo@dsic.upv.es
2
DSIC, Universitat Politècnica de València, Spain. email:
jinsa@dsic.upv.es
3 Clayton School of Information Technology, Monash University, Australia.
email: david.dowe@monash.edu
4 Space Science and Engineering Center, University of Wisconsin - Madison,
USA. email: test@ssec.wisc.edu

reason, including, e.g., personnel selection, sports1 or other competitions). The most relevant (and controversial) feature of the Turing
Test is that it takes humans as a touchstone to which machines should
be compared. In fact, the comparison is not performed by an objective criterion, but assessed by human judges, which is not without
controversy. Another remarkable feature (and perhaps less controversial) is that the Turing Test is set on an intentionally restrictive
interaction channel: a teletype conversation. Finally, there are some
features about the Turing Test which make it more general than other
kinds of intelligence tests. For instance, it is becoming increasingly
better known that programs can do well at human IQ tests [32][8],
because ordinary IQ tests only evaluate narrow abilities and assume
that narrow abilities accurately reﬂect human abilities across a broad
set of tasks, which may not hold for non-human populations. The
Turing test (and some formal intelligence measures we will review
in the following section) can test broad sets of tasks.
We must say that Turing cannot be blamed for all the controversy.
The purpose of Turing’s imitation game [37] was to show that intelligence could be assessed and recognised in a behavioural way, without the need for directly measuring or recognising some other physical or mental issues such as thinking, consciousness, etc. In Turing’s
view, intelligence can be just seen as a cognitive ability (or property)
that some machines might have and others might not. In fact, the
standard scientiﬁc view should converge to deﬁning intelligence as
an ability that some systems: humans, non-human animals, machines
—and collectives thereof—, might or might not have, or, more precisely, might have to a larger or lesser degree. This view has clearly
been spread by the popularity of psychometrics and IQ tests.2
While there have been many variants and extensions of the Turing Test (see [33] or [31] for an account of these), none of them
(and none of the approaches in psychometrics and animal cognition,
either) have provided a formal, mathematical deﬁnition of what in1

2

In many sports, to see how good a player is, we want competent judges but
also appropriate team-mates and opponents. Good tournaments and competitions are largely designed so as to return (near) maximal expected information.
In fact, the notion of consciousness and other phenomena is today better
separated from intelligence than it was sixty years ago. They are now seen
as related but different things. For instance, nobody doubts that a team of
people can score well in a single IQ test (working together). In fact, the
team, using a teletype communication as in the Turing Test, can dialogue,
write poetry, make jokes, do complex mathematics and all these human
things. They can even do these things continuously for days or weeks, while
some of the particular individuals rest, eat, go to sleep, die, etc. Despite
all of this happening on the other side of the teletype communication, the
system is just regarded as one subject. So the fact that we can effectively
measure the cognitive abilities of the team or even make the team pass the
Turing Test does not lead us directly to statements such as ‘the team has a
mind’ or ‘the team is conscious’. At most, we say this in a ﬁgurative sense,
as we use it for the collective consciousness of a company or country. In the
end, the ‘team of people’ is one of the best arguments against Searle’s Chinese room and a good reference whenever we are thinking about evaluating
intelligence.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

28

telligence is and how it can be measured.
A different approach is based on one of the things that the Turing Test is usually criticised for: learning3 . This alternative approach requires a proper deﬁnition of learning, and actual mechanisms for measuring learning ability. Interestingly, the answer to this
is given by notions devised from Turing machines. In the 1960s, Ray
Solomonoff ‘solved’ the problem of induction (and the related problems of prediction and learning) [36] by the use of Turing machines.
This, jointly with the theory of inductive inference given by the Minimum Message Length (MML) principle [39, 40, 38, 5], algorithmic
information theory [1], Kolmogorov complexity [25, 36] and compression theory, paved the way in the 1990s for a new approach for
deﬁning and measuring intelligence based on algorithmic information theory. This approach will be summarised in the next section.
While initially there was some connection to the Turing Test, this
line of research has been evolving and consolidating in the past ﬁfteen years (or more), cutting all the links to the Turing Test. This has
provided important insights into what intelligence is and how it can
be measured, and has given clues to the (re-)understanding of other
areas where intelligence is deﬁned and measured, such as psychometrics and animal cognition.
An important milestone of this journey has been the recent realisation in this context that (social) intelligence is the ability to perform
well in an environment full of other agents of similar intelligence.
This is a consequence of some experiments which show that when
performance is measured in environments where no other agents coexist, some important traits of intelligence are not fully recognised. A
solution for this has been formalised as the so-called Darwin-Wallace
distribution of environments (or tasks) [18]. The outcome of all this is
that it is increasingly an issue whether intelligence might be needed
to measure intelligence. But this is not because we might need intelligent judges as in the Turing Test, but because we may need other
intelligent agents to become part of the exercises or tasks an intelligence test should contain (as per footnote 1).
This seems to take us back to the Turing Test, a point some of us
deliberately abandoned more than ﬁfteen years ago. Re-visiting the
Turing Test now is necessarily very different, because of the technical companions, knowledge and results we have gathered during this
journey (universal Turing machines, compression, universal distributions, Solomonoff-Kolmogorov complexity, MML, reinforcement
learning, etc.).
The paper is organised as follows. Section 2 introduces a short account of the past ﬁfteen years concerning deﬁnitions and tests of machine intelligence based on (algorithmic) information theory. It also
discusses some of the most recent outcomes and positions in this line,
which have led to the notion of Darwin-Wallace distribution and the
need for including other intelligent agents in the tests, suggesting an
inductive (or recursive, or iterative) test construction and deﬁnition.
This is linked to the notion of recursive Turing Test (see [32, sec.
5.1] for a ﬁrst discussion on this). Section 3 analyses the base case
by proposing several schemata for evaluating systems that are able
to imitate Turing machines. Section 4 deﬁnes different ways of doing the recursive step, inspired by the Darwin-Wallace distribution
and ideas for making this feasible. Section 5 brieﬂy explores how all
this might develop, and touches upon concepts such as universality
in Turing machines and potential intelligence, as well as some sug3

This can be taken as further evidence for Turing not conceiving the imitation test as an actual test for intelligence, because the issue about machines
being able to learn was seen as inherent to intelligence for Turing [37, section 7], and yet the Turing Test is not especially good at detecting learning
ability during the test.

gestions as to how machine intelligence measurement might develop
in the future.

2

MACHINE INTELLIGENCE
MEASUREMENT USING TURING
MACHINES

There are, of course, many proposals for intelligence deﬁnitions and
tests for machines which are not based on the Turing Test. Some
of them are related to psychometrics, some others may be related
to other areas of cognitive science (including animal cognition) and
some others originate from artiﬁcial intelligence (e.g., some competitions running on speciﬁc tasks such as planning, robotics, games,
reinforcement learning, . . . ). For an account of some of these, the
reader can ﬁnd a good survey in [26]. In this section, we will focus
on approaches which use Turing machines (and hence computation)
as a basic component for the deﬁnition of intelligence and the derivation of tests for machine intelligence.
Most of the views of intelligence in computer science are sustained over a notion of intelligence as a special kind of information processing. The nature of information, its actual content and
the way in which patterns and structure can appear in it can only
be explained in terms of algorithmic information theory. The Minimum Message Length (MML) principle [39, 40] and SolomonoffKolmogorov complexity [36, 25] capture the intuitive notion that
there is structure –or redundancy– in data if and only if it is compressible, with the relationship between MML and (two-part) Kolmogorov complexity articulated in [40][38, chap. 2][5, sec. 6]. While
Kolmogorov [25] and Chaitin [1] were more concerned with the notions of randomness and the implications of all this in mathematics
and computer science, Solomonoff [36] and Wallace [39] developed
the theory with the aim of explaining how learning, prediction and inductive inference work. In fact, Solomonoff is said to have ‘solved’
the problem of induction [36] by the use of Turing machines. He was
also the ﬁrst to introduce the notions of universal distribution (as the
distribution of strings given by a UTM from random input) and the
invariance theorem (which states that the Kolmogorov complexity of
a string calculated with two different reference machines only differs
by a constant which is independent of the string).
Chaitin brieﬂy made mention in 1982 of the potential relationship
between algorithmic information theory and measuring intelligence
[2], but actual proposals in this line did not start until the late 1990s.
The ﬁrst proposal was precisely introduced over a Turing Test and
as a response to Searle’s Chinese room [35], where the subject was
forced to learn. This induction-enhanced Turing Test [7][6] could
then evaluate a general inductive ability. The importance was not that
any kind of ability could be included in the Turing Test, but that this
ability could be formalised in terms of MML and related ideas, such
as (two-part) compression.
Independently and near-simultaneously, a new intelligence test
(C-test) [19] [12] was derived as sequence prediction problems
which were generated by a universal distribution [36]. The difﬁculty of the exercises was mathematically derived from a variant of
Kolmogorov complexity, and only exercises with a certain degree of
difﬁculty were included and weighted accordingly. These exercises
were very similar to those found in some IQ tests, but here they were
created from computational principles. This work ‘solved’ the traditional subjectivity objection of the items in IQ tests, i.e., since the
continuation of each sequence was derived from its shortest explanation. However, this test only measured one cognitive ability and
its presentation was too narrow to be a general test. Consequently,

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

29

these ideas were extended to other cognitive abilities in [14] by the
introduction of other ‘factors’, and the suggestion of using interactive tasks where “rewards and penalties could be used instead”, as in
reinforcement learning [13].
Similar ideas followed relating compression and intelligence.
Compression tests were proposed as a test for artiﬁcial intelligence
[30], arguing that “optimal text compression is a harder problem than
artiﬁcial intelligence as deﬁned by Turing’s”. Nonetheless, the fact
that there is a connection between compression and intelligence does
not mean that intelligence can be just deﬁned as compression ability
(see, e.g., [9] for a full discussion on this).
Later, [27] would propose a notion which they referred to as a
“universal intelligence measure” —universal because of its proposed
use of a universal distribution for the weighting over environments.
The innovation was mainly their use of a reinforcement learning setting, which implicitly accounted for the abilities not only of learning
and prediction, but also of planning. An interesting point for making
this proposal popular was its conceptual simplicity: intelligence was
just seen as average performance in a range of environments, where
the environments were just selected by a universal distribution.
While innovative, the universal intelligence measure [27] showed
several shortcomings stopping it from being a viable test. Some of
the problems are that it requires a summation over inﬁnitely many
environments, it requires a summation over inﬁnite time within each
environment, Kolmogorov complexity is typically not computable,
disproportionate weight is put on simple environments (e.g., with 1−
2−7 > 99% of weight put on environments of size less than 8, as also
pointed out by [21]), it is (static and) not adaptive, it does not account
for time or agent speed, etc
Hernandez-Orallo and Dowe [17] re-visited this to give an intelligence test that does not have these abovementioned shortcomings.
This was presented as an anytime universal intelligence test. The
term universal here was used to designate that the test could be applied to any kind of subject: machine, human, non-human animal or
a community of these. The term anytime was used to indicate that
the test could evaluate any agent speed, it would adapt to the intelligence of the examinee, and that it could be interrupted at any time to
give an intelligence score estimate. The longer the test runs, the more
reliable the estimate (the average reward [16]).
Preliminary tests have since been done [23, 24, 28] for comparing
human agents with non-human AI agents. These tests seem to succeed in bringing theory to practice quite seamlessly and are useful
to compare the abilities of systems of the same kind. However, there
are some problems when comparing systems of different kind, such
as human and AI algorithms, because the huge difference of both
(with current state-of-the-art technology) is not clearly appreciated.
One explanation for this is that (human) intelligence is the result of
the adaptation to environments where the probability of other agents
(of lower or similar intelligence) being around is very high. However,
the probability of having another agent of even a small degree of intelligence just by the use of a universal distribution is discouragingly
remote. Even in environments where other agents are included on
purpose [15], it is not clear that these agents properly represent a rich
‘social’ environment. In [18], the so-called Darwin-Wallace distribution is introduced where environments are generated using a universal distribution for multi-agent environments, and where a number of
agents that populate the environment are also generated by a universal distribution. The probability of having interesting environments
and agents is very low on this ﬁrst ‘generation’. However, if an intelligence test is administered to this population and only those with
a certain level are preserved, we may get a second population whose

agents will have a slightly higher degree of intelligence. Iterating this
process we have different levels for the Darwin-Wallace distribution,
where evolution is solely driven (boosted) by a ﬁtness function which
is just measured by intelligence tests.

3

THE BASE CASE: THE TURING TEST FOR
TURING MACHINES

A recursive approach can raise the odds for environments and tasks
of having a behaviour which is attributed to more intelligent agents.
This idea of recursive populations can be linked to the notion of recursive Turing Test [32, sec. 5.1], where the agents which have succeeded at lower levels could be used to be compared at higher levels.
However, there are many interpretations of this informal notion of a
recursive Turing Test. The fundamental idea is to eliminate the human reference from the test using recursion —either as the subject
that has to be imitated or the judge which is used to tell between the
subjects.
Before giving some (more precise) interpretations of a recursive
version of the Turing Test, we need to start with the base case, as
follows (we use TM and UTM for Turing Machine and Universal
Turing Machine respectively):
Definition 1 The imitation game for Turing machines4 is deﬁned as
a tuple "D, B, C, I#
• The reference subject A is randomly taken as a TM using a distribution D.
• Subject B (the evaluee) tries to emulate A.
• The similarity between A and B is ‘judged’ by a criterion or judge
C through some kind of interaction protocol I. The test returns this
similarity.
An instance of the previous schema requires us to determine the distribution D and the similarity criterion C and, most especially, how
the interaction I goes. In the classical Turing Test, we know that D is
the human population, C is given by a human judge, and the interaction is an open teletype conversation5 . Of course, other distributions
for D could lead to other tests, such as, e.g., a canine test, taking
D as a dog population, and judges as other dogs which have to tell
which is the member of the species or perhaps even how intelligent
it is (for whatever purpose —e.g., mating or idle curiosity).
More interestingly, one possible instance for Turing machines
could go as follows. We can just take D as a universal distribution
over a reference UTM U , so p(A) = 2−KU (A) , where KU (A) is the
preﬁx-free Kolmogorov complexity of A relative to U . This means
that simple reference subjects have higher probability than complex
subjects. Interaction can go as follows. The ‘interview’ consists of
questions as random ﬁnite binary strings using a universal distribution s1 , s2 , ... over another reference UTM, V . The test starts by subjects A and B receiving string s1 and giving two sequences a1 and b1
as respective answers. Agent B will also receive what A has output
4

5

The use of Turing machines for the reference subject is relevant and not
just a way to link two things by their name, Turing. Turing machines are
required because we need to deﬁne formal distributions on them, and this
cannot be done (at least theoretically) for humans, or animals or ‘agents’.
This free teletype conversation may be problematic in many ways. Typically, the judge C wishes to steer the conversation in directions which will
enable her to get (near-)maximal (expected) information (before the timelimit deadline of the test) about whether or not the evaluee subject B is
or is not from D. One tactic for a subject which is not from D (and not a
good imitator either) is to distract the judge C and steer the conversation in
directions which will give judge C (near-) minimal (expected) information.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

30

immediately after this. Judge C is just a very simple function which
compares whether a1 and b1 are equal. After one interation, the system issues string s2 . After several iterations, the score (similarity)
given to B is calculated as an aggregation of the times ai and bi have
been equal.
This can be seen as formalisation of the Turing Test where it is a
Turing machine that needs to be imitated, and the criterion for imitation is the similarity between the answers given by A and B to the
same questions. If subject B cannot be told or instructed about the
goal of the test (imitating A) then we can use rewards after each step,
possibly concealing A’s outputs from B as well.
This test might seem ridiculous at ﬁrst sight. Some might argue
that being able to imitate a randomly-chosen TM is not related to
intelligence. However, two issues are important here. First, agent B
does not know who A is in advance. Second, agent B tries to imitate
A solely from its behaviour.
This makes the previous version of the test very similar to the most
abstract setting used for analysing what learning is, how much complexity it has and whether it can be solved. First, this is tantamount to
Gold’s language identiﬁcation in the limit [11]. If subject B is able to
identify A at some point, then it will start to score perfectly from that
moment. While Gold was interested in whether this could be done in
general and for every possible A, here we are interested in how well
B does this on average for a randomly-chosen A from a distribution.
In fact, many simple TMs can be identiﬁed quite easily, such as those
simple TMs which output the same string independently of the input.
Second, and following this averaging approach, Solomonoff’s setting
is also very similar to this. Solomonoff proved that B could get the
best estimations for A if B used a mixture of all consistent models
inversely weighted by 2 to the power of their Kolmogorov complexity. While this may give the best theoretical approach for prediction
and perhaps for “imitation”, it does not properly “identify” A. Identiﬁcation can only be properly claimed if we have one single model
of A which is exactly as A. This distinction between one vs. multiple models is explicit in the MML principle, which usually considers
just one single model, the one with the shortest two-part message
encoding of said model followed by the data given this model.
There is already an intelligence test which corresponds to the previous instance of deﬁnition 1, the C-test, mentioned above. The Ctest measures how well an agent B is able to identify the pattern
behind a series of sequences (each sequence is generated by a different program, i.e., a different Turing machine). The C-test does not
use a query-answer setting, but the principles are the same.
We can develop a slight modiﬁcation of deﬁnition 1 by considering that subject A also tries to imitate B. This might lead to easy
convergence in many cases (for relatively intelligent A and B) and
would not be very useful for comparing A and B effectively. A signiﬁcant step forward is when we consider that the goal of A is to
make outputs that cannot be imitated by B. While it is clearly different, this is related to some versions of Turing’s imitation game,
where one of the human subjects pretends to be a machine. While
there might be some variants here to explore, if we restrict the size of
the strings used for questions and answers to 1 (this makes agreeing
and disagreeing equally likely), this is tantamount to the game known
as ‘matching pennies’ (a binary version of rock-paper-scissors where
the ﬁrst player has to match the head or tail of the second player, and
the second player has to disagree on the head or tail of the ﬁrst). Interestingly, this game has also been proposed as an intelligence test
in the form of Adversarial Sequence Prediction [20][22] and is related to the “elusive model paradox” [3, footnote 211][4, p 455][5,
sec. 7.5].

This instance makes it more explicit that the distribution D over
the agents that the evaluee has to imitate or compete with is crucial.
In the case of imitation, however, there might be non-intelligent Turing machines which are more difﬁcult to imitate/identify than many
intelligent Turing machines, and this difﬁculty seems to be related
to the Kolmogorov complexity of the Turing machine. And linking
difﬁculty to Kolmogorov complexity is what the C-test does. But biological intelligence is frequently biased to social environments, or
at least to environments where other agents can be around eventually. In fact, societies are usually built on common sense and common understanding, but in humans this might be an evolutionarilyacquired ability to imitate other humans, but not other intelligent
beings in general. Some neurobiological structures, such as mirror
neurons have been found in primates and other species, which may
be responsible of understanding what other people do and will do,
and for learning new skills by imitation. Nonetheless, we must say
that human unpredictability is frequently impressive, and its relation
to intelligence is far from being understood. Interestingly, some of
the ﬁrst analyses on this issue [34][29] linked the problem with the
competitive/adversarial scenario, which is equivalent to the matching pennies problem, where the intelligence of the peer is the most
relevant feature (if not the only one) for assessing the difﬁculty of
the game, as happens in most games. In fact, matching pennies is
the purest and simplest game, since it reduces the complexity of the
‘environment’ (rules of the game) to a minimum.

4

RECURSIVE TURING TESTS FOR TURING
MACHINES

The previous section has shown that introducing agents (in this case,
agent A) in a test setting requires a clear assessment of the distribution which is used for introducing them. A general expression of how
to make a Turing Test for Turing machines recursive is as follows:
Definition 2 The recursive imitation game for Turing machines is
deﬁned as a tuple !D, C, I" where tests and distributions are obtained as follows:
1. Set D0 = D and i = 0.
2. For each agent B in a sufﬁciently large set of TMs
3.
Apply a sufﬁciently large set of instances of deﬁnition 1 with
parameters !Di , B, C, I".
4.
B’s intelligence at degree i is averaged from this sample of
imitation tests.
5. End for
6. Set i = i + 1
7. Calculate a new distribution Di where each TM has a probability
which is directly related to its intelligence at level i − 1.
8. Go to 2
This gives a sequence of Di .
The previous approach is clearly uncomputable in general, and still
intractable even if reasonable samples, heuristics and step limitations
are used. A better approach to the problem would be some kind of
propagation system, such as Elo’s rating system of chess [10], which
has already been suggested in some works and competitions in artiﬁcial intelligence. A combination of a soft universal distribution,
where simple agents would have slightly higher probability, and a
one-vs-one credit propagation system such as Elo’s rating (or any
other mechanism which returns maximal expected information with
a minimum of pairings), could feasibly aim at having a reasonably

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

31

good estimate of the relative abilities of a big population of Turing
machines, including some AI algorithms amongst them.
What would this rating mean? If we are using the imitation game, a
high rating would show that the agent is able to imitate/identify other
agents of lower rating well and that it is a worse imitator/identiﬁer
than other agents with higher rating. However, there is no reason to
think that the relations are transitive and anti-reﬂexive; e.g., it might
even happen that an agent with very low ranking would be able to
imitate an agent with very high ranking better than the other way
round.
One apparently good thing about this recursion and rating system
is that the start-up distribution can be very important from the point
of view of heuristics, but it might be less important for the ﬁnal result. This is yet another way of escaping from the problems of using a
universal distribution for environments or agents, because very simple things take almost all the probability —as per section 2. Using
difﬁculty as in the C-test, making adaptive tests such as the anytime
test, setting a minimum complexity value [21] or using hierarchies
of environments [22] where “an agent’s intelligence is measured as
the ordinal of the most difﬁcult set of environments it can pass” are
solutions for this. We have just seen another possible solution where
evaluees (or similar individuals) can take part in the tests.

5

DISCUSSION

The Turing test, in some of its formulations, is a game where an agent
tries to imitate another (or its species or population) which might
(or might not) be cheating. If both agents are fair, and we do not
consider any previous information about the agents (or their species
or populations), then we have an imitation test for Turing machines.
If one is cheating, we get closer to the adversarial case we have also
seen.
Instead of including agents arbitrarily or assuming that any agent
has a level of intelligence a priori, a recursive approach is necessary.
This is conceptually possible, as we have seen, although its feasible
implementation needs to be carefully considered, possibly in terms
of rankings after random 1-vs-1 comparisons.
This view of the (recursive) Turing test in terms of Turing machines has allowed us to connect the Turing test with fundamental issues in computer science and artiﬁcial intelligence, such as the problem of learning (as identiﬁcation), Solomonoff’s theory of prediction,
the MML principle, game theory, etc. These connections go beyond
to other disciplines such as (neuro-)biology, where the role of imitation and adversarial prediction are fundamental, such as predatorprey games, mirror neurons, common coding theory, etc. In addition,
this has shown that the line of research with intelligence tests derived
from algorithmic information theory and the recent Darwin-Wallace
distribution are also closely related to this as well. This (again) links
this line of research to the Turing test, where humans have been replaced by Turing machines.
This sets up many avenues for research and discussion. For instance, the idea that the ability of imitating relates to intelligence can
be understood in terms of the universality of a Turing machine, i.e.
the ability of a Turing machine to emulate another. If a machine can
emulate another, it can acquire all the properties of the latter, including intelligence. However, in this paper we have referred to the notion
of ‘imitation’, which is different to the concept of Universal Turing
machine, since a UTM is deﬁned as a machine such that there is an
input that turns it into any other pre-speciﬁed Turing machine. A machine which is able to imitate well is a good learner, which can ﬁnally
identify any pattern on the input and use it to imitate the source. In

fact, a good imitator is, potentially, very intelligent, since it can, in
theory (and disregarding efﬁciency issues), act as any other very intelligent being by just observing its behaviour. Turing advocated for
learning machines in section 7 of the very same paper [37] where he
introduced the Turing Test. Solomonoff taught us what learning machines should look like. We are still struggling to make them work in
practice and preparing for assessing them.

ACKNOWLEDGEMENTS
This work was supported by the MEC projects EXPLORAINGENIO
TIN
2009-06078-E,
CONSOLIDER-INGENIO
26706 and TIN 2010-21062-C02-02, and GVA project PROMETEO/2008/051. Javier Insa-Cabrera was sponsored by Spanish
MEC-FPU grant AP2010-4389.

REFERENCES
[1] G. J. Chaitin, ‘On the length of programs for computing ﬁnite sequences’, Journal of the Association for Computing Machinery, 13,
547–569, (1966).
[2] G. J. Chaitin, ‘Godel’s theorem and information’, International Journal
of Theoretical Physics, 21(12), 941–954, (1982).
[3] D. L. Dowe, ‘Foreword re C. S. Wallace’, Computer Journal, 51(5),
523 – 560, (September 2008). Christopher Stewart WALLACE (19332004) memorial special issue.
[4] D. L. Dowe, ‘Minimum Message Length and statistically consistent invariant (objective?) Bayesian probabilistic inference - from (medical)
“evidence”’, Social Epistemology, 22(4), 433 – 460, (October - December 2008).
[5] D. L. Dowe, ‘MML, hybrid Bayesian network graphical models, statistical consistency, invariance and uniqueness’, in Handbook of the
Philosophy of Science - Volume 7: Philosophy of Statistics, ed., P. S.
Bandyopadhyay and M. R. Forster, pp. 901–982. Elsevier, (2011).
[6] D. L. Dowe and A. R. Hajek, ‘A non-behavioural, computational extension to the Turing Test’, in Intl. Conf. on Computational Intelligence &
multimedia applications (ICCIMA’98), Gippsland, Australia, pp. 101–
106, (February 1998).
[7] D. L. Dowe and A. R. Hajek, ‘A computational extension to the Turing
Test’, in Proceedings of the 4th Conference of the Australasian Cognitive Science Society, University of Newcastle, NSW, Australia, (September 1997).
[8] D. L. Dowe and J. Hernandez-Orallo, ‘IQ tests are not for machines,
yet’, Intelligence, 40(2), 77–81, (2012).
[9] D. L. Dowe, J. Hernández-Orallo, and P. K. Das, ‘Compression and intelligence: social environments and communication’, in Artiﬁcial General Intelligence, eds., J. Schmidhuber, K.R. Thórisson, and M. Looks,
volume 6830, pp. 204–211. LNAI series, Springer, (2011).
[10] A.E. Elo, The rating of chessplayers, past and present, volume 3, Batsford London, 1978.
[11] E.M. Gold, ‘Language identiﬁcation in the limit’, Information and control, 10(5), 447–474, (1967).
[12] J. Hernández-Orallo, ‘Beyond the Turing Test’, J. Logic, Language &
Information, 9(4), 447–466, (2000).
[13] J. Hernández-Orallo, ‘Constructive reinforcement learning’, International Journal of Intelligent Systems, 15(3), 241–264, (2000).
[14] J. Hernández-Orallo, ‘On the computational measurement of intelligence factors’, in Performance metrics for intelligent systems workshop, ed., A. Meystel, pp. 1–8. National Institute of Standards and Technology, Gaithersburg, MD, U.S.A., (2000).
[15] J. Hernández-Orallo, ‘A (hopefully) non-biased universal environment class for measuring intelligence of biological and artiﬁcial
systems’, in Artiﬁcial General Intelligence, 3rd Intl Conf, ed.,
M. Hutter et al., pp. 182–183. Atlantis Press, Extended report at
http://users.dsic.upv.es/proy/anynt/unbiased.pdf, (2010).
[16] J. Hernández-Orallo, ‘On evaluating agent performance in a ﬁxed period of time’, in Artiﬁcial General Intelligence, 3rd Intl Conf, ed.,
M. Hutter et al., pp. 25–30. Atlantis Press, (2010).
[17] J. Hernández-Orallo and D. L. Dowe, ‘Measuring universal intelligence: Towards an anytime intelligence test’, Artiﬁcial Intelligence
Journal, 174, 1508–1539, (2010).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

32

[18] J. Hernández-Orallo, D. L. Dowe, S. España-Cubillo, M. V. HernándezLloreda, and J. Insa-Cabrera, ‘On more realistic environment distributions for deﬁning, evaluating and developing intelligence’, in Artiﬁcial General Intelligence, eds., J. Schmidhuber, K.R. Thórisson, and
M. Looks, volume 6830, pp. 82–91. LNAI, Springer, (2011).
[19] J. Hernández-Orallo and N. Minaya-Collado, ‘A formal deﬁnition of intelligence based on an intensional variant of Kolmogorov complexity’,
in Proc. Intl Symposium of Engineering of Intelligent Systems (EIS’98),
pp. 146–163. ICSC Press, (1998).
[20] B. Hibbard, ‘Adversarial sequence prediction’, in Proceeding of the
2008 conference on Artiﬁcial General Intelligence 2008: Proceedings
of the First AGI Conference, pp. 399–403. IOS Press, (2008).
[21] B. Hibbard, ‘Bias and no free lunch in formal measures of intelligence’,
Journal of Artiﬁcial General Intelligence, 1(1), 54–61, (2009).
[22] B. Hibbard, ‘Measuring agent intelligence via hierarchies of environments’, Artiﬁcial General Intelligence, 303–308, (2011).
[23] J. Insa-Cabrera, D. L. Dowe, S. España-Cubillo, M. Victoria
Hernández-Lloreda, and José Hernández-Orallo, ‘Comparing humans
and ai agents’, in AGI: 4th Conference on Artiﬁcial General Intelligence - Lecture Notes in Artiﬁcial Intelligence (LNAI), volume 6830,
pp. 122–132. Springer, (2011).
[24] J. Insa-Cabrera, D. L. Dowe, and José Hernández-Orallo, ‘Evaluating
a reinforcement learning algorithm with a general intelligence test’, in
CAEPIA - Lecture Notes in Artiﬁcial Intelligence (LNAI), volume 7023,
pp. 1–11. Springer, (2011).
[25] A. N. Kolmogorov, ‘Three approaches to the quantitative deﬁnition of
information’, Problems of Information Transmission, 1, 4–7, (1965).
[26] S. Legg and M. Hutter, ‘Tests of machine intelligence’, in 50 years of
artiﬁcial intelligence, pp. 232–242. Springer-Verlag, (2007).
[27] S. Legg and M. Hutter, ‘Universal intelligence: A deﬁnition of machine
intelligence’, Minds and Machines, 17(4), 391–444, (November 2007).
[28] S. Legg and J. Veness, ‘An Approximation of the Universal Intelligence Measure’, in Proceedings of Solomonoff 85th memorial conference. Springer, (2012).
[29] D. K. Lewis and J. Shelby-Richardson, ‘Scriven on human unpredictability’, Philosophical Studies: An International Journal for Philosophy in the Analytic Tradition, 17(5), 69 – 74, (October 1966).
[30] M. V. Mahoney, ‘Text compression as a test for artiﬁcial intelligence’,
in Proceedings of the National Conference on Artiﬁcial Intelligence,
AAAI, pp. 970–970, (1999).
[31] G. Oppy and D. L. Dowe, ‘The Turing Test’, in Stanford Encyclopedia of Philosophy, ed., Edward N. Zalta. Stanford University, (2011).
http://plato.stanford.edu/entries/turing-test/.
[32] P. Sanghi and D. L. Dowe, ‘A computer program capable of passing IQ
tests’, in 4th Intl. Conf. on Cognitive Science (ICCS’03), Sydney, pp.
570–575, (2003).
[33] A.P. Saygin, I. Cicekli, and V. Akman, ‘Turing test: 50 years later’,
Minds and Machines, 10(4), 463–518, (2000).
[34] M. Scriven, ‘An essential unpredictability in human behavior’, in Scientiﬁc Psychology: Principles and Approaches, eds., B. B. Wolman and
E. Nagel, 411–425, Basic Books (Perseus Books), (1965).
[35] J. R. Searle, ‘Minds, brains and programs’, Behavioural and Brain Sciences, 3, 417–457, (1980).
[36] R. J. Solomonoff, ‘A formal theory of inductive inference’, Information
and Control, 7, 1–22, 224–254, (1964).
[37] A. M. Turing, ‘Computing machinery and intelligence’, Mind, 59, 433–
460, (1950).
[38] C. S. Wallace, Statistical and Inductive Inference by Minimum Message
Length, Information Science and Statistics, Springer Verlag, May 2005.
ISBN 0-387-23795X.
[39] C. S. Wallace and D. M. Boulton, ‘An information measure for classiﬁcation’, Computer Journal, 11(2), 185–194, (1968).
[40] C. S. Wallace and D. L. Dowe, ‘Minimum message length and Kolmogorov complexity’, Computer Journal, 42(4), 270–283, (1999).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

33

What language for Turing Test in the age of qualia?
Francesco Bianchini1, Domenica Bruni2
Abstract. What is the most relevant legacy by Turing for
epistemology of Artificial Intelligence (AI) and cognitive
science? Of course, we could see it in the ideas set out in his
well-known article of 1950, Computing Machinery and
Intelligence. But how could his imitation game, and its following
evolution in what we know as Turing Test, still be so relevant?
What we want to argue is that the nature of imitation game as a
method for evaluating research on intelligent artifacts, has not its
core specifically in (natural) language capability as a way of
showing the presence of intelligence in a certain entity, but in the
interaction between human being and machines. Humancomputer interaction is a particular field in information science
for many important practical respects, but interaction between
human being and machines is the deepest sense of Turing’s ideas
on evaluation of intelligent behavior and entities, within and
beyond its connection with natural language. And from this point
of view it could be methodologically and epistemologically
useful for further research in every discipline involving machine
and artificial artifacts, especially as concerns the very current
subject of consciousness and qualia. In what follows we will try
to argue such a perspective by showing some field in which
interaction, in connection with different sorts of language, could
be of interest in the spirit of Turing’s 1950 article.12

1 TURING, LANGUAGE AND INTERACTION
One of the most interesting idea by Turing was a based-onlanguage test for proving the intelligence, or the intelligent
behavior, of a program. In Turing’s terms, it is a machine
showing an autonomous and self-produced intelligent behavior.
Actually, Turing never spoke about a test, but just about an
imitation game, using the concept of imitation as an intuitive
concept. This is a typical way of thinking as regards Turing,
though, who had provided a method for catching the notion of
computable function in a mechanical way through a set of
intuitive concepts about fifteen years before [24]. Likewise the
case of computation theory, the Turing’s aim in 1950 article was
to deal with a very notable subject in the easiest and most
straightforward manner, and avoiding the involvement with
more complex and specific theoretical structures based on fielddependent notions.
In the case of imitation game the combination of the notion of
“imitation” and of the use of natural language allowed Turing to
express a paradigmatic method for evaluating artificial products,
but gave rise as well to an endless debate all over the last sixty
years about the suitableness of this kind of testing artificial
intelligence. Leaving aside the problem concerning the correct
1

Dept. of Philosophy, University of Bologna. Email:
francesco.bianchini5@unibo.it
2
Dept. of Cognitive Science, University of Messina, Email:
dbruni@unime.it

interpretation of the notion of “imitation”, we may ask first
whether the role of language in the test is fundamental or it is
just connected to the spirit of the period in which Turing wrote
his paper, that is within the current behaviorist paradigm in
psychology and in the light of the natural language centrality in
the philosophy of twentieth century. In other terms, why did
Turing choose natural language in order to build a general frame
for evaluating the intelligence of artificial, programmed
artifacts? Is such a way of thinking (and researching) still useful?
And, if so, what can we say about it in relation with further
research in this field?
As we said, the choice of natural language had the purpose to
put the matter in an intuitive manner. We human beings usually
ascribe intelligence to other human beings through linguistic
conversations, mostly carrying out in a question-answer form.
Besides, Turing himself asserts in 1950 article that such a
method «has the advantage of drawing a fairly sharp line
between the physical and the intellectual capacities of a man»
[26]. This is the ordinary explanation of Turing’s choice. But it
is also true that, in a certain sense, the very first enunciation of
the imitation game is in another previous work by Turing, where,
ending his exposition on machine intelligence, he speaks about a
«little experiment» regarding the possibility of a chess game
between two human beings (A and C), and between a human
being (A) and a paper machine worked by a human being (B).
Turing asserts that if «two rooms are used with some
arrangement for communicating moves, and a game is played
between C and either A or the paper machine […] C may find it
quite difficult to tell which he is playing. (This is a rather
idealized form of an experiment I have actually done.)» [25].
Such a brief sketch of the imitation game in 1948 paper is not
surprising because that paper is a sort of first draft of the
Turing’s ideas of 1950 paper, and it is even more considerable
for some remarks, for example, on self-organizing machines or
on the possibility of machine learning. Moreover, it is not
surprising that Turing speaks about machines referring to them
as paper machines, namely just for their logical, abstract
structure. It is another main Turing’s theme, that remembers the
human computor of 1936 paper. What is interesting is the fact
that the first, short outline of imitation game is not based on
language, but on a subject that is more early-artificialintelligence-like, that is, chess game. So, (natural) language is
not necessary for imitation game from the point of view of
Turing, and yet the ordinary explanation of Turing’s choice for
language is still valid within such a framework. In other terms,
Turing was aware not only that there are other domains in which
a machine can apply itself autonomously – a trivial fact – but
also that such domains are as enough good as natural language
for imitation game. Nevertheless, he choose natural language as
paradigmatic.
What conclusions can we draw from such remarks? Probably
two ones. First, Turing was pretty definitely aware that the
evaluation of artificial intelligence (AI) products, in a broad

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

34

sense, would be a very difficult subject, maybe the more
fundamental as regards the epistemology of AI and cognitive
science, even if, obviously, he didn’t use such terms in 1950.
Secondly, that the choice of language and the role of language in
imitation game are even more subtle than the popular culture and
the AI tradition usually assert. As a matter of fact, he did not
speak about natural language in general but of a “questionanswer method”, a method that involves communication, not just
language processing or producing. So, from this point of view it
seems that, for Turing, natural language processing or producing
are just some peculiar human cognitive abilities among many
other ones, and are not basic for testing intelligence. What is
basic for such a task is communication or, to use another, more
inclusive term, interaction. But a specification is needed. We are
not maintaining that the capability of using language is not a
cognitive feature, but that in Turing’s view interaction is the best
way in order to detect intelligence, and language interaction, by
means of question-answer method, is perhaps the most intuitive
form of interaction for human beings. No interaction is
tantamount to no possibility to identify intelligence, and for such
a purpose one of the two poles of interaction must be a human
being3.
Furthermore, the «question and answer method seems to be
suitable for introducing almost anyone of the fields of human
endeavour that we wish to include» [26] and, leaving aside the
above-mentioned point concerning the explicit Turing’s request
to penalize in no way machines or human beings for their
unshared features, we could consider it as the main aim of
Turing, namely generalizing the intelligence testing. Of course,
such an aim anticipates one of the mainstream of the following
rising AI4, but it has an even wider range. Turing was not
speaking, indeed, about problem solving, but trying to formulate
a criterion and a method to show and identify machine intelligent
behavior in different-field interaction with human beings. So,
language communication seems to become both a lowest
common denominator for every field in which it is possible
testing intelligence and, at the same time, a way to cut single
field or domain for testing intelligence from the point of view of
interaction. Now we will consider a few of them, in order to
investigate and discuss whether they could be relevant for qualia
problem.

3

A similar way of thinking seems to be suggested, as regards
specifically natural language, by an old mental experiment formulated
by Putnam, in which he imagines a human being learning by heart a
passage in a language he did not know and then repeating it in a sort of
stream of consciousness. If a telepath, knowing that particular
language, could perceive the stream of consciousness of the human
being who has memorized the passage, the telepath could think the
human being knows that language, even though it is not so. What does
it lack in the scene described in the mental experiment? A real
interaction. As a matter of fact, the conclusion of Putnam himself is
that: «the understanding, then, does not reside in the words themselves,
nor even in the appropriateness of the whole sequence of words and
sentences. It lies, rather, in the fact that an understanding speaker can
do things with the words and sentences he utters (or thinks in his head)
besides just utter them. He can answer questions, for example […].»
[19]. And it appears to be very close to what Turing thought more than
twenty years before.
4
For example, consider the target to build a General Problem Solver
pursued by Newell, Shaw and Simon for long [15, 16].

2 LANGUAGE TRANSLATION AS
CULTURAL INTERACTION
A first field in which language and interaction are involved is
language translation. We know that machine translation is a very
difficult target of computer science and AI since their origins up
to nowadays. The reason is that translation usually concerns two
different natural languages, two tongues, and it is not a merely
act of substitution. On the contrary, translation involves many
different levels of language: syntactic and semantic levels, but
also cultural and stylistic levels, that are very context-dependent.
It is very difficult for a machine to find the correct word or
expression to yield in a specific language what is said in another
language. Many different approaches in this field, especially
from computational linguistic, are available to solve the problem
of a good translation. But anyway, it is an operation that still
remains improvable. As a matter of fact, if we consider some
machine translation tools like Google Translator, there are
generally syntactic and semantic problems in every product of
such tools, even if, maybe, the latter are larger than the former.
So, how can we test intelligence in this field concerning
language? Or, in other terms, what could be a real test for
detecting intelligence as regards translation? A tool improvement
could be not satisfying. We could think indeed that, with the
improvement of machine translation tools, we could have better
and better outcomes in this field, but what we want is not a
collection of excellent texts, from the point of view of
translation. What we want is a sort of justification of the word
choice in the empirical activity of translation. If we could have a
program that is able to justify its choosing of words and
expressions in the act of translation, we could consider that the
problem of a random good choice of a word or of an expression
is evaded.
In a dialogue, a personal tribute to Alan Turing, Douglas
Hofstadter underlines a similar view. Inspired by the two little
snippets of Turing’s 1950 article [26], Hofstadter builds a
(fictitious) conversation between a human being and a machine
in order to show the falsity of simplistic interpretations of Turing
Test, that he summarizes in the following way: «even if some AI
program passed the full Turing Test, it might still be nothing but
a patchwork of simple-minded tricks, as lacking in
understanding or semantics as is a cash register or an
automobile transmission» [10]. In his dialogue, Hofstadter tries
to expand the flavor of the second Turing snippet, where Mr
Pickwick is compared to a winter’s day [26]. The conversation
by Hofstadter has translation as the main topic, in particular
poetry translation. Hofstadter wants to show how complex such
a subject is and that it is very difficult that a program could have
a conversation of that type with a human being, and thus pass the
Turing Test. By reversing perspective, we can consider
translation one of the language field in which, in the future, it
could be fruitful testing machine intelligence. But we are not
merely referring to machine translation. We want to suggest the
a conversation on a translation subject could be a target for a
machine. Translation by itself, indeed, concerns many cultural
aspects, as we said before, and the understanding and
justification of what term or expression is suitable in a specific
context of a specific language could be a very interesting
challenge for a program, that would imply the knowledge of the
cultural context of a specific language by the program, and

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

35

therefore the implementation of mechanisms for representing
and handling two different language contexts.
In Hofstadter’s dialogue, much attention is devoted to the
problem from a poetic point of view. We can have a flavour of
the general issues involved by considering an extract from the
dialogue, which is between two entities, a Dull Rigid Human and
an Ace Mechanical Translator:
«DRH: Well, of course, being an advanced AI program, you
engaged in a highly optimized heuristic search.
AMT: For want of a better term, I suppose you could put it
that way. The constraints I found myself under in my search
were, of course, both semantic and phonetic. Semantically, the
problem was to find some phrase whose evoked imagery was
sufficiently close to, or at least reminiscent of, the imagery
evoked by croupir dans ton lit. Phonetically, the problem was a
little trickier to explain. Since the line just above ended with
“stir”, I needed an “ur” sound at the end of line 6. But I didn’t
want to abandon the idea of hyphenating right at that point. This
meant that I needed two lines that matched this template:
Instead of …ur…ing …… bed
where the first two ellipses stand for consonants (or
consonant clusters), and the third one for “in” or “in your” or
something of the sort. Thus, I was seeking gerunds like
“lurking”, “working”, “hurting”, “flirting”, “curbing”,
“squirming”, “bursting”, and so on — actually, a rather rich
space of phonetic possibilities.
DRH: Surely you must have, within your vast data bases, a
thorough and accurate hyphenation routine, and so you must
have known that the hyphenations you propose — “lur-king”,
“squir-ming”, “bur-sting”, and so forth — are all illegal…
AMT: I wish you would not refer to my knowledge as “your
vast data bases”. I mean, why should that quaint, old-fashioned
term apply to me any more than to you? But leaving that quibble
aside, yes, of course, I knew that, strictly speaking, such
hyphenations violate the official syllable boundaries in the eyes
of rigid language mavens like that old fogey William Safire. But
I said to myself, “Hey, if you’re going to be so sassy as to
hyphenate a word across a line-break, then why not go whole
hog and hyphenate in a sassy spot inside the word?”» [10].
Poetry involves metrical structures, rhymes, assonances,
alliterations and many other figures of speech [10]. But, they
constitute some constraints that are easily mechanizable, by
means of the appropriate set of data bases. In fact, a machine
could be faster than a human being in finding, for example,
every word rhyming with a given one. So the problem is not if
we have to consider poetry or prose translation, and their
differences, but that of catching the cultural and personal flavor
of the text’s author, within a figure of speech scheme or not.
Poetry just has some further, but mechanizable, constraints. So,
what remains outside such constraints? Is it the traditional idea
of an intentionality of terms? We do not think that things are
those. The notion of intentionality seems always to involve a
first-person, subjective point of view that is undetectable in a
machine, as a long debate of last thirty years seems to show. But
if we consider the natural development of intentionality problem,
that of qualia, (as subjective conscious experiences that we are
able to express with words), maybe we could have a better

problem and find a better field of investigation in considering
translation as a sort of qualia communication. In other terms, a
good terminological choice and a good justification of such a
choice could be a suitable method for testing intelligence, even
in its capability to express and understand qualia. And this could
be a consequence of the fact that, generally speaking, translation
is a sort of communication, a communication of contents from a
particular language to another particular language; and in the end
a context interaction.

3 INTERACTION BETWEEN MODEL AND
REALITY
Another field in which the notion of interaction could be relevant
from the point of view of the Turing Test is that of scientific
discovery. In the long development of machine learning some
researchers implemented programs that are able to carry out
generalizations from data structures within a specific scientific
domain, namely scientific laws5. Even thought they are very
specific laws, they are (scientific) laws in all respects. Such
programs were based on logic method and, indeed, they could
only arrive to a generalization from data structures and they were
not able to obtain their outcomes from experimental conditions.
More recently, other artificial artifacts have been built in order to
fill such a gap. For example, ADAM [8] is a robot programmed
for carrying out outcomes in genetics with the possibility of
autonomously managing real experiments. It has a logic-based
knowledge base that is a model of metabolism, but it is able as
well to plan and run experiments to confirm or disconfirm some
hypotheses within a research task. In particular, it could set up
experimental conditions and situations with a high level of
resource optimization for investigating gene expression and
associating one or more genes to one protein. The outcome is a
(very specific but real) scientific law, or a set of them. We could
say that ADAM is a theoretical and practical machine. It
formulates a number of hypotheses of gene expression using its
knowledge bases, that includes all that we already know about
gene expression from a biological point of view. It does the
experiments to confirm or disconfirm every hypothesis, and then
it carries out a statistical analysis for evaluating the results. So, is
ADAM a perfect scientist, an autonomous intelligent artifacts in
the domain of science?

Figure 1. Diagram of the hypotheses generation–experimentation cycle
for the production of new scientific knowledge, on which ADAM is
based (from [21]).

5

For example GOLEM. For some outcomes of it, see [14]; for a
discussion see [5].

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

36

Of course, it is true that its outcomes are original in some
cases; and it is also true that its creators, its programmers do not
see in it a substitute for scientists, but only an assistant for
human scientists, even though a very efficient one, at least at the
current phase of research, likewise it happens in other fields like
chess playing and music. What does lack ADAM to become a
scientist in? We could say that it lacks in the possibility of
controlling or verifying its outcomes from different points of
view, for example from an interdisciplinary perspective. But it
seems a mere practical limit, surmountable with a lot of
additional scientific knowledge of different domains, given that
it has the concrete possibility to do experiments. Yet, as regards
such a specific aspect, what is the reach of ADAM – or other
programs devoted to scientific discovery, like EVE, specialized
in pharmaceutical field – in conducting experiments? Or, that is
the same thing, how far could it get in formulating hypotheses?
It all seems to depend on its capacity of interaction with the real
world. And so we could say that in order to answer the question
if ADAM or other similar artificial artifacts are intelligent, we
have to consider not only the originality of their outcomes, but
also their creativity in the hypothesis formulation, task that is
strictly dependent on its practical interaction with the real world.
Is this a violation of what Turing said we have not to consider in
order to establish if a machine is intelligent, namely its
“physical” difference from human beings? We think not. We
think that interaction between a model of reality and reality itself
from a scientific point of view is the most important aspect in
scientific discovery and it could be in the future one of the way
in which evaluate the results of artificial artifacts and their
intelligence. As a matter of fact, science and scientific discovery
take place in a domain in which knowledge and methods are
widely structured and the invention of new hypotheses and
theories could reveal itself as a task of combination of previous
knowledge, even expressed in some symbolic language, more
than a creation from nothing. And the capability to operate such
a combination could be the subjective perspective, the first
person point of view of future machines.

4 EMOTION INTERACTING: THE CASE OF
LOVE
Another field in which the notion of interaction could be relevant
from the point of view of Turing Test are emotions, their role in
the interaction with the environment and the language to transmit
the emotions. Emotions are cognitive phenomena. It is not
possible to characterize them as irrational dispositions, but they
provide with all the necessary information about the word
around us. The emotions are a way to relate the environment and
other individuals. Emotions are probably a necessary condition
for our mental life [2, 6]. They show us our radical dependence
on the natural and social environment.
One of the most significant cognitive emotions is love. Since
antiquity, philosophers have considered love as a crucial issue in
their studies. Modern day psychologists have discussed its
dynamics and dysfunctions. However, it has rarely been
investigated as a genuine human cognitive phenomenon. In its
most common sense, love has been considered in poetry,
philosophy, and literature, as being something universal, but at
the same time, as a radically subjective feeling. This ambiguity
is the reason why love is such a complicated subject matter.
Now, we want to argue that love, by means of its rational

character, can be studied in a scientific way. According to the
philosophical tradition, human beings are rational animals.
However, the same rationality guides us in many circumstances,
sometimes creates difficult puzzles. Feelings and emotions, like
love, fortunately are able to offer an efficient reason for action.
Even if what “love” is defies definition, it remains a crucial
experience in the ordinary life of human beings. It participates in
the construction of human nature and in the construction of an
individual’s identity. This is shown by the universality of the
feeling of love across cultures. It is rather complicated to offer a
precise definition of “love”, because its features include
emotional states, such as tenderness, commitment, passion,
desire, jealousy, and sexuality. Love modifies people’s way of
thinking and acting, and it is characterized by a series of physical
symptoms. In fact, love has often been considered as a type of
mental illness. How many kinds of love are there? In what
relation are they?
Over the past decades many classifications of love have been
proposed. Social psychologists such as Berscheid and Walster
[1], for example, in their cognitive theory of emotion, propose
two stages of love. The former has to do with a state of
physiological arousal and it is caused by the presence of positive
emotions, like sexual arousal, satisfaction, and gratification, or
by negative emotions, such as fear, frustration, or being rejected.
The second stage of love is called “tagging”, i.e., the person
defines this particular physiological arousal as a “passion” or
“love”. A different approach is taken by Lee [12] and Hendrick
[7, 9]. Their interest is to identify the many ways we have for
classifying or declining love. They focus their attention on love
styles, identifying six of them: Eros, Ludus, Mania, Pragma,
Storge and Agape. Eros (passionate love) is the passionate love
which gives central importance to the sexual and physical
appearance of the partner; Ludus (game-playing love) is a type
of love exercised as a game that does not lead to a stable, lasting
relationship; Mania (possessive, dependent love) is a very
emotional type of love which is identified with the stereotype of
romantic love; Pragma (logical love) concerns the fact that
lovers have a concrete and pragmatic sense of the relationship,
using romance to satisfy their particular needs and dictating the
terms of them; Storge (friendship-based love) is a style in which
the feeling of love toward each other grows very slowly. Finally,
it is possible to speak of Agape (all-giving selfless love)
characterized by a selfless, spiritual and generous love,
something rarely experienced in the lifetime of individuals.
Robert Sternberg [20] offers a graphical representation of love
called the “triangle theory”. The name stems from the fact that
the identified components are the vertices of a triangle. The work
of the Yale psychologist deviates from previous taxonomies, or
in other words, from the previous attempts made to offer a
catalogue of types of existing love. The psychological elements
identified by Sternberg to decline feelings of love are three:
intimacy, passion, decision/commitment. The different forms of
love that you may encounter in everyday life would result from a
combination of each of these elements or the lack of them.
Again, in the study and analysis of the feeling of love we
encounter a list of types of love: non-love, affection, infatuation,
empty love, romantic love, friendship, love, fatuous love, love
lived.
Philosophers, fleeing from any kind of taxonomy, approach
the feeling of love cautiously, surveying it and perhaps even
fearing it. Love seems to have something in common with the

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

37

deepest of mysteries, i.e. the end of life. It leads us to question,
as death does, the reality around us as well as ourselves, in the
hope that something precious and important not pass us by. But
love is also the guardian of an evil secret that is revealed, which
consists in the nonexistence of the love object, in that it is
nothing but a projection of our own desires. Love is, according
to Arthur Schopenhauer, a sequence of actions performed by
those who know perfectly that there is a betrayal in that it does
nothing else but carry out the painful event which life consists
in. Thus, love, too, has its Maya veil, and once torn down, what
remains? What remains is the imperative of the sexual
reproduction of the species instinct.
Human nature has for Harry G. Frankfurt [4] two fundamental
characteristics: rationality and the capacity to love. Reason and
love are the regulatory authorities that guide the choices to be
made, providing the motivation to do what we do and
constraining it by creating a space which circumscribes or
outlines the area in which we can act. On one hand, the ability to
reflect and think about ourselves leads to a sort of paralysis. The
ability to reflect, indeed, offers the tools to achieve our desires,
but at the same time, is often an impediment to their satisfaction,
leading to an inner split. On the other, the ability to love unites
all our fragments, structuring and directing them towards a
definite end. Love, therefore, seems to be involved in integration
processes of personal identity.
In The Origin of species [3] Charles Darwin assigned great
importance to sexual selection, arguing that language, in its
gradual development, was the subject of sexual selection,
recognizing in it features of an adaptation that we could call
unusual (such as intelligence or morality). The dispute that has
followed concerning language and its origins has ignited the
minds of many scholars and fueled the debate about whether
language is innate or is, on the contrary, a product of learning.
Noam Chomsky has vigorously fought this battle against the
tenets of social science supporting that language depends on an
innate genetic ability.
Verbal language is a communication system far more
complex than other modes of communication. There are strong
referential concepts expressed through language that are capable
of building worlds. Similar findings have been the main causes
of the perception of language within the community of scholars,
as something mysterious, something that appeared suddenly in
the course of our history. For a long time arguments concerning
the evolution of language were banned and the idea that a similar
phenomenon could be investigated and argued according to the
processes that drive the evolution of the natural world were
considered to be of no help in understanding the complex nature
of language. Chomsky was one of the main protagonists of this
theoretical trend. According to Chomsky, the complex nature of
language is that it can be understood only through a formal and
abstract approach such as the paradigm of generative grammar.
This theoretical position puts out the possibility of a piecemeal
approach to the study of language and the ability to use the
theory of evolution to get close to understanding it. Steven
Pinker and Paul Bloom, two well-known pupils of Chomsky, in
an article entitled “Natural Language and Natural Selection”,
renewed the debate on the origin of language, stating that it is
precisely the theory of evolution that presents the key to
explaining the complexity of language. A fascinating hypothesis
on language as a biological adaptation is that which considers it
an important feature in courtship. Precisely for this reason it

would have been subject to sexual selection [13]. A good part of
courtship has a verbal nature. Promises, confessions, stories,
statements, requests for appointments are all linguistic
phenomena. In order to woo, find the right words, find the right
tone of voice and the appropriate arguments, you need to employ
language.
Even the young mathematician Alan Turing utilized the
courtship form to create his imitation game with the aim of
finding an answer to a simple – but only in appearance –
question (“can machines think?”). Turing formulated and
proposed a way to establish it by means of a game that has three
protagonists as subject: a man, a woman and an interrogator. The
man and woman are together in one room, in another place is the
interrogator and communication is allowed through the use of a
typewriter. The ultimate goal of the interrogator is to identify if
on the other side there is a man or a woman. The interesting part
concerns what would happen if in the man’s place a computer
was put that could simulate the communicative capabilities of a
human being. As we mentioned before, the thing that Turing
emphasizes in this context is that the only point of contact
between human being and machine communication is language.
If your computer is capable of expressing a wide range of
linguistic behavior appropriate to the specific circumstances it
can be considered intelligent. Among the behaviors to be
exhibited, Turing insert kindness, the use of appropriate words,
and autobiographical information. The importance of
transferring to whoever stands in front of us autobiographical
information, coating therefore the conversation with a personal
and private patina, the expression of shared interests, the use of
kindness and humor, are all ingredients typically found in the
courtship rituals of human beings. It is significant that a way in
which demonstrating the presence of a real human being passed
through a linguistic courtship, a mode of expression that reveals
the complex nature of language and the presence within it of
cognitive abilities. Turing asks: “Can machines think?”, and we
might answer: “Maybe, if they could get a date on a Saturday
evening”.
To conclude, in the case of a very particular phenomenon
such as love, one of the most intangible emotions, Turing shoves
us to consider the role of language as fundamental. But love is a
very concrete emotion as well, because of its first person
perspective. Nevertheless, in order to communicate it also we
human beings are compelled to express it by words in the best
way we can, and at the same time we have just language for
understanding love emotion in other entities (of course, human
beings), together with every real possibility of making mistake
and deceiving ourselves. And so, if we admit the reality of this
emotion also from a high level cognitive point of view, that
involves intelligence and rationality, we have two consequences.
The first one is that just interaction reveals love; the second one
is that just natural language interaction, made of all the complex
concepts that create a bridge between our feelings and the ones
of another human being, reveals the qualia of the entity involved
in a love exchange. Probably that is why Turing wanders through
that subject in his imitation game. And probably the
understanding of this kind of interaction could be, in the future, a
real challenge for artificial artifacts provided with “qualia
detecting sensor”, that cannot be so much different from qualia
itself.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

38

5 A TURING TEST FOR HUMAN (BEING)
BRAIN
A last way in which we could see interaction (connected to
language) as relevant for testing intelligence in machines needs
two perspective reversals. The first one concerns the use of
Turing-Test-like methods to establish the presence of (a certain
level of) consciousness in unresponsive brain damage patients.
As a matter of fact, such patients are not able to use natural
language for communicating as human beings usually do. So
researchers try to find signs of communications that are different
from languages, like blinks of an eyelid, eye-tracking, simple
command following, response to pain, and they try at the same
time to understand if they are intentional or automatic [22]. In
such cases, neurologists are looking for signs of intelligence,
namely of the capability of using intentionally cognitive faculties
through a behavioral method that overturns the one of Turing. In
the case of machines and Turing Test, natural language faculty is
the evidence of the presence of intelligence in machines; in the
case of unresponsive brain damage patients, scientists assume
that patients were able to communicate through natural language
before damage, and so that they were and are intelligent because
intelligence is a human trait. Thus, they look for bodily signs to
establish a communication that is forbidden through usual
means.
This is even more relevant if we consider vegetative state
patient, that are not able to perform any body movement. In the
last years, some researchers supposed that it is possible to
establish a communication with vegetative state patients, a
communication that would show also a certain level of
consciousness, by means of typical neuroimaging techniques,
like fMRI and PET [17]6. In short, through such experiments
they observed that some vegetative state patients, unable to carry
out any body response, had a brain activation very similar to that
of healthy human beings when they were requested with auditory
instructions to imagine themselves walking through one’s house
or playing tennis. Even though the interpretation of such
outcomes is controversial, because of problems regarding
neuroimaging methodology and the nature itself of conscious
activity, if we accept them, they would prove perhaps the
presence of a certain level of consciousness in this kind of
patients, namely the presence of consciousness in mental
activities. They would prove, thus, the presence of intentionality
in the patient response, and not only of cognitive processes or
activities, that could be just cognitive “island” of mental
functioning [11].
Such experimental outcomes could be very useful for building
new techniques and tools of brain-computer interaction for
people who are no longer able to communicate by natural
language and bodily movements, even though there are many
problems that have still to be solved from a theoretical and
epistemological point of view as regards the methodology and
the interpretations of such results [23]. Is it a real
communication? Are those responses a sign of awareness? Could
those responses be real answers to external request?
Yet, what is important for our argumentation is the possibility
of back-transferring these outcomes to machines, and this is the
second reversal we mentioned before. As a matter of fact, these
experiments are based on the assumption that also human beings
6

are machines and that communication is interaction between
mechanical parts, also in the case of subjective, phenomenal
experiences, that are evoked by means of language, but without
external signs. So, the challenging question is: is it possible to
find a parallel in machines? Is it possible to re-create in artificial
artifacts this kind of communication that is not behavioral, but is
still mechanical and detectable inside machines – virtual or
concrete mechanisms – and is simultaneously a sign of
consciousness and awareness in the sense of qualia? Is this sort
of (non-natural-language) communication, if any, a way in
which we could find qualia in programs or robots? Is it the sort
of interaction that could lead us to the feeling of machines?

REFERENCES
[1] E. Berscheid, E. Walster, Interpersonal Attraction, Addison-Wesley,
Boston, Mass., 1978.
[2] A.R. Damasio, Descartes’ Error: Emotion, Reason, and the Human
Brain, Putnam Publishing, New York, 1994.
[3] C. Darwin, On the Origin of Species by Means of Natural Selection,
or the Preservation of Favoured Races in the Struggle for Life,
Murray, London, 1859.
[4] H.G. Frankfurt, The Reasons of Love, Princeton University Press,
Princeton, 2004.
[5] D. Gillies, Artificial Intelligence and Scientific Method, Oxford
University Press, Oxford, 1996.
[6] P. Griffith, What emotions really are. The Problem of Psychological
Categories, Chicago University Press, Chicago, 1997.
[7] C. Hendrick, S. Hendrick, ‘A Theory and a Method of Love’, Journal
of Personality and Social Psychology, 50, 392–402, (1986).
[8] R.D.King, J. Rowland, W. Aubrey, M. Liakata, M. Markham, L.N.
Soldatova, K.E. Whelan, A. Clare, M. Young, A. Sparkes, S.G.
Oliver, P. Pir, ‘The Robot Scientist ADAM’, Computer, 42, 8, 46–54,
(2009).
[9] C. Hendrick, S. Hendrick, Romantic Love, Sage, California, 1992.
[10] D.R. Hofstadter, Le Ton beau de Marot, Basic Books, New York,
1997.
[11] S. Laureys, ‘The neural correlate of (un)awareness: lessons from the
vegetative state’, Trends in Cognitive Sciences, 9, 12, 556–559,
(2005).
[12] J. Lee, The Colors of Love, Prentice-Hall, Englewood Cliffs, 1976.
[13] G.F. Miller, The Mating Mind. How Sexual Choice Shaped the
Evolution of Human Nature, Anchor Books, London, 2001.
[14] S. Muggleton, R.D. King, M.J.E. Sternberg, ‘Protein secondary
structure prediction using logic-based machine learning’, Protein
Engineering, 5, 7, 647–657, (1992).
[15] A. Newell, J.C. Shaw, H.A. Simon, ‘Report on a general problemsolving program’, Proceedings of the International Conference on
Information Processing, pp. 256–264, (1959).
[16] A. Newell, H.A. Simon, Human problem solving, Prentice-Hall,
Englewood Cliffs, NJ, 1972.
[17] A.M. Owen, N.D. Schiff, S. Laureys, ‘The assessment of conscious
awareness in the vegetative state’, in S. Laureys, G. Tononi (eds.),
The Neurology of Consciousness, Elsevier, pp. 163–172, 2009.
[18] A.M. Owen N.D. Schiff, S. Laureys, ‘A new era of coma and
consciousness science’, Progress in Brain Research, 177, 399–411,
(2009).
[19] H. Putnam, Mind, Language and Reality. Philosophical Papers,
Vol. 2. Cambridge University Press, Cambridge, 1975.
[20] R. Sternberg, ‘A Triangular Theory of Love’, Psychological
Review, 93, 119–35, (1986).
[21] A. Sparkes, W. Aubrey, E. Byrne, A. Clare, M.N. Khan, M. Liakata,
M. Markham, J. Rowland, L.N. Soldatova, K.E. Whelan, M. Young,
R.D. King, ‘Towards Robot Scientists for autonomous scientific
discovery’, Automated Experimentation, 2:1, (2010).

For a general presentation and discussion see also [18, 23].

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

39

[22] J.F. Stins, ‘Establishing consciousness in non-communicative
patients: A modern day version of the Turing Test’, Consciousness
and Cognition, 18, 1, 187–192, (2009).
[23] J.F. Stins, S. Laureys, ‘Thought translation, tennis and Turing tests
in the vegetative state’, Phenomenology and Cognitive Science, 8,
361–370, (2009).
[24] A.M. Turing, ‘On Computable Numbers, with an Application to the
Entscheidungsproblem’, Proceedings of the London Mathematical
Society, 42, 230–265, (1936); reprinted in: J. Copeland (ed.), The
essential Turing, Oxford University Press, Oxford, pp. 58-90, 2004.
[25] A.M. Turing, ‘Intelligent Machinery’, Internal report of National
Physics Laboratory, 1948 (1948); reprinted in: J. Copeland (ed), The
essential Turing, Oxford University Press, Oxford, pp. 410–432,
2004.
[26] A.M. Turing, ‘Computing Machinery and Intelligence’, Mind, 59,
433–460, (1950).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

40

Could There be a Turing Test for Qualia?
Paul Schweizer1
Abstract. The paper examines the possibility of a Turing test
designed to answer the question of whether a computational
artefact is a genuine subject of conscious experience. Even given
the severe epistemological difficulties surrounding the 'other
minds problem' in philosophy, we nonetheless generally believe
that other human beings are conscious. Hence Turing attempts to
defend his original test (2T) in terms of operational parity with
the evidence at our disposal in the case of attributing
understanding and consciousness to other humans. Following
this same line of reasoning, I argue that the conversation-based
2T is far too weak, and we must scale up to the full linguistic and
robotic standards of the Total Turing Test (3T).
Within this framework, I deploy Block's distinction
between Phenomenal-consciousness and Access-consciousness
to argue that passing the 3T could at most provide a sufficient
condition for concluding that the robot enjoys the latter but not
the former. However, I then propose a variation on the 3T,
adopting Dennett's method of 'heterophenomenology', to
rigorously probe the robot's purported 'inner' qualitative
experiences. If the robot could pass such a prolonged and
intensive Qualia 3T (Q3T), then the purely behavioural evidence
would seem to attain genuine parity with the human case.
Although success at the Q3T would not supply definitive proof
that the robot was genuinely a subject of Phenomenalconsciousness, given that the external evidence is now
equivalent with the human case, apparently the only grounds for
denying qualia would be appeal to difference of internal
structure, either physical-physiological or functionalcomputational. In turn, both of these avenues are briefly
examined. 1the

1

INTRODUCTION

which underpins cognitive science, Strong AI and various allied
positions in the philosophy of mind, computation (of one sort or
another) is held to provide the scientific key to explaining
mentality in general and, ultimately, to reproducing it artificially.
The paradigm maintains that cognitive processes are essentially
computational processes, and hence that intelligence in the
natural world arises when a material system implements the
appropriate kind of computational formalism. So this broadly
Computational Theory of Mind (CTM) holds that the mental
states, properties and contents sustained by human beings are
fundamentally computational in nature, and that computation, at
least in principle, opens the possibility of creating artificial
minds with comparable states, properties and contents.

1

Institute for Language, Cognition and Computation, School of
Informatics, Univ. of Edinburgh, EH8 9AD, UK. Email:
!"#$%&'()*+)",)#-.

Traditionally there are two basic features that are held
to be essential to minds and which decisively distinguish mental
from non-mental systems. One is representational content:
mental states can be about external objects and states of affairs.
The other is conscious experience: roughly and as a first
approximation, there is something it is like to be a mind, to be a
particular mental subject. As a case in point, there is something it
is like for me to be consciously aware of typing this text into my
desk top computer. Additionally, various states of my mind are
concurrently directed towards a number of different external
objects and states of affairs, such as the letters that appear on my
monitor. In stark contrast, the table supporting my desk top
computer is not a mental system: there are no states of the table
that are properly about anything, and there is nothing it is like to
be the table.
be applied to a system with no representational states, so too,
many would claim that a system entirely devoid of conscious
experience cannot be a mind. Hence if the project of Strong AI
is to be successful at its ultimate goal of producing a system that
truly counts as an artificially engendered locus of mentality, then
it would seem necessary that this computational artefact be fully
conscious in a manner comparable to human beings.

2

CONSCIOUSNESS AND THE ORIGINAL
TURING TEST

In 1950 Turing [1] famously proposed an answer to the question

has since become universally referred to as the 'Turing test' (2T).
In

can pose questions to the remaining two players, where the goal
of the game is for the questioner to determine which of the two
respondents is the computer. If, after a set amount of time, the
questioner guesses correctly, then the machine loses the game,
and if the questioner is wrong then the machine wins. Turing
claimed, as a basic theoretical point, that any machine that could
win the game a suitable number of times has passed the test and
should be judged to be intelligent, in the sense that its behavioral
performance has been demonstrated to be indistinguishable from
that of a human being.
In his prescient and ground breaking article, Turing
explicitly considers the application of his test to the question of
machine consciousness. This is in section (4) of the paper, where
he considers the anticipated 'Argument from Consciousness'
objection to the validity of his proposed standard for answering
the question 'Can a machine think?'. The objection is that, as per
the above, consciousness is a necessary precondition for genuine
thinking and mentality, and that a machine might fool its

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

41

interlocutor and pass the purely behavioural 2T, and yet remain
completely devoid of internal conscious experience. Hence
merely passing the 2T does not provide a sufficient condition for
concluding that the system in question possesses the
characteristics required for intelligence and bona fide thinking.
Hence the 2T is inherently defective.
Turing's defensive strategy is to invoke the well known
and severe epistemological difficulties surrounding the very
same question regarding our fellow human beings. This is the
other minds problem
how do you
know that other people actually have a conscious inner life like
only conscious being in the universe. As Turing humorously
notes, this type of 'solipsistic' view (although more accurately
characterized as a form of other minds skepticism, rather than
full blown solipsism), while logically impeccable, tends to make
communication difficult, and rather than continually arguing
over the point, it is usual to simply adopt the polite convention
that everyone is conscious.
Turing notes that on its most extreme construal, the
only way that one could be sure that a machine or another human
being is conscious and hence genuinely thinking is to be the
machine or the human and feel oneself thinking. In other words,
one would have to gain first person access to what it's like to be
the agent in question. And since this is not an empirical option,
conscious all we have to go on is behaviour. Hence Turing
attempts to justify his behavioural test that a machine can think,
and ipso facto, has conscious experience, by claiming parity with
the evidence at our disposal in the case of other humans. He
therefore presents his anticipated objector with the following
dichotomy: either be guilty of an inconsistency by accepting the
behavioural standard in the case of humans but not computers, or
maintain consistency by rejecting it in both cases and embracing
solipsism. He concludes that most consistent proponents of the
argument from consciousness would chose to abandon their
objection and accept his test rather than be forced into the
solipsistic position.
However, it is worth applying some critical scrutiny to
Turing's reasoning at this early juncture. Basically, he seems to
be running epistemological issues together with semantical
and/or factive questions which should properly be kept separate.
mean by saying that a system has a
mind i.e. what essential traits and properties are we ascribing
how we can know that a given system actually satisfies this
behaviouristic methodology has a strong tendency to collapse
these two themes, but it is important to note that they are
conceptually distinct. In the argument from consciousness, the
point is that we mean something substantive, something more
than just verbal stimulus-response patterns, when we attribute
mentality to a system. In this case the claim is that we mean that
the system in question has conscious experience, and this
property is required for any agent to be accurately described with
So one could potentially hold that consciousness is
the term) and that:
(1) other human beings are in fact conscious
(2) the computer is in fact unconscious

though it passes the 2T.
This could be the objective state of affairs that genuinely obtains
in the world, and this is completely independent of whether we
can know, with certainty, that premises (1) and (2) are actually
true. Although epistemological and factive issues are intimately
related and together inform our general practices and goals of
inquiry, nonetheless we could still be correct in our assertion,
without being able to prove
that consciousness was essential to genuine mentality, then one
could seemingly deny that any purely behaviouristic standard
was sufficient to test for whether a system had or was a mind.
In the case of other human beings, we certainly take
behaviour as evidence that they are conscious, but the evidence
could in principle overwhelmingly support a false conclusion, in
both directions. For example, someone could be in a comatose
state where they could show no evidence of being conscious
because they could make no bodily responses. But in itself this
of what was going on and perhaps be able to report,
retrospectively, on past events once out of their coma. And
again, maybe some people really are zombies, or sleepwalkers,
and exhibit all the appropriate external signs of consciousness
oo spell be ruled out a priori.
Historically, there has been disagreement regarding the
proper interpretation of Turing's position regarding the intended
import of his test. Some have claimed that the 2T is proposed as
an operational definition of intelligence, thinking, etc., (e.g.
Block [2], French [3]), and as such it has immediate and
fundamental faults. However, in the current discussion I will
adopt a weaker reading and interpret the test as purporting to
furnish an empirically specifiable criterion for when intelligence
can be legitimately ascribed to an artefact. On this reading, the
main role of behavior is inductive or evidential rather than
constitutive, and so behavioral tests for mentality do not provide
a necessary condition nor a reductive definition. At most, all that
is warranted is a positive ascription of intelligence or mentality,
if the test is adequate and the system passes. In the case of
Turing's 1950 proposal, the adequacy of the test is defended
almost entirely in terms of parity of input/output performance
with human beings, and hence alleges to employ the same
operational standards that we tacitly adopt when ascribing
conscious thought processes to our fellow creatures.
Thus the issue would appear to hinge upon the degree
of evidence a successful 2T performance provides for a positive
conclusion in the case of a computational artefact, (i.e. for the
negation of (2) above), and how this compares to the total body
of evidence that we have in support of our belief in the truth of
(1). We will only be guilty of an inconsistency or employing a
double standard if the two are on a par and we nonetheless
dogmatically still insist on the truth of both (1) and (2). But if it
turns out to be the case that our evidence for (1) is significantly
better than for the negation of (2), then we are not forced into
there is clearly very little parity with the human case. We rely on
far more than simply verbal behaviour in arriving at the polite
convention that other human beings are conscious. In addition to
conversational data, we lean very heavily on their bodily actions
involving perception of the spatial environment, navigation,

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

42

physical interaction, verbal and other modes of response to
communally accessible non-verbal stimuli in the shared physical
surroundings, etc. So the purely conversational standards of the
2T are not nearly enough to support a claim of operational parity
with humans. In light of the foregoing observations, in order to
move towards evidential equivalence in terms of observable
behaviour, it is necessary to break out of the closed syntactic
bubble of the 2T and scale up to a full linguistic and robotic
version of the test. But before exploring this vastly strengthened
variation as a potential test for the presence of conscious
experience in computational artefacts, in the next section I will
briefly examine the notion of consciousness itself, since we first
need to attain some clarification regarding the phenomenon in
question, before we go looking for it in robots.

3

TWO TYPES OF CONSCIOUSNESS

Even in the familiar human case, consciousness is a notoriously
elusive phenomenon, and is quite difficult to characterize
rigorously. In addition, the word
uniform and univocal manner, but rather appears to have
different meanings in different contexts of use and across diverse
academic communities. Block [4] provides a potentially
illuminating philosophical analysis of the distinction and
possible relationship between two common uses of the word.
a number of different concepts and denoting a number of
different phenomena. He attempts to clarify the issue by
distinguishing two basic and distinct forms of consciousness that
are often conflated: Phenomenal or P-consciousness and Access
or Ais experience: what makes a state phenomenally conscious is that
controversially, Block holds that P-conscious properties, as such,
he notoriously difficult explanatory gap problem in
philosophical theorizing concerns P-consciousness e.g. how is
it possible that appeal to a physical brain process could explain
what it is like to see something as red?
So we must take care to distinguish this type of purely
qualitative, Phenomenal
consciousness,
from
Access
consciousness, the latter of which Block sees as an information
processing correlate of P-consciousness. A-consciousness states
and structures are those which are directly available for control
of speech, reasoning and action. Hence Block's rendition of Aconsciousness is similar to Baars' [5] notion that conscious
representations are those that are broadcast in a global
workspace. The functional/computational approach holds that
the level of analysis relevant for understanding the mind is one
that allows for multiple realization, so that in principle the same
mental states and phenomena can occur in vastly different types
of physical systems which implement the same abstract
functional or computational structure. As a consequence, a
staunch adherent of the functional-computational approach is
committed to the view that the same conscious states must be
preserved across widely diverse type of physical
implementation. In contrast, a
that details of the particular physical/physiological realization
matter in the case of conscious states. Block says that if P = A,
then the information processing side is right, while if the

biological nature of experience is crucial then we can expect that
P and A will diverge.
A crude difference between the two in terms of overall
characterization is that P-consciousness content is qualitative
while A-consciousness content is representational. A-conscious
states are necessarily transitive or intentionally directed, they are
always states of consciousness of. However. P-conscious states
On Block's account, the paradigm Pconscious states are the qualia associated with sensations, while
the paradigm A-conscious states are propositional attitudes. He
maintains that the A-type is nonetheless a genuine form of
consciousness, and tends to be what people in cognitive
neuroscience have in mind, while philosophers are traditionally
more concerned with qualia and P-consciousness, as in the hard
problem and the explanatory gap. In turn, this difference in
meaning can lead to mutual misunderstanding. In the following
discussion I will examine the consequences of the distinction
between these two types of consciousness on the prospects of a
Turing test for consciousness in artefacts.

4

THE TOTAL TURING TEST

In order to attain operational parity with the evidence at our
command in the case of human beings, a Turing test for even
basic linguistic understanding and intelligence, let alone
conscious experience, must go far beyond Turing's original
proposal. The conversational 2T relies solely on verbal
input/output patterns, and these alone are not sufficient to evince
a correct interpretation of the manipulated strings. Language is
primarily about extra-linguistic entities and states of affairs, and
there is nothing in a cunningly designed program for pure syntax
manipulation which allows it to break free of this closed loop of
symbols and demonstrate a proper correlation between word and
object. When it comes to judging human language users in
normal contexts, we rely on a far richer domain of evidence.
Even when the primary focus of investigation is language
proficiency and comprehension, sheer linguistic input/output
data is not enough. Turing's original test is not a sufficient
condition for concluding that the computer genuinely
understands or refers to anything with the strings of symbols it
f
relations and interactions with the objects and states of affairs in
the real world that its words are supposed to be about. To
illustrate the point; if the computer has no eyes, no hands, no
mouth, and has never seen or eaten anything, then it is not
talking about hamburgers when its program generates the string
-a-m-b-u-r-g-e-rinside a closed loop of syntax.
In sharp contrast, our talk of hamburgers is intimately
connected to nonverbal transactions with the objects of
nonverbal stimuli to appropriate linguistic behaviours. When
given the visual stimulus of being presented with a pizza, a taco
and a kebab, we can produce the salient utterance "Those
particular foodstuffs are not hamburgers". And there are
appropriate nonverbal actions. For example, we can follow
complex verbal instructions and produce the indicated patterns
of behaviour, such as finding the nearest Burger King on the
basis of a description of its location in spoken English. Mastery
of both of these types of rules is essential for deeming that a

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

43

human agent understands natural language and is using
expressions in a correct and referential manner - and the hapless
2T computer lacks both.2 2the
And when it comes to testing for conscious experience,
we again need these basic additional dimensions of perception
and action in the real world as an essential precondition. The
fundamental limitations of mere conversational performance
naturally suggest a strengthening of the 2T, later named the Total
Turing Test (3T) by Harnad [7], wherein the repertoire of
relevant behaviour is expanded to include the full range of
intelligent human activities. This will require that the
computational procedures respond to and control not simply a
teletype system for written inputs and outputs, but rather a well
crafted artificial body. Thus in the 3T the scrutinized artefact is a
robot, and the data to be tested coincide with the full spectrum of
behaviours of which human beings are normally capable. In
order to succeed, the 3T candidate must be able to do, in the real
world of objects and people, everything that intelligent people
can do. Thus Harnad expresses a widely held view when he
claims that the 3T is "...no less (nor more) exacting a test of
having a mind than the means we already use with one another...
[and, echoing Turing] there is no stronger test, short of being the
candidate". And, as noted above, the latter state of affairs is not
an empirical option. examined.3 3the
Since the 3T requires the ability to perceive and act in
the real world, and since A-consciousness states and structures
are those which are directly available for control of speech,
reasoning and action, it would seem to follow that the successful
3T robot must be A-conscious. For example, in order to pass the
test, the robot would have to behave in an appropriate manner in
any number of different scenarios such as the following. The
robot is handed a silver platter on which a banana, a boiled egg,
a teapot and a hamburger are laid out. The robot is asked to pick
up the piece of fruit and throw it out the window. Clearly the
robot could not perform the indicated action unless it had direct
information processing access to the identity of the salient
object, its spatial location, the movements of its own mechanical
arm, the location and geometrical properties of the window, etc.
Such transitive, intentionally directed A-conscious states are
plainly required for the robot to pass the test.
But does it follow that the successful 3T robot is Pconscious? It seems, not, since on the face of it there appears to
be no reason why the robot could not pass the test relying on Aconsciousness alone. All that is being tested is its executive
control of the cognitive processes enabling it to reason correctly
and perform appropriate verbal and bodily actions in response to
a myriad of linguistic and perceptual inputs. These abilities are
demonstrated solely through its external behaviour, and so far,
there seems to be no reason for P-conscious states to be invoked.
intelligence and linguistic understanding in the actual world, the
2

Shieber [6] provides a valiant and intriguing rehabilitation/defense of
the 2T, but it nonetheless still neglects crucial data, such as mastery of
language exit and entry rules. Ultimately Shieber's rehabilitation in terms
of interactive proof requires acceptance of the notion that
conversational input/response patters alone are sufficient, which
premise I would deny for the reasons given. The program is still
operating within a closed syntactic bubble.
3
See Schweizer [8] for an argument to the effect that even the combined
linguistic and robotic 3T is still too weak as a definitive behavioural
test of artificial intelligence.

A-conscious robot could conceivably pass the 3T while at the
same time there is nothing it is like to be the 3T robot passing the
test. We are now bordering on issues involved in demarcating
the 'easy' from the 'hard' problems of consciousness, which, if
pursued at this point, would be moving in a direction not
immediately relevant to the topic at hand. So rather than
exploring arguments relating to this deeper theme, I will simply
contend that passing the 3T provides a sufficient condition for
Block's version of A-consciousness, but not for P-consciousness,
since it could presumably be passed by an artefact devoid of
qualia.
Many critics of Block's basic type of view (including
Searle [9] and Burge [10]) argue that if there can be such
-conscious but not P-conscious,
then they are not genuinely conscious at all. Instead, Aand is a form of consciousness only to the extent that it is
parasitic upon P-conscious states. So we could potentially have a
3T for A-consciousness, but then the pivotal question arises, is
A-consciousness without associated qualitative presentations
really a form of consciousness? Again, I will not delve into this
deeper and controversial issue in the present discussion, but
simply maintain that the successful 3T robot does at least exhibit
the type of A-awareness that people in, e.g., cognitive
neuroscience tend to call consciousness. But as stated earlier,
'consciousness' is a multifaceted term, and there are also good
reasons for not calling mere A-awareness without qualia a fullfledged form of consciousness.
For example, someone who was drugged or talking in
their sleep could conceivably pass the 2T while still
'unconscious', that is A-'conscious' but not P-conscious. And a
human sleep walker might even be able to pass the verbal and
robotic 3T while 'unconscious' (again A-'conscious' but not Pconscious). What this seems to indicate is that only A'consciousness' can be positively ascertained by behaviour. But
there is an element of definitiveness here, since it seems
plausible to say that an agent could not pass the 3T without
being A-'conscious', at least in the minimal sense of Aawareness. If the robot were warned 'mind the banana peel' and it
was not A-aware of the treacherous object in question on the
ground before it, emitting the frequencies of electromagnetic
radiation appropriate for 'banana-yellow', then it would not
deliberately step over the object, but rather would slip and fall
and fail the test.

5

A TOTAL TURING TEST FOR QUALIA

In the remainder of the paper I will not pursue the controversial
issue as to whether associated P-consciousness is a necessary
condition for concluding that the A-awareness of the successful
3T robot is genuinely a form of consciousness at all. Instead, I
will explore an intensification of the standard 3T intended to
prod more rigorously for evidential support of the presence of Pconscious states. This Total Turing Test for qualia (Q3T) is a
more focused scrutiny of the successful 3T robot which
emphasizes rigorous and extended verbal and descriptive
probing into the qualitative aspects of the robot's purported
internal experiences. So the Q3T involves unremitting
questioning and verbal analysis of the robot's qualitative inner
experiences, in reaction to a virtually limitless variety of salient
external stimuli, such as paintings, sunsets, musical

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

44

performances, tastes, textures, smells, pleasures and pains,
emotive reactions...
Turing suggests a precursor version of this strategy in
his 1950 discussion of the argument from consciousness, where
he observes that the question of machine consciousness could be
addressed by a sustained viva voce, where the artefact was asked
questions directly concerning its aesthetic and other types of
qualitative reactions and judgement in response to opened-ended
questioning by the interrogator. Turing provides a conjectural
illustration of the method in the guise of a 'sonnet writing'
programme being quizzed by a human judge.
Interrogator: In the first line of your sonnet which
reads "Shall I compare thee to a summer's day," would
not "a spring day" do as well or better?
Witness: It wouldn't scan.
Interrogator: How about "a winter's day," that would
scan all right.
Witness: Yes, but nobody wants to be compared to a
winter's day.
Interrogator: Would you say Mr. Pickwick reminded
you of Christmas?
Witness: In a way.
Interrogator: Yet Christmas is a winter's day, and I do
not think Mr. Pickwick would mind the comparison.
Witness: I don't think you're serious. By a winter's day
one means a typical winter's day, rather than a special
one like Christmas.
And so on....
The above sample dialogue serves as a good reminder of just
how difficult the original 2T really is (and consequently why it
hasn't yet been passed). However, this conjectured scenario is
still confined to a merely conversational setting of verbal inputs
and verbal outputs, and hence falls far short of the behavioural
potential of a full 3T edition, as well as the concomitant
evidential standards applicable in the human case.
Plebe and Perconti [11] put forward a strengthened
adaptation of a 2T-style viva voce, where the pivotal difference
is that, in addition to merely linguistic inputs, the computer must
now give appropriate and testable conversational reactions to
uploaded images. This is an interesting and important
augmentation of the original 2T, since the inputs are no longer
strictly linguistic, and the test is aimed at evaluating verbally
plausible responses to stimuli that, to us at least, have a
phenomenal aspect. As an example of the method, Plebe and
Perconti supply an excerpt from a hypothetical conversation.
Interrogator: Do you want to look at a picture of me?
Machine: Oh yes, thanks, let's upload that.
<.... uploading>

Machine: Mmmh, I see several people here, who are
you?
Interrogator: Try to guess.
Machine: Well, I know you are blond and shy, so I
would guess the second from the left.
This appears to be an order of magnitude jump over
the purely verbal 2T, and accordingly its standards of
satisfaction are even more futuristic. However, in terms of the
ultimate goal of providing a test, the passing of which constitutes
a sufficient condition for the presence of genuine conscious
experience in an artefact, it should be noted that the inputs, at a
crucial level of analysis, remain purely syntactic and nonqualitative, in that the uploaded image must take the form of a
digital file. Hence this could at most provide evidence of some
sort of (proto) A-awareness in terms of salient data extraction
and attendant linguistic conversion from a digital source, where
the phenomenal aspects produced in humans by the original (predigitalized) image are systematically corroborated by the
computer's linguistic outputs when responding to the inputted
code.
Although a major step forward in terms of expanding
the input repertoire under investigation, as well as possessing the
virtue of being closer to the limits of practicality in the nearer
term future, this proposed new qualia 2T still falls short of the
full linguistic and robotic Q3T. In particular it tests, in a
relatively limited manner, only one sensory modality, and in
principle there is no reason why this method of scrutiny should
be restricted to the intake of photographic images represented in
digital form. Hence a natural progression would be to test a
computer on uploaded audio files as well. However, this
expanded 2T format is still essentially passive in nature, where
the neat and tidy uploaded files are hand fed into the computer
by the human interrogator, and the outputs are confined to mere
verbal response. Active perception of and reaction to distal
objects in the real world arena are critically absent from this test,
and so it fails to provide anything like evidential parity with the
human case. And given the fact that the selected non-linguistic
inputs take the form of digitalized representations of possible
visual (and/or auditory) stimuli, there is still no reason to think
that there is anything it is like to be the 2T computer processing
the uploaded encoding of an image of, say, a vivid red rose.
But elevated to a full 3T arena of shared external
stimuli and attendant discussion and analysis, the positive
evidence of a victorious computational artefact would become
exceptionally strong indeed. So the extended Q3T is based on a
methodology akin to Dennett's [12] 'heterophenomenology' given the robot's presumed success at the standard Total Turing
Test, we count this as behavioural evidence sufficient to warrant
the application of the intentional stance, wherein the robot is
treated as a rational agent harbouring beliefs, desires and various
other mental states exhibiting intentionality, and who's actions
can be explained and predicted on the basis of the content of
these states. Accordingly, the robot's salient sonic emissions are
interpreted as natural language utterances asserting various
propositions and expressing assorted contents. For the reasons
delineated above in section 4, I would argue that this interpretive
step and application of the intentional stance to a mere artefact is
not evidentially warranted in a limited 2T type of setting, and

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

45

hence in the latter case, the syntactic tokens emitted by the
computer cannot be given the semantic value that we would
attribute to utterances produced by our fellow human beings.
However, following Dennett, the gathering of
scientifically rigorous empirical evidence proceeds with
neutrality regarding the possible conscious experience of the
robot. In this extended Q3T context, the artefact is queried not
about uploaded images as 'qualitative' input data, but rather
about its own subjective reactions and attendant introspective
states. The robot is not hand fed selected digital files, but rather
is presented with the same external stimuli as its human
interlocutor, the latter of whom can then interrogate the robot
regarding its purported qualitative experiences in response to the
shared stimuli. Hence the human and the robot can both observe
the same Dürer and the same Dali paintings, listen to the same
Vivaldi and Schubert concertos, attend the same wine tasting
event, eat the same brand of hamburger... And the robot must
discourse on its subjective impressions and reactions like a true
aesthete, and where this discourse can be calibrated by the
qualitative reactions.
As an example of the method, an excerpt from a
hypothetical Q3T interaction might look something like the
following.
Interrogator: Isn't the sunset over the Pacific
magnificent?
Robot: Yes, the colours are sublime. Let's go to the top
of that rise for a better view...
Excellent, now we have the full panorama.
Interrogator: What impression does it make on you?
Robot: As the glowing orange orb of the sun begins to
touch the calm blue line of the horizon, yet another day
is passing. Now the radiant disc begins to sink into the
sea and the background crimson intensifies and grows
warmer. It gives me a feeling of sadness, but at the
same time a glimpse of ineffable beauty...
Ouch!! Why the hell did you pinch me?
Interrogator: Just wanted to bring you back down to
earth - you were starting to wax a bit too lyrical.
Robot: Thanks very much for your concern. Could you
please pass me the bottle of delicious chilled
Chardonnay - I want to savour another glass along
with the last rays of the setting sun.
Interrogator: Here you go.
Robot: Arrrgh, that tastes disgusting! - what happened
to the wine?
Interrogator: Uhh, I just stirred in a little marmite
when you weren't looking - wanted to see how you'd
react. This is a Q3T, after all...
Even though a merely A-conscious robot could conceivably pass
the verbal and robotic 3T while at the same time as there being
nothing it is like for the robot passing the test, in this more

focussed version of the 3T the robot would at least have to be
able to go on at endless length talking about what it's like. And
this talk must be in response to an open ended range of different
combinations of sensory inputs, which are shared and monitored
by the human judge. Such a test would be both subtle and
extremely demanding, and it would be nothing short of
remarkable if it could not detect a fake. And presumably a
human sleepwalker who could pass a normal 3T as above would
nonetheless fail this type of penetrating Q3T (or else wake up in
the middle!), and it would be precisely on the grounds of such
failure that we would infer that the human was actually asleep
and not genuinely P-conscious of what was going on.
If sufficiently rigorous and extended, this would
provide extremely powerful inductive evidence, and indeed to
pass the Q3T the robot would have to attain full evidential parity
with the human case, in terms of externally manifested
behaviour.

6

BEYOND BEHAVIOUR

So on what grounds might one consistently deny qualitative
states and P-consciousness in the case of the successful Q3T
robot and yet grant it in the case of a behaviourally
indistinguishable human? The two most plausible considerations
that suggest themselves are both based on an appeal to essential
differences of internal structure, either physical/physiological or
functional/computational. Concerning the latter case, many
versions of CTM focus solely on the functional analysis of
propositional attitude states such as belief and desire, and simply
ignore other aspects of the mind, most notably consciousness
and qualitative experience. However others, such as Lycan [13],
try to extend the reach of Strong AI and the computational
paradigm, and contend that conscious states arise via the
implementation of the appropriate computational formalism. Let
us denote this extension of the basic CTM framework to the
version of CTM+ might hold that qualitative experiences arise in
virtue of the particular functional and information processing
structure of the human brand of cognitive architecture, and hence
that, even though the robot is indistinguishable in terms of
input/output profiles, nonetheless its internal processing structure
is sufficiently different from ours to block the inference to Pconsciousness. So the non-identity of abstract functional or
computational structure might be taken to undermine the claim
that bare behavioural equivalence provides a sufficient condition
for the presence of internal conscious phenomena.
At this juncture, the proponent of artificial
]
consciousness might appeal to a version of Van Gul
objections. When aimed against functionalism, the missing
qualia arguments generally assume a deviant realization of the
very same abstract computational procedures underlying human
ours in all respects, and the position being supported is that
consciousness is to be equated with states of the biological brain,
rather than with any arbitrary physical state playing the same
functional role as a conscious brain process. For example, in
Block's [15] well known 'Chinese Nation' scenario, we are asked
to imagine a case where each person in China plays the role of a
neuron in the human brain and for some (rather brief) span of
time the entire nation cooperates to implement the same

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

46

computational procedures as a conscious human brain. The
rather compelling 'common sense' conclusion is that even though
the entire Chinese population may implement the same
computational structure as a conscious brain, there are
nonetheless no purely qualitative conscious states in this
scenario outside the conscious Chinese individuals involved.
And this is then taken as a counterexample to purely
functionalist theories of consciousness.
-strategy is to claim
that the missing qualia argument begs the question at issue. How
do we know, a priori, that the very same functional role could be
played by arbitrary physical states that were unconscious? The
anti-functionalist seems to beg the question by assuming that
such deviant realizations are possible in the first place. At this
point, the burden of proof may then rest on the functionalist to
try and establish that there are in fact functional roles in the
human cognitive system that could only be filled by conscious
processing states. Indeed, this strategy seems more interesting
than the more dogmatic functionalist line that isomorphism of
abstract functional role alone guarantees the consciousness of
any physical state that happens to implement it.
So to pursue this strategy, Van Gulick examines the
psychological roles played by phenomenal states in humans and
identifies various cognitive abilities which seem to require both
conscious and self-conscious awareness e.g. abilities which
involve reflexive and meta-cognitive levels of representation.
These include things like planning a future course of action,
control of plan execution, acquiring new non-habitual task
behaviours These and related features of human psychological
organization seem to require a conscious self-model. In this
manner, conscious experience appears to play a unique
information throughout the brain. In turn, the proponent of
artificial consciousness might plausibly claim that the successful
Q3T robot must possesses analogous processing structures in
order to evince the equivalent behavioural profiles when passing
the test. So even though the processing structure might not be
identical to that of human cognitive architecture, it must
nonetheless have the same basic cognitive abilities as humans in
order to pass the Q3T, and if these processing roles in humans
require phenomenal states, then the robot must enjoy them as
well.
However, it is relevant to note that Van Gulick's
analysis seems to blur Block's distinction between Pconsciousness and A-consciousness, and an obvious rejoinder at
this point would be that all of the above processing roles in both
humans and robots could in principle take place with only the
latter and not the former. Even meta-cognitive and 'conscious'
self models could be accounted for merely in terms of Aawareness. And this brings us back to the same claim as in the
standard 3T scenario - that even the success of the Q3T robot
could conceivably be explained without invoking Pconsciousness per se, and so it still fails as a sufficient condition
for attributing full blown qualia to computational artefacts.

7

MATTER AND CONSCIOUSNESS

Hence functional/computational considerations seem too weak to
ground a positive conclusion, and this naturally leads to the
question of the physical/physiological status of qualia. If even
meta-cognitive and 'conscious' self models in humans could in

principle be accounted for merely in terms of A-awareness, then
how and why do humans have purely qualitative experience?
One possible answer could be that P-conscious states are
essentially physically based phenomena, and hence result from
or supervene upon the particular structure and causal powers of
the actual central nervous system. And this perspective is reenforced by what I would argue (on the following independent
grounds) is the fundamental inability of abstract functional role
to provide an adequate theoretical foundation for qualitative
experience.
Unlike computational formalisms, conscious states are
inherently non-abstract; they are actual, occurrent phenomena
extended in physical time. Given multiple realizability as a
hallmark of the theory, CTM+ is committed to the result that
qualitatively identical conscious states are maintained across
widely different kinds of physical realization. And this is
tantamount to the claim that an actual, substantive and invariant
qualitative phenomenon is preserved over radically diverse real
systems, while at the same time, no internal physical regularities
need to be preserved. But then there is no actual, occurrent factor
which could serve as the causal substrate or supervenience base
for the substantive and invariant phenomenon of internal
conscious experience. The advocate of CTM+ cannot rejoin that
it is formal role which supplies this basis, since formal role is
abstract, and such abstract features can only be instantiated via
actual properties, but they do not have the power to produce
them.
The only (possible) non-abstract effects that
instantiated formalisms are required to preserve must be
specified in terms of their input/output profiles, and thus internal
experiences, qua actual events, are in principle omitted. So (as
I've also been argued elsewhere: see Schweizer [16,17]) it would
appear that the non-abstract, occurrent nature of conscious states
entails that they must depend upon intrinsic properties of the
brain as a proper subsystem of the actual world (on the crucial
assumption of physicalism as one's basic metaphysical stance obviously other choices, such as some variety of dualism, are
theoretical alternatives). It is worth noting that from this it does
not follow that other types of physical subsystem could not share
the relevant intrinsic properties and hence also support conscious
states. It only follows that they would have this power in virtue
of their intrinsic physical properties and not in virtue of being
interpretable as implementing the same abstract computational
procedure.

8

CONCLUSION

We know by direct first person access that the human central
nervous system is capable of sustaining the rich and varied field
of qualitative presentations associated with our normal cognitive
activities. And it certainly seems as if these presentations play a
vital role in our mental lives. However, given the above critical
observation regarding Van Gulick's position, viz., that all of the
salient processing roles in both humans and robots could in
principle take place strictly in terms of A-awareness without Pconsciousness, it seems that P-conscious states are not actually
necessary for explaining observable human behaviour and the
attendant cognitive processes. In this respect, qualia are rendered
functionally epiphenomenal, since purely qualitative states per se
are not strictly required for a functional/computational account
of human mentality. However, this is not to say that they are

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

47

physically epiphenomenal as well, since it doesn't thereby follow
that this aspect of physical/physiological structure does not in
fact play a causal role in the particular human implementation of
this functional cognitive architecture. Hence it becomes a purely
contingent truth that humans have associated P-conscious
experience.
And this should not be too surprising a conclusion, on
the view that the human mind is the product of a long course of
exceedingly happenstance biological evolution. On such a view,
perhaps natural selection has simply recruited this available
biological resource to play vital functional roles, which in
principle could have instead been played by P-unconscious but
A-aware states in a different type of realization. And in this case,
P-conscious states in humans are thus a form of 'phenomenal
overkill', and nature has simply been an opportunist in exploiting
biological vehicles that happened to be on hand, to play a role
that could have been played by a more streamlined and less rich
type of state, but where a 'cheaper' alternative was simply not
available at the critical point in time. Evolution and natural
selection are severely curtailed in this respect, since the basic
ingredients and materials available to work with are a result of
random mutation on existing precursor structures present in the
organism(s) in question. And perhaps human computer scientists
and engineers, not limited by what happens to get thrown up by
random genetic mutations, have designed the successful Q3T
robot utilizing a cheaper, artificial alternative to the overly rich
biological structures sustained in humans.
So in the case of the robot, it would remain an open
question whether or not the physical substrate underlying the
artefact's cognitive processes had the requisite causal powers or
intrinsic natural characteristics to sustain P-conscious states.
Mere behavioural evidence on its own would not be sufficient to
adjudicate, and an independent standard or criterion would be
required.4 4So if P-conscious states are thought to be essentially
physically based, for the reasons given above, and if the robot's
Q3T success could in principle be explained through appeal to
mere A-aware stets on their own, then it follows that the nonidentity of the artefact's physical structure would allow one to
consistently extend Turing's polite convention to one's
conspecifics and yet withhold it from the Q3T robot.

Sciences 4: 115-122 (2000).
[4] N. Block, 'On a confusion about a function of consciousness',
Behavioral and Brain Sciences 18, 227-247, (1995).
[5] B. Baars, A Cognitive Theory of Consciousness, Cambridge
University Press, (1988).
[6] S. Shieber, 'The Turing test as interactive proof', Nous 41:33-60
(2007).
[7]
Minds and Machines 1: 43-54, (1991).
[8] P. Schweizer, 'The externalist foundations of a truly total Turing
test', Mind & Machines, DOI 10.1007/s11023-012-9272-4, (2012).
[9] J. Searle, The Rediscovery of the Mind, MIT Press, (1992).
[10] T. Burge, 'Two kinds of consciousness', in N. Block et al. (eds),
The Nature of Consciousness: Philosophical Debates, MIT Press,
(1997).
[11] A. Plebe and P. Perconti, 'Qualia Turing test: Designing a test for
the phenomenal mind', in Proceedings of the First International
Symposium Towards a Comprehensive Intelligence Test (TCIT),
Reconsidering the Turing Test for the 21st Century, 16-19, (2010).
[12] D. Dennett, Consciousness Explained, Back Bay Books, (1992).
[13] W. G., Lycan, Consciousness, MIT Press, (1987).
[14] R. Van Gul
: Are we all
just armadillos? , in Consciousness: Psychological and
Philosophical Essays, M. Davies and G. Humphreys (eds.),
Blackwell, (1993).
[15] N. Block, 'Troubles with functionalism', in C. W. Savage (ed),
Perception and Cognition, University of Minnesota Press, (1978).
[16] P. Schweizer,
Minds and
Machines, 12, 143-144, (2002)
[17] P. Schweizer, 'Physical instantiation and the propositional
attitudes', Cognitive Computation, DOI 10.1007/s12559-0129134-7, (2012).

REFERENCES
Mind 59: 433A. Turing,
460 (1950).
[2] N. Block, 'Psychologism and behaviorism', Philosophical Review
90: 5-43 (1981).
[3] R. French, 'The Turing test: the first 50 years', Trends in Cognitive

[1]

4

This highlights one of the intrinsic limitations of the Turing test
approach to such questions, since the test is designed as an imitation
game, and humans are the ersatz target. Hence the Q3T robot is designed
to behave as if it had subjective, qualitative inner experiences
indistinguishable from those of a human. However, if human qualia are
the products of our particular internal structure (either physicalphysiological or functional-computational), and if the robot is
significantly different in this respect, then the possibility is open that the
robot might be P-conscious and yet fail the test, simply because its
resulting qualitative experiences are significantly different than ours.
And indeed, a possibility in the reverse direction is that the robot might
even pass the test and sustain an entirely different phenomenology, but
where this internal difference is not manifested in its external behaviour.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

48

Jazz and Machine Consciousness:
Towards a New Turing Test
Antonio Chella1 and Riccardo Manzotti2
Abstract. A form of Turing test is proposed and based on the
capability for an agent to produce jazz improvisations at the
same level of an expert jazz musician.
.12

1 INTRODUCTION
The Essay in the style of Douglas Hofstadter [19] related to the
system EMI by David Cope [11] [12], evokes a novel and
different perspective for the Turing test. The main focus of the
test should be creativity instead of linguistic capabilities: can a
computer be so creative to the point that its creations could be
indistinguishable from those of a human being?
According to Sterberg [36], creativity is the ability to produce
something that is new and appropriate. The result of a creative
process is not reducible to some sort of deterministic reasoning.
No creative activity seems to identify a specific chain of activity,
but an emerging holistic result [25].
Therefore, a creative agent should be able to generate novel
artifacts not by following preprogrammed instructions, but on
the contrary by means of a real creative act.
The problem of creativity has been widely debated in the field
of automatic music composition. The previously cited EMI by
David Cope, subject of the Hoftadter essay, produce impressive
results: even for an experienced listener it is difficult to
distinguish musical compositions created by these programs
from those ones created by a human composer. There is no doubt
that these systems capture some main aspects of the creative
process, at least in music.
However, one may wonders if an agent can actually be
creative without being conscious. In this regard, Damasio [14]
suggests a close connection between consciousness and
creativity. Cope himself in his recent book [13] discusses the
relationship between consciousness and creativity. Although he
does not take a clear position on this matter, he seem to favor the
view according to which consciousness is not necessary for
creative process. In facts, Cope asks if a creative agent should
need to be aware of being creating something and if it needs to
experience the results of its own creations.
The argument of consciousness is typically adopted [3] to
support the thesis according to which an artificial agent can
never be conscious and therefore it can never be really creative.
But recently, there has been a growing interest in machine
consciousness [8] [9], i.e., the study of consciousness through
the design and implementation of conscious artificial systems.
This interest is motivated by the belief that this new approach,
based on the construction of conscious artifacts, can shed new
light on the many critical aspects that affect the mainstream

1
2

University of Palermo, Italy, email: antonio.chella@unipa.it
IULM University, Milan, Italy, email: riccardo.manzotti@iulm.it

studies of consciousness from philosophy and neuroscience.
Creativity is just one of these critical issues.
The relationship between consciousness and creativity is
difficult and complex. On the one side some authors claim the
need of awareness of the creative act. On the other side, it is
suspected that many cognitive processes that are necessary for
the creative act may happen in the absence of consciousness.
However it is undeniable that consciousness is closely linked
with the broader unpredictable and less automatic forms of
cognition, like creativity.
In addition, we could distinguish between the mere
production of new combinations and the aware creation of new
content: if the wind would create (like the monkeys on a
keyboard) a melody which is indistinguishable from the “Va
Pensiero” by Giuseppe Verdi, it would be a creative act? Many
authors would debate this argument [15].
In the following, we discuss some of the main features for a
conscious agent like embodiment, situatedness, emotions and the
capability to have conscious experience. These features will be
discussed with reference to musical expression, and in particular
to a specific form of creative musical expression, namely jazz
improvisation. Musical expression seems to be a form of artistic
expression that most of others is able to immediately produce
conscious experience without filters. Moreover, differently from
olfactory or tactile experiences, musical experience is a kind of
structured experience.
According to Johnson-Laird [20], jazz improvisation is a
specific form of expertise of great interest for the study of the
mind. Furthermore, jazz is a particularly interesting case of study
in relation to creativity. Creativity in a jazz musician is very
different from typical models of creativity. In fact, the creativity
process is often studied with regards to the production of new
abstract ideas, as for example the creation of a new mathematical
theory after weeks of great concentration. On the contrary, jazz
improvisation is a form of immediate and continuous lively
creation process which is closely connected with the external
world made up of musical instruments, people, moving bodies,
environments, audience and the other musicians.

2 CREATIVITY
There are at least two aspects of creativity that is worth
distinguishing since the beginning: syntactic and semantic
creativity. The first one is the capability to recombine a set of
symbols according to various styles. In this sense, if we have
enough patience and time, a random generator will create all the
books of the literary world (but without understanding their
meaning). The second aspect is the capability to generate new
meaning that will be then dressed by appropriate symbols. These
two aspects correspond to a good approximation to the
etymological difference between the terms intelligence and

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

49

intuition. Intelligence is often defined as the ability to find novel
connections for different entities, but intuition should be able to
do something more, i.e., to bring in something that was
previously unavailable.
In short, the syntactic manipulation of symbols may occur
without consciousness, but creativity does not seem to be
possible without consciousness.
Machine consciousness is not only a technological challenge,
but a novel field of research that has scientific and technological
issues, such as the relationship between information and
meaning, the ability for an autonomous agent to choose its own
goals and objectives, the sense of self for a robot, the capability
to integrate information into a coherent whole, the nature of
experience. Among these issues there is the capability, for an
artificial agent, to create and to experience its own creations.
A common objection to machine consciousness emphasizes
the fact that biological entities may have unique characteristics
that cannot be reproduced in artifacts. If this objection is true,
machine consciousness may not be feasible. However, this
contrast between biological and artificial entities has often been
over exaggerated, especially in relation to the problems of
consciousness. So far, nobody was able to satisfactorily prove
that the biological entities may have characteristics that can not
be reproduced in artificial entities with respect to consciousness.
In fact, at the a meeting on machine consciousness in 2001 at
Cold Spring Harbor Laboratories, the conclusion from Koch [23]
was that no known natural law prevents the existence of
subjective experience in artifacts. On the other hand, living
beings are subject to the laws of physics, and yet are conscious,
able to be creative and to prove experience.
The contrast between classic AI (focused on manipulation of
syntactic symbols) and machine consciousness (open to consider
the semantic and phenomenal aspects of the mind) holds in all
his strength in the case of creativity.
Is artistic improvisation - jazz improvisation in particular - a
conscious process? This is an open question. The musicologist
Gunther Schuller [33] emphasizes the fact that jazz
improvisation affects consciousness at all levels, from the
minimal to the highest one. It is a very particular kind of creative
process.
Jazz improvisation has peculiar features that set it apart from
the traditional classic improvisation [29]: as part of Western
classical music, improvisation is a kind of real time composition
with the same rules and patterns of classic composition. On the
contrary, jazz improvisation is based on a specific set of patterns
and elements. The melody, the rhythm (the swing), the chord
progressions are some of the issues that need to be analyzed and
studied with stylistic and aesthetic criteria different from those of
Western classical music [10].

3 EMBODIMENT
Embodiment does not simply mean that an agent must have a
physical body, but also and above all, that different cognitive
functions are carried out by means of aspects of the body. The
aspect of corporeality seems to be fundamental to the musical
performance and not only for jazz improvisation. In this regard,
Sundberg & Verrillo [38] analyzed the complex feedback that
the body of a player receives during a live performance. In facts,
auditory feedback is not sufficient to explain the characteristics
of a performance. The movement of the hands on the instrument,

the touch and the strength needed for the instrument to play, the
vibrations of the instrument propagated through the fingers of
the player, the vibration of the air perceived by the player’s
body, are all examples of feedback guiding the musician during a
performance. The player receives at least two types of bodily
feedback: through the receptors of the skin and through the
receptors of the tendons and muscles. Todd [39] assumed a third
feedback channel through the vestibular apparatus.
Making music is essentially a body activity [26]. Embodiment
is fundamental to jazz improvisation: can an agent without a
body, such as a software like EMI that runs on a mainframe, be
able to improvise? Apparently not, because it would miss the
bodily feedback channels described above. And, in fact, the
results obtained by EMI in the version Improvisation are modest
and based on ad hoc solutions. The same problem arises for
consciousness: can a software that run on a mainframe be
conscious?
It does not seem that embodiment is a sufficient condition for
consciousness, but it may be a necessary condition. Basically, a
cognitive entity must be embodied in a physical entity. However,
it is necessary to deeply reflect about the concept of
embodiment.
Trivially, a cognitive agent can not exist without a body; even
AI expert systems are embodied in a computer which is a
physical entity. On the other hand it is not enough to have a body
for an agent in order to be not trivially embodied: the Honda
ASIMO robot3, considered the state of the art of today robotic
technology, is an impressive humanoid robot but its
performances are essentially based on a standard controller in
which the behaviors are almost completely and carefully defined
in advance by its designers.
In addition, biology gives us many examples of animals, such
as the cockroaches, whose morphology is complex and that
allows them to survive without cognitive abilities.
The notion of embodiment is therefore much more deep and
complex than we usually think. Not only the fact that an agent
might have a body equipped with sophisticated sensors and
actuators, but other conditions must be met. The concept of
embodiment requires the ability to appreciate and process the
different feedback from the body, just like an artist during a
musical live performance.

4 SITUATEDNESS
In addition to having a body, an agent is part of an environment,
i.e., it is situated. An artist, during a jam session, is typically
situated in a group where she has a continuous exchange of
information. The artist receives and provides continuous
feedback with the other players of the group, and sometimes
even with the audience, in the case of live performances.
The classical view, often theorized in textbooks of jazz
improvisation [10], suggests that during a session, the player
follows his own musical path largely made up by a suitable
musical sequence of previously learned patterns. This is a partial
view of an effective jazz improvisation. Undoubtedly, the
musician has a repertoire of musical patterns, but she is also able
to deviate from its path depending on the feedback she receives
from other musicians or the audience, for example from

3

http://asimo.honda.com

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

50

suggestions from the rhythm section or due to signals of
appreciation from the listeners.
Cognitive scientists (see, e.g., [20]) typically model jazz
improvisation processes by means of
Chomsky formal
grammars. This kind of model appears problematic because it
does not explain the complexity of the interaction between the
player, the rest of the group and the audience.
A more accurate model should take into account the main
results from behavior-based robotics [5]. According to this
approach, a musician may use a repertoire of behaviors that are
activated according to the input she receives and according to an
appropriate priority based on her musical sensibility. Interesting
experiments in this direction have been recently described in the
literature. Roboser [27] is an autonomous robot that can move
autonomously in an environment and generate sound events in
real time according to its internal state and to the sensory input it
receives from the environment. EyesWeb [6] is a complex
system that analyzes body movements and gestures with
particular reference to emotional connotations in order to
accordingly generate sound and music in real time and also to
suitably control robots.
Continuator [28] is a system based on a methodology similar
to EMI, but differently from it, is able to learn and communicate
in real time with the musician. For example, the musician
suggests that musical phrases and the system is able to learn the
style of the musician and to continue and complete the sentences
by interacting with the musician.
However, the concept of situated agent, as the concept of
embodiment, is a complex and articulate one. An effective
situated agent should develop a tight integration development
with their surrounding environment so that, like a living being,
its body structure and cognition would be the result of a
continuous and constant interaction with the external
environment.
A true situated agent is an agent that absorbs from its
surroundings, changes according to it and, in turn, it changes the
environment itself. A similar process occurs in the course of jazz
improvisation: the musicians improvise on the basis of their
musical and life experiences accumulated and absorbed over the
years. The improvisation is then based on the interaction and
also, in the case of a jazz group, even of past interactions with
the rest of the group. Improvisation is modified on the basis of
suggestions received from other musicians and audience, and in
turn changes the performances of the other group musicians. A
good jazz improvisation is an activity that requires a deeply
situated agent.

successful performance the player create a tight empathic
relationship between herself and the listeners.
Gabrielsson & Juslin [17] conducted an empirical analysis of
the emotional relationship between a musician and the listeners.
According to this analysis, a song arouses emotions on the basis
of its structure: for example, a sad song is in a minor key, it has a
slow rhythm and the dissonances are frequent, while an exciting
song is fast, strong, with few dissonances.
The emotional intentions of a musician during a live
performance can be felt by the listener with greater or lesser
effectiveness depending on the song itself. The basic emotional
connotations such as the joy or the sadness are easier to transmit,
while more complex connotation such as solemnity are more
difficult to convey. The particular musical instrument employed
has a relevance in the communication of emotions, and of course
the degree of achieved empathy depends on the skill of the
performer. This analysis shows that an agent, to make an
effective performance, must be able to convey emotions and to
have a model (even implicit) of them.
This hypothesis is certainly attractive, but it is unclear how to
translate it into computational terms. So far, many computational
models of emotions have been proposed in the literature. This is
a very prolific field of research for robotics [16].
However, artificial emotions have been primarily studied at
the level of cognitive processes in reinforcement learning
methods.
Attractive and interesting robotic artifacts have been built
able to convey emotions, although it is uncertain whether these
experiments represent effective steps forward in understanding
emotions. For example, the well known robot Kismet [4] is able
to modify some of its external appearance like raising an
eyebrow, grimace, and so on. during its interactions with an user.
These simple external modifications are associates with
emotions. Actually, Kismet has no real model of emotions, but
merely uses a repertoire of rules defined in advance by the
designer: it is the user that naively, interacting with the robot,
ends up with the attribution of emotions to Kismet. On the other
hand, it is the human tendency to anthropomorphize aspects of
its environment. It is easy to see a pair of eyes and a mouth in a
random shape, so it is at the same time easy to ascribe emotions
and intentions to the actions of an agent.
In summary, an agent capable of transmitting emotions during
jazz improvisation must have some effective computational
models for generation and evocation of emotions.

6 EXPERIENCE
5 EMOTIONS
Many scholars consider emotions as a basic element for
consciousness. Damasio [14] believes that emotions form a sort
of proto-consciousness upon which higher forms of
consciousness are developed. In turn, consciousness, according
to this frame of reference, is intimately related with creativity.
The relationships between emotions and music have been
widely analyzed in the literature, suggesting a variety of
computational models describing the main mechanisms
underlying the evocation of emotions while listening to music
[21] [22].
In the case of a live performance as a jazz improvisation, the
link between music and emotions is a deep one: during a

Finally, the more complex problem for consciousness is: how
can a physical system like an agent able to improvise jazz to
produce something similar to our subjective experience? During
a jam session, the sound waves generated by the musical
instruments strike our ears and we experience a sax solo
accompanied by bass, drums and piano. At sunset, our retinas are
struck by rays of light and we have the experience of a
symphony of colors. We swallow molecules of various kinds
and, therefore, we feel the taste of a delicious wine.
It is well known that Galileo Galilei suggested that smells,
tastes, colors and sounds do not exist outside the body of a
conscious subject (the living animal). Thus experience would be
created by the subject in some unknown way.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

51

A possible hypothesis concerns the separation between the
domain of experience, namely, the subjective content, and the
domain of objective physical events. The claim is that physical
reality can be adequately described only by the quantitative point
of view in a third person perspective while ignoring any
qualitative aspects. After all, in a physics textbook there are
many mathematical equations that describe a purely quantitative
reality. There is room for quality content, feelings or emotions.
Explaining these qualitative contents is the hard problem of
consciousness [7].
Yet scholars as Strawson [37] questioned the validity of such
a distinction as well as the degree of real understanding of the
nature of the physical world.
Whether the mental world is a special construct generated by
some feature of the nervous systems of mammals, is still an open
question. It is fair to stress that there is neither empirical
evidence nor theoretical arguments supporting such a view. In
the lack of a better theory, we could also take into consideration
the idea inspired by externalism [31] [32] according to which the
physical world comprehends also those features that we usually
attribute to the mental domain. A physicalist must be held that if
something is real, and we assume consciousness is real, it has to
be physical. Hence, in principle, a device can envisage it.
In the case of artificial agents for jazz improvisation, how is it
possible to overcome the distinction between function and
experience? Such a typical agent is made up by a set of
interconnected modules, each operating in a certain way. How
the operation of some or all of the interconnected modules
should generate conscious experience? However, the same
question could be transferred to the activity of neurons. Each
neuron, taken alone, does not work differently from a software
module or a chip. But it could remains a possibility: it is not the
problem of the physical world, but of our theories of the physical
world. Artificial agents are part of the same physical world that
produce consciousness in human subjects, so they may exploit
the same properties and characteristics that are relevant for
conscious experience.
In this regard, Tononi [41] proposed a theory supported by
results from neuroscience, according to which the degree of
conscious experience is related to the amount of integrated
information. According to this framework, the primary task of
the brain is to integrate information and, noteworthy, this process
is the same whether it takes place in humans or in artifacts like
agents for jazz improvisation. According to this theory,
conscious experience has two main characteristics. On the one
side, conscious experience is differentiated because the potential
set of different conscious states is huge. On the other side,
conscious experience is integrated; in facts a conscious state is
experienced as a single entity. Therefore, the substrate of
conscious experience must be an integrated entity able to
differentiate among a big set of different states and whose
informational state is greater than the sum of the informational
states of the component sub entities [1] [2].
According to this theory, Koch and Tononi [24] propose a
potential new Turing test based on the integration of
information: artificial systems should be able to mimic the
human being not in language skills (as in the classic version of
Turing test), but rather in the ability to integrate information
from different sources.
Therefore, an artificial agent aware of its jazz improvisation
should be able to integrate during time the information generated

by its own played instrument, the instruments of its band as well
as information from the body, i.e., the feedback from skin
receptors, the receptors of the tendons and muscles and possibly
from the vestibular apparatus. Furthermore, it should also be able
also to integrate information related to emotions.
Some of the early studies based on suitable neural networks
for music generation [40] are promising in the way to implement
an information integration agent. However, we must emphasize
the fact that the implementation of a true information integration
system is a real technological challenge In fact, the typical
engineering techniques for the building of an artifact is
essentially based on the principle of divide et impera, that
involves the design of a complex system through the
decomposition of the system into easier smaller subsystems.
Each subsystem then communicates with the others subsystems
through well-defined interfaces so that the interaction between
the subsystems happen in a very controlled way. Tononi's theory
requires instead maximum interaction between the subsystems in
order to allow an effective integration. Therefore, new
techniques are required to design effective conscious agents.
Information integration theory raised heated debates in the
scientific community. It could represent a first step towards a
theoretically well-founded approach to machine consciousness.
The idea of being able to find the consciousness equations
which, like the Maxwell's equations in physics, are able to
explain consciousness in living beings and in the artifacts, would
be a kind of ultimate goal for scholars of consciousness.

7 CONCLUSIONS
The list of problems related to machine consciousness that have
not been properly treated is long: the sensorimotor experience in
improvisation, the sense of time in musical performance, the
problem of the meaning of a musical phrase, the generation of
musical mental images and so on. These are all issues of great
importance for the creation of a conscious agent for jazz
improvisation, although some of them may overlap in part with
the arguments discussed above.
Although the classic AI achieved impressive results, and the
program EMI by Cope is a great example, so far these issues
have been addressed only partially.
In this article we have discussed the main issues to be
addressed in order to design and build an artificial that can
perform a jazz improvisation. The physicality, the ability to be
located, to have emotions, to have some form of experience are
all problems inherent in the problem of consciousness.
A new Turing test might be based on imitating the ability to
distinguish a jazz improvisation produced by an artificial agent,
maybe able to integrate information according to Tononi, than
improvisation produced by an expert jazz musician.
As should be clear, this is a very broad subject that
significantly extends the traditional the mind-brain problem.
Machine consciousness is, at the same time, a theoretical and
technological challenge that forces to deal with old problems and
new innovative approaches. It is possible, and hope that the
artificial consciousness researchers push to re-examine many
threads left hanging from the Artificial Intelligence and
cognitive science. “Could consciousness be a theoretical time
bomb, ticking away in the belly of AI? Who can say?”
(Haugeland [18], p. 247).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

52

REFERENCES
[1] D. Balduzzi and G. Tononi, ‘Integrated information in discrete
dynamical systems: Motivation and theoretical framework’, PLoS
Computational Biology, 4, e1000091, (2008).
[2] D. Balduzzi and G. Tononi, ‘Qualia: The geometry of integrated
information’, PLoS Computational Biology, 5, e1000462, (2009).
[3] M. Boden, The Creative Mind: Myths and Mechanisms - Second
Edition, Routledge, London, 2004.
[4] C. Breazeal, Designing Sociable Robots, MIT Press, Cambridge,
MA, 2002.
[5] R. Brooks, Cambrian Intelligence: The Early History of the New
AI, MIT Press, Cambridge, MA, 1999.
[6] Camurri, S. Hashimoto, M. Ricchetti, A. Ricci, K. Suzuki, R.
Trocca and G. Volpe, ‘EyesWeb: Toward Gesture and Affect
Recognition in Interactive Dance and Music Systems’, Computer
Music Journal, 24, 57 – 69, (2000).
[7] D. Chalmers. The Conscious Mind: In Search of a Fundamental
Theory, Oxford University Press, Oxford, 1996.
[8] A. Chella and R. Manzotti (eds.), Artificial Consciousness, Imprint
Academic, Exeter, UK, 2007.
[9] A. Chella and R. Manzotti, ‘Machine Consciousness: A Manifesto
for Robotics’, International Journal of Machine Consciousness, 1,
33 – 51, (2009).
[10] J. Coker, Improvising Jazz, Simon & Schuster, New York, NY,
1964.
[11] D. Cope, ‘Computer Modeling of Musical Intelligence in EMI’,
Computer Music Journal, 16, 69 – 83, 1992.
[12] D. Cope, Virtual Music. MIT Press, Cambridge, MA, 2001.
[13] D. Cope, Computer Models of Musical Creativity, MIT Press,
Cambridge, MA, 2005.
[14] A. Damasio, The Feeling of What Happens: Body and Emotion in
the Making of Consciousness, Houghton Mifflin Harcourt, 1999.
[15] A. Danto, ‘The Transfiguration of Commonplace’, The Journal of
Aesthetics and Art Criticism, 33, 139 – 148, (1974).
[16] J.-M. Fellous and M. A. Arbib, Who Needs Emotions?: The Brain
Meets the Robot, Oxford University Press, Oxford, UK, 2005.
[17] A. Gabrielsson and P.N. Juslin, ‘Emotional Expression in Music
Performance: Between the Performer's Intention and the Listener's
Experience’, Psychology of Music, 24, 68 – 91, (1996).
[18] J. Haugeland, Artificial Intelligence: The Very Idea, MIT Press,
Bradford Books, Cambridge, MA, 1985.
[19] D. Hofstadter, ‘Essay in the Style of Douglas Hofstadter’, AI
Magazine, Fall, 82 – 88, (2009).
[20] P.N. Johnson-Laird, ‘Jazz Improvisation: A Theory at the
Computational Level’, in: Representing Musical Structure, P.
Howell, R. West & I. Cross (eds.), Academic Press, London, 1991.
[21] P. N. Juslin & J. A. Sloboda (eds.), Handbook of Music and
Emotion - Theory, Research, Application, Oxford University Press,
Oxford, UK, 2010.
[22] P.N. Juslin & D. Västfjäll, ‘Emotional responses to music: The
need to consider underlying mechanisms’, Behavioral and Brain
Sciences, 31, 559 – 621, (2008).
[23] K. Koch, ‘Final Report of the Workshop Can a Machine be
Conscious’, The Banbury Center, Cold Spring Harbor Laboratory,
http://theswartzfoundation.com/abstracts/2001_summary.asp (last
access 12/09/2011).
[24] K. Koch and G. Tononi, ‘Can Machines Be Conscious?’, IEEE
Spectrum, June, 47 – 51, (2008).
[25] A. Koestler, The Act of Creation, London, Hutchinson, 1964.
[26] J. W. Krueger, ‘Enacting Musical Experience’, Journal of
Consciousness Studies, 16, 98 – 123, (2009).
[27] J. Manzolli and P.F.M.J. Verschure, ‘Roboser: A Real-World
Composition System’, Computer Music Journal, 29, 55 – 74,
(2005).
[28] F. Pachet, ‘Beyond the Cybernetic Jam Fantasy: The Continuator’,
IEEE Computer Graphics and Applications, January/February, 2 –
6, (2004).

[29] J. Pressing, ‘Improvisation: Methods and Models’, in: Generative
Processes in Music: The Psychology of Performance,
Improvisation, and Composition, J. Sloboda (ed.), Oxford
University Press, Oxford, UK, 1988.
[30] P. Robbins & M. Aydede (eds.), The Cambridge Handbook of
Situated Cognition, Cambridge, Cambridge University Press, 2009.
[31] T. Rockwell, Neither Brain nor Ghost, MIT Press, Cambridge, MA,
2005.
[32] M. Rowlands, Externalism – Putting Mind and World Back
Together Again, McGill-Queen’s University Press, Montreal and
Kingston, 2003.
[33] G. Schuller, ‘Forewords’, in: Improvising Jazz, J. Coker, Simon &
Schuster, New York, NY, 1964.
[34] J. R. Searle, ‘Minds, brains, and programs’, Behavioral and Brain
Sciences, 3, 417 – 457, (1980).
[35] A. Seth, ‘The Strength of Weak Artificial Consciousness’,
International Journal of Machine Consciousness, 1, 71 – 82,
(2009).
[36] R. J. Sternberg (eds.), Handbook of Creativity, Cambridge,
Cambridge University Press, 1999.
[37] G. Strawson, ‘Does physicalism entail panpsychism?’, Journal of
Consciousness Studies, 13, 3 – 31, (2006).
[38] J. Sundberg and R.T. Verrillo, ‘Somatosensory Feedback in
Musical Performance’, (Editorial), Music Perception: An
Interdisciplinary Journal, 9, 277 – 280, (1992).
[39] N.P. McAngus Todd, ‘Vestibular Feedback in Musical
Performance: Response to «Somatosensory Feedback in Musical
Performance»’, Music Perception: An Interdisciplinary Journal,
10, 379 – 382, (1993).
[40] P.M. Todd & D. Gareth Loy (eds.), Music and Connectionism, MIT
Press, Cambridge, MA, 1991.
[41] G. Tononi, ‘An Information Integration Theory of Consciousness’,
BMC Neuroscience, 5, (2004).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

53

Taking Turing Seriously (But Not Literally)
William York1 and Jerry Swan2
Abstract. Results from present-day instantiations of the Turing test,
most notably the annual Loebner Prize competition, have fueled the
perception that the test is on the verge of being passed. With this perception comes the misleading implication that computers are nearing
human-level intelligence. As currently instantiated, the test encourages an adversarial relationship between contestant and judge. We
suggest that the underlying purpose of Turing’s test would be better served if the prevailing focus on trickery and deception were replaced by an emphasis on transparency and collaborative interaction.
We discuss particular examples from the family of Fluid Concepts architectures, primarily Copycat and Metacat, showing how a modiﬁed
version of the Turing test (described here as a “modiﬁed Feigenbaum
test”) has served as a useful means for evaluating cognitive-modeling
research and how it can suggest future directions for such work.

1

INTRODUCTION; THE TURING TEST IN
LETTER AND SPIRIT
The method of “postulating” what we want has many advantages; they are the same as the advantages of theft over honest
toil. – Bertrand Russell, Introduction to Mathematical Philosophy
Interrogator: Yet Christmas is a Winter’s day, and I do not
think Mr. Pickwick would mind the comparison.
Respondent: LOL – Pace Alan Turing, “Computing Machinery and Intelligence”

If Alan Turing were alive today, what would he think about the
Turing test? Would he still consider his imitation game to be an effective means of gauging machine intelligence, given what we now
know about the Eliza effect, chatbots, and the increasingly vacuous
nature of interpersonal communication in the age of texting and instant messaging?
One can only speculate, but we suspect he would ﬁnd current instantiations of his eponymous test, most notably the annual Loebner
Prize competition, to be disappointingly literal-minded. Before going further, it will help to recall Turing’s famous prediction about the
test from 1950:
I believe that in about ﬁfty years’ time it will be possible, to
programme computers, with a storage capacity of about 109 , to
make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making
the right identiﬁcation after ﬁve minutes of questioning ([22],
p. 442).
1
2

Indiana University, United States, email: wwyork@indiana.edu
University of Stirling, Scotland, email: jsw@cs.stir.ac.uk

The Loebner Prize competition adheres closely to the outward
form—or letter—of this imitation game, right down to the ﬁveminute interaction period and (at least for the ultimate Grand Prize)
the 70-percent threshold.3 However, it is questionable how faithful the competition is to the underlying purpose—or spirit—of the
game, which is, after all, to assess whether a given program or artifact should be deemed intelligent, at least relative to human beings.4
More generally, we might say that the broader purpose of the test is
to assess progress in AI, or at least that subset of AI that is concerned
with modeling human intelligence. Alas, this purpose gets obscured
when the emphasis turns from pursuing this long-term goal to simply “beating the test.” Perhaps this shift in emphasis is an inevitable
consequence of using a behavioral test: “If we don’t want that,” one
might argue, “then let us have another test.” Indeed, suggestions have
been offered for modifying the Turing test (cf. [6], [7], [3]), but we
still see value in the basic idea behind the test—that of using observable “behavior” to infer underlying mechanisms and processes.

1.1 Priorities and payoffs
The letter–spirit distinction comes down to a question of research priorities, of short-term versus long-term payoffs. In the short term, the
emphasis on beating the test has brought programs close to “passing
the Turing test” in its Loebner Prize instantiation. Brian Christian,
who participated in the 2009 competition as a confederate (i.e., one
of the humans the contestant programs are judged against) and described the experience in his recent book The Most Human Human,
admitted to a sense of urgency upon learning that “at the 2008 contest..., the top program came up shy of [passing] by just a single vote”
([1], p. 4). Yet in delving deeper into the subject, Christian realized
the superﬁciality—the (near) triumph of “pure technique”—that was
responsible for much of this success.
But it is not clear that the Loebner Prize has steered researchers
toward any sizable long-term payoffs in understanding human intelligence. After witnessing the ﬁrst Loebner Prize competition in 1991,
Stuart Shieber [20] concluded, “What is needed is not more work on
solving the Turing Test, as promoted by Loebner, but more work on
the basic issues involved in understanding intelligent behavior. The
parlor games can be saved for later” (p. 77). This conclusion seems
as valid today as it was two decades ago.

1.2 Communication, transparency, and the Turing
test
The question, then, is whether we might better capture the spirit of
Turing’s test through other, less literal-minded means. Our answer is
3
4

Of course, the year 2000 came and went without this prediction coming to
pass, but that is not at issue here.
See [5] for more discussion of the distinction between human-like intelligence versus other forms of intelligence in relation to the Turing test.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

54

not only that we can, but that we must. The alternative is to risk trivializing the test by equating “intelligence” with the ability to mimic
the sort of context-neutral conversation that has increasingly come to
pass for “communication.” Christian points out that “the Turing test
is, at bottom, about the act of communication” ([1], p. 13). Yet given
the two-way nature of communication, it can be hard to disentangle progress in one area (AI) from deterioration in others. As Jaron
Lanier recently put it,
You can’t tell if a machine has gotten smarter or if you’ve just
lowered your standards of intelligence to such a degree that the
machine seems smart. If you can have a conversation with a
simulated person presented by an AI program, can you tell how
far you’ve let your sense of personhood degrade in order to
make the illusion work for you? ([13], p. 32).
In short, the Turing test’s reliance on purely verbal behavior renders it susceptible to tricks and illusions that its creator could not
have reasonably anticipated. Methodologies such as statistical machine learning, while valuable as computational and engineering
tools, are nonetheless better suited to modeling human banality than
they are human intelligence. Additionally, the test, as currently instantiated, encourages an adversarial approach between contestant
and judge that does as much to obscure and inﬂate progress in AI as
it does to provide an accurate measuring stick. It is our contention
that a test that better meets Turing’s original intent should instead be
driven by the joint aims of collaboration and transparency.

2

INTELLIGENCE, TRICKERY, AND THE
LOEBNER PRIZE

Does deception presuppose intelligence on the part of the deceiver? In proposing his imitation game, Turing wagered—at least
implicitly—that the two were inseparable. Surely, a certain amount
of cunning and intelligence are required on the part of humans who
excel at deceiving others. The ﬂip side of the coin is that a degree of
gullibility is required on the part of the person(s) being deceived.
Things get more complicated when the deception is “perpetrated”
by a technological artifact as opposed to a willfully deceptive human.
To quote Shieber once again, “[I]t has been known since Weizenbaum’s surprising experiences with ELIZA that a test based on fooling people is confoundingly simple to pass” (p. 72; cf. [24]). The gist
of Weizenbaum’s realization is that our interactions with computer
programs often tell us less about the inner workings of the programs
themselves than they do about our tendency to project meaning and
intention onto artifacts, even when we know we should know better.

2.1 The parallel case of art forgery
For another perspective on the distinction between genuine accomplishment and mere trickery, let us consider the parallel case of art
forgery. Is it possible to distinguish between a genuine artist and a
mere faker? It is tempting to reply that in order to be a good faker—
one good enough to fool the experts—one must necessarily be a good
artist to begin with. But this sort of argument is too simplistic, as it
equates artistry with technical skill and prowess, meanwhile ignoring originality, artistic vision, and other qualities that are essential to
genuine artistry (cf. [14], [2]). In particular, the ability of a skilled art
forger to create a series of works in the style of, say, Matisse does not
necessarily imply insight into the underlying artistic or expressive vision of Matisse—the vision responsible for giving rise to those works

in the ﬁrst place. As philosopher Matthew Kieran succinctly puts it,
“There is all the difference in the world between a painting that genuinely reveals qualities of mind to us and one which blindly apes
their outward show” ([11], p. 21).
Russell’s famous quote about postulation equating to theft helps
us relate an AI methodology to the artistry–forgery distinction.
Russell’s statement can be paraphrased as follows: merely saying
that there exists a function (e.g., sqrt()) with some property
(e.g., sqrt(x)*sqrt(x)=x for all x >= 0) does not tell
us very much about how to generate the actual sqrt() function.
Similarly, the ability to reproduce a small number of values of x
that meet this speciﬁcation does not imply insight into the underlying mechanisms involved, relative to which the existence of these
speciﬁc values is essentially a side effect. A key issue here is the
small number of values: Since contemporary versions of the Turing
test are generally highly time-constrained, it is even more imperative
that the test involve a deep probe into the possible behaviors of the
respondent.

2.2 Thematic variability in art and in computation
Many of the Loebner Prize entrants (e.g., [23]) have adopted the
methodologies of corpus linguistics and machine learning, so let us
reframe the issue of thematic variability in these terms. We might
abstractly consider the statistical machine-learning approach to the
Turing test as being concerned with the induction of a generative
grammar. In short, the ability to induce an algorithm that reproduces
some themed collection of original works does not in itself imply
that any underlying sensibilities that motivated those works can be
effectively approximated by that algorithm.
One way of measuring the “work capacity” of an algorithm is to
employ the Kolmogorov complexity measure [21], which is essentially the size of the shortest possible functionally identical algorithm. In the induction case, algorithms with the lowest Kolmogorov
complexity will tend to be those that exhibit very little variability—in
the limiting case, generating only instances from the original collection. This would be analogous to a forger who could only produce
exact copies of another artist’s works, rather than works “in the style
of” said artist—the latter being the stock-in-trade of infamous art
forgers Han van Meegeren [25] and Elmyr de Hory [10].
In contrast, programs from the family of Fluid Concepts architectures (see 4.1 below) possess relational and generative models that
are domain-speciﬁc. For example, the Letter Spirit architecture [19]
is speciﬁcally concerned with exploring the thematic variability of a
given font style. Given Letter Spirit’s (relatively) sophisticated representation of the “basis elements” and “recombination mechanisms”
of form, it might reasonably be expected to have a high Kolmogorov
complexity. The thematic variations generated by Letter Spirit are
therefore not easily approximated by domain-agnostic data-mining
approaches.

2.3 Depth, shallowness, and the Turing test
The artistry–forgery distinction is useful insofar as it offers another
perspective on the issue of depth versus shallowness—an issue that is
crucial in any analysis of the Turing test. Just as the skilled art forger
is adept at using trickery to simulate “authenticity”—for example, by
artiﬁcially aging a painting through various techniques such as baking or varnishing ([10], [25])—analogous forms of trickery tend to
ﬁnd their way into the Loebner Prize competition: timely pop-culture
references, intentional typos and misspellings, strategic changes of

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

55

subject, and so on (cf. [20], [1]). Yet these surface-level tricks have
as much to do with the genuine modeling of intelligence as coating
the surface of a painting with antique varnish has to do with bona ﬁde
artistry. Much like the art forger’s relationship with the art world, the
relationship between contestant programs and judges in the Loebner
Prize is essentially adversarial, not collaborative. The adversarial nature of these contestant–judge interactions, we feel, is a driving force
in the divergence of the Turing test, in its current instantiations, from
the spirit in which it was originally conceived.

3

SOME VARIATIONS ON THE TURING TEST

The idea of proposing modiﬁcations to the Turing test is not a new
one. In this section, we look at such proposals—Stevan Harnad’s
“Total Turing Test” (and the accompanying hierarchy of Turing tests
he outlines) and Edward Feigenbaum’s eponymous variation on the
Turing test—before discussing how they relate to our own, described
below as a “modiﬁed Feigenbaum test.”

3.1 The Total Turing Test
Harnad ([6], [7]) has outlined a detailed hierarchy of possible Turing
tests, with Turing’s own version occupying the second of ﬁve rungs
on this hypothetical ladder. Harnad refers to this as the T2, or “penpal,” level, given the strict focus on verbal (i.e., written or typed)
output. Directly below this level is the t1 test (where “t” stands for
“toy,” not “Turing”). Harnad observed, a decade ago, that “all of the
actual mind-modelling research efforts to date are still only at the t1
level, and will continue to be so for the foreseeable future: Cognitive
Science has not even entered the TT hierarchy yet” ([7], §9). This is
still the case today.
Just as the t1 test draws on “subtotal fragments” of T2, T2 stands
in a similar relation to T3, the Total Turing Test. This test requires not
just pen-pal behavior, but robotic (i.e., embodied) behavior as well.
A machine that passed the Total Turing Test would be functionally
(though not microscopically) indistinguishable from a human being.5
Clearly, there are fewer degrees of freedom—and hence less room
for deception—as we climb the rungs on Harnad’s ladder, particularly from T2 to T3. However, given the current state of the art, the T3
can only be considered an extremely distant goal at this point. It may
be that the T2, or pen-pal, test could only be convincingly “passed”—
over an arbitrarily long period of time, as Harnad stipulates, and not
just the ﬁve-minute period suggested by Turing and adhered to in
the Loebner Prize competition—by a system that could move around
and interact with other people and things in the real world as we do.
It may even be that certain phenomena that are still being modeled
and tested at the t1 level—even seemingly abstract and purely “cognitive” ones such as analogy-making and categorization—are ultimately grounded in embodiment and sensorimotor capacities as well
(cf. [12]), which would imply fundamental limitations for much current research. Unfortunately, such questions must be set aside for the
time being, as they are beyond the scope of this paper.

3.2 The Feigenbaum test
The Feigenbaum test [3] was proposed in order test the quality
of reasoning in specialized domains—primarily scientiﬁc or otherwise technical domains such as astrophysics, computer science, and
medicine. The confederate in the Feigenbaum test is not merely an
5

The T4 and T5 levels, which make even greater demands, are not relevant
for our purposes.

ordinary human being, but an “elite scientist” and member of the U.S.
National Academy of Sciences. The judge, who is also an Academy
member and an expert in the domain in question, interacts with the
confederate and the contestant (i.e., the program). Feigenbaum elaborates, “The judge poses problems, asks questions, asks for explanations, theories, and so on—as one might do with a colleague” ([3],
p. 36). No time period is stipulated, but as with the Turing test, “the
challenge will be considered met if the computational intelligence
’wins’ one out of three disciplinary judging contests, that is, one of
the three judges is not able to choose reliably between human and
computer performer” (ibid.).

3.3 A modified Feigenbaum test
Feigenbaum’s emphasis on knowledge-intensive technical domains
is in keeping with his longtime work in the area of expert systems.
This aspect of his test is incidental, even irrelevant, to our purposes.
In fact, we go one step further with our “modiﬁed Feigenbaum test”
and remove the need for an additional contestant beyond the program. Rather, the judge “interacts” directly with the program for an
arbitrarily long period of time and evaluates the program’s behavior
directly—and qualitatively—on the basis of this interaction. (No illusion is made about the program passing for human, which would
be premature and naive in any case.)
What is relevant about the Feigenbaum test for our purposes is its
emphasis on focused, sustained interaction between judge and program within a suitably subtle domain. Our modiﬁed Feigenbaum test
stresses a similar type of interaction, though the domain—while still
constrained—is far less specialized or knowledge-intensive than, say,
astrophysics or medicine. In fact, the domain we discuss below—
letter-string analogies—was originally chosen as an arena for modeling cognition because of its balance of generality and tractability
[9]. In other words, the cognitive processes involved in thinking and
otherwise “operating” within the domain are intended to be more or
less general and domain-independent. At the same time, the restriction of the domain, in terms of the entities and relationships that make
it up, is meant to ensure tractability and plausibility—in contrast to
dealing (or pretending to deal) with complex real-world knowledge
of a sort that can scarcely be attributed to a computer program (e.g.,
knowledge of medicine, the solar system, etc.).
In the following section, we argue on behalf of this approach and
show how research carried out under this ongoing program represents
an example of how one can take the idea of Turing’s test seriously
without taking its speciﬁcations literally.

4

TAKING TURING SERIOUSLY: AN
ALTERNATIVE APPROACH

In an essay entitled “On the Seeming Paradox of Mechanizing Creativity,” Hofstadter [8] relates Myhill’s [17] three classes of mathematical logic to categories of behavior. The most inclusive category,
the productive, is the one that is of central interest to us here. While
no ﬁnite collection of rules sufﬁces to generate the members of a productive set P (and no x ∈
/ P ), a more expansive and/or sophisticated
set of generative rules (i.e., creative processes) can approximate P
with unbounded accuracy.
In order to emphasize the role of such “unbounded creativity” in
the evaluation of intelligence, we describe a modiﬁed Feigenbaum
test restricted to the microdomain of letter-string analogies. An example of such a problem is, “If abc changes to abd, how would you
change pxqxrx in ’the same way’?” (or simply abc → abd; pxqxrx

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

56

→ ???). Problems in this domain have been the subject of extensive study [9], resulting in the creation of the well-known Copycat
model [16] and its successor, Metacat [15]. Before describing this
test, however, we brieﬂy discuss these programs’ architectures in
general terms.

4.1 Copycat, Metacat, and Fluid Concepts
architectures
Copycat’s architecture consists of three main components, all of
which are common to the more general Fluid Concepts architectural
scheme. These components are the Workspace, which is essentially
roughly the program’s working memory; the Slipnet, a conceptual
network with variably weighted links between concepts (essentially
a long-term memory); and the Coderack, home to a variety of agentlike codelets, which perform speciﬁc tasks in (simulated) parallel,
without the guidance of an executive controller. For example, given
the problem abc → abd; iijjkk → ???, these tasks would range from
identifying groups (e.g., the jj in iijjkk) to proposing bridges between items in different letter-strings (e.g., the b in abc and the jj in
iijjkk) to proposing rules to describe the change in the initial pair of
strings (i.e., the change from abc to abd).6
Building on Copycat, Metacat incorporates some additional components that are not present in its predecessor’s architecture, most
notably the Episodic Memory and the Temporal Trace. As the program’s name suggests, the emphasis in Metacat is on metacognition,
which can broadly be deﬁned as the process of monitoring, or thinking about, one’s own thought processes. What this means for Metacat
is an ability to monitor, via the Temporal Trace, events that take place
en route to answering a given letter-string problem, such as detecting a “snag” (e.g., trying to ﬁnd the successor to z, which leads to a
snag because the alphabet does not “circle around” in this domain)
or noticing a key idea. Metacat also keeps track of its answers to previous problems, as well as its responses on previous runs of the same
problem, both via the Episodic Memory. As a result, it is able to be
“reminded” of previous problems (and answers) based on the problem at hand. Finally, it is able to compare and contrast two answers
at the user’s prompting (see Section 4.3 below).
Philosophically speaking, Fluid Concepts architectures are predicated upon the conviction that it is possible to “know everything
about” the entities and relationships in a given microdomain. In other
words, there is no propositional fact about domain entities and processes (or the effect of the latter on the on the former) that is not in
principle accessible to inspection or introspection. In Copycat, the
domain entities range from permanent “atomic” elements (primarily,
the 26 letters of the alphabet) to temporary, composite ones, such as
the letter strings that make up a given problem (abc, iijjkk, pxqxrx,
etc.); the groups within letter strings that are perceived during the
course of a run (e.g., the ii, jj, and kk in iijjkk); and the bonds that
are formed between such groups. The relationships include concepts
such as same, opposite, successor, predecessor, and so on.
A key aspect of the Fluid Concepts architecture is that it affords an
exploration the space of instantiations of those entities and relationships in a (largely) non-stochastic fashion—that is, in a manner that is
predominately directed by the nature of the relationships themselves.
In contrast, the contextual pressures that give rise to some subtle yet
low frequency solutions are unlikely to have a referent within a statistical machine-learning model built from a corpus of Copycat an6

See [16] for an in-depth discussion of codelet types and functions in Copycat.

swers, since outliers are not readily captured by gross mechanisms
such as sequences of transition probabilities.

4.2 An example from the Copycat microdomain
To many observers, a letter-string analogy problem such as the aforementioned abc → abd; iijjkk → ??? might appear trivial on ﬁrst
glance.7 Yet upon closer inspection, one can come to appreciate the
surprising subtleties involved in making sense of even a relatively
basic problem like this one. Consider the following (non-exhaustive)
list of potential answers to the above problem:
• iijjll – To arrive at this seemingly basic answer requires at least
three non-trivial insights: (1) seeing iijjkk as a sequence of three
sameness groups—ii, jj, and kk—not as a sequence of individual
letters; (2) seeing the group kk as playing the same role in iijjkk
that the letter c does in abc; and (3) seeing the change from c to
d in terms of successorship and not merely as a change from the
letter c to the letter d. The latter point may seem trivial, but it is not
a given, and as we will see, there are other possible interpretations.
• iijjkl – This uninspiring answer results from simply changing the
letter category of the rightmost letter in iijjkk to its successor, as
opposed to the letter category of the rightmost group.
• iijjkd – This answer results from the literal-minded strategy of
simply changing the last letter in the string to d, all the while ignoring the other relationships among the various groups and letter
categories.
• iijjdd – This semi-literal, semi-abstract answer falls somewhere
in between iijjll and iijjkl. On the one hand, it reﬂects a failure to
perceive the change from c to d in the initial string in terms of successorship, instead treating it as a mere replacement of the letter
c with the letter d. On the other hand, it does signal a recognition
that the concept group is important, as it at least involves carrying
out the change from k to d in the target string over to both ks and
not just the rightmost one. This answer has a “humorous” quality to it, unlike iijjkl or iijjkd, due to its mixture of insight and
confusion.
This incomplete catalog of answers hints at the range of issues that
can arise in examining a single problem in the letter-string analogy
domain. Copycat itself is able to come up with all of the aforementioned answers (along with a few others), as illustrated in Table 1,
which reveals iijjll to be the program’s “preferred choice” according to the two available measures. These measures are (1) the relative frequency with which each answer is given and (2) the average
“ﬁnal temperature” associated with each answer. Roughly speaking,
the temperature—which can range from 0 to 100—indicates the program’s moment-to-moment “happiness” with its perception of the
problem during a run, with a lower temperature corresponding to a
more positive evaluation

4.3 The modified Feigenbaum test: from Copycat
to Metacat
One limitation of Copycat is its inability to “say” anything about the
answers it gives beyond what appears in its Workspace during the
7

Such problems may seem to bear a strong resemblance to the kinds of problems one might ﬁnd on an IQ test. However, an important difference worth
noting is that the problems in the Copycat domain are not conceived of as
having “correct” or “incorrect” answers (though in many cases there are
clearly “better” and “worse” ones). Rather, the answers are open to discussion, and the existence of subtle differences between the various answers to
a given problem is an important aspect of the microdomain.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

57

Table 1. Copycat’s performance over 1000 runs on the problem abc →
abd; iijjkk → ???. Adapted from [16].
Answer
iijjll
iijjkl
iijjdd
iikkll
iijkll
iijjkd
ijkkll

Frequency

Average Final Temperature

810
165
9
9
3
3
1

27
47
32
46
43
65
43

course of a run. While aggregate statistics such as those illustrated
in Table 1 can offer some insight into its performance, the program
is not amenable to genuine Feigenbaum-testing, primarily because it
doesn’t have the capacity to summarize its viewpoint. To the extent
that it can be Feigenbaum-tested, it can only do so in response to
what might termed ﬁrst-order questions (e.g., abc → abd; iijjkk →
???). It cannot answer second-order questions (i.e., questions about
questions), let alone questions about its answers to questions about
questions.
In contrast, Metacat allows us to ask increasingly sophisticated
questions of it, and thus can be said to allow for the sort of modiﬁed Feigenbaum-testing described in Section 3.3. One can “interact”
with the program in a variety of ways: by posing new problems; by
inputting an answer to a problem and running the program in “justify
mode,” asking it to evaluate and make sense of the answer; and by
having it compare two answers to one another (as in the above examples). In doing the latter, the program summarizes its “viewpoint”
with one of a set of canned (but non-arbitrary) English descriptions.
For example, the preferred answer might be “based on a richer set of
ideas,” “more abstract,” or “more coherent.”
The program also attempts to “explain” how the two answers are
similar to each other and how they differ. For example, consider the
program’s summary of the comparison between iijjll and iijjdd in
response to the aforementioned problem:
The only essential difference between the answer iijjdd and
the answer iijjll to the problem abc → abd; iijjkk → ??? is that
the change from abc to abd is viewed in a more literal way for
the answer iijjdd than it is in the case of iijjll. Both answers rely
on seeing two strings (abc and iijjkk in both cases) as groups
of the same type going in the same direction. All in all, I’d say
iijjll is the better answer, since it involves seeing the change
from abc to abd in a more abstract way.
It should be emphasized that the speciﬁc form of the verbal output
is extremely unsophisticated relative to the capabilities of the underlying architecture, indicating that it is possible to exhibit depth of
insight while treating text generation as essentially a side-effect. This
contrasts sharply with contemporary approaches to the Turing test.
For the sake of contrast, here is the program’s comparison between
the answers iijjll and abd, which illustrates some of the program’s
limitations in clumsily (and, of course, unintentionally) humorous
fashion:
The only essential difference between the answer abd and
the answer iijjll to the problem abc → abd; iijjkk → ??? is
that the change from abc to abd is viewed in a completely different way for the answer abd than it is in the case of iijjll.
Both answers rely on seeing two strings (abc and iijjkk in both

cases) as groups of the same type going in the same direction.
All in all, I’d say abd is really terrible and iijjll is very good.
Apart from the thin veneer of human agency that results from
Metacat’s text generation, the program’s accomplishments—and just
as importantly, its failures—become transparent through interaction.

4.4 Looking ahead
In order for it to actually pass an “unrestricted modiﬁed Feigenbaum
test” in the letter-string analogy domain, what other questions might
we conceivably require Metacat to answer? Here are some suggestions:
1. Problems that involve more holistic processing of letter strings.
There are certain letter strings that humans seem to have little
trouble processing, but that are beyond Metacat’s grasp—for example, the string oooaaoobboooccoo in the problem abc → abd;
oooaaoobboooccoo → ???. How are we so effortlessly able to
“tune out” the o’s in oooaaoobboooccoo? What would it take for
a Metacat-style program to be able to do likewise?
2. Meta-level questions about sequences of answers. For example,
“How is the relationship between answer A and answer B different
from that between C and D?” Such questions could be answered
using the declarative information that Metacat already has; all that
would seem to be required is the ability to pose the question.
3. Questions pertaining to concepts about analogy-making in general, such as mapping, role, theme, slippage, pressure, pattern,
and concept. Metacat deals implicitly with all of these ideas, but
it doesn’t have explicit knowledge or understanding of them.
4. An ability to characterize problems in terms of “the issues they
are about,” with the ultimate goal of having a program that is able
to create new problems of its own—which would certainly lead
to a richer, more interesting exchange between the program and
the human interacting with it. Some work in this area was done in
the Phaeaco Fluid Concepts architecture [4], but the issue requires
further investigation.
5. Questions of the form, “Why is answer A more humorous (or
stranger, or more elegant, etc.) than answer B?” Metacat has implicit notions, however primitive, of concepts such as succinctness, coherence, and abstractness, which ﬁgure into its answer
comparisons. These notions pertain to aesthetic judgment insofar
as we tend to ﬁnd things that are succinct, coherent, and reasonably abstract to be more pleasing than things that are prolix, incoherent, and either overly literal or overly abstract. Judgments
involving humor often take into account such factors, too, among
many others. Metacat’s ability—however rudimentary—to employ criteria such as abstractness and coherence in its answer evaluations could be seen as an early step toward understanding how
these kinds of qualitative judgments might emerge from simpler
processes. On the other hand, for adjectives such as “humorous,”
which presuppose the possession of emotional or affective states,
it is not at all clear what additional mechanisms might be required,
though some elementary possibilities are outlined in [18].
6. A rudimentary sense of the “personality traits” associated with
certain patterns of answers. In other words, just as Metacat is able
compare two answers with one another, a meta-Metacat might
be able to compare two sets of answers—and, correspondingly,
two answerers—with one another. For example, a series of literalminded or short-sighted answers might yield a perception of the
answerer as being dense, while a series of sharp, insightful an-

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

58

swers punctuated by the occasional obvious clunker might yield a
picture of an eccentric smart-aleck.
Ultimately, however, the particulars of Copycat, Metacat, and the
letter-string analogy domain are not so important in and of themselves. The programs merely serve as an example of a kind of approach to modeling cognitive phenomena, just as the domain itself
serves as a controlled arena for carrying out such modeling.
To meet the genuine intent of the Turing test, we must be able
to partake in the sort of arbitrarily detailed and subtle discourse
described above in any domain. As the forgoing list shows, however, there is much that remains to be done, even—to stick with our
example—within the tiny domain in which Copycat and Metacat operate. It is unclear how far a disembodied computer program, even
an advanced successor to these two models, can go toward modeling
socially and/or culturally grounded phenomena such as personality,
humor, and aesthetic judgment, to name a few of the more obvious
challenges involved in achieving the kind of discourse that our “test”
ultimately calls for. At the same time, it is unlikely that such discourse lies remotely within the capabilities of any of the current generation of Loebner Prize contenders, nor does it even seem to be a
goal of such contenders.

5

CONCLUSION

We have argued that the Turing test would more proﬁtably be considered as a sequence of modiﬁed Feigenbaum tests, in which the questioner and respondent are to collaborate in an attempt to extract maximum subtlety from a succession of arbitrarily detailed domains. In
addition, we have explored a parallel between the “domain-agnostic”
approach of statistical machine learning and that of artistic forgery,
in turn arguing that by requesting successive variations on an original theme, a critic may successfully distinguish mere surface-level
imitations from those that arise via the meta-mechanisms constitutive of genuine creativity and intelligence. From the perspective we
have argued for, Metacat and the letter-string-analogy domain can be
viewed as a kind of Drosophila for the Turing test, with the search
for missing mechanisms directly motivated by the speciﬁc types of
questions we might conceivably ask of the program.

[8] D. R. Hofstadter, Metamagical Themas: Questing for the Essence of
Mind and Pattern, Basic Books, New York, 1986.
[9] D. R. Hofstadter, Fluid Concepts and Creative Analogies, Basic Books,
New York, 1995.
[10] C. Irving, Fake! The story of Elmyr de Hory, the greatest art forger of
our time, McGraw-Hill, New York, 1969.
[11] M. Kieran, Revealing Art, Routledge, London, 2005.
[12] B. Kokinov, V. Feldman, and I. Vankov, ‘Is analogical mapping embodied?’, in New Frontiers in Analogy Research, eds., B. Kokinov,
K. Holyoak, and D. Gentner, New Bulgarian Univ. Press, Soﬁa, Bulgaria, (2009).
[13] J. Lanier, You Are Not a Gadget, Alfred A. Knopf, New York, 2010.
[14] A. Lessing, ‘What is wrong with a forgery?’, Journal of Aesthetics and
Art Criticism, 23(4), 461–471, (1979).
[15] J. Marshall. Metacat: A self-watching cognitive architecture for
analogy-making and high-level perception. Doctoral dissertation, Indiana Univ., Bloomington, 1999.
[16] M. Mitchell, Analogy-Making as Perception: A Computer Model, MIT
Press, Cambridge, Mass., 1993.
[17] J. Myhill, ‘Some philosophical implications of mathematical logic’, Review of Metaphysics, 6, 165–198, (1952).
[18] R. Picard, Affective Computing, MIT Press, Cambridge, Mass., 1997.
[19] J. Rehling. Letter spirit (part two): Modeling creativity in a visual domain. Doctoral dissertation, Indiana Univ., Bloomington, 2001.
[20] S. Shieber, ‘Lessons from a restricted Turing test’, Communications of
the ACM, 37(6), 70–78, (1994).
[21] R.J. Solomonoff, ‘A formal theory of inductive inference, pt. 1’, Information and Control, 7(1), 1–22, (1964).
[22] A. Turing, ‘Computing machinery and intelligence’, Mind, 59, 433–
460, (1950).
[23] R. Wallace, ‘The anatomy of A.L.I.C.E.’, in Parsing the Turing Test,
eds., R. Epstein, G. Roberts, and G. Beber, 1–57, Spring, Heidelberg,
(2009).
[24] J. Weizenbaum, Computer Power and Human Reason, Freeman, San
Francisco, 1976.
[25] H. Werness, ‘Han van Meegeren fecit’, in The Forger’s Art, ed., D. Dutton, 1–57, Univ. of California Press, Berkeley, (1983).

ACKNOWLEDGEMENTS
We would like to thank Vincent Müller and Aladdin Ayesh for their
hard work in organizing this symposium, along with the anonymous
referees who reviewed and commented on the paper. We would also
like to acknowledge the generous support of Indiana University’s
Center for Research on Concepts and Cognition.

REFERENCES
[1] B. Christian, The Most Human Human, Doubleday, New York, 2011.
[2] D. Dutton, ‘Artistic crimes’, British Journal of Aesthetics, 19, 302–314,
(1979).
[3] E. A. Feigenbaum, ‘Some challenges and grand challenges for computational intelligence’, Journal of the ACM, 50(1), 32–40, (2003).
[4] H. Foundalis. Phaeaco: A cognitive architecture inspired by bongard’s
problems. Doctoral dissertation, Indiana Univ., Bloomington, 2006.
[5] R. French, ‘Subcognition and the limits of the Turing test’, Mind, 99,
53–65, (1990).
[6] S. Harnad, ‘The Turing test is not a trick: Turing indistinguishability is
a scientiﬁc criterion’, SIGART Bulletin, 3(4), 9–10, (1992).
[7] S. Harnad, ‘Minds, machines and Turing: the indistinguishability of indistinguishables’, Journal of Logic, Language, and Information, 9(4),
425–445, (2000).

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

59

Laws of Form and the Force of Function.
Variations on the Turing Test
Hajo Greif1
Abstract. This paper commences from the critical observation that
the Turing Test (TT) might not be best read as providing a deﬁnition or a genuine test of intelligence by proxy of a simulation of
conversational behaviour. Firstly, the idea of a machine producing
likenesses of this kind served a different purpose in Turing, namely
providing a demonstrative simulation to elucidate the force and scope
of his computational method, whose primary theoretical import lies
within the realm of mathematics rather than cognitive modelling.
Secondly, it is argued that a certain bias in Turing’s computational
reasoning towards formalism and methodological individualism contributed to systematically unwarranted interpretations of the role of
the TT as a simulation of cognitive processes. On the basis of the
conceptual distinction in biology between structural homology vs.
functional analogy, a view towards alternate versions of the TT is
presented that could function as investigative simulations into the
emergence of communicative patterns oriented towards shared goals.
Unlike the original TT, the purpose of these alternate versions would
be co-ordinative rather than deceptive. On this level, genuine functional analogies between human and machine behaviour could arise
in quasi-evolutionary fashion.

1

A Turing Test of What?

While the basic character of the Turing Test (henceforth TT) as a simulation of human conversational behaviour remains largely unquestioned in the sprawling debates it has triggered, there are a number
of diverging interpretations as to whether and to what extent it provides a deﬁnition, or part of a deﬁnition, of intelligence in general,
or whether it amounts to the design of an experimental arrangement
for assessing the possibility of machine intelligence in particular. It
thus remains undecided what role, if any, there is for the TT to play
in cognitive inquiries.
I will follow James H. Moor [13] and other authors [21, 2] in their
analysis that, contrary to seemingly popular perception, the TT does
neither provide a deﬁnition nor an empirical criterion of the named
kind. Nor was it intended to do so. At least at one point in Alan M.
Turing’s, mostly rather informal, musings on machine intelligence,
he explicitly dismisses the idea of a deﬁnition, and he attenuates the
idea of an empirical criterion of machine intelligence:
I don’t really see that we need to agree on a deﬁnition [of thinking] at all. The important thing is to try to draw a line between
the properties of a brain, or of a man, that we want to discuss,
and those that we don’t. To take an extreme case, we are not
interested in the fact that the brain has the consistency of cold
porridge. We don’t want to say ‘This machine’s quite hard, so
1

University of Klagenfurt, Austria, email: hajo.greif@aau.at

it isn’t a brain, and so it can’t think.’ I would like to suggest
a particular kind of test that one might apply to a machine.
You might call it a test to see whether the machine thinks, but
it would be better to avoid begging the question, and say that
the machines that pass are (let’s say) ‘Grade A’ machines. [. . . ]
(Turing in a BBC radio broadcast of January 10th, 1952, quoted
after [3, p. 494 f])
Turing then goes on to introducing a version of what has come to
be known, perhaps a bit unfortunately, as the Turing Test, but was
originally introduced as the “imitation game”. In place of the articulation of deﬁnitions of intelligence or the establishment of robust empirical criteria for intelligence, we ﬁnd much less ambitious,
and arguably more playful, claims. One purpose of the test was to
develop a thought-experimental, inductive approach to identifying
those properties shared between the human brain and a machine
which would actually matter to asking the question of whether men
or machines alike can think: What is the common ground human
beings and machines would have to share in order to also share a
set of cognitive traits? It was not a matter of course in Turing’s day
that there could possibly be any such common ground, as cognition
was mostly considered essentially tied to (biological or other) human nature.2 In many respects, the TT was one very instructive and
imaginative means of raising the question whether the physical constitution of different systems, whether cold-porrige-like or electriccircuitry-like, makes a principled difference between a system with
and a system without cognitive abilities. Turing resorted to machine
simulations of behaviours that would normally be considered expressions of human intelligence in order to demonstrate that the lines of
demarcation between the human and the mechanical realm are less
than stable.
The TT is however not sufﬁcient as a means for answering the
questions it ﬁrst helped to raise, nor was it so intended. Turing’s
primary aim for the TT was one demonstration, among others, of
the force and scope of what he introduced as the “computational
method” (which will be brieﬂy explained in section 2). Notably,
the computational method has a systematically rooted bias towards,
ﬁrstly, considering a system’s logical form over its possible functions
and towards, secondly, methodological individualism. I will use Turing’s mathematical theory of morphogenesis and, respectively, the
distinction between the concepts of structural homology and functional analogy in biology as the background for discussing the implications of this twofold bias (in section 3). On the basis of this discussion, a tentative reassessment of the potentials and limits of the
2

In [1, p. 168 f], Margaret Boden notices that the thought that machines could
possibly think was not even a “heresy” up to the early 20th century, as that
claim would have been all but incomprehensible.

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

60

TT as a simulation will be undertaken (in section 4): If there is a systematic investigative role to play in cognitive inquiries for modiﬁed
variants of the TT, these would have to focus on possible functions
to be shared between humans and machines, and they would have
to focus on shared environments of interaction rather than individual
behaviours.

2

The Paradigm of Computation

Whether intentionally or not, Turing’s reasoning contributed to
breaking the ground for the functionalist arguments that prevail in
much of the contemporary philosophies of biology and mind: An
analysis is possible of the operations present within a machine or an
organism that systematically abstracts from their respective physical
nature. An set of operations identical on a speciﬁed level of description can be accomplished in a variety of physical arrangements. Any
inference from the observable behavioural traits of a machine simulating human communicative behaviour, as in the TT, to an identity
of underlying structural features would appear unwarranted.
Turing’s work was concerned with the possibilities of devising a
common logical form of abstractly describing the operations in question. His various endeavours, from morphogenesis via (proto-) neuronal networks to the simulation of human conversational behaviour,
can be subsumed under the objective of exploring what his “computational method” could achieve across a variety of empirical ﬁelds
and under a variety of modalities. Simulations of conversational behaviours that had hitherto been considered an exclusively human
domain constituted but one of these ﬁelds, investigated under one
modality.
Turing’s computational method is derived from his answer to a
logico-mathematical problem, David Hilbert’s “Entscheidungsproblem” (the decision problem) in predicate logic, as presented in [8].
This problem amounts to the question whether, within the conﬁnes
of a logical calculus, there is an unequivocal, well-deﬁned and ﬁnite, hence at least in principle executable, procedure for deciding on
the truth of a proposition stated in that calculus. After Kurt Gödel’s
demonstration that neither the completeness nor the consistency of
arithmetic could be proven or disproven within the conﬁnes of arithmetic proper [7], the question of deciding on the truth of arithmetical propositions from within that same axiomatic system had to be
recast as a question of deciding on the internal provability of such
propositions. The – negative – answer to this reformulated problem was given by Turing [18] (and, a little earlier, by a slightly different method, Alonzo Church). Turing’s path towards that answer
was based on Gödel’s elegant solution to the former two problems,
namely a translation into arithmetical forms of the logical operations
required for deciding on the provability of that proposition within
the system of arithmetical axioms. Accordingly, the method of further investigation was to examine the calculability of the arithmetical
forms so generated.
To decide on the calculability of the problem in turn, Turing introduced the notion of computability. A mathematical problem is considered computable if the process of its solution can be broken down
into a set of exact elementary instructions by which one will arrive at
a determinate solution in a ﬁnite number of steps, and which could
be accomplished, at least in principle, by human “computers”.3 Even
complex problems should thus become reducible to a set of basic
3

I am following B. Jack Copeland [4] here on his deﬁnition of computability,
as he makes a considerable effort at spelling out what notion of computability Turing was using in [18]. He thus hopes to stem the often-lamented ﬂood
of loose and misguiding uses of that term in many areas of science.

operations. The fulﬁlment of the required routines demands an ability to apply a set of rules and, arguably, some mental discipline, but
these routines are not normally considered part of the most typical
or complex properties of human thought – and can be mechanised,
in a more direct, material sense, by an appropriately constructed and
programmed machine. Hence, Turing’s notion of “mechanical” was
of a fairly abstract kind. It referred to a highly standardised and routinised method of solving mathematical problems, namely the computational method proper. This method could be equally applied by
human, mechanical or digital “computers”, or by any other system
capable of following the required routines.
Given this description of computability, the primary aim of Turing’s models of phenomena such as morphogenesis, the organisation
of the nervous system or the simulation of human conversation lies in
ﬁnding out whether, how and to what extent their speciﬁc structural
or behavioural patterns can be formally described in computational
terms – and thus within the realm of mathematics. A successful application of the computational method to the widest variety of phenomena would have implications on higher epistemological or arguably
even metaphysical levels, but, being possible implications, these are
not contained within the mathematical theory.

3

The Relevance of Form and Function

The design of Turing’s computational method intuitively suggests,
but does not entail, that the phenomena in question are chieﬂy considered in their, computationally modellable, form. Turing focuses
on the formal patterns of organic growth, on the formal patterns of
neuronal organisation and re-organisation in learning, and on the logical forms of human conversation. The possible or actual functions
of these formally described patterns, in terms of the purposes they do
or may serve, are not systematically considered. A second informal
implication of Turing’s computational approach lies in his focus on
the behaviour of isolated, individual systems – hence not on the organism in its environment, but on the human brain as a device with
input and output functions.4 Such focus on self-contained, individual entities was arguably guided by a methodological presupposition
informed by the systematic goals of Turing’s research: The original
topics of his inquiry were the properties of elementary recursive operations within a calculus. Hence, any empirical test for the force
and scope of the computational method, that is, any test for what can
be accomplished by means of such elementary recursive operations,
would naturally but not necessarily commence in the same fashion.
In order to get a clearer view of this twofold bias, it might be
worthwhile to take a closer look at the paradigm of Turing’s computational method. That paradigm, in terms of elaboration, rigour and
systematicity, is not to be found in his playful and informal imitation
game approach to computer simulations of conversational behaviour.
Instead, it is to be found in his mathematical theory of morphogenesis [20]. This inquiry was guided by Sir D’Arcy Thompson’s, at
its time, inﬂuential work On Growth and Form [17], and it was directed at identifying the basic chemical reactions involved in generating organic patterns, from an animal’s growth to the grown animal’s
anatomy, from the dappledness or stripedness of furs to the arrangement of a sunﬂower’s ﬂorets and the phyllotactic ordering of leaves
on a plant’s twigs. The generation of such patterns was modelled
in rigorously formal-mathematical fashion. The resulting model was
impartial to the actual biochemical realisation of pattern formation. It
would only provide some cues as to what concrete reactants, termed
“morphogens” by Turing, one should look out for.
4

For this observation, see, for example, [9, p. 85].

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

61

Less obviously but similarly important, Turing chose not to inquire into any adaptive function, in Darwinian terms, of the patterns
so produced. These patterns may or may not serve an adaptive function, and what that function amounts to is of secondary concern at
best. Explaining the generation of their form does not contribute to
explaining that form’s function, nor does it depend on that function.
In this respect, too, Turing’s thoughts appear to be in line with, if
not explicitly endorsing, D’Arcy Thompson’s skeptical view of the
relevance of adaptation by natural selection in evolution. The formative processes in organisms are considered at least partly autonomous
from Darwinian mechanisms. Whether the ﬂorets of a sunﬂower are
patterned on a Fibonacci series, as they in fact are, or whether they
are laid out in grid-like fashion, as they possibly cannot be according
to the mathematical laws of form expounded by Turing, is unlikely
to make a difference in terms of selective advantage. In turn however, natural selection may not offer a path to a grid-like pattern in
the ﬁrst place, while enabling, but arguably not determining, the Fibonacci pattern. In likewise fashion, the cognitive abilities of human
beings or other animals would not in the ﬁrst place be considered
as adaptive abilities, deﬁned in relation to challenges posed by their
environments, but in their, mathematically modellable, form.
Turing’s bias towards form over function, in conjunction with his
methodological individualism, created a difﬁculty in systematically
grasping a relation that might look straightforward or even obvious to
the contemporary reader, who is likely to be familiar with the role of
populations and environments in evolution, and who might also be familiar with philosophical concepts of functions: analogy of functions
across different, phylogenetically distant species. In Turing’s notion
of decoupling logical form from physical structure, the seeds of the
concept of functional analogy appear to be sown, however without
growing to a degree of maturity that would prevent the premature
conclusions often drawn from Turing’s presentation of the TT.
It is the condition of observable similarity in behaviour that has
been prone to misguide both proponents and critics of the TT. One
cannot straightforwardly deduce a similarity of kind – in this case,
being in command of a shared form of intelligence – from a similarity in appearance. A relation of proximity in kind could only be
ﬁrmly established on the grounds of a relation of common descent,
that is, from being part of the same biological population or from
being assembled according to a common design or Bauplan. This is
the ultimate skeptical resource for the AI critic who will never accept some computer’s or robot’s trait as the same or equivalent to
a human one. However convincing it may look to the unprejudiced
observer, any similarity will be dismissed as a feat of semi-scientiﬁc
gimmickry. Even a 1:1 replica of a human being, down to artiﬁcial
neurones and artiﬁcial muscles made of high-tech carbon-based ﬁbres, is unlikely to convince him or her. What the skeptic is asking
for is a structural homology to lie at the foundation of observable
similarities.
In the biological discipline of morphology, the distinction between
analogies and homologies has ﬁrst been systematically applied by
Richard Owen, who deﬁned it as follows:
“A NALOGUE.” – A part or organ in one animal which has the
same function as another part or organ in a different animal.
“H OMOLOGUE.” – The same organ in different animals under
every variety of form and function. [15, p. 7, capitalisation in
original]
This distinction was put on an evolutionary footing by Charles Darwin, who gave a paradigmatic example of homology himself, when
he asked: “What can be more curious than that the hand of a man,

formed for grasping, that of a mole for digging, the leg of the horse,
the paddle of the porpoise, and the wing of the bat, should all be
constructed on the same pattern, and should include the same bones,
in the same relative positions?” [5, p. 434] – where the reference of
“the same” for patterns, bones and relative positions is ﬁxed by their
common ancestral derivation rather than, for Owen and other Natural
Philosophers of his time, by abstract archetypes.
In contrast, an analogy of function of traits or behaviours amounts
to a similarity or sameness of purpose which a certain trait or behaviour serves, but which, ﬁrstly, may be realised in phenotypically
variant form and which, secondly, will not have to be derived from
a relation of common descent. For example, consider the function
of vision in different species, which is realised in a variety of eye
designs made from different tissues, and which is established along
a variety of lines of descent. The most basic common purpose of
vision for organisms is navigation within their respective environments. This purpose is shared by camera-based vision in robots, who
arguably have an aetiology very different from any natural organism.
Conversely, the same navigational purpose is served by echolocation
in bats, which functions in an entirely different physical medium and
under entirely different environmental circumstances, namely the absence of light.
There are no principled limitations as to how a kind of function is
realised and by what means it is transmitted. The way in which either
variable is ﬁxed depends on the properties of the (biological or technological) population and of the environment in question. In terms
of determining its content, a function is ﬁxed by the relation between
an organism’s constitution and the properties of the environment in
which it ﬁnds itself, and thus by what it has to accomplish in relation to organic and environmental variables in order to prevail. This
very relation may be identical despite the constitution of organisms
and the properties of the environment being at variance between different species. Perceiving spatial arrangements in order to locomote
under different lighting conditions would be a case in point. In terms
of the method by which a function is ﬁxed, a history of differential
reproduction of variant traits that are exposed to the variables of the
environment in which some population ﬁnds itself will determine the
functional structure of those traits. If an organism is endowed with a
reproducible trait whose effects keep in balance those environmental
variables which are essential to the organism’s further existence and
reproduction, and if this happens in a population of reproducing organisms with sufﬁcient frequency (which does not even have to be
extremely high), the effects of that trait will be their functions.5
Along the lines of this argument, an analogy of function is possible
between different lines of descent, provided that the environmental
challenges for various phylogenetically remote populations are similar. There are no a-priori criteria by which to rule out the possibility
that properties of systems with a common descent from engineering
processes may be functionally analogous to the traits and behaviours
of organisms. In turn, similarity in appearance is at most a secondary
consequence of functional analogy. Although such similarity is fairly
probable to occur, as in the phenomenon of convergent evolution, it is
never a necessary consequence of functional analogy. The similarity
that is required to hold between different kinds of systems lies in the
tasks for whose fulﬁlment their respective traits are selected. Structural homology on the other hand does neither require a similarity of
tasks nor a similarity of appearance, but a common line of descent
from which some trait hails, whatever function it may have acquired
later along that line, and whatever observable similarity it may bear
5

This is the case for aetiological theories of function, as pioneered by [23]
and elaborated by [11].

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

62

to its predecessor. In terms of providing criteria of similarity that go
beyond what can be observed on the phenotypical level, functional
analogy trumps structural homology.

4

The Turing Test as Demonstrative vs.
Investigative Simulation

On the grounds of the above argument, the apparent under-deﬁnition
of the epistemical role of the TT owes to an insufﬁcient understanding of the possibilities and limitations of functional analogy in the AI
debates: It is either confounded with homological relations, which,
as there are no common lines of descent between human beings and
computers, results in the TT being rejected out of hand as a test for
any possible cognitive ability of the latter. Or analogous functions are
considered coextensive with a set of phenotypical traits similar, qua
simulation, to those of human beings. Either way, it shows that inferences to possible cognitive functions of the traits in question are not
warranted by phenotypical similarity. Unless an analogy of function
can be achieved, the charge of gimmickry against the TT cannot be
safely defused. If however such an analogy can be achieved, the test
itself would not deliver the evidence necessary for properly assessing
that analogy, nor would it provide much in the way of a suggestion
how that analogy could be traced.
One might be tempted to put the blame for this insufﬁcient understanding of functional analogy on Turing himself – but that might
be an act of historical injustice. Firstly, he did not claim functional
analogies to be achieved by his simulations. Secondly, some of the
linkages between the formal-mathematical models which he developed and more recent concepts of evolution that comprise the role
of populations and environments in shaping organic functions were
not in reach of his well-circumscribed theory of computation. They
were not even ﬁrmly in place at the time of his writing. Much of contemporary evolutionary reasoning owes to the Modern Synthesis in
evolutionary biology, which was only in the process of becoming the
majority view among biologists towards the end of Turing’s life.6
With the beneﬁt of hindsight however, and with the clariﬁcation of
matters that it allows, is there any role left for the TT to be played
in inquiries into human cognition – which have to concern, ﬁrst and
foremost, the functions of human cognition? Could it still function as
a simulation of serious scientiﬁc value? Or, trying to capture Turing’s
ultimate, trans-mathematical objective more precisely and restating
the opening question of this paper: Could the TT still help to identify
the common ground human beings and machines would have to share
in order to also share a set of cognitive traits? For modiﬁed forms of
that test at least, the answer might be positive.
First of all, one should be clear about what kind of simulation the
TT is supposed to be. If my reconstruction of Turing’s proximate
aims is valid, the imitation game was intended as a demonstrative
simulation of the force and scope of the computational method, with
no systematic cognitive intent. By many of its interpreters and critics
however, it was repurposed as an investigative simulation that, at a
minimum, tests for some of the behavioural cues by which people
normally discern signals of human intelligence in communication,
or that, on a maximal account, test for the cognitive capacities of
machines proper.
The notions of demonstrative and investigative simulations are distinguished in an intuitive, prima facie fashion in [16, p. 7 f], but
may not always be as clearly discernible as one might hope. Demonstrative simulations mostly serve a didactic purpose, in reproducing

some well-known behaviours of their subject matter or “target” in a
different medium, so as to allow manipulations of those behaviours’
variables that are analogous to operations on the target proper. The
purpose of ﬂight simulators for example lies in giving pilots a realistic impression of experience of ﬂying an airplane. Events within the
ﬂight simulation call for operations on the simulation’s controls that
are, in their effects on that simulation, analogous to the effects of the
same operations in the ﬂight that is being simulated. The physical
or functional structure of an airplane will not have to be reproduced
for this purpose, nor, of course, the physical effects of handling or
mishandling an in-ﬂight routine. Only an instructive simile thereof is
required. I hope to have shown that this situation is similar to what
we encounter in the TT, as originally conceived. No functional analogy between simulation and target is required at all, while the choice
and systematic role of observable similarities is contingent on the
didactic purpose of the simulation.
An investigative simulation, on the other hand, aims at reproducing a selection of the behaviours of the target system in a fashion that
allows for, or contributes to, an explanation of that behaviours’ effects. In a subset of cases, the explanation of the target’s functions is
included, too. Here, a faithful mapping of the variables of the simulation’s behaviours, and their transformations, upon the variables and
transformations on the target’s side is of paramount importance. No
phenomenal similarity is required, and a mere analogy of effects is
not sufﬁcient, as that analogy might be coincidental. Instead, some
aspects of the internal, causal or functional, structure of the target
system will need to be systematically grasped. To this purpose, an
investigative simulation is guided by a theory concerning the target
system, while the range of its behaviours is not exhausted by that theory: Novel empirical insights are supposed to grow from such simulations, in a manner partly analogous to experimental practice.7 I
hope to have shown that this is what the TT might seem to aim at,
but does not achieve, as there is no underlying theory of the cognitive traits that appear to be simulated by proxy of imitating human
conversational behaviour.
An alternative proposal for an investigative role of the TT along
the lines suggested above would lie in creating analogues of some of
the cognitive functions of communicative behaviour. Doing so would
not necessarily require a detailed reproduction of all or even most underlying cognitive traits of human beings. Although such a reproduction would be a legitimate endeavour taken by itself, although probably a daunting one, it would remain conﬁned to the same individualistic bias that marked Turing’s own approach. A less individualistic,
and perhaps more practicable approach might take supra-individual
patterns of communicative interaction and their functions rather than
individual minds as its target.
One function of human communication, it may be assumed, lies
in the co-ordination of actions directed at shared tasks. If this is so, a
modiﬁed TT-style simulation would aim at producing, in evolutionary fashion, ‘generations’ of communicative patterns to be tried and
tested in interaction with human counterparts. The general method
would be similar to evolutionary robotics,8 but, ﬁrstly, placed on a
higher level of behavioural complexity and, secondly, directly incorporating the behaviour of human communicators. In order to allow
for some such quasi-evolutionary process to occur, there should not
be a reward for the machine passing the TT, nor for the human counterpart revealing the machine’s nature. Instead, failures of the machine to effectively communicate with its human counterpart, in re7
8

6

For historical accounts of the Modern Synthesis, see, for example, [10, 6].

For this argument on the epistemic role of computer simulations, see [22].
For a paradigmatic description of the research programme of evolutionary
robotics, see [14].

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

63

lation to a given task, would be punished by non-reproduction, in
the next ‘generation’, of the mechanism responsible for the communicative pattern, replacing it with a slightly (and perhaps randomly)
variant form of that mechanism. In this fashion, an adaptive function
could be established for the mechanism in question over the course
of time. Turing indeed hints at such a possibility when brieﬂy discussing the “child machine” towards the end of [19, pp. 455–460] –
a discussion that, in his essay, appears somewhat detached form the
imitation game proper.
For such patterns to evolve, the setup of the TT as a game of imitation and deception might have to be left behind – if only because
imitation and deception, although certainly part of human communication, are not likely to constitute its foundation. Even on a fairly
pessimistic view of human nature, they are parasitic on the adaptive functions of communication, which are more likely to be cooperative.9 Under this provision, humans and machines would be endowed with the task of trying to solve a cognitive or practical problem in co-ordinated, perhaps collaborative, fashion. In such a situation, the machine intriguingly would neither be conceived of as an instrument of human problem-solving nor as an autonomous agent that
acts beyond human control. It would rather be embedded in a shared
environment of interaction and communication that poses one and
the same set of challenges to human and machine actors, with at least
partly similar conditions of success. If that success is best achieved in
an arrangement of symmetrical collaboration, the mechanisms of selection of behavioural patterns, the behavioural tasks and the price of
failure would be comparable between human beings and machines.
By means of this modiﬁed and repurposed TT, some of the functions of human communication could be systematically elucidated
by means of an investigative simulation. That simulation would establish functional analogies between human and machine behaviour
in quasi-evolutionary fashion.

5

Conclusion

It might look like an irony that, where, on the analysis presented in
this paper, the common ground that would have to be shared between
human beings and machines in order to indicate what cognitive traits
they may share, ultimately and in theory at least, is functionally identiﬁed, and where the author of that thought experiment contributed to
developing the notion of decoupling the function of a system from its
physical structure, the very notion of functional analogy did not enter that same author’s focus. As indicated in section 4 above, putting
the blame on Turing himself would be an act of historical injustice.
At the same instance however, my observations about the formalistic
and individualistic biases built into Turing’s computational method
do nothing to belittle the merits of that method as such, as its practical
implementations ﬁrst allowed for devising computational models and
simulations of a variety of functional patterns in a different medium,
and as its theoretical implications invited systematical investigations
into the physical underdetermination of functions in general. In some
respects, it might have taken those biases to enter this realm in the
ﬁrst place.

[3] The Essential Turing, ed., B. Jack Copeland, Oxford University Press,
Oxford, 2004.
[4] B. Jack Copeland, ‘The Church-Turing Thesis’, in The Stanford Encyclopedia of Philosophy, html, The Metaphysics Research Lab, Stanford, spring 2009 edn., (2009).
[5] Charles Darwin, On The Origin of Species by Means of Natural Selection. Or the Preservation of Favoured Races in the Struggle for Life,
John Murray, London, 1 edn., 1859.
[6] David J. Depew and Bruce H. Weber, Darwinism Evolving. Systems
Dynamics and the Genealogy of Natural Selection, MIT Press, Cambridge/London, 1995.
[7] Kurt Gödel, ‘Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I’, Monatshefte für Mathematik, 38,
173–198, (1931).
[8] David Hilbert and Wilhelm Ackermann, Grundzüge der theoretischen
Logik, J. Springer, Berlin, 1928.
[9] Andrew Hodges, ‘What Did Alan Turing Mean by “Machine”?’, in The
Mechanical Mind in History, eds., Philip Husbands, Owen Holland, and
Michael Wheeler, 75–90, MIT Press, Cambridge/London, (2008).
[10] Ernst Mayr, One Long Argument. Charles Darwin and the Genesis of
Modern Evolutionary Thought, Harvard University Press, Cambridge,
1991.
[11] Ruth Garrett Millikan, Language, Thought, and Other Biological Categories, MIT Press, Cambridge/London, 1984.
[12] Ruth Garrett Millikan, Varieties of Meaning, MIT Press, Cambridge/London, 2004.
[13] James H. Moor, ‘An Analysis of the Turing Test’, Philosophical Studies, 30, 249–257, (1976).
[14] Stefano Nolﬁ and Dario Floreano, Evolutionary Robotics: The Biology,
Intelligence and Technology of Self-Organizing Machines, MIT Press,
Cambridge/London, 2000.
[15] Richard Owen, On the Archetype and Homologies of the Vertebrate
Skeleton, John van Voorst, Lodon, 1848.
[16] Philosophical Perspectives in Artiﬁcial Intelligence, ed., Martin Ringle,
Humanities Press, Atlantic Highlands, 1979.
[17] D’Arcy Wentworth Thompson, On Growth and Form, Cambridge University Press, Cambridge, 2 edn., 1942.
[18] Alan M. Turing, ‘On Computable Numbers, with an Application to the
Entscheidungsproblem’, Proceedings of the London Mathematical Society, s2-42, 230–265, (1936).
[19] Alan M. Turing, ‘Computing Machinery and Intelligence’, Mind, 59,
433–460, (1950).
[20] Alan M. Turing, ‘The Chemical Basis of Morphogenesis’, Philosophical Transactions of the Royal Society, B, 237, 37–72, (1952).
[21] Blay Whitby, ‘The Turing Test: AI’s Biggest Blind Alley?’, in Machines and Thought, eds., Peter Millican and Andy Clark, volume 1
of The Legacy of Alan Turing, 53–62, Clarendon Press, Oxford, (1996).
[22] Eric B. Winsberg, Science in the Age of Computer Simulation, University of Chicago Press, Chicago, 2010.
[23] Larry Wright, ‘Functions’, Philosophical Review, 82, 139–168, (1973).

References
[1] Margaret A. Boden, Mind as Machine: A History of Cognitive Science,
Oxford University Press, Oxford, 2006.
[2] B. Jack Copeland, ‘The Turing Test’, Minds and Machines, 10, 519–
539, (2000).
9

For an account of the evolution of co-operative functions, see, for example,
[12, ch. 2].

AISB/IACAP 2012 Symposium: Revisiting Turing and his Test: Comprehensiveness, Qualia, and the Real World

64

Battle of signifiants 1
Abstract
Memetics is a new science which is just establishing its relations with other sciences – we’ll
glance over few of them. Since huge amount of entities which are called memes belongs also to the set
of linguistic signes, it can be expected that the exchange between linguistics and memetics will be
fecond. Thus we’ll use basic linguistic terminology, especially the lexicological notion of morpheme
defined like « smallest unit having a meaning » as a source of metaphores which could be helpful in
interpretation of phonemena which could be labeled as « m emetic » . By using this new theoretical
framework and by focusing only upon the relations among few «allomemes » used for expressing
happiness or amusement, which are for example emoticons like :­) , :) or a morpheme « lol » , we’ll
try to offer to our reader an illustrous example of our « back to the signs » method which could
possibly lead to the formalization of memetics, and thus to her firm establishement at the point of
contact between natural and human sciences.
We’ll show, also, how the formal properties of a meme­sign – for example its length or the
number of possible modalities of its expression ­ influence its fitness, and thus its distribution in
population of hosts.

1. Introducion
"A scholar is just a library's way of making another library."
maxim of Daniel Dennett

1.1 Memetics
The inspiration for the memetic science came from biology. A book Selfish Gene within
which R. Dawkins had introduced his meme concept as "a unit of cultural transmission or a unit of
imitation" (Dawkins, 1976) was , in a first place, a book about (socio)biological hypothesis stating that
the basic components of hereditary code – genes – are « having their own interests» in the process of
evolution. This « selfish interest of a gene » can be sometimes contradictory to the interests of the
other genes within the code, or even to the interest of the « hosting » entity as a whole.
« Selfish » meme was thus created as an analogy to « selfish » gene. Both memes and genes
belong amongst the replicators – replicators are the entities which have a tendency to make copies of
themselves. Genes are molecular structures stored within the cellular nucleus, memes are – for the
internalist school of memetics­ some vaguely defined « information structures » stored within the
brains synaptic networks or are ­for the school of externalists­ « externalized » within the material
artifacts. Genes replicate by the processes of DNA/RNA translation and transcription, memes do it by
completely different means – by the process of immitation. There are many other metaphoric images
between genetics and memetics – science about heredity was and will be the big terminological and
methodological inspiration. Some speak even about « memetic engineering », others, inspired by
virology , use the termes « viruses of the mind » while speaking about memes like terrorism, drug
addiction or ... ideologies and religions.

In nature, errors during the process of replication can be observed. These errors ­ mutations –
lead to different properties of replicated entities. If the replication is taking place within the system
with limited ressources which are essential for the replication , the cumulation of differences caused
by mutations will result logically in the competition ­> selection ­> Darwinian evolution. Since the
ressources essential for meme replication – in other words the « brain potential » of an individual or a
population ­ are finite , an emergence of a second form of evolution would be an inescapable
consequence of existence of these « brain­based » replicators . This « second form » of Darwinian
evolution is for some memeticians term synonymous to cultural evolution known from anthropology.
Memes are supposed to be for a cultural evolution what genes are for good old biological evolution of
natural species. Memes are the basic units of cultural evolution. Evergrowing complexity of culture is
a result of evolutionary forces acting upon memes.
Whether the memetic science will become a firm science or will lose its momentum and
disappear like some bizarre postmodern pseudoscientific discipline will strongly depend on the
progress within psychology and cognitive sciences. If these disciplines will be able to furnish the
memetics with some evidence of material basis of replicators within the brain, memetics will flourish.
While there is a still­growing evidence present that the entities within the human memory are not
passive aggregates of data, but active entities trying to get affirmed by repetitve expression1 either in
form of internal obsessive idea or an externalized habit (Hromada, 2007), existence of memes within
the brains can be proven only indirectly. As one of the memetic pioneers puts it: « ...instead of
language based on a concrete mechanism of information storage, we must settle for an abstract
representation of the information brainstored. Thus, memory abstractions form the basis for memetic
evolution theory. » (Lynch, 1998)
« The memetics movement split almost immediately into those who wanted to stick to
Dawkins' definition of a meme as "a unit of information in the brain," and those who wanted to
redefine
it
as
observable
cultural
artifacts
and
behaviours. »
(
http://en.wikipedia.org/wiki/Memetics#Internalists_and_externalists ). Since the only way how to
prove the existence of the memes within the mind is the introspection, endeavours of psychology­
oriented internalists, or intrapersonal memeticians , are doomed to battle with deep methodological
problems. On the other side, the endeavours of sociology­oriented externalists, or interpersonal
memeticians, can be based upon huge amount of solid empiric data which can be analysed by well
known hypothese­testing methods.
Attitude of interpersonal memetician is not to ask much about the events which are occuring
within the brains of individuals, simply because these events are determined by too many variables.
For an interpersonal memetician , « meme » exists only if it is articulated, externalized, expressed into
the world. By the act of expression, ephemeral brain content becomes an empiric object – a phoneme
studied by phonetician, a grapheme in the book, a character in the database or an artifact on the shelf.
After mesuring the populations and distributions of different types of these objects he tries to bring to
light dynamics inherent to these populations. In ideal case, the dynamic of a memetic system can be
explained by the properties of the system only, reference to human beings as the hosts of the
1 For an interpersonal memetician, a meme is conceived, in the first place, as an entity replicated by immitation
among the subjects. Intrapersonal memetics, on the contrary, studies memetic phenomena within only one subject,
and thus the only way how such a meme can be replicated within, or by, one subject only is repetition. Studies of
repetitive behavior, like circular reaction of babies (Piaget, 1961) leads to establishement of intrapersonal memetics.

« memes » being only secondary. For such a scientist, it is a « will of meme » which is often
determining the will of a man. Much less often it is the other way around.
This is the stance which will be applied in this article.

1.2 Linguistics

« For the sign always escapes, in certain measure, the will individual or social – that’s its
essential characteristic » (Saussure, 1962). If we would, within the preceding sentence, commutate
the word « sign » by the word « meme » we would obtain one of the central postulates of
memetics. This citation, given by a founder of modern structuralism to its students almost a
century ago, can be found within the chapter where Saussure tries to establish a new science
called semiology by a definition « a science which studies the life of signs in the midst of the
social life ». If we allowed ourselves to reduce the memes to signs, wouldn’t it be,also, the
beautiful definition of memetics ?
« Sign, broadly defined, designates an element X which represents another element Y, or
serves as a substitute to him » (Niklas­Salminen, 1997). Accepting this definition, we may say that
every « articulated meme » is a sign in a sense, that it referes to the fact of existence of « brain
structure » which gave birth to its articulation into the empiric world. Element X is an object or a set
of objects within the world – a word, a picture, an observed behavior ­ while element Y is a mental
representation. Existence of this mental representation is the causa efficiens of creation of element X.
Since we couldn’t have element X without element Y, the fact of X referes to the fact of Y. 2In such a
way, memes as we described in part 1.1 can always be conceived as signs. Thus, we can and we will
find fructiferous resemblances between semiology and memetics.
Saussure himself was a linguist, in fact one of the best linguists of all times, and he developped
the notion of semiological science for one principal reason. He needed to situate nascent linguistics
somewhere. He situated it within the semiology, linguistics thus become a « branch of semiology »
dealing with linguistic signs.
On the beginning of 21st century, linguistic is, without any doubt, the most evolved
semiological discipline. And more – it is a science which has a phenomenas produced by a man for an
object – thus it is a human science, but its methods are rigorous like those of the physics and its
symbolic system is formalized – at least the phonetics is without any doubt also a natural science.
Within this article we will try to show that highly evolved linguistic science can play a role of
wise old sister for pubescent memetics. Similiarly as linguistic reduces the wide of field of interest of
semiology just to the field of linguistic signs, will our « linugistic memetics » of this article specialize
only upon the « linguistic memes ». Linguist analyzes language in relation with humans – he asks for
example what functions does the language offer to people and finds at least six answers (Jakobson,
1963; Yaguello, 1981) – while a memetician asks what does the language do for herself.
Language can be understood as a huge, socially distributed complex of memes – S. Blackmore
proposes a term « memplex » (Blackmore, 1999). As we said earlier, meme is a structure which copies
from brain to brain. It can either be a rule (of behavior), or materia upon which this rule is applied. In
2 The element X is in semiology called signifier (le signifiant) , the element Y – the mental representation, the
concept – is called signified (le signifié). « For the Post­Structuralists, the signifier is now the dominant unit and can
be considered as analogous to the meme. » (Gatherer, 1997). Even when we do not consider ourselves to be post­
structuralists only , this attitude will be applied in this article – meme for us is much more identified with a signifier
X than with obscure signified Y

our current state of knowledge, we do not dare to make any concrete observations nor hypothesis,
when it comes to the grammatical rules of syntax. But since an attitude of memetician is to expect that
the « organization of brain structures » are more immitated from « outside » than generated from
« within », it can be expected that grammatical memetics could get into clash with generative
grammarians of N.Chomsky.
But the other set of rules known for linguists – so called phonological rules – can serve as a
wonderful source of data for linguists. Phonological rules are induced from corpuses which mirror the
language used within given society within given periods of time. By comparing the corpuses from
different times or different locations, one can observe changes within the speech habits of individuals
– changement of accent, disparitions or additions of new phonemes etc. Now , when we know that a
habit is a meme , we can say, using the termes of memetics – looking at the corpuses created and rules
induced by phonologues, one can observe the rise of new memes and their battle with the old ones for
the time and energy of brains.
One can observe the rise of memetic infection, and its fall.

1.3 Lexical statistics and memetic dynamics

As we stated earlier, language is a memplex3 consisting of two classes of memetic structures :
memes acting as rules of grammar, and memes acting as contents upon which these rules are applied
– phonemes, morphemes and words. That grammatical rules and phonological habits are memes
spread by immitation is obvious to anyone who had ever spent his time field­working with babies. We
hope to persuade the others during our future experiments, when our methods will be more evolved
and our grants higher then zero.
In this textual experiment we’ll focus not on meme – rules, but meme – contents4. And more –
since our corpus is textual and not audial, it would be useless to focus our attention to the phonetic
layer. We aim to find the basic units of memetic exchange, but to say that phonemes are basic atoms of
linguistic memetic exchange means for us to go with our analysis too far. We are persuaded that the
better candidates for the memes are in the higher levels – memes are either morphemes, or words5.
« Morphéme est défini comme la plus petite unité de signification de la langue » (Niklas­
Salminen, 1997). Morphéme is a sense embodied, it is a semantic domain condensed and expressed ,
it is the smallest possible signifier X for a signified Y. If we have , for example, the word « lover » an
morpheme « lov » referes to semantique domain related to acts of love, while grammatical morpheme
« er » refers to semantique domain which could be described like « agent of the related action ».
Because the morphéme is , ex vi termini , a basic atom of linguistic exchange which has both
« signifier » and « signified » side, it possibly can serve as a candidate for a meme.
The problem with morphemes is that they are artificial constructs of linguists, similiarly as
3 Thesis that the language is truly a memplex, in other words an entity which creates its own mechanisms catalysing
its own replication, seems to be more true in light of the fact of existence of institutions like , for example CREDIF
– Centre de recherche et d’études pour la diffusion du français
4 Knowing that the content can be also described by a rule – by a rule of production of a given content, we don’t see
any major difference between the description of rules and the description of contents.
5 In fact there is one more unit which could be possibly a meme – a syllable. We leave it out as a potential candidate
for a meme for the same reason as a phoneme – syllable does not have a sense. Even when we will focus on
signifiers and not signifieds, we cannot , and we don’t want to pretend that the semantic level does not exist. It exists,
and it is raison d’être of language.

atoms – however useful this notion is – are artificial constructs of physicians (and memes artificial
constructs of memeticians;). Normal by­linguistic­not­infected humans have natural tendency to
perceive the « word » as a basic meaning­carrying unit. Even when in fact it’s not at all easy to define
what the word is, and the ways fo this definition are not at all evident (Niklas­Salminen, 1997), people
would most probably divide the sentence into the words, not morphemes, if asked to analyse the
sentence into simpler elements which are not phonemes, even in case of languages like french, when
there is no difference between words and morphemes caused by accent or pause. And the answer why
they did it like it could be simple « because we feeled it ». The notion word as « basic unit of
something » is simply deeply rooted within man’s mind. So it could also be a good candidate for a
meme.
We leave the question whether it should be a word or rather a morpheme which should be
identified with a « meme » undecided within the scope of this article.
Word is conceived as an lexical unit by a science called lexicology. This science , also
relatively new, but anyway in more evolved state than memetics could offer its fruits to that branch of
memetics, which deals with evolution and distribution of memplexes composed of linguistic signs.
Here are the few lexicological termes whose application within memetics could be more than
fructiferous:
Word’s disponibility ­ « variable which does not only depend on the subject’s knowledge of
vocabulary of given language, but also of the conditions within which it is expressed ». (Niklas­
Salminen, 1997). This notion and the related studies of aphatics could be useful especially in the
sphere of intrapersonal, cognitive memetics. If we understand that the first essential step of memetic
replication is the expression, which necessairily begins by a word « coming to one ‘s mind or mouth»
, we’ll see that « word ‘s disponibility » can be identified with the « word’s/meme’s own tendency to
get expressed and thus replicated ».
Lexique – totality of all the words used within the given language. Virtually infinite.
Potentially identifiable with « memetic pool » within the frame of linguistic memetics.
Word frequency – number of occurrences of a given word within a corpus.
The concept of word frequency is the core term of lexical statistics. Lexical statistics « is the
application of statistic methods for a description of a vocabulary » (Niklas­Salminen, 1997).
It’s important to notice that lexical statistics is a quantitative science, thus mathematically
formalizable. It is the discipline which aims to describe and compare textual corpuses in terms of
word frequencies , but these corpuses are very often either articial aggregates of « as much data as
possible » like Tresor de la langue française or corpuses composed of artistic production of one
author.
With the progress of digital communication, the amount of textual real­life grows fabulously –
data in form of emails, mobile phone messages , discussion forum submissions etc. We can, of course,
analyse this data by lexical statistics methods, but our goal is different.
We do not want to describe a vocabulary, nor to find out whether this or that man is a real or
fictive author of this or that book. We want to analyze the rates of changes of meme frequencies
within a corpus describing real social system. We aim to discover inherent dynamics of this system.
Afterwards, we’ll be allowed to formulate hypothesis and generalize conclusions about what we had
found.
And last but not least, we want to create such models, which will allow us to predict the future

state of a given social system.

2. Das experiment
The memes themselves are like fractals—they can apply to content as fine­grained as words, lines and study
locations or as general as complete discourses, complete drawings and complete research articles.
(Dirlam, 2003)

2.1

Corpus – le champ de bataille

Notre corpus est notre joyau.
Il s’agit d’une base de donnée d’une communauté virtuelle présente sur le domaine
http://kyberia.sk , l’auteur de cet article étant son fondateur et sénateur . Nous avons à notre
disposition toutes les donnés de ce système dès le jour de sa création en 2001 jusqu’en juillet 2007.
On peut dire que cette base des données connaît deux sections principales – une section des « data
nodes – les noeuds des données » ­ ce sont des forums, des articles, des blogs , des commentaires
d’amitié etc. ­ il y en a 2266901 au total, créés par 8517 utilisateurs ; et la second section comprend
des 3385647 messages échangés parmi les 3870 utilisateurs.
Nous ne sommes que au début de nos recherches, nous avons donc décidé de laisser de côté la
section beaucoup plus complexe6 composée des noeuds des données afin de nous focaliser sur
l’ensemble moins complexe mais plus riche (du point de vue quantitatif) – sur les messages échangés
parmi les utilisateurs.
Kyberia.sk, évaluée comme « le premier serveur de communauté réussi en Slovaquie » (
http://pocitace.sme.sk/clanok.asp?cl=3509119 ) est d’abord un réseau social dense. Les relations
parmi les utilisateurs ne sont uniquement virtuelles, loin d’être – ce n’est pas un ensemble des
personnes anonymes , mais une vraie mini­societé humaine, une véritable « zone temporairement
autonome » (Hakim Bey, 1984) . Le fait que pour faire partie du domaine, il faut que la communauté
accepte la demande de d’enregistrement du « novice », fait de kyberia une zone « semi­autonome »
qui mène à la création d’une identité collective ­ « nous faisons partie de la tribu de kyberia ».
Il existe, évidemment, un grand nombre d’approches méthodologiques qu’on pourrait utiliser
et infiniment de phénomènes qu’on pourrait étudier par rapport à la communauté de kyberia. Le
lexicologue y trouvera une émergence des mots et affixes nouveaux; l’anthropologue y trouvera, peut
être, une tendance de plus et plus forte envers l’endogamie; le sociologue, connaissant la composition
de la communauté, pourrait en tirer des hypothèses générales portant sur la société humaine en
général, dans laquelle kyberia est noyée. Il pourrait sans doute le faire, comme le réseau sociale de
Kyberia est le miroir de la sociéte urbaine slovaque­tchèque et la base de donnée de kyberia est le
miroir de ce réseau sociale .
Nous utiliserons cette base des données textuelles comme un échantillon des données
empiriques. Sur cet échantillon, nous metterons en oeuvre des méthodes de la statistique lexicale et
nous essayerons d’interpréter nos résultats en termes mémétiques.
6 C’est grâce à la complexité et multicontextualité de ce système des noeuds des données que nous nous sommes
permis de proposer le précurseur de la premiére loi de la mémétique interne ­ « La probabilité de l’articulation
répetée d’un signe est inversement proportionelle à la période de temps qui s’est écoulée depuis la dernière
articulation de ce signe, et tout cela indépendamment du contexte » (Hromada, 2007)

Nous avons consacré d’autres articles aux problèmes éthiques de notre recherche.

2.2 :­) & :) ­ Les acteurs

Globalement,on peut prévoir que si deux mots sont employés exactement dans les mêmes contextes,
l’un d’eux a tendance à disparaître ou à changer de sens (Niklas­Salminen, 1997)

En l’état actuel de nos connaissances, nous ne pouvons pas dire si on peut identifier le mème,
en tant que contenu d’un échange linguistique, avec le mot ou le morphème; nous nous focaliserons
sur les signes linguistiques monomorphématiques, qui sont à la fois morphèmes aussi bien que mots.
Même si notre échantillon contient des millions messages créés par des milliers d’utilisateurs
pendant plus de 4 ans, il vaudrait mieux choisir pour nos analyses primordiales les signes linguistiques
dont les fréquences d’utilisation sont plutôt hautes. Or nous voulons mettre en oeuvre les méthodes
statistiques qui nous apporteront plus la vérité dans nos résultats plus nous aurons des données.
Comme la communauté de kyberia se compose d’utilisateurs parlant plusieurs langues, il
faudra donc ou bien le prendre en compte de ce fait et ne jamais l’oublier pendant l’interprétation de
nos résultats, ou bien de se concentrer simplement sur des signes linguistiques universaux, qui
dépassent la domaine de chaque langue individuelle.
Puisque nous sommes d’accord avec la vielle maxime « La simplification est une meilleure
sophistication » (De Vinci) , nous avons choisi la seconde voie où nous sommes amenés,
naturellement, à l’ensemble des signes linguistiques7 appelés « émoticônes». Une émotîcone est
définie comme « une représentation en caractères typographiques d'une émotion » (
http://fr.wikipedia.org/wiki/%C3%89motic%C3%B4ne ). Pendant 25 ans d’existence, les émotîcons
ont réussi à envahir des cerveaux de billions d’habitants de la Terre. Il s’agit, donc, des mèmes par
excellence , bien adaptés au milieu de la societé humaine grâce à leur ressemblance avec le visage
humaine, dont l’efficacité extrême le rend digne de notre attention.
Nous ne voulons pas mélanger les torchons et les serviettes dans nos analyses, nous ne
regarderont donc que les « smileys » signifiants « l’émotion d’amusement, rire , bonheur ». Les
émoticons comme :­( , ;­( , :( seront a priori exclues de nos analyses, puisque leur signifié, en d’autres
mots, l’émotion qu’elles expriment ­ l’état intérieur d’hôte , est différente .
En cette analyse primitive nous excluerons même des émoticons comme ;­) , ;) , ou ;­] dont la
haute fréquence d’utilisation par rapport au reste d’Internet est une particularité de kyberia. Bien que
cette analyse soit féconde nous ne sommes pas sûr s’il s’agisse, dans les cas de ;­) et de :­) par
exemple, d’émotîcons codant la même émotion – ayant la même fonction. En bref, nous ne sommes
pas sûr s’il s’agit d’émôticons synonymiques ou non. Notre doute provient de la différence entre les
caractères qui codent des yeux – tandis que les deux­points en :) ont l’air « normal » pour tout le
monde, le point­virgule en ;) apporte avec soi souvent la connotation d’une « badinerie raffinée ». En
bref – c’est possible que ce soient pas des synonymes.
Et notre première pas sera l’analyse des synonymes – nous définissons des synonymes comme
les deux signifiants différents dont des domaines sémantiques se recouvrent, se chevauchent, sont
7Nous laissons de côté les discussions terminologiques si des émotîcons sont en fait les signes linguistiques motivées ou
des symboles. Leur formes primordiales, avec les deux­points pour des yeux, le tiret pour le nez et la parenthèse pour la
bouche était, sans doute, fortement motivées. En d’autres mots, il y avait « un rapport de ressemblance formelle entre la
forme de l’objet représentant et celui de l’objet représenté » (Niklas­Salminen, 1997) lors de leur création, il s’agissait
donc des symboles. Un des buts de cet essai est de montrer comment la ressemblance formelle entre le signifiant et le
signifié dépérit, en accroissant l’arbitraire , grâce aux forces de l’évolution mèmetique.

presque identiques . Ils servont la même fonction8 dans la langue et dans la vie – ils peuvent être
commuté sans changement de sens.
Du point de vue mémétique, nous disons que des synonymes sont des allomèmes9 si elles sont
présentes dans le même hôte. L’hôte , en mémétique , est un être humain dont le cerveau contient la
représentation neurale d’une mème et qui répand le mème par son activité, volontaire ou involontaire.
Nous appelons expression cet acte d’hôte qui transforme la représentation mentale en objet
empirique, perceptible par d’autres hôtes éventuels.
L’émotîcon :­) que nous appelerons aussi un « smiley classique » et l’émoticon « :) » que nous
appelerons aussi un « smiley occidental » sont des synonymes – elles référent au même domaine
sémantique, à la même intention du locuteur qui veut s’en se servir pour exprimer les émotions
positives ou pour employer un registre « léger » . Il s’agit d’allomèmes si elles sont présentes dans le
cerveau d’un même locuteur. Comment pouvons­nous savoir si elles y sont vraiment présentes, sans
d’être obligé de briser la crâne de ce dernier ?
Simplement. En observant le comportement résultant de l’expression d’un mème par un sujet,
nous pouvons être sûr que le cerveau de ce sujet contient une représentation interne de ce mème. Si
quelqu’un a écrit :) n’importe où et n’importe quand dans le passé, et si nous le savons, nous le
percevons en tant que l’hôte d’un smiley occidental – voilà notre simplification méthodologique10 . Si,
dans une autre situation, le même sujet exprimerais une autre mème nous le concevrons en tant que
l’hôte de ce dernière aussi. Aujourd’hui on croit que le cerveau d’un sujet sain peut hébrereger un
nombre théoriquement infini des mèmes. Si nous savons que un sujet est l’hôte de deux mots qui
jouent la même rôle dans la communauté du sujet – ils sont des synonymes – ces deux mots sont en
relation allomémique l’un avec l’autre pour le sujet donné.
Pour un sujet 11donné et pour une période de temps donnée, nous pouvons mesurer un nombre
d’utilisations des mots dans notre corpus – leurs fréquences. S’il s’agit de synonymes, on appelle une
allomème dominante celle dont la fréquence d’utilisation est plus haute que celle d’une allomème
recessive. Si je sais que le smiley occidental était exprimé 23 fois par le sujet X pendant l’année 2007
tandis que le smiley classique n’était exprimé que 42 fois par le même sujet pendant la même période,
8 « Le sens d’un mot est son emploi » (Benveniste, 1974) ; "If we had to name anything which is the life of the sign,
we should have to say that it was its use" (Wittgenstein, 1958)
9 Allomème est un neologisme motivé par le terme « l’allèle », connu des généticiens. « On nomme allèle une
variante donnée d'un gène au sein d'une espèce. Tous les allèles d'un même gène occupent le même locus
(emplacement) sur un même chromosome » ( http://fr.wikipedia.org/wiki/All%C3%A8le ) . Inspirés par cette
définition, nous definissons l’allomème comme des « variantes existant en tant que représentations neurales
dans un cervaux humaine qui représentent la même intention ou la domaine sémantique et qui sont
exprimées par l’acte d’expression par leur hôte». Notre notion est proche de celle de « variante culturelle » de
Boyd et Richerson , mais tandis que pour eux, une « variante culturelle » est une abréviation pour « l’information
enregistrée dans des cerveaux humaines ...une fois les gens changeront les concepts d’une « folk psychology » en
concepts scientifiques et fiables » (Boyd,Richerson, 2005), l’allomème référe à un comportement exprimé. Et
encore, un concept d’allomème ne devient utile que lorsque il y a des mèmes en opposition – ce concept aidera nos
analyses seulement quand il y a toujours au moins deux expressions allomémetiques d’un locus sémantique ou d’une
même intention .
10 Il s’agit d’un hôte même s’il a écrit ce mot par hasard , en faisant une faute par exemple. En effet cela peut arriver
qu’il reproduis la même faute – et ce qui n’était qu’une faute au début devient une habitude. En ce cas nous parlons
d’une mutation mémétique involontaire qui était jadis au fond de presque toute diversité culturelle.
11 Ou pour un ensemble dessujets.

le smiley classique sera dominant par rapport au smiley occidental.
Dans la figure 1 nous voyons la première visualisation des données de kyberia. Chaque
colonne X représente des activités diachroniques d’un utilisateur, chaque rang Y représente une
période du temps – on peut dire qu’elle décrit la communauté du point de vue synchronique. Si
pendant la période Y , l’utilisateur X a utilisé plus souvent le smiley classique, nous placerons la
couleur violet sur la position X,Y; si le smiley occidental était son allomème dominante, nous y
placerons la couleur turqoise; s’il n’a utilisé aucun des deux allomèmes, nous laisserons la place en
noir. Nous appellons une image qui décrit les changements des fréquences et des distributions des
mèmes dans le temps un mémogramme.

Dessin 1: Un mémogramme de la communauté kyberia. La coordonnée X représente un utilisateur, Y
coordonée représente une période de temps, la couleur sur la position X,Y représente l'allomème
dominante pour l'utilisateur X pendant la période Y. Le turquoise pour la dominance des smileys
occidentaux :); le violet pour les smileys classiques. Pour voir la version pleine d’image, clickez sur
l’image.
Nous pouvons remarquer un ensemble de colonnes à droite qui sont entièrement violettes. Il
s’agit d’un grand nombre dutilisateurs ­ à peu prés six dixième de tous ­ pour lesquels le smiley
classique n’était jamais une allomème dominante.

Dessin 2: Un mèmogramme visualisant les mêmes données que Illustration 1 mais RANGÉ
DIFFEREMENT. Le pixel violet répresente une dominance des smileys occidentaux, turquoise
représente la dominance des smileys classiques. Pour voir la version pleine d’image, clickez sur elle
ou ici .
Sur le dessin 2 nous voyons les mêmes données, mais ordonnées autrement . Nous pouvons
remarquer un petit ensemble de colonnes entièrement violettes, une petite pente à droite qui représente
des utilisateurs qui n’ont jamais utilisé le smiley classique. Il y en a à peu près un dixième de tous.
Regardons maintenant le dessin 3. Chaque point rouge représente le ratio Y entre la population
d’hôtes actifs12 du smiley occidental et la population d’hôtes actives du smiley classiques pendant la
semaine X. Nous pouvons remarquer que ce ratio croit graduellement au fil du temps. Après avoir mis
en oeuvre une des méthodes les plus simples d’analyse statistique nommée « la méthode des moindres
carrés » (http://en.wikipedia.org/wiki/ Least_squares ) nous avons calculé le coefficient regressif
β=0,027 . En gros, en faisant abstraction des oscillations chaotiques, le ratio entre les :) et les :­) est
0,027 fois plus élevé après chaque semaine. Ce phénomène peut être produit par
12 Un hôte actif d’un mème pour une période est un hôte qui a exprimé réellement le mème pendant la période. Au
contraire, un hôte latent est un hôte dont nous savons qu’il a exprimé le mème auparavant, mais ne l’a pas exprimé
pendant la période que nous observons.

la diminution des hôtes actives du mème :­)
 l’accroissement des hôtes actives du mème :)
 une combinaison des facteurs précédents
Les dessins 1 et 2 montrent qu’il y a beaucoup plus d’utilisateurs qui n’ont jamais écrit le
smiley classique que ceux qui n’ont jamais écrit le smiley occidental. Est­ce que ce fait ne peut­il être
lié, dans une certaine mesure, au phénomène du changement linéaire que nous observons ici ?
Nous maintenons que « oui » et nous allons
essayer de mettre un peu de lumière sur cette liaison.
Nous prétendons que les deux phénomènes sont les
résultats du fait que:
Le smiley occidental est une forme plus stable
que le smiley classique. Un mathématicien du chaos
dirait que :) est un attractor plus fort que celui de :­) .
Les Darwinists pourraient dire que « le fitness » d’un
smiley occidental est plus élevée que celui d’un smiley
classique.
En d’autres mots, c’est plus probable que
quelqu’un va partir d’un smiley classique vers les autres
allomèmes, y compris le smiley occidental, que
quelqu’un va partir d’un smiley occidental vers les
autres allomèmes. Nous voyons au moins deux raisons
pour cette prétendue « stabilité »:
1. La stabilité grâce au « frequency­based bias »13
de Boyd & Richerson : Sur le dessin 3 nous
voyons que la fréquence du smiley occidental
était au debut même d’existence de la
communauté au moins 3 fois plus haute que celle
du smiley classique. Si le « frequency­based
bias » algorithme influence en quelque sorte le
comportement humain, cette différence initiale
en fréquences a pu mener vers sa propre
autocatalyse . Si l’homme a vraiment une
tendance à imiter les mèmes plus communs
autour de lui, le fait que le :) était 4 fois plus
répandu au début même de la communauté, a
comme conséquence le fait qu’un utilisateur
Dessin 3: Le coordonnée Y du point rouge
nouveau ou irrésolu adopte cette variante et que
représente le nombre de gens qui ont
l’utilisateur usant des :­) subit les forces plus
exprimé le smiley occidental pendant la
puissantes pour changer son habitude que celui
semaine X divisé par le nombre de gens
usant des :)
qui ont exprimé le smiley classique
2. La stabilité aux propriétés du signe/mème même
pendant la même semaine X


13 « Frequency­based bias: the use of commonness or rarity of a cultural variant as a basis for choice. For example, the
most advantageous variant is often likely to be the commonest.If so,a conformity bias is an easy way to acquire the

: Le smiley occidental consiste en deux caractères, le smiley classique consiste en trois
caractères. L’hôte doit donc investir plus d’énergie pour écrire :­) que pour :). Et encore ­ la
probabilité que l’hôte pourrait faire une faute en les écrivant est donc 3/2 plus haute en cas de
:­) ­ l’hôte donc dois investir plus d’énergie dans de possibles corrections . Si le caractère qui
fait la différence – le nez, le tiret – a apporté avec soi quelque avantage, cet investissement
pourrais avoir un sens. « Cet effort, le locuteurs ne le font que pour autant qu’il est payant »
(Yaguello, 1991). Mais cet effort n’est pas payant. Le tiret n’apporte avec soi aucune
information nouvelle. Le deux­points informe l’interlocuteur que la chaîne des caractères qui
suivra sera un émotîcon si ces caractères ne sont pas alpha­numériques – on peut dire que le
deux­point joue un rôle presque grammatical . La parenthèse fournit le contenu sémantique de
l’émotîcon – est­ce une expression de la tristesse ou du bonheur ? Mais le tiret ne nous
informe de rien. Il est qu’une épave de la motivation primaire. Il est redondant. Il n’est pas
payant – il va ou bien trouver sa propre nouvelle signification distinctive ou bien disparaître.
Tels sont les lois de la linguistique, tels sont aussi les lois de la mémétique
Même si nous ne voulons pas priver la première explication d’honneur qui lui appartient
légitimement, nous nous permettons de focaliser l’attention de notre chére lectrice sur la deuxième
explication. Elle est, dans un certain sens , fondamentalement différent. Tandis que nous devons
recourir au fait de la distribution initiale des allomèmes dans la communauté pour l’explication par le
« frequency­based bias » , nous ne devons que recourir au propriétés du signe lui­même dans le
deuxième cas. Voilà la première démonstration comment une propriété strictement formelle (et
objective) d’un signe – la composition du signifiant ­ influence le nombre de hôtes infectés.
Quelle que soit l’explication choisie , il faut remarquer que nous n’avons jamais recouru
aux explications par la volonté propre d’un locuteur. En regardant de près les dessins 1 & 2 il,
nous remarquons qu’il y a des cas lorsqu’un utilisateur a changé son habitude dans un moment donné,
et il n’a jamais recouru à la vieille habitude de nouveau. Il pourrait bien s’agir d’un utilisateur qui a
réussi à maîtriser l’influence des forces mémétiques autour de lui par sa propre raison.
Mais ce sont des cas exceptionnels, pas faciles à atteindre. Beaucoup plus souvent nous
pouvons voir un homme succomber aux forces culturelles autour de lui. En fait, c’est peut être bien
impossible pour un être humain de ne pas succomber à ces forces. La phonologie – avec ses études
des propagations et des transformations des langues, dialectes, accents et des règles qui les fondent –
nous en donne une abondance des exemples. On ne peut pas dire non à un accent, quand on est noyé
dedans, on a mal à dire non à une dictature de religion ou à l’idéologie quand tous le monde autour de
nous y croit .
La mémétique interpersonnelle étudie donc les forces sociales et leurs effets sur le
comportement des êtres humains en faisant abstraction des intérêts humains. Ce sont des intérêts et les
propriétés des mèmes et leurs complexes qui comptent et gagnent à long terme. La mémétique se
demande:
1. Quelles sont des causes14 d’un changement d’une vieille habitude/régle/mot X en une nouvelle
habitude/mot Y ?
correct variant » (Boyd & Richerson, 2005)

14 La cause et la raison n’est pas la même chose. Les raisons sont réfléchies, les causes sont aveugles – elles suivent la
règle naturelle. Les raisons sont liées plutôt à la causalité logique et sémantique, tandis que les causes sont liées à la
causalité physique. Les raisons ont leurs raisons , les phénomènes sociales ont leurs causes.

2. Quelles sont des causes pour rester à l’habitude/régle/mot Y?.
et elle essaye de trouver des réponses en étudiant les propriétés des habitudes/régles/mots eux mêmes.
Elle essaie de trouver quelle est « le fitness » du mème donné pour un environnement donnée, et à
partir de cette connaissance, elle essaie de prédire l’état futur de cet environnement.

2.3 Le prédateur et la proie

"How many people are actually 'laughing out loud' when they send LOL?" (Crystal, 2001)

Dessin 4: Les courbes montrent l'évolution des populations mensuelles des hôtes actives du mème
"lol" (tourqoise) et ":­)" (violette) dans la communauté de kyberia.sk. Les lignes rouges désignent les
« crises de communication » principales – chaque fois que nous voyons l’abaissement de toutes
courbes, nous pouvons parler d’une crise de communication .
Son nom est « lol ». Personne n’est sûr de ce que ce signifiant veut dire. Wikipedia donne des
interprétations comme "laughing out loud","laugh out loud","lots of luck" ou "lots of love". L’auteur
de cet article a cru pendant longtemps que « lol » voulait dire « lots of laugh ». Peu importe. Lol est
puissante même sans le sens fixe.
Pour les phonéticiens il n’est qu’un syllabe commençant et finissant par un même
approximant « l » ­ c’est­à­dire par une consonne Liquide, fLuide, Ludique – et sonorisé par une
voyelle postérieure mi­fermée arrondie « o » ­ un son un peu sombre et moins marqué que celui de
« a » ou « i », mais quand même très puissant.
Un signifiant libre du signifié. Une forme attirante sans contenu.
Il est venu au monde comme la forme écrite. Dans un certain sens il a eu de la chance – avant
l’Internet il n’existait pas de mot comme le mot en anglais, ni en allemand, ni en français. Il y avait un
trou et il l’a rempli. Seulement le mot « l’Olland » (le Pays­bas en italien) lui était proche, mais
l’italien n’a jamais été une langue marquante pour l’évolution d’Internet.
Regardons maintenant le dessin 4 pour voir son histoire dans la communauté kyberia.sk . Nous
pouvons remarquer que pendant la première année , le mème « lol » était moins répandue que le
smiley classique. Mais la distance a amoindri après la « première crise de communication 15», et la
taille des populations est devenu presque la même après la deuxième crise. Puis nous voyons un duel
15 Nous appelons « la crise de communication » cette époque dans l’histoire du système pendant laquelle le nombre
total de tous les mèmes échangés parmi les membres du système est réduit. Dans le cas de l’histoire de kyberia, les
crises de communication ont souvent été causées par des problèmes de serveur , de transitions aux nouvelles
versions du logiciel ou par d’autres événements ayant leurs racines plutôt en dehors du système de kyberia.
Contrairement à la crise de communication, l’état normal du système de kyberia est caractérisé par un
agrandissement de la population total qui est causé par l’afflux des nouveaux utilisateurs.

durant quelques mois quand le smiley classique a plus ou moins réussi à maintenir une petite avance.
Mais après la troisième crise toute petite le « lol » a finalement réussi d’avoir une population plus
grande que celle du smiley classique. La montée de la population de « lol » est beaucoup plus abrupte
que celle de :­) après la première « grande interruption traditionnelle du fonctionnement de kyberia »
désignée comme la quatorzième crise , de même qu’après la cinquième et sixième.
Nous pouvons donc faire une généralisation et dire que le même « lol » s’est répandu
beaucoup plus effectivement pendant les crises de communication. En d’autres mots le fitness16 du
mème « lol » pendant la crise de communication est plus élévée que celui du même « :­) ».
Quelles peuvent être des raisons de ce phénomène ? Nous nous permettons d’offrir cette
hypothèse17 simple et potentiellement falsifiable : tandis que le smiley classique ou d’autres émotîcons
ne peuvent être exprimé que en forme écrite, le mème « lol » a à sa disposition aussi une autre forme,
une autre modalité d’expression – la modalité de parole . En effet, on peut observer une invasion de
« lol » et de ses formes dérivées ( cf. Appendice 1) dans la langue parlée. La crise de communication
de kyberia ne concerne que le fonctionnement du système web, les membres de la communauté sont
donc contraints de recourir aux autres moyens de la communication, s’ils veulent échanger leurs
mèmes.
Non seulement ils veulent échanger leurs mèmes, c’est la nature propre des mèmes d’induire
ses hôtes envers leur expression . C’est la nature même du cerveau humain de vouloir exprimer ses
contenus – et si c’est impossible d’utiliser une modalité d’expression, on utilise une autre . Quand le
logiciel de kyberia ne marche pas, les mèmes qui ne peuvent être exprimés que par l’écriture sont
fortement handicapés. Non seulement la prolifération en nouvelles hôtes n’est plus possible, leur
puissance est affaiblie dans leurs hôtes passés . On les oubliera tout simplement. En d’autres mots –
plus la fréquence d’un même dans le monde extérieur diminue ­> moins sa représentation nerveuse
dans les cerveaux des hôtes est figée 18­> moins le mot est disponible ­> la probabilité d’une
expression future devient plus petite ­> la fréquence diminue . Voilà, une autoinhibition .
Les autres mèmes qui n’ont pas cet handicap et remplissent plus ou moins la même fonction ne
tardent pas à venir prendre leur place. L’échange des mèmes a continué d’exister pendant chaque
crise, elle n’a que changé de forme – plus la crise du logiciel était grave, plus le barycentre de la
communication s’est déplacé vers la communication parlée. Cette communication a bien sûr favorisé
la prolifération du « lol » qui est un signifiant court et fort – son fitness est vraiment grand. Or, qui
dirait « le deux­points, le tiret, la parenthèse » lors d’une rencontre avec des amis – et des hôtes
potentiels – dans un bar? Chaque rencontre personnelle a potentiellement servi comme une couveuse
du « lol ».
Nous voyons donc des résultats du simple fait que le même « lol » peut être exprimé par la
bouche, tandis que le mème :­) ne peut être exprimé que par l’activité des mains. Voilà la deuxième
16 The overall survival and proliferation rate of a meme m can be expressed as the meme fitness F(m), which
measures the average number of memes at moment t divided by the average number of memes at the previous
time step or "generation" t – 1. ( Principia Cybernetica Web ­ http://pespmc1.vub.ac.be/MEMEFITN.html )
17 Je remercie mon amis Lubos Iskra dont l’idée était d’expliquer ce phénomène par l’handicap de la modalité écrite.
18 Une lectrice fidèle pourrait objecter que la diminution de la fréquence extérieure d’un mème ne mène pas
automatiquement vers l’affaiblissement de la répresentation nerveuse liée pourvu que la fréquence interne reste haute,
par exemple grâce à la méditation. L’objection est tout à fait pertinente dans le cadre d’une mémétique intrapersonnelle,
mais elle est hors de la portée de cet article qui traite de questions soulevés par la mémétique interpersonnelle. Nous
remercions notre lectrice pour sa compréhension exceptionelle.

démonstration comment une propriété strictement formelle (et objective) d’un signe même – la
composition du signifiant ­ influence le nombre de hôtes infectés, la réussite d’un mème en tant
que mème. Nous finirons, donc, notre petite excursion par le premier postulat de notre petite théorie:
Le fitness d’un mème est proportionnel au nombre des modalités d’expression par
lesquelles ce mème peut être exprimé. Les conséquences sont les plus visibles quand une ou
plusieurs des modalités d’expression sont restreintes.

3. Extroduction
Vaecitryaḿ prákrtadharmah samánaḿ na bhaviśyati: Diversity, not identity, is the law of nature.
(Anandamurti, 1961)
Nous bâtissons notre terminologie, nous formulons nos premières postulats. Nous sommes en
train d’établir une science. D’abord une science empirique – puisque ce sont les données empiriques
d’où nous tirons nos hypothèses. Puis une science formalisée et mathématisée.
Une science humaine car l’objet de son intérêt est l’homme et son activité.
Et une science sociale car l’objet de son intérêt est l’homme au sein de la vie sociale.
Mais également une science naturelle , quand nous aspirons à prédire le futur.
Imaginons le trio des tetragrams donnés – BRHM, ALLA et JHVH – et une communauté
humaine incapable à articuler la consonne H. Ceteris paribus , nous pourrions prédire un aspect de
l’état futur de cette communauté n’en connaissant que:
 le premier postulat de notre théorie
 les propriétés propres aux mèmès observés­ JHVH contient 2xH, BRHM une et ALLA aucune
 les propriétés « d’environnement » – les hôtes n’arrivent pas à produire des fricatives glottales
nous saurons que ce sera la forme ALLA qui va réussir, à une échéance suffisamment longue,
d’infecter le nombre le plus grand de cerveaux des hôtes faisant partie de cette population,
BRHM sera le deuxième et JHVH va perdre. Tout cela parce que les locuteurs n’arrivent pas à
prononcer une H.
Créer n’importe quelle liaisons entre cet exemple simplifié et le monde réelle signifierait de
pousser le bouchon trop loin. On n’atteint pas « ceteris paribus » dans le monde réel, et certainement
pas dans la vie de l’homme – tant de variables, tant de relations, tant de chaos !
La science qui viens de naître ici n’atteindra son but qu’à condition que son but soit modeste.
Espérer que quelqu’un puisse prédire l’avenir de l’humanité – voilà un rêve de fou ! Mais si nous
restions modeste, nous pourrions, peut être, découvrir ou même bâtir des îlots d’ordre dans le chaos
des données. Modeste comme un linguiste qui dit, après qu’il a regardé son corpus pendant des années
que « Mesdames, Messieurs, la différence entre O fermé et O ouvert est en train de disparaître, un
d’eux disparaîtra donc entièrement dans un horizon de 23 ans », nous nous permettons d’exprimer une
connaissance banale « Ladies, Gentlemen, notre deux expériences que nous venons d’effectuer ici
nous montrent que le smiley classique est en train de mourir ».
Nous ne sommes qu’au commencement de notre récit, nous avons donc une grand peine à
accepter la mort. Même s’il s’agit de la mort de signes – notre priorité va vers la vie au lieu de la
mort. Nous proposerons, donc, de sauver le smiley classique en lui donnant une signifié bien défini.
Lequel?
Voici la réponse en forme d’une définition métalinguistique: « :­) est pour nous un émoticôn

du bonheur sincère, c’est un essai de décrire le vrai état de notre visage au moment où nous le tapons
sur notre clavier. L’exprimer nous a coûté plus d’énergie que l’expression d’un « ;) » bon marché ou
un « lol » agressif , et nous savons bien où nous l’avons investi.
Si nous suivons notre définition rigoureusement, nous remplirons une forme blessée du sens.
Le smiley classique ne va pas mourir pour nous. Si nous réussissons à faire répandre cette définition
parmi les autres êtres magnifiques, il ne mourra pas même pour eux. Ce jour­là ne sera pas qu’un jour
de renaissance pour notre vieux ami , le :­) , mais de même une grande journée pour la mémétique
appliquée, une naissance véritable du « memetic engeneering ».
Nous ne pouvons nous contenter d’étudier le passé. C’est à l’épreuve du présent et du futur
proche que nous devons confronter nos résultats.
(Asimov, 1993)

























Dawkins, R. 1976. The Selfish Gene. Oxford: Oxford University Press.
Hromada, D. 2007. Moja prva rozprava o metode. http://node.nel.edu/?node_id=6823
Lynch, A., 1998; Units, Events and Dynamics in Memetic Evolution. Journal of Memetics ­
Evolutionary Models of Information Transmission, 2
de Saussure, F., 1972. Cours de linguistique génerale. Paris: Editions Payot
Niklas­Salminen, A. 1997. La lexicologie. Paris: Armand Colin
Blackmore S., 1999. The Meme Machine. Oxford: Oxford University Press
Gatherer, D., 1997; Macromemetics: Towards a Framework for the Re­unification of
Philosophy. Journal of Memetics ­ Evolutionary Models of Information Transmission, 1.
Yaguello, M., 1991. Alice au pays du langage. Paris: Editions du seuil
Jakobson, R., 1963. Essais de linguistique générale. Paris
Piaget, J., 1962. La psychologie d’intelligence. Paris: Colin
Dirlam,D.K.,(2003). Competing Memes Analysis. Journal of Memetics ­ Evolutionary Models
of Information Transmission, 7.
Anandamurti, Sh. (1961). Ananda Sutram ( http://en.wikipedia.org/wiki/Ananda_Sutram )
Wittgenstein, L. (1958). The Blue and Brown Books (Notes dictated to Cambridge students in
1933–35)
Richerson, P. J. and R. Boyd (2005). Not by genes alone : how culture transformed human
evolution. Chicago ; London, University of Chicago Press.
Hakim Bey (1984). The Temporary Autonomous Zone, Ontological Anarchy, Poetic Terrorism
( http://www.hermetic.com /bey/taz_cont.html )
Crystal, D. (2001) . Language and the Internet. Cambridge University Press
Asimov, I. (1993). L’aube de la fondation – Forward the Foundation. Nightfall Inc.

Appendice 1 ­ Les formes contenant le trigram « lol » en majuscule ou miniscule.
Les formes fléchis des mots slovaques ou noms existants sont en italiques.
Les formes particulièrement marrant sont en gras.

La forme

Nombre
d’expressions

Nombre des hôtes

lol

21984

828

LOL

5009

426

lolo

1596

228

lola

201

87

lolitka

107

61

lolitky

107

60

megalol

197

49

LoL

289

46

Lol

90

45

lolita

74

45

LOLO

67

37

lolek

39

31

lolovia

37

22

lolko

39

21

lolitku

25

19

lolu

23

17

lolka

39

17

lolitu

23

16

lolik

56

15

lolitiek

21

15

MEGALOL

48

15

lolino

54

15

Lola

19

15

Commentaire

une personnage d’un bande dessiné
polonais Lolek & Bolek

lolol

34

14

loll

24

14

lololol

18

14

Lolo

22

14

lolitkou

13

13

lolity

15

13

megaLOL

19

12

lololo

19

12

lolinko

36

12

lole

23

12

Lolita

14

12

instalol

13

10

lolom

15

10

lOL

10

9

skloly

9

9

lolou

29

9

lolle

26

9

lOl

21

9

Lolle

29

9

lolov

10

9

LOLOLOL

10

8

LOl

16

8

lolololol

10

8

lolky

13

7

lolec

11

7

bolol

11

7

malolet

9

7

neloluj

13

7

lollypop

10

7

lolipop

7

7

halolo

6

6

roflol

8

6

skola = l’école

ne loles pas !

haloo ?

olol

11

6

loli

7

6

lolovci

6

6

maloletych

6

6

lolovina

9

6

lolovi

7

6

Lolitu

5

5

LOLA

5

5

oklolo

6

5

lollipop

5

5

lolit

7

5

lolofon

6

5

belole

7

5

lolujes

5

5

lolitas

7

5

LOLo

6

4

lolooo

4

4

Lolek

4

4

mohlol

4

4

lolitke

4

4

callol

5

4

Lole

5

4

loljesus

8

4

lolujem

4

4

lolina

5

4

slole

3

3

dalol

3

3

halolooo

4

3

lolite

3

3

rofllol

3

3

lollll

3

3

une lolise, une lolerie

okolo = autour

tu loles

mohol = il a pu

je lole

LOLka

4

3

loloo

3

3

lolitkam

3

3

megalolo

4

3

lolooool

5

3

loluj

4

3

loles ! (imperatif)

pondelol

3

3

pondelok=lundi

Loly

3

3

ROFLOL

4

3

RLOL

3

3

radiololator

4

3

slolu

3

3

Lolitka

3

3

filologickej

5

3

halolololo

3

3

lolololo

3

3

lolitkovsky

3

3

belolem

3

3

lolof

6

3

ololo

3

3

MEGAlol

3

3

mailol

3

3

lolooooo

3

3

gigalol

4

3

pololeg

3

3

lolovat

4

3

lolll

3

3

lolinka

9

3

lolly

5

3

neinstalol

3

3

bilologiu

3

3

lolitkas

3

3

no comment

loler

la bilologie

filologicke

3

3

lolitkami

3

3

loliky

5

3

LOLKA

3

3

lolitou

5

3

Ololiuqui

5

3

lolololololol

3

3

sololit

4

3

loloviny

3

3

mrtelol

3

3

lololololol

4

3

loloooooo

3

3

lolitkach

3

3

lol2

9

3

loly

7

3

LOLek

3

2

pololegalne

2

2

killol

2

2

sololomos

4

2

pololegalnych

2

2

palol

2

2

lolovinu

3

2

klolkej

2

2

o kolkej=quand?

abslolutne

2

2

abslolutely

lolla

2

2

imapwnyooavat
arlol

2

2

Filologickej

4

2

volol

2

2

loliq

2

2

LOLITAS

2

2

pololet

2

2

les lolies, les loleries

Paulol

Xylolitov

3

2

lollipops

2

2

lolipops

2

2

loL

6

2

MEGAGIGAL
OL

2

2

bololepsie

2

2

pololezal

2

2

splolu

2

2

karfilol

2

2

reinstallol

3

2

reinstalol

2

2

filologia

2

2

installol

2

2

blolo

2

2

lolobrigita

4

2

lolitaci

2

2

spololu

2

2

lolot

3

2

Dlol

3

2

LOLOLOLOL
OL

2

2

LOLLLLL

2

2

loluje

2

2

sklolou

2

2

ololol

3

2

Lollypop

2

2

nolol

2

2

vizulolalne

2

2

hyperLOL

2

2

loloool

3

2

halololo

2

2

le chou­floleur

spolu=ensemble

il lole

filologicka

2

2

LOLy

2

2

lolovske

2

2

lolcek

2

2

megamrtelol

2

2

lolovic

2

2

lolty

2

2

lol­ième

zlolo

2

2

zlo=mal, le malol

polololo

3

2

choroolol

2

2

Lolly

3

2

maloletym

2

2

LOLL

3

2

tylolo

2

2

lolitkovske

2

2

lolitk

2

2

pololezala

2

2

sellol

4

2

lolkujes

2

2

lolitek

2

2

lolne

2

2

lolitach

2

2

lolin

2

2

okololo

2

2

pololeti

2

2

lolitovske

2

2

helflol

2

2

LOLOLOLOL

2

2

lolz

3

2

lolzor

2

2

loline

2

2

pololezmo

2

1

un lolitique ?

slolocne

1

1

pololegalny

2

1

2menylola

1

1

stiholol

1

1

LLOL

1

1

okolol

1

1

filolofilozo

1

1

sLOLOniik

1

1

lolotat

1

1

pololokalne

1

1

halilelolajovou

1

1

rychloletiacu

1

1

zlolitkovske

1

1

DDDlol

2

1

megalolooool

1

1

propranolol

3

1

Lolee

1

1

monoklolami

1

1

lolllololll

1

1

_lol_

1

1

lolollolol

1

1

jhellolove

1

1

tuliloliona

1

1

splolocnu

1

1

loloti

2

1

ololeidiiiiii

1

1

pololegalna

1

1

heLOLou

7

1

ROFLLOL

1

1

bolole

2

1

klolsk

1

1

megahyperlol

1

1

le lolétat ?

???

des lolots

loliak

1

1

pololasky

1

1

hlolcicka

1

1

techlologiu

1

1

loln

1

1

lolegynka

1

1

lolaa

1

1

popiciLOL

1

1

Lolou

1

1

loliku

1

1

pinelola

1

1

haloluja

1

1

lollitaz

2

1

oraclol

1

1

lola444

1

1

Lolituma

1

1

Indiemixtapezl
olz_We_Think
_These_Bands
_Are_Underrat
ed_Vol_1

1

1

pentaLOL

1

1

pololkonverzac
nym

1

1

lol2323

1

1

KAROLOL

1

1

lolotenie

1

1

_Lolita_v1

1

1

lolineky

1

1

MEGAGIGAH
YPERLOL

1

1

turboLoL

1

1

lalalola

1

1

une petite lollègue

alléloluia !

nedalol

1

1

lollita19

1

1

LOLkovat

1

1

loloooll__tak

1

1

lolllllloooolllll

1

1

luvinlolis

1

1

___lol___

2

1

lolitkouSmajlo
u

1

1

pololelegalnom

1

1

zaloloval

1

1

LoLo

1

1

Lollo

1

1

nezabudlol

1

1

blola

1

1

lolopuky

1

1

Lolit

1

1

lolokaaar

1

1

jelolj

1

1

oklolia

1

1

nelolkaj

1

1

HYPERLOL

2

1

FLola

1

1

shwarcneger_lo
l_

1

1

ponolol

1

1

lolololololololo
lololololol

1

1

_lolita101

1

1

alkohlolizme

1

1

kyberialol

1

1

donieslol

1

1

il n’a pas oublolié

l’acohlolism

mrtelolinko

1

1

lolovity

1

1

moltololto

2

1

knedlol

1

1

MEGAgigaTE
RRAquadroLO
L

1

1

velkeLOL

1

1

lolkovani

1

1

MegaLoL

1

1

lolovine

1

1

omgzlolroflolz

1

1

LOLOBRIDZI
DA

1

1

loloidna

1

1

halol

1

1

nezmenilol

1

1

vymalolova

1

1

bilology

1

1

Lolka

1

1

lolzolo

1

1

lolna

2

1

lolmajster

2

1

lol___a

1

1

kokololu

1

1

lolipo

1

1

loles

2

1

lolityyyyyyyyy
yyyyyyyyyyy

1

1

lollo

1

1

popocatepetLO
L

1

1

lolobridzida

1

1

loloide

volna = libre

dmnc__lol___jj

1

1

angllol

1

1

loliada

1

1

lol__zvlast

1

1

trillolbyta

1

1

pololezi

1

1

jelolt

1

1

celolovenske

1

1

borololo

1

1

LOLk

1

1

megaLoL

1

1

loloizmom

1

1

newadilololo

1

1

neposlolo

1

1

LOLLLLLL

1

1

lolepop

1

1

lolckera

1

1

neodoslolo

1

1

lolowityy

1

1

lolrofl

1

1

exploler_

1

1

nebololi

1

1

jakyhokloli

1

1

lollapalozy

1

1

anjelologiu

1

1

pololubovu

1

1

sklolaminatove

1

1

turbolol

1

1

hydrolol

1

1

lollllllolllllolllll

1

1

lol__zadne

1

1

LOLom

1

1

olympiade

par loloism

Internet Exploler

cybercrustgrind
coreLOLo

1

1

hahalolfnuk

1

1

klolo

1

1

plolnoci

1

1

ohlolooool

2

1

polol

1

1

vklzlol

1

1

hypercyberlolis
ticky

1

1

malolo

1

1

zacalol

1

1

LLLLLLLLLO
LEEEEEEEEE
E

1

1

lolinou

1

1

nebololeo

1

1

trebalol

1

1

sumylol

1

1

nalologovany

1

1

hahallllllol

1

1

megaloll

1

1

hoppilollu

1

1

pololegal

1

1

morflologiou

1

1

mrtedrtemegal
ol

2

1

lolk

15

1

islolated

1

1

lollow

1

1

belolle

1

1

hlalol

1

1

strelol

1

1

par morphlologie

pelolo

1

1

lolofoon

2

1

lol___jj

2

1

slolbody

1

1

trulolo

1

1

Loleka

1

1

lolovinou

1

1

sklolkz

1

1

stlolom

1

1

Ceskoslovensk
o_lol

1

1

maloletou

1

1

teolologiu

1

1

LOLobriadok

1

1

lolololololololo
lo

1

1

lolik2

1

1

cisololmooo

1

1

lolatee

1

1

lol0nz

1

1

loollolool

1

1

lollllllll

1

1

rotflmaolol

2

1

zabudlol

1

1

Lolobrigida

1

1

vlolal

1

1

OBRLOL

1

1

coklolvek

1

1

evlolucia

1

1

loldopici

3

1

zablolkoval

1

1

lolofonia

2

1

la teolologie

l’evlolution

la lolophonie

hololololololol
olo

1

1

lolololololololo
lool

1

1

lol_btw_ne

1

1

zlolzite

1

1

MoolOl

2

1

zlolovanec

1

1

LOL_ak

1

1

zobrazovalolen

2

1

molol

1

1

omgwtflol

1

1

dalolol

1

1

flololo

1

1

loliest

1

1

lolku

2

1

LOL___ju

1

1

lolkovia

1

1

holaLOLa

1

1

lollinov

1

1

Walol

1

1

DDDDlolll

1

1

sklolu

1

1

cloldcut

1

1

polole

1

1

halololooo

1

1

lollia_cedric

3

1

lolitacke

1

1

LOLMAO

1

1

beloletek

1

1

lolyneki

1

1

blolku

1

1

zlozite=difficile

Loli

1

1

lolorofl

1

1

Lola_Rennt_Ic
on_by_nothing
unusual

2

1

__lol__si

1

1

Hilloltopu

1

1

lolx

2

1

lolofci

1

1

Lolitku

1

1

megaroflgigalo
l

1

1

rotlflol

1

1

uninstalol

1

1

lolalolovska

1

1

Mohlol

1

1

trojlola

1

1

odkazowac_lol

1

1

dopeLOL

1

1

LOLDOPICI

1

1

poslol

1

1

dopadlol

1

1

lolololoooolool

1

1

muheheheeheh
eLOL

1

1

pikolol

1

1

Bialoleck

2

1

lololololooooo
ol

1

1

alkhololom

1

1

loololoolololllo
ooollloooollloo
ol

1

1

LOLitku

1

1

lollapalooza

2

1

PLOLOVICU

1

1

funlol

1

1

slolarizacie

1

1

ultralol

1

1

ezamilolujem

1

1

iholol

1

1

napadlol

1

1

lolja

1

1

lolow

1

1

okolole

1

1

spololocnostou

1

1

disabLoL

1

1

lolleho

1

1

LOL_

1

1

loliik

3

1

lolooolooololol
olololololololol
ol

1

1

Indolol

1

1

ublolenu

1

1

klola

2

1

2xlol

1

1

megalolopar

1

1

pjilologie

1

1

delol

1

1

LOLenheit

1

1

lol__ja

1

1

dokolola

1

1

karololko

1

1

dalolfsd5br

2

1

halalilol

1

1

frololo

2

1

LOLik

1

1

lollehp

1

1

pololaikovi

1

1

trilolobyt

3

1

lol3

1

1

LOLLIPOP

1

1

tetraLOL

1

1

LOLLITA

1

1

insallol

1

1

LOLITKA

1

1

haloohalolololo

1

1

lolto

1

1

superlolo

1

1

filologickA

1

1

lollercoaster

1

1

LOLna

1

1

Lolkoviaaaa

1

1

loliakoval

1

1

oklolnostiach

1

1

Lolu

2

1

stololeee

1

1

Lolowito

1

1

Lolla

1

1

loller

1

1

exploler

1

1

lolobrogita

1

1

pondelolk

1

1

lllol

1

1

loloid

1

1

pelol

1

1

ROFLOLMAO

1

1

un trilolobyt

XIXI
LLola

1

1

chvilolinu

1

1

lOliik

1

1

lolinoo

2

1

turboLOL

1

1

lolujeme

1

1

LOLu

1

1

hlolo555

1

1

lilalolipop

1

1

pololezanie

1

1

lolllllllllllek

1

1

lolofony

1

1

sisilolo

1

1

angelology_7_
heavens

1

1

lol_fuuuj_dnes

1

1

LOLLOLLOL
OLOLOLOLO
LOLOLOL

1

1

zlol

2

1

loleee

1

1

psychlologicke

1

1

lol_dobe

1

1

bolol_

1

1

roflmaololand

1

1

sololom

1

1

potesilolo

1

1

nevytopilolol

1

1

dookolola

1

1

hahalol

1

1

LOOOOLOLL
L

1

1

nous lolons

psychlologique

zgulolovavat

1

1

Lolom

1

1

cislol

1

1

nepodarilol

1

1

clolek

1

1

loloch

3

1

lolitkaaaa

1

1

lolokar

1

1

oklol

1

1

girlgrdresslolli

1

1

hahalololo

1

1

SLOLY

1

1

LOLity

1

1

oteploli

1

1

filol

1

1

holklolo

1

1

looollololllllow
i

1

1

LOL_poslem

1

1

OLOLO

1

1

roflolo

1

1

rychloliecbu

1

1

maloludi

1

1

Lolitou

1

1

maloletky

1

1

maloletyhc

1

1

totalnajsammeg
abigLOL

1

1

volola

1

1

pilolas

1

1

megaultralol

1

1

lololololololol

1

1

cislo = le nombre

clovek = l’homme

loooooooooooo
oololololooooo
oooooooo

1

1

zalolam

1

1

nemohlol

1

1

ololeidy

2

1

ellolol

1

1

pindolol

1

1

psycholologick

1

1

stylol

1

1

lolofka

1

1

obrlol

1

1

lolkuujem

1

1

DDlol

1

1

lolokarovia

1

1

kololo

1

1

LOLLYPOP

3

1

dovololit

1

1

nebololuto

1

1

lolime

1

1

nedalolen

1

1

lolkujem

1

1

hlavobloly

1

1

lolitampegs

1

1

psychlolgiz

1

1

lolge

1

1

skalolezecka

1

1

loollloll

1

1

lolofil

1

1

lolkov

2

1

lolitkaaa

1

1

veselol

3

1

dovoli=permettre

veselo= gaie

doubleLOL

1

1

wzniklol

1

1

Lollipop

1

1

nebololo

1

1

lol_skuskujem

1

1

lolpodlamna

1

1

dalolo1oo

1

1

hellolou

1

1

LoLinka

1

1

pololetny

1

1

zavlolat

1

1

Hilloltop

1

1

lolipoopppps

1

1

psychololo

1

1

lollllllllllllllllllll
lllllllllllllllllll

1

1

lolka31

1

1

SKLOLY

1

1

uberlol

1

1

sklolit

2

1

LOLovona

1

1

minilolo

1

1

lolitkz

1

1

okololi

1

1

bloli

1

1

celolampove

1

1

teralol

1

1

lol___ja

1

1

pololesne

1

1

neulolujem

1

1

LOLEK

1

1

pololesik

1

1

zavolat = appeler

lololoool

1

1

LOLOLOLOL
OLOL

1

1

filologick

1

1

pszchololgicko
m

1

1

TERRALOL

1

1

megauplol

1

1

klole

1

1

tombolololLOL
y

1

1

lloooolllloloool
ll

1

1

lolusik

1

1

pelolandom

1

1

LOLITA

1

1

vyschlol

1

1

loliny

1

1

zaujmalololo

1

1

LOLooooo

1

1

psychololog

1

1

lakalolo

1

1

nebololepsie

1

1

lolinovia

1

1

malolat

5

1

berlolo

1

1

diplolomka

1

1

sLOLOwanistic
ky

1

1

lolacka

1

1

blolllllllllllllllo
oooooooooooo
oooo

1

1

lolova

1

1

lolos

2

1

LOLFL

1

1

zmizlol

1

1

lolosalat

1

1

megalolmi

1

1

lolololooooooo
oooooooooooo

1

1

__lol

2

1

lolalife

1

1

nyLOL

1

1

hololo

1

1

LOLicek

1

1

maloletosti

1

1

lolatinka

1

1

lolpkar

1

1

postelolou

1

1

honilol

1

1

kiullol

1

1

loLQ

1

1

kalola

1

1

berlolal

1

1

lolel

1

1

tazkyLOL

1

1

pololez

1

1

Zvlolena

1

1

lolstajl

1

1

hallolo

1

1

velociLOL

1

1

LOL__drzim

1

1

cobibololoca

1

1

lollololl

1

1

lOLo

1

1

LOLLOLOLO

1

1

lolipopa

1

1

sloly

1

1

urlolog

1

1

lolooooooo

1

1

pololikvidfank
y

1

1

lolitkovskym

1

1

LOL_veru_a

1

1

malolista

1

1

rollol

1

1

OLOL

2

1

lolobeat

1

1

lolokrull

1

1

loltio

1

1

000lol

1

1

stihololo

1

1

LOLLLLLLLL
LL

1

1

lolikov

1

1

lolip

1

1

ceklol

1

1

pololezim

1

1

skalolezectvo

1

1

boolol

1

1

tri_loliti

3

1

lolovating

1

1

lolrajt

1

1

vololo

1

1

20lol2

1

1

lolje

1

1

pololeze

1

1

lollitazz

1

1

klolu

1

1

20lol

2

1

lolarina

1

1

lolloloolll

1

1

lolowj

1

1

galol

1

1

fololow

1

1

3lol

1

1

helolo

1

1

lollyrock

1

1

spololocne

1

1

OLOLOLLL

1

1

akychklolvek

1

1

lolotali

1

1

loliiiiiik

1

1

lolitne

1

1

Propranolol

1

1

ROLLOL

1

1

chechechachab
uhehemuhamu
hahawobrlol

1

1

vykrilol

1

1

lolovna

1

1

lollky

1

1

lol__tak

1

1

kokololo

1

1

klolk

1

1

bilologien

1

1

zlolujem

1

1

lolllllllllllllllllo
oooooooooooo
oolllllllloooooo

1

1

oooooooool
zilol

1

1

lolalita

1

1

premaloloval

1

1

astrlolog

1

1

exploleru

1

1

lol__

1

1

battlol

1

1

Dlolo

1

1

LOLujes

1

1

pololightove

1

1

lolacov

1

1

kiloladu

1

1

kvloli

1

1

zlolim

1

1

lolitkyyyyy

1

1

Lolz

1

1

utralol

1

1

bllol

1

1

lolv

1

1

LolSiDaaj

1

1

barlolAma

1

1

stlol

1

1

lolypop

1

1

megalloll

1

1

rychlolebo

1

1

bilologia

1

1

veselololo

2

1

LoLezne

1

1

hahaalolol

1

1

vylolova

1

1

lolcity

1

1

la lolyauté

alol

1

1

lolitkaaaaaaaaa
aaaaaaaaaaaaaa
aaa

1

1

pololitologiu

2

1

LOLlypop

1

1

lololollll

1

1

veselolo

1

1

idelologicky

1

1

doniesolol

1

1

llol

1

1

lol___si

1

1

loliz

1

1

fylolozsophya

1

1

beloluk

1

1

zavlolam

1

1

hlolist

4

1

rlole

1

1

lolololoprd

1

1

Adventures_Of
_Lolo_NES_Sc
reenShot1

1

1

terraLOL

1

1

loluol

1

1

pololitologie

1

1

lolloll

1

1

lolaaa

1

1

lolololololololo
lolololololololo
lolololololololo
lolololololololo
l

1

1

loletchka

1

1

lolcore

1

1

la politolologie

idelologique

scrollol

1

1

skaloleziem

1

1

Roflol

1

1

nebololololooo

1

1

lolomotor1kz

1

1

LOLOL

1

1

angelologie

1

1

technopolol

1

1

trilology

1

1

landlolrd

1

1

rychlolozeni

1

1

LOLROFLMA
O

1

1

lolenka

1

1

beloled

1

1

nebololelo

1

1

___LOL___N
O

1

1

lolobridzid

1

1

lollipopcards

1

1

15lolo11

1

1

lollitky

3

1

wololooo

2

1

skalolezecke

1

1

intallol

1

1

lolvia

1

1

lOlko

1

1

lol_

1

1

xvilolinku

1

1

sololity

1

1

vysokoslolaka

1

1

lolool

1

1

klolko

1

1

Halololo

1

1

videololovanie

1

1

AHOLoL

1

1

lolindu

1

1

LOLOLOLOL
OLOLOLOLO
L

1

1

lol_no

1

1

pekne_lol

1

1

bolalolka

1

1

kokotkomegade
bololo

1

1

ololooool

1

1

elolvasom

1

1

loltouzjehuste

1

1

goaloldov

1

1

imnolol

7

1

lolllllllll

1

1

lolith

1

1

lolipops7009li

1

1

megalolko

1

1

magalol

1

1

hralolky

1

1

loland

1

1

tylolinek

1

1

kanaly_LOL

1

1

OFLLLOL

1

1

fotololozo

1

1

LOLOVIA

2

1

chololate

1

1

dopelol

1

1

lollipop_icon_s

1

1

mall
lololololo

1

1

lollerskates

1

1

lolofl

1

1

lolofoonka

1

1

talol

5

1

helol

1

1

metrpolola

1

1

pablol

1

1

zelole

1

1

trilologia

1

1

lol___ako

1

1

lolkou

1

1

omylol

1

1

elolvashatod

1

1

inhalol

1

1

hallol

1

1

loleka

2

1

loleoo

1

1

medgalol

1

1

lol_ani

1

1

zavlolali

1

1

loling

1

1

zvelolekar

1

1

rychloliecenia

1

1

lol___no

1

1

magagigaLOL

1

1

lolika

1

1

loltime

1

1

bololen

1

1

toblolo

1

1

dalolo6bv

2

1

une trilologie

nainstalol

1

1

megagigalol

4

1

lolotal

1

1

elollem

1

1

loLOTR

1

1

lolakne

1

1

haloolooololo

1

1

magemegameg
alol

1

1

lolkami

1

1

bolololo

1

1

vylolaval

1

1

loltroll

1

1

pindololu

1

1

DDDDlol

2

1

LOLOFL

1

1

olololl

1

1

lolobridgida

1

1

lolale

1

1

daloloo

1

1

splolocnost

1

1

__lol___no

2

1

LOL_sign

1

1

loloeiuya

1

1

lolkas

1

1

_LOL

1

1

Angelologiu

1

1

pisalol

1

1

elolvasoma

1

1

Alola

1

1

lolophonius

1

1

colol

1

1

spolocnost=la societé

skalolezen

1

1

LOLOLOLOL
O

1

1

francouzakaLO
L

1

1

lolitkoifna

1

1

filologa

1

1

loloooo

1

1

lloloo

1

1

proxylol

2

1

lolovinami

1

1

nezamiloluj

1

1

daLoL

1

1

lolitin

1

1

utiahlol

1

1

loladin

1

1

hlolkou

1

1

lollolool

1

1

loleeee

1

1

ololitky

1

1

bololieva

1

1

LOLOLO

1

1

prdelolezec

1

1

zalolat

1

1

lolovsky

1

1

veseloololoo

1

1

rozlolovat

1

1

okruhlolista

1

1

lloll

1

1

popololo

1

1

lolkick

1

1

lolroflkewl

1

1

zlolonom

1

1

dooklola

1

1

tlolko

1

1

2xLOL

1

1

ooololollol

1

1

roflolol

1

1

megarofllol

1

1

diolol

1

1

scrolol

2

1

loluju

1

1

jololos

1

1

pololeziac

1

1

lolain

1

1

najdlolzitejsie

1

1

lolosh

1

1

aaaaaaaaaaaaaa
aaaaaaaaaaaaa
LOL

1

1

BROKOLOLO
LOLOLICUU
UUU

1

1

filologickou

1

1

nestretlol

1

1

LOLOLOLOO
OOOOOL

1

1

psilolo

1

1

loloolololololo
gopeeed

1

1

lolica

1

1

Lolipope

1

1

lolipap

2

1

lolinek

3

1

monglolsky

1

1

psychoLOLgie

1

1

LOLsosovu

1

1

nevadilolo

1

1

lol___jo

1

1

lolmao

1

1

lollloll

1

1

Skontrloluj

1

1

pololelegal

1

1

lolllll

1

1

pololeviej

1

1

lolitjek

1

1

lolpozdrav

1

1

trilolololobit

1

1

sololuitoch

1

1

neodoslalololol
ololo

1

1

lolisko

1

1

marcelolm

1

1

lolaaaaaaaaaaa
aa

1

1

pololegalni

1

1

sotalol

1

1

lolinovat

1

1

lolbot

1

1

lolololooooooo
ol

1

1

ZLOl

1

1

pololegalnej

1

1

lolololololololll
llll

1

1

olalola

1

1

dlolezite

1

1

lolvlastne

1

1

filologie

1

1

LOLako

1

1

Pololezmo

1

1

lol__se

1

1

ohlolit

1

1

neLOLuj

1

1

butylolakron

1

1

LOLyPOP

1

1

lolingonthefloo
rlaughing

1

1

lolofisku

2

1

vyjdeLOLOLO
LOLOLOLOL
OLOLOLOLO
LOLOLOLOL
OLOLOLOL

1

1

lolacafe

1

1

vizulol

2

1

lollllllllllllllllllll
llllllllllllllllllllll
lllllllllllllllll

1

1

lolco

1

1

haloloo

1

1

lOOLLLol

1

1

lolitami

1

1

lolposh

1

1

lols

1

1

Lolovi

1

1

felolem

2

1

chcelol

1

1

sedelol

1

1

urlologii

1

1

elolvastam

1

1

trilolobeatom

1

1

elol

1

1

lolcetung

1

1

metoprolol

1

1

naklolko

1

1

rychlolahko

1

1

philology

2

1

hyperlol

1

1

mololko

1

1

kaloly

1

1

Mao Tse tung?

JADT’ 18
PROCEEDINGS OF THE
14TH INTERNATIONAL CONFERENCE
ON STATISTICAL ANALYSIS OF TEXTUAL DATA

JADT’ 18
PROCEEDINGS OF THE
14TH INTERNATIONAL CONFERENCE
ON STATISTICAL ANALYSIS OF TEXTUAL DATA

(Rome, 12-15 June 2018)

Vol. I

UniversItalia
2018

PROPRIETÀ LETTERARIA RISERVATA
Copyright 2018 - UniversItalia - Roma
ISBN 978-88-3293-137-2
A norma della legge sul diritto d’autore e del codice civile è vietata la riproduzione di
questo libro o di parte di esso con qualsiasi mezzo, elettronico, meccanico, per mezzo
di fotocopie, microfilm, registra-tori o altro. Le fotocopie per uso personale del lettore
possono tuttavia essere effettuate, ma solo nei limiti del 15% del volume e dietro
pagamento alla SIAE del compenso previsto dall’art. 68, commi 4 e 5 della legge 22
aprile 1941 n. 633. Ogni riproduzione per finalità diverse da quelle per uso personale
deve essere autorizzata specificatamente dagli autori o dall’editore.

Program Committee
Ramón Álvarez Esteban: Univ. of León, E
Valérie Beaudouin: Telecom ParisTech, F
Mónica Bécue: Poly. Univ. of Catalunya, E
Sergio Bolasco: Sapienza Univ. of Rome, I
Isabella Chiari: Sapienza Univ. of Rome, I
François Daoust, UQÀM, Montreal, CDN
Anne Dister, FUSL, Bruxelles / UCL, Louvain, B
Jules Duchastel: UQÀM, Montreal, CDN
Serge Fleury: Univ. Paris 3, F
Cédrick Fairon: UCL, Louvain, B
Luca Giuliano: Sapienza Univ. of Rome, I
Serge Heiden, ENS, Lyon, F
Domenica Fioredistella Iezzi, Univ. of Tor Vergata, I
Margareta Kastberg, Univ. of Franche Comté, F
Ludovic Lebart: CNRS / ENST, Paris, F
Jean-Marc Leblanc: Univ. of Créteil, F

Alain Lelu: Univ. of Franche Comté, F
Dominique Longrée, Univ. of Liège, B
Véronique Magri: Univ. of Nice Sophia-Antipolis, F
Pascal Marchand: Univ. of Toulouse, F
William Martinez: Univ. of Lisboa, P
Damon Mayaffre: CNRS, Nice, F
Sylvie Mellet: CNRS, Nice, F
Michelangelo Misuraca: Univ. of Calabria, I
Denis Monière: Univ. of Montréal, CDN
Bénédicte Pincemin: CNRS, Lyon, F
Céline Poudat: Univ. of Nice Sophia-Antipolis, F
Pierre Retinaud: Univ. of Tolouse, F
André Salem: Univ. Paris 3, F
Monique Slodzian: Inalco, F
Arjuna Tuzzi: Univ. of Padua, I
Mathieu Valette: Inalco, F

Organising Committee
Domenica Fioredistella Iezzi: Univ. of Tor Vergata, I
Sergio Bolasco: Sapienza Univ. of Rome, I
Livia Celardo: Sapienza Univ. of Rome, I
Isabella Chiari: Sapienza Univ. of Rome, I
Francesca della Ratta: ISTAT, I
Fiorenza Deriu: Sapienza Univ. of Rome, I
Francesca Dolcetti: Sapienza Univ. of Rome, I

Andrea Fronzetti Colladon: Univ. of Tor Vergata, I
Francesca Greco: Sapienza Univ. of Rome, I
Isabella Mingo: Sapienza Univ. of Rome, I
Michelangelo Misuraca: Univ. of Calabria, I
Arjuna Tuzzi: Univ. of Padua, I
Maurizio Vichi: Sapienza Univ. of Rome, I
Francesco Zarelli: ISTAT, I

Local Organisation
Francesco Alò, Giulia Giacco,
Paolo Meoli, Vittorio Palermo, Viola Talucci

Table of contents

Introduction ............................................................................................................... XVII
Acknowledgements ....................................................................................................XIX
Invited Speakers
GERMAN KRUSZEWSKI
Memorize or generalize? Searching for a compositional RNN in a haystack
Adam Liška ......................................................................................................... XXIII
BING LIU
Scaling-up Sentiment Analysis through Continuous Learning .................. XXIV
PASCAL MARCHAND
La textométrie comme outil d’expertise :
application à la négociation de crise. ................................................................ XXV
GEORGE K. MIKROS
Author Identification Combining Various Author Profiles. Towards a Blended
Authorship Attribution Methodology ............................................................. XXVI
ROBERTO NAVIGLI
From text to concepts and back: going multilingual
with BabelNet in a step or two ....................................................................... XXVII
Contributors
MOTASEM ALRAHABI1, CHIARA MAINARDI1
Identification automatique de l’ironie et des formes apparentées dans un
corpus de controverses théâtrales ........................................................................... 1
MOHAMMAD ALSADHAN, SASCHA DIWERSY,
AGATA JACKIEWICZ, GIANCARLO LUXARDO
Migrants et réfugiés : dynamique de la nomination de l'étranger ................... 10
R. ALVAREZ-ESTEBAN, M. BÉCUE-BERTAUT, B. KOSTOV, F. HUSSON, J-A
SÁNCHEZ-ESPIGARES
Xplortext, a R package. Multidimensional statistics for textual data science . 19
ELENA, AMBROSETTI, ELEONORA MUSSINO, VALENTINA TALUCCI
L'evoluzione delle norme: analisi testuale delle politiche sull'immigrazione in
Italia ........................................................................................................................... 26

VIII

JADT’ 18

MASSIMO ARIA, CORRADO CUCCURULLO
A bibliometric meta-review of performance measurement, appraisal,
management research ............................................................................................. 35
LAURA ASCONE
Textual Analysis of Extremist Propaganda and Counter-Narrative: a quantiquali investigation ................................................................................................... 44
LAURA ASCONE, LUCIE GIANOLA
Analyse de données textuelles appliquée à des problématiques de sécurité et
d'enquête judiciaire ................................................................................................. 52
SIMONA BALBI, MICHELANGELO MISURACA, MARIA SPANO
A two-step strategy for improving categorisation of short texts ..................... 60
CHRISTINE BARATS, ANNE DISTER, PHILIPPE GAMBETTE, JEAN-MARC
LEBLANC, MARIE PERES
Appeler à signer une pétition en ligne : caractéristiques linguistiques des
appels ........................................................................................................................ 68
MANUEL BARBERA, CARLA MARELLO
Newsgroup e lessicografia: dai NUNC al VoDIM .............................................. 76
IGNAZIA BARTHOLINI
Techniques for detecting the normalized violence in the perception of refugee
/ asylum seekers between lexical analysis and factorial analysis...................... 83
PATRIZIA BERTINI MALGARINI, MARCO BIFFI, UGO VIGNUZZI
Dal corpus al dizionario: prime riflessioni lessicografiche sul Vocabolario
storico della cucina italiana postunitaria (VoSCIP) ............................................ 90
MARCO BIFFI
Strumenti informatico-linguistici per la realizzazione di un dizionario
dell’italiano postunitario ........................................................................................ 99
ANNICK FARINA, RICCARDO BILLERO
Comparaison de corpus de langue « naturelle » et de langue « de traduction »
: les bases de données textuelles LBC, un outil essentiel pour la création de
fiches lexicographiques bilingues........................................................................ 108
FELICE BISOGNI, STEFANO PIRROTTA
Il rapporto tra famiglie di anziani non autosufficienti e servizi territoriali:
un'analisi dei dati esploratoria con l'Analisi Emozionale del Testo (AET) .... 117
ANTONELLA BITETTO, LUIGI BOLLANI
Esperienza di analisi testuale di documentazione clinica e di flussi informativi
sanitari, di utilità nella ricerca epidemiologica e per indagare la qualità
dell'assistenza......................................................................................................... 126
GUIDO BONINO, DAVIDE PULIZZOTTO, PAOLO TRIPODI
Exploring the history of American philosophy in a computer-assisted
framework .............................................................................................................. 134

JADT’ 18

IX

MARC-ANDRE BOUCHARD, SYLVIA KASPARIAN
La classification hiérarchique descendante pour l’analyse des représentations
sociales dans une pétition antibilinguisme au Nouveau-Brunswick,
Canada .................................................................................................................... 142
LIVIA CELARDO, RITA VALLEROTONDA, DANIELE DE SANTIS,CLAUDIO
SCARICI, ANTONIO LEVA
Analysing occupational safety culture through mass media monitoring..... 150
BARBARA CORDELLA, FRANCESCA GRECO, PAOLO MEOLI,VITTORIO
PALERMO, MASSIMO GRASSO
Is the educational culture in Italian Universities effective? A case study ...... 157
MICHELE A. CORTELAZZO, GEORGE K. MIKROS, ARJUNA TUZZI
Profiling Elena Ferrante: a Look Beyond Novels .............................................. 165
FABRIZIO DE FAUSTI, MASSIMO DE CUBELLIS, DIEGO ZARDETTO1
Word Embeddings: a Powerful Tool for Innovative Statistics at Istat .......... 174
Gibbons A. (1985). Algorithmic Graph Theory. Cambridge University Press. . 182
VIVIANA DE GIORGI, CHIARA GNESI
Analisi di dati d’impresa disponibili online: un esempio di data science tratto
dalla realtà economica dei siti di e-commerce ................................................... 183
ALESSANDRO CAPEZZUOLI, FRANCESCA DELLA RATTA,
STEFANIA MACCHIA,MANUELA MURGIA, MONICA SCANNAPIECO,
DIEGO ZARDETTO
The use of textual sources in Istat: an overview ................................................ 192
FRANCESCA DELLA RATTA, GABRIELLA FAZZI, MARIA ELENA
PONTECORVO, CARLO VACCARI, ANTONINO VIRGILLITO
Twitter e la statistica ufficiale: il dibattito sul mercato del lavoro ................. 200
SAMI DIAF
Gauging An Author’s Mood Using Hidden Markov Chains ......................... 209
MARC DOUGUET
Les hémistiches répétés ........................................................................................ 215
FRANCESCA DRAGOTTO, SONIA MELCHIORRE
«Mangiata dall’orco e tradita dalle donne». Vecchi e nuovi media raccontano
la vicenda di Asia Argento, tra storytelling e Speech Hate ............................. 223
CRISTIANO FELACO, ANNA PAROLA
Il cosa e il come del processo narrativo. L’uso combinato della Text Analysis e
Network Text Analysis al servizio della precarietà lavorativa ....................... 233
ANA NORA FELDMAN
Hablando de crisis: las comunicaciones del Fondo Monetario Internacional 242
VALERIA FIASCO
Brexit in the Italian and the British press:
a bilingual corpus-driven analysis ...................................................................... 250
VIVIANA FINI, GIUSEPPE LUCIO GAETA, SERGIO SALVATORE
Textual analysis to promote innovation within public policy evaluation .... 259

X

JADT’ 18

ALESSIA FORCINITI, SIMONA BALBI
A proposal for Cross-Language Analysis:
violence against women and the Web ................................................................ 268
BEATRICE FRACCHIOLLA, OLINKA SOLENE DE ROGER
La verbalisation des émotions ............................................................................. 276
LUISA FRANCHINA, FRANCESCA GRECO, ANDREA LUCARIELLO,
ANGELO SOCAL, LAURA TEODONNO
Improving Collection Process for Social Media Intelligence: A Case Study . 285
ANDREA FRONZETTI COLLADON, JOHANNE SAINT-CHARLES, PIERRE
MONGEAU
The impact of language homophily and similarity of social position on
employees’ digital communication ..................................................................... 293
MATTEO GERLI
Looking Through the Lens of Social Sciences: The European Union in the EUFunded Research Projects Reporting .................................................................. 300
LUCIE GIANOLA, MATHIEU VALETTE
Spécialisation générique et discursive d’une unité lexical L’exemple de
joggeuse dans la presse quotidienne régionale ................................................... 312
PETER A. GLOOR, JOAO MARCOS DE OLIVEIRA, DETLEF SCHODER
The Transparency Engine – A Better Way to Deal with Fake News .............. 319
FRANCESCA GRECO, LEONARDO ALAIMO, LIVIA CELARDO
Brexit and Twitter: The voice of people.............................................................. 327
FRANCESCA GRECO, GIULIO DE FELICE, OMAR GELO
A text mining on clinical transcripts of good and poor outcome
psychotherapies ..................................................................................................... 335
FRANCESCA GRECO, DARIO MASCHIETTI, ALESSANDRO POLLI
DOMINIO: A Modular and Scalable Tool for the Open Source Intelligence 343
LEONIE GRÖN, ANN BERTELS, KRIS HEYLEN
Is training worth the trouble? A PoS tagging experiment with Dutch clinical
records..................................................................................................................... 351
FRANCE GUERIN-PACE, ELODIE BARIL
Les outils de la statistique textuelle pour analyser
les corpus de données d’enquêtes de la statistique publique .......................... 359
SERGE HEIDEN
Annotation-based Digital Text Corpora Analysis within the TXM Platform 367
DANIEL HENKEL
Quantifying Translation : an analysis of the conditional perfect in EnglishFrench comparable-parallel corpus..................................................................... 375
DANIEL DEVATMAN HROMADA
Extraction of lexical repetitive expressions from complete works of William
Shakespeare ............................................................................................................ 384

JADT’ 18

XI

OLIVIER KRAIF, JULIE SORBA
Spécificités des expressions spatiales et temporelles dans quatre sous-genres
romanesques (policier, science-fiction, historique et littérature générale) .... 392
CYRIL LABBE, DOMINIQUE LABBE
Les phrases de Marcel Proust .............................................................................. 400
LUDOVICA LANINI, MARÍA CARLOTA NICOLÁS MARTÍNEZ
Verso un dizionario corpus-based del lessico dei beni culturali: procedure di
estrazione del lemmario ....................................................................................... 411
DANIELA LARICCHIUTA, FRANCESCA GRECO, FABRIZIO PIRAS, BARBARA
CORDELLA, DEBORA CUTULI, ELEONORA PICERNI, FRANCESCA
ASSOGNA, CARLO LAI, GIANFRANCO SPALLETTA, LAURA PETROSINI
“The grief that doesn’t speak”: Text Mining and Brain Structure 419
GEVISA LA ROCCA, CIRUS RINALDI
Icone gay: tra processi di normalizzazione e di resistenza. Ricostruire la
semantica degli hashtag........................................................................................ 428
LUDOVIC LEBART
Looking for topics: a brief review......................................................................... 436
GAËL LEJEUNE, LICHAO ZHU
Analyse Diachronique de Corpus : le cas du poker .......................................... 444
JULIEN LONGHI, ANDRE SALEM
Approche textométrique des variations du sens ............................................... 452
LAURENT VANNI1, DAMON MAYAFFRE, DOMINIQUE LONGREE
ADT et deep learning, regards croisés. Phrases-clefs, motifs et nouveaux
observables ............................................................................................................. 459
LUCIE LOUBERE
Déconstruction et reconstruction de corpus... À la recherche de la pertinence
et du contexte ......................................................................................................... 467
HEBA METWALLY
L’apport du corpus-maquette à la mise en évidence des niveaux descriptifs de
la chronologie du sens. Essai sur une Série Textuelle Chronologique du Monde
diplomatique (1990-2008). ....................................................................................... 474
JUN MIAO, ANDRE SALEM
Séries textuelles homogènes................................................................................. 491
SILVIO MIGLIORI, ANDREA QUINTILIANI, DANIELA ALDERUCCIO,
FIORENZO AMBROSINO, ANTONIO COLAVINCENZO, MARIALUISA
MONGELLI, SAMUELE PIERATTINI, GIOVANNI PONTI SERGIO BOLASCO,
FRANCESCO BAIOCCHI, GIOVANNI DE GASPERIS
TaLTaC in ENEAGRID Infrastructure................................................................ 501
ISABELLA MINGO, MARIELLA NOCENZI
The dimensions of Gender in the International Review of Sociology. A
lexicometric approach to the analysis of the publications in the last twenty
years ........................................................................................................................ 509

XII

JADT’ 18

ADIEL MITTMANN, ALCKMAR LUIZ DOS SANTOS
The Rhythm of Epic Verse in Portuguese From the 16th to the 21st Century514
DENIS MONIERE, DOMINIQUE LABBE
Le vocabulaire des campagnes électorales ......................................................... 522
CYRIELLE MONTRICHARD
Faire émerger les traces d’une pratique imitative dans la presse de tranchées à
l’aide des outils textométriques ........................................................................... 532
ALBERT MORALES MORENO
Evolución diacrónica de la terminología y la fraseología jurídicoadministrativa en los Estatutos de autonomía de Catalunya de 1932, 1979 y
2006 .......................................................................................................................... 541
CEDRIC MOREAU
Comment penser la recherche d’un signe pour une plateforme multilingue et
multimodale français écrit / langue des signes française ? .............................. 556
JEAN MOSCAROLA, BORIS MOSCAROLA
Conclusion ADT et visualisation, pour une nouvelle lecture des corpus Les
débats de 2ème tour des Présidentielles (1974-2017) ........................................ 563
MAURIZIO NALDI
A conversation analysis of interactions in personal finance forums ............. 571
STEFANO NOBILE
Analisi testuale, rumore semantico e peculiarità morfosintattiche:
problemi e strategie di pretrattamento di corpora speciali.............................. 578
DANIEL PELISSIER
L’individu dans le(s) groupe(s) : focus group et partitionnement
du corpus ................................................................................................................ 586
BENEDICTE PINCEMIN, CELINE GUILLOT-BARBANCE, ALEXEI
LAVRENTIEV
Using the First Axis of a Correspondence Analysis as an Analytical Tool.
Application to Establish and Define an Orality Gradient for Genres of
Medieval French Texts .......................................................................................... 594
CELINE POUDAT
Explorer les désaccords dans les fils de discussion du Wikipédia francophone
.................................................................................................................................. 602
MATTHIEU QUIGNARD, SERGE HEIDEN, FREDERIC LANDRAGIN,
MATTHIEU DECORDE
Textometric Exploitation of Coreference-annotated Corpora with TXM:
Methodological Choices and First Outcomes .................................................... 610
PIERRE RATINAUD
Amélioration de la précision et de la vitesse de l’algorithme de classification
de la méthode Reinert dans IRaMuTeQ ............................................................. 616

JADT’ 18

XIII

LUISA REVELLI
Il parametro della frequenza tra paradossi e antinomie:
il caso dell’italiano scolastico .................................................................................. 626
PIERGIORGIO RICCI
How Twitter emotional sentiments mirror on the Bitcoin
transaction network .............................................................................................. 635
CHANTAL RICHARD, SYLVIA KASPARIAN
Analyse de contenu versus méthode Reinert : l’analyse comparée d’un corpus
bilingue de discours acadiens et loyalistes du N.-B., Canada ......................... 643
VALENTINA RIZZOLI, ARJUNA TUZZI
Bridge over the ocean: Histories of social psychology in Europe and North
America. An analysis of chronological corpora ................................................ 651
LOUIS ROMPRE, ISMAÏL BISKRI
Les « itemsets fréquents » comme descripteurs de documents textuels ....... 659
CORINNE ROSSARI, LJILJANA DOLAMIC, ANNALENA HÜTSCH, CLAUDIA
RICCI, DENNIS WANDEL
Discursive Functions of French Epistemic Adverbs: What can Correspondence
Analysis tell us about Genre and Diachronic Variation? ................................. 668
VANESSA RUSSO, MARA MARETTI, LARA FONTANELLA, ALICE
TONTODIMAMMA
Misleading information in online propaganda networks ............................... 676
ELIANA SANANDRES, CAMILO MADARIAGA, RAIMUNDO ABELLO
Topic modeling of Twitter conversations .......................................................... 684
FRANCESCO SANTELLI, GIANCARLO RAGOZINI, MARCO MUSELLA
What volunteers do? A textual analysis of voluntary activities in the Italian
context ..................................................................................................................... 692
S. SANTILLI, S. SBALCHIERO, L. NOTA, S. SORESI
A longitudinal textual analysis of abstract presented at Italian Association for
Vocational guidance and Career Counseling’
Conferences from 2002 to 2017 ............................................................................ 700
JACQUES SAVOY
A la poursuite d’Elena Ferrante........................................................................... 707
JACQUES SAVOY
Regroupement d’auteurs dans la littérature du XIXe siècle ........................... 716
STEFANO SBALCHIERO, ARJUNA TUZZI
What’s Old and New? Discovering Topics in the American Journal of
Sociology................................................................................................................. 724
NILS SCHAETTI, JACQUES SAVOY
Comparison of Neural Models for Gender Profiling ........................................ 733
LIONEL SHEN
Segments répétés appliqués à l'extraction de connaissances trilingues ......... 740

XIV

JADT’ 18

SANDRO STANCAMPIANO
Misurare, Monitorare e Governare le città con i Big Data .............................. 748
FADILA TALEB, MARYVONNE HOLZEM
Exploration textométrique d’un corpus de motifs juridiques dans le droit
international des transports ................................................................................. 755
JAMES M. TEASDALE
The Framing of the Migrant: Re-imagining a Fractured Methodology in the
Context of the British Media. ............................................................................... 763
MARJORIE TENDERO1, CECILE BAZART
Results from two complementary textual analysis software (Iramuteq and
Tropes) to analyze social representation of contaminated brownfields ........ 771
MATTEO TESTI, ANDREA MERCURI, FRANCESCO PUGLIESE
Multilingual Sentiment Analysis ......................................................................... 780
JUAN MARTÍNEZ TORVISCO
A linguistic analysis of the image of immigrants’ gender in Spanish
newspapers............................................................................................................. 788
FRANCESCO URZÌ
Lo strano caso delle frequenze zero nei testi legislativi euroistituzionali...... 796
SYLVIE VANDAELE
Les traductions françaises de The Origin of Species : pistes lexicométriques . 805
PIERRE WAVRESKY, MATTHIEU DUBOYS DE LABARRE, JEAN-LOUP
LECOEUR
Circuits courts en agriculture : utilisation de la textométrie dans le traitement
d’une enquête sur 2 marchés ............................................................................... 814
MARIA ZIMINA, NICOLAS BALLIER
On the phraseology of spoken French: initial salience, prominence and
lexicogrammatical recurrence in a prosodic-syntactic treebank Rhapsodie .... 822

Abstracts
FILIPPO CHIARELLO, GUALTIERO FANTONI, ANDREA BONACCORSI,
SILVIA FARERI
What kind of contributions does research provides? Mapping issue based
statements in research abstracts .......................................................................... 833
FILIPPO CHIARELLO, GIACOMO OSSOLA, GUALTIERO FANTONI,ANDREA
BONACCORSI, ANDREA CIMINO, FELICE DELL’ORLETTA
Technical sentiment analysis: predicting the success of new products using
social media ............................................................................................................ 835

JADT’ 18

XV

FIORENZA DERIU, DOMENICA FIOREDISTELLA IEZZI
Citizens and neighbourhood life: mapping population sentiment in Italian
cities ......................................................................................................................... 837
FRANCESCA DI CARLO, ROSY INNARELLA, BRIZIO LEONARDO TOMMASI
Vax network: profiling influential nodes with social network analysis on
twitter ...................................................................................................................... 838
DAVIDE DONNA
Alteryx .................................................................................................................... 840
VALERIO FICCADENTI, ROY CERQUETI, MARCEL AUSLOOS
Complexity of US President Speeches ................................................................ 841
PETER A. GLOOR
Measuring the Dynamics of Social Networks with Condor ........................... 842
IOLANDA MAGGIO, DOMENICA FIOREDISTELLA IEZZI, MATTEO
FATIGHENTI
“BIG DATA” Words Trend Analysis using the multidimensional analysis of
texts ......................................................................................................................... 844
MARIO MASTRANGELO
Itinerari turistici, network analysis e text mining ............................................. 845
MARIA FRANCESCA ROMANO, GUIDO REY, ANTONELLA BALDASSARINI
PASQUALE PAVONE
Text Mining per l’analisi qualitativa e quantitativa dei dati amministrativi
utilizzati dalla Pubblica Amministrazione......................................................... 847
ALESSANDRO CESARE ROSA
Taglio cesareo e Vbac in Italia al tempo dei Big Data: una proposta di ulteriore
contributo informativo.......................................................................................... 849

Introduction

The International Conference on the Statistical Analysis of Textual Data (JADT,
Journées d’Analyse Statistique des Données Textuelles) has been at its 14th
edition. It was held for the third time in Rome, from 12 to 15 June 2018,
organized by the DII - Department of Enterprise Engineering “Mario
Lucertini” at Tor Vergata University of Rome and the DSS - Department of
Statistical Sciences at Sapienza University of Rome. This biennial conference
has continuously gained importance since its first occurrence in Barcelone
(1992), and with the editions of Montpellier (1994), Rome (1996), Nice (1998),
Lausanne (2000), Saint-Malo (2002), Louvain-la Neuve (2004), Besançon
(2006), Lyon (2008), Rome (2010), Liège (2012), Paris (2014), Nice (2016).
Every two years, the JADT conference presented the state of the art
concerning theories, problems, methods, algorithms, software and
applications in several domains, sharing a quantitative approach to the study
of lexical, textual, pragmatic or discursive features of information expressed
in natural language.
The proceedings of the 2018 Conference collect 113 contributions by 243
scholars from 15 countries spread all over the world. These papers include
contributions open to all scholars and researchers working in the field of
textual data analysis, ranging from lexicography to the analysis of political
discourse, from information retrieval to marketing research, from
computational linguistics to sociolinguistics, from text mining to content
analysis. The invited speakers focused on the central topics of the conference,
discussing open and new themes, e.g. machine learning algorithms to
profiling users of social media, new multilingual approaches, textometry,
and authorship. The proceedings follow an alphabetical order by the
surname of the first author of the contributions.
In this edition, several innovations have been introduced with respect to the
past. In a roundtable, we discussed the past, present and future of Statistical
Analysis of Textual Data and Text Mining methods, by examining the point
of view of Universities and enterprises. The papers, which followed a review
process carried out with two and sometimes three reviewers, are maximum
of 6 pages. The idea is that the papers were not yet in their final version, and
the exchange with other scholars during the conference led to an
improvement. For the first time, a selection of extended papers presented at

XVIII

JADT’ 18

the JADT conference will be published, after another reviewing process, in a
book published by Springer and in several special issues of acknowledged
Journals (Advances in Data Analysis and Classification, International Review
of Sociology, Italian Journal of Applied Statistics, Social Indicators Research,
RPC Rivista di Psicologia Clinica). The perspective of enhancing the papers
discussed during JADT conference will allow the scholar community to keep
the network of active contacts and lively exchanges.

D. Fioredistella Iezzi, Livia Celardo, Michelangelo Misuraca

Acknowledgements

We express our gratitude to the 56 reviewers who offered their assistance in
selecting and anonymously reviewing the papers of this volume: Massimo
Aria, Barbara Baldazzi, Nadia Battisti, Valérie Beaudouin, Sergio Bolasco,
Etienne Brunet, Mónica Bécue, Isabella Chiari, Livia Celardo, Michele
Cortelazzo, Pasquale Del Vecchio, Francesca Della Ratta, Fiorenza Deriu,
Anne Dister, Francesca Dolcetti, Annick Farina, Serge Fleury, Andrea
Fronzetti, Luca Giuliano, Peter Gloor, Francesca Greco, Francesca Grippa,
Serge Heiden, D. Fioredistella Iezzi, Antonio Iovanella, Sylvia Kasparian,
Margareta Kastberg, Dominique Labbé, Ludovica Lanini, Alexei Lavrentev,
Ludovic Lebart, Jean-Marc Leblanc, Alain Lelu, Dominique Longrée,
Véronique Magri, Pascal Marchand, Damon Mayaffre, Sylvie Mellet, Silvia
Micheli, Michelangelo Misuraca, Denis Monière, Gianluca Murgia, Pa-squale
Pavone, Bénédicte Pincemin, Céline Poudat, Pierre Ratinaud, Piergiorgio
Ricci, Maria Francesca Romano, Johanne Saint-Charles, André Salem,
Massimiliano Schiraldi, Max Silberztein, Maria Spano, Arjuna Tuzzi, Mathieu
Valette, Ramón Álvarez Esteban.
JADT2018 was held under the patronage of ISTAT (Istituto Nazionale di
Statistica - National Institute of Statistics). We are also very grateful to the
following sponsors: ISTAT, Le Sphinx, The Information Lab, Master in Data
Science at Tor Vergata University, Prisma.
As regards the organisation of the conference, we would like to thank all the
members of the local organising team: Francesco Alò, Silvia Castellan, Giulia
Giacco, Paolo Meoli, Vittorio Palermo, Viola Talucci.
Special thanks go to Livia Celardo, Isabella Chiari, Andrea Fronzetti
Colladon, Francesca Della Ratta, Fiorenza Deriu, Francesca Dolcetti,
Francesca Greco, for the organisation of special tracks concerning Official
Statistics, Linguistics, Applications on social and psychological domains,
Social Network and Semantic Analysis.

Invited Speakers

JADT’ 18

XXIII

Memorize or generalize? Searching for a
compositional RNN in a haystack Adam Liška
German Kruszewski
Facebook- germank@fb.com

Abstract
Machine learning systems have made rapid progress in the past few years, as
evidenced by the remarkable feats they have accomplished on fields as
diverse as computer vision or reinforcement learning. Yet, as impressive as
these achievements are, they rely on learning algorithms that require orders
of magnitude more data than a human learner would. This disparity could be
rooted in many different factors. In this talk, we will draw on the hypothesis
that compositional learning — that is, the ability to recombine previously
acquired skills and knowledge to solve new problems– could be one
important element of fast and efficient learning (Lake et al, 2017). In this
direction, we will discuss our ongoing efforts towards building systems that
can learn in compositional ways. Concretely, we will present a simple
benchmark based on function composition to measure the compositionality
of learning systems and use it to draw insights for whether current learning
systems learn or can learn in a compositional manner.

XXIV

JADT’ 18

Scaling-up Sentiment
Analysis through Continuous Learning
Bing Liu
University of Illinois at Chicago - liub@uic.edu

Abstract
Sentiment analysis (SA) or opinion mining is the computational study of
people’s opinions, sentiments, emotions, and evaluations. Due to numerous
research challenges and almost unlimited applications, SA has been a very
active research area in natural language processing and text mining. In this
talk, I will first give a brief introduction to SA, and then move on to discuss
some major difficulties with the current technologies when one wants to
perform sentiment analysis in a large number of domains, e.g., all products
sold in a large retail store. To tackle the scaling-up problem, I will describe
our recent work on lifelong machine learning (LML) (or lifelong learning)
that tries to enable the machine learn like humans, i.e., learning continuously,
retaining or accumulating the knowledge learned in the past, and using the
knowledge to help future learning and problem solving. This paradigm is
quite suitable for SA and can help scale up SA to a large number of domains
with little manual involvement.

JADT’ 18

XXV

La textométrie comme outil d’expertise :
application à la négociation de crise.
Pascal Marchand
Université de Toulouse – pascal.marchand@iut-tlse3.fr

Résumé
Pour aborder la pertinence de la pratique textométrique dans des
problématiques de terrains et comme outil d’expertise, on étudiera les
échanges réels impliquant les négociateurs des Forces d’intervention de
Police, dans des contextes de barricades, prises d’otages, terrorisme ou
intention suicidaire à haut niveau de dangerosité.
Nous envisagerons donc la négociation au travers des dynamiques de choix
lexical et nous chercherons à cartographier le lexique, classer des segments
de textes et comparer des profils de locuteurs et de situations.
On se propose ainsi de répondre aux questions suivantes :
 Y a-t-il des thèmes récurrents dans les crises ?
 Y a-t-il une chronologie lexicale de la crise ?
 Comment se gèrent les émotions ?
 Quelles sont les spécificités des situations « radicalisées » ?
L’objectivation des échanges et la mise en évidence des séquences formelles
peut alors fournir une aide au diagnostic, dans le but de tirer des éléments
concrets pour des objectifs de retour d'expérience et de formalisation des
pratiques des professionnels de la négociation.

XXVI

JADT’ 18

Author Identification Combining Various Author
Profiles. Towards a Blended Authorship Attribution
Methodology
George K. Mikros
National and Kapodistrian University of Athens – gmikros@gmail.com

Abstract
The aim of this presentation is to describe a new method of attributing texts
to their real authors using combined author profiles, modern computational
stylistic methods based on shallow text features (n-grams) and machine
learning algorithms. Until recently, authorship attribution and author
profiling were considered similar methods (nearly identical feature sets and
classification algorithms), but with different aims, i.e. in the former to
identify the author’s identity and in the latter to detect author’s
characteristics such as gender, age, psychological profile etc. Both of these
methods have been used independently aiming at different research aims
and in different real-life tasks. However, in this talk we will present a unified
methodological framework where standard authorship attribution
methodology and author profiling are combined so that we can approach
more effectively open or semi-open authorship attribution problems, a
category known as authorship verification which is particularly difficult to
tackle with present computational stylistic methods. More specifically, we
will present preliminary research results from the application of this blended
methodology to a real semi-open authorship problem, the Ferrante’s
authorship case. Using a corpus of 40 modern Italian literary authors
compiled by Arjuna Tuzzi and Michele Cortelazzo from the University of
Padua (Tuzzi & Cortelazzo, under review), we will explore the dynamics of
author profiling in gender, age and region and various methods we can
combine the extracted profiles so that we can entail the identity of the real
author behind Ferrante’s books. Moreover, we will extend this methodology
and validate its usefulness in social media texts using the English Blog
Corpus (Argamon, Koppel, Pennebaker, & Schler, 2007). Using, simulated
scenarios of authorship attribution cases (the real author to be included in the
training data and the real author to be missing from the training corpus) we
will further evaluate the usefulness of the proposed blended methodology
which can lead to some exciting new possibilities for investigating author
identities in both closed and open authorship attribution tasks.

JADT’ 18

XXVII

From text to concepts and back: going multilingual
with BabelNet in a step or two
Roberto Navigli
Sapienza University of Rome – roberto.navigli@uniroma1.it

Abstract
Multilinguality is a key feature of today’s Web, and it is this feature that we
leverage and exploit in our research work at the Sapienza University of
Rome’s Linguistic Computing Laboratory, which I am going to overview and
showcase in this talk. I will describe the most recent developments of the
BabelNet technology. I will introduce BabelNet live – the largest,
continuously-updated multilingual encyclopedic dictionary – and then
discuss a range of cutting-edge industrial use cases implemented by
Babelscape, our Sapienza startup company, including: multilingual
interpretation of terms; multilingual concept and entity extraction from text;
cross-lingual text similarity.

Contributors

JADT’ 18

1

Identification automatique de l’ironie et des formes
apparentées dans un corpus de controverses
théâtrales
Motasem Alrahabi1, Chiara Mainardi2
1

Université Paris-Sorbonne Abu Dhabi – motasem.alrahabi@gmail.com
2 Université Sorbonne Nouvelle – chiara.mainardi@univ-paris3.fr

Abstract
This paper presents the results of an automatic analysis on a corpus of French
texts about theatre debates (16th –19th centuries). The purpose of this study
is to highlight the important role of different forms of irony in the theatre
controversy and to reveal the stand point of authors and established
authorities towards theatre performances. Despite the difficulty of this task,
our research shows encouraging results. This unprecedent comparison of
these kind of texts, in which authors condemn the theatre or approve it,
enables to a broader understanding of the authors’ positions, arguments and
rhetorical strategies relating to theatre controversies.
Résumé
Cet article présente les résultats de notre analyse automatique d’un corpus de
débats sur le théâtre (16ème – 19ème siècle). L’objectif de cette étude est
d’illustrer le rôle important que jouent les différentes formes de l’ironie dans
la polémique autour du théâtre et de mettre en évidence la position des
auteurs ou des autorités antiques citées vis-à-vis des spectacles. Les résultats
obtenus sont encourageants malgré la difficulté de la tâche et ils nous
permettent de comparer d’une façon inédite les textes des auteurs défenseurs
avec ceux des auteurs pourfendeurs du théâtre et d’avoir une meilleure
compréhension de certains arguments et stratégies d’auteurs dans le champ
de la controverse.
Keywords: Ironie, théâtre, marqueurs linguistiques, annotation sémantique,
système à base de règles.
1. Introduction
Nous proposons une analyse automatique d’un corpus en français qui
rassemble des débats sur le théâtre depuis le milieu du 16e siècle jusque dans
les années 1840. Notre objectif est d’illustrer le rôle important que jouent les
expressions de l’ironie dans la polémique autour du théâtre et de mettre en
évidence la position des auteurs ou des autorités antiques citées vis-à-vis des

2

JADT’ 18

spectacles. Nous présentons d’abord les ressources linguistiques
développées, l’outil d’annotation utilisé et le corpus ; ensuite, nous
commentons les résultats d’analyse automatique et, avant de conclure, nous
explorons les perspectives de ce projet en cours.
2. Prémisses sur l’ironie
L’ironie est un fait de langue utilisé afin de transmettre un message
directement ou indirectement opposé à ce qui est dit littéralement.
Largement étudiée en philosophie, en rhétorique ou en linguistique
(Berrendonner, Sperber et Wilson, Kerbrat-Orecchioni, Ducrot, Grice…),
l’ironie représente un concept hétérogène extrêmement difficile à définir du
fait de ses nombreuses formes et de la complexité des phénomènes qui sont
en jeu. L’ironie fonctionne à l’aide d’indices laissés par le locuteur à
l’interlocuteur pour lui faire comprendre ses intentions par des jeux de
parallélismes, de contradictions, d’exagérations et d’hyperboles plus ou
moins marqués. Ces indices – souvent pragmatiques ou extralinguistiques –
sont plus ou moins évidents, d’où l’importance de la prise en compte du
contexte (référentiel, locuteur, interlocuteur…), des connaissances partagées
et des normes sociales et culturelles. La présente étude constitue la première
étape pour une détection automatique du champ de l’ironie au sein de notre
corpus. Conscients de la difficulté de la tâche et de l’absence de ressources
linguistiques adaptées à notre corpus et à nos objectifs, nous nous sommes
tournés vers une approche symbolique en nous basant sur un travail
précédent autour de l’annotation automatique des modalités énonciatives
(Riguet et Alrahabi, 2017). Employés dans les stratégies argumentatives, ces
marqueurs observables aident à exprimer ou à rapporter l’ironie ou d’autres
cas qui s’y apparentent (sarcasme, raillerie, satire, moquerie…). Exemple :
De sorte qu'on ne peut mieux définir la Comédie, qu'une « assemblée de
railleurs où personne ne se connait, et où chacun rit des défauts qui les
rendent tous également coupables et ridicules ». [Lelevel, 1694]
Les marqueurs utilisés sont principalement des verbes comme se moquer,
ironiser, parodier… Ensuite, par l’observation d’une partie du corpus, nous
avons enrichi ces ressources par des substantifs, des adjectifs et des adverbes.
Nous avons ensuite classé ces marqueurs dans des sous-catégories selon
différentes nuances sémantiques : 1) ironie, dérision, se moquer, sarcastique,
parodier… ; 2) chicaner, taquiner, narguer… ; 3) faire rire, comique, pitre,
grotesque, idiot… ; 4) mordant, piquant, pinçant, aigre… ; 5) mépriser,
dénigrer, sous-estimer, vilipender… ; 6) calomnier, hypocrisie, ruse,
malice… ; etc. En tout, nous avons collecté autour de 70 marqueurs

JADT’ 18

3

linguistiques.
3. Méthodologie et choix techniques
La détection automatique de l’ironie est une tâche difficile, notamment à
cause de la multitude des moyens linguistiques qui expriment, souvent de
manières subtiles, l’ironie ou les autres formes apparentées. Différents
travaux computationnels s’intéressent à la détection automatique de ces
phénomènes linguistiques (Joshi et al., 2016): approches à base de règles,
approches statistiques et approches d’apprentissage profond. Dans le présent
projet, nous avons utilisé Excom2 (Alrahabi, 2010), un outil d’annotation à
base de règles qui nous a permis d’avoir le contrôle sur le processus
d’annotation et d’améliorer progressivement la pertinence des ressources
linguistiques exploitées. Pour le système, la présence dans une phrase d’un
marqueur de l’ironie déclenche les règles associées qui explorent le contexte
et vérifient la présence ou l’absence de marqueurs complémentaires.
Dans la phrase qui suit, la présence de l’adverbe moqueusement dans le
contexte d’un marqueur de parole permet à Excom2 d’attribuer à ce passage
textuel l’étiquette « Ironie » :
« Il lui faut, dit-on moqueusement, cinq épithètes ! » [Corpus OBVIL]
Les règles dans Excom2 peuvent être organisées selon un ordre de priorité et
utiliser en entrée les résultats d’autres règles. Avant l’étape de l’annotation,
l’outil procède à la segmentation des textes afin de les découper en sections,
paragraphes et phrases. Pour l’Ironie, nous avons créé 8 règles que nous
avons associées aux différents marqueurs linguistiques.
4. Corpus
Cette présente étude s’appuie sur des textes à argumentation théâtrophile ou
théâtrophobe, et sur des textes adoptant une stratégie « mesurée ». Cette
dernière consiste à dénoncer des abus de la scène pour, ensuite, convaincre le
lecteur à préserver l’utilité intrinsèque du théâtre. Ces trois types de textes
possèdent une logique souvent détournée et déconcertante pour le lecteur :
sous le déroulement des chapitres, on découvre parfois des connexions
implicites, un usage de l’ironie très répandu et des phrases à la forme
négative qui infléchissent notablement la détection des contenus. Avec ses
reprises pour réitérer ou au contraire pour retourner l’argument contre
l’adversaire, ce corpus de controverses théâtrales se prête bien à des analyses
numériques. Le corpus rassemble 59 textes (environ un million de mots) écrits
en langue française depuis le milieu du 16e siècle jusque dans les années

4

JADT’ 18

18401. Ceux-ci ont été préalablement numérisés et édités dans le cadre du
Labex OBVIL de Paris IV-Sorbonne et sont librement accessibles en ligne2.
5. Evaluation
Une première phase de tests sur un échantillon du corpus a été nécessaire
pour stabiliser les règles d’identification et de désambiguïsation. Afin
d’évaluer la qualité des annotations obtenues, nous nous sommes focalisés
dans un premier temps sur le calcul de la précision. Nous avons alors annoté
avec Excom2 une autre partie du corpus (7 articles, 215675 mots) et nous
avons obtenu 416 annotations. Ensuite, nous avons demandé à une personne
qui connait les œuvres de cette période de juger les sorties du système selon
un guide d’annotation. Pour chaque annotation, l’évaluatrice devait choisir
entre : « Correct », « Incorrect » ou « Je ne sais pas ». Le critère d’évaluation
était le suivant : est-ce que l'auteur du texte fait allusion à l'ironie dans la
phrase en question? Nous avons obtenu 93.9 % de précision.
6. Difficultés rencontrées
Nous nous sommes heurtés à plusieurs difficultés. Au niveau du lexique, peu
de changements ont été effectués sur nos marqueurs, comme par exemple le
mot satire qui se trouve avec les deux orthographes satire (88 occurrences) et
satyre (68 occurrences). En français, le dernier désigne le demi-dieu
compagnon de Dionysos ou Bacchus. Cependant, dans certains textes qui
n’ont pas encore été modernisés, et sont en langue française du 16e ou 17e
siècle, ce mot indique plus largement la « satire ». D’un autre côté, certains
marqueurs sont polysémiques et génèrent du bruit, comme ridicule (437
occurrences, le marqueur le plus fréquent), plaisanter (176 occurrences) et comique
(131 occurrences). Exemple [Rousseau, 1758] :
Le ridicule est l'arme favorite du vice. C'est par elle qu'en attaquant dans le
fond des cœurs le respect qu'on doit à la vertu, il éteint enfin l'amour qu'on
lui porte.
Concernant la syntaxe du 17e et 18e siècle, nous avons observé une certaine
complexité au niveau des phrases qui sont parfois très longues (cinq lignes ou

Nous renvoyons à la liste de la bibliographie française qui constitue le corpus
total de la Haine du Théâtre: http://obvil.paris-sorbonne.fr/corpus/hainetheatre/bibliographie_querelle-france/
2 Il s’agit d’une partie du corpus de « La Haine du théâtre », projet dirigé, au sein
du Labex OBVIL, par François Lecercle et Clotilde Thouret (Lecercle et al., 2016),
http://obvil.paris-sorbonne.fr/projets/la-haine-du-theatre.
1

JADT’ 18

5

plus), et au niveau des signes de ponctuation qui ne sont pas stables.
Plusieurs virgules, points virgules, etc. peuvent en effet se succéder dans une
seule phrase. De plus, les auteurs de notre corpus utilisent des tournures
complexes. Très souvent, ces phrases sibyllines sont ironiques, et cela se
passe d’autant plus si elles se trouvent à la forme interrogative.
7. Interprétation des résultats
Dans l’étude des débats sur le théâtre, les expressions de l’ironie sont une
voie d’entrée féconde dans le corpus. On constate d’abord que, tout au long
des siècles couverts par le projet Haine du Théâtre (16e – 19e siècles), l’usage
de l’ironie se situe entre les valeurs de 0,20 à 0,30 % (1265 annotations en
total). Nous avons ensuite analysé les marqueurs de l’ironie en étudiant leur
présence relative selon les siècles et nous avons pris en compte uniquement
ceux ayant un pourcentage supérieur à 5% à l’intérieur d’un même siècle.

Figure 1 : Les marqueurs d’ironie dans le corpus HdT pondérés par siècle

Une baisse considérable a lieu au 17e siècle. S’il est prématuré d’en tirer des
conclusions hâtives, nous pouvons cependant tout de suite constater que cela
est probablement dû à l’affirmation de la religion, de l’ordre du classicisme
ainsi qu’à l’autoritarisme étatique qui s’insinuait dans les esprits des
écrivains de cette époque. En revanche, au fur et à mesure du 17e au 19e
siècle, les valeurs de ces marqueurs augmentent de manière assez stable.
De manière générale, l’ironie est utilisée dans le corpus comme procédé
éthique et stylistique, ce qui rend les auteurs bien efficaces dans l’élaboration
de leur vision de la querelle. Qu’ils soient théâtrophobes ou théâtrophiles, ils
peuvent jouer avec les nuances des marqueurs d’ironie, dissimuler un
double-sens dans leurs phrases, s’exprimer figurément de manière contraire

6

JADT’ 18

à ce qu’ils communiquent littéralement. Par exemple, nous retrouvons une
présence considérable du lemme « mépris » au 17e et 18e siècles. Il s’agit
principalement d’un usage de l’ironie en tant que mécanisme de régulation
de la vie sociale. Notamment, Conti et Voisin utilisent un humour inoffensif
contre les excès de l’art et mettent en avant la bienséance :
Ceux qui vont aux Spectacles, non par hasard, mais de propos délibéré, et
avec tant d'ardeur, qu'ils abandonnent l'Eglise par un mépris insupportable
pour y aller, où ils passent tout le jour à regarder ces femmes infâmes,
auront-ils l'impudence de dire qu'ils ne les voient pas pour les désirer
[Conti 1667, Voisin 1671]
L’ « hypocrisie » commence à être utilisée au 17e et son utilisation se réduit
avec le temps (jusqu’à 1% au 19e). Le lemme en question est essentiellement
appliqué à des phrases où l’ironie n’est qu’un « autre nom du malheur »
(Martin 2009), une manière de renforcer le point de vue de l’auteur.
L’hypocrisie est un vice privilégié, qui ferme la bouche à tout le monde, et
qui jouit en repos d'une impunité souveraine. [Coustel 1694]
Très répandu dans le corpus est l’usage de l’ironie comme écho satirique. Le
lemme « calomnier », présent dans les textes du 17e au 19e siècle, en est
l’exemple :
[…] cessez de calomnier vos contemporains selon l'usage immémorial de
ceux qui profèrent de vaines paroles. [Senancour 1825]

Figure 2 : Valeurs pondérées de l’annotation de l’ironie dans le corpus

JADT’ 18

7

Les premiers résultats nous ont ainsi permis d’effectuer des comparaisons
très intéressantes entre les textes des auteurs défenseurs et les textes des
auteurs pourfendeurs du théâtre. A partir du nombre d’expressions
ironiques correctement identifiées comme telles, nous avons recensé leur
nombre et dressé des statistiques pour chaque auteur du corpus annoté.
On constate qu’en données relatives, les auteurs qui utilisent le plus les
marqueurs d’ironie appartiennent à la « querelle Rousseau » (moitié du 18e
s.). Cela est à analyser en perspective mais, en l’espèce, dans cet article nous
pouvons le mettre en lien avec l’usage de l’ironie au 18e siècle, comme
plusieurs écrits sur Voltaire le témoignent (Loriot, 2015). Les mots de
D’Alembert sont très parlants sur ce sujet et éclairent le rôle de
l’ironie [Alembert, 1759] :
Si la satire et l’injure n’étaient pas aujourd’hui le ton favori de la critique,
elle serait plus honorable à ceux qui l’exercent, et plus utile à ceux qui en
sont l’objet.
Les marqueurs linguistiques qui ont été détectés pour cette période
appartiennent à la sphère sémantique du ridicule, de la satire, de la farce et du
comique3. D’autres marqueurs verbaux, tels que se moquer et plaisanter sont
présents dans cette querelle et sont communs aux écrits de la précédente
controverse datant du milieu du 17e siècle. Les valeurs ironiques de cette
dernière, dont les représentants théâtrophobes sont Conti et Nicole, parmi
d’autres, sont cependant moins importantes (0,06 vs. 0,17). Outre ces
marqueurs verbaux, nous pouvons citer les catégories de substantifs tels que
le ridicule et le faire rire. A la même période, Aubignac, auteur de la stratégie
offensive-défensive, part d’une critique du théâtre pour arriver à sa défense.
Il s’inspire des marqueurs habituels pour la période du 17e siècle et reprend
dans ses phrases les propos de ses collègues, pour ensuite les réfuter. De
plus, il recourt plus spécifiquement à des marqueurs ironiques tel que railler
et idiot. Contemporaine à d’Aubignac, la querelle entre Caffaro et Bossuet
nous donne des résultats surprenants : si Caffaro emploie peu de marqueurs
relevant de l’ironie (0,05), Bossuet est lui chef de file parmi ses contemporains
(valeur de 0,27). Comme les autres auteurs, Bossuet puise dans les marqueurs
du comique et du ridicule, tout comme la forme verbale plaisanter. Néanmoins,
nous retrouvons dans ses résultats des mots appartenant à la catégorie de
marqueurs piquants [Bossuet, 1694]:

3 Signalons que le marqueur « ironie » et toutes ses variantes n’ont que 11
occurrences dans le corpus !

8

JADT’ 18

Il ne faut pas s’étonner que l’église ait improuvé en général tout ce genre de
plaisirs [les spectacles…] à cause que communément, ainsi que nous l’avons
remarqué, par sa bonté et par sa prudence, elle épargne la multitude dans les
censures publiques : néanmoins parmi ces défenses, elle jette toujours des
traits piquants contre ces sortes de spectacles, pour en détourner tous les
fidèles.
Nous comprenons ainsi que pour juger le théâtre incompatible avec la morale
chrétienne, Bossuet privilégie un style vif et mordant, il appuie l’église tout
en dénigrant les défenseurs du théâtre.
La recherche sur les stratégies de la querelle du théâtre, tout en se
questionnant sur les modalités argumentatives et les objectifs circonstanciels
de chaque auteur, nous dévoile également certaines idées récurrentes autour
de la considération du théâtre. Les différents textes partagent un certain
nombre de lieux communs, comme par exemple l’idée de perversion,
l’inflation temporelle, ou les arguments économiques et politiques.
8. Discussion et perspectives
Dans cet article, nous avons présenté une approche à base de règles pour la
détection automatique de l’ironie et des formes apparentées dans un corpus
de débats sur le théâtre (16e – 19e siècle). La méthode que nous avons adoptée
nous a fourni une matière abondante et des données quantitatives pour
mieux cerner l’objet d’étude. Vu la particularité du phénomène langagier
étudié et la simplicité de notre approche par analyse de surface, nous
considérons que ces premiers résultats sont très encourageants (93.9 % de
précision). A ce titre, ils méritent d’être approfondis afin d’en tirer le plus
grand bénéfice en terme d’exploitation et de précision. Nous envisageons de
calculer le taux de rappel dans l’annotation et d’identifier les sources des
segments annotés (les locuteurs). L’un de nos objectifs consiste également à
annoter les phrases négatives et à analyser leur association avec l’ironie
(Mainardi et al., 2015), ce qui nous permettrait de dégager des pistes de
recherche inédites dans le domaine des humanités numériques.
Références
Alrahabi, M. (2010). EXCOM-2: plateforme d'annotation automatique de
catégories sémantiques. Applications à la catégorisation des citations en
français et en arabe. Thèse de doctorat, Université Paris-Sorbonne.
Joshi A., Bhattacharyya P., Carman M. J., (2016). Automatic Sarcasm
Detection: A Survey ACM Comput. Surv. V, N, Article A (January 2016).
Lecercle F., Mainardi C., Thouret C. (2016). Pour une exploration numérique
des polémiques sur le théâtre, RHLF, n°116 / 4 dir. Didier Alexandre,
Littérature et humanités numériques, PUF.

JADT’ 18

9

Loriot C. (2015), Rire et sourire dans l'opéra-comique en France aux 18ème et
19ème siècles, Lyon, Symétrie.
Mainardi C., Sellami Z., Jolivet V., (2015). “A Semantic Exploration Method
Based on an Ontology of 17th Century Texts on Theatre: la Haine du
Théâtre", First International Workshop on Semantic Web for Cultural
Heritage (SW4CH 2015), New Trends in Databases and Information
Systems, 539, pp. 468-476, Communications in Computer and Information
Science.
Martin L. (2009), “Le rire est une arme. L'humour et la satire dans la stratégie
argumentative du Canard enchaîné”, A contrario 2009/2 (n° 12), 26-45.
Riguet M., Alrahabi M. (2017), "Pour une analyse automatique du Jugement
Critique: les citations modalisées dans le discours littéraire du XIXe
siècle", in DHQ: Digital Humanities Quarterly 2017

10

JADT’ 18

Migrants et réfugiés : dynamique
de la nomination de l'étranger
Mohammad Alsadhan, Sascha Diwersy,
Agata Jackiewicz, Giancarlo Luxardo
Praxiling UMR 5267 (Univ Paul Valéry Montpellier 3, CNRS)
muhammad.alsadhan@univ-montp3.fr, sascha.diwersy@univ-montp3.fr,
agata.jackiewicz@univ-montp3.fr, giancarlo.luxardo@univ-montp3.fr

Abstract
Intense debates arose from the migrant crisis experienced by Europe in recent
years, both in the media and in the politics. We address here the issue of
nomination used for the newcomers, that we propose to study based on the
comparison of the two substantivations in French: migrant and réfugié. Using
their combinatory profiles, we seek to highlight the contrast between the two
terms and the changes in their semantics and their axiological charge. In
order to do so, we rely on a large corpus of texts, established over a threeyear period: the French parliamentary debates of the Assemblée Nationale. The
comparative study of the combinatory profiles related to the two terms
shows that both shared and unshared collocatives are encountered, and that
their profiles overall tend to converge.
Résumé
Au cours des dernières années, la crise migratoire en Europe a suscité de vifs
débats politico-médiatiques. Nous nous intéressons ici à la question de la
nomination des nouveaux arrivants, que nous proposons d’étudier par la
comparaison des deux substantivations migrant et réfugié. A partir de leurs
profils combinatoires, nous cherchons à mettre en évidence le contraste entre
ces deux termes, les changements dans leur sémantique et leur charge
axiologique. Pour cela, nous nous appuyons sur un corpus, établi sur une
période d’environ trois ans : les débats à l'Assemblée Nationale. L’étude
comparative des profils combinatoires associés aux deux termes montre que
l’on rencontre à la fois des collocatifs partagés et d’autres non partagés et que
leurs profils tendent globalement à converger.
Keywords: political discourse, cooccurrences, diachronic data and
hierarchical clustering, curve clustering.

JADT’ 18

11

1. Introduction
L'Union Européenne a connu en 2015 une arrivée massive d’étrangers extraeuropéens, qui a donné lieu à des formules telles que « crise migratoire » ou «
crise des réfugiés ». Dans un contexte de net clivage de l’opinion publique,
cette crise a entraîné des positions politiques contrastées dans chaque pays
concerné et des compromis difficiles à trouver.
Les débats politico-médiatiques ont porté d'abord sur la prise en charge des
victimes, le droit « d'asile » à accorder aux nouveaux venus, de même que sur
la lutte contre les filières illégales, avec des positions « pro-immigration » ou
« anti-immigration ». Mais ce phénomène s'expliquant en partie par les
conflits en cours au Sud et à l'Est de l'Europe, la question de la désignation
des intéressés a été posée. Alors que jusque-là les « migrants » étaient
principalement motivés par des perspectives économiques, il a été remarqué
qu'une partie de ces personnes devraient être nommés « réfugiés » ou «
demandeurs d'asile ». D'autres termes, comme « clandestins », ont pu aussi
être évoqués.
Nous cherchons ici à questionner la dynamique de la nomination utilisée
dans les débats politiques. A partir d’un corpus de débats parlementaires
nous mettons en œuvre divers procédés de classification basés sur la nature
diachronique des données.
2. Les corpus de débats parlementaires
Nous faisons l’hypothèse que les discours autour de la crise migratoire font
usage des deux termes migrant et réfugié en partie de façon interchangeable,
en partie dans des contextes où seulement l’un des deux est possible. Cette
distinction entre plusieurs emplois en discours, nous proposons de la mettre
en évidence par le voisinage des deux termes et d’évaluer sa variation
d’abord sur le discours politique et en fonction du temps.
Le corpus traité dans la suite est constitué à partir des transcriptions des
débats en séance publique à l’Assemblée Nationale pour la période qui va de
janvier 2014 à février 2017 (ce qui correspond à la fin de la XIVe législature).
Les données textuelles, publiées en format XML et disponibles en accès libre
sur le site data.assemblee-nationale.fr, représentent environ 28,6 millions de
mots occurrences. Elles ont été transformées et enrichies par des annotations
linguistiques suivant une méthodologie décrite par Diwersy et al. (2018). De
nombreuses métadonnées sont définies sur ce corpus, mais dans la suite nous
nous concentrons sur la date (mois-année) associée à une unité structurelle
de base correspondant au tour de parole (intervention d’un député).

12

JADT’ 18

3. Analyse chronologique
L’évolution du sémantisme des termes migrant et réfugié peut être étudiée par
l’association de méthodes mettant en jeu : (i) les fréquences d’apparition de
ces deux lemmes dans les corpus, (ii) leurs profils collocationnels, qui
peuvent faire émerger des champs sémantiques spécifiques, (iii) la variation
de la similarité de ces profils collocationnels dans le temps et la
caractérisation de la contribution de chaque collocatif à l’évolution des scores
de similarité obtenus.

Figure 1

L’évolution des fréquences relatives des deux lemmes par trimestre dans le
corpus est illustrée par le graphique en figure 1. Il met en évidence une
évolution fréquentielle en parallèle avec un pic d’utilisation des deux termes
autour de septembre 2015. La corrélation de rang entre les deux séries
fréquentielles, mesurée par le taux de Kendall, est ici significative (environ
0,74, pour une p-valeur de 0,0005). Dans la suite, l’unité de temps choisie est
le trimestre ; il en résulte des analyses sur 13 trimestres pour la période
couverte. Afin de produire une périodisation plus précise, nous avons mis en
œuvre une approche combinant annotations en relations de dépendance
syntaxique, création de lexicogrammes représentant les profils
collocationnels par trimestre des deux termes (ordonnés suivant le score
d’application du test exact de Fisher) et application de Classifications
Ascendantes Hiérarchiques par Contiguïtés (CAHC), cf. (Diwersy et
Luxardo, 2016 ; Gries et Hilpert, 2008).
La construction d’une CAHC peut être entreprise suivant deux méthodes :
 pour chaque lemme, en calculant la similarité entre deux trimestres

JADT’ 18

13

successifs d’après le coefficient de Pearson (Pearson product-moment
correlation coefficient),
 en calculant la variation de la similarité entre les vecteurs représentant
les profils collocationnels des deux lemmes, d’après l’écart type cumulé
sur deux trimestres successifs.
La première méthode révèle les variations plus importantes sur les trimestres
initiaux jusqu’au pic de la crise. La deuxième méthode qui permet d’illustrer
la comparaison entre les deux termes par un graphique unique est
représentée par la figure 2.

Figure 2

L’étude de cette classification hiérarchique permet de révéler sept étapes
(représentées par sept zones grises). L’évolution du score de similarité est
illustrée par un graphique qui se superpose au dendrogramme et qui
confirme une croissance globale de 0 à 0,2 (mais avec un pic à 0,6). Le passage
d’une période à l’autre est marqué par une progression jusqu’à P03
(correspondant au 3e trimestre 2015, suivant le pic de la crise) mais avec un
déclin des périodes P03 à P05 et de P06 à P07.

14

JADT’ 18

Figure 3

4. Évolution des profils combinatoires et orientations discursives
Cette section vise à expliciter les facteurs linguistiques à l’origine des
tendances statistiques établies dans la partie précédente. Il s’agit, d’une part,
de mettre en évidence les différences sémantiques entre migrant et réfugié
telles qu’elles se manifestent à travers leurs profils différentiels et, d’autre
part, de relever les points essentiels concernant leur similarité
distributionnelle. Les profils différentiels sont constitués par les collocatifs
exclusifs à chacun des substantifs étudiés et, de ce fait, ne contribuent à
aucun moment à la similarité de leurs profils combinatoires. Le tableau 1 en
donne un aperçu restreint aux collocatifs les plus saillants, situés dans le
premier décile des inventaires collocationnels en termes de score
d’association.
Tableau 1 - Profils différentiels constitués par le premier décile
des collocatifs exclusifs à migrant et à réfugié
migrant

réfugié

Dépendances en aval
(régime)

Dépendances en amont Coordin
(termes recteurs)
ation

Dépendances en
aval

Dépendances en
amont

Epithètes

Compl.
du nom

Compl.
du nom

Epithètes

irrégulier
illégal
clandestin
âgé

Calais
Calaisis
situation

Compl. objet
Compl. circ.
Sujet
dissuader
entasser
refouler
secourir

retour
langue
déferleme
nt
réadmissi
on

politique
guerre
palestinien
afghan
vietnamien
irakien
cambodgien
persécuté
réinstallé

CO
CC
Sujet
affluer

CDN

Coord
inatio
n

CDN

statut
protection
(Haut-)
Commissar
iat
qualité
relocalisati
on
distinction
concubin
défi

apatri
de
bénéfic
iaire
déplac
é
migra
nt

JADT’ 18

15

Parmi les collocatifs saillants du nom réfugié, on notera d’abord la forte
présence d’une série de termes (statut, qualité ; (Haut-)Commissariat ;
protection ; apatride)1 qui renvoient au cadre des dispositions relevant du droit
international qui imposent aux autorités un devoir d’assistance envers des
personnes dont le départ de leur lieu de résidence habituelle est considéré
comme étant contraint par une menace existentielle. Catégoriser une
personne au moyen du terme réfugié revêt donc un enjeu juridique,
administratif et politique, dont l’ampleur peut se voir régulée d’une part, par
des mises en paradigme explicites avec d’autres termes dans le cadre d’une
coordination (cf. les collocatifs apatride, bénéficiaire, déplacé et migrant) et,
d’autre part, par des catégorisations secondaires exprimées par des
expansions nominales (épithètes ou compléments du nom) caractérisant les
causes du départ forcé. A travers les modifieurs du nom réfugié impliquant
une relation causale (politique, persécuté ; (de) guerre) se construit un
paradigme, et finalement une hiérarchie de causes potentiellement légitimes
ou non-légitimes (et de réponses à apporter aux conséquences liées à ces
causes).2 A côté de ces modifieurs, qui dénotent directement la cause du
départ forcé, on trouve toute une série d’adjectifs ethnonymiques (palestinien,
afghan, vietnamien, irakien, cambodgien) qui la dénotent indirectement en
s’appuyant sur le savoir partagé concernant l’histoire troublée de ces pays.
Cet environnement discursif montre que le mot réfugié se présente comme la
nomination d’un statut juridique et qu’il est intégré à une argumentation
orientée positivement.
Les collocatifs de migrant révèlent un profil sémantique bien différent, en ce
sens que ce terme place au centre de l’intérêt la question de la (non)conformité à des dispositions légales imposées à des personnes dont le
séjour sur un territoire différent de celui de leur lieu résidentiel d’origine est
considéré comme étant le résultat d’un déplacement conditionné par des
considérations utilitaires (et en premier lieu économiques). C’est bien à cette
dimension sémantique que se rapporte, dans le profil différentiel de migrant,
de façon saillante la série des collocatifs irrégulier, illégal, clandestin et situation
(qui, quant à lui, s’oppose, de ce point de vue, à statut et qualité, collocatifs
exclusifs à réfugié). Ayant hérité les traits aspectuels du participe en –ant dont

On trouve dans les déciles inférieurs – non documentés ici – d’autres collocatifs
comme statutaire ou conventionnel qui rentrent dans cette même série.
2 On peut observer que cette sous-catégorisation va souvent de pair avec une
modalisation d’appartenance catégorielle, exprimée par l’adjectif épithète véritable qui
constitue avec vrai et authentique une série de collocatifs (appartenant à la catégorie de
l'enclosure) exclusifs à réfugié qui sont néanmoins représentés à des rangs inférieurs
de l’inventaire cooccurrentiel.
1

16

JADT’ 18

il est issu par conversion, le nom migrant présente le séjour momentané de la
personne qualifiée en tant que telle à un endroit donné comme étant
l’épisode d’une série inaccomplie de déplacements3 – séjour et déplacements
qui, à travers des collocatifs tels dissuader, refouler et retour, se voient
caractérisés comme relevant aussi bien de la volonté des personnes en
mouvement, que de la bienveillance ou du refus des autorités qui en ont le
contrôle potentiel. Faut-il voir en cela la motivation inférentielle de
l’évaluation négative que véhicule un terme comme déferlement contrairement
à ses variantes axiologiquement plus neutres afflux, flux ou encore arrivée, qui,
eux, font tous partie des collocatifs partagés des noms migrant et réfugié ?
Pour mieux cerner les collocatifs partagés qui contribuent le plus à
l’évolution de la similarité distributionnelle des deux noms en question, nous
avons mis en œuvre la méthode de classification proposée par Trevisani &
Tuzzi (2016), en l’appliquant aux séries chronologiques des produits de
scores d’association normés propres à chaque collocatif, qui entrent dans la
composition des sommes donnant les produits scalaires lesquels représentent
les indices de similarité retenus.

Figure 4

L’application de la méthode4 fait ressortir, sur l’ensemble des 72 collocatifs
communs à migrant et réfugié, 6 classes de profils évolutifs, dont 5 sont

Contrairement à cela, réfugié, qui résulte de la nominalisation d’un participe
passé, est associé à la représentation d’un seul épisode de déplacement accompli et
envisagé en termes de son origine.
4 Nous remercions Arjuna Tuzzi d’avoir mis à notre disposition le script R
permettant de mettre en œuvre les calculs respectifs.
3

JADT’ 18

17

constituées par un seul terme, à savoir millier, afflux, accueillir, crise et accueil
(cf. figure 3). D’un point de vue sémantique, ces 5 collocatifs, qui, à différents
moments de la série chronologique analysée, occupent les premiers rangs en
termes de contribution aux scores de similarités respectifs, forment tout un
condensé de la trame discursive impliquant les noms migrant et réfugié au
cours de la période étudiée, avec :
 millier et afflux, qui renvoient à une affluence perçue comme massive ;
 crise, qui caractérise ce processus comme ayant atteint un point
culminant à fort potentiel de déstabilisation ;
 ainsi que accueillir et accueil qui se rapportent à la prise en charge des
conséquences immédiates du processus concerné.
Facteurs distributionnels de premier ordre, ces collocatifs placent migrant et
réfugié dans un rapport paradigmatique associé à plusieurs dimensions
sémantiques, qui, en vue des orientations argumentatives fortement
divergentes instaurées par les deux noms (cf. supra), fait de leur choix un
véritable enjeu discursif.
5. Conclusion et perspectives
Les prolongements de cette étude exploratoire sont nombreux. En partant de
Wihtol De Wenden (2016), il nous semble possible de construire un modèle
d’analyse comportant cinq catégories qui sont autant de facettes du
phénomène migratoire actuel : (i) origines et causes des migrations, (ii)
profils des migrants, (iii) situation des migrants, (iv) gouvernance des
migrations, (v) mobilité et restrictions migratoires. L’application de cette
grille de lecture aux collocations impliquant les termes réfugié et migrant (ou
encore leurs équivalents), peut s’avérer une piste de recherche prometteuse
qui permet de donner aux résultats de l’analyse linguistique que nous venons
d’effectuer une dimension transdisciplinaire, comme c’est par exemple le cas
pour la différence entre facteurs « push » (poussant les individus à partir de
leur pays) et « pull » (incitant les individus à venir dans un pays spécifique)
établie par Wihtol de Wenden, différence qui se reflète dans la divergence
fondamentale de l’orientation argumentative des programmes de sens
propres aux noms étudiés, en ce que réfugié implique la notion de départ
forcé alors que migrant évoque l’idée d’un déplacement volontaire.
Si la figure du réfugié ou du migrant est essentiellement une construction
politique (Wihtol De Wenden, 2016, p. 50) – ce que confirme d’ailleurs le
profil collocationnel du terme correspondant tel qu’il se manifeste dans le
corpus de discours parlementaire analysé - les différents (et nombreux)
profils des personnes en déplacement peuvent être étudiés à partir des
témoignages qu’elles livrent à propos de leur expérience migratoire. C’est
l’objet d’une enquête menée auprès de Syriens arrivés en France depuis 2012,

18

JADT’ 18

qui se situe dans le prolongement du présent article et qui comporte à ce
stade un volet uniquement qualitatif, dont les résultats préliminaires
(Alsadhan et Richard, 2018) montrent que, lorsque le choix se présente, c’est
bien le vocable réfugié qui est privilégié en tant qu’auto-désignant.
Références
Alsadhan, M., Richard A. (2018, à paraître). La réception des réfugiés Syriens
du discours médiatico-politique identitaire français, in Sandré M., Richard
A. & Hailon F. : Le discours politique identitaire face aux migrations, No 8 de
la revue Studii de lingvistica.
Diwersy, S., Luxardo, G. (2016). Mettre en évidence le temps lexical dans un
corpus de grandes dimensions : l’exemple des débats du Parlement
européen, in Mayaffre D., Poudat C., Vanni L., Magri V. & Follette P.
(éds.) : JADT 2016 : Actes des 13es Journées internationales d’Analyse
statistique des Données Textuelles, Nice, 2016, URL :
http://lexicometrica.univ-paris3.fr/jadt/jadt2016/01ACTES/83638/83638.pdf.
Diwersy, S., Frontini, F., Luxardo, G. (2018, à paraître). The Parliamentary
Debates as a Resource for the Textometric Study of the French Political
Discourse, in Proceedings of ParlaCLARIN workshop, 11th edition of the
Language Resources and Evaluation Conference (LREC2018).
Gries, S.T., Hilpert, M. (2008). The identification of stages in diachronic data:
variability-based neighbour clustering. Corpora 3 (1), pp. 59-81.
Trevisani, M., Tuzzi, A. (2016). Analisi di dati testuali cronologici in corpora
diacronici: effetti della normalizzazione sul curve clustering, in Mayaffre
D., Poudat C., Vanni L., Magri V. & Follette P. (éds.) : JADT 2016 : Actes
des 13es Journées internationales d’Analyse statistique des Données Textuelles,
Nice, 2016, URL : http://lexicometrica.univ-paris3.fr/jadt/jadt2016/01ACTES/82630/82630.pdf.
Wihtol De Wenden C. (2016). Migrations. Une nouvelle donne, Éditions de la
Maison des sciences de l'homme, Paris.

JADT’ 18

19

Xplortext, a R package.
Multidimensional statistics for textual data science
R. Alvarez-Esteban1, M. Bécue-Bertaut2, B. Kostov3,
F. Husson4, J-A Sánchez-Espigares2
2

1Universidad de León – ramon.alvarez@unileon.es
Universitat Politècnica de Catalunya – monica.becue@upc.edu; josep.a.sanchez@upc.edu
3Institut d'Investigacions Biomèdiques August Pi i Sunyer – belchin3541@gmail.com
4Agrocampus Ouest – husson@agrocampus-ouest.fr

Abstract
We present here the package Xplortext for textual data science which
provides classical and novel features for textual analysis. Starting from the
corpus encoded into a lexical table, aggregate or not, several problems are
dealt with: revealing both document and word structures and their mutual
relationships, by applying correspondence analysis (CA); comparing several
corpora structures by using multiple factor analysis for contingency tables
(MFACT); uncovering complex relationships between words and contextual
variables via CA for a simple or a multiple generalized aggregate lexical table
(CA-GALT and MFA-GALT), clustering documents thanks to a hierarchical
clustering algorithm (HCA); evaluating the evolution of the vocabulary along
time thanks to a chronological constrained hierarchical clustering algorithm
(CCHCA).
Resumé
Nous présentons ici le paquet Xplortext pour la science des données
textuelles qui comprend des méthodes classiques et récentes d'analyse
textuelle. Partant du corpus encodé sous forme tableau lexical, agrégé ou
non, plusieurs problèmes sont traités: révélation des structures sur les
documents et les mots ainsi comme leurs relations mutuelles, en appliquant
l'analyse des correspondances (AC); comparer plusieurs structures de corpus
en utilisant l'analyse factorielle multiple pour les tables de contingence
(MFACT); découvrir des relations complexes entre mots et variables
contextuelles via CA pour une table lexicale agrégée simple ou multiple (CAGALT et MFA-GALT), en regroupant des documents grâce à un algorithme
de clustering hiérarchique (HCA); évaluer l'évolution du vocabulaire au fil
du temps grâce à un algorithme de classification hiérarchique sous contrainte
chronologique (CCHCA).
Keywords: Xplortext, R package, Textual data, Contextual data,
Correspondence analysis, Multiple factor analysis for contingency tables,

20

JADT’ 18

Generalized aggregate lexical table, Hierarchical clustering, Contiguity
constrained hierarchical clustering, Labeled tree.
1. Introduction
R offers numerous tools for textual data science. However, among them,
multidimensional statistics is not so well represented that it should be.
Xplortext, a new R package, intends to fill in the gaps. Its features are based
on the exploratory approach to texts, in the line of the works by Benzécri
(1981) and Lebart et al. (1998). The fundamental choices behind the design of
Xplortext are to offer classical and novel textual analysis methods based on
multidimensional statistics in a same package. The mains issues were to
consider:
 Classical multidimensional statistical methods, in which CA remains
being the core method.
 Novel methods, favoring those able to jointly analyze textual and
contextual data to know not only who says what, taking here the title
of a paper by Lebart, but also why he/she is saying that.
 Numerous graphical outputs providing great flexibility to choose the
elements to be represented.
 Specific methods to deal with chronological corpora.
2. Example
The political speech corpus used as an example consists of 11 documents of
about 10,000 occurrences each one. These are the "investiture speeches" of 6
Spanish presidential candidates who have been pronounced from 1979 to
2011: Suarez (1979), Calvo-Sotelo (1981), González (1982, 1986, 1989 and
1993), Aznar (1996 and 2000), Zapatero (2004 and 2008) and Rajoy (2011).
3. Encoding the textual data and basic statistics
Xplortext takes advantage of functions of the R package tm to import the
corpus. Mainly, plain text files (typically .txt) and spreadsheet-like file (.csv,
.xls) are considered. By default, plain text and CSV files are assumed to use
the local native system (usually latin1) on Windows and utf8 in Mac or
Linux. The encoding of the file can be given in the R command read. If
necessary, the corpus can be saved in a known encoding beforehand. In any
format, one row corresponds to one document. The text to analyze can be
filled in one or several columns; the remaining columns provide information
about the documents and are automatically imported as contextual
(quantitative and/or qualitative) variables. Textual and contextual data must
be located in the same file. Conversion to lower/upper cases, numbers
removal and punctuation are managed by Xplortext depending on the

JADT’ 18

21

arguments of Textdata function. Stopwords can be taken into account using
the lists provided by either Xplortext (issued from tm) or the user. The
importing step ends with the encoding of the corpus into a documents ×
words table (lexical table) and, possibly, a documents × repeated segments
table (segmental table). Another option is to ask for an aggregated lexical
table according to the categories of a variable. Then, elementary indicators,
such as the corpus and vocabulary sizes, are computed and the words and
repeated segments indices are listed and represented by a histogram,
visualizing so their frequency (Fig.1). Classical summaries of the contextual
variables are given.
4. Correspondence analysis as a core method
Correspondence analysis (CA) is a core method in Xplortext revealing both
document and word structures and their mutual relationships.
4.1. CA and content and form of a corpus
The content and form of a corpus are both important as CA results. In fact,
content is better captured when replaced into the form as, "the form is the
bottom that comes back to the surface" in the words of Victor Hugo. Figure 2
shows the factor maps issued from a CA performed on the documents ×
words table.

Figure 1: Most frequent words and repeated segments

The trajectory of the speeches is revealed, enhancing the existence of three
temporal poles. The represented words are the most contributive and have to
be read as seen along the trajectory. In this way, they clearly illustrate the
three poles and allow us to capture the meaning of the evolution. Note that
the confidence ellipses around the documents are very narrow.

22

JADT’ 18

Figure 2: Documents and the most contributive words on the first CA plane

4.2. Multiple factor analysis for contingency tables
When dealing with a multiple contingency table (=juxtaposition of several
contingency tables), the multiple factor analysis for contingency tables
(MFACT; Bécue-Bertaut & Pagès, 2004; Bécue-Bertaut & Pagès, 2008),
extension of CA, turns to be useful. Very different aims can be looked for. For
example, interesting aims would be comparing the documents structures as
issued either from using different thresholds on the word frequency (10, 20,
30 or 50; 4 lexical tables) or from keeping or not the tool words (2 lexical
tables) or the stopwords.

Figure 3: Synthetic representation of the groups as issued from MFATC

JADT’ 18

23

MFACT offers a high number of graphical and numerical results, either
similar to those of any principal component methods (such as PCA or CA) or
specific to the comparison of structures defined on the rows by the groups of
columns. Among the latter, the representation of the groups provides a
synthetic tool by representing each group with one point, revealing the
global dissimilarities between the group structures (Fig. 3).
4.3. Generalized aggregate lexical tables
Correspondence analysis on a generalized aggregated lexical table (CAGALT; Bécue-Bertaut & Pagès, 2015; Bécue-Bertaut, Kostov & Pagès, 2014)
deals with two paired tables (frequency table, contextual variables table)
observed on the same statistical units. In textual analysis, the frequency table
is a lexical table and the statistical units are the documents. This method can
be seen as a canonical correspondence analysis (CCA; ter Braak, 1986)
approach to the texts. It enables to study the relationships between
contextual variables and words but untangling the respective influences of
the variables/categories on the lexical choices to avoid spurious relationships.
MFA-GALT (multiple factor analysis for analyzing a series of generalized
aggregated lexical tables; Kostov, 2015) deals with several paired tables,
possibly defined on several sets of statistical units while the set of variables is
common to all the contextual tables. In textual analysis, MFA-GALT
compares the relationships between words and variables in these several
paired tables. A favored application concerns surveys answered in different
languages by several samples, being common the open-ended and the closed
questions.
5. Clustering algorithms
A classical hierarchical clustering algorithm (HCA) is included in Xplortext.
Clustering starts from the documents coordinates on the CA dimensions. An
exhaustive description of the clusters is provided, extracting their
characteristic words and looking for the differentiated behavior of the
variables in the clusters. The number of clusters is issued from the
hierarchical tree structure. An automatic suggestion is done.
A method for chronological constrained hierarchical clustering algorithm
(CCHCA) is also offered. Only chronological contiguous nodes can be
grouped. Further, the tree is described by the chronological words defined as
follows. The characteristic words of each node are identified but finally a
word is associated to only one node, the one that it better characterizes. These
words are used to label the nodes (Fig. 4). Although the tree could be used to
determine clusters, its main role is to allow for capturing the evolution of the
speeches and their vocabulary through a descending reading of the labels

24

JADT’ 18

and nodes of the tree.

Figure 4: Labeled chronological tree

6. Works in progress
The following features will be included in a next future:
 Chronological clustering (Legendre et al., 1985) has been proposed to
divide a chronological series of species (=species counts operated at
different moments) into homogeneous temporal parts. A same
aggregation criterion as in chronological constrained clustering is
used but a test is performed before aggregating two nodes to ensure
their homogeneity. If homogeneity does not exist, the corresponding
aggregation is not performed. As a result, the series is possibly
divided into non-connected sub-series. This clustering method has
been applied with benefit to the chronological series of words
corresponding to a chronological corpus, allowing for dividing the
corpus into non-connected homogeneous parts (Bécue-Bertaut et al.,
2014).
 Regularized CA (Josse et al.) allows for recovering a low-rank
structure from noisy data, such as textual data, by using
regularization schemes via a simple parametric bootstrap algorithm.
7. Conclusion
Xplortext is published on R CRAN. Bécue-Bertaut, et al. (2018) present a

JADT’ 18

25

series of applications of this package through several examples whose results
are interpreted in details. The corresponding bases and scripts are published
on the website http://xplortext.org.
References
Bécue-Bertaut M. and coll. (2018). Analyse textuelle avec R. Presses
Universitaires de Rennes (PUR), Rennes.
Bécue-Bertaut M., Kostov B., Morin A. and Naro G. (2014). Rhetorical
strategy in forensic closing speeches. Multidimensional statistics-based
methodology. Journal of Classification, 31: 85-106.
Bécue-Bertaut, M. and Pagès, J. (2004). A principal axes method for
comparing multiple contingency tables: MFACT. Computational Statistics
and Data Analysis, 45: 481-503.
Bécue-Bertaut M. and Pagès J. (2008). Multiple factor analysis and clustering
of a mixture of quantitative, categorical and frequency data. Computational
Statistics and Data Analysis, 52: 3255–3268.
Bécue-Bertaut M. and Pagès J. (2015). Correspondence analysis of textual
data involving contextual information: CA-GALT on principal
components. Advances in Data Analysis and Classification, 9: 125–142.
Bécue-Bertaut M., Pagès J. and Kostov B. (2014). Untangling the influence of
several contextual variables on the respondents’ lexical choices. A
statistical approach. SORT – Statistics and Operations Research Transactions,
38: 285–302.
Benzécri, J.-P. (1981). Pratique de l’Analyse des Données. Tome III. Linguistique &
Lexicologie. Dunod, Paris.
Josse J., Sardy S. and Wager S. (2016). denoiseR: A Package for Low Rank Matrix
Estimation. arXiv: 1602.01206
Kostov B. (2015). A principal component method to analyse disconnected
frequency tables by means of contextual information. (Doctoral
dissertation). Retrieved from
http://upcommons.upc.edu/handle/2117/95759.
Lebart, L., Salem, A. and Berry, L. (1998) Exploring textual data. Kluwer.
Legendre, P., Dallot, S. and Legendre, L. (1985). Succession of species within
a community: chronological clustering, with applications to marine and
freshwater zooplankton, American Naturalist, 125: 257–288.
ter Braak CJF. (1986). Canonical correspondence analysis: a new eigenvector
technique for multivariate direct gradient analysis. Ecology, 67: 1167–1179.

26

JADT’ 18

L'evoluzione delle norme: analisi testuale delle
politiche sull'immigrazione in Italia
Elena, Ambrosetti1, Eleonora Mussino2, Valentina Talucci3
1

Associate Professor, Sapienza Università di Roma
2 Associate Professor, Stockholm University
3 Researcher, ISTAT

1. Introduzione
Nei paesi del Sud-Europa, le politiche migratorie tendono a privilegiare le
questioni relative all'ingresso degli immigrati (ad esempio ingressi regolari e
irregolari, sanatorie e ricongiungimento familiare) rispetto agli aspetti legati
all'integrazione (Pastore 2004, Solé 2004). Questo squilibrio nell'azione
politica è imputabile alla volontà dei paesi di immigrazione di poter
controllare i flussi, bloccare gli ingressi non autorizzati e determinare il
numero e la composizione dei migranti. Le politiche migratorie regolano in
modo diretto l’esito dell’ingresso o meno nel Paese di destinazione e
successivamente orientano i percorsi di inserimento nel tessuto economicosociale e culturale degli stranieri ammessi in Italia. Attraverso lo studio delle
politiche dell’immigrazione dall’Unità d’Italia a oggi possiamo analizzare
come il linguaggio istituzionale nel corso degli anni e varie legislature si sia
trasformato tracciando diversi aspetti legati alle migrazioni internazionali nel
nostro paese. Questo argomento assume particolare importanza in quanto la
scelta di un tipo di linguaggio potrebbe influenzare opinioni e atteggiamenti
nei confronti degli stranieri da parte della popolazione italiana.
2. Le politiche migratorie in Italia
L’Italia, sebbene sia diventata un paese di immigrazione negli anni Settanta,
soltanto nel 1986 si è dotata della prima normativa sull’immigrazione a
seguito dell’adesione nel 1975 alla Convenzione alla Convenzione 143
dell'Organizzazione Internazionale del Lavoro (OIL) e dell'aumento dei flussi
di immigrati nel corso degli anni Ottanta. La legge 943/1986 (Legge Foschi)
riguardava in primo luogo lo status dei lavoratori, inoltre includeva il
ricongiungimento familiare e l'accesso allo stato sociale di base (Colombo e
Sciortino, 2004). La legge venne indirizzata ai lavoratori extra-comunitari,
con l’obiettivo di equipararli ai lavoratori italiani e ai lavoratori dell'Unione
europea (Nascimbene, 1988; Colombo e Sciortino, 2004). Inoltre la legge
introdusse una sanatoria per i lavoratori extracomunitari che si trovavano già
nel territorio senza documenti regolari. Nel febbraio 1990, la legge 39/1990
(Legge Martelli) fu approvata dal Parlamento italiano a seguito delle

JADT’ 18

27

pressioni dovute all’incremento degli arrivi dopo la caduta della cortina di
ferro e dalla imminente ratifica del Trattato di Schengen (ratificato nel 1993 e
entrato in vigore nel 1997). Al contrario della precedente legge Foschi, la
legge si rivolgeva a tutte le categorie di migranti e non solo quindi ai
lavoratori, per cui è considerata la prima legge organica sulle migrazioni.
Nonostante ciò essa viene ricordata principalmente per la sanatoria di circa
218.000 irregolari. Ricordiamo anche alcuni altri aspetti significativi coperti
dalla legge Martelli: l’introduzione dell’obbligo di visto, con conseguente
inasprimento del controllo delle frontiere, che rese molto più difficile entrare
in Italia, la programmazione annuale delle quote di lavoratori
extracomunitari attraverso il cosiddetto Decreto Flussi, l’asilo politico, e da
ultimo, l’inasprimento delle condizioni per l’ottenimento ed il rinnovo del
permesso di soggiorno. Nel 1995 fu emanata la legge 489/1995 (Legge Dini):
essa conteneva ulteriori misure restrittive per il controllo delle frontiere, una
nuova sanatoria per i lavoratori stranieri irregolari e la regolamentazione dei
flussi di lavoratori stagionali. A differenza delle misure restrittive, che non
trovarono attuazione in quanto ritenute contrarie alla Costituzione, la
sanatoria rappresentò il vero successo del decreto Dini, con un numero di
stranieri regolarizzati pari a 248.000 persone.
Nel 1997, con l’entrata in vigore dell'accordo di Schengen è stato introdotto
nell’ordinamento italiano l'adeguamento alla politica comune in materia di
visti. Sempre in tema di normative comunitarie, la legge 209/1998 ha
ratificato il trattato di Amsterdam, entrato in vigore in Italia in quell'anno.
Nello stesso anno il governo ha approvato il Testo Unico delle disposizioni
concernenti la disciplina dell'immigrazione e norme sulla condizione dello
straniero, Dlgs 286/1998 (Legge Turco-Napolitano). Obiettivo della legge era
quello di operare una rottura con il passato e di condurre ad una gestione del
fenomeno migratorio strutturale e di lungo periodo. La legge era basata su
quattro pilastri (Zincone e Caponio, 2004): 1. Prevenzione e lotta all’
immigrazione irregolare: da notare in particolare l’introduzione
dell'espulsione immediata di migranti irregolari e di centri di permanenza
temporanea per detenere immigrati clandestini in attesa di espulsione; 2.
Migrazioni da lavoro: i nuovi arrivi di lavoratori stranieri sono regolati con
quote annuali di lavoratori stabilite ogni anno dal Ministero del lavoro; viene
introdotto il meccanismo dello sponsor secondo il quale un cittadino italiano
o uno straniero residente si fa garante dell’ingresso di uno straniero privo di
contratto di lavoro; 3. Promozione dell'integrazione di migranti già residenti
in Italia: creazione del Fondo Nazionale per l’integrazione dedicato al
finanziamento di attività multiculturali e ad azioni antidiscriminazione;
introduzione del permesso di soggiorno di lungo periodo, o carta di
soggiorno per i migranti residenti da almeno 5 anni in Italia; 4. Concessione

28

JADT’ 18

di diritti umani fondamentali, come l'assistenza sanitaria di base, ai migranti
irregolari. La legge Turco-Napolitano si fece carico della regolarizzazione di
217.000 stranieri.
Nel 2002 è stata introdotta la legge Bossi-Fini, con lo scopo di modificare in
maniera restrittiva il Testo unico del 1998. Più specificamente, la legge ha
modificato i primi due pilastri della legge. Con la nuova normativa sono state
adottate una serie di misure volte a scoraggiare l’insediamento permanente
dei migranti tra le quali: l’abolizione del sistema dello sponsor, la riduzione
del periodo di validità del permesso di soggiorno e il collegamento della
validità del permesso di soggiorno a un contratto di lavoro ("contratto di
soggiorno"). Inoltre fu adottata una politica più repressiva nei confronti dei
migranti irregolari che includeva l’applicazione del rimpatrio forzato,
controlli più sistematici della polizia che includevano il pattugliamento delle
coste italiane e la detenzione di coloro che rimanevano sul territorio italiano
più a lungo di quanto previsto dal permesso di soggiorno (over-stayers). In
linea con le precedenti leggi, la legge 189/2002 ha regolarizzato 634.728
immigrati, rappresentando la più grande sanatoria mai adottata in Europa
fino a quel momento (Zincone, 2006). Dopo il 2002 sono state apportate
poche modifiche alla normativa sulla migrazione, si tratta in particolare di:
misure per combattere l'immigrazione clandestina, sanatorie per migranti
irregolari presenti sul territorio italiano e recepimento di direttive UE che
implicano modifiche alla normativa esistente.
L'acquisizione della cittadinanza per nascita (jus sanguinis) e per residenza
(jus soli) era inizialmente regolata dalla legge 555/1912. Le condizioni erano
molto restrittive: la cittadinanza era concessa solo al figlio di un uomo
italiano e sotto condizioni specifiche al figlio di una donna italiana. La legge
123/1983 ha introdotto nella legislazione italiana l'acquisizione della
cittadinanza per matrimonio e ha riformato l'acquisizione della cittadinanza
per nascita, concedendo indifferentemente il diritto di cittadinanza al figlio di
madre o padre italiani. L'acquisizione della cittadinanza italiana è stata
ulteriormente riformata dalla legge 91/1992 riservando particolari diritti ai
cittadini europei rispetto agli extra europei. La cittadinanza per matrimonio è
stata riformata nel 2009 (legge 94 del 15 giugno), prolungando il periodo di
residenza necessario in Italia da sei mesi a due anni dalla data del
matrimonio. Negli ultimi anni si contano diversi tentativi per introdurre una
nuova normativa sulla cittadinanza allo scopo di semplificare e ridurre il
tempo per l’ottenimento della cittadinanza per i migranti di seconda
generazione (nati in Italia). Come primo risultato, l'art. 33 del decreto 69/2013
ha semplificato la procedura di acquisizione della cittadinanza per gli
stranieri nati in Italia. Nonostante ciò, fino ad oggi manca una nuova
normativa in materia.

JADT’ 18

29

La normativa sulle migrazioni in Italia è stata costantemente caratterizzata
dalla mancanza di una politica attiva degli ingressi e dal continuo tentativo
di rallentare ed osteggiare il radicamento giuridico e sociale della
popolazione straniera sul territorio italiano. Il ricorso continuo a strumenti
ex-post come le sanatorie, l’utilizzo delle quote come sistema di emersione di
lavoratori stranieri già presenti sul territorio italiano piuttosto che come
norma di ingresso di nuovi lavoratori, ed il forte accento che la classe politica
e i media pongono sulla lotta all’immigrazione illegale sono esempi
emblematici di come il fenomeno migratorio in Italia venga affrontato in
termini di contenimento e controllo e non di allargamento e integrazione. La
presenza straniera, ancora oggi, è perlopiù considerata transitoria e viene
percepita e gestita in termini di risposta ad eventi contestuali di emergenza.
3. Dati e Metodi
I dati testuali utilizzati per realizzare questo lavoro sono tutti i capi normativi
contenuti nelle leggi approvate in Italia dal 1912 al 2014 in materia di
migrazione. La metodologia di analisi proposta fa capo alla Content Analysis
realizzata attraverso tecniche automatizzate dei dati. Si effettua applicando
un insieme di routine, supportate da specifici software in questo caso
TaLTAC2 – Trattamento automatico Lessico testuale per l’Analisi del
contenuto - che consentono di automatizzarne in parte o del tutto
l’esplorazione, la descrizione e il trattamento di grosse moli di dati; in questo
modo vengono trasformati insiemi di testi non strutturati in insiemi di testi
strutturati. Oltre alla descrizione dei contenuti del testo è possibile analizzare
il corpus in base ad una o più variabili disponibili sui frammenti come l’anno
e la maggioranza di governo1. L’estrazione dell’informazione peculiare
individuata attraverso il test del p-value permetterà di avere, per ogni
variabile esplicativa, una lista di parole chiave sovra o sotto rappresentate
rispetto a un modello di riferimento. Inoltre tramite l’analisi delle

Casa delle libertà: centro destra: Governo Berlusconi II, XIV Legislatura (30
maggio 2001 - 27 aprile 2006); Coalizione di centro destra: Governo Berlusconi IV, XVI
Legislatura (dal 29 aprile 2008 al 23 dicembre 2012); Grande coalizione: XVII
Legislatura Governi Letta e Renzi, centro sinistra e Alternativa popolare; Indipendente:
Governo Dini - (17/01/1995 - 17/05/1996) governo tecnico; Indipendenti: Governo Monti
(dal 16 novembre 2011 al 27 aprile 2013) Governo tecnico, XVI Legislatura; Liberale:
Governo Giolitti (1911-1914), UL - PR - PDC - PD - UECI – CC, centro destra; L’Unione:
centro sinistra, XV Legislatura (28 aprile 2006 - 6 febbraio 2008) Governo Prodi II;
Pentapartito: Coalizione politica: DC - PSI - PSDI - PRI -PLI, IX Legislatura;
Quadripartito: Coalizione politica: DC - PSI – PSDI - PLI, X Legislatura; Ulivo: centro
sinistra, XIII Legislatura.
1

30

JADT’ 18

corrispondenze lessicali cerchiamo un pattern che metta in relazione in modo
sistematico i lemmi e le dimensioni identificate con le caratteristiche associate
ad ogni legge.
4. Risultati
Le leggi sono state analizzate come un unico corpus che soddisfa i criteri
standard di dimensione minima richiesta affinché le analisi siano robuste. Ad
una prima analisi lessicometrica il testo, costituito da 150.714 occorrenze e
8.113 forme grafiche, rassicura sulla sua adeguata estensione: la proporzione
di parole diverse sul totale delle occorrenze (V/N*100= 5,383) si allontana
notevolmente dalla soglia del 20% rispettando, quindi, la soglia minima di
significatività statistica di un corpus (Bolasco, 1999). Sorprendentemente il
livello di ricercatezza del linguaggio non è particolarmente elevato, come si
vede dalla percentuale di hapax (V1/V*100) e dal coefficiente a di Zimpf
rispettivamente 28,350% e 1,325. Guardando il vocabolario, la prima parola
non vuota è comma (1529) seguita da numero (1160) e articolo (1066). Le altre
parole tema, ovvero quei sostantivi che compaiono con maggiore frequenza
nel testo, sono straniero, decreto, Stato, disposizioni, ingresso, territorio e
soggiorno.
Abbiamo poi eseguito un confronto tra il nostro vocabolario e il “lessico del
discorso programmatico di Governo” (Bolasco, 1999) per individuare quanto
fosse peculiare il linguaggio del nostro corpus anche rispetto a un
vocabolario tecnico-legislativo. Da questo confronto abbiamo ottenuto uno
“scarto” che indica quanto la forma in questione sia sovra (postivo) o sottorappresentata (negativo) rispetto al modello di riferimento Bolasco (1999);
più lo scarto è alto più le forme sono definite peculiari rispetto al testo
analizzato, ovvero lo caratterizzano. Senza entrare nel merito delle parole
chiave legate al vocabolario prettamente giuridico (come decreto, lettera),
emerse già dalla gerarchia delle occorrenze, si possono analizzare le altre
principali dimensioni del testo: oltre alla parola straniero la prima
dimensione che emerge è quella di frontiera (ingresso, territorio, frontiera,
accesso, durata) e di esercizio di diritto (regolamento, autorizzazione,
disposizioni). Ma la dimensione più corposa è quella delittuosa (pena, delitti,
reato, reati, tribunale, sentenza, condanna, violazione, esecuzione). Fa riflette
invece come le parole sottorappresentate siano governo, politica, pubblico,
parlamento: ovvero quelle legate alla dimensione legislativa.
Partendo dall’ipotesi che il linguaggio sia cambiato nel tempo abbiamo
effettuato un’analisi delle specificità (vedi tavola 1). Quando una parola è
sovra-rappresentata si parlerà di forma caratteristica (o specificità positiva),
al contrario quando essa è sottorappresentata parleremo di specificità
negativa; le forme prive di specificità in quel gruppo si definiscono banali,

JADT’ 18

31

mentre quelle che non sono specifiche di nessun gruppo sono considerate
appartenenti al vocabolario di base del corpus (Bolasco, 1999).
Tavola 1: Specificitá positive per anno di legislatura
1912
1986
1990
1992
1995
1998
2000
cittadinanza
lavoro
entrata
cittadinanza
lavoro
lavoro
visto
legge
lavoratori
permesso
straniera
soggiorno
soggiorno
ottenimento
Stato
immigrati
materia
Stato
permesso
stranieri
professionale
italiana
extracomunitari frontiera
italiana
entrata
permesso
consente
residenza
sociale
lavoro
figlio
penale
autonomo
seguito
cittadino
lavoratore
previdenza
estero
motivi
attuazione
transito
presidente della
estero
previdenza
extracomunitari cittadino
stagionale
motivi
Repubblica
servizio
autorizzazione cittadini
servizio
tempo
attivita'
autonomo
Governo
collocamento
decreto
militare
sociale
sociale
tempo
figli
materia
previdenza
durata
straniera
extracomunitariostranieri
figlio
italiani
apolidi
italiano
caso
europea
societa'
figli
occupazione
soggiorno
entrata
lire
estero
visti
militare
diritti
interno
residenza
legislativo
modalita'
sussistenza
padre
consulta
quanto
acquista
previdenza
regolarmente
requisiti
matrimonio
entrata
prima
età
pubblica
pubblica
importo
2002
2004
2007
2008
2009
2010
2011
lavoro
convalida
soggiorno
prevenzione
penale
conoscenza
seguente
decreto
successive
permesso
pubblico
codice
test
espulsione
soggiorno
legislativo
periodo
legislativo
procuratore
lingua
questore
testo
euro
sensi
ricerca
entrata
italiana
penale
legislativo
modificazioni
legislativo
penale
interno
permesso
termine
permesso
giudice
familiari
sensi
pubblico
lungo
provvedimento
penale
provvedimento volontariato
procuratore
giudiziario
svolgimento
permesso
asilo
seguenti
unico
procedura
imputato
modalità
periodo
soggiornanti
giudice
interno
commi
ricongiungimentoin presenza di europea
in presenza di decorrere
familiare
funzioni
domestico
europeo
rimpatrio
stagionale
provvedimenti motivi
sicurezza
seguenti
prefettura
respingimento
codice
parole
durata
persona
parole
sistema
lettera
caso
accompagnamen lungo
ricercatore
comma
legislativo
legislativo
autoritÃ
composizione nazionale
sostituite
guida
istruzione
allontanamento
procedura
decisione
rilasciato
antimafia
legislativo
rilascio
misure

Dalle specificità ottenute analizzando l’andamento del linguaggio nel tempo
emerge come si sia iniziato a scrivere di migrazioni parlando di cittadinanza
e residenza, introducendo progressivamente concetti connessi al lavoro e
all’essere extra-comunitario arrivando da un lato a temi di integrazione e
dall’altro a temi di criminalizzazione dello straniero. Il panorama lessicale nel
tempo si è arricchito ma anche “estremizzato”. Questa “estremizzazione”
potrebbe essere il risultato delle diverse coalizioni/maggioranze e quindi non
solo legato ad una dimensione temporale ma ancor di più politica, per questo
motivo é bene analizzare le due dimensioni contemporaneamente.

32

JADT’ 18

5. Dimensioni lessicali
L’analisi delle corrispondenze lessicali2 è stata condotta sui primi 50 lemmi
estratti dal confronto tra i lemmi dei verbi del nostro vocabolario e quelli del
“lessico del discorso programmatico di Governo”. Attraverso l’analisi delle
corrispondenze abbiamo riassunto la diversità del lessico utilizzato nelle
diverse leggi rispetto all’anno e la coalizione di governo. I primi due assi
fattoriali, proiettati in figura 1, rappresentano il 46% della variabilità
spiegata. La prima dimensione, rappresentata dal primo fattore, è
caratterizzata dalla dimensione temporale. Fatta eccezione per il 1992 e il
2007 tutte le leggi approvate dopo il 2002 si contrappongono a quelle
precedenti. Il secondo asse è caratterizzato dalla contrapposizione del partito
Liberale (Governo Giolitti 1911-1914) e Quadripartito (Governo Andreotti VII
1991-1992), in contrapposizione alle altre maggioranze di Governo. Le
coordinate ci permettono di proiettare le classi e le forme grafiche sul piano e
il posizionamento ci permette di individuare e interpretare i profili a seconda
della vicinanza dei punti.

Figura 1: Dimensioni lessicali delle interviste, rappresentazione del primo piano fattoriale

Andando a vedere più in dettaglio i quadranti possiamo notare che nel
primo, in cui si collocano il Quadripartito e gli anni 1992 e 2010, le forme
grafiche che contraddistingono lo spazio fanno riferimento alla dimensione
culturale “lingua” e “conoscenza”. Le forme grafiche proiettate nel secondo
piano, carattarezzato dagli anni 2002 e a seguire e dalla Casa delle libertá, la
grande-coalizione e gli indipendenti del governo tecnico Monti, esprimono

2

Con l’ausilio del programma Spad, nello specifico con il metodo CORBIT

JADT’ 18

33

principalmente gli aspetti legati alla delittuosità (e.g. violazioni, delitti, reato,
pena) e giuridica (e.g. norme, tribunale, giudice, esecuzione). A cavallo del
primo e secondo quadrante troviamo anche la coalizione di Centro-destra.
Nel terzo quadrante troviamo gli anni 1986, 1990, 1995, 1998, 2007 e il
governo tecnico Dini con l'Unione e il Pentapartito. Le forme grafiche su
questo piano identificano le caratteristiche del soggiorno come: carta, durata,
status, temporanea. Mentre la dimensione di frontiera caratterizza il quarto
quadrante: territorio, frontiera, legale, autorizzazione. A cavallo di queste
due dimensioni si trovano il mondo del lavoro e quello associativo che sono
parte integrante del percorso migratorio in Italia; non sorprende quindi che
caratterizzano sia il terzo che il quarto quadrante.
6. Conclusioni
L’obiettivo di questo lavoro era di esplorare il panorama legislativo in
riguardo alle migrazioni in un’ottica statistica, con l’obiettivo di estrarre le
sue caratteristiche e le sue peculiarità. In questa prospettiva le differenze
linguistiche, temporali e soprattutto dei diversi esecutivi rappresentano
un’interessante bacino informativo per investigare l’evoluzione semantica
delle norme. Seppur descrittivo questo lavoro assume una particolare
importanza in quanto la scelta di un tipo di linguaggio potrebbe influenzare
opinioni e atteggiamenti nei confronti degli stranieri da parte della
popolazione italiana. I nostri risultati mostrano che il panorama lessicale
della normativa italiana sull’immigrazione dal 1912 al 2014 è notevolmente
mutato. In primo luogo, dal punto di vista delle specificità ottenute
analizzando l’andamento del linguaggio nel tempo è emerso che
inizialmente, quando l’Italia era un paese di emigrazione, la normativa sulle
migrazioni era caratterizzata da temi quali la cittadinanza e la residenza.
Dagli anni Ottanta del secolo scorso, con l’incremento dei flussi migratori in
entrata nel nostro paese, sono stati introdotti progressivamente concetti
connessi al lavoro e all’essere extra-comunitario. Alla fine degli anni Novanta
del secolo scorso, a seguito del netto incremento degli arrivi di stranieri in
Italia, si inizia a parlare di integrazione e di ricongiungimento familiare.
Infine a partire dagli anni duemila inizia il processo di “criminalizzazione”
dello straniero pertanto entrano nel vocabolario specifico temi quali
sicurezza, respingimento, allontanamento. In secondo luogo, l’analisi delle
corrispondenze fattoriali ha confermato che a partire dal 2002 (Legge BossiFini) vi è stato un netto cambiamento del linguaggio usato nella normativa
dell’immigrazione, il linguaggio è infatti caratterizzato sempre più da temi
legati alla sicurezza e alla legalità. Inoltre il linguaggio usato, è stato
senz’altro influenzato da altri fattori che qui non abbiamo preso in
considerazione come per esempio, il recepimento delle politiche europee

34

JADT’ 18

sull’immigrazione, la situazione geo-politica internazionale, l’incremento
degli atti terroristi di matrice islamista a partire dagli attentati negli Stati
Uniti l’11 settembre 2001. Con questo lavoro abbiamo delineato un panorama
lessicale che ha cambiato direzione orientandosi sempre di più verso temi di
regolamentazione e contenimento (espulsione, allontanamento irregolare).
Esso ha confermato un approccio negativo riguardo alle migrazioni
indipendentemente dalla maggioranza di governo.
Bibliografia
Bolasco S. (1999), Analisi multidimensionale dei dati, Carocci Roma
Colombo, A., & Sciortino, G. (2004). Alcuni problemi di lungo periodo delle
politiche migratorie italiane. Le Istituzioni del Federalismo, 5, 763–788.
Nascimbene, B. (1988). Lo Straniero nel diritto italiano. Milano: Giuffré
Editore.
Pastore, F. (2004). A community out of balance: nationality law and
migration politics in the History of post-unification Italy. Journal of
Modern Italian Studies, 9(1), 27–48.
Solé, C. (2004). Immigration policies in southern Europe. Journal of Ethnic
and Migration Studies, 30(6), 1209–1221.
Zincone, G., & Caponio, T. (2004). Immigrant and immigration policymaking: the case of Italy. IMISCOE Working Paper Country Report.
Amsterdam: IMISCOE.
Zincone, G. (2006). The making of policies: immigration and immigrants in
Italy. Journal of Ethnic and Migration Studies, 32(3), 347–375.

JADT’ 18

35

A bibliometric meta-review of performance
measurement, appraisal, management research
Massimo Aria1, Corrado Cuccurullo2
University of Naples Federico II– aria@unina.it
University of Campania L. Vanvitelli – corrado.cuccurullo@unicampania.it
1

2

Abstract
Performance measurement, appraisal, and management have become one of
the most prominent and relevant research issues in in management studies.
The emphasis on empirical contributions has resulted in voluminous and
fragmented research streams. Thus, synthesizing the research literature is
relevant for effectively using the existing knowledge base, advancing a line
of research, and providing evidence-based insights.
In this paper, we propose a bibliometric meta-review that offers a different
knowledge base for future research agenda with implications also for
teaching and practice. We analyze the performance management literature
through a bibliometric analysis of reviews recently published (2000 - 2017) in
the scientific journals of domains, such as Management, Business and
Operations. The main purpose is to map and understand the intellectual
structure through co-citation analysis.
Keywords: Science Mapping; Content Analysis; Bibliometrix; Performance
Measurement.
1. Introduction
Performance measurement, appraisal, and management have become one of
the most prominent and relevant research issues in in management studies.
They are an ongoing topic of conferences and of books and journal articles as
well as of professional and popular grey literature. Researches on these
topics have been conducted in different sectors and for various
organizations, including public and professional ones. While the number of
academic publications on these topics is increasing at a rapid pace, the
emphasis on empirical contributions has resulted in voluminous and
fragmented research streams that hampers the ability to accumulate
knowledge and actively collect evidence through a set of previous research
papers. So, literature reviews are increasingly assuming a crucial role in
synthesizing past research findings to effectively use the existing knowledge
base, advance a line of research, and provide evidence-based insight into the
practice of exercising and sustaining professional judgment and expertise.
Among the different qualitative and quantitative reviewing, bibliometrics

36

JADT’ 18

has the potential to introduce a systematic, transparent, and reproducible
review process based on the statistical measurement of science, scientists, or
scientific activity.
In this paper, we propose a bibliometric “review of reviews” (meta-review)
that offers a different knowledge base for future research agenda with
implications also for teaching and practice. The goal of this article is to find a
path and to take stock of the existing knowledge in performance
measurement, appraisal, and management research.
2. Research Synthesis on performance measurement, appraisal and
management
2.1 Overcoming semantic ambiguity
‘‘Performance’’ is a complex concept and can be seen from different angles.
It is a multi-dimensional construct, the measurement of which varies
depending on a variety of factors. For example, it is important to determine
whether the measurement objective is to assess performance outcomes or
behavior at organizational or individual levels, in financial terms or
multidimensional ones (e.g. balanced scorecard framework), as
intermediate or final consequence of a managerial action. In very general
terms, performance is the contribution (result and how to achieve the result)
that an entity (individual, group of individuals, organizational unit,
organization, program, or public policy) provides through its action towards
achieving the aims and objectives and also the satisfaction of the needs for
which the organization was formed.
While measurement concerns performance indicators and appraisal is the
process of evaluating the performance of individuals and teams,
performance management is a systematic process for improving
organizational performance by developing the performance of individuals
and teams. It is a means of getting better results by understanding and
managing performance within an agreed framework of planned goals,
standards and competency requirements.
2.2 The need of a meta-review
In this work we analyze the performance management literature through a
bibliometric analysis of literature reviews recently published (2000 - 2017) in
the scientific journals of domains, such as Management, Business and
Operations.
The main purpose is to map and understand the intellectual structure
through co-citation analysis of this recent and evolving macro-topic,
highlighting internal clusters. The main contribution is to understand better
the state of art in terms of gaps, divergences, commonalities and tendencies

JADT’ 18

37

in which the field is going on. So, we provide a map to scholars in
positioning their future research work and to teachers to introduce so vast
topic to students.
This field of research is well suited to a bibliometric meta-review for the
following reasons:
1. there is little consensus among scholars. For example, Franco-Santos et
al. (2007) have counted 17 different definitions for business performance
measurement system, while Taticchi et al. (2010) almost 25 diverse
frameworks.
2. the field is deeply multidisciplinary. The most widely cited authors
come from a variety of different disciplinary backgrounds, such as
accounting, strategy, operations management and research, human
resources. The scholars’ background diversity brings different research
questions, theoretical bases and methodological approaches. The
functional silos, through which research on performance management is
developing, impede to have a coherent and agreed body of knowledge.
Understanding deeply the intellectual structure of the field and its
evolution is a relevant challenge for researchers.
3. there is a community of dedicated scholars around the world that share
the same agenda (cohesion in dominant issues) but use divergent
theoretical approaches and methods.
4. the field is still relatively immature. As in terms of age it is relatively
young, the limited professionalization is not surprising. In addition, there
is not a reference journal as Strategic Management Journal for strategy
scholars. In this case, our study can be contributive, showing the gaps in
literature and providing some guidelines for researchers.
5. common accepted performance management practices do not exist
(Richard et al., 2009). In many contexts performance management is
dysfunctional, although this problem is known since more 50 years
(Ridgway, 1956). We still miss more robust empirical and theoretical
analysis of performance management frameworks and methodologies.
Empirical investigations of the performance impact of frameworks,
including the most diffused balanced scorecard, have failed to offer
uncontroversial findings (Banker et al., 2000; Ittner et al., 2003; Neely et
al., 2004). Some authors call for further and longitudinal studies for
understanding the social influences and implications, but they do not
show which paths follow.
6. some publications assumed seminal roles in the evolution of the
scientific field. These articles, owing to their impact, are accelerating
factors of development of the field (Berry, Parasuraman 1993). It is
therefore important to identify what are the most influential performance

38

JADT’ 18

management articles published between 1991 and 2010, to understand
better the state of art and discover the linkages among authors.
7. there is an extended spectrum of this research field and an increased
intensity of research, but most part of it also confirms the
incompleteness and inconsistence of results. There are still various open
issues and unsolved problems. This depends on the fragmentation of the
field of research, on different disciplinary membership of researchers and
their cultural context. This diversity implies the use of different theories
and methods and therefore also the emergence of different dominant
themes.
8. a profound and rapid evolution is taking place. Not only the research
has shifted from the financial performance to the multidimensional one,
but a shift of scholars’ attention from the organizational to the individual
performance is under way. Moreover, another significant shift is ongoing.
While earlier research was often normative, founded on economic
rationality, more recent research is more analytical and explanatory
(Cuccurullo et al., 2016).
The overwhelming volume and variety of new information, conceptual
developments, and data are the milieu where bibliometrics becomes useful
by providing a structured, more objective and reliable analysis to present the
“big picture” of extant research.
3. Methods
Our bibliometric meta-review is a quantitative research synthesis of the
reviews published on the same topic that we conducted with bibliometrix
(Aria, Cuccurullo, 2017), a unique tool, developed in the R language, which
follows a classic logical bibliometric workflow.
3.1 Data collection
For data retrieval, we used the Social Science Citation Index (Indexes=SCIEXPANDED, SSCI) of Clarivate Analytics Web of Science. It is the most used
database of scientific knowledge by management scholars (Zupic, Čater,
2015). Our search terms were (TS=(("performance manag*") OR
("performance measur*") OR ("performance apprais*"))). We applied our
search keyword to the Timespan=2000-2017 and filtered findings for
language (English) and document types (Review). Therefore, we found 783
reviews. Then, we refined our search by categories (Management or Business
or Operation Research Management Science) and obtained 167 reviews.
Finally, we selected all the reviews published in the most authoritative
journals as ranked as 3, 4, 4* by ABS 2015: We dropped off 31 journals for a
total of 50 reviews. Our final dataset is formed by 117 reviews.

JADT’ 18

39

3.2 Data analysis
Our effort at delineating the intellectual structure of the discipline involves
author co-citation analysis (ACA), a bibliometric technique that uses a matrix
of co-citation frequencies between authors as its input. This matrix is the
basis for various types of analyses.
ACA ability to reveal patterns of association between authors based on their
co-citation frequencies makes it a prospective methodology for
understanding the evolution of an academic discipline. Authors working in a
stream of research often cite one another as well as draw on common sources
of knowledge. Further, their works are likely to be frequently co-cited (i.e.,
cited together) by other authors working on intellectually similar themes. The
citations of seminal authors provide a basis for unraveling the complex
patterns of associations that exist among them as well as to trace the changes
in intellectual currents taking place over time.
4. Findings
4.1 Descriptive analysis
Our dataset includes 117 reviews published in 46 journals since 2000 (table 1
and 3). They received 105 citations on average (table 2). They show
fluctuating growth that reaches its peak every 5 years (table 3).
Table 1: Main Information about data
Articles

117

Sources (Journals, Books, etc.)

46

Keyword Plus – Author's Keywords
Period

770 – 383
2000 - 2017

Average citations per article

Authors

297

Authors of single
authored articles
Co-Authors per Articles

10
2.65

Collaboration Index

2.79

105.1

Table 2: Top manuscripts per citations
Paper

TC

TCperY
ear
71.1

1 BHARADWAJ AS,(2000),MIS Q.
2 DIAMANTOPOULOS A;SIGUAW JA, (2006), BRIT. J. MANAGE.

128
0
588

3 MELO MT et al.(2009), EUR. J. OPER. RES.

587

65.2

4 ZHOU P et al.(2008),EUR. J. OPER. RES.

429

42.9

5 WRIGHT PM;BOSWELL WR, (2002), J. MANAGE.

379

23.7

6 WRIGHT PM et al.(2005), PERS. PSYCHOL.

347

26.7

7 ZACHARATOS A et al..(2005), J. APPL. PSYCHOL.

305

23.5

49.0

40

JADT’ 18

8 ADAMS R et al.(2006), INT. J. MANAG. REV.

302

25.2

9 GIBSON C;VERMEULEN F, (2003), ADM. SCI. Q.

291

19.4

10 CARDOEN B et al. (2010), EUR. J. OPER. RES.

288

36.0

Table 3: Most Relevant Sources
Sources
1 J. OF MANAGEMENT

Article
s
11

2 INT. J. OF OPERATIONS & PRODUCTION MANAGEMENT;INT. J. OF
PRODUCTION ECONOMICS
4 EUROPEAN J. OF OPERATIONAL RESEARCH; INT. J. OF
MANAGEMENT REVIEWS
6 INT. J. OF HUMAN RESOURCE MANAGEMENT; INT. J. OF
PRODUCTION RESEARCH
8 J. OF BUSINESS ETHICS; STRATEGIC MANAGEMENT J.

8

10 BRITISH J. OF MANAGEMENT; J. OF APPLIED PSYCHOLOGY; J. OF
MANAGEMENT
STUDIES;
MANAGEMENT
ACCOUNTING
RESEARCH; OMEGA-INT. J. OF MANAGEMENT SCIENCE; SUPPLY
CHAIN MANAGEMENT

3

7
5
4

4.2 Co-citation network and cluster analysis
The objective of our paper is to identify the intellectual structure of the
performance measurement and management field. More specifically, our
goals are to (1) delineate the subfields that constitute the intellectual structure
of the field; (2) determine the relationships, if any, between the subfields; (3)
identify authors who play a pivotal role in bridging two or more conceptual
domains of research; and (4) graphically map the intellectual structure in a
network space in order to visualize spatial distances between intellectual
themes. In extreme synthesis, figure 1 shows:
1. A first cluster (red bubbles) represented by works concerning the system
of multidimensional performance measurement and evaluation. At its
center, we find prominent authors who contribute with specific
frameworks, such as the balanced scorecard (Kaplan, Norton, 1992,
1996) and performance prism (Neely et al., 1995). Next to them, we find
the contribution of Ittner et al. (2003) about one of the great problems in
the multidimensional measurement: the balance between subjectivity
and objectivity. Always central in the cluster, we find performance
system design (Neely et al., 2000), At the upper and lower extremes of
the cluster, we find other two issues of multidimensional performance
systems: strategic alignment (Chenhall, 2005) and the guidelines to
implement systems (Bititci et al., 1997).

JADT’ 18

2.

3.

4.

5.

41

A second cluster (blue bubbles) concerns the current prevailing
perspective of studying performance measurement and management:
the strategic one. In particular, figure 1 highlights the bridge
contributions of two cornerstones of the resource-based view (Barney,
1991, Wernelfelt, 1984).
In front of this cluster, in the upper-left part of the map, we find another
one (violet bubbles) that deals with theories, such as the agency theory:
(Eisenhardt, 1989; Jensen, 1976) - which are the main method of
investigation - (Carpenter, 2003), and psychology (Kahneman, 1979)
Two other neighboring clusters, located in the lower left part of the map,
concerns human resources. A first cluster (green bubbles) includes
almost entirely works published on Academy of Management. Their
preferred theme is perceptions of organizational performance (Delaney,
Huselid 1996). A second cluster concerns participation in the appraisal
from psychological perspective (Cawley et al., 1998; Keeping, Levy
2000).
One last cluster is isolated and concerns the studies of operation
research on performance measurement

Figure 1: Co-citation network of cited references

42

JADT’ 18

References
Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive
science mapping analysis, Journal of Informetrics, 11(4), pp 959-975.
Barney, J. (1991). Firm resources and sustained competitive
advantage. Journal of management, 17(1), 99-120.
Bititci, U. S., Carrie, A. S., & McDevitt, L. (1997). Integrated performance
measurement systems: a development guide. International journal of
operations & production management, 17(5), 522-534.
Cawley, B. D., Keeping, L. M., & Levy, P. E. (1998). Participation in the
performance appraisal process and employee reactions: A meta-analytic
review of field investigations. Journal of applied psychology, 83(4), 615.
Chenhall, R. H. (2005). Integrative strategic performance measurement
systems, strategic alignment of manufacturing, learning and strategic
outcomes: an exploratory study. Accounting, Organizations and
Society, 30(5), 395-422.
Cuccurullo, C., Aria, M., & Sarto, F. (2016). Foundations and trends in
performance management. A twenty-five years bibliometric analysis in
business and public administration domains, Scientometrics.
Delaney, J. T., & Huselid, M. A. (1996). The impact of human resource
management
practices
on
perceptions
of
organizational
performance. Academy of Management journal, 39(4), 949-969.
Eisenhardt, K. M. (1989). Agency theory: An assessment and
review. Academy of management review, 14(1), 57-74.
Ittner, C. D., Larcker, D. F., & Meyer, M. W. (2003). Subjectivity and the
weighting of performance measures: Evidence from a balanced
scorecard. The accounting review, 78(3), 725-758.
Jensen, M. C., & Meckling, W. H. (1976). Theory of the firm: Managerial
behavior, agency costs and ownership structure. Journal of financial
economics, 3(4), 305-360.
Kaplan, R. S., & Norton, D. P. (1992). In Search of Excellence. Harvard
manager, 14(4), 37-46.
Kaplan, R. S., & Norton, D. P. (1996). Using the balanced scorecard as a
strategic management system.
Keeping, L. M., & Levy, P. E. (2000). Performance appraisal reactions:
Measurement, modeling, and method bias. Journal of applied psychology,
85(5), 708.
Neely, A. (2005). The evolution of performance measurement research:
developments in the last decade and a research agenda for the
next. International
Journal
of
Operations
&
Production
Management, 25(12), 1264-1277.
Neely, A., Gregory, M., & Platts, K. (1995). Performance measurement system

JADT’ 18

43

design: a literature review and research agenda. International journal of
operations & production management, 15(4), 80-116.
Neely, A., Mills, J., Platts, K., Richards, H., Gregory, M., Bourne, M., &
Kennerley, M. (2000). Performance measurement system design:
developing and testing a process-based approach. International journal of
operations & production management, 20(10), 1119-1145.
Wernerfelt, B. (1984). A resource-based view of the firm. Strategic
management journal, 5(2), 171-180.

44

JADT’ 18

Textual Analysis of Extremist Propaganda and
Counter-Narrative: a quanti-quali investigation
Laura Ascone
Université de Cergy-Pontoise – laura.ascone@etu.u-cergy.fr

Abstract
This paper investigates the rhetorical strategies of jihadist propaganda and
counter-narrative in English and French. Since jihadist propaganda aims at
both persuading the Islamic State’s sympathisers and threatening its enemies,
attention was focused on the way threat and persuasion are verbalised. As
far as jihadist propaganda is concerned, the study was conducted on the
Islamic State’s two official online magazines: Dabiq, published in English, and
Dar al-Islam, published in French. As for the counter-narrative, the corpus
was composed of the articles published on the main English and French
governmental websites. Combining quantitative and qualitative approaches
allowed to examine the general characteristics as well as the specificities of
both jihadist propaganda and counter-narrative. The software Tropes was
used to analyse the corpora from a semantic-pragmatic perspective. The
results’ statistical validity was then verified and synthesised with the
softwares Iramuteq and R. This study revealed that the rhetorical strategies
varied between both jihadist propaganda and counter-narrative, and French
and English.
Keywords: jihadist propaganda, counter-narrative, discourse analysis, threat,
persuasion.
1. Introduction
The recent terrorist attacks by Daesh in Western countries have led
researchers and experts to examine the islamisation of radicalism (Roy, 2016).
Different studies have been conducted on the psychosociological contexts
that may lead someone to adhere to the jihadist ideology (Benslama, 2016;
Khosrokhavar, 2014), as well as on the role played by the Internet in the
radicalisation process (Von Behr, 2013). Yet, even though terrorism would
not exist without communication (McLuhan, 1978), the rhetorical strategies
of the jihadist propaganda have been neglected and remained unexplored.
This research investigates the rhetorical strategies of both jihadist
propaganda and counter-narrative published on the Internet in English and
French. More precisely, this analysis focuses on the way threat and
persuasion are expressed in jihadist discourse, as well as on the way French
government and international institutions face and counter jihadist

JADT’ 18

45

propaganda. From a linguistic perspective, threat and persuasion are
complex speech acts. Therefore, pragmatics and, more specifically, Searle’s
(1969) speech act theory, constituted the basis of this study. As far as jihadist
propaganda is concerned, the analysis was conducted on the Islamic State’s
two official online magazines: Dabiq, published in English, and Dar al-Islam,
published in French. As for the counter-narrative, the corpus was composed
of the articles published on the main French and English institutional
websites such as stopdjihadism.fr or counterjihadreport.com. The fact that
jihadist propaganda and counter-narrative address different readerships, led
us to hypothesise that differences in both content and form might be
identified between the two magazines, as well as among the different
governmental websites. Combining quantitative and qualitative approaches
(Garric and Longhi, 2012; Rastier, 2011) (that is, lexicometry and textometry
for the quantitative approach, and the interpretation of the text according to
the ideology behind it for the qualitative one), allowed to examine the
general characteristics as well as the specificities of both jihadist propaganda
and counter-narrative. Following Marchand’s (2014) work, the software
Tropes was used to analyse the corpora from a semantic-pragmatic
perspective. The results were then investigated in a qualitative way, and their
statistical validity verified with the softwares Iramuteq and R. The
combination of these two approaches allowed to overcome the limitations
imposed by both the software’s automatic analysis and the qualitative
subjective interpretation. By comparing the rhetorical strategies used in both
jihadist propaganda (Huyghe, 2011) and counter-narrative, the aim of this
research was to identify the linguistic differences between these two
discourses and these two languages, in order to determine the rhetorical
strategies that might prove efficient in countering jihadist propaganda. After
having presented the rhetorical pattern of jihadist propaganda, the linguistic
characteristics of English and French counter-narratives will be examined.
The jihadist and governmental rhetorical strategies will then be contrasted.
2. Corpus and methodology
2.1. Jihadist propaganda
The analysis of the rhetorical strategies in jihadist propaganda was
conducted on Daesh’s official online magazines Dabiq, published in English,
and Dar al-Islam, published in French. Since these two magazines address a
readership that has already adhered to the jihadist ideology, their goal is to
both reinforce the reader’s adhesion and incite him/her to act in the name of
the jihadist ideology. The reader is then incited to adopt the behaviour a
good Muslim should have, and to take revenge on who is presented by
Daesh as the responsible for the Muslims’ humiliation, that is the West. As

46

JADT’ 18

far as Dabiq is concerned, the corpus investigated was composed of all the
articles published on the first fourteen numbers (i.e. 377,450 words). As for
Dar al-Islam, the analysis was conducted on the first nine issues (229,762
words). To analyse the rhetorical strategies used in jihadist propaganda, a
quanti-qualitative approach was adopted (Garric and Longhi, 2012; Rastier,
2011). More precisely, this iterative approach was composed of five stages. A
first qualitative analysis of the jihadist ideology, the radicalisation process,
and the linguistic characteristics of hate speech and propagandistic discourse
was essential to the understanding of the jihadist discourse as well as to the
advancement of our first hypotheses. The second stage corresponded to a
quantitative analysis whose goal was to verify the validity of our hypotheses:
the corpus was then examined with the software Tropes, which allows to
investigate a text from a semantic perspective. More precisely, based on a
pre-established lexicon, the software identifies the themes tackled in the text,
and shows how these themes are linked to one another. The most frequent
themes in both magazines are religion and conflict. However, in order to study
the way threat and persuasion are expressed in the two corpora, a deeper
qualitative analysis was conducted on the themes sentiment for the French
corpus, and feeling for the English one (third stage). In other terms, the
quantitative analysis constituted the basis for a qualitative study, which was
then conducted only on the expressions conveying feelings. Because of their
size-difference, the nine issues of the French magazine count 318 sentimentexpressions, whereas Dabiq counts 705 feeling-expressions. Therefore, in order to
contrast the results, a normalisation was applied. Then, a quantitative
analysis was conducted with the software Iramuteq, which is an interface of
the software R and which performs statistical analysis of textual data based
on Reinert’s classification method. This way, it was possible to test the
hypotheses and results issued by the qualitative study (fourth stage).
Furthermore, a qualitative manual analysis of the first number of both Dabiq
and Dar al-Islam allowed to identify the propositions conveying threat,
persuasion, obligation, prohibition, and rewards, that had not been detected
by the software Iramuteq. This way, it was possible to provide a lexicon
specific to the corpus under investigation, that was not detected by the
software because of the special features of the jihadist discourse (fifth stage).
The combination and alternation of both quantitative and qualitative
approaches allowed to examine Daesh’s discourses in relation to the context
in which it is produced (Valette and Rastier, 2006).
2.2. Counter-narrative
The analysis on the rhetorical strategies in French and English counternarratives was conducted on the main governmental and institutional

JADT’ 18

47

websites. The French corpus was composed of the articles published on
www.stop-djihadisme.gouv.fr (the platform created after the first terrorist
attacks in France in 2015), www.interieur.gouv.fr (the website of the Minister of
Interior), and www.cpdsi.fr (the website of the Centre de Prévention contre les
Dérives Sectaires liées à l’Islam). The corpus counts 115,950 words. As far as
the English corpus is concerned, it was composed of the articles published on
www.counterjihadreport.com
(a
news
aggregating
website),
www.consilium.europa.eu (the website of the European Council and of the
European Union Council), www.ec.europa.eu (the website of the European
Commission), and on the Radicalisation Awareness Network (this is a
specific section of the website of the European Commission). The corpus
counts 116,000 words. In order to conduct comparable analyses, the same
quanti-qualitative approach was adopted. The qualitative analysis of the
geopolitical context and of the different campaigns used to face and counter
the jihadist radicalisation was essential to the understanding of both French
and English counter-narratives (first stage). Then, a quantitative analysis was
conducted with the software Tropes, which allowed to identify the most
frequent themes. The themes religion and droit (“law”) were the most present
in the French corpus, whereas the themes education and communication were
the most frequent in the English one (second stage). The third stage
corresponded to the qualitative analysis that was conducted on the category
sentiment, for the French corpus (292 propositions), and feeling, for the
English one (370 propositions). A normalisation was then applied to compare
jihadist and governmental discourse. A second quantitative analysis was
then conducted with the softwares Iramuteq and R to test the results issued by
the qualitative study (fourth stage). The results of the analysis on jihadist
propaganda and counter-narrative were then contrasted to compare the
rhetorical strategies used in jihadist propaganda and counter-narrative.
3. The rhetorical strategies used in French and English jihadist propaganda
The quantitative analysis conducted with the software Tropes, and the
qualitative study conducted on the categories sentiment and feeling, revealed
the components of the jihadist discourse. The propaganda of the Islamic State
is based on five key concepts: threat, persuasion, reward, obligation, and
prohibition. The assessment of interjudge agreement was necessary to
determine these five concepts as well as to categorise the different
propositions selected by Tropes as objectively as possible. Each category was
examined from both quantitative (i.e., its identification and distribution in the
two magazines, Dabiq and Dar al-Islam, using the softwars Tropes and
Iramuteq, and the corpus analysis toolkit AntoConc) and qualitative (i.e.,
analysing each concept in relation to the context in which it was produced)

48

JADT’ 18

perspectives. Yet, these five concepts are not independent from one another.
Rather, they are strongly linked to one another.

Figure 1: the rhetorical pattern of jihadist propaganda

Figure 1 shows the rhetorical pattern of jihadist discourse. Since Dabiq and
Dar al-Islam aim at manipulating the reader’s behaviour, jihadist propaganda
is based on obligations and prohibitions. Rewards as well as guilty feelings
towards the Muslims living in the Middle-East, aim at leading the reader to
respect these prescriptions. Not respecting them would mean facing negative
consequences. Threat may then be expressed against the members of the
Islamic State themselves and, more in general, against any Muslim.
Obligations are also exploited to impose the readership a hostile and violent
attitude against Western countries, which is justified by the feeling of
victimisation. Fighting against the Muslims’ enemy is presented by jihadists
as a heroic and valorising action, and therefore, a persuasive one.
Furthermore, not only are attractive factors rewards for the reader’s
obedience. They are sometimes presented as independent from the reader’s
behaviour. In other terms, persuasion is presented as a positive and
valorising act that, contrary to rewards, does not depend on whether the
reader respects or not the prescriptions imposed The sentence “Jihad is
necessary to obtain Allah’s forgiveness”, for instance, presents an obligation
(“it is necessary”) and a reward that will be granted if the obligation is
respected (“to obtain Allah’s forgiveness”). However, this sentence expresses
more than an obligation and a reward. Jihad, which is interpreted as attractive
by jihadists, tends to be associated with terrorist attacks and, consequently, it
will be perceived as threatening by Western countries. Furthermore, this

JADT’ 18

49

sentence implies that if the obligation is not respected, the individual will not
obtain Allah’s forgiveness. In other terms, this sentence indirectly expresses a
threat against the readership too.
4. The rhetorical strategies used in French and English counter-narratives
The large number of Daesh’s sympathisers and foreign fighters shows that the
communicative and rhetorical strategies adopted in Daesh’s propaganda
have an important and persuasive impact on the readership. On the contrary,
the counter-narrative produced by the different governments to face and
counter jihadist propaganda, has been criticised not to be as efficient as
jihadist propaganda. In the French corpus, 292 propositions conveying
sentiment (“feeling”) were identified, whereas 370 propositions conveying
feelings were identified in the English one.
The frequency of the five categories (i.e. of the propositions conveying threat,
persuasion, reward, obligation, and prohibition) was calculated in the French
and English corpora. The reward-category is the only one that was more
present in the French corpus than in the English one. Contrary to the Islamic
State’s propaganda, the propositions conveying rewards and prohibitions are
almost absent in both French and English counter-narratives. On the
contrary, what these two discourses have in common is the high frequency of
the propositions conveying threat (Example 1).
1. “Terrorist groups will continue to exploit the refugee crisis in their
propaganda, seeking to portray Western mistreatment of Muslims,
and inciting fear by alleging that their supporters are being
smuggled in amongst genuine refugees.” (RAN website)
As Example 1 shows, threat tends to be associated to the other (i.e., the Islamic
State), which implies that Western countries are presented as victims of the
Islamic State. In the English corpus, 355 occurrences of the word victim(s)
were identified. The corpus analysis toolkit AntConc showed that the most
frequent collocation of this term is the word terrorism (57 co-occurrences). On
the contrary, the French corpus, where the word victime/s occurs 70 times
only, presents only 2 co-occurrences of the term terrorisme. Rather, French
counter-narrative tends to talk about rescuing and helping victims
(secours/aide aux victimes). Furthermore, differences were identified between
the different websites in a same language.
Figure 2 shows the under- and overuse of the most representative terms in
two French governmental websites: stopdjihadisme and CPDSI. More
precisely, based on a Chi2 dependence test, the graph shows the words that
are significantly associated or “anti-associated” to the two websites. The
figure revealed that CPDSI website focuses more on the religious dimension.
The words islam, jihad and jihadiste (“jihadist”) are significantly associated to

50

JADT’ 18

this sub-corpus. This implies that jihad and jihadiste are presented and
interpreted as religious terms. On the contrary, the website of the
stopdjihadisme campaign is characterised by an overuse of the words terroriste
(“terrorist”), terrorisme (“terrorism”), Syrie (“Syria”), radicalisation
(“radicalisation”), Irak (“Iraq”), français (“French”), and France (“France”).
The overuse of these specific terms shows that the campaign and,
consequently, its website focus more on the geopolitical dimension, where
the radicalisation process is presented in relation to terrorism and not to
Islam.

Figure 2: under- and overuse of some key-terms in French counter-narrative

5. Conclusion
This comparative analysis revealed that jihadist discourse and counternarrative present both similarities and differences. As far as the differences
are concerned, the frequency of the propositions conveying threats,
persuasion, prohibitions, obligations, and rewards varied between these two
discourses: they were more frequent in counter-narrative than in jihadist
propaganda. The Islamic State’s propaganda aims at reinforcing the reader’s
adhesion to the jihadist ideology, and at inciting him/her to act against its
enemies in the name of the jihadist ideology. On the contrary, counternarrative does not aim at reinforcing an ideology. Rather, it aims at
countering the jihadist radicalisation. This difference was confirmed by the
variation of the different category-frequencies in jihadist propaganda and
counter-narrative. Despite this crucial difference, similarities between these
two discourses were identified. More precisely, both discourses present the
respective speakers’ communities as victims of the other and, consequently,
incite the readership to fight, whether violently or not, against the enemy. As
far as the methodology is concerned, the procedures adopted allowed to

JADT’ 18

51

investigate the general and special features of both jihadist and governmental
discourses. The results obtained in the quantitative analysis constituted the
starting point for a qualitative analysis, which permitted to identify the
features that had not been detected by the softwares as well as to refine
Tropes’s pre-established lexicon.
References
Angenot, M. (2008). Dialogue de sourds. Traité de rhétorique antilogique. Paris :
Mille et une nuits.
Benslama, F. (2016). Un furieux désir de sacrifice : le surmusulman. Paris :
Edition du Seuil.
Garric, N., & Longhi, J. (2012). L’analyse de corpus face à l’hétérogénéité des
données : d’une difficulté méthodologique à une nécessité
épistémologique. Langage, (3) : 3-11.
Huyghe, F.-B. (2011). Terrorismes : violence et propagande. Paris : Gallimard.
Khosrokhavar, F. (2014). Radicalisation. Paris : Editions de la maison des
sciences de l’homme.
Marchand, P. (2014), Analyse avec IRaMuTeQ de dialogues en situation de
négociation de crise : le cas Mohammed Mehra. Communication présentée
aux 12es Journées Internationales d’Analyse statistique des Données Textuelles,
Paris, 25.
McLuhan, M. (1978). The brain and the media: The “Western” hemisphere.
Journal of Communication, 28(4): 54-60.
Rastier, F. (2011). La mesure et le grain : sémantique de corpus. Champion ; diff.
Slatkine.
Roy, O. (2016). Le djihad et la mort. Le Seuil.
Searle, J. (1969). Speech acts: an essay in the philosophy of language. London:
Cambridge University Press.
Valette, M., & Rastier, F. (2006). Prévenir le racism et la xénophobie :
propositions de linguistes. Langues modernes, 100(2),68.
Von Behr, I. (2013). Radicalisation in the digital era: the use of the Internet in 15
cases of terrorism and extremism.

52

JADT’ 18

Analyse de données textuelles appliquée à des
problématiques de sécurité et d'enquête judiciaire
Laura Ascone1, Lucie Gianola1
1

AGORA, Université de Cergy-Pontoise – laura.ascone@etu.u-cergy.fr, lucie.gianola@u-cergy.fr

Abstract
This presentation investigates two cases of textual analysis applied to
security contexts: - the analysis of the rhetorical strategies adopted in the
Islamic State’s official online magazines: Dabiq, published in English, and
Dar al-Islam, published in French; - the use of methods for named entities’
automatic extraction, and the conception of a textual exploration software for
criminal analysis.
Résumé
Nous présentons deux cas d'application de l'analyse de données textuelles
dans des contextes liés à la sécurité :
- l'analyse des stratégies rhétoriques de propagande djihadistes à travers
l'étude des revues Dabiq et Dar-al-Islam,
- l'utilisation de méthodes d'extraction automatique d'entités nommées et la
conception d'un outil d'exploration textuelle pour l'analyse criminelle.
Keywords: analyse de données textuelles, radicalisation, analyse criminelle
1. Introduction
L'essor de préoccupations sécuritaires liées aux actes de terrorisme perpétrés
à travers le monde depuis le début du XXIème siècle pousse les chercheurs,
acteurs publics et sociaux à rechercher de nouveaux moyens d'analyse de ce
phénomène. En France, les sciences humaines et sociales se saisissent de la
question comme le démontre l'organisation de plusieurs journées d'études
sur la question (« Nouvelles figures de la radicalisation », Toulouse, avril
2017, « Les SHS face à la menace », Cergy, septembre 2017, « Des sciences
sociales en état d'urgence : islam et crise politique », Paris, décembre 2017).
Nous souhaitons présenter dans cet article deux sujets d'étude relatifs à ces
préoccupations sécuritaires : une étude de la rhétorique de Daesh du point de
vue du recours aux émotions dans les revues Dabiq (anglais) et Dar al-Islam
(français), ainsi qu'une collaboration entre le Pôle Judiciaire de la
Gendarmerie Nationale (PJGN) et l'Université de Cergy-Pontoise visant à
fournir de nouveaux outils d'analyse textuelle des procédures judiciaires aux
équipes d'analystes criminels. Le phénomène de la radicalisation djihadiste a
amené chercheurs et professionnels à examiner les raisons

JADT’ 18

53

psychosociologiques qui sont à la base de l'adhésion à l'idéologie djihadiste
(Khosrokhavar, 2014) ainsi que les stratégies adoptées par le groupe
extrémiste pour diffuser les messages de propagande (Lombardi, 2015).
Toutefois, bien qu'elles jouent un rôle crucial au sein de la propagande
djihadiste, les stratégies rhétoriques qui visent à menacer ou à persuader les
différents lecteurs restent inexplorées. La première partie de cette étude vise
donc à présenter une analyse quanti-qualitative du schéma rhétorique et des
émotions sur lesquels se base la propagande djihadiste. Dans la continuité
des travaux de Marchand (2014), les logiciels Iramuteq et Tropes ont permis
d’étudier le corpus d’un point de vue quantitatif. Les résultats issus de cette
analyse quantitative ont ensuite constitué le point de départ d’une analyse
qualitative sur les extraits exprimant des émotions, afin d’examiner plus en
détail les stratégies rhétoriques de la propagande djihadiste.
Le cas de l'analyse des procédures judiciaires nous confronte à une
problématique typique d'extraction d'information passant par la
reconnaissance automatique d'entités nommées : notre travail de recherche
consiste notamment à concevoir les bases d'un outil de navigation textuelle
ad hoc. Bien que les besoins des analystes criminels soient similaires à ceux
d'autres domaines d'application (analyse de la voix du client, traitement
automatique de la langue biomédicale, etc.), le contexte de l'enquête
judiciaire pose de nouvelles contraintes de précision dans l'extraction et dans
la mise à disposition des résultats à l'expert, c'est-à-dire à l'analyste criminel.
Le besoin social et institutionnel de nouvelles approches de documents
d'origines variées dans les contextes judiciaires et sécuritaires nous permet de
démontrer la pertinence de méthodes d'analyse de données textuelles déjà
éprouvées dans ces deux cas d'étude.
2. Description de la rhétorique djihadiste : cas des revues Dabiq et Dar alIslam
2.1. Corpus et méthodologie
Cette recherche a été menée sur les deux revues de Daech : Dabiq, publié en
anglais, et Dar al-Islam, publié en français. Dabiq s’adresse aux sympathisants
non arabophones de Daech, tandis que Dar al-Islam, qui n’est pas une
traduction de Dabiq, s’adresse à un lectorat uniquement francophone. Cette
distinction nous conduit à avancer l’hypothèse que les deux revues diffèrent
dans leur contenu ainsi que dans la forme du message qu’elles portent.
Toutefois, l’une et l’autre s’adressent à un lectorat qui a déjà adhéré à
l’idéologie islamiste. Leur objectif n’est donc pas de persuader le lecteur de
s’approcher de l’islamisme, mais de renforcer son adhésion et de l’amener à
agir au nom de cette idéologie. Afin d’analyser les stratégies rhétoriques du
discours jihadiste, une approche quanti-qualitative a été adoptée (Rastier,

54

JADT’ 18

2011). Plus particulièrement, cette approche itérative était constituée de
quatre étapes. Une première analyse qualitative de l’idéologie djihadiste, du
processus de radicalisation et des caractéristiques linguistiques du discours
de haine a été essentielle à la compréhension du discours djihadiste ainsi qu’à
l’avancement des premières hypothèses. La deuxième étape correspond à
une analyse quantitative qui a permis de vérifier les hypothèses avancées : le
corpus a donc été examiné avec le logiciel Tropes (Ghiglione et al, 1998), qui
permet d’analyser un texte d’un point de vue sémantico-pragmatique à partir
d’un lexique préétabli, et d’identifier les thèmes les plus récurrents dans le
corpus ainsi que la manière dont ces thèmes sont liés l’un à l’autre. Afin
d’analyser la manière dont le discours djihadiste arrive à persuader et
menacer les différents lecteurs (Giro, 2014), une analyse qualitative a été
menée sur les thèmes sentiment, pour le corpus français, et feeling, pour le
corpus anglais (troisième étape). En d’autres termes, l’analyse quantitative a
constitué le point de départ pour une étude qualitative, qui a donc été menée
sur les énoncés exprimant des émotions et des sentiments (Caffi et Janney,
1994). Enfin, une dernière analyse quantitative a été menée avec le logiciel
Iramuteq (Ratinaud et Marchand, 2012) qui, basé sur la méthode Reinart,
permet, par exemple, de déterminer le sous- et suremploi de certains termes
au sein des différents corpus (quatrième étape). La combinaison d’approches
qualitatives et quantitatives a permis d’examiner de discours djihadiste en
relation avec le contexte dans lequel il a été produit (Valette et Rastier, 2006).
2.2. Résultats
L’analyse des énoncés exprimant des émotions et des sentiments dans les
deux revues officielles de Daesh a permis de déterminer le schéma rhétorique
sur lequel se construit la propagande djihadiste. Puisque l’objectif de Dabiq et
de Dar al-Islam est de manipuler le comportement du lecteur, la propagande
de Daech se fonde sur l’imposition d’obligations et d’interdictions. L’accord
de récompenses ainsi que le sentiment de culpabilité visent à amener le
lecteur à respecter ces indications. En revanche, tout musulman qui ne
respecte pas ces indications, subira des conséquences négatives : il sera jugé
d’apostat et il sera donc considéré comme un ennemi. On a ici la menace
exprimée par Daech contre les musulmans. Les obligations sont exploitées
également pour imposer au lecteur une action violente contre l’Occident,
justifiée et alimentée par le sentiment de victimisation. Combattre l’ennemi
est présenté comme une action héroïque et valorisante. En participant au
combat contre l’Occident, le lecteur aura l’impression de devenir un héros
qui lutte au nom d’une cause juste et noble (De Bonis 2015), et de voir ses
faiblesses disparaître (Rumman, Suliman et al 2016). En outre, en citant des
versets coraniques concernant la victoire des musulmans, l’auteur assure à

JADT’ 18

55

son lecteur que la communauté musulmane aura la victoire sur l’ennemi ;
l’extrait suivant en est un exemple : « Allah par vos mains les châtiera, les
couvrira d’ignominie, vous donnera la victoire sur eux et guérira les poitrines d’un
peuple croyant » (Dar al-Islam, n° 8). La victoire sur l’ennemi est perçue par les
djihadistes comme persuasive. Toutefois, cet énoncé, perçu comme persuasif
par les djihadistes, le sera comme menaçant par l’Occident. De même, le
djihad, qui est interprété comme persuasif par les membres du groupe
djihadiste puisqu’il permet d'accéder au Paradis, tend à être associé aux
attentats terroristes et donc à être perçu comme menaçant par les
occidentaux. Cette double interprétation rejoint la définition de Perelman et
Olbrechts-Tyteca (1988), qui proposent d’« appeler persuasive une
argumentation qui ne prétend valoir que pour un auditoire particulier » (p.
36). Bien que Dabiq et Dar al-Islam présentent le même schéma rhétorique,
leur contenu varie de manière conséquente. Cette étude a révélé, par
exemple, que la revue française focalise son discours sur la figure de l’autre
(i.e., de l’ennemi). En revanche, la revue anglaise est focalisée sur la figure du
musulman et, plus particulièrement, sur la conduite qu’un bon musulman
devrait avoir.
3. Analyse textuelle des procédures judiciaires
Au sein d'une équipe d'enquête, le travail des analystes criminels consiste à
lire et synthétiser les documents de procédures (auditions de témoins,
données téléphoniques et bancaires, comptes-rendus d'expertise, etc.) afin de
fournir aux enquêteurs et aux magistrats une vision plus globale des
informations collectées, par le biais de schémas de représentation et de
synthèses (Rossy 2011). Leur intervention est requise dans des affaires
complexes comme les cold cases ou les affaires impliquant de larges réseaux,
et permet de fournir de nouvelles pistes d'investigation pour les enquêteurs.
À l'heure actuelle, les analystes s'appuient sur un logiciel de reconnaissance
optique de caractères, des outils de bureautique classique (traitement de
texte, tableur) ainsi que sur le logiciel de représentation graphique d'IBM
Analyst's Notebook. Cet outillage ne les dispense pas d'une phase de lecture
précise et chronophage de la procédure visant entre autres à repérer et
extraire manuellement les informations pertinentes pour l'enquête,
regroupées en différents types d'entités qui une fois extraites sont agencées
en représentation graphique (chronologique ou relationnelle).
3.1. Corpus de travail
Le corpus de travail mis à notre disposition par le PJGN est une procédure
judiciaire complète jugée et résolue concernant un homicide. Le dossier,
comme toute procédure judiciaire, rassemble une variété de documents :

56

JADT’ 18

rapports d'expertise, procès-verbaux d'investigations, procès-verbaux
d'auditions de témoins et de mis en cause, factures téléphoniques détaillées,
données bancaires, planches photographiques, etc. Nous avons choisi de
concentrer notre travail sur le sous-corpus composé des auditions de témoins
et de personnes gardées à vue. Ce choix s'est fait lors de notre prise de
connaissance du corpus et du domaine, les auditions représentant la masse
d'information la plus dense et la plus difficilement accessible d'une
procédure : le nombre des auditions (dans notre cas, 370 auditions pour
environ 600 000 mots) et leur manque de structure gênent leur traitement
avec des outils standards, contrairement par exemple aux données
téléphoniques qui peuvent être intégrées telles quelles dans Analyst's
Notebook ou à d'autres données collectées en gendarmerie sous forme de
formulaires structurés.
3.2. Détection automatique d'entités nommées
La notion d'entité en analyse criminelle correspond à la notion d'entité
nommée (EN) en extraction d'information : une unité linguistique
monoréférentielle qui a la capacité de renvoyer à un référent unique (Nouvel
& al, 2015). D'une manière générale, cinq types d'entités intéressent les
analystes criminels : les personnes, les lieux, les dates et heures, les véhicules
et les numéros de téléphone. Nous avons entrepris d'appliquer des
techniques de détection d'EN éprouvées sur les documents de procédures
judiciaires, tout en variant les approches de manière à répondre au mieux
aux contraintes de chaque type d'entité. Deux fonctionnalités du logiciel
UNITEX (Paumier, 2016) ont été mises en œuvres : l'édition de grammaires
pour la détection des dates, l'utilisation d'un lexique pour la détection des
villes, et la combinaison d'un lexique de prénoms et de règles pour les noms
de personnes. Les numéros de téléphone quant à eux sont détectés à l'aide
d'une expression régulière.
En l'état actuel des choses, nous sommes donc en mesure de détecter :
 Les dates normées : “le 10 janvier 2017”, “l'an deux mille dix-sept, le
dix janvier”, “le 10/01/2017”
 Les noms et prénoms de personnes : “Blanche Rivière”, “Petit
Noémie”, “Michel E. Dupont”
 Plus de 36000 villes figurant dans un lexique1
Le développement d’une approche de détection des véhicules, car leurs
mentions dans le corpus combinent plusieurs types d’informations :
 genre de véhicule : moto, scooter, camionnette, voiture, etc.

1 Disponible
(janvier 2018)

à l'adresse

:

http://sql.sh/736-base-donnees-villes-francaises

JADT’ 18

57

 marque
 mention du modèle ou d’une forme (4X4, citadine, berline, break,
etc.)
 couleurs et signes distinctifs (rouille, sérigraphie, année du modèle,
etc.)
La délimitation de la mention d’un véhicule ne peut se résumer à la
combinaison d’une marque et d’un modèle, comme le montrent les deux
exemples suivants tirés du corpus :
 Il s'agit d'un petit modèle comme une TWINGO pour vous donner le
volume. Il était de couleur orangé. Il est petit car il a un petit coffre.
 M. X. m'a cependant parlé d'un véhicule 4X4 conduit par un individu qui
avait un fusil.
La détection des véhicules nous amènera donc à envisager une approche de
détection plus complexe que celles déjà mises en place.
3.3 Analyse de données textuelles et analyse criminelle, une même
problématique ?
Si la détection automatique des entités nommées dans le contexte de l'analyse
criminelle en gendarmerie constitue une tâche habituelle de TAL, on ne peut
pas pour autant en circonscrire les apports potentiels à des aspects purement
techniques. La méthodologie de travail de l’analyse criminelle repose sur
l'interprétation humaine pour la production d'hypothèses, et en cela nous la
rapprochons de l'analyse des données textuelles (ADT) telle que définie par
(Ho-Dinh, 2017) : « Avec l’ADT, nous nous situons au contraire dans une
perspective de construction des connaissances, par l’interprétation humaine
des résultats obtenus grâce à des outils informatiques de calcul et de
visualisation. La puissance informatique vient donc en assistance de
l’exploration et la fouille des données. Cette différence fondamentale permet
de produire des connaissances qualitatives sur les données et non seulement
quantitatives. » La poursuite de nos travaux s'oriente donc non seulement
vers l'amélioration des résultats de détection d'entités et l'introduction
d'approches statistiques (TF-IDF, clustering de documents, etc) mais
également vers le développement d'une interface d'exploration textuelle
propre, prenant en compte les spécificités du genre textuel de la procédure
judiciaire (tri du texte en fonction de sa nature : texte d'en-tête, informations
d'état-civil), et permettant une navigation efficace entre entités détectées,
mesures statistiques, et texte original. La méthodologie de l’analyse
criminelle et les pratiques du métier pourraient être à revoir en conséquence,
impliquant une phase de formation des analystes criminels aux méthodes
textométriques.

58

JADT’ 18

4. Conclusion
Nous estimons avoir soulevé des perspectives théoriques et techniques pour
l'analyse de données textuelles dans les domaines judiciaires et de la sécurité,
relevant aussi bien de l’analyse de discours que du TAL et de la textométrie.
Dans le cas de la propagande de Daesh, l’analyse et la compréhension du
discours djihadiste pourraient contribuer à la formulation d’un contrediscours qui puisse faire face et contrer la propagande djihadiste. Concernant
les pratiques d'analyse textuelles en analyse criminelle, nous espérons que la
mise en place de techniques d'automatisation et d'un outil d'exploration
textuelle permette de repenser la méthode d'accès à l'information en analyse
criminelle et soit une première étape d'une réflexion plus large sur la collecte
et la circulation de l'information et des documents dans le processus
judiciaire. Ces deux cas d'études illustrent la pertinence d'approches de
sciences humaines et sociales dans le contexte sécuritaire et judiciaire, qui a
jusqu'à présent surtout eu recours à des expertises en sciences dites « dures »
(médecine légale, biologie, chimie, informatique, etc.), regroupées sous
l'appellation de « sciences forensiques ». Nous espérons que de telles
contributions permettront de renforcer les liens et d'ouvrir la voie à d'autres
projets associant institutions judiciaires et de défense et chercheurs en
sciences humaines et sociales.
References
Caffi C., & Janney R. W. (1994). Toward a pragmatics of emotive
communication. Journal of pragmatics, 22(3), 325-373.
De Bonis M. (2015). La strategia della paura. Limes, 11.
Ghiglione, R., Landré, A., Bromberg, M., & Molette, P. (1998). L’analyse
automatique des contenus. Paris, Dunod.
Giro M. (2015). Parigi: il branco di lupi, lo Stato Islamico e quello che
possiamo fare. Limes.
Ho Dinh O. (2017). Caractérisation différentielle de forums de discussion sur le
VIH en vietnamien et en français. Thèse de doctorat, Inalco, Paris.
Marchand P. (2014). Analyse avec Iramuteq de dialogues en situation de
négociation de crise : le cas Mohammed Mehra. Actes des 12èmes Journées
internationales d’Analyse statistique des Données Textuelles (JADT), Paris, pp.
457-471.
Nouvel D., Erhmann M., Rosset S. (2015). Les entités nommées pour le traitement
automatique des langues. ISTE Editions
Paumier S. (2016). Unitex 3.1 user manual, http://www-igm.univ-mlv.fr/
unitex
Perelman C., & Olbrechts-Tyteca L. (1988) (5e éd.). Traité de l’argumentation.
Bruxelles : Edition de l’Université de Bruxelles.

JADT’ 18

59

Rastier F. (2011). La mesure et le grain: sémantique de corpus. Champion; diff.
Slatkine.
Ratinaud P., Marchand P. (2012). Application de la méthode ALCESTE à de
"gros" corpus et stabilité des "mondes lexicaux" : analyse du "CableGate"
avec IraMuTeQ. Actes des 11eme Journées internationales d’Analyse statistique
des Données Textuelles (JADT), Liège, 13-15 juin, p. 835-844.
Rossy Q. (2011). Méthodes de visualisation en analyse criminelle : approche
générale de conception des schémas relationnels et développement d'un catalogue
de patterns. Thèse de doctorat, Université de Lausanne, Faculté de droit et
des sciences criminelles.
Rumman A., Suliman M. et al. (2016). The Secret of Attraction: ISIS Propaganda
and Recruitmenet. Traduit par Ward, W. J. et al. Amman: Friedrich-EbertStiftung.
Valette M., & Rastier F. (2006). Prévenir le racisme et la xénophobie:
propositions de linguistes. Langues modernes, 100(2), 68.

60

JADT’ 18

A two-step strategy for improving
categorisation of short texts
Simona Balbi1, Michelangelo Misuraca2, Maria Spano1
1

Università di Napoli Federico II – simona.balbi@unina.it maria.spano@unina.it
2 Università della Calabria – michelangelo.misuraca@unical.it

Abstract
Text categorisation allows organising a collection of documents with respect
to their content. When we consider short texts – e.g., posts and comments
shared onto social media – this task is harder to achieve because we have few
significant terms. Refer to higher-level structures, representing concepts, or
topics occurring in the collection, can improve the effectiveness of the
procedure. In this paper, we propose a novel two-step strategy for text
categorisation, in the frame of feature extraction. Concepts are identified by
using network analysis tools, namely community detection algorithms.
Therefore, it is possible to organise the document collection with respect to
the different concepts and describe the groups of documents with respect to
terms. A case study about Pope Francis on Twitter is presented for showing
the effectiveness of our proposal.
Keywords: short texts, text categorisation, textual network, community
detection
1. Introduction
The ever-increasing popularity of the Internet, together with the amazing
progress of computer technology, has led to a tremendous growth in the
availability of electronic documents. Therefore, there is a great interest in
developing statistical tools for the effective and efficient extraction of
information on the Web, in a so-called Text Mining perspective.
The most common reference model for representing documents, in Text
Mining, is the so-called vector space model: a document is a vector in the
(extremely sparse) space spanned by the terms. Documents are usually coded
as bag-of-words, i.e. as an unordered set of terms, disregarding grammatical
and syntactical roles. The focus is on the presence/absence of a term in a
document, its characterisation and discrimination power. In the knowledge
discovery process, the core of the majority of procedures is related to
dimensionality reduction, both via feature selection and/or feature extraction.
Statistical tools enable an effective feature extraction. One of the most
interesting tasks in Text Mining is Text categorisation which allows
organising a collection of documents, grouping them with respect to their

JADT’ 18

61

content. Here we propose a novel two-step strategy designed for the text
categorisation of short documents – e.g., posts and comments shared onto
social media – when the task is harder to achieve because we have few
significant terms. The basic idea is that Textual data can be processed at
different levels, e.g. we can consider single terms, or subsets of terms
identifying different concepts, in a feature extraction frame. Concepts are
identified by using network analysis tools, namely community detection
algorithms. Therefore, it is possible to organise the document collection with
respect to the different concepts and describe the groups of documents with
respect to terms. The effectiveness of our proposal is showed by analysing a
set of tweets about the Pope Francis, posted on November 2017.
2. Background and related work
The bag-of-words encoding is characterised by high dimensionality and an
inherent data sparsity. According to Aggrawal and Yu (2000), the
performances of text categorisation algorithms decline dramatically due to
these aspects. Therefore, it is highly desirable a previous dimensionality
reduction.
In pre-processing, feature selection and/or feature extraction are often used
before applying any further analysis. Via feature selection, only a subset the
original vocabulary is considered, according to with some criterions. Several
feature selection techniques are reported in the literature, such as term
strength (Yang, 1995), information gain (Yang and Pedersen, 1997), Chi-squared
statistic (Galavotti et al., 2000), entropy-based ranking (Dash and Liu, 2000).
Feature extraction (also known as feature reduction) is a process for
extracting a set of new features from the original vocabulary by applying
some functional mapping. Common feature reduction techniques include
lexical correspondence analysis (Lebart et al., 1998), latent semantic indexing
(Deerwester et al., 1990). These techniques obtain dimensionality reduction,
by transforming the original terms in fewer linear combinations, spanning
sub-dimensional spaces, that may not have a clear meaning and sometimes
results are difficult to be interpreted.
To cope with this limit, here we consider a different viewpoint. Both feature
selection and feature extraction are basically founded on the analysis of a
documents x terms matrix, in which the generic element is the frequency of a
term in a document, or another related weight representing the importance
of the term. It is possible to get back part of the use context of each term by
constructing a terms x terms co-occurrence matrix. In general, each element of
this latter matrix is the number of times two terms co-occur in the corpus.
This particular data structure can be represented as a network, where each
term is a vertex and each element of the matrix different from 0 is an edge.

62

JADT’ 18

The problem of reducing the original dimensionality and perform a feature
extraction can be seen as a community detection problem: terms used
together define a concept, as in latent semantic indexing, or correspondence
analysis, but without any algebraic transformation. Differently from the
approaches previously described, indeed, this method preserves the original
meaning of the terms and allows a better readability of the results.
A community in a network is a set of nodes where vertices are densely interconnected and sparsely connected to other parts of the network (Wasserman
and Faust, 1994). There is no universally accepted definition for a
community, but it is well known that most real-world networks display
community structures. When we consider networks of terms, communities of
terms densely interconnected can be interpreted as topics. From a theoretical
point of view, community detection is not very different from clustering.
Many algorithms have been proposed. Traditional approaches are based on
hierarchical or partitional clustering (e.g.: Scott, 2000; Hlaoui and Wang,
2004). The most popular algorithm is the one proposed by Girvan and
Newman (2004). The method is historically important because it marked the
beginning of a new era in the field of community detection, by introducing
the notion of "modularity". Originally introduced to define a stopping
criterion, modularity (nowadays refers as Girvan and Newman's modularity)
has rapidly become an essential element of many community detection
methods, as fast-greedy (Clauset et al., 2004), label propagation (Raghavan et al.,
2007), leading eigenvector (Newman, 2006). It measures the difference between
the observed fraction of edges that fall within the given communities and the
expected fraction in the hypothesis of random distribution. For a most
comprehensive review of the community detection literature, it is possible to
refer to Fortunato (2010).
3. Problem definition and proposed method
Text categorisation allows to group documents belonging to a collection with
respect to the textual content of the documents themselves. When we
consider short texts, this task is more difficult to achieve because we have
few significant terms for characterising the different groups. The
identification of high-level structures representing the concepts/topics
occurring in the collection can improve the effectiveness of the grouping
procedure. In this paper, a two-step strategy for improving the automatic
organisation of a collection of documents is proposed.
LetT={d1, …, dn}p be a set of n document vectors in a term space of p
dimension, represented by a documents x terms matrix, where each element tij
is the occurrence of an i term into a j document (i=1, ..., p; j=1, ..., n). For the
purpose of our analysis, we are just interested if the term i occurs in

JADT’ 18

63

document j, or not. Then we consider a binary matrix B, where the generic
element bij is equal to1 if the term i occurred at least once in document j, 0
otherwise. From the matrix B we derive the terms x terms co-occurrence
matrix A by the product ABBT. The generic element aii′ is the number of
documents in which the term i and the term i′ co-occur (ii′). An element aii
on the principal diagonal represents the total number of documents in the
collection containing the term i. A is an undirected weighted adjacency
matrix that can be used to analyse the relations existing among the different
terms.
As each community can be seen as a concept/topic occurring in the collection,
in order to detect a group of terms defining a concept, we perform a
community detection on the matrix A. Each community can be seen as a
concept/topic occurring in the collection.
As we said above, the greedy algorithm is based on the optimisation of a
quality function known as modularity. Suppose the vertices are divided into
communities such that vertex/term i belongs to the community ci. The
modularity Q is defined as

Q=


 i i' 
1

 aii'  s(c ,c )
2 h  i i'
2 h ii' 

where h is the total number of edges in the network, i is the degree of the
term i and the s function s(ci,ci′) is 1 if ci=ci′ and 0 otherwise. In practice, a
value above about 0.3 is a good indicator of an interesting community
structure in a network.
The greedy algorithm falls in the general family of agglomerative hierarchical
clustering methods. Starting with a state in which each term is the sole
member of one of K concepts, the algorithm repeatedly joins concepts
together in pairs choosing in each step the join that results in the greatest
increase in modularity.
At the end of the detection process, we obtain a terms x concepts matrix C, a
complete disjunctive table where the cik element (k=1, …, K) is 0 or 1 when a
term i belongs or not to a community. The text categorisation is performed
with a clustering algorithm on the matrix documents x concepts T*(TTC)DK-1,
where DK-1 is the diagonal matrix of the column marginal distribution of C.
Each cell of T* contains the proportion of terms belonging to a concept.

64

JADT’ 18

4. A case study
Twitter is one of the most popular – and worldwide leading – social
networking service. It can be seen as a blend of instant messaging,
microblogging and texting, with brief content and a very broad audience.
The embryonic idea was developed considering the exchange of texts like
Short Message Service in a small group of users. As of the third quarter of
2017, it has 330 million monthly active users, with an amount of daily sent
tweets close to 500 million (Source: Twitter, Statista). Our aim is to categorise
a set of tweets, generated by the same hashtags, with respect to the different
concepts expressed in the collection itself.
4.1. Data description and pre-processing
By using the Twitter Archiver add-on1 for Google Sheet, we collected 24588
tweets about Pope Francis, published between November 10th and December
7th 2017. We use the hashtag #papafrancesco in the query, with any kind of
restriction on the language of the tweets. Moreover, we do not filter the socalled retweets, so that some texts are replicated in the corpus. The preprocessing was performed in two steps. First, we stripped URLs, usernames,
hashtags, emoticons and RT prefixes, and we normalised the tweets by
removing special characters and any separators than blanks. Second, on the
23915 cleaned tweets, we performed a lemmatisation and a grammatical
tagging. The terms contained in the tweets written in other languages
different from Italian were considered as noise.
In the analysis, we consider only nouns because of their content-bearing role.
Moreover, we delete from the vocabulary the terms occurring less than 10
times. Thus we obtain a documents x terms matrix T with 23915 rows and 1603
columns, and the corresponding terms x terms co-occurrence matrix A.
4.2. Concept identification and categorisation process
We perform the community detection procedure on A in order to identify the
concepts. For better highlighting relations among the terms, we fixed a
threshold of 30 on the value of co-occurrence, deleting isolated terms. The
greedy algorithm detected 38 different concepts. The high value of the
modularity measure (Q = 0.648) supports the effectiveness of our procedure
results. In Table 1, we list as an example the terms belonging to some of the
detected concepts.

1
https://chrome.google.com/webstore/detail/twitter-archiver/pkanpfekacaoj
dncfgbjadedbggbbphi

JADT’ 18

65
Table 1 – Concepts detected in the collection with corresponding terms

Concept
2
7
10
19
23
27
…

Terms
scienza, sperimentazione, accanimento, responsabilità, malato, cura, eutanasia, …
bangladesh, religione, viaggio, cultura, myanmar, discorso, buddista, monaco, …
aborto, perversione, febbraio, don, pieri, colonizzazione, crimine, mafia
pensiero, figlio, papà, cecilia, moser, monte
dramática, miedo, josé, experimentan, condición, maría, marcada, incertidumbre
giornatamondialedeipoveri, aula, giovanni, paolo, preparazione, pranzo
…

It is interesting to note that the algorithm identifies the concepts not written
in Italian (e.g., #23 contains Spanish terms) and the concepts not related to
Pope Francis (e.g., #19 refers to a popular reality show). By selecting only the
terms belonging to the different communities, we obtain a 19799 x 38 matrix
T*. On this matrix, we perform a hierarchical clustering based on the Ward
criterion. In Figure 1 it is shown the histogram of the level indices obtained by
the clustering. The indices represent the loss of inter-class inertia caused by
the aggregation. The maximum gap in the distribution suggests to consider a
partition in 37 clusters.

Figure 1 – Histogram of the level indices calculated on the dendrograms’ nodes

66

JADT’ 18

Because of the unsupervised nature of the approach, the quality of the results
can be investigated only by looking at the clusters’ composition. Due to the
limitation of 140 characters, each tweet can express one to three concepts at
most. In Table 2 we can see the concepts occurring in the different clusters.
The order of the concepts represents their importance in terms of statistical
significance. The preliminary results seem to be very promising, but a deep
investigation has to be considered in order to validate the proposal.
Table 2 – Clusters’ size and composition
Cluster

Tweets

Concepts Cluster

Tweets

Concepts Cluster

Tweets

Concepts

1

120

6

14

8210

4, 7

27

51

30

2

506

15, 6, 9

15

536

1

28

150

36

3

95

9, 15

16

1348

32

29

163

37

4

62

12

17

1379

13

30

41

21

5

179

29

18

677

3

31

51

28

6

93

14

19

2699

2

32

102

22, 4

7

79

16

20

666

8, 7

33

71

26, 22

8

160

10

21

48

24, 20, 13

34

42

17, 11

9

445

5

22

155

20, 4, 24

35

288

11, 34

10

304

19, 18

23

242

38

36

125

34, 11

11

36

18

24

55

25

37

42

23, 11

12

66

31

25

71

33

Total

19799

13

335

27

26

107

35

5. Final remarks
The proposed strategy aims at categorising the documents of a collection by
detecting high-level structures, i.e. concepts, as subsets of terms. The terms
belonging to each concept are retained in the process and can be used for
characterising the identified groups of documents. The tools are given by
network analysis tools, namely community detection algorithms. The
strategy is suitable when we deal with short texts. Future developments of
this work are devoted to set automatically a co-occurrence threshold in the
community detection step and to evaluate alternative similarity indices for
measuring the relation strength among terms.

JADT’ 18

67

References
Aggrawal C.C. and Yu P.S. (2000). Finding generalized projected clusters in
high dimensional spaces. Proceedings of SIGMOD’00, pp. 70-81.
Clauset A., Newman M.E. and Moore C. (2004). Finding community
structure in very large networks. Physical review E, 70(6), 066111.
Dash M. and Liu H. (2000). Feature selection for clustering. Proceedings of
Pacific-Asia Conference on knowledge discovery and data mining, pp. 110-121.
Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshmanet R.
(1990). Indexing by latent semantic analysis. Journal of the American Society
for Information Science, 41(6): 391-407.
Fortunato S. (2010). Community detection in graphs. Physics Reports, 486(3):
75-174.
Galavotti L., Sebastiani F. and Simi M. (2000). Feature selection and negative
evidence in automated text categorization. Proceedings of KDD-00.
Hlaoui A., Wang S. (2004). A direct approach to graph clustering. Neural
Networks and Computational Intelligence: 158-163.
Lebart L., Salem A., Berry L. (1998). Exploring textual data. Springer
Netherlands.
Newman M.E. (2006). Modularity and community structure in networks.
Proceedings of the national academy of sciences, 103(23): 8577-8582.
Newman M.E. and Girvan M. (2004). Finding and evaluating community
structure in networks. Physical review E, 69(2): 026113.
Raghavan U.N., Albert R. and Kumara S. (2007). Near linear time algorithm
to detect community structures in large-scale networks. Physical review E,
76(3): 036106.
Scott J. (2000). Social Network Analysis: a handbook. Sage, London.
Wasserman S. and Faust K. (1994). Social network analysis. Cambridge
University Press.
Yang Y. (1995). Noise reduction in a statistical approach to text
categorization. Proceedings of the 18th annual international ACM SIGIR
conference on Research and development in information retrieval, pp. 256-263.
Yang Y. and Pedersen J.O. (1997). A comparative study on feature selection in
text categorization. Proceedings of ICML-97, pp. 412-420.

68

JADT’ 18

Appeler à signer une pétition en ligne :
caractéristiques linguistiques des appels
Christine Barats1, Anne Dister2, Philippe Gambette3,
Jean-Marc Leblanc1, Marie Peres1
1

Université Paris-Est, CEDITEC (EA 3119), Créteil, France – christine.barats@parisdescartes.fr,
jean-marc.leblanc@u-pec.fr, marie.leblanc@u-pec.fr
2Université Saint-Louis - Bruxelles, Belgique – anne.dister@usaintlouis.be
3Université Paris-Est, LIGM (UMR8049), Champs-sur-Marne, France – gambette@u-pem.fr

Résumé
L’analyse des 12 522 textes d’appel d’une plateforme de pétitionnement en
ligne permet d’examiner leurs caractéristiques linguistiques. Le recours à des
outils textométriques met ainsi au jour certaines régularités quant aux
modalités d’appel à signer. Nous nous intéressons tout particulièrement aux
régularités lexicales, aux formes d’adresse ainsi qu’aux modalités
d’implication des signataires.
Mots-clés : statistique textuelle, pétition en ligne, textes d’appel
Abstract
The analysis of the 12 522 petition texts of an online petition platform allows
to examine their linguistic characteristics. The use of statistical textual
analysis tools brings to light several regularities as for the modalities of the
call to be signed. We focus on the lexical regularities, the salutations as well
as the modalities of implication of the signatories.
Keywords : statistical textual analysis, online petition, petition texts
1. Introduction
Les plateformes de pétitionnement en ligne prolongent et modifient l’acte de
pétitionnement (Contamin, 2001). Dans la dynamique des recherches sur
l’incidence des dispositifs de participation en ligne sur les formes d’écriture
numérique et d’engagement politique (Boure, Bousquet, 2011 ; Mabi, 2016 ;
Badouard, 2017 ; Contamin, 2017), nous nous proposons d’interroger les
caractéristiques des textes d’appel au regard d’une plateforme numérique de
pétitionnement.
Le corpus que nous avons analysé est issu de l’un des principaux sites
francophones de pétitions en ligne (lapetition.be). Il se compose de plus de 12
500 pétitions ayant récolté au total 3,25 millions de signatures sur la période
comprise entre le 31 octobre 2006 et le 12 février 2015.
Le site propose 9 rubriques parmi lesquelles le porteur de la pétition est tenu

JADT’ 18

69

de classer sa pétition : Art et culture ; Droits de l’Homme ; Environnement,
nature et écologie ; Humour/Insolite ; Loisirs ; Politique ; Protection
animalière ; Social ; Autres. Comme nous l’avons montré ailleurs (Barats et
al., 2016) et rappelé en figure 1, les différentes rubriques connaissent des
variations importantes tant en termes de nombre de pétitions (figure 1) qu’en
ce qui concerne la longueur des textes des appels, le nombre de signatures ou
encore le nombre et le volume des commentaires laissés par les signataires.
Le choix de la rubrique relève du promoteur de la pétition et témoigne d’une
interprétation qui varie selon les porteurs de projet, mais débouche sur des
régularités internes à chaque rubrique qui émergent de classifications
automatisées du corpus.
Dans cet article, nous nous centrerons exclusivement sur les textes des
appels, avec une attention particulière sur leur incipit, afin d’observer quelles
sont les régularités lexicales et syntaxiques qui caractérisent les textes d’appel
sur l’ensemble du corpus, mais également en contrastant les rubriques. Les
12 522 textes constituent un corpus de 2,6 millions de mots.
Humour / Insolite

397

Art et culture

652

Loisirs

795

Environnement, nature et écologie

1034

Protection animalière

1378

Droits de l’Homme

1738

Social

1806

Politique

2276

Autres

2446

Figure 1 - Distribution du nombre de pétitions par rubrique

70

JADT’ 18

2. Les mots les plus fréquents dans les textes d’appel
Afin d’identifier la présence ou non de formes communes aux textes d’appel,
nous avons examiné les débuts des textes d’appel, indépendamment des
rubriques. La répartition du premier mot des appels ne correspond pas à une
loi de puissance (l’habituelle loi de Zipf) car la courbe décroit plus lentement.
Les débuts des textes d’appel font donc apparaitre un vocabulaire fréquent
particulier. Les 20 formes de cette liste sont en première position dans plus
de la moitié des textes de pétitions : nous, pour, bonjour, le, la, je, les, monsieur,
pétition, l, il, a, depuis, non, en, cette, si, madame, contre, suite.
Si l’on se penche maintenant sur le vocabulaire des 200 formes les plus
fréquentes dans l’ensemble des textes d’appel, on constate que les premiers
verbes conjugués sont est, sont, ont, soit, peut, demandons, faut, doit, avons,
sommes, demande, sera et les premiers mots lexicaux pétition, enfants, pays,
personnes, vie, Belgique, France, temps, animaux, monsieur, monde, place, projet,
jour, droit, loi, politique, mois, travail, ville, ministre, gouvernement, citoyens, cas,
Bruxelles, justice, président, lieu, site, chiens, situation, rue.
On le voit dans la figure 2, dix formes apparaissent non seulement parmi les
30 mots les plus fréquents (hors mots vides) des appels mais aussi parmi les
30 les plus fréquents en première position des textes : nous, pour, je, pétition,
non, contre, j, vous, on, notre.
À l’inverse, des mots qui apparaissent avec une fréquence élevée en première
position des textes d’appel ne se retrouvent pas parmi les 200 mots les plus
fréquents, ou très bas dans le classement : bonjour (545), monsieur (313),
madame (141), chers (111), stop (82), signez (80), mesdames (73), appel (60), voila
(53), marre (45), messieurs (41), cher (40), voici (40), lettre (36), voilà (30), trop
(30), oui (29), sauvons (24), test (23), aidez (22), salut (18).
On trouve ici des formes spécifiques de l’interpellation directe : bonjour, salut,
madame et mesdames, monsieur et messieurs ou encore chers. La présence de
bonjour ou salut rend compte de la diversité des modalités d’interpellation qui
renvoient à des niveaux de langue différents et des formulations parfois
inattendues. L’accessibilité en ligne du dispositif facilite le lancement d’une
pétition : notre corpus se décline sur un continuum qui va des pétitions les
plus sérieuses, celles qui trouvent un écho dans la presse, qui auraient sans
doute existé sans le dispositif d’une plateforme en ligne, qui sont signées par
plusieurs dizaines ou centaines de personnes, aux pétitions très
confidentielles, « juste pour rire », dont le texte de l’appel est très réduit et
qui récoltent peu de signatures. Bonjour apparait avec une plus grande
fréquence dans la rubrique « Loisirs ». La forme test, quant à elle, révèle
certaines difficultés liées au dispositif : il s’agit de tester si une pétition peut
être mise en ligne, et le texte de l’appel comprend alors ce seul mot.

JADT’ 18

71

Figure 2 – Visualisation en chaines de fréquences partagées (Lechevrel & Gambette, 2016)
des 30 mots les plus fréquents, hors mots vides, en première position
et parmi les textes des pétitions.

Deux présentatifs (voici : 40 occurrences, voila/voilà : 83 occurrences) sont
fréquemment attestés en première position des appels à pétition, en
particulier dans les rubriques « Loisirs » et « Humour ». La valeur
énonciative de ces deux formes est relativement différente. La forme voilà est
dans un grand nombre d’emplois une marque de l’oralité qui introduit le
propos sans en modifier fondamentalement le contenu, mais qui reste un
présentatif (« Voilà je suis une très grande fan du destin de Lisa », « Voilà les Tokyo
Hôtel refont des tournées »…). D’autres emplois sont le produit d’une réflexion
(« Voilà, j’ai décidé de faire une pétition », « Voilà, je fais cette pétition ») ou ont
valeur de conclusion : (« Voilà pourquoi il faut avoir peur de l’avenir »). Cette
dernière configuration reste plus fréquente lorsque voilà se trouve dans une
position autre dans la phrase (« Voilà le problème », « voilà pourquoi j’ai décidé de
»…). Une deuxième catégorie d’emploi, où voici et voilà revêtent les mêmes
valeurs, avec une fréquence plus importante de voici, concerne les marques
temporelles (« Voilà quelques années que l’on demande l’autorisation de porter des
shorts », « Voici 22 mois que je suis papa »). Enfin voici comme voilà (dans des

72

JADT’ 18

proportions bien moindres pour la seconde forme) prennent une valeur de
présentatif dans un grand nombre d’emplois (« Voilà le but de ma pétition », «
voilà ma propre pétition », « voici une histoire comme tant d’autres », « voici une
pétition à faire suivre », « voici le lien de ma pétition…).
Avec les verbes à l’impératif signez, aidez et sauvons, le porteur de la pétition
entre directement dans le vif du sujet : il s’agit d’inciter les signataires à agir
par l’acte de pétitionnement. Stop, marre, trop, et oui participent du même
mouvement : agir, mettre fin, encourager à, etc. On ajoute à cette liste pour,
deuxième mot le plus fréquent en première position. Avec contre, il est très
clairement une marque caractéristique de la posture pétitionnaire : on
s’oppose, on soutient. Dans la majorité des rubriques, les textes qui
commencent par non ou contre sont moitié moins nombreux que ceux qui
commencent par oui ou pour, excepté dans la rubrique « Environnement » où
ils sont plus nombreux. Nos investigations vont se poursuivre en privilégiant
les fonctionnalités d’annotation du corpus offertes par TextObserver afin de
davantage prendre en compte les différents contextes d’emploi de ces formes
et ainsi renforcer leur désambigüisation.
Les verbes à l’impératif sont un indicateur intéressant d’implication du
signataire que l’on retrouve aussi dans l’emploi des pronoms nous, vous et je
auxquels nous allons maintenant nous intéresser.
3. L’implication des signataires et des porteurs de pétitions
Le pronom nous est particulièrement mobilisé dans notre corpus : mot le plus
fréquent au début des appels, il est aussi le pronom le plus utilisé dans
l’ensemble du corpus. Ce nous se veut mobilisateur : il inclut dès le texte de la
pétition les futures pétitionnaires dans l’acte de pétitionnement. Une
extraction des 10 mots cooccurrents les plus spécifiques du pronom nous
placé en première position, à l’aide de l’outil TextObserver (Barats et al.,
2013), permet de faire émerger par ordre décroissant de spécificité :
demandons, voulons, souhaitons, soussignés, citoyens, soutenons, réclamons,
opposons, déclarons, appris. Ce pronom introduit très souvent une demande ou
une dénonciation, parfois des éléments de contexte (cf. appris).
On ne peut évidemment exclure que certains de ces nous ne réfèrent qu’aux
porteurs de la pétition, sans l’inclusion des signataires. Néanmoins, la
présence des cooccurrents citoyens et soussignés et les retours que nous avons
faits aux textes montrent que la grande majorité des nous incluent les
signataires. Une étude plus approfondie est en cours pour quantifier plus
précisément les différents cas. Une interrogation par rubrique confirme
l’importance quantitative de ce nous inclusif, en particulier dans le cas des
rubriques « Environnement », « Politique » et « Social » comme le montre la
figure 3(a).

JADT’ 18

73

Figure 3 – Nombre de pétitions, par rubrique, dont le texte d’appel contient j’, je ou nous (a)
et nombre médian de mots des textes de pétitions qui contiennent ou non ces pronoms (b).

Le pronom je arrive quant à lui en quatrième position des mots les plus
fréquents en début de texte, et il est le troisième pronom le plus mobilisé sur
l’ensemble des textes après nous et vous. Il n’est pas rare que les deux
pronoms nous et je/j’ soient utilisés dans les textes d’appel, le porteur de la
pétition passant de son expérience personnelle pour ensuite mobiliser les
pétitionnaires, comme dans l’exemple de la pétition suivante intitulée «
Contre la fermeture du Delhaize d’Herstal » (pet 14595) : « Je trouve ça honteux
de fermer un magasin qui est récompensé du meilleur rapport clients-Personnel! Il
est temps de se serrer les coudes et de se battre jusqu’au bout! Ne nous laissons pas
faire!!!!! ».

Figure 4 – Pourcentage de textes de pétitions renvoyant ou non à une URL (a)
et mentionnant facebook (b), par type de rubrique.

Un des moyens de passer d’une implication individuelle à une mobilisation
collective est de faire référence à d’autres espaces de relai d’information sur
le web, ce qui se traduit par la présence d’URL, qui ciblent parfois des

74

JADT’ 18

réseaux sociaux. 11% des appels comprennent des URL. L’incidence des
rubriques se confirme : « Protection animalière » et « Environnement »
comportent le plus grand nombre d’URL (17%), comme le montre la figure
4(a). Afin d’approfondir ce résultat, nous avons prêté attention à la présence
du réseau social Facebook : 1,6% des textes de pétition y renvoient, comme
on le voit en figure 4(b). La rubrique « Protection animalière » est celle qui
fait le plus appel à des relais via des pages Facebook, confirmant un mode de
mobilisation spécifique et transmedia (Barats et al., 2016). La rubrique «
Politique » est celle qui fait le moins appel au réseau social Facebook. Notons
cependant que la pétition la plus signée sur l’unité de la Belgique, d’aout
2007, a proposé, à l’issue de la fermeture de la pétition, de rassembler sur un
site web les photos d’une des manifestations organisées en novembre 2007.
Les textes des pétitions rendent ainsi compte de l’articulation de différents
dispositifs web dans la dynamique de pétitionnement, qu’une approche
strictement quantitative n’indique que partiellement.
On peut s’étonner, en observant la figure 3(a), du nombre relativement
important, dans chacune des rubriques, de pétitions dans lesquelles aucun de
ces deux pronoms n’apparait et qui serait peut-être le signe de pétitions
moins implicantes, plus impersonnelles. En effet, on constate également que
moins de 15% de ces textes sans nous ni je/j’ contiennent le pronom vous. Si
l’on y regarde de plus près, on se rend compte que les textes des pétitions
sans nous ni je/j sont, pour chaque rubrique, beaucoup plus courts que les
textes de celles qui incluent nous et/ou je/j, comme le montre la figure 3(b).
5. Conclusions et perspectives
Notre analyse des premiers mots de textes d’appel de pétitions montre que le
vocabulaire utilisé dans cette position présente davantage de régularités liées
aux particularités de la pétition que la totalité des textes. Elle permet de
repérer quelques caractéristiques linguistiques qui varient parfois selon les
rubriques (pronoms personnels, formes d’adresse, URL, etc.).
L’approche textométrique trouve parfois ses limites, comme avec l'ambigüité
du nous qui peut inclure ou non les promoteurs ou les signataires de la
pétition, ou bien dans le cas de la polarité positive ou négative de
prépositions et de verbes qui ne suffisent pas à repérer si la pétition traduit
plutôt une demande ou une dénonciation.
Ce travail constitue une première étape vers une vérification systématique
d’autres marqueurs qui permettent d’impliquer les signataires, comme par
exemple la présence de verbes à l’impératif ou de déterminants, en vue d’une
mise en relation avec le nombre de signataires et éventuellement de
recommandations pour la rédaction de textes de pétitions en ligne.

JADT’ 18

75

Références
Badouard R. (2017). Le désenchantement de l’internet. Désinformation, rumeur et
propagande. Paris, FYP éditions.
Barats C., Leblanc J.-M. and Fiala P. (2013). Approches textométriques du
web : corpus et outils. In Barats, C., editor, Manuel d’analyse du Web en
sciences humaines et sociales. Paris, Armand Colin.
Barats C., Dister A., Gambette Ph., Leblanc J.-M., Peres M. (2016).
Analyser des pétitions en ligne : potentialités et limites d’un dispositif
d’études pluridisciplinaires, JADT 2016, Nice. http://lexicometrica.univparis3.fr/jadt/jadt2016/01-ACTES/83043/83043.pdf
Boure R. and Bousquet F. (2011). La construction polyphonique des pétitions
en ligne. Le cas des appels contre le débat sur l’identité nationale.
Questions de Communication, vol. 20: 293-316.
Contamin J.-G. (2001). Contribution à une sociologie des usages pluriels des
formes de mobilisation: l’exemple de la pétition en France. Thèse de
doctorat, Université Paris 1.
Contamin J.-G., Léonard T. and Soubiran T. (2017). Les transformations des
comportements politiques au prisme de l’e-pétitionnement. Potentialités
et limites d’un dispositif d’étude pluridisciplinaire, Réseaux, vol. 204(4):
97-131.
Lechevrel N. and Gambette P. (2016). Une approche textométrique pour
étudier la transmission des savoirs biologiques au XIXe siècle. Nouvelles
perspectives en sciences sociales, vol. 12(1): 221-253
Mabi C. (2016). Analyser les dispositifs participatifs par leur design. In
Barats, C., editor, Manuel d’analyse du Web en sciences humaines et sociales.
Paris, Armand Colin.

76

JADT’ 18

Newsgroup e lessicografia: dai NUNC al VoDIM*
Manuel Barbera, Carla Marello
Università degli Studi di Torino – b.manuel@inrete.it; carla.marello@unito.it

Abstract
VoDIM (Vocabolario dinamico dell’italiano moderno - Dynamic dictionary of
modern Italian) represents a new development in recent Italian lexicography.
In this paper we argue that NUNC corpora ( www.corpora.unito.it), which
contain texts from newsgroups that were downloaded at the beginning of
XXI century, display aspects of “written-spoken” Italian. NUNC might offer
instances of new meaning of “old” words and new collocational contexts. We
discuss several examples taken from the corpora, such as the
internationalism Umwelt, the collocation assolutamente sì and the abbreviation
clima for ‘climatizzatore’ ‘air conditioning’.
Abstract
Il VoDIM (Vocabolario dinamico dell’italiano moderno) rappresenta una
grande novità nella lessicografia italiana di questi anni. Qui si argomenta che
i corpora italiani della suite NUNC ( www.corpora.unito.it), ricavati dai testi
presenti nei newsgroup di inizio millennio, sono un buon testimone
dell’italiano “scritto-parlato” e potrebbero essere utili per documentare nel
VoDIM nuove accezioni e l’uso di nuove collocazioni. Si portano come
esempi il caso dell’ internazionalismo Umwelt, della collocazione di
assolutamente con sì e dell’accorciamento clima per ‘climatizzatore’.
Keywords: VoDIM – NUNC – Lessicografia – italiano
1. Introduzione
Il VoDIM (Vocabolario dinamico dell’italiano moderno), progetto capitanato
dall’Accademia della Crusca1 che coinvolge otto gruppi di ricerca di
altrettante università italiane, fra cui anche il gruppo torinese, sarà un
dizionario dell’italiano postunitario online, basato su corpora e su altri
dizionari acquisiti in formato digitale come il Tommaseo - Bellini, la quinta
Crusca ed il Battaglia, e disegnato per poter essere interrogabile anche a

A Manuel Barbera si devono i §§ 2 e 3, a Carla Marello i §§ 4 e 5 ed il § 1 va
ascritto ad entrambi; anche se ovviamente il lavoro è stato concepito insieme ed
entrambi gli autori se ne sentono pienamente responsabili.
1
Cfr. http://www.accademiadellacrusca.it/it/eventi/crusca-torna-vocabolariolesicografia-dinamica-dellitaliano-post-unitario.
*

JADT’ 18

77

“corpus variabile”, definito dall’utente.
I corpora su cui si appoggia diventano quindi essenziali. Un primo corpus di
riferimento base (i cui risultati non sono ancora pubblici:
http://dizionariodinamico.it/prin2012crusca/dictionary) è stato prodotto col
PRIN 2012 dalla medesima Crusca (in collaborazione con le Università di
Catania, Firenze, Genova, Milano, Napoli, Piemonte Orientale, Tuscia e con il
CNR), ma, naturalmente, da solo è insufficiente alla bisogna.
2. I NUNC
Un corpus con cui si suggerisce di completarlo è il NUNC-IT; i NUNC
(homepage: http://www.bmanuel.org/projects/ng-HOME.html), ideati da
Manuel Barbera (in bmanuel.org), ed appannaggio del medesimo gruppo
torinese che partecipa al VoDIM, propriamente sono una suite multilingue di
corpora che vorrebbe documentare il genere testuale “newsgroup” all’inizio
del terzo millennio; molte versioni ne sono state implementate (anche per
tematiche specifiche), tutte reperibili dalla homepage; il risultato non è
ancora del tutto soddisfacente; pure, qualche uso può già esserne fatto2.
Un newsgroup è un forum telematico a libero accesso, gratuito, disponibile
su Internet, che si manifesta nella forma di testi scritti, i post, inviati ad una
“bacheca elettronica” mantenuta presso una rete di server (i newsserver che
costituiscono UseNet). Gli utenti del gruppo possono scaricare, leggere e
rispondere ai post, costruendo catene (thread) di botte e risposte. I
newsgroup sono articolati in una tassonomia precisa, ossia in un sistema di
cornici argomentative che si chiamano “gerarchie”, a base geograficonazionale e/o tematica.
I vantaggi di questa base testuale per la linguistica dei corpora sono
numerosi e sono stati trattati in Barbera, 2007 e Barbera et Marello, 2009; qui
ci interessa in primo luogo il fatto che presentano una Umgangssprache
assolutamente contemporanea, reale e molto variata per registri e temi.
Per quanto riguarda il VoDIM, molte voci, neologismi, tecnicismi, prestiti,
ecc., non sono attestate nel corpus base della Crusca e quindi i NUNC
potrebbero risultare utile serbatoio di contesti.
3. Un case study: Umwelt
Si veda ad esempio un prestito tecnico, il termine Umwelt.
Introdotto (in tedesco) dal biologo (estone, ma di famiglia tedesca del Baltico)
Jakob Johann baron von Uexküll già nel titolo della sua importante opera del
1909 (Umwelt und Innenwelt der Tiere), è entrato presto nella tradizione

2 Come dimostrato da alcuni degli interventi presenti in Barbera et al. 2007; in
Costantino et al. 2009, per non citare che i primi utilizzi di dieci anni fa.

78

JADT’ 18

filosofica (a partire da una recensione di Max Scheler del 1914): usato da
Heidegger in un suo corso del 1929-30, è diventato poi moneta corrente (tra
gli altri) in francese con Gilles Deleuze, Maurice Merleau-Ponty e Jacques
Lacan, nonché in italiano con Giorgio Agamben. Ma è usato soprattutto in
testi di biologia, naturalmente, e poi in semiotica, in cui è stato diffuso negli
anni Sessanta da Thomas Albert Sebeok (born Sebők Tamás) ed è alla base
della moderna biosemiotica (cfr. Kull, 2001).
Nei NUNC il termine è ripetutamente attestato.
Per Gadamer comprendere l ' esistenza3 - e qui c'è ancora Heidegger significa prima di tutto pre-comprenderla , in quanto la comprendiamo
con un linguaggio che non scegliamo , ma che , trascendentalmente ,
definisce già la realtà in cui ci muoviamo : l'Um-Welt , da un lato , e
dall ' altro lato , il Mit-welt . Ma , Gadamer cerca di andare alla radice
del movimento del pensiero del soggetto e tale origine sta nell '
esigenza di comprendere e farsi comprendere , cioè nel muoversi nell '
Umwelt e nel Mitwelt . Il fatto è che per Gadamer l ' Altro è visibile
solo con gli " occhi nostri ", ciò con ciò che " siamo ", con la nostra "
identità ", il nuovo si dà solo nel familiare . E in un certo senso è così . L
' altro è ciò che mi disturba che mi inquieta perchè non riesco a ridurlo
al mio mondo : è un'eccedenza .

Quello precedente è un esempio dell’uso tecnico-filosofico del termine, che
non si discosta molto da quello che si potrebbe trovare nello spogliare i testi
(e le traduzioni) di quella tradizione. Più interessante è l’esempio seguente:
Anche in Italia il consumo di televisione è vertiginosamente aumentato
: […] . Oltre a due effetti di rilevanza individuale : - la caduta verticale
della capacità di fissare l ' attenzione per più di un certo tempo ( se a
un buon insegnante occorre anche un ' ora per sviluppare un dato
argomento , gli spazi televisivi obbligati in novanta secondi troncano
quello stesso argomento in modo irreparabile ) e - la perdita di
interesse per la lettura - aspetti che coinvolgono per mimetismo
inconscio ( vale a dire per l ' inconscio occupazione degli spazi mentali
ad opera non solo delle immagini ma dell ' intera atmosfera televisiva
che foggia l ' Umwelt dell ' uomo moderno ) anche persone che
fruiscono della TV per tempi ben sotto la media - l ' esposizione allo "

3 Le citazioni dal corpus sono nel prosieguo riportate tel quel: in particolare sono
mantenute le tokenizzazioni di interpunzioni ed apostrofi, tutti gli “errori di
digitazione”, e le idiosincrasie ortografiche proprie del genere.

JADT’ 18

79

sbarramento " delle immagni4 televisive ha due rilevanti effetti sociali :
- il conformismo applicato e - l ' ignoranza generalizzata . […]

Si tratta di un traslato, chiaramente fuori dai campi “tecnici” di diffusione del
termine. Lessicograficamente ciò è particolarmente rilevante perché
testimonia il traghettamento del prestito al di fuori del dominio originario di
appartenenza, assicurandone lo sdoganamento all’uso comune, anche se
colto o relativamente tale. Per questo tipo di riscontri i NUNC possono
rivelarsi particolarmente utili.
4. Al di qua e al di là della parola grafica
Il VoDIM oltre che datare la comparsa di particolari lessemi o di determinate
accezioni, si propone anche di attestare la comparsa di accorciamenti e
combinazioni di parole: i NUNC, in effetti, presentano usi incipienti passati
dal parlato a questa forma di scritto di inizio millennio.
Dal punto di vista della frequenza statistica di tali usi, i dati estratti dai
corpora NUNC presentano delle criticità dovute al fenomeno del quoting, ma
costituiscono una ricca miniera di prime attestazioni: si vedano, ad esempio,
lo studio di Onesti et Squartini, 2007 sul modo di dire tutta una serie di o di
Valle, 2006 sulla penetrazione precoce di anglismi (più o meno italianizzati).
Per quanto concerne gli accorciamenti, in particolare, in Allora et Marello,
2008 ne abbiamo dato una nutrita raccolta. Un esempio per tutti è clima come
accorciamento di climatizzatore; Marello l’aveva già fatto oggetto di un breve
articolo5 e ne aveva constatato la presenza in più post del 2002 di NUNCMotori. Si veda il brano di thread in cui compare anche un disinvolto conce
per concessionario6:
Qualcuno e' in grado di dirmi quanti grammi (olio/gas?) servono per la
ricarica del clima per un CRD del 2002? Una spesa approssimativa?
Grazie
Ciao a tutti, scusate se mi intrometto, ma oggi dopo giorni di dubbio ho
chiamato il conce per lo stesso motivo di Massimo,30 km per sentire un
po' di aria fresca con il clima impostato a 5 gradi e macchina lasciata

Come si diceva, le citazioni dal corpus sono riportate tel quel, ivi compresi gli
errori presenti nella fonte. Tantopiù che la maggiore tolleranza alle cattive digitazioni,
e l’aperta accettazione di alcune caratteristiche grafico-ortografiche, sono tipiche di
questo genere di CMR.
5 Apparso sul Corriere del Ticino il 23 settembre 2005
6 Non approdato questo agli onori della registrazione nei dizionari, come invece
accade per clima la cui data di prima attestazione è secondo il dizionario Zingarelli il
2000.
4

80

JADT’ 18
prima all'ombra

Al di là della parola grafica può, ad esempio, essere interessante
documentare gli usi di assolutamente sì7: se ne trovano ben 103 nei
NUNC generali. Ecco due esempi:
Ma ti senti tanto tanto tanto depressa ??? Ci dobbiamo preoccupare ?
[>]… Oggi un pò meno , però devo dire che ho passato veramente dei
brutti momenti. L ' importante è riprendersi , no ? Assolutamente sì !
Riprendersi e ripartire subito !
tu sei un troll ? […] No , perché il flame occasionale non fa di una
persona un troll - werted è un troll ? Assolutamente sì , perché attua
flame , insulti e provocazioni in modo sistematico e con offese che vanno
oltre l ' ambito dello sfottò sportivo . In più utilizza tutte le tecniche
tipiche del trollaggio , dal morphing al faking al flooding .

Stessa indagine si può fare per anche no, constatando che è nella
stragrande maggioranza dei contesti è ma anche no.
5. Conclusioni
Un ulteriore fattore che rende i NUNC apprezzabili per il linguista e il
lessicografo attento all’uso è la dialogicità, che si intravede soprattutto negli
esempi presentati nel § 4. È un fenomeno pervasivo nei NUNC, di solito
declinato nei newsgroup come quoting (cfr. Barbera, 2011 e Marello, 2007).
Computazionalmente ciò crea, è vero, alcuni problemi (ancora non del tutto
risolti), dato che il fenomeno del testo ripetuto, se incontrollato, va
inevitabilmente ad intaccare l’aspetto statistico, vanificando un semplice uso
quantitativo dei corpora; però testualmente è un fenomeno di grande
importanza, specie se valorizzabile, come nei NUNC, con la possibilità di
potere allargare i contesti fino a 2000 parole.
La capacità dei newsgroup di fissare nello scritto usi eminentemente orali, di
trasferire la fluidità dell’oralità ad uno speciale tipo di scrittura, costituendo
una sorta di ponte tra i due media, può rivelarsi particolarmente importante
per il VoDIM, proprio perché i corpora NUNC registrano tendenze
emergenti nella lingua italiana. Sulla peculiarità diamesica di questo
particolare tipo di “scritto-parlato” abbiamo sostato in Barbera et Marello,
2009, ma qui non possiamo non rimarcarne l’opportunità che potrebbe
presentare per il VoDIM.
I NUNC, come dicevamo, non sono ancora perfetti: i prototipi che sono stati

7

Oggetto di un articolo sul Corriere del Ticino del 21 gennaio 2004.

JADT’ 18

81

messi online sono solo delle beta, ma la volontà di perfezionarli c’è: e non è
da escludere che il VoDIM rappresenti l’occasione giusta per farlo.
Bibliografia
Allora A. e Marello C. (2008), “Ricarica clima”. Accorciamenti nella lingua
dei newsgroup, in Cresti E., editor, Atti del IX Congresso della Società
Internazionale di Linguistica e Filologia Italiana (SILFI): “Prospettive nello
studio del lessico italiano” (Firenze, 14-17 giugno 2006). Cesati: vol. II, pp.
533-538.
Barbera M., Per la storia di un gruppo di ricerca. Tra bmanuel.org e
corpora.unito.it, in Barbera M., Corino E. e Onesti C., editors, Corpora e
linguistica in Rete. Guerra Edizioni: pp. 3-20.
Barbera M., Une introduction au NUNC: histoire de la création d’un corpus,
in Ferrari A. et Lala L., editors, Variétés syntaxiques dans la variété des textes
online en italien: aspects micro- et macrostructuraux. Université de Nancy II,
2011: pp. 9-36.
Barbera M. e Marello C. (2009), Tra scritto-parlato, Umgangssprache e
comunicazione in rete: i corpora NUNC, in Antonini A. e Stefanelli S.,
editors, Per Giovanni Nencioni. Convegno internazionale di studi. Pisa Firenze, 4-5 Maggio 2009. Le Lettere: pp. 157-86. Poi in Barbera M., Quanto
più la relazione è bella: saggi di storia della lingua italiana 1999-2014,
Bmanuel.org - Youcanprint, 2015: pp. 157-182.
Costantino M., Marello C. e Onesti C. (2009), La cucina discussa in rete.
Analisi di gruppi di discussione italiani relativi alla cucina, in Robustelli
C. e Frosini G., editors, Atti del convegno ASLI 2007 “Storia della lingua e
storia della cucina. Parola e cibo: due linguaggi per la storia della società
italiana”. Modena, 20-22 settembre 2007. Cesati: pp. .717-727.
Kull K. (2001), Jakob von Uexküll: An introduction. Semiotica, vol. 134 (1/4):
pp. 1-59.
Marello C. (2007), Does Newsgroups “Quoting” Kill or Enhance Other Types
of Anaphors?, in Korzen I. and Lundquist L., editors, Comparing Anaphors
between Sentences, Texts and Languages. Samfundslitteratur Press: pp. 145157.
Onesti C. e Squartini M. (2007), “Tutta una serie di”. Lo studio di un pattern
sintagmatico e del suo statuto grammaticale, in Barbera M., Corino E. e
Onesti C., editors, Corpora e linguistica in Rete. Guerra Edizioni: pp. 271284.
Valle L. (2006), Varietà diafasiche e forestierismi nell'italiano nei gruppi di
discussione in rete, in López Díaz M. et Montes López M., editors,
Perspectives fonctionnelles: emprunts, économie et variations dans les langues.
S.I.L.F. 2004. XXVIII Colloque de la Société internationale de linguistique

82

JADT’ 18

fonctionnelle, tenu à Saint-Jacque-de-Compostelle et à Lugo du 20 au 26
septembre 2004. Editorial Axac: pp. 371-374.
Zingarelli N. (2017), Lo Zingarelli 2017. Vocabolario della lingua italiana. A cura
di Mario Cannella e Beata Lazzarini. Zanichelli.

JADT’ 18

83

Techniques for detecting the normalized violence in
the perception of refugee / asylum seekers between
lexical analysis and factorial analysis
Ignazia Bartholini
Univ. of Palermo - ignazia.bartholini@unipa.it

Abstract 1
The theme of gender violence finds a peculiar declination if linked to the
phenomenon of forced migrations, and intersects historical-cultural variants
of neo-patriarchal nature to cultural-religious orthodoxies the newcomers
often bear with them. Studying gender violence in the context of globalized
migrations allows us to highlight three bias that mark the western discourse
and that concern the way of conceiving its phenomenology as pre-modern
(a); detaching violence interpretation from politics of intervention and
contrast (b); considering gender asymmetries, sexist representations and
practices in the Mediterranean hosting society as residual (c). Subsequently,
the factorial structure of the questionnaire was investigated through the
Principal Components Analysis (ACP) and the subsequent Oblimin rotation
of the factorial axes, as a relation between the dimensions of the
questionnaire was assumed. The reliability of the scales was verified by the
Cronbach alpha coefficient.
Abstract 2
Il tema della violenza di genere trova una declinazione peculiare se collegato
al fenomeno delle migrazioni forzate e interseca le varianti storico-culturali
di natura neo-patriarcale alle ortodossie culturali-religiose che i nuovi
arrivati portano spesso con loro. Studiare la violenza di genere nel contesto
delle migrazioni globalizzate ci consente di evidenziare tre pregiudizi che
segnano il discorso occidentale e che riguardano: il modo di concepire la sua
fenomenologia come premoderna (a); la searazione fra l'interpretazione della
violenza e le politiche di intervento e contrasto (b); il considerare le
asimmetrie di genere, le rappresentazioni sessiste e le pratiche Mediterranee
come residuali (c). Successivamente, la struttura fattoriale del questionario è
stata analizzata attraverso la Principal Components Analysis (ACP) e la
successiva rotazione Oblimin degli assi fattoriali, essendo stata ipotizzata una
relazione tra le dimensioni del questionario. L'affidabilità delle scale è stata
verificata dal coefficiente alfa Cronbach.
Keywords: gender violence, forced migrations, sexist representation

84

JADT’ 18

1. Introduction
Over the last two decades, the field of border and migration management has
been characterized by the increasing interrelatedness of discourses about
control practices and about humanitarian issues (Walters 2011, Fassin 2010).
Today, European policies seek to incorporate strategies to support forced
migrants as key instruments for the protection of refugees (Moro 2012).
Forced migration, which can also be addressed through the lens of gender
(Hans 2008), is grafted onto a broader field of research, which includes
welfare strategies, social representations and intercultural dynamics.
According to the UNHCR, gender-based violence refers to “any act of
gender-based violence that results in, or is likely to result in, physical, sexual
or psychological harm or suffering to women, including threats of such acts,
coercion or arbitrary deprivation of liberty, whether occurring in public or
private life” (UNHCR 2008: 201). It can take, among others, the form of “rape,
forced impregnation, forced abortion, trafficking, sexual slavery, and the
intentional spread of sexually transmitted infections, including HIV/AIDS”
(UNHCR 2008: 7, 10).
Forms of violence happen not only inside the migratory journey by other
refugees, but also by public officers, government employees, aid agencies
crew (Ferris 2007; Freedman 2015).
2. The numbers of the phenomenon
According to data of the Italian Ministry of Internal Affairs, between 2015
and 2016, 154719 migrants disembarked in Italy, of which 82136 asylum
seekers. From January to March 2016 9,307 migrants disembarked in Italy.
Currently, migrants come mostly from Gambia, Senegal, Mali, Guinea, Ivory
Coast, Morocco, Somalia, Sudan and Cameroon (Source: ANSA).
In January 2016 asylum seekers were 7,505, mostly from Pakistan (1510),
Nigeria (1306), Afghanistan (665) and Gambia (625). Among these, 6739 were
men, 766 women, 292 unaccompanied minors and 199 minors. 6507 requests
were reviewed so far with the following outcomes: 190 people (3%) were
granted the refugee status, 698 (11%) obtained a subsidiary permit, 1352
(21%) were granted with a humanitarian protection and 4266 (66%) were
denied (source: Italian Ministry of Internal Affairs).
Only in the 2017, from the Hotspot Trapani-Milo, managed by "Badia Grande
NGO” one of partners of the project " Provide ", have transited 21,478
refugees / asylum seekers (Source - Ministry of Interior), with 21 different
nationalities. These include 16,010 men, 3177 women, 2291 children divided
in 1787 males and 504 females.

JADT’ 18

85

Last year, two researchers from the University of Palermo submitted a
questionnaire of 36 items to 465 women, temporarily hosted at the TrapaniMilo Hotspot in Sicily.
3. Objectives of research
The core question of the research concerns the identification of violence’s
subjective dimensions from the side of the victims and the operators, as well
as the problems in building social multicultural constructions of violence.
The research wants to identify violence’s subjective dimensions from the side
of the victims and the operators, as well as the problems in building social
multicultural constructions of violence.
For this purpose, the research investigates a specific articulation of the
“migratory violence,” which entails cultural specificities and contextual
conditions, such as the journey and the time spent in reception facilities. In
order to highlight topics and problems related to the social construction of
gender violence, attention will be paid to victims’ point of view concerning
the ‘normalized’ procedural violence, even by means of operational
definitions of victims’ first reception treatments in the institutional arenas.
Furthermore, gender relations are biased by the whole migration experience,
and this leads to various forms of direct, indirect and structural violence:
forms of gender-based violence are seen not only among refugees. Finally,
refugees and asylum seekers may suffer structural violence in the form of
social exclusion and discrimination (Jaji 2009, Crisp, Morris & Refstie 2012),
secondary victimization (Pinelli 2011, Tognetti 2016), labour exploitation
(Coin 2004), forced prostitution (Naggujja et al 2014, Krause-Vilmar 2011)
and sexual abuse (Crisp, Morris & Refstie 2012). Therefore, the migratory
violence to which women—as well as minors and LGBT—are subjected,
becomes, a particular mode of reading and interpretation of intra- and
intercultural gender relations.
For the first part of the research's objective, was to assess the perception of
the violence suffered of the women of sample before and during the journey
to the coast of Sicily.
For the second one of the research's objective, was to individuate some
effective interventions for the reduction of the migrant' exposure to different
types of violence and threat, to encourage the access to physical and
psychological services, to assist the violence' victims with integration,
support safe and appropriate cultural instruments , to provide support for
families, stable settlement in host country and to concerted actions for
reducing the inequalities in access to resources.

86

JADT’ 18

4. Methodology
A1. Once the ethnic intersection, socioeconomic gender and status explored,
an internalist perspective will be employed, based on the analysis of the
narrative devices, that is the conversations’ reports that migratory violence
victims conduct with experts (linguistic and intercultural mediators, social
assistants, psychologists and lawyers, but also doctors and police officers) or
with members of the third sector.
A2. Definitions of lived or experienced violence, through interviews to
refugees and operators in the first and second reception centres, that have
particular acquaintance with the phenomenon;
Subsequently, the factorial structure of the questionnaire investigated
through the Principal Components Analysis (ACP) and the subsequent
Oblimin rotation of the factorial axes, as a relation between the three
dimensions of the questionnaire was assumed:
a. the daily life before the trip;
b. the gender dynamics and relationships among the family members;
c. the violence normalized.
The reliability of the scales was verified by the Cronbach alpha coefficient.
In order to verify the hypothesis, that there are statistically significant
differences to the mean scores of the different dimensions, analyzes of the
variance have been carried out. Multivariate analysis techniques on variance,
together with a lexical analysis, allowed us to select:
1. the keywords present in the corpus of the questionnaire using frequency
indexes;
2. the meta-information contained within the text units;
3. the context units through specific data arrays for content analysis
The communication that we propose to present will describe the results of
the research conducted and the methodological opportunity of the text
analysis tools used by the researchers involved.
5. Some Research’s results
To individuate the vulnerabilities of migrants, it was necessary to identify
appropriate instruments of analysis for being able the needs of violence
victims and in order to deal with them in a respectful, sensitive, professional
and non-discriminatory manner. The have explained the need to receive the
proper degree of assistance and a stronger support and protection. The
keywords more frequently used by migrants are been: protection, fear,
opportunity, work, life.

JADT’ 18

87

The content analysis, and the context units involved through specific data,
describe the necessity to acknowledge the women/asylum seekers, who could
be victims by other men after their arrive in reception center too and the
opportunity to put specific procedures to prevent, identify, and respond to
the different forms of proximity gender-based violence.
The content analysis, and the context units involved through specific data,
describe the necessity to acknowledge the women/asylum seekers, who could
be victims by other men after their arrive in reception center too and the
opportunity to put specific procedures to prevent, identify, and respond to
the different forms of proximity gender-based violence.
6. Conclusion
The problems that refugees face require humanitarian responses and
effective interventions (Dal Lago 1999; Colombo 2012; Camarrone 2016), such
as the reduction of exposure to different types of violence and threat in postmigration phase and the access to physical and psychological services
(Shamir 2005; Ambrosini 2010; Bartholini 2017). From this perspective, the
Mediterranean represents a peculiar field of analysis of that normalized
violence – procedural and proximal – that denies refugees/asylum-seekers,
minors and LGBT people to consider themselves as right holders and
subjects of the same dignity and value.
Morevor, the results or content analysis shows the necesity of a stronger
integration, with a support strategies of appropriate cultural s and social
practices and to provide adeguate support for families in a stable settlement
in our host countries (Balibar 2012). Lastly, the research highlights the need
of some concerted action to reduce inequalities in access to resources
(Robinson et al. 2006).
Gender violence related persecution may give rise to claims for international
protection (Gilbert 2009).
Council of Europe Convention on preventing and combating violence against
women (Istanbul Convention of 2011) and the Directive 2012/29/EU in
establishing minimum standards on the rights, support and protection of
victims, contribute to achieve the obligation to "ensure access for victims and
their family members to general victim support and specialist support, in
accordance with their needs".
Although member states are stepping up their work in order to streamline a
gender understanding into public decision making, policy and operations,
this effort is not always reflected in the asylum procedures.

88

JADT’ 18

References
Ambrosini M. (2010). Richiesti e respinti. L’immigrazione in Italia. Come e perché.
Milano: il Saggiatore.
Balibar E. (2012). Strangers as enemies. Walls all over the world, and how to
tear them down. Mondi Migranti, Vol. 6, n. 1: 7-25. DOI: 10.3280/MM2012001001
Bartholini I (2017). Migrations: A Global Welfare Challenge: Policies, Practices and
Contemporary Vulnnerabilities, (with F. Pattaro Amaral; A. Silvera
Samiento; R. Di Rosa), Edition Corunamerica, Barranquilla (Colombia),
p.1-196 (ISBN 978-9588-59812-2-5).
Camarrone D. Hotspot di Lampedusa, la sindaca chiede al Ministero dell’interno
una verifica urgente delle procedure UE, Diritti e frontiere, 8 gennaio 2016, in
http://dirittiefrontiere.blogspot.it/2016/01/la-verita-sul-sistema-hotspot.html
Colombo A. (2012). Fuori controllo? Miti e realtà dell’immigrazione in Italia.
Bologna: Il Mulino.
Coin F. (2004). Gli immigrati, il lavoro, la casa. Franco Angeli: Milano.
Convenzione
di
Dublino
(1990),
in
http://www.camera.it/_bicamerali/schengen/fonti/convdubl.htm
Crisp J., Morris T. & Refstie, H. (2012). Displacement in urban areas: new
challenges, new partnerships. Disasters, 36(1): S23-S42.
Dal Lago A. (1999). Non Persone. L’esclusione dei migranti in una società globale.
Milano: Feltrinelli.
Fassin D. (2010). La raison humanitaire. Une histoire morale du temps present,
Gallimard-Seuil-Hautes Études: Paris.
Gilbert L. (2009). Immigration as Local Politics: Re-Bordering Immigration
and Multiculturalism through Deterrence and Incapacitation. International
Journal of Urban and Regional Research, Vol. 33, n. 1: 26-42. DOI:
10.1111/j.1468-2427.2009.00838.x
Jaji R. (2009). Refugee woman and the experiences of local integration in Nairobi,
Kenya. University of Bayreuth: Bayreuth.
Krause-Vilmar J. (2011). The Living Ain’t Easy, Urban Refugees in Kampala. UN
Report
Ministero dell’Interno, Rapporto sulla protezione internazionale in Italia 2015, in
http://www.interno.gov.it/sites/default/files/t31ede-rapp_prot_int_2015__rapporto.pdf
Naggujja Y. et al (2014). From The Frying Pan to the Fire: Psychosocial Challenges
Faced By Vulnerable Refugee Women and Girls in Kampala, Report of the
Refugee Law Project.

JADT’ 18

89

Osti G. & Ventura F. a cura di (2012). Vivere da Stranieri in Aree Fragili. Napoli:
Liguori.
Palidda S. a cura di (2011). Il discorso ambiguo sulle migrazioni. Messina:
Mesogea.
Pinelli B. (2011). Attraversando il Mediterraneo. Il sistema campo in Italia:
violenza e soggettività nelle esperienze delle donne, Lares, 77: 159-180.
Regolamento (CE) n. 343/2003 (Dublino II), in http://eur-lex.europa.eu/legalcontent/IT/TXT/?uri=URISERV%3Al33153
Regolamento
UE
n.
604/2013
(Dublino
III),
in
http://eurlex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2013:180:0031:0059:IT:P
DF
Robinson D. & Reeve K. (2006). Neighbourhood Experiences of New Immigration.
Reflections from the Evidence Base. York: Joseph Rowntree Foundation.
Shamir R. (2005). Without borders? Notes on globalization as a mobility
regime. Sociological Theory, Vol. 23, n. 2: 197-217. DOI: 10.1111/j.07352751.2005.00250.x
Tognetti M. (2016). Donne e processi migratori fra continuità e cambiamento.
ParadoXa, X(3): 69-88.
Walters W. (2011). Foucault and Frontiers: Notes on the Birth of the
Humanitarian Border. In: Bröckling U. (Ed.). Governmentality: Current
Issues and Future Challenges. Routledge: London.

90

JADT’ 18

Dal corpus al dizionario: prime riflessioni
lessicografiche sul Vocabolario storico della cucina
italiana postunitaria (VoSCIP)
Patrizia Bertini Malgarini1, Marco Biffi2, Ugo Vignuzzi3
1LUMSA – p.bertini@lumsa.it
Università degli Studi di Firenze – marco.biffi@unifi.it
3Sapienza. Università di Roma – ugo.vignuzzi@uniroma1.it
2

Abstract
The Vocabolario storico della cucina italiana postunitaria (VoSCIP) it is a
historical dictionary of the language of the cooking, which has also had a
considerable importance for identifying a national linguistic model after the
Unity of Italy. The dictionary is based on a representative corpus (today 42
texts), but by its nature it is a work in progress, open, and it is progressively
increasing. The first exemplar entries (such as cappelletti, anolini, tagliatelle,
bagnomaria) had been presented in various conferences and in some articles;
the entries had been based on a restricted corpus (28 texts) and they have
highlighted some critical issues, so it was necessary a further methodological
reflection. The aim of our paper is to propose some aspects of these
investigations and this methodological reflection: a) the structure of the voice
in a differentiated form (“light” and “complex”); b) the treatment of
emerging positions from the statistical analysis tools of the corpus; c) the
lemmatization of compound words in the face of the morphological
polymorph emerging from the diachronic depth of the corpus; d) the correct
balance between the examples mentioned in the voice and the possibility of a
direct interrelation with the database.
Sintesi
Il Vocabolario storico della cucina italiana postunitaria (VoSCIP) è un dizionario
storico di una lingua speciale, quella della cucina, che ha avuto una notevole
importanza anche nel quadro dell’individuazione di un modello linguistico
nazionale soprattutto all’indomani dell’Unità. Il dizionario si basa su un
corpus rappresentativo (attualmente di 42 testi), ma che per sua natura è
elastico, e aperto, e viene quindi progressivamente incrementato. Le prime
voci campione (quali per esempio cappelletti, anolini, tagliatelle, bagnomaria)
presentate in vari convegni e in articoli in volume e riviste, basate su un
corpus ristretto a 28 testi, hanno messo in luce alcune criticità che hanno
spinto a una ulteriore riflessione metodologica. Proprio alcuni aspetti di tali
approfondimenti sono oggetto del contributo che proponiamo: a) la struttura

JADT’ 18

91

della voce in forma differenziata (“leggera” e “complessa”); b) il trattamento
delle collocazioni emergenti dagli strumenti statistici di analisi del corpus; c)
la lemmatizzazione di parole composte a fronte della polimorfia morfologica
emergente dalla profondità diacronica del corpus; d) il corretto equilibrio tra
esempi citati nella voce e possibilità di un’interrelazione diretta con la banca
dati.
Keywords: lingua della cucina, lingue speciali, linguistica dei corpora,
lessicografia, vocabolario, italiano, dizionario storico
1. Il VoSCIP
Il “Vocabolario storico della cucina italiana postunitaria” (VoSCIP) nasce con
lo scopo di documentare il costituirsi e il fissarsi di una cultura e di una
lingua unitaria della gastronomia in Italia dopo l’Unità. Si tratta di
un’esigenza ben presente a tutti gli addetti ai lavori (linguisti, storici
dell’alimentazione, sociologi ecc.) e che nello specifico ha preso le mosse da
una precisa prospettiva di ricerca, quella di esaminare le vie e i modi
dell’affermarsi di un italiano gastronomico “comune”, a partire da Pellegrino
Artusi e dal modello archetipico del suo fortunatissimo La scienza in cucina e
l’arte di mangiar bene. Il progetto “L’Italiano in cucina. Per un Vocabolario
storico della lingua italiana della gastronomia” è stato assunto
dall’Accademia della Crusca che lo ha inserito nell’ambito degli studi che
mirano alla costruzione del suo progetto strategico dedicato alla redazione di
un Vocabolario Italiano postunitario.
Per la realizzazione del VoSCIP si è proceduto preliminarmente a fissare un
corpus rappresentativo di testi, nel quale naturalmente un ruolo nodale
spetta alla Scienza in cucina: corpus che, per motivi di fattibilità pratica, si è
deciso di far arrivare alla Seconda guerra mondiale e dintorni,
nell’auspicabile prospettiva di poter spostare successivamente il terminus ad
quem alla contemporaneità (con l’inclusione, oltre che dei testi a stampa
posteriori al ’50, delle diverse produzioni legate al “trasmesso” nelle sue
varie forme, dai ricettari presenti sul WEB, ai blog ai social media etc.). Il
corpus principale di riferimento comprende al momento oltre un centinaio di
volumi apparsi tra la fine del Settecento (torneremo fra poco sulle ragioni
della scelta di arretrare il terminus post quem) e il 1950: i testi sono stati
selezionati utilizzando le principali bibliografie sulla produzione
gastronomica italiana del periodo considerato (preziosa in primo luogo
quella di Alberto Capatti che correda l’edizione del 2010 della Scienza
artusiana della Rizzoli). Necessariamente si è dovuto tener conto pure di
fattori pratici quali in primo luogo la reperibilità delle opere e soprattutto la
loro disponibilità e/o acquisibilità da parte dell’Academia Barilla, con la
quale è stata a tali scopi stipulata una specifica convenzione da parte

92

JADT’ 18

dall’Accademia della Crusca. Al momento, i testi acquisiti informaticamente
e marcati (XML/TEI) sono quaranta.
Prima di proseguire, una doverosa precisazione (già annunciata) sul terminus
post quem: anche se il nostro obiettivo primario è, come abbiamo detto, quello
di raccogliere e descrivere la lingua della tradizione gastronomica italiana
postunitaria, per meglio documentare le origini di questo italiano in cucina
(soprattutto per l’aspetto della fraseologia, cioè in primo luogo polirematiche
e collocazioni, ma anche detti proverbiali, modi di dire, ecc.) abbiamo deciso
di prendere in considerazione anche alcuni dei testi più significativi tra fine
Settecento e primo Ottocento, a partire dalle due redazioni dell’Apicio
moderno e dal Cuoco galante di Vincenzo Corrado. Sempre al medesimo fine,
stiamo procedendo inoltre allo spoglio sistematico di tutto ciò che è
pertinente all’ambito semantico del cibo nella tradizione lessicografica
italiana, a partire dalle cinque impressioni del Vocabolario degli Accademici
della Crusca, dal Tommaseo Bellini, dal Giorgini Broglio, e soprattutto dal
Dizionario moderno (prima ed., 1905) di Alfredo Panzini. L’interesse di questo
vocabolario, che offre un vero e proprio panorama della vita e della cultura
italiana tra fine Ottocento e Novecento, è costituito dal nostro punto di vista
proprio dallo spazio attribuito a quelle parole nuove, che già nella prima
edizione lo stesso Panzini catalogava in “scientifiche, tecniche, mediche,
filosofiche, [parole straniere, neologismi, parole dello sport,] della moda, del
teatro, della cucina”.
Imprescindibile nell’ambito lessicale del cibo (come è ben noto) è la
dimensione diatopica per la quale il VoSCIP potrà utilizzare gli importanti
risultati delle indagini geolinguistiche del Novecento, in primis degli atlanti
linguistici: l’AIS e l’ALI, ma anche l’ASLEF, l’ALEPO, l’ALT, l’ALLI, e i
preziosi materiali in corso di pubblicazione per l’ALS (tra cui si ricorderà
almeno il paradigmatico volume di Ruffino 1995).
Per verificare la fattibilità del nostro progetto abbiamo realizzato alcune voci
pilota: siamo partiti da tagliatella, cui sono seguite agnelotto, cappelletto e
anolino; in tutt’altro ambito abbiamo recentissimamente elaborato la voce
bagnomaria. Proprio la redazione di queste voci e in particolare dell’ultima,
bagnomaria, ha messo in luce alcune criticità del modello di voce
originariamente elaborato e reso necessario un ripensamento che sfruttasse a
pieno le risorse della lessicografia computer aided (o della lessicografia
computerizzata) e della multimedialità oggi disponibili.
2. La banca dati
I testi del corpus sono stati sottoposti a una marcatura XML/TEI leggera,
mirata soprattutto a finalità lessicografiche. Attualmente sono stati acquisiti,
collazionati e marcati 42 testi che coprono uniformemente l’arco cronologico

JADT’ 18

93

considerato. Per quanto riguarda l’header sono state previste le indicazioni di
autore, titolo, luogo di edizione, editore, anno, tipologia testuale, indicazione
diamesica, in modo che possano costituire la base per filtrare sottocorpora
specifici. All’interno del testo sono state marcate le pagine di ogni volume
(così che le trascrizioni possano essere di volta in volta collegate alla
riproduzione in facsimile dell’originale), le eventuali figure, le parti in lingue
diverse dall’italiano (perché possano essere escluse dall’interrogazione del
lessicografo). Non si è ritenuto di prevedere nessuna marcatura per i
forestierismi, che, al pari degli altri lessemi, devono essere analizzati
opportunamente dal lessicografo in ogni loro contesto. In una seconda fase
della marcatura dei primi 42 testi, in via di attuazione, è prevista anche la
marcatura del testo delle singole ricette e del loro titolo. Lo scopo primario di
questa marcatura è quello di ottenere una lista aperta delle ricette presenti
nel corpus, che possano eventualmente essere messe a confronto tra di loro
con appositi algoritmi legati alle forme presenti nel titolo. In questo modo
sarà possibile individuare una linea diacronica delle singole ricette e seguire
l’evoluzione della lingua in esse contenute. Per quanto concerne il
trattamento informatico va tenuto conto che la banca dati è un esempio di
testualità ibrida: sia in relazione all’acquisizione filologica del testo e alla sua
interrogabilità, sia per quanto riguarda la possibilità di applicazione di
procedure di lemmatizzazione automatica. Trattandosi di testi ottonovecenteschi la possibilità di buoni risultati nell’applicazione degli
strumenti informatico-linguistici realizzati nel panorama nazionale e
internazionale
scema
progressivamente
allontanandosi
dalla
contemporaneità verso il 1861, ma anche per i testi ottocenteschi e primo
novecenteschi si hanno garanzie sufficienti. Vista la particolare natura della
banca dati, la sua cronologia e la sua finalità lessicografica, nell’equilibrio
della gestione delle risorse, si è preferito quindi non investire su una
lemmatizzazione controllata, che avrebbe comportato l’inserimento di
correttivi legati alla lingua ottocentesca e primo-novecentesca sia sui
dizionari macchina che sulle morfologie macchine attualmente in
circolazione (prevalentemente di base anglofona, con tutti i limiti che questo
comporta, e, anche nel migliore dei casi tarati per l’italiano scritto recente; cfr.
Biffi 2016). La banca dati (attualmente in fase di testing nella sua versione
beta) è quindi consultabile con un motore di ricerca per forme, potenziato da
strumenti (caratteri jolly, ricerca fuzzy) che facilitino l’individuazione delle
varianti formali, morfologiche e grafico-fonetiche, e da una lemmatizzazione
automatica basata sulle morfologie macchina attualmente esistenti (e quindi
tarate sull’italiano scritto contemporaneo, ma comunque sufficientemente
funzionali per il reperimento delle forme varianti di testi otto-novecenteschi,
soprattutto se a fini lessicografici). La piattaforma di interrogazione prevede

94

JADT’ 18

specifiche funzioni di ricerca a distanza e collocazioni, e la possibilità di
accedere a dati statistici, sia in versione tabellare, sia in versione heatmap e tag
cloud. Con queste caratteristiche la banca dati può peraltro essere del tutto
omogenea a quelle che gravitano intorno al progetto del Corpus di riferimento
per un nuovo vocabolario dell’italiano moderno e contemporaneo. Fonti
documentarie, retrodatazioni, innovazioni, finanziato su fondi PRIN 2012 e
coordinato da Claudio Marazzini, offrendo così ampi margini di dialogo con
gli strumenti lessicografici a essa collegati.
3. Struttura delle voci e dizionario elettronico
La struttura della voce progettata risente naturalmente delle caratteristiche
dei dizionari storici. Ecco la sua architettura:
LEMMA + categoria grammaticale
0.1. Forme attestate nel corpus dei testi (con tutte le varianti)
La forma lemmatizzata per la voce principale è quella più
diffusa nell’uso odierno: ci si serve del GRADIT, Grande
dizionario italiano dell’uso, di Tullio De Mauro, con i relativi
aggiornamenti.
0.2. Nota etimologica essenziale.
0.3. Prima attestazione nel corpus.
0.3.1. Indicazione numerica della frequenza (per ciascuna
forma; nell’indicazione delle occorrenze, la seconda cifra,
preceduta dal segno +, si riferisce alle forme presenti in
eventuali indici).
0.4. Distribuzione geografica delle varianti.
Per ora si forniscono i dati relativi ai soli AIS e ALI.
Aggiungiamo in nota il riscontro con le forme registrate
da Touring Club Italiano 1931.
0.5. Note linguistiche/merceologiche (forestierismi; italianismi in
altre lingue).
La bibliografia per ora si riferisce solo alle ‘Note
linguistiche’, e, per quanto riguarda gli italianismi in altre
lingue, al DIFIT (consultabile in versione elettronica in
http://www.italianismi.org/difit-elettronico).
0.6. Riepilogo dei significati.
0.7. Locuzioni polirematiche e vere proprie (con la prima
attestazione nel corpus).
0.8. Rinvii (sono previsti soprattutto ‘iperlemmi’, o, se si
preferisce voci ‘generali’, di raccordo).
0.9. Corrispondenze lessicografiche (= riscontri nei dizionari e

JADT’ 18

95

nei corpora lessicografici in rete): si distinguono i vocabolari
etimologici (compreso il LEI) da quelli descrittivi (in ordine
cronologico, a partire dal Tommaseo-Bellini).
1. Prima definizione
Contesti
1.1. Definizione subordinata
Contesti
1.2. Definizione subordinata
Contesti
[...]
2. Seconda definizione
Contesti
[...].
La voce richiama, con gli opportuni adattamenti, quella del TLIO Tesoro della
Lingua Italiana delle origini, dell’Istituto dell’Opera del Vocabolario Italiano
del CNR di Firenze. I primi esperimenti, sui quali è basata ad esempio
l’ultima voce campione relativa a bagnomaria (a partire da una versione
iniziale del corpus, limitata a 28 testi), hanno evidenziato che la struttura
rischia però di essere troppo pesante in vista di una effettiva fattibilità
realizzativa del progetto.
I limiti “dimensionali” emergenti (che bene risultano evidenti in Bertini
Malgarini e Vignuzzi 2017) sono legati soprattutto alla ricchezza degli esempi
e all’ampiezza delle citazioni da altri strumenti lessicografici.
A entrambi questi limiti si pensa però di provvedere aumentando
l’interazione con gli altri strumenti collegati e collegabili.
In primo luogo prevedendo una profonda interazione tra banca dati testuale
e dizionario sia nella fase di redazione della scheda che in quella di
pubblicazione. In questo modo sarà possibile limitare il numero di esempi
citati per poi rimandare a un dossier completo delle occorrenze mediante il
collegamento con il corpus informatizzato. Nell’ottica di creare un accesso
aperto alla banca dati dei testi è opportuno porsi il problema dell’utilizzo
pubblico di testi coperti da diritto d’autore. Il tema è già stato affrontato
all’interno del gruppo PRIN 2008 “Il portale della TV, la TV dei portali” e in
occasione del convegno conclusivo del progetto Marina Pietrangelo –
ricercatrice dell’ITTIG (Istituto di Teoria e Tecniche dell’Informazione
Giuridica) appositamente invitata a parlare sul tema Per un uso legale degli
audiovisivi in corpora di ricerca – ha risposto con un sostanziale via libera
previsto dalla norma nel caso di progetti con esclusiva finalità di ricerca e
senza nessun risvolto economico (Pietrangelo 2017). Anche i riferimenti agli

96

JADT’ 18

altri dizionari vanno poi realizzati attraverso collegamenti con le versioni
elettroniche in rete attualmente disponibili (ad esempio quella del
Tommaseo-Bellini: Tommaseo online; quella delle edizioni del Vocabolario degli
Accademici della Crusca: Lessicografia della Crusca in rete; e infine quella del
vocabolario postunitario che si sta realizzando all’interno del progetto PRIN
2015 “Vocabolario dinamico dell’italiano post-unitario”, coordinato da Claudio
Marazzini). Sono tuttora allo studio procedure per il trattamento delle
collocazioni emergenti dagli strumenti statistici di analisi del corpus, e per la
lemmatizzazione di parole composte a fronte della polimorfia morfologica
emergente dalla profondità diacronica del corpus. All’interno di una vera e
propria stazione lessicografica tutti questi strumenti saranno integrati
all’interno di un sistema di back-office che, tramite fasi di valutazione
progressiva e di controllo, porterà alla diretta pubblicazione della voce in
rete. Infine, proprio la potenziale interazione/integrazione con il citato futuro
“Vocabolario dinamico dell’italiano post-unitario” ha suggerito al gruppo di
ricerca di predisporre una scheda lessicografica variabile: alla scheda
approfondita del dizionario storico si affiancheranno infatti una scheda
strutturata secondo le specifiche di un dizionario sincronico per quelle voci
che facciano ancora oggi parte dell’italiano dell’uso, e strumenti di
calibrazione dei campi che l’utente esperto e non esperto potrà gestire in
modo da avere di volta in volta una voce personalizzata.
In sede di discussione sarà presentata e discussa una voce “esemplare” del
VoSCIP, anche in relazione alla selezione e all’organizzazione del materiale
lessicografico e alla sua pubblicazione (in rete e in forma cartacea).

JADT’ 18

97

Riferimenti bibliografici
Bertini Malgarini, P. e Vignuzzi, U. (2017). Bagnomaria nel Vocabolario storico
della
cucina
italiana
postunitaria
(VoSCIP):
<
http://permariag.wixsite.com/permariagrossmann/vignuzzi>.
Biffi, M. (2016). Progettare il corpus per il vocabolario postunitario, in Marazzini,
C. e Maconi, L. (a cura di), L’italiano elettronico. Vocabolari, corpora, archivi
testuali e sonori. Accademia della Crusca, pp. 259-80.
Pietrangelo, M. (2016). Per un uso legale degli audiovisivi in corpora di ricerca, in
Alfieri, G., Biffi, M. et alii (a cura di), Il portale della TV. La tv dei portali.
Bonanno, pp. 171-185.
Ruffino, G. (1995). I pani di Pasqua in Sicilia. Un saggio di geografia linguistica e
etnografica. Centro di Studi Filologici e Linguistici Siciliani.
Touring Club Italiano (1931). Guida gastronomica d’Italia. Touring Club
Italiano [rist. anast. 2003].
Strumenti
AIS = Jaberg, K. e Jud, J. (1928-1940). Sprach- und Sachatlas Italiens und der
Südschweiz. Ringier, 8 voll. (trad. it. 1987. AIS. Atlante linguistico ed
etnografico dell’Italia e della Svizzera meridionale, Unicopli). Anche in rete:
NavigAIS, <http://www3.pd.istc.cnr.it/navigais/>.
ALEPO = Telmon, T. e Canobbio, S. (1984-). Atlante linguistico ed etnografico del
Piemonte occidentale (vedi <http://www.alepo.eu/>)
ALI = Bartoli, M. G et alii (1995-). Atlante Linguistico Italiano. Istituto
Poligrafico e Zecca dello Stato.
ALLI = Moretti, G. et alii (1982-). Atlante Linguistico dei Laghi Italiani
(vedi <http://www.lettere.unipg.it/ricerca/centri>
ALS = Ruffino, G. (1995-). Atlante Linguistico della Sicilia (vedi
<http://atlantelinguisticosicilia.it/>).
ALT = Giacomelli, G. (2000). Atlante Lessicale Toscano. LEXIS (in CD-ROM);
Ora in rete come ALT-WEB: <http://serverdbt.ilc.cnr.it/altweb/RT_ALTWEB_home.htm>.
ASLEF = Pellegrini, G. B. et alii (1972-). Atlante Storico-Linguistico-Etnografico
Friulano. Istituto di glottologia e fonetica dell’Università Istituto di
filologia romanza della Facoltà di lingue e letterature straniere
dell’Università.
DIFIT = Stammerjohann. H. (2008). Dizionario di italianismi in francese, inglese e
tedesco.
Accademia
della
Crusca.
Anche
in
rete:
<http://www.italianismi.org>.
GRADIT = De Mauro, T. (2007). Grande Dizionario Italiano dell’Uso. UTET.
LEI = Pfister, M. e Schweickard, W. (1979-). Lessico Etimologico Italiano, Edito
per incarico della Commissione per la Filologia romanza. Reichert.
Lessicografia della Crusca in rete = Accademia della Crusca (2004). Lessicografia

98

JADT’ 18

della Crusca in rete. <http://www.lessicografia.it>.
TLIO = Opera del Vocabolario Italiano (1997-). Tesoro della lingua italiana delle
origini. <http://www.vocabolario.org/>.
Tommaseo-Bellini = Tommaseo, N. e Bellini V. (1861-1879). Dizionario della
lingua italiana, Società L’Unione Tipografico-Editrice.
Tommaseo online = Accademia della Crusca (2015). Tommaseo online.
<http://www.tommaseobellini.it>.

JADT’ 18

99

Strumenti informatico-linguistici per la realizzazione
di un dizionario dell’italiano postunitario
Marco Biffi
Università degli Studi di Firenze – marco.biffi@unifi.it

Abstract
The paper focuses on some general problems about representative corpora
for the compilation of dictionaries. It starts from the concrete case of the
Vocabolario dell’italiano post-unitario, which, due to its hybrid nature, offers a
complete view of both the criticalities of synchronic lexicography and of the
historical one. Therefore is introduced the concept of Banca linguistica, that is
a platform in which different types of corpora, a search meta-engine of the
existing databases, and tools of access to existing electronic dictionaries
converge. A final paragraph is dedicated to the concept of “quantum
relativity” of data of computational linguistics.
Sintesi
Il contributo mette a fuoco alcuni problemi generali relativi alla costituzione
di corpora rappresentativi per la redazione di dizionari, partendo dal caso
concreto del Vocabolario dell’italiano post-unitario, che, per la sua natura ibrida,
offre un quadro completo sia delle criticità della lessicografia sincronica sia di
quella storica. Si introduce pertanto il concetto di Banca linguistica in cui
convergono diverse tipologie di corpora, un metamotore di ricerca per la
consultazione delle banche dati esistenti e sistemi di integrazione con i
dizionari elettronici esistenti. Infine ci si sofferma sul concetto di “relatività
quantistica” dei dati estrapolabili dalle ricerche informatico-linguistiche.
Keywords: Linguistica dei corpora, Italiano, Dizionario sincronico,
Dizionario storico, Testo elettronico, Bilanciamento, Metamotore, Banca
linguistica, Relatività quantistica, Informatica linguistica, Linguistica
computazionale
1. Introduzione
In questo contributo cercherò di mettere a fuoco alcuni problemi generali
relativi alla costituzione di strumenti per la redazione di dizionari partendo
da un caso specifico, quello del progetto di un dizionario “ibrido”, insieme
storico e sincronico, su cui sta lavorando un gruppo di ricerca nazionale
coordinato da Claudio Marazzini. Il progetto – che ha come obiettivo finale la
redazione di un vocabolario dell’italiano post-unitario che raccolga il
patrimonio linguistico nazionale della lingua ufficiale dello Stato dal 1861 a

100

JADT’ 18

oggi – ha visto l’avvio con una prima fase finanziata sul PRIN 2012 Corpus di
riferimento per un Nuovo Vocabolario dell’Italiano moderno e contemporaneo. Fonti
documentarie, retrodatazioni, innovazioni; e ha poi potuto continuare con un
secondo finanziamento sul PRIN 2015 Vocabolario dinamico dell’italiano postunitario. Ai due progetti hanno partecipato numerose università italiane:
Piemonte Orientale, Milano, Genova, Firenze, Viterbo, Napoli, Catania (al
progetto sul corpus ha partecipato anche l’Istituto di Teorie e Tecniche
dell’Informatica Giuridica ITTIG del CNR di Firenze; al progetto sul
vocabolario dinamico partecipa anche l’Università degli Studi di Torino);
come partner esterno ha collaborato l’Accademia della Crusca, per la quale il
dizionario post-unitario è uno dei tre progetti strategici attuali, accanto al
Vocabolario dantesco e all’Osservatorio degli italianismi nel Mondo (OIM).
Per quanto le dinamiche di impiego di corpora per la redazione di dizionari
storici siano note, soprattutto dopo l’esperienza del TLIO Tesoro della lingua
italiana delle origini dell’Istituto dell’Opera del Vocabolario Italiano del CNR
di Firenze, meno si è riflettuto sulle implicazioni pratiche della costituzione
di un dizionario sincronico basato su un corpus rappresentativo, e del tutto
nuovo è il caso di uno strumento ibrido come il vocabolario post-unitario, in
cui le criticità della lessicografia informatica storica e sincronica si mescolano,
evidenziando come si debba piuttosto muoversi nella direzione di strumenti
articolati.
2. Criticità di fisionomia di un corpus rappresentativo dell’italiano postunitario
Un primo problema da affrontare per un corpus rappresentativo per un
dizionario è la sua dimensione. Se proviamo a effettuare un rapido controllo
sulla situazione dei corpora di riferimento per altre lingue europee (in
particolare inglese e tedesco, che hanno avuto una maggiore attenzione a
questo tema), sia il British National Corpus (per il 10% costituito da trascrizioni
dell’inglese parlato – cfr. Cresti-Panunzi 2013: 36-37) che il DWDS-Kerncorpus
(testi del XX secolo di cinque tipologie: letteratura, 25%; giornali, 25%; prosa
scientifica, 20%; guide, libri di ricette e testi analoghi, 20%; lingua parlata
trascritta, 10% – cfr. Klein 2013: 18-19) hanno dimensione pari a circa 100
milioni di parole. Questa era la dimensione che nel primo decennio del secolo
individuava corpora di dimensioni standard (cfr. Chiari 2007: 45; secondo la
tabella ivi riportata); anzi, 100 milioni di parole era la soglia che divideva i
corpora standard da quelli di grandi dimensioni. Tenendo conto dei
progressi informatici e metodologici degli ultimi anni, certamente è
opportuno introdurre qualche correttivo; e in effetti sia per l’inglese che per il
tedesco questi correttivi esistono, perché i corpora bilanciati sono affiancati
da thesauri. Al BNC è stata recentemente affiancata la Bank of English (un

JADT’ 18

101

monitor corpus, secondo la terminologia di Sinclair, di testi completi per un
totale di 650 milioni di parole – cfr. Cresti-Panunzi 2013: 36-37); al Kerncorpus
si sono aggiunti alcuni moderni corpora di giornali (successivi al 1995) e altre
raccolte più piccole di testi, per un totale di 2,6 miliardi di parole (e, anche sul
piano diacronico, si sta cercando di completare il quadro con il Deutsche
Textarchiv in allestimento dal 2005 e ormai in via di completamento, che
raccoglie 1500 libri accuratamente scelti, di solito prime edizioni e volumi di
giornali, nell’arco cronologico compreso fra il 1650 e il 1900 – cfr. Klein 2013:
18-19). Per quanto riguarda la raccolta di testi si è già sottolineata
l’importanza di quella che è stata definita “parabola dimensionale dei
corpora” (Biffi 2016: 262).

Figura 1

La rappresentazione geometrica analitica di questa parabola evidenzia il
rapporto tra la lingua nei secoli (nella fattispecie l’italiano) e la possibilità di
rappresentarla con un corpus dell’ordine di grandezza di 100.000 parole
(kiloparole), di milioni di parole (megaparole), di miliardi di parole (gigaparole).
La possibilità di costruire corpora di grandi dimensioni diminuisce tanto
maggiormente quanto più si va indietro nel tempo, mentre aumenta
vertiginosamente per la lingua dei nostri giorni, con dimensioni ormai
veramente molto elevate, che non corrispondono certamente a tutto ciò che si
produce in una certa lingua, perché questo è ovviamente impossibile, ma che
tendenzialmente vi si avvicinano molto. La ridotta dimensione dei corpora
dell’italiano del passato – questo sottolinea la curva – non è soltanto legata al
fatto, oggettivo, che per il passato disponiamo di un minor numero di testi,

102

JADT’ 18

ma, in modo determinante, al fatto che molto più difficilmente riusciamo a
riunire i testi del passato in formato elettronico per poterli interrogare con
efficacia. Le difficoltà sono legate ai limiti di tutti gli strumenti informatici
coinvolti nella realizzazione di corpora informatici, che paradossalmente
convergono nel determinare l’andamento di questa curva: l’efficacia
dell’OCR (il riconoscimento ottico, automatico, dei caratteri), l’efficacia delle
morfologie macchina per la lemmatizzazione, l’efficacia dei motori di ricerca
disponibili con facilità e a costo poco elevato; quindi toccano i processi che
coinvolgono sia l’acquisizione dei testi, sia il loro trattamento, sia la loro
interrogazione e interrogabilità (Biffi 2016: 263-267). Per il passato gli effetti
della parabola rendono gestibile il problema di una reale rappresentatività
del corpus di riferimento. In effetti il TLIO, che si muove in un arco
cronologico che va dalle origini al 1375, può disporre come base di partenza
di un corpus che riunisce una raccolta consistente di testi volgari del periodo
considerato, spaziando a tutto tondo sull’asse diatopico e diafasico (e quindi
garantendo una grande rappresentatività anche in diastratia). Ha
fondamenta molto solide anche a fronte di dimensioni che, sulla scala di
misurazione dei corpora, non sono particolarmente elevate. Le ridotte
dimensioni hanno consentito infatti di abbattere gli effetti “parabolici”
dell’efficacia dell’acquisizione e del trattamento del testo elettronico (i testi,
ricavati dalle principali edizioni critiche hanno potuto essere sottoposti a
un’attenta collazione), così come dell’efficacia delle morfologie macchina (il
corpus è stato lemmatizzato di fatto manualmente con l’ausilio di procedure
semiautomatiche). La possibilità di progettare e realizzare un motore per
lemmi e un motore per forme personalizzato ha poi definitivamente
abbattuto i problemi di interrogazione/interrogabilità. Ma è evidente che
anche salendo di poco nella cronologia, proprio per l’effetto “parabolico”, i
problemi aumentano vertiginosamente. Per quanto riguarda le morfologie
macchina, ad esempio, sarebbe opportuno ricalibrarle in base alle variazioni
diacroniche delle strutture morfologiche e morfosintattiche, seguendo l’asse
del tempo (ed esperimenti si stanno facendo: ad esempio per la morfologia
della lingua di Leonardo in un progetto finanziato dalla Biblioteca
Leonardiana di Vinci e da me curato per la parte linguistica); ma il processo è
lungo e non è mai stato affrontato in modo sistematico, né
metodologicamente né pragmaticamente. Questo perché, ma vale per tutti gli
aspetti della linguistica computazionale e più in generale di quella che
preferisco chiamare linguistica informatica, la tendenza generale è quella di
lavorare per piccole monadi e non creare sistema mettendo in sinergia le
competenze e gli strumenti in modo da ampliare e affinare le tecnologie
disponibili rendendole sempre più potenti. Così oggi disponiamo di vari
strumenti, in parte sovrapponibili, in parte complementari, ma nulla di

JADT’ 18

103

realmente condivisibile da migliorare con un sistema open source, in modo da
concentrare gli sforzi su ciò che realmente manca e o è debole. Il “pezzo”
delle morfologie macchina è particolarmente significativo: costruire un
corpus diacronico per un dizionario storico significa infatti fornire i primi
mattoni per ricalibrare le morfologie macchine esistenti tarandole sul periodo
preso in considerazione; ma in nessun caso si è pensato di usare questi
corpora del passato come punto di partenza per migliorare le procedure di
lemmatizzazione che a loro volta potenzierebbero le possibilità
lessicografiche in un circolo virtuoso destinato a raffinare gli strumenti a
disposizione della comunità scientifica. Per tornare alle specificità del
dizionario dell’italiano post-unitario, il suo carattere ibrido lo colloca in una
posizione particolarmente delicata perché in quanto diacronico, dal 1861 al
2000, risente dei limiti informatici di cui abbiamo parlato (anche se, ad
esempio, in questo segmento cronologico le procedure di riconoscimento
automatico dei caratteri danno ottimi risultati). Ma diventa decisamente
sincronico nel periodo 2000-2014, quando abbiamo la possibilità di creare un
enorme corpus massivo (delle dimensioni delle gigaparole), anche con facilità,
semplicemente attingendo dal web mediante programmi di data crawling (web
crawler, o spider), come dimostra molto bene il caso di RIDIRE (www.ridire.it,
diretto da Emanuela Cresti), un corpus di 1,3 miliardi di parole, realizzato
con un crowler controllato che ha permesso un “bilanciamento” basato su
domini semantici (architettura e design, arti figurative, cinema, cucina,
letteratura e teatro, moda, musica, religione, sport) e domini funzionali
(amministrazione e legislazione, economia e affari, informazione).
3. Dal corpus rappresentativo alla “Banca linguistica”
Da un punto di vista teorico la scelta migliore per il corpus di riferimento del
dizionario dell’italiano post-unitario sarebbe quella di un corpus bilanciato
nell’ordine di megaparole dal 1861 al 2014, da affiancare con un corpus
massivo della dimensione delle gigaparole sul 2000-2014, un risultato, come si
è visto, ormai realizzabile. Però il gruppo di ricerca è partito da una
situazione pregressa di progetti già realizzati e studi già avviati con validi
risultati raggiunti, per cui si è scelto di mettere a frutto al massimo le
esperienze dei componenti del gruppo, recuperando tutti i materiali che
ciascuno poteva portare in dote al progetto per poi ampliarli e consolidarli
con competenze specifiche. La copertura quindi è “a macchia di leopardo”,
ed è pertanto necessario utilizzare al massimo, anche per la zona cronologica
che va dal 1861 al 2000, un approccio massivo, che conduce inevitabilmente
sulla strada della “banca linguistica”, del thesaurus, dal quale poi estrarre un
corpus bilanciato (o più di uno, in modo dinamico anche in relazione alle
esigenze del redattore della voce assegnata).

104

JADT’ 18

Figura 2

La “Banca linguistica” può essere una piattaforma in cui siano disponibili
vari sub-corpora, in cui siano raccolti tutti i materiali con una marcatura
semantica che consenta successivi bilanciamenti, con un “corpus centrale”
che sarà la base primaria del lavoro del lessicografo del vocabolario
postunitario, ma che andrà continuamente tarato grazie ai dati emergenti
dalla consultazione del corpus massivo contemporaneo e dei sub-corpora
diacronici presenti. La piattaforma dovrà anche dialogare con i dizionari
elettronici di cui disponiamo dal 1861 a oggi: il Tommaseo Online e la versione
elettronica della quinta edizione del Vocabolario degli Accademici della
Crusca presente nella Lessicografia della Crusca in rete per la parte diacronica
(nella speranza che l’accordo siglato nel settembre 2017 tra UTET e
Accademia della Crusca per la digitalizzazione del Grande Dizionario della
Lingua Italiana maturi frutti rapidi); le versioni dei dizionari sincronici
presenti in rete (il Sabatini Coletti, il De Mauro, il Treccani, e tutto quanto sarà
disponibile); tutti i corpora dell’italiano presenti sul web, inclusi quelli,
preziosissimi, degli archivi elettronici delle principali testate giornalistiche
nazionali (Biffi 2016: 272-273). Non va dimenticato infatti che il panorama dei
corpora dell’italiano è abbastanza ampio (per un quadro generale si veda
Cresti-Panunzi 2013; ma è necessario perfezionare il censimento). Però è
mancata, come del resto è naturale, una politica organica di costruzione di un
sistema: abbiamo quindi un’estrema eterogeneità di strumenti, piattaforme,
codifiche (per fortuna in anni recenti, almeno per quest’ultimo aspetto, la
forza centrifuga si sta progressivamente contenendo con il ricorso sempre più
frequente, se non totale, alla codifica XML/TEI), che costringe il ricercatore a
collegarsi n volte, su n piattaforme, con n filosofie diverse, con n motori
diversi, per poter effettuare una ricerca a tutto campo. Diventa quindi

JADT’ 18

105

fondamentale un metamotore. Una versione beta di metamotore dei corpora
dell’italiano è stata realizzata dall’unità di ricerca dell’Università degli Studi
di Firenze del gruppo PRIN 2012 da me diretta (www.metaricerche.it). Come
si legge nella sezione del portale intitolata “Il metamotore”: «Gli strumenti
individuati sono stati classificati secondo i possibili livelli di integrazione:
corpora liberamente consultabili; corpora liberamente consultabili previa
registrazione; corpora da scaricare. È stato poi predisposto uno studio di
fattibilità per la definizione di una serie di procedure atte ad analizzare gli
strumenti di partenza, determinare il livello di integrabilità (che passa anche
dalla possibilità di poter interagire con lo staff tecnico della singola banca
dati, a seguito di un accordo “strategico” sulla condivisione dei contenuti) e
individuare delle procedure da seguire a seconda del livello. Si è passati poi a
definire l’architettura del sistema, la tecnologia di riferimento e l’interfaccia
di consultazione, almeno per una prima versione prototipale della
piattaforma». La versione beta prevede l’integrazione di 8 banche dati, scelte
come campioni delle principali tipologie di livelli di integrazione:
 Livello massimo (si è trovato accordo con lo staff tecnico che gestisce
la banca dati): LIR (Lessico dell’Italiano Radiofonico), LIS (Lessico
dell’Italiano Scritto) e LIT (Lessico Italiano Televisivo), Accademia della
Crusca.
 Livello base (si è integrata la banca dati in una finestra, in attesa di
una maggiore interoperabilità): MIDIA (Morfologia dell'Italiano in
DIAcronia) Università Roma Tre; CorDIC (Corpora Didattici Italiani di
Confronto) Laboratorio Linguistico Italiano Università degli Studi di
Firenze.
 Livello minimo (si è integrata la banca dati in una finestra senza
possibilità di maggiore interoperabilità): Archivio dei quotidiani
«Corriere della Sera» e «La Repubblica».
Se questo strumento potrà essere potenziato fino a riunire nella lista dei
risultati tutte le banche dati testuali disponibili attualmente per l’italiano,
nella “Banca linguistica” del redattore del Vocabolario post-unitario sarà
disponibile un accesso centralizzato a tutti i corpora esistenti, da integrare,
modulare e bilanciare con il corpus riunito dal gruppo di ricerca PRIN, con il
corpus massivo dell’italiano contemporaneo, con gli strumenti lessicografici
elettronici. Rimangono da considerare alcune criticità che, se rimosse,
consentirebbero un ulteriore potenziamento della “Banca linguistica”, e che
possiamo richiamare in questa sede solo brevemente per punti. a) La gran
parte dei testi (ad esempio quelli letterari recenti) sfuggono alla possibilità di
essere organizzati in corpora interrogabili per le difficoltà legate ai diritti
d’autore. b) Le raccolte di corpora in diacronia, tranne rare eccezioni (come
ad esempio il CEOD, Corpus Epistolare Ottocentesco Digitale) prediligono la

106

JADT’ 18

tradizione letteraria di registro alto. Esistono già campioni rappresentativi di
italiano post-unitario, come il DIACORIS (25 milioni di occorrenze), ma si
devono ancora integrare i vuoti legati alle lingue speciali (come è stato
tentato di fare all’interno del progetto PRIN 2012). c) Resta da indagare
quanto dal web si possano recuperare (in modo più o meno automatico)
materiali per le sezioni in diacronia, grazie soprattutto alla presenza
massiccia di testi ottocenteschi riuniti in biblioteche digitali come Google libri
e Archive.
4. Informatica linguistica e relatività quantistica
Se il punto di partenza per la redazione di un dizionario non è più un corpus
di riferimento omogeneo predisposto allo scopo, ma una “banca linguistica”
in cui si è chiamati a gestire materiali non omogenei ed esogeni, non è inutile
richiamare in questo paragrafo finale l’importanza dei risvolti “quantistici”
della linguistica informatica (Biffi 2017: 545-549).
Consultando banche dati (includendo in questa categoria non solo i corpora
ma anche le edizioni elettroniche dei dizionari) non è difficile imbattersi in
diffrazioni nei risultati quantitativi (e quindi in quelli qualitativi, nella misura
in cui possono determinarsi lacune nella ricerca di determinati contesti), che
sicuramente in parte si spiegano con errori umani inseriti nelle varie fasi
realizzative delle banche dati (dovuti ai moderni copisti digitali, ai
programmatori, al progetto), ma anche per il concorso di fattori precisi e
individuabili. Nel contributo citato (Biffi 2017) le diffrazioni riguardano i
risultati relativi al numero dei lemmi nelle tre versioni elettroniche del
Vocabolario degli Accademici della Crusca del 1612, e sono da ricondurre a
diversità di tokenizzazione, diversità di approccio nella restituzione alle voci
dell’intrinseca struttura di base di dati, diverse priorità nella restituzione del
testo elettronico. In altre banche dati i fattori di diffrazione saranno
probabilmente da ricondurre ad altro, ma si dovrà sempre tener conto delle
caratteristiche e dell’architettura della banca dati così come degli strumenti di
ricerca a essa applicati. Come nelle scienze esatte da Heisenberg in poi si
deve tener conto dell’indeterminazione introdotta dallo strumento di
misurazione, consultando le banche dati sarà opportuno ricordare che le
caratteristiche dello strumento di conoscenza (in questo caso la banca dati)
perturbano il risultato della ricerca costringendoci a un’inevitabile
approssimazione “quantistica”; una perturbazione però dominabile, giacché
si possono ricostruire le cause di diffrazione e quindi correggere il risultato
finale, come avviene con la meccanica quantistica laddove è necessario
sostituirla alla meccanica classica.
E allora, per poter ottenere risultati scientifici consultando una banca dati, è
necessario conoscere a fondo le caratteristiche dello strumento, e tenere conto

JADT’ 18

107

della sua variabilità “quantistica” nel momento in cui leggiamo i dati. E,
quando si leggono e gestiscono i risultati, è necessario non solo essere
consapevoli di quale strumento si è usato, ma anche delle specifiche modalità
di ricerca applicate; in altre parole si deve tener conto continuamente del
contesto filologico della ricerca informatica, esattamente come, quando si
consulta l’edizione critica di un testo, si tiene conto anche delle varianti
dell’apparato.
Riferimenti bibliografici
Biffi, M. (2016). Progettare il corpus per il vocabolario postunitario, in Marazzini,
C. e Maconi, L. (a cura di), L’italiano elettronico. Vocabolari, corpora, archivi
testuali e sonori. Accademia della Crusca, pp. 259-80.
Biffi, M. (2018). Tra fiorentino aureo e fiorentino cinquecentesco. Per uno studio
della lingua dei lessicografi, in Belloni, G. e Trovato, P. (a cura di), La Crusca
e i testi. Lessicografia, tecniche editoriali e collezionismo librario intorno al
Vocabolario del 1612. libreriauniversitaria.it, pp. 543-560.
Chiari, I. (2007). Introduzione alla linguistica computazionale, Laterza.
Cresti, E. e Panunzi, A. (2013). Introduzione ai corpora dell’italiano, Il Mulino.

108

JADT’ 18

Comparaison de corpus de langue « naturelle » et de
langue « de traduction » : les bases de données
textuelles LBC, un outil essentiel pour la création de
fiches lexicographiques bilingues
Annick Farina, Riccardo Billero
Università degli Studi di Firenze – annickfarina@unifi.it; riccardo.billlero@gmail.com

Abstract
The aim of this paper is to describe the work done to exploit the LBC
database for the purpose of translation analysis as a resource to edit the
bilingual lexical sections of our dictionaries of Cultural Heritage (in nine
languages). This database, made up of nine corresponding corpora, contains
texts whose subject is cultural heritage, ranging from technical texts on art
history to books on art appreciation, such as tour guides, and travel books
highlighting Italian art and culture. We will illustrate the different questions
with the SketchEngine LBC French corpus, made up at the moment of
3,000,000 words. Our particular interest here is in research that not only
orients lexical choices for translators but that also precedes the selection of
bilingual quotations (from our Italian/French parallel corpus) and that we
rely on for editing an optional element of the file called "translation notes."
We will rely on this as much for works on "universals of translation" already
described by Baker (1993) as for studies aimed at improving Translation
Quality Assessment (TQA). We will show how a targeted consultation of
different corpora and sub-corpora that the database allows us to distinguish
("natural language" vs "translation”, "technical texts" vs "popularization
texts" or "literary texts") can help us identify approximations or translation
errors, so as to build quality comparative lexicographical information.
Keywords: electronic lexicography, multilingual lexical resources, corpus
linguistics
Résumé
Cet article a pour but de décrire notre travail sur la base de données LBC
pour ce qui concerne l’analyse de traductions comme ressources pour la
rédaction de la partie bilingue de nos dictionnaires du Patrimoine (dans les
neuf langues du projet). La base de données contient des corpus distincts de
neuf langues composés de textes qui sont tous reliés au patrimoine italien :
des textes techniques des différents domaines artistiques, des ouvrages de
critique d’art ou d’histoire de l’art, des guides touristiques, des récits de

JADT’ 18

109

voyages, etc. Nous illustrerons différentes interrogations du corpus français
(actuellement composé d’environ 3 millions de mots) dans SketchEngine. En
particulier, nous nous intéresserons à des recherches qui nous guident non
seulement vers la sélection de traduisants pour certains termes mais qui
précèdent aussi la sélection de citations bilingues (extraites de notre futur
corpus parallèle italien/français) et sur lesquelles nous nous appuyons pour
la rédaction d’un élément facultatif de la fiche appelé « notes de traduction ».
Nous nous appuyons pour ce faire tant sur les travaux sur les « universaux
de traduction » (Baker 1993) que sur études qui visent à l’amélioration de la
qualité des traduction (TQA : Translation Quality Assessment). Nous
montrerons comment une consultation ciblée des différents corpus et souscorpus que la base nous permet de distinguer (textes en « langue naturelle »
vs « en traduction », « textes techniques » vs « de vulgarisation » vs «
littéraires ») peut nous aider à repérer des approximations ou des erreurs de
traduction, nous aidant à construire une information lexicographique
comparative de qualité.
Keywords: lexicographie, ressources lexicales plurilingues, corpus
linguistiques.
1. Introduction
Un des principaux buts du projet Lessico dei Beni Culturali est de constituer
des dictionnaires monolingues de neuf langues différentes en fonction d’un
usage précis relié à un objet particulier : la description (et traduction de
descriptions) du patrimoine toscan principalement dans des textes de
vulgarisation (guides touristiques, sites de musées, etc.). Pour ce faire, nous
avons constitué des bases de données textuelles, que nous complétons sous la
forme d’un Work in progress, qui nous serviront pour différentes tâches, de la
création
de
nomenclatures
à
la
rédaction
de
fiches
lexicographiques/terminologiques monolingues et de fiches de traduction
reliant les nomenclatures des différentes langues entre elles (pour la
description de ces bases cfr. Billero et al. 2017). C’est l’utilisation de ces bases
de données textuelles pour la rédaction de fiches bilingues de traduction que
nous illustrerons ici1, en nous basant sur l’analyse de différentes
interrogations sur SketchEngine (principalement statistiques et de contexte)
de notre corpus LBC français, composé actuellement d’environ trois millions

Pour l’utilisation de nos bases pour la réalisation des dictionnaires
monolingues, voir l’article de Nicolás et Lanini dans ce volume. Nous constituons en
effet actuellement les nomenclatures des différentes langues en suivant le modèle
qu’elles ont défini pour l’italien. Le lien bilingue entre ces différentes nomenclatures
ne sera possible que lorsque nous aurons constitué nos bases de données parallèles.
1

110

JADT’ 18

de mots. Nous comparerons en particulier des données provenant de
plusieurs sous-corpus comparables de textes « en langue naturelle » et de
textes « en traduction ». Nous proposerons aussi une première comparaison
de résultats provenant d’un sous-ensemble du corpus italien avec un sousensemble contenant les traductions françaises des mêmes textes, qui
constituent un matériau fragmentaire pour le moment parce que nous
travaillons encore à l’insertion des textes dans le but de créer des bases
parallèles de traduction de l’italien vers toutes les langues du projet. Nous
montrerons comment une consultation ciblée des différents corpus et souscorpus que la base nous permet de distinguer (italien « langue naturelle » vs
français « langue naturelle », français « en traduction » vs français « langue
naturelle », français « textes spécialisés » vs français « vulgarisation » vs
français « littéraire ») peut nous aider à repérer des approximations ou des
erreurs de traduction, nous aidant à construire une information
lexicographique comparative de qualité.
2. Comparaison entre corpus « en langue naturelle » et « en traduction » :
une perspective à mi-chemin entre traductologie descriptive et prescriptive
Nous appuyant sur des analyses qui ne considèrent pas la langue de
traduction comme un « troisième code » (Frawley 1984), nous estimons pour
ce que des textes traduits trouvent parfaitement leur place à l’intérieur d’une
base textuelle unique d’une même langue, aux côtés de textes « en langue
naturelle ». Cependant, sur le modèle de propositions d’utilisation de corpus
de traduction dans un but didactique, tant pour l’enseignement des langues
que pour celui de la traduction, il nous semble nécessaire d’offrir la
possibilité d’une consultation de la base dans des sous-corpus distincts
regroupant des textes des deux types et de définir des critères d’évaluation
des textes traduits à intégrer dans la base, en constituant des corpus séparés
de textes traduits dans toutes les langues du projet. Ces corpus nous sont
utiles comme outils de mémoire de traduction pour travailler sur la partie
bilingue de nos fiches lexicographiques dans une perspective plus
prescriptive que descriptive. Comme le montrera notre comparaison de
résultats provenant de notre base française LBC « en langue naturelle » et «
en traduction » avec un corpus de près de 100.000 mots actuellement non
intégré dans la base composé de traductions d’ouvrages de « vulgarisation »
traduits en français (guides touristiques de la Toscane et sites de musées
surtout), certains des textes qui nous intéressent présentent des
caractéristiques que l’on peut assimiler à du « translationese » et ne
pourraient que fausser des interrogations de la base visant à attester des
formes ou structures typiques du français tel qu’il est écrit et parlé par la
majorité des locuteurs de cette langue sans interférence avec une autre
langue.

JADT’ 18

111

2.1 Information descriptive et prescriptive dans les dictionnaires LBC :
universaux et écarts
A la suite de Baker (1993), nous partons du principe qu’il existe des
universaux de traduction qui nous serviront de canevas pour l’illustration
des différents types d’interrogation effectués à l’intérieur de nos sous-corpus
et de comparaison des résultats obtenus. C’est sur ces universaux que nous
nous basons pour fournir la partie descriptive de l’information
lexicographique comparative détaillée présente dans la partie bilingue de nos
dictionnaires. Cette information correspond d’abord à l’observation des
corpus parallèles, qui fournissent des attestations de traduction des lemmes
(mots ou collocations) décrits par le dictionnaire, apparaissant dans des
citations bilingues à l’intérieur de la partie bilingue de l’article. Nous
analyserons en particulier :
- la simplification (principalement, pour ce qui concerne notre corpus, le
choix d’hyperonymes pour traduire certains termes plus spécifiques) qui
donne lieu dans nos dictionnaires à l’introduction d’une information
sémantique ajoutée qui accompagne le traduisant proposé : les traits
distinctifs particuliers au lemme qui ne sont pas rendus par le traduisant
seront indiqués avec ou sans parenthèses après le traduisant (par ex. tavola
traduit par peinture (sur bois) et tavoletta traduit par (petite) peinture (sur bois))
- le nivellement (non-respect du registre, par exemple le choix de
technicismes plutôt que de mots de la langue générale et vice versa). Toutes
les entrées ont une indication de marque d’usage. Dans le cas d’une
traduction qui implique un changement de registre, ce changement sera
relevé dans la partie « note de traduction » ou apparaitra dans la partie
réservée aux indicateurs sémantiques distinctifs dans le cas où plusieurs
traductions du même lemme seraient possibles avec ou sans perte de registre.
C’est le cas par exemple de tondo italien (non marqué) par rapport à médaillon
(non marqué) et à l’italianisme tondo (technicisme utilisé principalement par
les historiens de l’art). Baker analyse aussi l’explicitation qui est
particulièrement fréquente dans les textes qui nous intéressent parce qu’elle
est quasi systématiquement utilisée lors de l’usage d’un italianisme, en
particulier pour les realia qui ont un traitement particulier dans nos
dictionnaires (cfr. Farina 2014, 2016). Il serait possible de rechercher d’une
manière systématique ce type de données dans notre corpus en extrayant
toutes les occurrences de « type de » ou « sorte de » ou les éléments indiqués
entre parenthèses, mais nous avons volontairement laissé de côté cette
catégorie qui est trop fortement reliée à l’objet décrit par nos textes et à des
choix stylistiques partagés entre les auteurs de textes « en langue naturelle »
et les traducteurs dans le contexte de notre base, et ne nous permettrait donc
pas d’illustrer par une comparaison des deux types de ressources des

112

JADT’ 18

contraintes linguistiques reliées aux opérations de traduction2. Nous avons
laissé de côté aussi la « normalisation » ou « conservatisme » qui s’adapte
peu à notre matière, peu propice à la variation ou à l’exploration sur le plan
lexical et stylistique. Contrairement à Baker (1993 : 243), qui définit les
universaux de traduction comme des « features which typically occur in
translated text rather than original utterances and which are not the result of
interference from specific linguistic systems », nous avons adopté une
perspective plutôt prescriptive, ou mieux didactique, en prenant en
considération les phénomènes d’interférence (influence de la langue source
sur la langue cible) fréquents dans des opérations de traduction qui
concernent deux langues proches comme l’italien et le français et dans des
textes dont la qualité est loin d’être homogène. L’interférence est en effet
selon nous à la source non seulement de nombreux cas de simplification et
d’écarts de nivellement trouvés dans nos comparaisons mais d’autres
manifestations assimilables à des pertes découlant de l’opération de
traduction, voire à des erreurs ou inexactitudes de traduction. Le modèle du
TQA (Translation Quality Assessment) et, en particulier, les différents types
de mesures de qualité qui peuvent orienter le traducteur vers une
amélioration de la fluidité et de la précision peuvent nous servir de référence
pour ce faire (cfr. « Multidimensional Quality Metrics », Uszkoreit et al.
2013). Ces analyses nous orientent principalement vers le choix d’une
position qui peut sembler aller à l’encontre d’une exploitation de corpus
descriptive comme celle de Baker. De fait, elle se présente comme un
accompagnement permettant à l’utilisateur de nos dictionnaires d’effectuer
des choix, sur la base d’une exploitation descriptive des ressources consultées
telle que nous l’avons déjà décrite et de l’indication de données statistiques
résultant d’analyses de fréquence comme celles que nous les présenterons cidessous. Le rédacteur des fiches lexicographiques pourra de plus décider, le
cas échéant et lorsque nos analyses de ces données le pousseront à repérer
des erreurs ou écarts qui pourraient être réduits, de ne pas proposer une
forme qui apparait dans la base comme traduisant (tout en l’indiquant dans
la partie de l’article fournissant des indications statistiques sur les traduisants
trouvés) ou de rédiger la partie « note de traduction », facultative dans nos
articles bilingues, pour conseiller les utilisateurs dans leurs choix en
expliquant pourquoi certaines formes peuvent être préférées à d’autres.

2 L’utilisation abondante d’italianismes est une caractéristique dominante dans
les guides touristiques analysés, assimilable à une volonté de leurs auteurs de donner
à ces textes une « touche d’italianité» (Farina 2014 : 61)

JADT’ 18

113

3. Langue naturelle vs langue traduite : observation du corpus
La différence de fréquence de mots ou de collocations présents dans des
corpus comparables contenant des textes français en « langue naturelle » et
des textes qui proviennent d’une traduction en français peuvent nous
permettre de repérer des formes choisies sous l’influence de la langue source.
3.1 Fréquence zéro dans les textes en langue naturelle
Nous avons comparé la liste des mots présents dans le sous-corpus LBC de
textes de vulgarisation écrits en français contenant 270.000 mots avec un
corpus non intégré à la base pour le moment de textes de même type mais en
traduction 93.000 mots en réalisant une liste des mots présents exclusivement
dans le sous-corpus « en traduction ».
- fautes
La majorité des formes rencontrées sont assimilables à des fautes : absence
d’accent (cloitre), influence de l’orthographe italienne le français (baroche),
« francisation » excessive au niveau orthographique (Caliari) ou par
l’utilisation d’une traduction française là où l’usage préconise la forme
italienne (Sainte-Réparate désigne en français la personne ou la cathédrale de
Nice mais pas l’église Santa Reparata de Florence, la forme française n’est
attestée nulle part dans la base LBC) ou l’inverse (Giove n’est jamais utilisé en
italien dans notre corpus, où il est traduit par Jupiter), utilisation de mots qui
n’ont rien à voir avec la description du patrimoine florentin, probablement
parce qu’ils correspondent à un sens du mot-source qui s’applique à d’autres
contextes (coursive dans une description du Dôme de Florence, ou panonceau
pour se référer aux compartiments des portes du Paradis). Ce genre d’erreurs
ne donne pas lieu à la réalisation d’une information ciblée à l’intérieur des
dictionnaires sauf dans le cas d’une grande fréquence de l’erreur (par ex.
pour panonceau présent dans plusieurs sources avec un total de 8 occurrences
mais pas coursive qui n’a qu’une attestation).
- nivellement
On peut distinguer des formes qui correspondent à une différence
« pragmatique » ou stylistique entre français et italien qui ne nous intéressent
pas d’un point de vue lexicographique, comme l’utilisation de mentionnons
dans plusieurs textes en traduction qui ne se retrouve dans aucun des textes
de la base complète, ou de certaines formes du passé-simple (décora, succéda)
qui ne sont pas utilisées dans les textes de vulgarisation en français
« naturel ». Il s’agit de formes qui correspondent à des normes différentes
relatives aux types de texte du corpus : une analyse plus approfondie
pourrait probablement nous permettre d’observer un usage peu ou pas
attesté du « nous » dans les guides touristiques, et l’usage peu fréquent de
formes au passé-simple par rapport au passé-composé ou au présent, etc.

114

JADT’ 18

Ce qui nous intéresse beaucoup plus dans cette comparaison c’est de repérer
des formes qui, tout en étant parfaitement « correctes » en français, peuvent
être considérées comme hors contexte par rapport aux usages attestés dans le
même type de contexte en langue naturelle. La différence dans l’usage d’un
mot non attesté peut faire l’effet d’un « anachronisme » (différence dans la
fréquence d’usage en synchronie). C’est le cas par exemple de l’adjectif grandducal et du participe passé paraphé dont les équivalents italiens sont plus
fréquents dans la langue d’aujourd’hui que ne le sont leurs traductions
littérales françaises. L’écart dans le registre peut aussi s’appliquer dans le cas
d’une différence de « technicité ». L’adjectif autographe présent dans plusieurs
sources de vulgarisation en traduction est absent des textes de même type de
notre corpus en langue naturelle, mais on en trouve quelques occurrences
dans des textes plus spécialisés du corpus général. La différence de registre
donnera lieu à un marquage différencié entre lemme en langue source et sa
traduction attestée.
3.2 Différence de fréquence dans les textes-source par rapport aux textes-cible
Pour illustrer les phénomènes de simplification, nous avons interrogé deux
sous-corpus de notre base LBC constitués de 51 vies de l’ouvrage Le vite de'
più eccellenti pittori, scultori e architettori de G. Vasari (1568) et de leurs
traductions en français (traduction Leclanché-Weiss, 1900). Ne pouvant
encore nous baser sur des statistiques provenant des bases parallèles de
traduction (pour la description de ces bases cfr. Zotti 2017), nous nous
sommes concentrés sur des mots français qui avaient une grande fréquence
en comparant cette fréquence à celle du mot le plus proche en italien (même
sens, mêmes traits distinctifs). Ceci nous a permis de relever des écarts de
fréquence qui nous pousseront à une étude plus approfondie dans le but de
définir des réseaux analogiques dans les deux langues qui nous donnent la
possibilité de proposer des liens de traduction permettant d’éviter une perte
de précision. Tableau a par exemple une fréquence de 2232 par million de
mots dans notre sous-corpus français tandis que quadro a une fréquence de
793 par million de mots dans le sous-corpus italien contenant les mêmes
textes en langue originale. Un grand nombre d’hyponymes de quadro sont en
effet traduits par tableau en français. Si cette perte est probablement
compensée par l’ajout de traits distinctifs qui accompagnent le mot, nous
retenons que le traducteur ne pourrait que gagner en précision si nous lui
proposions d’autres formes pour rendre le sens de ces différents hyponymes.
4. Conclusion
La comparaison de résultats qui concernent la fréquence de formes à
l’intérieur du corpus LBC nous a permis d’illustrer l’utilisation de différents

JADT’ 18

115

sous-corpus pour orienter l’information tant descriptive que normative que
nous souhaitons fournir dans la partie bilingue de nos dictionnaires LBC.
« Nous considèrerons, même si cela reste à démontrer […] qu’une sur- ou
une sous-représentation d’un phénomène linguistique donné peut
correspondre à une violation de la contrainte d’usage […] et qu’une bonne
traduction se doit de tendre vers une homogénéisation entre la langue
originale et la langue traduite. » (Loock et al. 2013 : sp)
L’application de méthodes visant à la vérification de la qualité des
traductions et la création d’outils qui se basent sur des analyses critiques de
traductions existantes, en les comparant, en particulier, à des productions qui
ne passent pas par la médiation d’une autre langue devrait permettre une
optimisation du caractère naturel des textes traduits et de la précision,
objectif essentiel pour la diffusion d’une information de qualité.
Bibliographie
Baker M. (1993). Corpus Linguistics and Translation studies. Implications
and Applications. In Baker M. and al. editors, Text and Technology,
Amsterdam/Philadelphie, Benjamins, pp.233–250.
Billero R., Nicolas Martinez, M.C. (2017). Nuove risorse per la ricerca del
lessico del patrimonio culturale: corpora multilingue LBC. In CHIMERA
Romance Corpora and Linguistic Studies, Vol.4, No. 2, pp 203-216, ISSN:
2386-2629, 2017
Farina A. (2016). Le portail lexicographique du Lessico plurilingue dei Beni
Culturali, outil pour le professionnel, instrument de divulgation du savoir
patrimonial et atelier didactique, PUBLIF@RUM, vol. 24
http://publifarum.farum.it/ezine_articles.php?id=335
Farina A. (2014). Descrivere e tradurre il patrimonio gastronomico italiano: le
proposte del Lessico plurilingue dei Beni Culturali. In Chessa F. and De
Giovanni C., La terminologia dell'agroalimentare, Milan, Franco Angeli, pp.
55-66.
Frawley W. (1984). Prolegomenon to a theory of translation. In Frawley W.
editor, Translation: Literary, Linguistic and Philosophical Perspectives,
Newark, Univ. of Delaware Press : 159-175
Loock R., Mariaule M. and Oster C. (2013). Traductologie de corpus et qualité
: étude de cas. Tralogy - Session 5 - Assessing Quality in MT / Mesure de
la qualité en TA http://lodel.irevues.inist.fr/tralogy/index.php?id=188
Johansson S. and Hofland K. (1994). Towards an English-Norwegian parallel
corpus. In Fries U. and al. editors, Creating and using English language
corpora, Amsterdam, Rodopi pp. 25-37.
Loock R. (2016), La Traductologie de corpus. Villeneuve-d'Ascq. Presses
Universitaires Septentrion.

116

JADT’ 18

Uszkoreit H., Burchardt A. and Lommel A. (2013). A New Model of
Translation Quality Assessment Tralogy - Session 5 - Assessing Quality in
MT / Mesure de la qualité en TA
http://lodel.irevues.inist.fr/tralogy/index.php?id=319
Zotti V. (2017). L’integrazione di corpora paralleli di traduzione alla
descrizione lessicografica della lingua dell’arte : l’esempio delle
traduzioni francesi delle Vite di Vasari. In Zotti V., Pano A. editors,
Informatica Umanistica. Risorse e strumenti per lo studio del lessico dei beni
culturali. Firenze University Press.

JADT’ 18

117

Il rapporto tra famiglie di anziani non autosufficienti
e servizi territoriali: un'analisi dei dati esploratoria
con l'Analisi Emozionale del Testo (AET)
Felice Bisogni1, Stefano Pirrotta2
Associazione GAP - SPS Scuola di Psicoterapia Psicoanalitica - felice.bisogni@gmail.com
2Associazione GAP - SPS Scuola di Psicoterapia Psicoanalitica - stefanopirrotta@gmail.com
1

Abstract
In this paper the authors present a research committed by a local authority to
explore the relationship between not self-sufficient elders, their family
members and the community based assistance services they uses. The
exploratory data analysis, conducted with the Emotional Text Analysis (ETA)
(Carli, Paniccia, 2002), was used to identify emotional and cultural factors
related to the experience of assisting and being assisted at home and within
the community based services. The ETA has been realized on an assembled
text corpus produced transcribing 45 audio recorded interviews to not selfsufficient elders and their family members, patients of general practitioners
and/or users of the community based services (home-based and halfresidential). The interviews has been processed with T-Lab statistic software
(Lancia, 2004) and ETA has been applied to produce a clusters analysis. Four
clusters of dense words related to each others on 3 factorial axes emerged.
From the factorial axes emerges a emotional representation of elderlness as a
continuos allert related to the risk of dyng and as a depressive prescription to
survive related to the pretension to be assisted within their own family in
virtue of “blood ties”. The reciprocal control and contentiousness, and the
desirers to transgress the obligation of care giving and being cared are some
relevant emotions emerging by the ETA. The research's results shows also a
demand of a new assistance model emerges, founded on the possibility to
talk, to play and to have fun with others. Finally it emerges a demand of
services not only dealing with medical problems but also providing
psychological support and training to the families to develop relational
competences and to build reliable relationship out of the family. In the
conclusions of the paper some considerations regarding the relationships
between the clusters on the factorial axes and between clusters and
illustrative variables are highlithed.

118

JADT’ 18

Abstract
In questo articolo gli autori presentano una ricerca, condotta con la
metodologia dell'Analisi Emozionale del Testo (AET) (Carli, Paniccia, 2002),
commissionata da un ente locale al fine di esplorare i fattori emozionali che
organizzano l'esperienza di relazione tra un gruppo di anziani non
autosufficienti e i loro familiari e alcuni servizi socio-sanitari territoriali.
L'AET è stata realizzata su un corpus di testo assemblato trascrivendo 45
interviste audio registrate ad anziani non autosufficienti e loro familiari, che
utilizzano servizi di medicina generale e/o servizi sociali territoriali (di tipo
domiciliare o semiresidenziale). Le interviste sono state processate con il
software statistico T-lab (Lancia, 2004) e l'AET è stata applicata per produrre
una Cluster analysis. Dall’analisi sono emersi 4 cluster di “parole dense”
(Carli, Paniccia, 2002) in rapporto tra loro su 3 assi fattoriali, che
rappresentano il modo emozionale condiviso con cui gli intervistati parlano
delle loro attese sui servizi Dall’interpretazione dei dati è emerso un rapporto
tra famiglia ed anziano in crisi nel condividere desiderio e piacevolezza nello
stare insieme. Emerge una rappresentazione emozionale dell'anzianità come
allerta continua di fronte al rischio di morire e prescrizione depressiva a
sopravvivere connessa alla pretesa di essere assistiti all'interno della propria
famiglia in virtù di “rapporti di sangue”. A questo si contrappone il desiderio
di trasgredire l'obbligo famigliare ad assistere e farsi assistere. I risultati della
ricerca rilevano una domanda di nuovi modelli di assistenza fondati sulla
possibilità di parlare, giocare e divertirsi. Una domanda di servizi non rivolti
esclusivamente ai problemi medici ma anche a offrire supporto psicologico e
formazione alle famiglie per sviluppare competenze relazionali e relazioni
affidabili all'esterno della famiglia. Nelle conclusioni vengono messe in
evidenza alcune considerazioni riguardanti il rapporto tra cluster sugli assi
fattoriali e tra i cluster e le variabili illustrative.
Keywords: Emotional Text Analysis (ETA), assistance, elders, family,
community based services.
1. Introduzione
Sono circa 2,5 milioni gli anziani non autosufficienti presenti in Italia.
Secondo le più recenti previsioni ISTAT (2017), la percentuale di individui di
65 anni e più crescerà di oltre 10 punti percentuali entro il 2050, arrivando a
costituire il 34% della nostra popolazione. La presenza di un anziano non
autosufficiente in famiglia diventerà sempre più un’esperienza comune per le
famiglie italiane. Diversi studi hanno mostrato come l’organizzazione
dell’assistenza agli anziani non autosufficienti da parte dei propri familiari
comporti significativi problemi emozionali (Haley, 2003). Un recente studio

JADT’ 18

119

ha analizzato il testo di 26 interviste a familiari di anziani non autosufficienti
con esperienza di assistenza da parte di un badante (Paniccia, Giovagnoli,
Caputo, 2015). Dall’analisi del testo, condotta tramite la metodologia AET
(Carli, Paniccia, 2002), è emerso come i sistemi di relazione familiari entrino
in crisi contestualmente all’inattività e alla malattia dell’anziano. L'autrice
afferma che la domanda delle famiglie ai servizi sia quella di non essere
emarginate con il loro problemi entro il solo contesto familiare, per altro in
cambiamento. “Sul piano della ricerca - afferma Paniccia - va sviluppata la
differenza, proposta anche dagli intervistati, tra esplorazione dei vissuti degli
anziani assistiti da un lato, degli altri membri della famiglia dall’altro”. In
quest’ottica, la ricerca-intervento proposta risponde a questo invito,
esplorando il vissuto e le attese di un gruppo di anziani non autosufficienti e
loro familiari nei confronti di alcuni servizi territoriali.
2. Il progetto di ricerca-intervento psicosociale
Il progetto di ricerca-intervento è stato realizzato dagli autori per conto
dell'Associazione GAP, un’organizzazione che si occupa di ricerca e
intervento psicosociale nell'ambito della disabilità. Il committente è stato un
ente locale interessato a coinvolgere anziani non autosufficienti e loro
familiari nella costruzione di nuovi modelli di assistenza coerenti con la
domanda delle famiglie stesse. L'ente locale intendeva sviluppare un'offerta
di servizi d'assistenza innovativi a fronte di cambiamenti sociali e culturali
che stanno profondamente modificando l’organizzazione tradizionale della
famiglia. Famiglia in passato maggiormente attrezzata al proprio interno per
provvedere all'assistenza degli anziani. In tale contesto la ricerca intervento
psicosociale è stato proposta come strumento di esplorazione del rapporto tra
servizi d'assistenza rivolti agli anziani presenti nel territorio di competenza
dell'ente committente e famiglie che a tali servizi si rivolgono. In tale contesto
GAP a un gruppo di familiari e anziani non autosufficienti. Tutte le interviste
sono state audio-registrate e trascritte in modo da ottenere il testo su cui è
stata poi applicata l'Analisi Emozionale del Testo. In questa sede presentiamo
i risultati dell'Analisi Emozionale del Testo applicata al testo prodotto
trascrivendo 45 interviste a familiari e anziani non autosufficienti.
2.1. La raccolta dei dati
Le interviste sono state realizzate a 45 familiari e anziani non autosufficienti
in carico ai servizi di medicina generale o ai servizi di centro diurno per
anziani fragili partner del progetto. Di questi circa il 60 % usufruivano di
servizi di medicina generale insieme al servizio di centro diurno per anziani
fragili. Il restante 40% utilizzava esclusivamente i servizi di medicina
generale. Sono state realizzate 25 interviste ad anziani e 20 interviste a loro

120

JADT’ 18

familiari. Le interviste sono state trattate in un unico corpus e per questo in
analisi è stata inserita la variabile illustrativa “ruolo dell’intervistato”,
differenziando le interviste ad anziani da quelle a familiari. L'età media degli
anziani intervistati è di 79 anni, mentre l'età media dei famigliari è di 60 anni.
Gli intervistati sono stati scelti in ordine al criterio di coinvolgere nella ricerca
chi ponesse ai servizi partner problemi complessi che i servizi stessi
sentivano di avere difficoltà a prendere in carico. Questo nell'ipotesi che gli
intervistati potessero poi partecipare ad un intervento psicosociale fondato
sulla restituzione dei risultati della ricerca e sulla loro discussione critica al
fine di contribuire alla progettazione di modelli di assistenza più in linea con
i problemi sperimentati. Agli intervistati è stato proposto di partecipare a
un'intervista aperta, non strutturata, con una sola domanda stimolo seguita
dall'invito a dire tutto quello che veniva in mente. La domanda stimolo è
stata la seguente: “nell'ambito di un progetto di ricerca-intervento siamo
interessati a esplorare il rapporto tra servizi di assistenza, anziani e famiglie
che a tali servizi si rivolgono. In particolare ci interessa esplorare il punto di
vista dei familiari e degli anziani. Aggiungiamo che stiamo intervistano
anche un gruppo di medici di base e di operatori dei servizi socio-sanitari.
Siamo interessati alla sua esperienza; vorremo ascoltarla e raccogliere ciò che
lei ha da dire”. Gli intervistatori si sono presentanti come psicologi
professionisti membri di un'associazione interessata a costruire servizi per
l’invecchiamento e la non auto-sufficienza. Agli intervistati è stato detto che i
risultati della ricerca sarebbero stati condivisi con tutti gli interessati per
capire quali iniziative sviluppare.
3. Metodologia
L'Analisi Emozionale del Testo (Carli, Paniccia, 2002) è uno strumento
proprio della ricerca-intervento psicosociale, sviluppato per esplorare i modi
in cui i gruppi sociali simbolizzano emozionalmente e in modo condiviso un
contesto o un tema e come queste simbolizzazioni organizzino il
comportamento di quel gruppo. Tale metodologia, fondata sul principio del
conoscere per intervenire, prevede l'attivazione di un processo di esplorazione,
analisi e discussione critica della “cultura locale” condivisa entro un
determinato contesto, in relazione al tema posto ad oggetto della ricerca.
L'utilizzo di AET implica la destrutturazione del processo narrativo e delle
connessioni che costituiscono il senso intenzionale dei discorsi entro un testo
posto in analisi. Questo approccio metodologico è fondato
sull'individuazione di gruppi di parole in rapporto tra loro che più di altre
veicolavano significati emozionali: parole definite “parole dense”.
Operativamente abbiamo realizzato il processo statistico e informatico
attraverso il software T-lab (Lancia, 2004) scegliendo la strategia dell’Analisi

JADT’ 18

121

Tematica dei Contesti elementari non supervisionata. Le interviste realizzate
sono state assemblate entro un unico corpus, composto da 14053 tokens e
4121 types mentre gli hapax rilevati sono stati 230. Per quanto riguarda la sua
ricchezza lessicale, il TTR (Type/Token Ratio) è 0.293. Abbiamo raggruppato
le occorrenze di “parole dense” entro lessemi e in questo corpus ne sono stati
individuati e messi in analisi 856. Il numero di “contesti elementari” di testo
classificati 1423 (= 99.58%; del totale di 1429). Il processo di elaborazione dei
dati seguito dal software comporta i seguenti passi: a) costruzione di una
tabella di dati di unità contesto x unità lessicali (fino a 150,000 righe x 3,000
colonne), con valori di presenza/assenza; b) TF-IDF normalizzazione e
scalaggio dei vettori riga alle unità lunghezza (norma Euclidea); c)
clusterizzazione delle unità contesto (misure: coefficiente coseno; metodo:
bisezione K-means); d) - limatura delle partizioni ottenute e, per ciascuna di
esse: e) costruzione di una tabella di contingenza di unita lessicali x clusters;
f) test del chi quadro applicato a tutte le intersezioni della tabella di
contingenza; g) analisi delle corrispondenze della tabella di contingenza di
unità lessicali x clusters. L’analisi statistica ha permesso di individuare
diversi cluster corrispondenti a raggruppamenti di parole co-occorrenti. I
cluster sono quelli che hanno una ricorsività significativa entro il testo e
rappresentano le dimensioni più trasversali che caratterizzano la cultura
locale esplorata.
4. Risultati
Il corpus delle interviste è stato elaborato con il software T-Lab che ha
proposto come ottimale una partizione a 4 Cluster (CL) in rapporto tra loro
su tre fattori (le cui percentuali di inerzia sono Fattore 1= 41,24%, Fattore 2=
32,68%, Fattore 3= 26,08%). Il cluster 3 e il cluster 2 sono in rapporto su
polarità opposte del primo fattore; il cluster 1 e il cluster 4 sono in rapporto
su polarità opposte del secondo fattore, mentre il cluster 1 e il cluster 3 sono
in rapporto sul terzo fattore. Nella tabella (fig.1) è riporta la lista per cluster
delle “parole dense” e le variabili illustrative relative al gruppo delle
interviste degli anziani (_ruol_anz) e al gruppo delle interviste dei familiari
di anziani (_ruol_fam).

122

JADT’ 18
Tabella 1: Lista parole dense per cluster con i relativi valori di chi2

CLUSTER 1
N. of e.c..: 448
χ2
soit: 31.48%
171,81 problema
167,27 casa
79,08 uscire
71,56 lasciare
57,67 vivere
41,82 bisogno
36,62 h24
27,08 abbandonare
26,38 libero
25,59 badante
23,46 pulire
20,33 costringere
19,05 persona
18,71 autonomo
17,74 perdere
16,72 _ruol_fam

CLUSTER 2
N. of e.c.: 371
χ2
soit: 26.07%
155,86 centro
116,53 persona
83,4 aiutare
68,86 trovare
63,55 malattia
57,09 dottore
55,95 psicologia
52,96 supporto
36,48 municipio
31,94 gruppo
26,73 amicizia
24,09 frequentare
22,05 offrire
21,62 cooperativa
21,28 informazione

CLUSTER 3
N. of e.c.: 383
χ2
soit: 26.91%
408,52 figli
90,02 moglie
87,15 fratello
52,81 sposare
46,33 mangiare
40,44 dormire
37,92 morire
36,57 mamma
35,14 telefono
34,77 marito
31,17 maschio
28,96 nonni
26,96 femmina
26,77 cadere
26,68 soldi
27,45 _ruol_anz

CLUSTER 4
N. of e.c.: 221
χ2
soit: 15.83%
122,43 imparare
109,56 cura
97,87 giocare
61,33 parlare
49,95 fumare
47,67 giardino
44,09 dimenticare
42,25 insieme
36,51 somatizzare
35,9 gita
32,1 simpatia
31,21 riflettere
31,21 sigaretta
25,17 ascoltare
25,17 spazio

Di seguito, una lettura dei raggruppamenti di parole dense e della loro
collocazione sul piano fattoriale.
4.1. Cluster 3: obbligo all'assistenza intra-famigliare e prescrizione alla
sopravvivenza
Il cluster è presente in percentuale statisticamente maggiore entro il testo
delle interviste agli anziani (38,4%). Gli intervistati parlano del rapporto con i
propri famigliari: figli, le mogli, i fratelli. L'assistenza viene inscritta entro il
vincolo obbligante dell’essere una famiglia (etimologicamente da famulo, colui
che serve, che si prende cura): emerge l’attesa che il ruolo famigliare implichi il
dovere di occuparsi di chi non riesce a vivere da solo, preoccupandosi di
garantire la sopravvivenza e occupandosi di bisogni inderogabili come
mangiare e dormire. Emerge una rappresentazione infantilizzante dell’anziano
che sollecita l'instaurarsi di rapporti di dipendenza e accudimento. In tale
contesto la quotidianità, deprivata di desideri ed obbiettivi, sembra scorrere
in modo depressivo in attesa di morire, con il rischio di una chiusura
depressiva all'interno della famiglia. L'anzianità sembra identificata con la
figura del vecchio morente che non ha più nulla da dare o da chiedere alla
vita. L'unico riferimento alla vitalità entro il cluster è quello connesso a
parole come nipoti e telefonare: laddove si allenta l'obbligo dell’assistenza
sembra farsi spazio la possibilità di un rapporto piacevole e gratificante.

JADT’ 18

123

4.2. Cluster 2: ricerca di servizi e domanda alla psicologia
In questo cluster è rappresentato il processo di ricerca di servizi di assistenza.
Si cercano centri, contesti estranei alla famiglia, che aiutino ad occuparsi dei
problemi della persona non autosufficiente. Da un lato si guarda alla sua
soggettività, dall'altro si rappresenta una ricerca affannosa di servizi fondata
sull'angoscia di trovare soluzioni. La non autosufficienza è rappresentata
come malattia. Ciò comporta un vissuto di urgenza e pericolo e la fantasia di
dover contrastare qualcosa che mette a rischio la sopravvivenza. Su questo si
chiama in causa il dottore, in ipotesi il medico di base, cui viene attribuita una
competenza utile. Allo stesso tempo è chiamata in causa la psicologia cui viene
richiesto un intervento di supporto. Si evoca in tal modo una prospettiva di
intervento alternativa alla cura. Si chiede di essere aiutati a prepararsi e di
essere accompagnati, di parlare con qualcuno poiché ci si sente impreparati,
confusi.. A questo proposito i famigliari sembrano portatori di una domanda
di ascolto e consulenza fondata sul parlare. Agli enti locali e del privato
sociale gli intervistati si propongono come clienti, viene domandata
l'articolazione di un'offerta di servizi, valorizzando dispositivi d'intervento di
gruppo.
4.3. Cluster 1: funzione di controllo delegata alla badante e paura del
cambiamento
Il cluster è presente in percentuale statisticamente maggiore entro il testo
delle interviste ai familiari (39%). Gli intervistati parlano del problema che
vivono, situato nella casa, un contesto chiuso che offre riparo e che al
contempo costringe. Da un lato si cercano vie di uscita e d'altro lato c'è
difficoltà a lasciare, ad allontanarsi da rapporti protettivi e vincolanti. Viene
rappresentato un contrasto tra queste emozioni e il vivere: emerge un
sentimento di vita contrastata, per dirla con Canguilhem (1998). In tale
contesto si è presi dalla fantasia di abbandonare: emerge l'emozionalità della
colpa. Ciò avviene entro un contesto in cui la non autosufficienza viene
trattata quale bisogno esclusivamente fattuale e pressante, 24 ore su 24.
L'invecchiamento è rappresentato come evento che non lascia tregua, che
tormenta e angoscia. In tale contesto si chiede l’intervento della badante per
ripristinare il controllo, fare ordine. La badante è rappresentata come una
necessità motivata dal bisogno. L'assistenza all’anziano è qualcosa a cui ci si
sente costretti o da cui liberarsi, tertium non datur. Ma in questo cluster
vediamo come vivendo l'invecchiamento come bisogno continuo e
prescrivendo l'assistenza si generi colpa. Colpa connessa all’impotenza per il
non riuscire a rapportarsi ai cambiamenti con cui la non autosufficienza
confronta.

124

JADT’ 18

4.4. Cluster 4: domanda di costruzione di contesti dove parlare, giocare,
apprendere.
In questo cluster gli intervistati esprimono una domanda di contesti e
rapporti fondati sull'apprendimento, il gioco e sulla parola. Emergono desideri e
si riconoscono risorse che evocano la possibilità di trovare motivi per cui
valga la pena vivere. Emerge una rappresentazione della vecchiaia
caratterizzata da vitalità e desiderio di trasgredire. Si allenta la prescrittività
dell'obbligo della sopravvivenza: la vecchiaia è anche creatività, possibilità di
smarcarsi dagli obblighi rituali della vita sociale. Il riconoscimento del limite
del tempo, l'avvicinarsi della fine, motiva la ricerca di esperienze piacevoli
che diano senso alla vita. Si evoca il divertimento come obbiettivo alternativo
al controllo e alla sorveglianza senza obbiettivi. Sottolineiamo come la
domanda divertimento implichi il riconoscimento di una verità non scontata:
che si è ancora vivi fino a cinque minuti prima di morire.
5. Conclusioni
Per concludere proponiamo alcune considerazioni sul rapporto tra i cluster
sui tre assi fattoriali. Ricordiamo che il cluster 3 e il cluster 2 sono in rapporto
su polarità opposte del primo fattore, il cluster 1 e il cluster 4 sono in
rapporto su polarità opposte del secondo fattore, mentre il cluster 1 e il
cluster 3 sono in rapporto sul terzo fattore. Sul primo fattore emerge come la
dimensione motivazionale che sostiene la domanda di servizi da parte della
famiglia sia il desiderio di uscire dall’obbligo familiare. È il vissuto di obbligo
e l’incapacità di condividere entro i rapporti desiderio ed interessi che spinge
la famiglia in un'affannosa ricerca di interlocutori e professionisti esterni. Sul
secondo fattore emergono diverse modalità di rapportarsi al problema della
non autosufficienza. Su di un polo del fattore (cluster 1) la fattualizzazione
dell'invecchiamento come bisogno continuo di assistenza che mette in
pericolo la sopravvivenza mostra come i problemi associabili alla non
autosufficienza non siano esplorati. Tali problemi sembrano piuttosto
presunti dal familiare in modo autoreferenziale. L'emozionalità della colpa e
la fantasia irrealizzabile di ristabilire il controllo su una situazione in
cambiamento vissuta come persecutoria sono corollari di tale autorefenzialità
sottesa dall'incompetenza a utilizzare i rapporti familiari come contesto di
confronto e scambio sui problemi e sul da farsi. D'altro lato, sull'altro polo
del secondo fattore il riconoscimento di limiti, quali ad esempio il tempo
limitato della vita e l'ineluttabilità della fine, sembra fare spazio al
riconoscimento del desiderio degli anziani di divertirsi anche concedendosi
qualche trasgressione, come alternativa a sopravvivere in modo controllante
e mortifero. Infine il terzo fattore suggerisce una relazione tra la dinamica di
autorefenzialità dei rapporti familiari e la domanda di servizi emergente

JADT’ 18

125

entro la cultura in analisi, a cui si chiede non soltanto di curare ma anche di
aiutare la famiglia a sviluppare competenze e confrontarsi sui propri
problemi. I risultati della ricerca suggeriscono una domanda nei confronti di
servizi di accompagnamento e che sostengano la famiglia – intesa come
contesto di rapporti tra la persona non autosufficiente e i suoi familiari - nel
riconoscimento di desideri e obbiettivi attorno a cui organizzare l'assistenza e
la convivenza nel modo più piacevole, vitale e divertente possibile.
Bibliografia
Carli R., Paniccia R.M. (2002). L’analisi emozionale del testo. Franco Angeli,
Roma.
Haley, W. E. (2003). Family caregivers of elderly patients with cancer:
understanding and minimizing the burden of care. The journal of
supportive oncology, 1(4 Suppl 2), 25-9.
ISTAT (2017), Demografia in cifre, Roma, Istituto Nazionale di Statistica –
www.demo.istat.it.
Lancia, F. (2004). Strumenti per l’analisi dei testi. Franco Angeli, Roma.
Paniccia, R. M., Giovagnoli, F., & Caputo, A. (2015). In-home elder care. The
case of Italy: the badante. Rivista di Psicologia Clinica, (2), 60-83.

126

JADT’ 18

Esperienza di analisi testuale di documentazione
clinica e di flussi informativi sanitari, di utilità nella
ricerca epidemiologica e per indagare la qualità
dell'assistenza.
Antonella Bitetto1, Luigi Bollani2
1

Azienda Socio Sanitaria Territoriale di Monza – a.bitetto@asst-monza.it
2Università di Torino – luigi.bollani@unito.it

Abstract
This study finds reason in the now wide availability of clinical
documentation stored in electronic form to track the patient's health status
during his care path or for sending information to other institutions on the
activities carried out for administrative purposes. The diffusion of these
methods now makes available many biomedical collections of electronic data,
easily accessible at low cost that can be used for research purposes in the
field of observational epidemiological studies, in analogy with what was
historically already practiced in studies based on the reviewing of medical
records. However, since these collections are not organized according to
specific survey schemes, they sometimes do not allow the index events to be
discriminated with the necessary reliability between one source and another.
It has always been believed that the critical re-reading of texts can partially
help these informative shortcomings with the aim of bringing back according to possibility - the words or segments contained in the texts, to
statistically analyzable categories. The recent transfer of these collections
from paper to electronic forms opens the possibility of carrying out this
process automatically, reducing time and costs of the process and perhaps
increasing its reliability. It is proposed to address the problem, showing
study criteria and an example of analysis based on an empirical experience,
consistent with the needs of a biomedical context.
Keywords: textual analysis; electronic health data; medical thesaurus;
analysis of lexical correspondences; emergency in psychiatry
Riassunto
Questo studio trova ragione nella ormai ampia disponibilità di
documentazione clinica archiviata in forma elettronica per tracciare lo stato
di salute del paziente durante il suo percorso di cura o inviare informazioni
ad altri enti sulle attività svolte a scopo amministrativo. La vasta diffusione
di questi metodi mette a disposizione ormai numerose raccolte di tipo

JADT’ 18

127

biomedico, facilmente accessibili a basso costo che possono essere utilizzate a
scopo di ricerca nel settore degli studi epidemiologici osservazionali, in
analogia con quanto storicamente veniva già praticato negli studi basati sulla
rilettura delle cartelle cliniche. Non essendo però tali raccolte organizzate
secondo schemi di rilevazione specifici a volte non permettono di
discriminare con la necessaria attendibilità tra una fonte e l’altra gli eventi
indice. Da sempre si ritiene che la rilettura critica dei testi possa,
parzialmente soccorrere a tali carenze informative nell’obiettivo di
ricondurre - secondo possibilità - le parole o i segmenti contenuti nei testi
disponibili a categorie statisticamente analizzabili. Il recente passaggio di tali
raccolte dalla forma cartacea a quella elettronica apre la possibilità di operare
per via automatica riducendo tempi e costi del processo e forse
incrementandone l'attendibilità. Ci si propone di affrontare il problema,
mostrando criteri di studio ed un esempio di analisi basato su un’esperienza
empirica, conforme alle esigenze di un contesto biomedico.
Parole chiave: analisi testuale; dati sanitari elettronici; thesaurus medico;
analisi delle corrispondenze lessicali; psichiatria d’urgenza
1. Introduzione
Il progressivo processo di dematerializzazione della documentazione clinica
(valutazioni specialistiche ambulatoriali, verbali di Pronto Soccorso, referti
esami diagnostici) e l’implementazione dei flussi di dati sanitari a scopo
giuridico amministrativo (per il pagamento delle prestazioni erogate o per
l’aggiornamento dell’anagrafe, dell’INPS etc.) hanno reso disponibili
informazioni che possono essere utilizzate anche per obiettivi diversi da
quelli per cui i dati sono raccolti. I dati sanitari informatizzati (EHR “electronic health records”), vengono generalmente distinti in: a) strutturati
(ad es. registrati utilizzando terminologie cliniche controllate come la
Classificazione internazionale delle malattie -10ª revisione (ICD10) o la
nomenclatura sistematica della medicina - Termini clinici (SNOMED-CT), b)
semistrutturati (ad es. esami di laboratorio ed informazioni sulla
prescrizione) che seguono uno schema che varia a seconda delle convenzioni
adottate localmente, c) non strutturati (ad es. testo clinico) e d) binari (ad
esempio file di immagini come Rx e TAC). La sistematicità di queste raccolte
di dati, organizzati in maggioranza per entità individuali, li rende
particolarmente preziosi per diversi scopi di ricerca epidemiologica che
utilizza disegni di tipo osservazionale sia nell’ambito della qualità
dell’assistenza che dell’epidemiologia più classica, che studia rischi ed esiti
delle malattie (Mitchell J. et al., 1994). Per contro essendo tali raccolte
organizzate per scopi altri da quelli del monitoraggio della qualità o della
ricerca scientifica, spesso devono essere “trattate” prima di poter essere

128

JADT’ 18

analizzate con metodi statistici. In passato ciò veniva fatto attraverso la
rilettura delle cartelle cliniche da parte di esperti della materia. Attualmente
si cerca sempre più di ricorrere a metodi di analisi automatica dei testi che
garantisce una miglior standardizzazione e revisione (Denaxas S. et al., 2017).
A titolo di esempio si segnala che l’analisi automatica dei testi di flussi
informativi e della documentazione clinica elettronica ha permesso
d’indagare ambiti terapeutici e di sicurezza fondamentali come la qualità
dell’assistenza infermieristica e l’occorrenza di eventi avversi come – tra i
tanti - gli incidenti domestici, le reazioni allergiche e gli effetti collaterali ai
farmaci (Ehrenberg A. et Ehnfors M., 1999; Coloma P.M. et al., 2011;
Migliardi A. et al., 2004). Sono stati anche prodotti numerosi studi
epidemiologici classici per lo più riferiti a patologie croniche ad alta
prevalenza come le malattie cardiocircolatorie, il diabete o l’asma, all’estero e
in Italia (Gini R. et al., 2016; Vaona A. et al. , 2017), in alcuni casi mettendo in
evidenza bisogni di cura inespressi o complicanze dovute a ritardi o
trattamenti inappropriati (Persell S.D., et al., 2009; Ho M.L., et al., 2012).
Alcune ricerche si sono focalizzate sui disturbi mentali, area medica scelta
per l’esperienza di analisi di testo di seguito presentata. In questo ambito la
documentazione clinica elettronica permette di ottenere informazioni a basso
costo su ampi settori di popolazione che possono ricomprendere casistiche
difficili altrimenti da reclutare: questo è il caso di soggetti in fase prodromica
ad alto rischio di sviluppare psicosi (Fusar-Poli P. et al., 2017) o autolesionisti
(Zanus C. et al., 2017).
2. Metodi
La classificazione dei corpora non ancora studiati in categorie statisticamente
analizzabili rappresenta un argomento controverso ma anche una sfida che
giustifica, a nostro avviso, indagini di approfondimento delle procedure
metodologiche da adottare. Nel seguito si propone un metodo per il
trattamento di testi medici non strutturati di psichiatria, secondo criteri già in
parte utilizzati in precedenti esperienze (Bitetto A. et al., 2017).
2.1. Corpus
Le informazioni provengono dai verbali di consulenze psichiatriche svolte
presso il Pronto Soccorso di un ospedale universitario lombardo di grandi
dimensioni (1250 letti accreditati).
Il corpus è monolingua - in italiano - composto da brevi testi scritti dallo
psichiatra di turno alla fine della consulenza in urgenza. I referti sono
verificati e quindi conservati dal servizio informativo ospedaliero, certificato
ISO 9001/2015, che ha fornito il corpus dei dati, in forma anonima. Si sono
analizzati 1721 referti, relativi al periodo 01/01/2012 – 31/12/2012.

JADT’ 18

129

2.2. Pretrattamento di filtraggio linguistico
Il corpus è stato sottoposto ad un pretrattamento di filtraggio linguistico.
Dalle 177349 parole presenti nei referti originali sono state eliminate la
punteggiatura, i numeri, i pronomi, gli articoli, le proposizioni, i nomi propri
- anche dei farmaci- e le parole con una ricorrenza inferiore a 10. Ne è
risultato un elenco di 1679 parole distinte, che è stato rivisto manualmente da
un esperto per selezionare i termini in grado di descrivere i problemi/bisogni
di salute mentale secondo il modello strutturale utilizzato dalla scala HoNOS
(Wing J.K. et al., 1998; Lora A. et al., 2001). Si tratta di un modello di
valutazione dello stato di salute mentale impostato per problemi e non sulle
diagnosi, che difficilmente sono riportate nei referti di pronto soccorso. Il
modello distingue 12 “problemi” riconducibili ai seguenti concetti:
item H1- COMPORTAMENTI IPERATTIVI, AGGRESSIVI; item H2 COMPORTAMENTI DELIBERATAMENTE AUTOLESIVI; item H3 PROBLEMI LEGATI ALL’ASSUNZIONE DI ALCOOL O DROGHE; item H4 PROBLEMI COGNITIVI; item H5 - PROBLEMI DI MALATTIA SOMATICA;
item H6 - PROBLEMI LEGATI AD ALLUCINAZIONI E DELIRI; item H7 PROBLEMI LEGATI ALL’UMORE DEPRESSO; item H 8 - ALTRI PROBLEMI
DA ALTRI SINTOMI PSICHICI; item H9 - PROBLEMI NELLE RELAZIONI
SIGNIFICATIVE; item H10 - PROBLEMI NELLO SVOLGIMENTO DI
ATTIVITÀ DELLA VITA QUOTIDIANA; item 11- PROBLEMI NELLE
CONDIZIONI DI VITA; item H12 - PROBLEMI NELLE ATTIVITÀ
LAVORATIVE E RICREATIVE.
In questo modo è stato creato un thesaurus composto da 214 locuzioni brevi e
81 parole singole riconducibili a 11 categorie cliniche (esclusa la H10, data la
mancanza di locuzioni in grado di ricondurre ad essa). Nel thesaurus si sono
inoltre considerate parole e acronimi che individuano accesi legati al “rifiuto
delle cure”. La procedura di filtraggio dei testi, basata sul thesaurus (ponendo
anche attenzione a non includere contesti dove la parola chiave è negata), ha
permesso di riclassificare 1629 referti che rappresentano la base dell’analisi.
2.3. Analisi statistica
I diversi referti sono stati esaminati per la presenza/assenza di ciascuna
parola o locuzione chiave esaminata, in modo da introdurre per ogni parola
una codifica binaria rispetto al complesso dei testi considerati.
Successivamente tale codifica è stata estesa agli item della classificazione
HoNOS valutando – in ogni referto – la presenza di ciascun item,
determinata dalla presenza di almeno una parola chiave ad esso associata
(l’assenza dell’item si determina per contro in mancanza di parole chiave ad
esso associate). Per rappresentare l’associazione tra i diversi item, rispetto ai
referti studiati, si è quindi condotta un’analisi delle corrispondenze

130

JADT’ 18

(Benzécri, Jean-Paul, 1973) sulla tabella testi x item HoNOS (in aggiunta ad
essi si è anche incluso concetto di rifiuto/interruzione delle cure); per poter
apprezzare inoltre le relazioni tra parole e comportamenti/problemi, espressi
dalla classificazione introdotta, si sono aggiunte le parole e locuzioni chiave
in forma supplementare.
3. Risultati
La tabella 1 mostra la distribuzione di frequenza delle aree problematiche
descritte e riclassificate secondo i criteri della scala HoNOS.
Tabella 1 – Item HoNOS e percentuale di presenza del comportamento/problema
riscontrato nei referti
Item HoNOS

H1

H2

H3

H4

H5

H6

H7

H8

H9

H11 H12 RifiutoCure

% di presenza
30.82 15.22 12.22 7.18 20.32 18.35 32.72 59.55 5.10 1.23 7.31
nei referti

18.97

Come atteso i referti riferiscono soprattutto le manifestazioni cliniche del
disagio attraverso descrizioni dettagliate dei sintomi osservati rispetto ad
altri fattori di tipo ambientale (H9, H11, H12). Tra i sintomi, quelli di più
frequente riscontro sono l’umore depresso (H7) e la classe che raccoglie tutte
le manifestazioni cliniche non specificate “altri sintomi psichici” (H8). Molto
frequente è anche la descrizione di problemi di natura organica (sintomi fisici
H5) come atteso, visto che la gestione delle urgenze psichiatriche avviene
presso il pronto soccorso generale in cui la richiesta di parere su accesi legati
a problematiche fisiche è più alta che presso un ambulatorio di secondo
livello. Molto elevata è anche l’occorrenza di comportamenti violenti ed
iperattivi (H1), una delle urgenze più tipiche dell’ambito psichiatrico.

Figura 1 – A sinistra : rappresentazione congiunta dei primi 8 item HoNOS (sintomi psichici
e fisici); A destra : sintomi comportamentali (H1, H2, H3), sintomi psichici (H6, H7, H8) e
fattori ambientali precipitanti (H9, H11, H12)

Nella figura 1 – grafico di sinistra - sono rappresentati i risultati dell’analisi
delle corrispondenze sulle categorie dei sintomi, l’area problematica di

JADT’ 18

131

maggior riscontro nei testi. Il primo piano fattoriale – mostrato nel grafico –
spiega il 34.17 % della varianza totale. Rispetto alla dimensione 1, lungo
l’asse delle ascisse, le categorie di sintomi si suddividono in due gruppi: sulla
destra troviamo i problemi legati all’umore depresso (H7) vicino ad altri
sintomi (H8), di cui come già detto l’ansietà rappresenta l’area più vasta, e i
sintomi fisici (H5), confermando la probabile origine psicosomatica di parte
di essi. Nel medesimo raggruppamento si collocano i comportamenti
deliberatamente autolesivi e suicidari, che sono secondo la letteratura spesso
associati a problemi di depressione. Su valori elevati di ascissa, sono invece
raggruppati i sintomi psicotici (H6), i comportamenti agitati (H1), in
relazione con il rifiuto delle cure, cui spesso infatti si associano. Risultano
invece indipendenti dalle altre categorie di sintomi i problemi legati all’abuso
di alcool e droghe (H3) e quelli dovuti alla presenza di problemi cognitivi di
origine neurologica (H4), che occupano gli estremi della dimensione 2,
individuate dall’asse delle ordinate. La stessa analisi è rappresentata nella
figura 2, proiettando anche le parole pertinenti del thesaurus utilizzato.

Figura 2 – Rappresentazione congiunta degli item HoNOS relativi ai primi 8 item e
rappresentazione supplementare delle parole/locuzioni chiave utilizzati per individuare i
diversi item

132

JADT’ 18

Riprendendo la figura 1 – grafico a destra – si trova una seconda analisi delle
corrispondenze condotta sulle categorie di sintomi psichici e
comportamentali insieme ai fattori precipitanti di tipo ambientale. In questo
caso il primo piano fattoriale spiega il 30.33% della varianza totale.
La distribuzione dei sintomi psichici lungo l’asse delle ascisse conferma,
come atteso, i risultati dell’analisi del primo subset di categorie. In questo
caso è possibile notare la tendenza dei problemi legati all’abuso di alcool e
droghe (H3) a disporsi verso il centro del grafico in prossimità della categoria
altri sintomi (H8), con cui è possibile che certe manifestazioni siano in
relazione. Per quanto riguarda i fattori ambientali emerge dai dati una
relazione tra problemi di lavoro (H12), sintomi dello spettro depressivo (H7)
e condotte deliberatamente autolesive (H2). È possibile che il Pronto Soccorso
rappresenti un primo punto di accesso per un’utenza con forme reattive
anche gravi, secondarie a fattori di stress occupazionale (burnout,
depressioni reattive). Le altre categorie relative a problematiche ambientali
(H9 e H11) si collocano agli estremi della dimensione 2, mostrando un certo
grado di indipendenza rispetto all’occorrenza di sintomi comportamentali e
psichici.
5. Conclusioni
L’esperienza empirica di analisi testuale automatica di referti del Pronto
Soccorso conferma la sua utilità nell’indagare fenomeni complessi come le
manifestazioni cliniche e i fattori di rischio dell’urgenza psichiatrica. L’analisi
delle corrispondenze si dimostra un metodo semplice e utile per esplorare le
relazioni tra le diverse dimensioni in esame.
Emergono per altro alcuni problemi legati alla qualità delle informazioni che,
in quanto raccolte per altri scopi, presentano un eccesso di informazione
rispetto ad alcune aree (manifestazioni sintomatologiche) mentre sono
carenti in altre, come il grado di disabilità del soggetto non analizzabile come
fattore precipitante dell’urgenza. È possibile che tali carenze possano essere
superate acquisendo informazioni da altre fonti come alcuni ricercatori
hanno fatto (Fusar-Poli P. et al., 2017). Resterebbe comunque aperto il
problema di condividere e standardizzare i metodi di trattamento dei dati
nelle diverse fasi dell’indagine, dalle modalità con cui sono raccolte le
informazioni e compilati i referti, alla creazione di un thesaurus di parole e
locuzioni chiave standard per la psichiatria sulla base di concetti teorici e
criteri condivisi.
Bibliografia
Benzécri, J.P. (1973). L'analyse des données. Vol. 2. Paris: Dunod.
Bitetto A., et al. (2017). La consultazione psichiatrica in Pronto Soccorso come

JADT’ 18

133

fonte informativa sui bisogni inespressi di salute mentale. Nuova rassegna
studi psichiatrici vol. 15 novembre 2017
Coloma P.M. et al. (2011). Combining electronic healthcare databases in
Europe to allow for large-scale drug safety monitoring: the EU-ADR
Project. Pharmacoepidemiol Drug Saf.; 20(1):1–11. 40.
Denaxas S., et al. (2017).Methods for enhancing the reproducibility of
biomedical research findings using electronic health records. Bio Data
Mining;10:31
Ehrenberg A. et Ehnfors M. (1999). Patient problems, needs, and nursing
diagnoses in Swedish nursing home records. Nursing Diagnosis; 10(2), 6576.
Fusar-Poli P, et al. (2017). Diagnostic and Prognostic Significance of Brief
Limited Intermittent Psychotic Symptoms (BLIPS) in Individuals at Ultra
High Risk. Schizophr Bull; 43(1):48-56
Gini R. et al. (2016). Automatic identification of type 2 diabetes, hypertension,
ischaemic heart disease, heart failure and their levels of severity from
Italian General Practitioners' electronic medical records: a validation
study. BMJ Open; 6(12): e012413.
Ho ML, et al. (2012). The accuracy of using integrated electronic health care
data to identify patients with undiagnosed diabetes mellitus. J Eval Clin
Pract. ;18(3):606–11.
Lora A. et al. (2001). The Italian version of HoNOS (Health of the Nation
Outcome Scales), a scale for evaluating the outcomes and the severity in
mental health services. Epidemiology and Psychiatric Sciences; 10.3: 198-204.
Migliardi A. et al. (2004). Descrizione degli incidenti domestici in Piemonte a
partire dalle fonti informative correnti. Epidemiologia & Prevenzione ; 28.1:
20-26.
Mitchell J. et al., (1994). Using medicare claims for outcome research. Medical
care; 35:589-602
Persell S.D. et al. (2009). Electronic health record-based cardiac risk
assessment and identification of unmet preventive needs. Med Care;
47(4):418–24.
Vaona A. et al. (2017). Data collection of patients with diabetes in family
medicine: a study in north-eastern Italy. BMC Health Serv Res.;17(1):565
Wing J.K. et al., (1998). Health of the Nation Outcome Scales (HoNOS).
Research and development. The British Journal of Psychiatry; 172 (1) 11-18
Zanus C. et al. (2017). Adolescent Admissions to Emergency Departments for
Self-Injurious Thoughts and Behaviors. PLoS One.;12(1): e0170979.

134

JADT’ 18

Exploring the history of American philosophy in a
computer-assisted framework
Guido Bonino1, Davide Pulizzotto2, Paolo Tripodi3
2

1Università di Torino – guido.bonino@unito.it
LANCI, Université du Québec à Montréal – davide.pulizzotto@gmail.com
3Università di Torino – paolo.tripodi@unito.it

Abstract
The aim of this paper is to check to what extent some tools for computerassisted concept analysis can be applied to philosophical texts endowed with
complex and sophisticated contents, so as to yield results that are significant
not only because of the technical success of the procedures leading to the
results themselves, but also because the results, though highly conjectural,
are a direct contribution to the history of philosophy
Sommario
Lo scopo di questo articolo è di verificare in che misura la computer-assisted
concept analysis possa essere applicata a testi filosofici di contenuto
complesso e sofisticato, in modo da produrre risultati significativi non solo
dal punto di vista del successo tecnico delle procedure, ma anche in quanto i
risultati stessi, sebbene altamente congetturali, costituiscono un contributo
diretto alla storia della filosofia.
Keywords: philosophy, history of philosophy, paradigm, necessity, idealism,
Digital Humanities, Text Analysis, Computer-assisted framework
1. Computer-assisted concept analysis
The development of artificial intelligence poses a methodological challenge
to the humanities. Many traditional practices in disciplines such as
philosophy are increasingly integrating computer support. In particular,
Concept Analysis (CA) has always been a common practice for philosophers
and other scholars in the humanities. Thanks to the development of Text
Mining (TM) and Natural Language Processing (NLP), computer-assisted text
reading and analysis can provide the humanities with new tools for CA
(Meunier and Forest, 2005), making it possible to analyze large textual
corpora, which were previously virtually unassailable. Examples of
computer-assisted analyses of large corpora in philosophy are Allard et al.,
1963; McKinnon, 1973; Estève et al., 2008; Danis, 2012; Sainte-Marie et al.,
2010; Le et al., 2016; Meunier and Forest, 2009; Ding, 2013; Chartrand et al.,
2016; Pulizzotto et al., 2016; Slingerland et al., 2017. The use of computer-

JADT’ 18

135

assisted text analysis is also relevant for the distant reading approach,
developed by Franco Moretti in the context of literature studies (Moretti,
2005; Moretti, 2013), but which we are convinced can be usefully extended to
different fields (for the application to philosophy see the Conference “Distant
Reading and Data-Driven Research in the History of Philosophy” held in
Turin in 2017, http://www.filosofia.unito.it/dr2/).
The main aim of this paper is to check to what extent some tools for
computer-assisted CA can be applied to texts endowed with complex and
sophisticated contents, so as to yield results that are significant not only
because of the technical success of the procedures leading to the results
themselves, but also because the results, though highly conjectural, are a
direct contribution to the humanities. Philosophy, in particular the history of
philosophy, seems to be a good case to be considered, because of the
sophistication of its contents. Our main purpose is that of illustrating some of
the different kinds of work that can be done in history of philosophy with the
aid of computer-assisted CA.
2. Method
2.1. The corpus
To understand how TM and NLP can assist the work in history of
philosophy, some standard methods have been applied to a specific corpus,
which is provided by Proquest (www.proquest.com). The corpus is a
collection of 20,751 PhD dissertations in philosophy discussed in the US from
1981 to 2015. It therefore contains 20,751 documents: each document is a text,
comprising the title and the abstract of a dissertation, which are dealt with as
a single unit of analysis. The corpus also contains some metadata, such as the
author of the dissertation, the year of publication, the name of the supervisor,
the university, the department, and so forth. In the present paper we are not
going to exploit fully the wealth of information provided by these metadata,
which are certainly worth being the subject of further research. However, we
will use the crucial datum of the year of publication, which allows us to
assume a diachronic (that is, historical) perspective on the investigated
documents.
2.2. Data preprocessing
A preliminary step consists in a set of four preprocessing operations that
allow us to extract the linguistic information needed for the analysis: 1) Part
of Speech (POS) tagging; 2) lemmatization; 3) vectorization; 4) selection of the
sub-corpora responding to Keyword In Context (KWIC) criteria.
The POS tagging and the lemmatization process are performed on the basis
of the TreeTagger algorithm described by Schmid, 1994 and 1995. This

136

JADT’ 18

operation consists in the annotation of each word for each document
according to its morphological category. Some irrelevant categories (such as
determinants, prepositions and pronouns) are eliminated. Nouns, verbs,
modals, adjectives, adverbs, proper nouns and foreign words are taken into
account. The lemmatization process reduces a word to his lemma, according
to the correspondent POS tag. At the end of this process, we can identify
17,750 different lemmas, which are called types.
The mathematical modeling of each document into a vector space is called
vectorization. In such a model, each document is encoded by a vector, whose
coordinates correspond to the TF-IDF weighting of the words occurring in
that document. This weighting function calculates the normalized
frequencies of the words in each document (Salton, 1971). At the end of the
process, a matrix M is built, which contains 20,571 rows corresponding to
each document, and 17,750 dimensions, corresponding to the types.
Finally, three sub-corpora are created on the basis of the KWIC criterion.
These sub-corpora correspond to the set of all the text segments in which one
of these three lexical form, each of which convey the meaning of a concept,
appears: ‘necessity’, ‘idealism’, and ‘paradigm’. The three concepts have been
chosen because of the considerable diversity of their statuses: ‘necessity’ has
always been a keyword of several sub-fields of philosophy; ‘idealism’ refers
both to a philosophical current, historically determined, and to an abstract
position in philosophy; ‘paradigm’ entered the philosophical vocabulary in
relatively recent times, mainly after the publication of Kuhn, 1962, as a
technical term in the philosophy of science. We obtain a set of 719 documents
for ‘necessity’, 450 documents for ‘idealism’, 975 documents for ‘paradigm’.
2.3. Word-sense disambiguation process
For each sub-corpus, we identify the semantic patterns (usually, word cooccurrence patterns) associated to each lexical form, so as to discover the
most relevant semantic structures of that concept. This is done by using
clustering, a common method in Machine Learning for pattern recognition
tasks (Aggarwal and Zhai, 2012). Clustering techniques applied to texts are
based on two hypotheses: a contiguity hypothesis and a cluster hypothesis. The
former states that texts belonging to the same cluster form a contiguous
region that is quite clearly distinct from other regions, while the latter says
that texts belonging to the same cluster have similar semantic content
(Manning et al., 2009, p. 289 and 350). For our purposes, clustering is an
instrument for semantic disambiguation. In our experiment, we use the Kmeans algorithm (Jain, 2010, p. 50), a widely employed algorithm for WordSense Disambiguation tasks (Pal and Saha, 2015).
The main parameter that needs to be tuned in the K-means algorithm is the k

JADT’ 18

137

parameter, which determines the number of centroids to be initialized. Each
execution of the K-means algorithm generates a partition Pk having a number
of clusters equal to k. Since each centroid is the “center vector” of each
cluster, it can also be used to identify the most “prototypical” documents in a
given cluster. To complete this operation, a tool generally used to select
relevant documents in Information Retrieval is employed, that is, the cosine
computation among a query vector and a group of “document vectors”
(Manning et al., 2009). In this context, each centroid of a Pk partition can be
used as a query in order to identify documents with a higher cosine value.
Clustering has first been applied synchronously on the Si matrices with k = {2,
3, 4, …, 50}, thus obtaining the most recurring semantic patterns; then it has
been applied diachronically, dividing each matrix into three different periods
(1981-1993, 1994-2003, 2004-2015) in order to obtain sets of documents with
similar cardinality. On each sub-matrix of Si several clusterings with k = {2, 3,
4, ..., 50} were performed, in order to identify the temporal evolution of the
most important semantic patterns associated to the three concepts under
study. For each generated Pk partition, we also perform the cosine
computation in order to obtain a set of the most relevant PhD dissertations
belonging to each cluster.
3. Analyses
In this section, we are going to present three analyses, focusing on three
different concepts: paradigm, necessity and idealism. Each case illustrates a
different kind of historical-philosophical result.
3.1. Necessity
After exploring both synchronically and diachronically several clusters (with
different k) associated to the concept of necessity, we have focused on a
clustering with k=18 in the period 1981-2015 (the clusters are not significantly
different from one another in the three decades). It turns out that there are at
least 16 clearly distinct and philosophically interesting meanings of
‘necessity’: two (maybe distinct) theological notions; physical necessity;
political necessity; necessity as investigated in modal logic and possible
world semantics; moral necessity; necessity as opposed to freedom in debates
over determinism; the necessity of historical processes; metaphysical
necessity; two notions of causal necessity (attacked by Hume); the necessity
of life events; logical necessity; phenomenological necessity; necessity of the
Absolute (Hegel); necessity of moral duty (Kant); ancient concept of
necessity; the necessity of law. In addition to these, there is also a rather big
cluster in which ‘necessity’ seems to occur mainly with its ordinary, not
strictly philosophical meaning.

138

JADT’ 18

If the clustering we applied to ‘necessity’ were extended to a large number of
philosophical words (chosen in our corpus by domain experts), that would
be the first step for the construction of a bottom-up vocabulary of
philosophy, and ultimately of a data-driven philosophical dictionary, in
which the different (though related) meanings of philosophical terms would
be determined on the basis of actual use, rather than merely on the
lexicographer’s discernment. This lexicographic work is also an
indispensable step if one wants to overcome the “concordance approach”: it
seems to us that this bottom-up lexicography could be a promising starting
point for the construction of semantic networks.
3.2. Idealism
Unlike ‘necessity’, the term ‘idealism’ has different distributions in the
decades 1981-1993, 1994-2003 and 2004-2015. We have only considered the
largest clusters (> 10 documents), since for our purpose (that of
reconstructing the main historical developments of American academic
philosophy), isolated cases and minor tendencies are not relevant.
The evolution of some clusters over decades suggests interesting historical
reflections. First, the cluster “Kant” is persistently important. In fact, it
becomes more and more important, even in wider contexts, that is, in
documents that are not directly devoted to Kant. This is shown by the rising
trend of the cluster “Transcendental” (a term typically, but not always
directly connected with Kant). Second, the cluster “Hegel” disappears in the
second decade, then it reappears: is this a real phenomenon, rather than a
statistical artefact? How can it be explained? Third, the cluster “Realism”
disappears in the third decade: is there a relationship between the return of
“Hegel” and the disappearance of “Realism”? This is not the kind of
question, which comes naturally to the mind of the historian of philosophy,
on the basis on his/her knowledge of well-known developments of the
history of recent American philosophy. This hypothesis can be formulated
only thanks to some sort of defamiliarization (ostranenie) with respect to the
received views in history of philosophy. Yet, it seems unlikely that
philosophers in the last decade gave up speaking of realism. The received
view may after all be correct, that realism is more and more central in late
analytic philosophy (think, for example, of the centrality of David Lewis)
(Bonino and Tripodi forthcoming). Such a view is confirmed by other data,
such as the number of occurrences of ‘realis-’ in the abstracts of the corpus.
1981-93: 373 (5,76% of 6,471); 1994-2003: 465 (6,31% of 7,361); 2004-2015: 482
(5,6% of 8,585). Thus the focus on realism is still there, in the third decade.
One is therefore led to formulate an alternative hypothesis: philosophers
ceased to speak of idealism in relation to realism: perhaps the contrast realism-

JADT’ 18

139

idealism has become less important than many used to think; perhaps after
Dummett, realism is contrasted with anti-realism, rather than with idealism;
perhaps some sort of “interference” is here produced by the presence of a
further opposition, that between realism and nominalism.
The moral of this example is that clustering applied to large and conceptually
sophisticated corpora allows the historians of philosophy to concoct
alternative stories to account for the historical facts. This indicates that the
data-driven approach can trigger the production of conjectures one would
not think about. It is usually maintained that statistical techniques are useful
in that they restrict the space of possible interpretations (Mitchell, 1997), but
in other cases, such as the one described in this section, at least in an early
phase of the hermeneutic process, in virtue of their defamiliarizing impact
they can also have the opposite effect: that of broadening that same space
and discovering nouveaux observables (Rastier, 2011).
3.2. Paradigm
This case study deals with the term ‘paradigm’ in the period 1981-2015. After
exploring several k in the three decades, we focus on the synchronic analysis
of the set of clusters with k=16. The first result that immediately stands out is
that ‘paradigm’ occurs rather often: 995 documents, twice as many as
‘idealism’ (450), and considerably more than ‘necessity’ (719), a concept
which is widely regarded as central in the recent history of Anglo-American
philosophy. Using Google Ngram Viewer, and thus taking into account a
generalist, non disciplinary corpus, it turns out that such a high frequency is
peculiar to the philosophical discourse (the lowest value of ‘necessity’ is
0.0025%, which is higher than the highest value for ‘paradigm’, which is
0.0016%).
Why does ‘paradigm’ occur so frequently? On the one hand, one could find
this datum not so surprising, since ‘paradigm’ is a technical term in the
philosophy of science, introduced by Kuhn, 1962 to refer to a set of
methodological and metaphysical assumptions, examples, problems and
solutions, a vocabulary, which are taken for granted, in a given period of
normal science, by a scientific community. On the other hand, moving from a
priori considerations to the examination of the data, a partly different
landscape emerges: ‘paradigm’ seems to be a fashionable concept, which is
used in a variety of contexts as a term that is neither technical nor simply
ordinary. Only in cluster 8 has the term a straightforward technical use,
derived from Kuhn’s philosophy of science. Each of the other clusters (1:
theology, 2: music, 3: philosophy of law, 4: education; 5: nursing; 6:
philosophy of religion; 7: moral philosophy; 9: bioethics, 10: spiritualism; 11:
political theory; 12: self narrative; 13: theology; 14: Kant-Leibniz; 15

140

JADT’ 18

aesthetics; 16: philosophy and language in Wittgenstein, Heidegger etc.) does
not correspond to a different meaning of the term ‘paradigm’, but simply to
the application of the same concept to different fields. In most cases we have
to do with non-technical contexts, in which ‘paradigm’ has neither its
original grammatical meaning nor its ordinary, non-philosophical meaning
(standard, exemplar). It seems to us that its meaning and use are generic and
vague, rather than precise and technical; nonetheless, they evoke Kuhn: a
quasi-Kuhnian vocabulary became fashionable; it entered many
philosophical discourses, often more “humanistic” than “scientific” in spirit,
and much less technical than the philosophy of science.
This case study expresses an especially interesting kind of result obtainable
by using TM and NLP techniques to assist research in history of philosophy:
it shows how the interpretation of clusters fosters the discovery of
terminological fashions as opposed to genuine conceptual developments.
References
Aggarwal C.C., and Zhai C.X. (2012). “A Survey of Text Clustering
Algorithms.” In Mining Text Data, 77–128. Springer.
Allard M. et al. (1963). Analyse conceptuelle du Coran sur carte perforées.
Mouton.
Bonino G. and Tripodi P. (eds.), History of Late Analytic Philosophy, special
issue of “Philosophical Inquiries”, forthcoming.
Chartrand L., Meunier J.-G. and Pulizzotto D. (2016). CoFiH: A heuristic for
concept discovery in computer-assisted conceptual analysis. In Mayaffre
D. et al. (eds.), Proceedings of the 13th International conference on statistical
analysis of textual data, vol. I, pp. 85-95.
Danis J. (2012). L’analyse conceptuelle de textes assistée par ordinateur (LACTAO);
une expérimentation appliquée au concept d’évolution dans l’œuvre d’Henri
Bergson.
Université
du
Québec
à
Montréal
(http://www.archipel.uqam.ca/4641/1/M12423.pdf).
Ding X. (2013). A text mining approach to studying Matsushita’s
management thought. Proceedings of the 5th International conference on
informatin, process and knowledge, pp. 36-39.
Estève R. (2008). Une approche lexicométrique de la durée bergsonienne.
Actes des journées de la linguistique de corpus, vol. 3: 247-258.
Jain A.K. (2010). Data clustering: 50 years beyond K-means. Pattern
Recognition Letters, vol. 31(8): 651-666.
Kuhn T.S. (1962). The structure of scientific revolutions. University of Chicago
Press.
Le N.T, Meunier J.-G., Chartrand L. et al. (2016). Nouvelle méthode d’analyse
syntactico-sémantique profonde dans la lecture et l’analyse de textes

JADT’ 18

141

assistéespar ordinateur (LATAO). In Mayaffre D., et al. (eds.), Proceedings
of the 13th International conference on statistical analysis of textual data.
Manning C.D. at al. (2009). Introduction to Information Retrieval. Online
edition. Cambridge, UK: Cambridge University Press.
McKinnon A. (1973). The conquest of fate in Kierkegaard. CIRPHO, 1(1): 4558.
Meunier J.-G. and Forest D. (2005). Classification and categorization in
computer assisted reading and analysis of texts. In Cohen H. and Lefebvre
C. (eds.), Handbook of categorization in cognitive science, pp. 955-978.
Elsevier.
Meunier J.-G. and Forest D. (2009). Lecture et analyse conceptuelle assistée
par ordinateur: premières expériences. In Annotation automatique et
recherche d’informations. Hermes.
Mitchell T.M. (1997). Machine learning. McGraw-Hill.
Moretti F. (2005). Graphs, maps, trees. Abstract models for a literary history.
Verso.
Moretti F. (2013). Distant reading. Verso.
Pal A.R. and Saha D. (2015). Word sense disambiguation: A survey.
International Journal of Control Theory and Computer Modeling, vol. 5(3).
Pincemin B. (2007). Concordances et concordanciers: de l’art du bon KWAC.
XVIIe Colloque d’Albi. Langages et signification – Corpus en lettres et sciences
sociales: des documents numériques à l’interprétation, pp. 33-42.
Pulizzotto D. et al. (2016). Recherche de “périsegments” dans un contexte
d’analyse conceptuelle assistée par ordinateur: le concept d’“esprit” chez
Peirce. JEP-TALN-RECITAL 2016, vol. 2, pp. 522-531.
Rastier F. (2011), La mesure et le grain. Sémantique de corpus. Champion.
Sainte-Marie M. et al. (2010). Reading Darwin between the lines: a computerassisted analysis of the concept of evolution in the Origin of species. 10th
International conference on statystical analysis of textual data.
Salton G. (1971). The SMART Retrieval System: Experiments in Automatic
Document Processing. NJ: Prentice-Hall, Upper Saddle River.
Schmid H. (1994). “Probabilistic Part-of-Speech Tagging Using Decision
Trees.” In Proceedings of the International Conference on New Methods in
Language Processing. Manchester; UK.
Schmid H. (1995). “Improvements In Part-of-Speech Tagging With an
Application To German.” In Proceedings of the ACL SIGDAT-Workshop, 47–50.
Slingerland E. et al. (2017). The distant reading of religious texts: A “big
data” approach to mind-body concepts in early China. Journal of the
American Academy of Religion: 1-32.

142

JADT’ 18

La classification hiérarchique descendante pour
l’analyse des représentations sociales dans une
pétition antibilinguisme au Nouveau-Brunswick,
Canada
Marc-André Bouchard, Sylvia Kasparian
Université de Moncton – emb1214@umoncton.ca; sylvia.kasparian@umoncton.ca

Abstract
In this article, we apply Jean-Blaise Grize’s theoretical framework and Max
Reinert’s descending hierarchical classification to a corpus composed of
comments published as part of a petition against institutional bilingualism in
New Brunswick. Using Iramuteq, we point to the lexical worlds which
constitute anti-bilingualism arguments.
Résumé
Dans cet article, nous appliquons le cadre théorique développé par JeanBlaise Grize et la classification hiérarchique descendante de Max Reinert à un
corpus constitué de commentaires publiés dans le cadre d’une pétition contre
le bilinguisme institutionnel au Nouveau-Brunswick. Utilisant le logiciel
Iramuteq, nous dégageons les mondes lexicaux qui constituent
l’argumentation anti bilinguisme.
Mots-clés: mondes lexicaux, représentations sociales, schématisation,
classification hiérarchique descendante, pétition en ligne
1. Introduction
Toute analyse de discours, comme l’admet Jean-Blaise Grize dans Logique
naturelle et communications (1998; 144-145), est confrontée au problème de la
correspondance entre discours et représentations. Celui-ci serait attribuable
notamment à l’importance que donne l’analyse du discours à la situation de
communication, un facteur qui complique la relation de correspondance
entre ce qu’on dit et ce qu’on pense « vraiment ».
Dans le cadre de cet article, nous proposons d’explorer l’intersection entre
analyse de discours et étude des représentations et nous tenterons de
montrer que, bien que le problème de la correspondance entre discours et
représentations individuelles reste difficile à résoudre, les corpus de pétition
en ligne homogénéisent le discours et jouent sur la schématisation que
construit le locuteur, de façon à ce que les analyses logométriques puissent

JADT’ 18

143

accéder à certaines représentations sociales en jeu. À cet effet, nous aurons
recours à la méthode Reinert (une classification hiérarchique descendante
originalement popularisée par le logiciel ALCESTE) (1990) implantée dans le
logiciel Iramuteq (Ratinaud, 2009), qui consiste à relever les mondes lexicaux
d’un corpus. Plusieurs auteurs, dont Max Reinert lui-même, ont déjà établi
des liens entre cette méthode et le champ d’étude des représentations sociales
(1993; 13). Notre contribution à la conversation sera celle d’appliquer la
méthodologie issue de la logométrie et le cadre théorique développé par
Grize à un nouveau type de corpus qui gagne en popularité depuis le début
du 21e siècle, celui des pétitions en ligne. L’exemple par lequel nous
illustrerons notre exposé théorique sera celui de l’analyse, à l’aide
d’Iramuteq, des mondes lexicaux d’une pétition en ligne lancée au NouveauBrunswick (Canada) en 2013, sur la plate-forme www.change.org, contre
l’exigence du bilinguisme comme critère d’emploi dans la fonction publique
provinciale.
2. Cadre théorique
Selon Denise Jodelet, on peut définir la représentation sociale comme « une
forme de connaissance socialement élaborée et partagée, ayant une visée
pratique et concourant à la construction d’une réalité commune à un
ensemble social » (1997; 53). Ainsi, comme le remarque Serge Moscovici, leur
étude demande des méthodes d’observation plutôt que d’expérimentation
étant donné qu’elle se manifeste « comme une "modélisation" de l’objet
directement lisible dans, ou inférée de, divers supports linguistiques,
comportementaux ou matériels » (idem; 61). Bien qu’elle soit forme de
connaissance, la représentation se distingue de la connaissance scientifique
en ce qu’elle découle de ce que Jean-Blaise Grize nomme la logique naturelle
(Grize, 1997; 171-172), donnant ainsi sur un « savoir de sens commun »
(Jodelet, 1997; 53). Il faut entendre par « logique naturelle » qu’il est question
d’une logique d’ordre logico-discursif, manifestée dans le discours par la
schématisation, qui « prend en compte les contenus et non les seules formes
de la pensée » (Grize, 1997; 171-172). Selon Grize, la schématisation compte
cinq notions articulant son ensemble ainsi :
[1] Une schématisation est la mise en discours [2] du point de
vue qu’un locuteur A [3] se fait – ou a – d’une certaine réalité R.
[4] Cette mise en discours est faite pour un interlocuteur, ou un
groupe d’interlocuteurs, B [5] dans une situation d’interlocution
donnée (idem).

144

JADT’ 18

Ainsi, Grize propose que toute communication est situation d’interlocution,
dans laquelle l’orateur construit une schématisation en fonction de son
préconstruit culturel, de ses représentations de l’objet en question, et de sa
finalité; cette schématisation est constituée d’images de l’orateur, de
l’auditeur et de l’objet dont il s’agit, et elle est ensuite reconstruite par
l’auditeur en fonction de ses propres représentations, préconstruit culturel et
finalité (Grize, 1993; 7). La schématisation est donc partielle et partiale : « elle
est partielle dans la mesure où son auteur n’y fait figurer que ce qu’il juge
utile à sa finalité, à l’effet qu’il veut produire; elle est partiale puisqu’il
l’aménage de telle façon que B la reçoive » (Grize, 1997; 175). En termes de
finalité, selon Patrick Charaudeau, les discours, plus particulièrement ceux
de type argumentatif ont une double quête, soit le vraisemblable et
l’influence, le succès de celle-ci étant fonction des « représentations
socioculturelles partagées par les membres d’un groupe donné au nom de
l’expérience ou de la connaissance » (1992; 784). C’est donc dire que, compte
tenu de la « double quête » du mode de discours, les représentations d’objets
sur lesquelles le locuteur construit sa schématisation sont choisies en raison
du partage, supposé par le locuteur, de ces représentations chez le(s)
destinataire(s). Dès lors, l’analyse des mondes lexicaux communs à un
groupe de locuteurs dans une même situation de communication peut nous
donner des indices des représentations sociales que se fait le groupe d’un
objet du monde social. En effet, selon Max Reinert, dans un corpus collectif,
un monde lexical serait indicateur d’un espace de référence commun à un
groupe et « l’indice d’une forme de cohérence liée à l’activité spécifique du
sujet-énonciateur » (Reinert, 1993; 13). La méthode de classification
hiérarchique descendante (Reinert, 1990) propose une représentation de ces
mondes lexicaux (ou thématiques) sous la forme de tableaux de classification
obtenus par voie du croisement des unités de contexte (ou segments) et des
lexèmes d’un corpus. L’hypothèse à la base de cette méthode est que « dans
la mesure où une représentation collective exprime une certaine régularité de
structure dans une classe de représentations singulières […] cette régularité
est due aux contraintes de ce que nous appelons "un monde" » (Reinert, 1993;
29-30). La prise en compte de la fréquence et de l’environnement des formes
d’un corpus permet non seulement de relever les formes lexicales les plus
propices à constituer des indices de représentations sociales, mais aussi de
définir ces formes lexicales en fonction de leur cotexte.
3. Corpus
Le corpus que nous analysons dans la présente recherche est issu d’une
pétition en ligne. Contrairement à la pétition classique, la pétition en ligne
permet à ceux qui y apposent leur nom d’y publier, s’ils le désirent, un

JADT’ 18

145

commentaire justifiant leur appui au titre et à la description de celle-ci. Celle
dont il est question ici, Stop the hiring discrimination against citizens who speak
English only1, a été lancée en 2013 au www.change.org. Ses commentaires, en
plus d’être signés par leurs auteurs, sont accessibles publiquement sur la
page même. Cette particularité du canal de communication, que Contamin
(2001) appelle « un paradoxe classique des pétitions », a une incidence sur le
destinataire de la mise en discours en ce que ce dernier n’est pas seulement le
gouvernement de la province, mais aussi le grand public. Ainsi, les corpus de
pétitions en ligne homogénéisent les discours selon le modèle de la
communication de Grize. D’abord, le groupe de locuteurs se trouve dans la
même situation d’interlocution (monologues, à l’écrit, mode argumentatif) et
est invité à partager son point de vue sur une même réalité (en l’occurrence,
le bilinguisme institutionnel de la province du Nouveau-Brunswick). Ces
mises en discours sont faites pour un public général, et la nature engagée de
la pétition fait en sorte que, en théorie du moins, seuls les locuteurs
partageant le point de vue énoncé dans le titre sont représentés.
Le point de vue partagé par les intervenants, dans notre corpus, est que
l’exigence du bilinguisme anglais-français pour des emplois dans la fonction
publique provinciale constitue une discrimination envers les NéoBrunswickois anglophones, qui sont largement unilingues (moins de 15% de
ceux-ci se considèrent bilingues, comparativement à un taux de plus de 70%
dans la communauté minoritaire francophone). Ces discours s’inscrivent
dans un long débat au sein de la population néo-brunswickoise sur le
bilinguisme institutionnel, et historiquement le clivage se fonde sur la base
linguistique : les francophones sont en faveur du bilinguisme de l’État et de
l’avancement des droits linguistiques, alors que les anglophones y sont plus
réticents. En tout, à son terme à la fin de l’année 2013, la pétition Stop the
hiring discrimination against citizens who speak only English récolte 7758
signatures, pour un total de 2372 commentaires, la longueur de chacun
variant d’un mot (« jobs ») à 304 mots, pour une moyenne de 37,66 unités
linguistiques par commentaire. Ce corpus compte 4 425 formes différentes
représentant un total de 89 338 occurrences. Le corpus nettoyé et uniformisé
a été soumis à l’analyse du logiciel Iramuteq qui nous donne le
dendrogramme des classes constituant les mondes lexicaux des
commentaires présentés dans la section suivante.

1https://www.change.org/p/the-government-of-new-brunswick-stop-the-hiringdiscrimination-against-citizens-who-speak-only-english

146

JADT’ 18

4. Analyse
Les 89 338 occurrences (4425 formes différentes) qui constituent notre corpus
sont regroupées en 3492 lemmes, soit 2954 formes actives et 538 formes
supplémentaires. Et l'ensemble du corpus est segmenté en un total de 2423
parties constituées d'un nombre plus ou moins égal de formes (en moyenne
36.87 formes par segment). L’analyse de la classification hiérarchique
descendante avec Iramuteq produit le graphe présenté dans la Figure 1.

Figure 1 : Classification sur segments de textes simples

La lecture de la Figure 1 révèle que la première segmentation du corpus
donne lieu à la Classe 1 (en rouge), formant une classe représentant 30.3 %
des segments classés et constituée d'un lexique que nous nommons l’axe
sociopolitique : on y aborde d'abord la dynamique « majority » / « minority »,
qui, à se fier à cette liste de formes, jouerait un rôle d'avant-plan dans les
représentations du Canada et des provinces de ce pays. On remarque aussi,
en plus de quelques formes relevant de la culture et de la langue, un champ
lexical qui semble indiquer la présence de positionnements politiques dans le
corpus (« right », « common », « sense », « rule », « vote », « political », «
equal »), alors que les verbes (« fight », « cater », « stand », « stop », « start », «
push »), de nature politique aussi, renforcent l'hypothèse que cette classe est

JADT’ 18

147

constituée de segments exprimant des représentations au sujet de la société
canadienne. Une fois la Classe 1 constituée, le calcul divise le deuxième
segment en deux classes : la Classe 2 (en vert), contenant 31.7 % de ceux-ci;
contre 38 % dans la Classe 3 (en bleu). On observe que, collectivement, cellesci se démarquent de la Classe 1 par leur lexique relevant de l'expérience
personnelle plutôt que de l’opinion politique.
Cette caractéristique personnelle se manifeste dans la Classe 2 par des formes
comme « home », « family », « child », « young », et « daughter ». Les verbes,
quant à eux, précisent le contexte de cette expérience : « move », « find », «
leave », « work », « live », « stay », « raise », « love », et « born »; tout comme
quelques adjectifs évaluatifs et/ou axiologiques : « hard », « good », « decent
», et « impossible ». On observe aussi quelques formes, en plus de « [new]
brunswick », qui réfèrent à une province canadienne, soit à l'Alberta. Le
contenu de la Classe 2 constitue donc l’axe biographique, rejoignant souvent
le thème de l'exode vers l'Ouest canadien.
La troisième et dernière classe du corpus (en bleu) gravite autour du thème
du travail, voire plus précisément de la recherche d'un emploi. C'est aussi
dans cette classe qu'on trouve les seules références directes à la langue, mise
à part la forme « language » dans la Classe 1 : « bilingual », « speak », et «
french ». Certaines formes spécifiques à la Classe 3 laissent entendre que
celle-ci est, en partie, plus impersonnelle que la Classe 2 : « employee »,
« person », « applicant », et « individual ».
À partir de la classification sur segments de texte, on peut parcourir, de façon
automatisée, l'ensemble des segments de chaque section et leur attribuer un
score selon le nombre de mots représentatifs de la classe où ils se trouvent;
on tient aussi compte du degré de représentativité de ces formes.
Ainsi, les deux segments qui suivent sont caractéristiques de la Classe 1: «
discrimination of the english[-]speaking white majority populace should stop
with the democratic system becoming more in play with majority rules as a
true reflection of the people »; « we as a province cannot afford duplicate
books in 2 languages to support a minority and the need to speak french in a
majority speaking english province to have a job is ridiculous »
Il apparait, dans les segments caractéristiques de la Classe 1, un
renversement du rapport de pouvoir classique entre un groupe majoritaire et
un groupe minoritaire : les anglophones sont ici opprimés, alors que ce sont
les francophones qui sont avantagés, qui ont l'oreille attentive du
gouvernement, et, ultimement, qui détiennent le marché du travail bilingue.
Cette oppression serait apparente dans la difficulté pour les anglophones
unilingues de se trouver un emploi, dans la fonction publique notamment,
mais peut-être aussi dans le secteur privé. On remarque d'emblée une
représentation de la démocratie se résumant à la règle de majorité (telle que

148

JADT’ 18

définie par H. B. Mayo (1957; 50) comme : « the principle that when there is a
majority on a matter, then the wishes of the majority should prevail »), ce qui
est explicitement communiqué au premier segment caractéristique de la
Classe 1. En ce qui concerne la Classe 2, voici deux des segments les plus
caractéristiques : « it is very important to me because my daughter like 1000s
of other working children here in new brunswick have had to leave their
home province in order to find work because they only speak their own
language of english. »; et « i have been out of work for over a year. Unable to
find a full time job due to bilingualism restrictions. Going to have to move
west. ». Il apparait donc qu’il y a un motif récurrent dans la Classe 2 : pour
trouver un bon emploi, voire un emploi tout court, il faut être bilingue, faute
de quoi on s’exile, notamment dans l’Ouest canadien. On remarque que ces
segments témoignent d’un sentiment d’impuissance mais aussi de réticence
face à l’idée de quitter sa province natale. Certains segments caractéristiques
de la Classe 2 traitent de l’expérience personnelle du commentateur, qui a dû
ou qui croit avoir à déménager dans une province non bilingue, alors que
d’autres racontent l’exode, accompli ou prévu, de leur(s) enfant(s). On
remarque que, dans les segments de la Classe 2 qui précèdent, on attribue
volontiers la pauvreté du marché de l’emploi pour les anglophones au
facteur linguistique. Ensuite, les segments caractéristiques de la Classe 3 sont
les suivants: « because this is a problem, i have 17 years’ experience and 2
degrees and i can’t even apply for the jobs i qualify for because it’s
mandatory bilingual positions when over 90% of the day is dealing in
english, they won’t even interview you unless you speak french »; et « the
most qualified person for the job is not always hired because they are not
bilingual ». Les différentes formes du concept de « qualification », et d’autres
qui y sont liées sémantiquement, sont omniprésentes dans ces segments
caractéristiques. Il apparait d’emblée qu’on exclut les compétences
linguistiques de ce concept. En effet, une personne qui parle seulement
l’anglais est présentée comme potentiellement aussi qualifiée, et à l’occasion
plus qualifiée, qu’un candidat bilingue à un emploi qui demande le
bilinguisme. Le scénario, souvent hypothétique, qui est donné à voir tend à
mettre en jeu une personne unilingue qui serait plus qualifiée qu’une autre
chez qui le bilinguisme est présenté comme le seul atout.
5. Conclusion
En somme, dans le cadre de cette pétition, les locuteurs ont mis en discours
des représentations du bilinguisme institutionnel au Nouveau-Brunswick par
l’entremise de trois mondes lexicaux, présentant ainsi trois facettes de la
discrimination perçue envers les anglophones dans la fonction publique. Le
premier monde lexical est sociopolitique et énonce des principes généraux

JADT’ 18

149

sur ce qui est juste; le deuxième est biographique et relate les effets
personnels de cette discrimination; et le troisième porte sur des exemples de
la façon dont se manifeste cette discrimination dans le monde du travail.
Ainsi, l’échantillon des représentations sociales du bilinguisme institutionnel
constituant notre corpus donne à voir un lien de causalité entre l’exigence du
bilinguisme pour certains emplois et les difficultés du marché du travail de la
province. Dans le but de convaincre un public général, ce point de vue est
présenté sous un angle à la fois idéologique, personnel ou pratique,
renvoyant ainsi à certaines images de la démocratie, de l’exode et de la
compétence; images qui, bien que relativement homogènes dans notre
corpus, ne seraient pas nécessairement partagées dans les représentations
sociales des anglophones bilingues et des francophones.
Bibliographie
Charaudeau, Patrick (1992). Grammaire du sens et de l’expression. Hachette.
Contamin, J.-G. (2001). Contribution à une sociologie des usages pluriels des forms
de mobilization : l’exemple de la petition en France. Thèse de doctorat de
l’Université Paris 1.
Grize, Jean-Blaise (1998). Logique naturelle et communications. Presses
Universitaires de France.
Jodelet, Denise (1997). Les représentations sociales. Dans Jodelet ed. Les
representations sociales (5e ed.). Presses Universitaires de France.
Mayo, H. B. (1957). Majority Rule and the Constitution in Canada and the
United States. Political Research Quarterly, vol. 10(1) : 49-62
Ratinaud, Pierre (2009). Iramuteq : interface de R pour les analyses
multidimensionnelles de textes et de questionnaires. http://www.iramuteq.org.
Reinert, Max (1990). Alceste une méthodologie d’analyse des données
textuelles et une application. Bulletin de Méthodologie Sociologique, vol.
26(1): 24-54
Reinert, Max (1993). Les “mondes lexicaux” et leur “logique” à travers
l’analyse statistique d’un corpus de récits de cauchemars. Langage et
société, vol. 66(1) : 5-39
Reinert, Max (1997). Postures énonciatives et mondes lexicaux stabilisés en
analyse statistique de discours. Langage et société, no. 121/122 : 189-202

150

JADT’ 18

Analysing occupational safety culture through
mass media monitoring
Livia Celardo1, Rita Vallerotonda2, Daniele De Santis2,
Claudio Scarici2, Antonio Leva2
2

1 Sapienza University of Rome
INAIL Research – Headquarters for Research of the Italian National Institute for Insurance
against Accidents at Work

Abstract 1
In the last years, a group of researchers within the National Institute for
Insurance against Accidents at Work (INAIL) has launched a pilot project
about mass media monitoring in order to find out how the press deal with
the culture of safety and health at work. To monitor mass media, the Institute
has created a relational database of news concerning occupational injuries
and diseases, that was filled with information obtained from the newspaper
articles about work-related accidents and incidents, including the text itself of
the articles. In keeping with that, the ultimate objective is to identify the
major lines for awareness-raising actions on safety and health at work. In a
first phase of this project, 1,858 news articles regarding 580 different
accidents were collected; for each injury, not only the news texts but also
several variables were identified. Our hypothesis is that, for different kind of
accidents, a different language is used by journalists to narrate the events. To
verify it, a text clustering procedure is implemented on the articles, together
with a Lexical Correspondence Analysis; our purpose is to find language
distinctions connected to groups of similar injuries. The identification of
various ways in reporting the events, in fact, could provide new elements to
describe safety knowledge, also establishing collaborations with journalists in
order to enhance the communication and raise people attention toward
workers' safety.
Abstract 2
Negli ultimi anni un gruppo di ricercatori all’interno dell’Istituto Nazionale
per l’Assicurazione contro gli Infortuni sul Lavoro e le malattie professionali
(INAIL) ha lanciato un progetto pilota riguardante il monitoraggio dei mass
media con lo scopo di analizzare come la stampa tratta la salute e la sicurezza
sul lavoro. A tal fine, l’Istituto ha istituito un database relazionale delle
notizie riguardanti gli infortuni e le malattie, incluso il testo stesso delle
notizie. L’obiettivo finale del progetto è dunque quello di identificare le
direttrici principali su cui muoversi per azioni di sensibilizzazione su salute e

JADT’ 18

151

sicurezza sul lavoro. Nella prima fase del progetto, 1,858 articoli di giornale
riguardanti 580 infortuni sono stati raccolti; per ogni evento, non solo il testo
della notizia ma anche diverse variabili sono state individuate. La nostra
ipotesi è che per diversi tipi di infortunio un diverso linguaggio viene usato
dai giornalisti per narrare l’accaduto. Per verificare ciò, una procedura di
Text Clustering è stata implementata sugli articoli, insieme ad una Analisi
delle Corrispondenze Lessicali; il nostro obiettivo è quello di individuare
delle differenze nel linguaggio in relazione a diversi gruppi di infortuni.
L’identificazione di diversità nel modo in cui viene riportata la notizia al
lettore può fornire nuovi elementi per descrivere la cultura della sicurezza, al
fine di instaurare delle collaborazioni con i giornalisti stessi per rendere
migliore la comunicazione e accrescere l’attenzione del cittadino verso la
sicurezza del lavoratore.
Keywords: Occupational safety; Work-related accident; Text mining; Mass
media.
1. Introduction
The study described here grew out of the collaboration between the
Department of Social Sciences and Economics of Sapienza University of
Rome and the Headquarters for Research of INAIL (Italian National Institute
for Insurance against Accidents at Work) where, since 2012 a team of
researchers has developed the idea of monitoring the mass media in view of
prevention against accidents at work (INAIL, 2015).
With this in mind, those researchers achieved the so-called “Repertorio
Notizie SSL” (News Repository on Occupational Safety and Health), that is a
relational database of media news related to occupational injuries and
diseases. The objective of this project is to observe the culture of occupational
safety and health communicated by mass media agencies in order to identify
new elements for increasing prevention against accidents at work. In this
study we focus on the hypothesis that there are some asymmetries in the
language used to describe the injuries depending on the characteristics of the
event. To test it, we performed on the repository data some Automatic Text
Analysis procedures.
The article is structured as follow: in section no.2, the News Repository is
presented; in section no.3, data are presented and the methodology is
exposed; in section no.4, the results of the analyses are shown; in section no.5,
conclusions are drawn.
2. The tool
News Repository on Occupational safety and health (NeRO) is a tool created to
allow analyses of news contents and texts related to occupational diseases

152

JADT’ 18

and injuries. In fact, our strategic objective is to increase public awareness
and safety culture through a different approach, which will be also based on
the study of news articles, their composition and communication dynamics.
So, the first operational purpose is to understand:
- which kind of terms are used in news articles about accidents at work or
occupational diseases;
- what inspires a title;
- how the same news is treated by different sources/media;
- how the news text could be interpreted in different ways due to who
communicates the news itself;
- whether or not some specific aspects of the events are considered by
media.
Our study plans to analyze the cultural characteristics of mass media
communication regarding occupational safety and health (OSH), observing
the attitude of mass media (and journalists) towards the subject and the way
users perceive the news depending on which words are used. As mentioned
before, NeRO is an ad hoc relational database, centred on the gathering of
newspaper articles regarding accidents at work, but it is also arranged to
gather news on near misses, occupational diseases and incidents from all
kind of sources (press, television or radio). It involves several digital
interconnected tables, which contain structured – i.e. based on appropriate
classifications – and unstructured – i.e. textual – information. Information
retrieval regards events happened in Italy and it could contain both online
and directly consulting newspapers, since we exploited Google Alert Service
(using some suitable keywords) and a daily-newspaper subscription (“la
Repubblica”). The reference unit is the event (right now, we are restricting
events to accidents) and different aspects and information are linked to it:
one or more articles about it, one or more workers injured, and so on. The
data-entry interface consists of a series of thematic screens, starting from the
opening one, which covers the list of already recorded events. These screens
allow to enter the following data, step by step:
 [Screen “Event”] Text containing event description, date of the event,
venue, company where accident occurred (if appropriate), economic
activity;
 [Screens “News”] Texts of each article related to the event, newspaper
name (or press affiliation), news title, web url, date of the article;
 [Screens “Worker” and Sub-screens “Accident” and “Harms, disorders
or diseases”] Injured worker’s biographical data, information about
accident, type of injury, physical implication or resulting disease.

JADT’ 18

153

3. Methodology and data
The repository, at the end of data collection, was composed of 1,858 news,
related to almost six hundreds different accidents. In order to analyse the
content of the news texts in connection with the characteristics of the
different events, we performed a content analysis using the Reinert’s method
(Reinert, 1983) for a descendant hierarchical partition. This algorithm,
starting from the co-occurences matrix, generates groups of lexical units – i.e.
words – that more co-occur in the texts. Then, the lexical groups were
projected on the factorial axes, together with the variables modalities, using
the Lexical Correspondence Analysis (Lebart, Salem and Berry, 1997); in this
way, we could observe how the language is connected to the accidents
features. Finally, to better understand the differences between news texts we
analysed the specificities related to the modalities of the variables.
4. Main results and discussion
The cluster analysis made on news texts using the Reinert’s method–
choosing as segments the articles – produced three lexical groups (in order,
the red, the blue and the green ones, in Figure 1):
- Cluster 1 (56.5%): in this group are included the words related to the
description of the events, in terms of what happened;
- Cluster 2 (26.5%): here we have the terms connected to the road
accidents;
- Cluster 3 (17%): this group is about the emotional aspects connected
to the events.
We projected the lexical groups (Figure 1) and the modalities of the variables
related to the events (Figure 2) on the first two factors obtained using the
lexical correspondence analysis.
As shown in the figure no. 2, there are some interesting characterizations of
the language used in newspapers. Some variables, like the economic activity
and the accident site, present a strong lexical differentiation among the
modalities; this means that who is narrating the event - i.e. the journalist uses a specific language to describe the accident, on the basis of these
characteristics. The other variables presented no particular specificities,
except for the one related to the mortality of the accident. In fact, as shown in
the figure no. 2, on the second factor the variable “accident mortality” is best
represented because of the position and the distance of the modalities “yes”
and “no” from the origin. To better understand the lexical differences, we
analysed also the specificities (Bolasco and De Mauro, 2013; Lafon, 1980;
Lebart, Salem and Berry, 1997) for this particular variable.

154

JADT’ 18

Figure 1 Lexical groups

Figure 2 Lexical correspondence analysis

Starting from the results showed in table no.1, we can observe that there is a
significant difference in the language utilized when the accident is fatal or
not. The terms used in the case of a non-fatal event are related to the
description of the injury, while in the case of a mortal accident the situation is
completely different: the words utilized refer to the emotional sphere of the
event, so concepts like the family or the unpredictability are very often used
to describe what was happened.

JADT’ 18

155

Table 1 Analysis of the specificities – Variable: “accident mortality”
Fatal accident - No
Fatal accident - Yes
z = test-value
z = test-value
Hospital
59.17 Tragedy
35.68
Serious
58.84 Family
27.17
To transfer
54.90 Useless
23.62
Dangerous
28.38 To leave
19.84
Rescue
24.13 Victim
18.68
Ambulance
24.09 Tragic
17.71
Leg
23.12 Friend
14.95
Injury
22.06 Band
14.89
Trauma
20.55 Condolence
12.65
Hand
18.84 Province
12.15
Fracture
16.70 Son
11.49
Helicopter
13.70 Wife
11.48
Bus
12.23 Escape
10.63
Crossroad
10.20 Mayor
9.11

5. Conclusions
The project here presented showed how News Repository on OSH (NeRO)
can contribute to analyse occupational safety and health, although in some
institutions there are already databases dedicated to newspaper articles
dealing with OSH. Actually, in addition to news texts, NeRO provides
several systematized information, enabling to filter news according to
various search criteria and, above all, to carry out a number of studies and
organized analysis on textual data, too. In this paper, we showed one of the
study we implemented on Repository data using Automatic Text Analysis.
The results revealed that a large amount of information is contained within
these data; anyway, some information asymmetries are present. For that
reason, it will be essential to set up a discussion with a network of journalists
and other experts, in order to improve and enhance the media
communication. The challenge is to get out from the inner circle of
prevention practitioners and build a bridge that could connect the Institution
to a more general public, also contemplating liaison organizations (such as
trade unions and employers' associations).
References
Bolasco S. and De Mauro T. (2013). L'analisi automatica dei testi: fare ricerca con
il text mining. Carocci Editore.
Iezzi D. F. (2012). Centrality measures for text clustering. Communications in
Statistics-Theory and Methods, 41(16-17), 3179-3197.
INAIL. (2015). Il monitoraggio dei mass media in materia di salute e sicurezza:
Strumenti per la raccolta e l’analisi delle informazioni.
Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un

156

JADT’ 18

corpus. Mots, 1(1), 127-165.
Lebart L., Salem A. and Berry L. (1997). Exploring textual data(Vol. 4). Springer
Science & Business Media.
Reinert M. (1983). Une méthode de classification descendante hiérarchique:
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, 8(2), 187-198.

JADT’ 18

157

Is the educational culture in Italian Universities
effective? A case study
Barbara Cordella, Francesca Greco, Paolo Meoli,
Vittorio Palermo, Massimo Grasso
Sapienza University of Rome – barbara.cordella@uniroma1.it; francesca.greco@uniroma1.it;
paolomeoli3@libero.it; vittorio.palermo2511@gmail.com; massimo.grasso@uniroma1.it

Abstract 1
The paper explores the professors and students’ representation of
professional training in Clinical Psychology in the faculty of Medicine and
Psychology of the Sapienza University of Rome in order to understand
whether the educational context supports students in developing their ability
to enter the job market. To this aim, an Emotional Text Mining of the
interviews of 30 students and 17 teachers of the Clinical Psychology Master
of Science was performed. Both corpora underwent the analysis procedure
performed with T-Lab, i.e. a cluster analysis with a bisecting k-means
algorithm followed by a correspondence analysis on the keyword per cluster
matrix, and the results were compared. The results show 4 clusters and 3
factors for each corpus, highlighting a relationship between student and
professor representations. Both of them split the training process,
distinguishing the educational process from the professional one. The
emotional text mining of the interviews turned out to be an enlightening tool
letting their latent dimensions emerge, setting the process and outcome of the
academic training, and it proved to be very useful for educational purposes.
Abstract 2
La ricerca ha esplorato la rappresentazione della formazione in Psicologia
Clinica dei professori e degli studenti della facoltà di Medicina e Psicologia
della Sapienza Università di Roma al fine di comprendere se il contesto
formativo supporti gli studenti nello sviluppo di competenze utili
all’inserimento nel mercato del lavoro. A questo scopo è stata effettuata
un’Emotinal Text Mining delle interviste di 30 studenti e di 17 professori del
Corso di Laurea Magistrale in Psicologia Clinica con T-Lab (analisi dei cluster
con algoritmo bisecting k-means seguita da un’analisi delle corrispondenze
sulla matrice cluster per parole-chiave). I risultati mostrano 4 cluster e 3
fattori in entrambi i corpora, evidenziando una relazione tra le
rappresentazioni degli studenti con quelle dei professori per quanto concerne
il processo di apprendimento, distinguendo e mantenendo separati gli aspetti
formativi da quelli professionali. L’Emotional Text Mining risulta essere uno

158

JADT’ 18

strumento utile ad evidenziare le dimensioni latenti che organizzano il
processo e i risultati dell’apprendimento accademico.
Keywords: Education, Clinical Psychology, Job Market, Youth
Unemployment, Emotional Text Mining.
1. Introduction
The problem of youth unemployment is relevant nowadays. In Italy, 25% of
young people under 30 years of age are unemployed and this percentage
grows to 40% for under 25s (Mckinsey & Company, 2014). But why is this
percentage so high? According to Mckinsey’s study (ibidem), it shows that the
figure of 40% for youth unemployment does not rely on the economic cycle
but on “structural causes”. Among other causes, education is one of the
relevant factors of youth unemployment, and is a protection factor for
poverty and quality of life, as stated by ISTAT (2017). Graduates are less
likely to become poor although the employability and the wages depend on
the type of degree. 80% of young graduates in psychology are employed after
four years (Anpal Servizi, 2017). Psychologists are more likely to become
entrepreneurs than employees. Most probably, the length of time needed to
get into the job market is connected to the mismatch between the educational
system and enterprise (McKinsey & Company, 2014). Young people’s skills
are considered appropriate by 70% of Schools and Universities, but only by
42% of employers. The effectiveness of education depends in part on the
representation of the professional training characterizing the University.
Several studies were performed in order to investigate students’
representation in the Psychology Faculty in order to improve the training
process (e.g., Carli et al., 2004; Paniccia et al., 2009). Due to the change in the
educational plan that took place over the past decade, this study aims to
understand whether the present educational context supports students in
developing their ability to enter the job market, performing an emotional text
mining (Cordella et al., 2014; Greco, 2016) of the interviews of students and
teachers of the Master Degree in Clinical Psychology at the Sapienza
University of Rome.
2. Methodology
We know that a person's behaviour depends not only on their rationale
thinking but also, and sometimes most of all, on their emotional and social
way of mental functioning (Carli, 1990; Moscovici, 2005). Namely, people
consciously categorize reality and, at the same time, unconsciously symbolize
it emotionally (Fornari, 1976). These two thinking processes are the product
of the double-logic way of the functioning of the mind (Matte Blanco, 1981)
which allows people to adapt to their social environment. According to this

JADT’ 18

159

socio-constructivist approach, based on a psychodynamic model, the
unconscious processes are social, as people generate interactively and share
the same emotional meanings. The socially shared emotional symbolization
sets the interactions, behaviours, attitudes, expectations and communication
processes, and for this reason, the analysis of the narrations allows for the
acquisition of the latent emotional meaning of the text (Salvatore & Freda,
2011). If the conscious process sets the manifest content of the narration,
namely what is narrated, the unconscious process can be inferred through
how it is narrated, that is to say, the words chosen to narrate and their
association within the text. We consider that people emotionally symbolize
an event, or an object, and socially share this symbolisation. The words they
choose to talk about this event, or object, is the product of the socially-shared
unconscious symbolization (Greco, 2016). According to this, it is possible to
detect the associative links between the words to infer the symbolic matrix
determining the coexistence of these terms in the text. To this aim, we
performed a multivariate analysis based on a bisecting k-means algorithm
(Savaresi et Boley, 2004) to classify the text, and a correspondence analysis
(Lebart et Salem, 1994) to detect the latent dimensions setting the cluster per
keywords matrix. The interpretation of the cluster analysis results allows for
the identification of the elements characterizing the emotional representation
of education, while the results of correspondence analysis reflect its
emotional symbolization (Cordella et al., 2014; Greco, 2016). The advantage
connected with this approach is to interpret the factorial space according to
words polarization, thus identifying the emotional categories that generate
professional training representations, and to facilitate the interpretation of
clusters, exploring their relationship within the symbolic space.
3. Data collection and analysis
In order to explore the emotional representation of the education in the
Master of Science in Clinical Psychology, we interviewed 30 students (13% of
students) and 17 teachers (71% of teachers) of the Sapienza University of
Rome accordingly to their voluntary participation. We used an openquestions interview for students and teachers. Students’ interviews resulted
in a medium size corpus of 57.387 tokens, and teachers’ interviews resulted
in a small size corpus of 28.746 tokens. In order to check whether it was
possible to statistically process data, two lexical indicators were calculated:
the type-token ratio and the hapax percentage (TTRstudents = 0,09; Hapaxstudents =
50,3%; TTRteachers = 0,147; Hapaxteachers = 53,8%). According to the size of the
corpus, both lexical indicators highlight its richness and indicate the
possibility to proceed with the analysis. First, data were cleaned and preprocessed by the software T-Lab (Lancia, 2017) and keywords were selected.

160

JADT’ 18

Due to the size of the corpus and the hapax percentage, in order to choose the
keywords, we used the selection criteria proposed by Greco (Cordella et al.,
2014; Greco, 2016). In particular, we used stem as keywords instead of type,
filtering out the lemma of the open-questions of the interviews. Then, on the
context units per keywords matrix, we performed a cluster analysis with a
bisecting k-means algorithm (Savaresi et Boley, 2004) limited to ten partitions,
excluding all the context units that did not have at least two keywords cooccurrence. The eta squared value was used to evaluate and choose the
optimal solution. To finalize the analysis, a correspondence analysis on the
keywords per clusters matrix was made (Lebart et Salem, 1994) in order to
explore the relationship between clusters, and to identify the emotional
categories setting professional training representations both for students and
teachers.
4. Main results and discussion
The results of the cluster analysis show that the keywords selected allow the
classification on an average of 96% for both corpuses. The eta squared values
was calculated on partitions from 3 to 9, and they show that the optimal
solution is four clusters for both corpora. The correspondence analysis
detected three latent dimensions. In table 1 and 2, we can appreciate the
emotional map of the professional training emerging from the interviews of
the teachers and the students and cluster location in the factorial space.
Table 1  Cluster coordinates on factors of the teachers’ corpus (the percentage of explained
inertia is reported between brackets above each factor)
Cluster
(CU in Cl %)
1
2
3
4

Training Group
(22,3%)
Clinical Training
(33,7%)
Institutional Obligations
(20,2%)
Student Orientation
(23,8%)

Factor 1 1
(26,53%)
Motivation
Group
-0,21
Institution
0,33
Institution
0,65
Group
-0,79

Factor 2
(19,03%)
Outcome
Competence
0,51
Competence
0,23
Degree
-0,66
Degree
-0,39

Factor 3
(14,56%)
Role
Teacher
-0,50
Professional
0,39
Teacher
-0,38
Professional
0,16

CU in Cl = context units classified in the cluster.
The teachers’ corpus first factor (table 1) represents the motivation in
teaching, focusing on the group of students and their specific needs or on the
Institutional generic scopes; the second factor focuses on the training
outcome, the degree or the professional skills; and the third factor reflects the
role of the academic professor that could represent oneself as a teacher or a

JADT’ 18

161

professional. As regards the students corpus (table 2), the first factor
represents the approach to university experience, which can be perceived as
an individual experience or a social one (relational); the second factor
explains how students experience vocational training, perceiving it as the
fulfilment of obligations or the construction of professional skills that
requires personal involvement; and the third factor reflects the outcome of
the educational training that can focus on professional skills development or
on the achievement of qualifications.
Table 2  Cluster coordinates on factors of the students’ corpus (the percentage of explained
inertia is reported between brackets above each factor)
Cluster
(CU in Cl %)
1
2
3
4

Idealized Product
(27,6%)
Professional Education
(20,8%)
Group Identity
(26,3)
Empty Degree
(25,3%)

Factor 1
(23,2%)
Approach
Individual
-0,56
-0,04
Relational
0,69
Individual
-0,32

Factor 2
(15,3%)
Training
Fulfilment
0,45
Construction
-0,63
Fulfilment
0,22
0,01

Factor 3
(14,0%)
Outcome
Skills
-0,43
Skills
-0,24
-0,01
Qualifications
0,59

CU in Cl = context units classified in the cluster.
Table 3  Teachers’ Cluster (the percentage of context units classified in the cluster is reported
between brackets)
Cluster 1 (22,3%)

Cluster 2 (33,7%)

Training Group
CU
keyword
studente
59
cercare
43
corso
43
teoria
32
lezione
21
modalità
21
20
organizzazione
intervento
19
relazione
17

Clinical Training
keyword
CU
psicologia
94
lavoro
81
clinico
54
insegnare
36
contesto
29
problema
27
intervento
27
diverso
25
conoscenza
22

modello

16 interno

Cluster 3 (20,2%)
Institutional
Obligations
keyword
CU
scuola
29
persona
28
laurea
19
università
18
trovare
17
specializzazione
16
importante
16
entrare
15
14
scegliere
percorso
14

22

CU = context units classified in the cluster.

Cluster 4 (23,8%)
Student Orientation
keyword
CU
domanda
42
idea
40
33
organizzazione
aggiungere
32
processo
30
rispetto
29
orientare
21
parlare
21
Corso di laurea
20
Attività
18
didattiche

162

JADT’ 18

The four clusters of both corpuses are of different sizes (tables 1 and 2) and
reflect the representations of the professional training (table 3 and 4).
Regarding the teachers’ corpus (table 3), the first cluster represents the group
of students as a tool to teach professional skills, focusing on the group
process where relational dynamics are experienced; the second cluster
focuses on clinical training, teaching skills marketable in the job market; the
third cluster focuses on the teachers’ institutional obligations regardless of
the students’ training needs; and the fourth cluster represents students’
orientation as a way to support students in managing their academic training
regardless of professional skills. As regards the students’ corpus (table 4), in
the first cluster the good training involves students’ adherence to lesson tasks
regardless of critical thinking on the theoretical model proposed; in the
second cluster, learning professional skills is strictly connected to the ability
to get and respond to market demand; the third cluster reflects the relevance
of belonging to a group of colleagues supporting the construction of a
professional identity that, unfortunately, seems unconnected to professional
skills development; and the fourth cluster represents professional training as
a process in which the degree achievement is the main goal, regardless of the
job market demand.
Table 4  Students’ Cluster (the percentage of context units classified in the cluster is reported
between brackets)
Cluster 1 (27,6%)
Idealized Product
CU
keyword
esperienza
116
triennale
44
percorso
43
professione
41
università
37
possibilità
35
capire
33
diverso
31
senso
30
vivere
25

Cluster 2 (20,8%)
Professional Education
keyword
CU
pensare
89
esame
71
psicologia
65
seguire
55
realtà
55
vedere
55
iniziare
53
triennale
53
lavoro
44
interessante
44

Cluster 3 (26,3)
Group Identity
keyword
CU
scelta
154
studiare
153
frequentare
104
rapporto
102
piacere
98
colleghi
97
parlare
74
organizzare
68
domanda
55
aggiungere
36

Cluster 4 (25,3%)
Empty Degree
keyword
CU
vivere
26
trovare
85
tesi
20
sentire
91
riuscire
30
prendere
33
persone
105
maniera
23
livello
35
laboratorio
18

CU = context units classified in the cluster.

Students and teachers seem to have similar representations of the training
process: the academic need of building a network, highlighted by the
students’ cluster on group identity, and the teachers’ cluster on training group
and student orientation; the relevance of achieving a qualification, highlighted
by the students’ cluster on empty degree and the teachers’ cluster on
institutional obligation; and the development of professional skills marketable
in the job market reflected by the teachers’ cluster on clinical training and the

JADT’ 18

163

students’ cluster on professional education in line with what it was found by
Carli and colleagues (2004) and Paniccia and colleagues (2009) by means of a
similar methodology, the emotional textual analysis (Carli et al., 2016). The
awareness of the psychological demand of the labour market is an indicator
of the professional training process’s effectiveness. Nevertheless, students
and teachers split the academic achievement from the development of
professional skills. This could be a critical aspect, possibly explaining young
graduates’ difficulty in entering the job market, focusing more on academic
context rather than on market demand. As a consequence, during the
training process, students do not develop the connection between
professional training (what they are learning) and professional skills (what
they are going to do in the future).
5. Conclusion
Although the study results could not be generalized, due to the participants’
selection criteria and the methodology we used, they highlight professional
training representation characteristics, which are the elements influencing the
rate of unemployment among young psychologists. Even though it is not
possible to quantify the relevance of the characteristics of the representation,
the emotional text mining, allowing for the identification of the words
association explanatory of the education representation, allows for
hypotheses definition and the identification of the resources and the issues
pertaining the professional training in a specific context.
The interpretation of the text mining results lets the social unconscious
process emerge, setting the education useful to defining the type of
psychological intervention able to support the representation transformation
toward a more effective training process. In this particular case study, the
intervention would aim to develop the connection between professional
qualification achievement and the professional skills development, which are
currently split.
References
Anpal Servizi (2017), L’inserimento occupazionale dei laureati in psicologia,
dell’università La Sapienza di Roma, Direzione e studi analisi statistica - SAS.
Carli R. (1990). Il processo di collusione nelle rappresentazioni sociali. Rivista
di Psicologia Clinica, 4: 282-296.
Carli R., Dolcetti F. and Dolcetti (2004). L’Analisi Emozionale del Testo
(AET): un caso di verifica nella formazione professionale. In Purnelle G.,
Fairon C. and Dister A., editors, Actes JADT 2004: 7es Journées
internationales d’Analyse statistique des Données Textuelles, pp. 250-261.
Carli R., Paniccia R.M., Giovagnoli F., Carbone A. and Bucci F. (2016).

164

JADT’ 18

Emotional Textual Analysis. In L. A. Jason and D. S. Glenwick, editors,
Handbook of methodological approaches to community-based research:
Qualitative, quantitative, and mixed methods. Oxford University Press.
Cordella B., Greco F. and Raso A. (2014). Lavorare con Corpus di Piccole
Dimensioni in Psicologia Clinica: Una Proposta per la Preparazione e
l’Analisi dei Dati. In Nee E., Daube M., Valette M. and Fleury S., editors,
Actes JADT 2014 (12es Journées internationales d’Analyse Statistque des
Données Textuelles, Paris, France), pp. 173-184.
Fornari F. (1976). Simbolo e codice: Dal processo psicoanalitico all’analisi
istituzionale. Feltrinelli.
Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per
leggere il cambiamento culturale. Franco Angeli.
ISTAT (2017). Rapporto annuale 2017. ISTAT
Lancia F. (2017). User’s Manual : Tools for text analysis. T-Lab version Plus 2017.
Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod
Matte Blanco I. (1981). L’inconscio come insiemi infiniti: Saggio sulla bi-logica.
Einaudi
McKinsey & Company (2014). Studio ergo Lavoro, come facilitare la
transizione scuola lavoro per ridurre in modo strutturale la
disoccupazione giovanile in italia. Report di Ricerca "Studio ergo Lavoro",
McKinsey & Company,
https://www.mckinsey.it/file/2785/download?token=a3VfesjU.
Moscovici S. (2005). Le rappresentazioni sociali. Il Mulino.
Paniccia R.M., Giovagnoli F., Giuliano S., Terenzi V., Bonavita V., Bucci F.,
Dolcetti F., Scalabrella F. and Carli R. (2009). Cultura Locale e
soddisfazione degli studenti di psicologia. Una indagine sul corso di
laurea “intervento clinico” alla Facoltà di Psicologia 1 dell’Università di
Roma “Sapienza”. Rivista di Psicologia Clinica, Supplemento n. 1: 1-49.
Salvatore S. and Freda M. F. (2011). Affect, unconscious and sensemaking: A
psychodynamic, semiotic and dialogic model. New Ideas, Psychology, Vol.
29, pp. 119–135.
Savaresi S. M. and Boley D. L. (2004). A comparative analysis on the bisecting
K-means and the PDDP clustering algorithms. Intelligent Data Analysis
8(4): 345-362.

JADT’ 18

165

Profiling Elena Ferrante: a Look Beyond Novels
Michele A. Cortelazzo1, George K. Mikros2, Arjuna Tuzzi3
2

1University of Padova – cortmic@unipd.it
National and Kapodistrian University of Athens – gmikros@isll.uoa.gr
3University of Padova – arjuna.tuzzi@unipd.it

Abstract
Elena Ferrante represents rather a peculiar editorial and journalistic
phenomenon: Today, she enjoys a wide international audience, though, on
the other hand, there is surprisingly little scientific literature that discusses
her works. Since Elena Ferrante is a pseudonym for an anonymous writer,
some investigators have already dealt with the pursuit of her real identity
and, at the moment, the main suspects that emerged are Domenico Starnone,
Marcella Marmo and Anita Raja. Corpora collected in order to analyze Elena
Ferrante's works and compare them with the works of other authors are
usually composed of novels, however Marcella Marmo and Anita Raja are
not novelists and their works are not ascribed to genres comparable with
novels. One of Elena Ferrante's books, La Frantumaglia, is useful to collect
corpora of texts of different genres (letters, essays, interviews, etc.) and they
might include texts by authors that have never been taken into consideration
in research studies based on novelists. Nevertheless, these texts raise specific
questions that concern their exploitability in traditional authorship
attribution procedures due to their limited size. This study aims at working
on a corpus of texts other than novels by means of a machine learning
approach, in the frame of methods for authorship attribution and profiling.
Riassunto
Elena Ferrante costituisce un fenomeno editoriale e giornalistico italiano
molto particolare: attualmente gode di grande visibilità internazionale ma,
allo stesso tempo, c'è sorprendentemente poca letteratura scientifica che si
occupa delle sue opere. Siccome Elena Ferrante è lo pseudonimo di un/una
autore/autrice ancora anonimo/anonima, alcuni si sono già confrontati con la
ricerca della sua vera identità e i maggiori sospettati emersi, finora, sono
Domenico Starnone, Marcella Marmo e Anita Raja. I corpora che vengono
utilizzati per studiare la produzione di Elena Ferrante e confrontarla con
quella di altri autori sono costituiti normalmente da romanzi ma Anita Raja e
Marcella Marmo non sono scrittrici e i loro lavori non si possono ascrivere a
generi confrontabili con i romanzi. Una delle opere di Elena Ferrante, La
frantumaglia, può essere utilizzata per costituire corpora con testi di generi

166

JADT’ 18

diversi (lettere, saggi, interviste, ecc.) che possono includere materiali di
autori non ancora considerati nelle ricerche basate su romanzieri. Tuttavia,
questi testi presentano specifiche problematiche legate alla ridotta
dimensione e parziale utilizzabilità con strumenti di attribuzione d'autore
tradizionali. Questo lavoro ha come obiettivo studiare un corpus di testi
diversi dai romanzi con un approccio machine learning nell'ambito dei
metodi per l'attribuzione d'autore e il profiling.
Keywords: authorship attribution, machine learning, profiling, stylometry,
support vector machine
1. Introduction
In previous works the novels signed by Elena Ferrante have already been
studied in the panorama of Italian contemporary literature and they have
displayed that this author has a peculiar writing style and shows relevant
individual traits. Moreover, in previous investigations the Italian writer that
showed the highest level of similarity with Elena Ferrante is Domenico
Starnone (Galella, 2005; 2006; Gatto, 2016; Cortelazzo et Tuzzi, 2017; Tuzzi et
Cortelazzo, 2018). In this study we aim at testing further hypothesis and look
at texts that are not ascribed to the genre "novels". In this way we have the
opportunity to consider for authorship attribution and profiling experiments
new candidates, i.e. writers that are not exclusively novelists. A first
reference can be made to Marcella Marmo and Anita Raja, two Italian
women, that have been suspected to be the hand that hides behind the penname of Elena Ferrante, respectively, by Marco Santagata (2016) and Claudio
Gatti (2016). The corpus collected for this new study has a specific focus on
three main suspects (Marcella Marmo, Anita Raja, Domenico Starnone) and
includes further suspected authors (Goffredo Fofi, Mario Martone, Valeria
Parrella, Francesco Piccolo), authors that in previous analysis showed some
common traits with Elena Ferrante's works (Gianrico Carofiglio, Clara
Sereni), authors that provocatively claimed to be Elena Ferrante (Laura
Buffoni) and members of the E/O publishing house (Sandro Ferri, Sandra
Ozzola and the editorial board that is supposed to be the collective editor of
the publishers' web pages).
2. Corpus
The corpus includes letters, interviews and further material written by
different authors (tab. 1) that can be compared with texts included in the
book La Frantumaglia by Elena Ferrante (2016). An innovative perspective has
been adopted for analyzing texts: a Machine Learning (ML) approach based
on a Support Vector Machine (SVM) method that takes into consideration 13
authors for a classical Authorship Attribution (AA) and different variables

JADT’ 18

167

(gender, age, geographical area) for profiling tasks.
The whole corpus adopted for this study is composed of 113 texts and
includes 143,695 word tokens and 19,020 word types. In the classical ML
perspective, the corpus is arranged into two groups: a "training set" and a
"testing set". The training corpus (tab. 1) includes 86 texts (87,458 word
tokens), 78 written by 12 authors and 8 by a collective subject (EO) that
represents the editorial staff of E/O publishing house. The corpus is balanced
in terms of gender and partly balanced for age and geographical area (tab. 2).
Information about gender and age is not available (n.a.) for E/O, as it is
presumed to be a group. The testing corpus includes 27 texts (6 essays, 7
interviews, 14 letters for a total of 56,237 word tokens in size) signed by Elena
Ferrante and collected in her book La Frantumaglia. Five texts are chapters of
the same large essay that has been written as an answer to Giuliana Olivero
and Camilla Valletti's questions (Ferrante 2016).
Table 1. Authors and categories of texts included in the training corpus
Authors
Category
texts
tokens
texts
Laura Buffoni
3
4,477
article
53
Gianrico
6
4,940
essay
9
Carofiglio
E/O
8
3,955
interview
12
Sandro Ferri
2
3,838
letter
4
Goffredo Fofi
9
7,378
web
8
Marcella
5
12,991
Marmo
Mario Martone
10
9,320
Sandra Ozzola
4
1,879
Valeria
7
4,676
Parrella
Francesco
6
5,529
Piccolo
Anita Raja
4
13,617
Clara Sereni
2
2,271
Domenico
20
12,587
Starnone
Tot
86
87,458
Tot
86

tokens
42,124
22,926
15,480
1,611
5,317

87,458

Since most stylometric measures and linguistic features are heavily
influenced from text size, we decided to split our texts into equal sized text
chunks. Both the training and the testing corpus were segmented into 200
words text chunks. After the chunking procedure, the training corpus
inflated from 86 texts to 386 chunks of 200 words in length and the testing

168

JADT’ 18

corpus from 27 texts to 259 chunks of 200 word tokens in length. This
enlargement had also the positive effect of making our sample space larger,
giving us the opportunity to use a wider spectrum of linguistic features.
Table 2. Descriptive variables of texts included in the training corpus
Gender
n.a.

Age

authors

texts

tokens

1

8

3,955

Naples Area

authors

texts

tokens

n.a.

1

8

3,955

authors

texts

tokens

f

6

25

39,911

>60old

7

46

54,561

Naples

6

52

58,720

m

6

53

43,592

60young

5

32

28,942

NoNaples

7

34

28,738

Tot

13

86

87,458

Tot

13

86

87,458

Tot

13

86

87,458

3. Method
In order to investigate our research aims, we developed a feature-rich
document representation model comprised by the following features groups:
1) Author Multilevel N-gram Profiles (AMNP): 1,500 features, 500 features
of each n-gram category (2-grams and 3-grams at the character level, and
2-grams at the word level);
2) Most Frequent Words in the corpus (MFW, 500 features).
The first feature group (AMNP) provides a robust document representation
which is language independent and able to capture various aspects of
stylistic textual information. It has been used effectively in authorship
attribution problems (Mikros et Perifanos, 2011; 2013) and gender
identification focused on bigger texts (e.g. blog posts, cfr. Mikros, 2013).
AMNP consists of increasing order n-grams in both character and word level.
Since character and word n-grams capture different linguistic entities and
function complementary, we constructed a combined profile of 2, 3
characters n-grams and 2 words n-grams. For each n-gram we calculated its
normalized frequency in the corpus and included the 500 most frequent
entries resulting in a combined vector of 1,500 features. The second feature
group (MFW) can be considered classic in the stylometric tradition and it is
based on the idea that the MFWs belong to the functional words class and are
beyond the conscious control of the author, thus revealing its stylometric
finger print. In this study we used the 500 most frequent words of the corpus.
The above described features have been exploited for training a classification
machine learning algorithm, Support Vector Machines (SVM, Vapnik, 1995),
in both a standard authorship classification task and in three different author
profiling tasks (author’s gender, age, and geographical area). SVM is
considered a state-of-the-art algorithm for text classification tasks. The SVM
constructs hyper-planes of the feature space in order to provide a linear
solution to the classification problem. For our trials we experimented with

JADT’ 18

169

various kernels and we ended up choosing the polynomial one as this was
the most accurate in our dataset. All statistical models developed have been
evaluated using 10-fold cross validation (90% training set – 10% testing set)
and the accuracies reported represent the mean of the accuracies obtained in
each fold. Since the feature space was sparse, we eliminated all features that
showed a variance close to zero, using the two following rules: the
percentage of unique values was less than 20%, and the ratio of the most
frequent to the second most frequent value was greater than 20. The nearzero variance feature removal shrank the number of the employed features
and led to a reduction of 47.4% (from the initial 2,000 available features we
kept 1,052 features).
4. Results
4.1. Authorship Attribution Results
For the standard authorship classification task (tab. 3), first we worked with
the whole corpus as training dataset and obtained an accuracy of 0.7098 on
average (71%). Among the set of 13 candidates included in the corpus, a large
share of testing text chunks resulted attributed to Domenico Starnone (32%),
Anita Raja (21%) and Mario Martone (21%).
Table 3. Attribution of text chunks included in the testing corpora (whole and reduced corpus)

Authors
Starnone
Raja
Martone
E/O
Buffoni
Parrella
Fofi
Carofiglio
Ferri
Marmo
Piccolo
Ozzola
Tot

whole corpus
No. chunks
84
55
55
18
16
15
7
2
2
2
3
0
259

%
32%
21%
21%
7%
6%
6%
3%
1%
1%
1%
1%
0%
100%

reduced corpus
Authors
No. chunks
Starnone
115
Raja
73
Martone
39
E/O enlarged
32

Tot

259

%
44%
28%
15%
12%

100%

170

JADT’ 18

Table 4. Cross-classification matrix in authorship attribution task (whole and reduced corpus)
reduced corpus
whole corpus
Starnone
Raja
Martone
E/O enlarged
Tot
77
Starnone
2
0
5
84
48
Raja
3
0
4
55
30
Martone
14
2
9
55
15
E/O
1
2
0
18
Buffoni
6
5
2
3
16
Parrella
8
7
0
0
15
Fofi
4
3
0
0
7
Piccolo
2
0
0
1
3
Carofiglio
0
2
0
0
2
2
Ferri
0
0
0
2
Marmo
0
2
0
0
2
Ozzola
0
0
0
0
0
Tot
115
73
32
39
259

We deemed useful to reduce the candidates to Starnone, Raja, Martone and
rearrange the E/O collective author into a new enlarged version of the E/O
group, i.e. we pool together all the members of the E/O publishing house
(Sandro Ferri, Sandra Ozzola and the E/O staff). As an effect of this selection
we obtained an improvement in the performance of the ML algorithm (+13%)
since the accuracy rose up to 0.8408 on average (84%). With reference to this
reduced version of the training corpus, that includes only four candidates,
again most text chunks seem to belong to Domenico Starnone (44%) and
Anita Raja (28%). From a cross comparison of the results achieved (tab. 4)
with the whole and reduced versions of the training corpus we observed that
the text chunks of the testing corpus that have been attributed to Domenico
Starnone and Anita Raja proved more stable and consistent if compared to a
more unstable and weak role of Mario Martone. The existence of an action of
the publishing house was confirmed in both versions, although in some cases
a confusion of the E/O editors with Starnone and Raja's hands is somewhat
visible.
4.2. Profiling Results
Results achieved with profiling tasks are more schematic since the algorithm
is called to work with simpler dichotomous variables (tab. 5).
With respect to gender, the ML algorithm obtained an accuracy of 0.8000 on
average (80%) and the results achieved with the automatic classification of
the text chunks of the testing corpus suggested that among the fragments of
La Frantumaglia we might have different hands: at least a man (54%) and a
woman (46%). If compared with the case of gender profiling, the ML

JADT’ 18

171

algorithm achieved a similar performance in terms of accuracy for both the
classification by age (0.8027, 80%) and geographical area (0.7850, 78%) but for
the most part the text chunks appeared to be written by an old author (76%)
from Naples (90%).

f
m
Tot

Table 5. Profiling of text chunks included in the testing corpus
Gender
Age
Naples area
No.
%
No.
%
No.
chunks
chunks
chunks
141
54% >60 old
197
76% Naples
233
118
46% 60
62
24% NoNaples
26
young
259
100% Tot
259
100% Tot
259

%
90%
10%
100%

5. Discussion and conclusions
Among limitations and constrains of this method, first and foremost we have
to take into account that we have different genres among the texts of this
corpus (essays, interviews, newspapers articles, letters) and this feature
surely affects our results. Texts show similarities when they are written by
the same author or belong to the same text genre and these two effects are
not easy to disentangle in our text corpus. Secondly, when the SVM
prediction is called to assign testing chunks to authors and/or categories it
always leads to an attribution that is the result of a formula generated by the
ML algorithm (in other words it never answers "do not know"). Results
depend both on quality of texts and basket of opportunities offered during
the training phase. As a consequence, we have to refer to the accuracy of the
model and consider the classification as the best attribution among options
given by the set of reasonable candidates and available categories. Thirdly, La
Frantumaglia represents an interesting set of texts signed by Elena Ferrante
that are not ascribed to the genre "novels" and it enables new analyses to
compare and contrast the author's writing style with the one of authors that
are not strictly novelists. Nevertheless, we cannot be sure that all texts
included in La Frantumaglia are written by the same hand and, moreover, we
do not know whether these texts are written by the author that actually wrote
also the novels signed by Elena Ferrante. From the authorship attribution
viewpoint more than one hand emerged as likely and we can formulate some
hypothesis. If we take into account only main suspected authors mentioned
in our Introduction, Domenico Starnone and Anita Raja are confirmed; on the
contrary, Marcella Marmo seems not believable. Mario Martone's role is an
interesting suggestion since similarities of chunks taken from La Frantumaglia
with his texts might be the indirect outcome of direct interactions between
Martone and Ferrante (e.g. letters and interviews where they are both

172

JADT’ 18

speaking about the movie L'amore molesto). Also the E/O staff's role is
engaging as it is easy to imagine the effect on the writing style of one or more
editors that work as proofreaders, copyreaders and ghostwriters when Elena
Ferrante has to answer many interviews and letters collected by the
publishing house. From profiling experiments a composite picture of La
Frantumaglia emerges. The procedure reveals the existence of different hands
once more, suggested the involvement of at least a man and a woman, and
draws the portray of an author (single or collective) from Naples that is over
60 years old.
Does the mystery about Elena Ferrante's work remain a mystery?
Acknowledgements
We thank Arianna Menin for providing us with the corpus of texts of La
Frantumaglia collected for her first level (B.A.) 3-years degree thesis in
Communication (University of Padova, a.y. 2016/2017, supervisor prof.ssa
Arjuna Tuzzi).
References
Cortelazzo M.A. and Tuzzi A. (2017). Sulle tracce di Elena Ferrante: questioni
di metodo e primi risultati. In Palumbo, G. (ed), Testi, corpora, confronti
interlinguistici: approcci qualitativi e quantitativi, EUT – Edizioni Università
di Trieste, pp. 11-25.
Ferrante, E. (2016). La Frantumaglia. Roma: E/O.
Galella, L. (2005). Ferrante-Starnone. Un amore molesto in via Gemito, La
Stampa, 16 January 2005, pp. 27.
Galella, L. (2006). Ferrante è Starnone. Parola di computer. L'Unità, 23
November 2006.
Gatti, C. (2016). Elena Ferrante, le «tracce» dell'autrice identificata, Il Sole 24
Ore – Domenica, 2 October 2016, pp. 1-2.
Gatto, S. (2016). Una biografia, due autofiction. Ferrante-Starnone: cancellare
le tracce, Lo Specchio di carta. Osservatorio sul romanzo italiano
contemporaneo, 22 October 2016. www.lospecchiodicarta.it
Mikros, G.K. (2013). Authorship Attribution and Gender Identification in
Greek Blogs. In Obradović, I., Kelih, E. and Köhler R. (eds.), Selected papers
of the VIIIth International Conference on Quantitative Linguistics (QUALICO)
in Belgrade, Serbia, April 16-19, 2012, Belgrade: Academic Mind, pp. 21-32.
Mikros, G.K. and Perifanos, K. (2011). Authorship identification in large
email collections: Experiments using features that belong to different
linguistic levels Proceedings of PAN 2011 Lab, Uncovering Plagiarism,
Authorship, and Social Software Misuse held in conjunction with the CLEF 2011
Conference on Multilingual and Multimodal Information Access Evaluation, 19-

JADT’ 18

173

22 September 2011, Amsterdam.
Mikros, G.K. and Perifanos, K. (2013). Authorship attribution in Greek tweets
using multilevel author’s n-gram profiles. In Hovy, E., Markman, V.,
Martell, C. H. and Uthus D. (eds.), Papers from the 2013 AAAI Spring
Symposium "Analyzing Microtext", 25-27 March 2013, Stanford, California.
Palo Alto, California: AAAI Press, pp. 17-23.
Santagata M. (2016). Elena Ferrante è …, La lettura – Corriere della Sera, 13
March 2016, pp. 2-5.
Tuzzi, A. and Cortelazzo, M.A. (2018), What is Elena Ferrante? A
Comparative Analysis of a Secretive Bestselling Italian Writer, Digital
Scholarship in the Humanities (on line first version).
Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag.

174

JADT’ 18

Word Embeddings: a Powerful
Tool for Innovative Statistics at Istat
Fabrizio De Fausti1, Massimo De Cubellis1, Diego Zardetto1
1

ISTAT – Italian National Institute of Statistics
(defausti, decubell, zardetto)@istat.it

Abstract 1
In recent years, word embedding models have proven useful in many
Natural Language Processing problems. These models are generated by
unsupervised learning algorithms (like Word2Vec and GloVe) trained on
very large text corpora. Their main purpose is to map words to vectors of a
metric space in a very smart way, so that the resulting numeric
representation of input texts effectively captures and preserves a wide range
of semantic and syntactic relationships between words. In this paper we
discuss word embedding models generated from huge corpora of raw text in
Italian language, and we propose an original graph-based methodology to
explore, analyze and visualize the structure of the learned embedding spaces.
Abstract 2
Il lavoro illustra le potenzialità dei modelli Word Embedding nell’analisi di
grandi collezioni di dati testuali e propone un originale metodo basato sui
grafi per l’esplorazione della struttura semantica catturata dai modelli.
Keywords: Word Embeddings, Word2Vec, Graphs, Text Summarization,
Italian Tweets, NLP.
1. Introduction
Word embedding models represent a powerful tool that can be used as input
for subsequent machine learning tasks, like text classification, topic modeling
and document similarity. This work shows how we built, tested and used
word embedding models (based on the Word2Vec algorithm, see Section 2.1)
to achieve the following objectives:
 Istat is currently collecting streaming Twitter data on a large scale. Word
embedding models helped us devise domain-specific ‘filters’, namely sets
of keywords that we used to filter out off-topic tweets with respect to the
intended statistical production goal. Here we will show the case of the so-

JADT’ 18

175

called “Europe filter”, meant to measure people’s mood about the
European Union.
 Istat is currently exploiting textual data automatically scraped from the
websites of Italian enterprises in order to predict whether or not they
perform e-commerce. Given the huge corpus of noisy and unstructured
texts derived from this web-scraping procedure, word embedding models
allowed us: (i) to automatically create an “e-commerce pseudo-ontology”
and to smartly summarize the input texts, (ii) to encode the summarized
texts into a rich numeric representation in order to feed a Deep Learning
classifier.
2. Methodology
In recent years, new successful algorithms for natural language modeling
have been proposed, based on Neural Networks (e.g. Word2Vec and Glove).
These algorithms, starting from very large corpora of raw text, are able to
create models that map words to low-dimensional vector spaces, called word
embeddings (Mikolov et al., 2013a). Although these algorithms do not rely on
any linguistic domain-knowledge, nor on handcrafted syntactic and semantic
relationships between words, they are surprisingly able to learn both of them
from raw data. Indeed, words that are strongly related from a syntactic
and/or semantic point of view are mapped to vectors that are almost parallel
to each other; conversely, words that are syntactically and/or semantically
loosely related are mapped to nearly perpendicular vectors. Moreover, these
models perform amazingly well when it comes to solving analogies between
words, just like a human would do. For example, if one asks a trained word
embedding model «which word X completes the analogy: [ ‘Paris’ : ‘France’ =
‘Madrid’ : X ]», the answer will very likely be X = ‘Spain’. We mention here
only one type of relationship (capital-nation), but word embedding models
are able to capture a wide variety of relationships, such as: male-female,
singular-plural, superlative-comparative, synonym-antonym, politicianparty, etc.
2.1 Word2Vec
Word2Vec (Mikolov et al., 2013b) is one of the most influential word
embedding algorithms. It consists of a neural network trained to solve a
predictive problem according to one of the following two approaches:
predicting the central word given the other words of a context (Cbow), or
predicting the words of the context given the central word (Skipgram). At the
end of the training the predictive ability of the network is not used; instead,

176

JADT’ 18

its internal structure (weights of the network) is exploited to represent the
coordinates of each word of the dictionary in the embedding space.
While a large text corpus is the main input to Word2Vec, the algorithm
allows also for several hyperparameters which can be tuned to improve the
quality of the learned model. Some scholars (e.g. Levy et al., 2015) consider
these hyperparameters as key points to understand Word2Vec’s superiority
as compared to previous language modeling techniques.
The main hyperparameters of Word2Vec are:
 Embedding space dimension: the dimension of the vector space to which the
words of the corpus are mapped;
 Window size: the width of the sliding window used to process the corpus.
It defines how large the context is;
 Iteration: how many times the weights of the neural network are updated
during training;
 Learning model: the approach used to train the neural network, either
Cbow or Skipgram.
Of course, further factors affect the performance of a Word2Vec model:
 Size of the corpus: bigger corpora perform better than small ones;
 Quality of the corpus: very noisy, fragmented and poorly curated texts
generally produce lower quality embedding spaces.
At the end of the training phase, the quality of the learned word embedding
model can be assessed through standard test functions. Classical examples
are the word-similarity and the word-analogy functions (see e.g. Pennington et
al., 2014).
2.2 Exploring and visualizing big embedding models through graphs
As sketched in Section 2, word embedding algorithms transform words into
vectors of a low-dimensional metric space. The dimension of this numeric
space is usually set to values in the range 100-300 (see e.g. Mikolov et al.,
2013a). When input corpora are huge, taking into account inflected forms of
words, the output embedding model can contain hundreds of thousands of
vectors. As a consequence, the full structure of the embedding model is very
hard to analyze. Exploration and visualization of such models requires to (i)
reduce the dimensionality of the embedding space, and to (ii) focus on just a
subset of vectors, namely those derived by the most relevant words for the
analysis at hand. While traditional solutions exist for the first task, like PCA
and t-SNE (van der Maaten, Hinton, 2008), no standard methods are
available for the second one. We propose here a new technique, based on
graphs (Gibbons, 1985), that simultaneously addresses both needs. It selects

JADT’ 18

177

just a subset of relevant words, adopting a clever filtering criterion based on
their semantic proximity, and allows visualizing the resulting sub-model in a
two-dimensional graph.
2.3 Building the graphs
a

Given a “node” vector/word v in the embedding space, let’s define

. To build
, we connect v to its W nearest
base graph of width
vectors/words in the embedding space (the cosine distance is used). The base
graph

will thus have W + 1 nodes. Node v can be either the image of

an actual word , i.e.

, or the vector resulting from the sum of multiple

and , i.e.
. The idea is that, within the
words, say
embedding space, the sum of word vectors can be exploited to disambiguate
the meaning of polysemous words. An example is provided in Table 1, where
the 5 closest words to the vector V(‘rome’) are reported on the left panel, and
the 5 closest words to the vector V(‘rome’) + V(‘colosseum’) + V(‘ancient’) are
reported in the right panel. Evidently, the addition of words ‘colosseum’ and
‘ancient’ to the polysemous word ‘rome’ moves the semantic area explored
by the base graph

from a geographical to an historical sense.

Table 1. Word disambiguation by sum of vectors: the polysemous word is ‘rome’.
Closets 5 Words from
V(rome)

Cosine
Similarity

Closest 5 Words from
Cosine
V(rome) + V(colosseum) + V(ancient) Similarity

turin
palermo
naples
milan
bologna

0.6818
0.6377
0.6212
0.6129
0.5857

roman
archeological
pompei
trastevere
trajan

0.5822
0.5318
0.5250
0.5217
0.5189

Our approach builds a full output graph by iteratively combining N base
. We devised three different methods to combine base graphs
graphs
according to different exploration strategies. We called these methods
Geometric, Linear and Geometric-Oriented: the corresponding pseudo-codes
are provided in Table 2. Besides the width parameter W and the number of
iterations N, all the three methods require as input a set of seed words [seeds]
to define the starting point for the exploration of the embedding model.

178

JADT’ 18
Table 2 Pseudo codes of the proposed graph generation methods. Function find_leaves()
returns all the nodes with zero outdegree; function shortestPath() calculates
the shortest path between two nodes.

Geometric ([seeds], N, W) Linear ([seeds], N, W)

Geometric-Oriented ([seeds], N, W)

v = V(seed1) + V(seed2) +
…
G_w(v)
for iteration in [1, …, N]:
for leaf in
find_leaves():
G_w(V(leaf))

v = V(seed1) + V(seed2) + …
G_w(v)
for iteration in [1, …, N]:
for leaf in find_leaves()
virtualNode_leaf = 0
addEdge(leaf,
virtualNode_leaf)
for node in shortestPath(v,
leaf):
virtualNode_leaf =
virtualNode_leaf + node
G_W(virtualNode_leaf)

v = V(seed1) + V(seed2) + …
G_w(v)
for i in [1, …, N]:
virtualNode_i = 0
for leaf in find_leaves():
addEdge(leaf,
virtualNode_i)
virtualNode_i =
virtualNode_i + V(leaf)
G_w(virtualNode_i)

As will be shown in Section 3, the Geometric method tends to expand the
exploration range very quickly, rapidly losing the initial semantic focus
provided by the seed words; the Linear method stays much more focused,
but explores just a narrow sub-model; the Geometric-Oriented method
provides a satisfactory compromise between the previous two methods.
3. Application
3.1 Building word embedding models on large corpora of Italian tweets
Istat is currently collecting streaming Twitter data on a large scale. Italian
tweets are captured provided that they pass at least one active ‘filter’. Filters
are simply sets of keywords deemed to be relevant for specific statistical
production goals. For instance, the ‘Social Mood on Economy’ filter involves
60 keywords borrowed from the questionnaire of the Italian Consumer
Confidence Survey, and collects about 40,000 tweets per day.
We used a large collection of about 100 million Italian tweets to train
Word2Vec with different settings of hyperparameters, therefore generating
different embedding models. We subsequently analyzed the obtained models
and tested their quality as discussed in Section 3.1.2. This way we managed
to identify the best performing set of hyperparameters to be used for the
applications described in Sections 3.2 and 3.3.
3.1.1 Process
The data processing pipeline we implemented consists of the following steps:
 Collection of Italian tweets through Twitter’s streaming API as JSON files;

JADT’ 18

179

 Parsing of JSON files and storage of the tweets in a relational database;
 Extraction from the database of the textual content of about 100 million
tweets and export to a raw text file (corpus);
 Preprocessing of the raw text (text cleaning and normalization);
 Setting of Word2Vec hyperparameters;
 Training of Word2Vec on the tweets’ corpus;
 Test of the learned word embedding model.
3.1.2 Benchmark and selection of the best hyperparameters
With the aim of identifying the best hyperparameters, we customized
benchmark word-analogy tests contributed by the Stanford University
(Pennington et al., 2014), translating them in Italian and adding new word
analogies involving specific terms of the Economics field. Note that our tests
involved many groups of analogies, encoding a wide range of different
relationships between words, of both the syntactic and the semantic kind. As
a measure of model goodness, we adopted the so called “Top-1 accuracy”
criterion. According to this criterion an analogy [a : b = c : x] is successfully
solved by the learned model if and only if the closest (i.e. Top-1) embedding
vector to V(c) - V(a) + V(b) is exactly V(x). We evaluated against our
customized word-analogy tests many output models generated by diverse
settings of hyperparameters, and eventually found the following optimal
values: embedding space dimension = 200, window size = 8, iteration = 15, learning
model = Cbow.
3.2 Design of the “Europe” filter
As already mentioned in Section 3.1, Istat collects only Italian tweets that
match at least one active filter. So far, the keywords defining the filters have
been designed by subject-matter experts. In this section, instead, we illustrate
how word embedding models can be exploited to automatically develop new
filters in a data-driven way. The idea is to leverage our graph-based
exploration methodology to select the best keywords, starting from few
relevant seed words. In particular, on the occasion of the 16th anniversary of
the Treaties of Rome, our objective was to capture the sentiment of Italian
Twitter users about European Union. In Figures 1 and Figure 2 we show the
graphs resulting from the Geometric-Oriented and Geometric methods
respectively. Note that both graphs were generated using the same seed
words, namely: ‘europa’, ‘ue’, ‘bruxelles’, ‘europea’, ‘unione’, ‘euro’. The
Geometric-Oriented graph appears more compact and the words are indeed
closely related to the semantic area of the seed words. The Geometric graph,

180

JADT’ 18

instead, finds many more words, which are clearly grouped in coherent
clusters and represent a valuable semantic enrichment with respect to the
original seeds. Given its richness, this second graph has been considered by
subject-matter experts as a very good candidate to play the role of “Europe”
filter.

Figure 1: Geometric-Oriented
([‘europa’, ‘ue’, ‘bruxelles’, ‘europea’, ‘unione’, ‘euro’], 8, 8)
3.3 Text Summarization and Encoding
One ongoing Istat’s Big Data project aims at exploiting textual data
automatically scraped from the websites of Italian enterprises in order to
predict whether or not they perform e-commerce. To address this task, Deep
Learning techniques are being used. Since input scraped texts are huge and
Deep Learning algorithms are computationally intensive, a preliminary text
summarization step is in order. Besides increasing efficiency, the
summarization algorithm should hopefully improve accuracy by reducing
the signal-to-noise ratio of input data. Word embedding models allowed us
to achieve this goal with a purely data-driven approach.
To guide the summarization, we leveraged word embeddings trained on the
whole web-scraped corpus. We used the Linear-graph illustrated in Figure 3
to select a set of marker words with high discriminative power for the
detection of e-commerce, adopting as initial seeds the words: ‘carrello’,
‘shopping’, ‘online’. (These marker words constitute what we called an “ecommerce pseudo-ontology” in the Introduction.) To summarize the texts,
only input sentences containing marker words have been retained. This way,
we obtained a 92.2% reduction of the original noisy text, along with a
substantial improvement in the performance of the Deep Learning classifier
(+20%, as compared to marker words defined by subject-matter experts).
Lastly, we relied again on word embeddings to encode the summarized texts
and feed the Deep Learning classifier. Once more, our experiments show that

JADT’ 18

181

word embedding models outperform more traditional text encoding
approaches, like bag-of-words.

Figure 2: Geometric([‘europa’, ‘ue’, ‘bruxelles’, ‘europea’, ‘unione’, ‘euro’], 3, 8)

Figure 3: Linear ([‘shopping’, ‘online’, ‘carrello’], 11, 8)
4. Conclusions
The techniques for dealing with large corpora of texts can greatly benefit
from recent technology advancements. Word Embeddings are an example of
this opportunity. Extensive evidence shows that Word Embedding models
are indeed superior to more traditional text encoding methods like, e.g., bagof-words. Ongoing works on textual Big Data at Istat make extensive use of
these new approaches with very promising results.
References
Mikolov T., Yih W., Zweig G. (2013a). Linguistic Regularities in Continuous
Space Word Representations. Proceedings of NAACL-HLT 2013, pp. 746751.
Mikolov T., Chen K., Corrado G., Dean J. (2013b). Efficient Estimation of
Word Representations in Vector Space. CoRR abs/1301.3781.

182

JADT’ 18

Levy O., Goldberg Y., Dagan I. (2015). Improving Distributional Similarity
with Lessons Learned from Word Embeddings. Trans. of the Association
for Computational Linguistics, vol.(3): 211-225.
Pennington J., Socher R., Manning C.D. (2014). GloVe: Global Vectors for
Word Representation. Proceedings of EMNLP 2014, pp. 1532-1543.
van der Maaten L.J.P. and Hinton G.E. (2008). Visualizing High-Dimensional
Data Using t-SNE. Journal of Machine Learning Research, vol(9): 2579-2605.
Gibbons A. (1985). Algorithmic Graph Theory. Cambridge University Press.

JADT’ 18

183

Analisi di dati d’impresa disponibili online: un
esempio di data science tratto dalla realtà economica
dei siti di e-commerce
Viviana De Giorgi, Chiara Gnesi
Istat – degiorgi@istat.it; gnesi@istat.it

Abstract
This work describes the process of extracting, organising and analysing
detailed information on firms that trade electronic equipment on the
Alibaba.com site. The first part concerns how translating unstructured
information into variables organised in a statistical database by using
dimensional classes, indices, indicators and classifications. A companyproduct matching is realised by encoding a textual variable with an
international classification, and an automated analysis is applied in order to
explore, describe and analyse the corpus retrieved from the Internet. In the
second part a descriptive and econometric analysis shows how demographic
and economic information on enterprises from Alibaba.com are very
significant for competitiveness on the foreign market.
Keywords: encoding, classification, textual analysis, regression model.
Sommario
Il presente lavoro consiste nello sviluppo di un modello che consenta di
trattare, organizzare ed analizzare informazioni dettagliate sulle imprese che
commerciano apparecchiature elettroniche sul portale Alibaba.com.
La prima parte riguarda il processo di trasformazione dell’informazione
destrutturata in variabili organizzate in un database statistico attraverso l’uso
di classi dimensionali, indici, indicatori e classificazioni. Si è realizzato un
abbinamento impresa-prodotto utilizzando una classificazione internazionale
attraverso la codifica di una variabile testuale, su cui è applicata un’analisi
automatizzata al fine di esplorare, descrivere e analizzare il corpus testuale
tratto da Internet. Nella seconda parte è svolta un’analisi descrittiva ed
econometrica, i cui risultati mostrano la presenza sul portale cinese di
informazioni demografiche ed economiche sulle imprese altamente
significative per la competitività sul mercato estero.
Parole chiave: codifica, classificazione, analisi testuale, regressione.

184

JADT’ 18

1. Introduzione
Questo lavoro nasce dagli spunti di riflessione e studio offerti nel corso delle
lezioni di un Master universitario in Data Science1 e si rivolge in particolare
alle tecniche di trattamento, gestione ed analisi dei dati provenienti da fonti
recuperabili on line2 e fruibili in maniera gratuita. L’approccio adottato è
quello della singola impresa che vuole migliorare la propria competitività nel
mercato di riferimento, analizzando i dati generati dai processi aziendali nel
settore in cui è presente o mira a posizionarsi. A tal fine sono preziose le
informazioni dettagliate e aggiornate sui volumi prodotti, transazioni,
struttura e demografia delle imprese concorrenti, presenti nei siti di
commercio elettronico. Il presente lavoro è stato sviluppato utilizzando i dati
estratti attraverso un’intensa attività di web scraping dal portale Alibaba.com,
con riferimento alle imprese operanti nel settore delle apparecchiature
elettroniche.
2. Dai dati destrutturati alle variabili statistiche: costruzione del database
Nel processo di trasformazione dell’informazione destrutturata acquisita
online in variabili statistiche, un ruolo centrale riveste laclassificazione delle
imprese a partire dal principale prodotto commercializzato. La variabile
testuale – che corrisponde alla descrizione non codificata del prodotto
commercializzato dalla società – è stata codificata secondo una classificazione
di attività economica standardizzata a livello internazionale. Si è scelto
l’elenco Prodcom con riferimento alle divisioni 26, 27 e 28, per un totale di
989 sottocategorie di prodotti3.
L’attribuzione del codice Prodcom alla singola impresa è stata effettuata
implementando un sistema di codifica ad hoc4 strutturato in step successivi.
La fase iniziale consiste nella normalizzazione dei testi attraverso lo sviluppo

Master universitario in Data Science, Università Tor Vergata, Dipartimento di
Ingegneria dell’impresa "Mario Lucertini", anno accademico 2015/2016. Si ringraziano
Francesco Borrelli, Valentina Talucci e Domenica Fioredistella Iezzi per gli utili
suggerimenti.
2 L’acquisizione dei dati è stata effettuata nell’arco temporale che va dal 26
novembre 2016 al 7 gennaio 2017 dalla dott. Antonella Miele attraverso una attività di
web scraping. I dati utilizzati sono relativi a 2.349 imprese presenti sul sito
Alibaba.com e operanti nel settore delle apparecchiature elettroniche.
3http://ec.europa.eu/eurostat/ramon/nomenclatures/index.cfm?TargetUrl=LST_C
LS_DLD&StrNom=PRD_2011&StrLanguageCode=EN&StrLayoutCode=HIERARCHI
C#
4Non avendo a disposizione software già sviluppati utilizzabili, è stato
implementato un sistema di codifica ad hoc utilizzando il software SAS.
1

JADT’ 18

185

di un parser applicato alla variabile testuale e alle descrizioni della
classificazione utilizzata. Successivamente si è realizzato un matching tra i
due campi, attraverso un algoritmo che identifica l’abbinamento tra stringhe,
sfruttando il dizionario al massimo livello di dettaglio possibile5. Infine si è
realizzato l’abbinamento impresa-prodotto, assegnando a ciascuna impresa
un codice Prodcom che identifica univocamente il principale prodotto
commercializzato6. Il sistema di codifica ha permesso la classificazione del
95% delle imprese: un 30% circa vende “computer e prodotti di elettronica e
ottica, apparecchi elettromedicali, apparecchi di misurazione e orologi”, un
quarto vende “apparecchiature elettroniche e apparecchiature per
usodomestico non elettriche” e il 40% circa vende “apparecchiature elettriche
diverse dalle precedenti” (Tavola 1).
Tavola 1: Imprese per divisioni Prodcom, valori assoluti e percentuali
Divisione prodcom
n
26 – computer e prodotti di elettronica e ottica
718
27 – apparecchiature elettroniche e apparecchiature
per uso domestico non elettriche
618
28 – fabbricazione di macchinari ed apparecchiature n.c.a
893
non classificati
120
Totale complessivo
2349

%
30,6
26,3
38,0
5,1
100,0

L’analisi dei residui ha rivelato che la causa principale del mancato
abbinamento deriva dalla presenza sul mercato di Alibaba di prodotti,
elettrici e non, altamente specializzati ovvero sulla frontiera della tecnologia,
non presenti nella Prodcom. Tuttavia, l’abbondanza di acronimi,
abbreviazioni, slang hanno reso l’attività di standardizzazione
particolarmente complessa.
In seguito alla codifica della variabile testuale, si è proceduto a una sua
analisi automatizzata al fine di esplorare, descrivere e analizzare il corpus

In questa fase si è utilizzato il dizionario al massimo livello di dettaglio
possibile – 8 digit – in modo da abbracciare la descrizione del maggior numero di
prodotti possibile. L’abbinamento prodotto/dizionario si è realizzato per molte
sottocategorie di Prodcom; dopo aver analizzato i risultati ottenuti, si è scelto di
utilizzare i 4 digit come il massimo livello di disaggregazione compatibile con una
soglia di accuratezza ritenuta accettabile.
6L’assegnazione del codice è stata realizzata attribuendo all’impresa il codice
Prodcom corrispondente alla classe in cui si è realizzato in maggior numero di match
prodotto – dizionario, pesata per la frequenza più alta riscontrata in una determinata
categoria di prodotto.
5

186

JADT’ 18

tratto da Internet. L’analisi testuale7 consente di esplorare la struttura del
testo sia come corpus – raccolta di frammenti testuali fra loro confrontabili –
sia in relazione alla codifica ad esso attribuita. A tal fine, si è utilizzato
TaLTaC2, particolarmente adattoallo studio di informazioni testuali non
strutturate di grandi dimensioni e di informazioni strutturate a queste ultime
collegate. Un primo approfondimento è offerto dalle misure lessicometriche,
che consistono in una serie di misure e di indici statistici calcolati sul
vocabolario e sulle sue classi di frequenza (Bolasco, 1999).Il corpusè costituito
da 25.295 occorrenze, che corrispondono al numero totale di forme grafiche
intese come unità di conto(Giuliano, 2004). L’ampiezza del vocabolario, pari
a 4.363 forme grafiche distinte, riflettela specificità settoriale a cui attiene
l’analisi. Coerentemente, l’indice di estensione lessicale percentuale, pari a
17,2, e l’indice di Guiraud normalizzato, pari a 27,4,confermano come la
dimensione del vocabolario sia affetta da un bias determinato dalla specificità
delle imprese analizzate. Tuttavia, nel settore è presente una gamma di
prodotti piuttosto diversificata, come suggerito dal numero di hapax, pari a
50,2 (tavola 2).
Tavola 2: misure lessicometriche sul corpus
Misure lessicometriche
Occorrenze - N
Forme grafiche distinte - V
Type/Token (V/N)*100
% di Hapax (V1/V)*100
Frequenza media generale - N/V
G di Guiraud - V/sqrN
Coefficiente a

Valori
25.395
4.363
17,2
50,7
5,8
27,4
1,2

L’analisi lessicale, svolta a partire dall’analisi delle specificità, ha consentito
di verificare, all’interno di singole classi, la rilevanza dei prodotti attraverso
la sovra o sotto rappresentazione rispetto alla classificazione internazionale.
L’utilizzo del dizionario della Prodcomcome risorsa statistica-linguistica
esterna, ha permessoanalisi in parallelo. In effetti, l’indice Term Frequency
Inverse Document Frequency (TFIDF) calcolato anche sul dizionario, ha
consentito di evidenziare le caratteristiche peculiari dei prodotti venduti
dalle imprese rispetto al panorama delle stesse che commercializzano
prodotti elettronici.Inoltre, attraverso il confronto tra le forme grafiche del

7A tal fine, si è utilizzato TaLTaC2, un software per l’analisi automatica del testo
nella duplice logica di Text Analysis e di Text Mining (TM), quindi sia come analisi
del testo che come recupero e estrazione di informazione all’interno dello stesso

JADT’ 18

187

corpus e quelle del dizionario della Prodcom, si è potuto operare un controllo
indiretto sulla qualità della codifica di cui al precedente paragrafo
utilizzando lo scarto standardizzato come proxy di significatività8. Tale
misura consente, infine, di caratterizzare le imprese rispetto alla peculiarità
dei prodotti che le contraddistinguono all’interno del settore di riferimento
(figura 1).

Figura 1: Parole chiave del corpus in base allo scarto standardizzato

Ulteriori elaborazioni sui dati reperiti dal sito hanno consentito la creazione
di ulteriori variabili statistiche. Tra queste: tenure – una proxy dell’anzianità
dell’impresa, costruita a partire dall’anno di iscrizione al portale; addetti e
fatturato medi – a partire dal valore medio delle classi di riferimento; qualità
– una variabile dummy che segnala la presenza di una certificazione di
prodotto; propensione all’export – come quota percentuale di esportazioni
sul fatturato ; ricerca e sviluppo – in termini di addetti medi impiegati nelle
attività innovative; efficienza – capacità di risposta dell’impresa alle esigenze
dei clienti. Il database finale è costituito da 18 variabili, che afferiscono
all’Anagrafica dell’impresa, all’Attività economica, al Commercio estero, alla
Dimensione economica, alla Competitività e alla Ricerca & Sviluppo.
3. Analisi descrittiva ed ecometrica dei dati
Ai dati descritti precedentemente sono state applicate le tecniche largamente
adottate della ricerca statistica: un’analisi descrittiva del collettivo di
riferimento, un’analisi multivariata di tipo esplorativo per la ricerca delle
variabili da utilizzare in un modello econometrico e un modello di
regressione che tenga conto della specificità dei dati9. Si riportano di seguito i
principali risultati.

Si è utilizzata la formula classicadellamisura di specificità in cui fi* è la
frequenzarelativadella forma graficanell’elencoProdcom.
9Le informazionisulleimpresepresentisulsitovengonoaggiornate, anche se non si
sa bene quando e come, e l’informazione dell’anno di riferimento è presente talvolta e
solo per alcune variabili (per esempio il fatturato)
8

188

JADT’ 18

Per tutti i settori di attività, più della metà delle imprese si dichiara
produttrice e venditrice, forse perché tale caratteristica tende a essere un
parametro di scelta da parte di chi deve acquistare. Sono per lo più imprese
medio-grandi, giovani, che in genere interagiscono con i clienti, con alte
percentuali di export sul valore del fatturato, con presenza di dipendenti
dedicati alla ricerca e sviluppo, disponibilità del certificato dei prodotti
venduti. Cumulano un volume di esportazioni maggiore dell’80% le imprese
che hanno più di 50 dipendenti, oppure sono nelle classi più elevate di
fatturato, oppure rispondono almeno all’80% di richieste dal sito, o infine si
dichiarano produttrici dei prodotti venduti. L’analisi condotta, e quindi il
modello di regressione studiato, riguarda la dipendenza che il volume di
esportazioni ha con le variabili presenti nel data set. Al fine della scelta delle
variabili da utilizzare nel modello è stata effettuata un’analisi cluster
gerarchica (SAS Institute Inc., 1999), scegliendo la variabile con minimo
valore di 1-R2ratio10, e individuando le seguenti variabili: la produttività
d’impresa, la variabile dimensionale data dal numero dei dipendenti
occupati in ricerca e sviluppo e le tre variabili categoriche percentuale di
risposta a richieste, attività economica e tipologia d’impresa. Le prime due
risultano avere nel proprio cluster, nella suddivisione in 5 gruppi, il valore
minimo di 1-R2ratio; tra le variabili categoriche invece si evidenziano quelle
aventi minore correlazione own cluster con le altre variabili. Il modello
implementato consente di stimare i valori della variabile dipendente
“volume delle esportazioni” sulla base dei valori assunti/osservati da/per
alcune variabili indipendenti. Anche come conseguenza dei risultati
dell’analisi cluster descritta precedentemente, si è scelto di includere tra
queste: il fatturato per dipendente, il numero di dipendenti di ciascuna
impresa, la tipologia di prodotto a 2 cifre, la percentuale di risposta alle
richieste di possibili acquirenti, la quota di dipendenti d’impresa occupati in
ricerca e sviluppo, la tipologia di impresa e il numero di anni di attività.
È stato stimato il seguente modello di regressione lineare (Rencher e Schaalje,
2008):
+
dip+
dove: (a)

è il logaritmo naturale del volume di esportazioni; (b)

101-R^2
ratio=(1-R^2
own
cluster)/(1-R^2
nextclosest),
dove
own
cluster=correlazione con il proprio gruppo di variabilie nextclosest=correlazione con il
gruppo più vicino

+

JADT’ 18

189

è il logaritmo della produttività; (c)
della percentuale di risposta; (d)

è il logaritmo

è il logaritmo della quota di

è la tipologia di impresa, (f)
dipendenti occupati in ricerca e sviluppo; (e)
ate è la tipologia di prodotto, (g) dip è il numero dei dipendenti.
In presenza di una variabile dipendente con distribuzione log-normale11,
l’applicazione di una trasformazione logaritmica alla variabile dipendente e
alle variabili indipendenti continue ha come primo obiettivo di ottenere una
distribuzione assomigliante a quella di una normale. Ciò implica, per i
modelli lineari, la possibilità di estensione di tale ipotesi distributiva anche ai
residui (ε) del modello e quindi consente di condurre in modo corretto i
necessari test di significatività sui coefficienti stimati. Inoltre, la
contemporanea trasformazione logaritmica delle variabili indipendenti
(continue) consente di interpretare i valori dei coefficienti stimati
direttamente in termini di elasticità. L’introduzione della variabile dip2 è utile
per verificare l’esistenza di eventuali relazioni non lineari tra dip e la
dipendente, ovvero per capire se all’aumento del numero di dipendenti
corrisponda
una
crescita
delle
esportazioni
progressivamente
superiore/inferiore. È stato inoltre studiato un secondo modello (modello2)
introducendo l’interazione tra la quota di dipendenti occupati nella ricerca e
sviluppo e la variabile categoriale relativa alla tipologia d’impresa. Tale scelta
è coerente con l’idea che il livello di attività in ricerca e sviluppo possa
rappresentare una fonte di valore aggiunto maggiore per le imprese che
producono rispetto a quelle che vendono soltanto. I risultati ottenuti e
riportati nella tavola 3 vengono di seguito descritti: (1) la relazione tra la
variabile dipendente e la misura di produttività utilizzata è
significativamente positiva; a una variazione dell’1% del fatturato per
addetto corrisponde, mediamente, un variazione di oltre l’1% del volume
delle esportazioni; (2) queste sono correlate positivamente anche con la
percentuale di risposta a richieste dal sito e con il numero di anni di attività
dell’impresa (coefficienti sempre significativi); (3) la stima dei due coefficienti
relativi alla dimensione d’impresa evidenziano che questa accresce (come era
logico aspettarsi) il volume delle esportazioni, ma con tassi progressivamente
decrescenti all’aumentare del numero dei dipendenti (rendimenti decrescenti

11

La variabile aleatoria

segue la distribuzione logaritmica

solo se
segue la distribuzione normale
densità di probabilità è f(x)=e^(-〖(lnx-μ)〗^2/〖2σ〗^2)/(x√2πσ)

. La sua funzione di

190

JADT’ 18

di scala); (4) sembrano esistere effetti differenziali tra il volume di
esportazioni e le tipologie di prodotti venduti per settore di attività
economica, ma non sempre i coefficienti sono significativi; (5) le dummy
relative alla tipologia d’impresa mostrano coefficienti sempre non
significativamente diversi da zero in assenza di interazione con la proxy di
ricerca e sviluppo (modello1); (6) se fatte interagire (modello2) emerge invece
come le due tipologie impresa produttrice e produttrice/venditrice abbiano
un effetto positivo sulle esportazioni (rispetto alla modalità di riferimento
impresa solo venditrice) e l’intensità di ricerca e sviluppo sembra accrescere
significativamente le esportazioni solo per il settore delle imprese produttrici;
(7) la variabile in oggetto risulta infatti correlata negativamente con la
dipendente nei casi di imprese operanti esclusivamente nel settore del
commercio e positivamente per quelle manifatturiere o contemporaneamente
anche venditrici.
Tavola 3: Stima dei parametri del modello lineare (modello 1 e modello 2)
nel data set iniziale e nel data set integrato
Variabile
ln(fattxdip)
Resp
num_anni
ate26 (rif,)
at 27
ate28
Others
Dip
dip^2
type venditrice (rif,)
produttrice
produttrice/venditrice
ln(dip_in_rd/dip) x venditrice
ln(dip_in_rd/dip)) x produttrice
ln(dip_in_rd/dip) x produttrice/venditrice
Costante
N
r2_ajusted

modello1
1,024***
0,002***
0,022***

modello2
1,025***
0,002***
0,023***

-0,067*
-0,094***
0,067
0,012***
-0,001***

-0,061
-0,085**
0,063
0,012***
-0,001***

-0,008
-0,055
-0,097***

0,358***
0,161*
-0,212***
0,188***
0,125***
2,084***
1.913
0,866

2,291***
1.913
0,865

*p<0,1; **p<0,05; ***p<0,01

Le funzioni di densità della variabile dipendente osservata e stimata
mostrano entrambe una forma distributiva approssimativamente normale:
non emergono significative differenze tra i due modelli, ce forniscono
entrambi una buona approssimazione.

JADT’ 18

191

Riferimenti bibliografici
Bolasco S. (1999). L’analisi multidimensionale dei dati, Roma, Carocci.
Giuliano L. (2004), L’analisi automatica dei dati testuali. Software e istruzioni per
l’uso, Milano, LED.
Rencher A.C, Schaalje G.B. (2008). LinearModels in Statistics. Second Edition.
Wiley.
SAS Institute Inc. (1999), LogisticRegressionModeling Course Notes, Cary, NC:
SAS Institute Inc., pages 56-57.

192

JADT’ 18

The use of textual sources in Istat: an overview
Alessandro Capezzuoli, Francesca della Ratta, Stefania Macchia,
Manuela Murgia, Monica Scannapieco, Diego Zardetto1
ISTAT – Istituto Nazionale di Statistica – nome.cognome@istat.it

Abstract 1
Text Mining techniques allow a more widespread use of textual materials
also in Official Statistics. We show implementations and current pilots
realized in Istat, with a focus on both techniques and applications. Initially,
text mining techniques were used to manage complex taxonomies or conduct
open question analysis, while at the moment Big data frameworks allow to
expand the different sources of data also to merge several data sources and to
reduce response burden.
Abstract 2
Le tecniche di Text Mining consentono un ampio utilizzo di dati testuali
anche nella Statistica Ufficiale. Sono descritte le implementazioni e le
sperimentazioni realizzate in Istat in questo ambito, focalizzando sulle
tecniche utilizzate e le applicazioni realizzate. Inizialmente il Text Mining
veniva effettuato per gestire le tassonomie o effettuare analisi testuale delle
riposte aperte, mentre più di recente il contesto dei Big data ha consentito di
ampliare le fonti utilizzate e di integrarle tra loro anche in funzione del
contenimento del response burden.
Keywords: text mining, official statistics, sentiment analysis
1. Automatic coding and semantic search of taxonomies
The first use of text mining techniques in Italian official statistics was
finalized to manage complex classifications. Indeed, classifications are
defined, which consist of structured lists of concepts, mutually exclusive,
corresponding to codes that allow to produce a partition of the population.
When the identification of the code corresponding to the concept does not
present any ambiguity, it is possible to use closed questions with lists of
items among which the one matching with the response is selected.

1 This work comes from a common effort; paragraph 1.1 is written by Manuela
Murgia and Stefania Macchia, par. 1.2 by Alessandro Capezzuoli; par. 2 by Francesca
della Ratta, par. 3 by Monica Scannapieco and Diego Zardetto.

JADT’ 18

193

On the other hand, when codes belong to classifications that are complex in
terms of structure, criteria and hierarchies, then the management of
taxonomies is a very difficult task that implies the knowledge of the
classification. Let us think, for example, of the classification of Occupation: in
order to identify the code corresponding to each occupation it is necessary to
consider different aspects, like the level of competences, their scope or the
activities managed. In this paragraph, it is described how, with the evolution
of technologies, this activity has been performed in different ways, using
different software tools.
1.1. Automatic coding
Up to some years ago, statistics survey questionnaires rarely used open
questions allowing textual answers because of the difficulties in processing
them in order to provide a measure of the phenomenon. On the other hand,
this could not often be avoided for some variables, like occupation, economic
activity, education level that have necessarily to be coded according to
official classifications for either national or cross-national data comparison.
In the past, verbal responses were manually coded, but this was very timeconsuming, costly and error prone, especially for large amount of data
(Macchia et Murgia, 2002). For this reason Istat decided to adopt automated
coding systems that consist of two main parts: i) a database (dictionary) and ii)
a matching algorithm. The dictionary is made of texts associated with
numeric codes. Codes are those of official classifications and represent the
possible values to be assigned to the verbal responses entering the coding
process, while texts are the textual labels expressing the concepts that the
classifications associate to codes. In order to improve the coding results,
dictionaries are enriched with common language descriptions, resulting from
answers to previous surveys. The matching algorithm is a ‘weighting
algorithm’ that assigns a weight to each word of the verbal response to be
coded. The weight indicates how much a word is informative and it depends
on the word’s frequency inside the dictionary: the higher its frequency the
minor its weight. Then the algorithm compares the input response with all
the texts inside the dictionary looking for a perfect match. If no exact match is
found then it looks for a partial match with the most “similar” description,
choosing the one with the highest weight.
The efficiency of the automated coding systems allowed Istat to use them not
only to code responses of statistical surveys, but also to offer the coding
service to a larger public such as governmental or private institutions, private
citizens, who need to associate free text descriptions to official classifications
codes, let’s think, for instance, to businesses which have to identify their
economic activity code for declarations to Chambers of Commerce. The

194

JADT’ 18

coding service was then made available on the Istat web site for the ATECO
(the Italian version for Nace, the Economic Activity classification) variable.
The software used for many years was ACTR (1998-2015) developed and
distributed by Statistics Canada. In 2015 ACTR was not working anymore on
the new Istat IT platform and it was substituted by CIRCE that behaves like
ACTR but it is developed in house and based on R (Murgia et al., 2016). The
choice of R made it possible to create a coding package freely downloadable
from the website and also to offer a web service for the coding of the ATECO.
The web service can be easily incorporated in any other software
applications: electronic questionnaires of Istat surveys or in software systems
of external organizations.
1.2 Semantic search within taxonomies
The evolution of technology allowed to explore also other software solutions
suitable to represent the Statistical classifications logical structure, described
within the Generic Statistical Information Model (GSIM). To this end, it was
possible to exploit a very simple JSON object, to which then associate the
metadata related to the classification (family, series, level, etc.). PUT and GET
methods, related to the HTTP protocol, permit an easy acquisition of
classification items that can then be organized through ad hoc procedures, on
the basis of GSIM model, and stored into a relational database.
Being a JavaScript Object Notation, JSON is the natural environment for the
construction of web applications using programming languages like e.g.
Ajax/JavaScript combined with ad hoc frameworks as appropriate.
Elasticsearch and Solr are the main frameworks used to search and share
data. In particular, Elasticsearch provides a set of powerful and complete
tools/plugins for data dissemination and the use of REST resources.
Elasticsearch is well suited for the solution of some critical issues related to
the use of statistical classifications in different fields (surveys, administrative
registers, information systems, etc.), such as:
• acquisition, storage, management and updates of classifications;
• multilingual semantic search for coding;
• sharing and dissemination of coding tools.
Textual search is a very popular technique for users who seek information on
the web. It does not require any special skill and users have already acquired
through surfing the web and it is also suitable to search within statistical
classifications and facilitate coding. The most common problem related to
semantic searches within taxonomies concerns false-positive and falsenegative results. The search is usually done through SQL queries allowing
users to perform two types of operations: "exact match" and "full text". String
parsing algorithms can be associated to the SQL queries.

JADT’ 18

195

A statistical classification can be indexed within Elasticsearch to perform
complex and differentiated textual searches through DSL (Domain Specific
Language) in JSON format. This solution permits to simplify the formulation
of complicated SQL queries and makes the search system from any
programming language usable. Elasticsearch allows users to manipulate
large volumes of data thanks to an internal document management,
completely independent from relational databases, and the opportunity to
create distributed cluster.
Istat experience in using this methodology has been very satisfactory. The
coding systems related to the main statistical classifications (ISCO, NACE,
ISCED, COFOG, COICOP) were included in several Istat surveys ("Labour
Force Survey", multi-purpose survey "Aspects of daily life", "Consumer
prices", etc.) and Information system on occupation. Easy to use, widgets
have been developed to include coding systems within web questionnaires
and web applications.
2. Open questions analysis
Social research uses open questions also when category answers are not
known or when researchers prefer to explore interviewees’ different points of
view using their own categories. This approach offers a great opportunity to
realize analysis in depth, but it is difficult to be applied with the largest
sample used in official statistics. So it is generally preferred using open
questions only in pilot survey or small samples, to explore the possible list of
answers and to obtain the closed-end list for the final survey. As an example,
Istat used this approach in a survey on the female participation in
parliamentary life: in 2000 an open question was introduced in a quarterly
Multipurpose survey and the list of answering categories obtained with
textual analysis was used in the 2005 annual Multipurpose survey.
However, in the early 2000s Text mining tools made it possible to analyse
open questions also when codes does not belong to pre-defined
classifications. The first example was introduced in Istat by Sergio Bolasco,
who analysed the daily diaries collected in 2002-2003 Time use Survey to
obtain a classification of some daily life actions (Bolasco et al., 2007). This
classification was obtained using the Entity Research by Regular expression
(RE) inserted in the tool Taltac2, a function that represents a very important
turning point for the use of textual data in statistical surveys, because it made
possible to pass from the simple description of words contained in a corpus
(Lexical analysis) to the classification of single records on the basis of

196

JADT’ 18

words that are contained in each of them (Textual analysis2). The single
word is no more the unit of analysis as the RE function searches or counts
within the entire record a particular word or a combination of words, putting
the result in a new customized variable.
This function was afterwards used in other Istat surveys. First it was used in
the Survey on Occupations, developed in 2005-2006 and aimed at describing
Italian labour market occupations, providing detailed information on each
Occupational Unit. Researchers were interested also in tasks in which
workers are daily involved, which was asked through an open question:
“What does your job consist of? Which are the activities you are involved in during
your working day?”. Our aim was to provide each Occupational Units with a
list of semi-standardized activities, labelling in the same way similar
activities expressed in different ways by respondents. So, we used a strategy
of text categorization adding in final dataset an extra column variable with a
synthesis of the activities stated by interviewees: the final result was a list of
over 7,000 specific activities (della Ratta, 2009).
A similar approach is currently used to check and correct the coding of
economic activity carried out by interviewers in the Labour Force Survey:
every quarter, 1500 records out of 24000 responses collected in the survey
referred to specific Nace section are analyzed. The correctness of the codes
assigned is verified from a double perspective: not only by comparing
respondents’ vocabulary reported in the response field of the question on
economic activity with the specific dictionary of the official classification
(Nace rev-2), but also considering other extra information connected with
this variable collected in the same survey questionnaire. The process is
completed with a thorough examination of data consistency in each session,
to validate the corrections made and to assign the definitive proper code. At
the end errors are transmitted to interviewers during specific training
sessions in order to improve the all process of data collection, from the
interview to the coding assignment (della Ratta et Tibaldi, 2014).
Other uses of Text Mining tools regarded the classification of open questions
of the online survey on the dimensions of well-being (della Ratta et Tinto,

The search for the textual information is run by complex queries using regular
expressions with Boolean operators (AND, OR, NOT), lexeme reductions (wildcards
as “*” and “?”, e.g. contact* and customer? ) and distances (LAGgxx) between
consecutive words, that allow to identify different expressions used to convey the
same concept (contact*LAG3 customer? is able to identify series such as “to contact
the customer”, “contacts with customers”, “I contact my main customers”; the value
of the new variable could be “to contact customers”).
2

JADT’ 18

197

2012), or the analysis of residual answers inserted in single questions
(“Others”, please specify) that can improve the exhaustiveness of
questionnaires and can be used in training activity for interviewers.
In conclusion, the availability of Text Mining tools made it possible to
process open questions independently by the size of the text, being free in
this way to use un-structured data in official statistics, especially in recursive
analysis in which text categorization strategies can be repeated several times.
3. Dealing with Textual Big Data
Since recent years, in line with European-level strategic directives, Istat has
been exploring the potential of Big Data sources for Official Statistics. Many
of such sources – and notably those that seem the most promising so far – are
made up of huge collections of unstructured and noisy texts.
In current Istat’s projects, two types of unstructured sources were taken into
account, namely: (i) textual data collected from the websites of Italian
companies, obtained through automatic procedures of access and extraction
performed on a large scale (hundreds of thousands of sites); (ii) messages in
Italian language publicly available on Social Networks, typically collected in
streaming after a preliminary selection step performed using ‘filters’ (i.e. sets
of keywords that a message must match to be deemed relevant).
The contexts of use of textual data from company websites include the
enrichment of information in statistical business registers and the potential
replacement of questions from surveys questionnaires. The possible uses of
data from Social Network mainly concern the production of high-frequency
(e.g. daily) sentiment indices.
At the moment the experiments with Social Networks data focused on the
Twitter platform and on the development of “specific” sentiment indices: the
goal is to measure the Italian mood about topics or aspects of life that might
be relevant for Official Statistics (like the economic situation, the European
Union, the migrants’ phenomenon, the terrorist threat, and so on). The hope
is that such sentiment indices can improve the quality of Istat’s economic
forecasting models, enrich existing statistical products (for example the BES)
or create new statistical outputs in their own right.
Among the processing techniques used for these sources, a particularly
promising type consists of the Word Embedding models. These models are
generated by unsupervised learning algorithms (such as Word2Vec (Mikolov
et al., 2013) and GloVe (Pennington et al., 2014), both based on neural
networks) trained on large collections of text documents. Their main
objective is to map natural language words into vectors of a metric space, in
such a way that the numerical representation of texts captures and preserves
a wide range of syntactic and semantic relationship existing between words.

198

JADT’ 18

Istat successfully tested Word Embedding models in both the application
scenarios sketched above. In the first scenario, Word Embeddings have been
exploited to automatically summarize the huge text corpora scraped from
company websites, and to subsequently encode the summarized texts in
order to feed a Deep Learning algorithm for downstream analysis (e.g. to
predict whether a given enterprise performs e-commerce). In the second
scenario, Word Embedding models have been leveraged both to design the
‘filters’ used to select relevant messages from Twitter and to evaluate the
actual performance of the same ‘filters’ after data collection.
In the following of this section a specific focus will be provided on data
scraped from enterprises websites3. The Istat sampling survey on Information
and Communication Technologies (ICT) in enterprises aims at producing
information on the use of Internet and other networks by Italian enterprises
for various purposes (e-commerce, e-skills, e-business, social media, egovernment, etc.). In 2013, an Istat project started with the purpose of
studying the possibility to estimate some indicators produced by the survey
directly from the websites of the enterprises; these indicators included online
sale rate, social media presence rate and job advertisement rate. The idea was
to use web scraping techniques, associated, in the estimation phase, to text
and data mining algorithms, with the aim of replacing traditional
instruments of data collection and estimation, or to combine them in an
integrated approach (Barcaroli et al., 2015). The recently achieved results are
very encouraging with respect to the use of such techniques (Barcaroli et al.,
2017).
The whole pipeline that has been set up for this project includes:

A scraping activity performed by an ad-hoc developed software
(RootJuice4).

A storage step in which scraped data are stored in a NoSQL database,
i.e. Apache Solr.

A data preparation and text encoding step, performed in two different
ways:
1. tokenization, word filtering, lemmatization, generation of a termdocument matrix
2. word filtering and word embeddings.

An analysis step, performed via machine learning methods on each of
the text encodings resulting from the previous step.

3 A more detailed focus on the processing of Twitter data is presented in the
paper “Word Embeddings: a powerful tool for innovative statistics at Istat”,
submitted to this conference.
4 Available on GitHub : https://github.com/SummaIstat/RootJuice/.

JADT’ 18

199

4. Conclusions and remarks
The techniques for dealing with large corpora of texts can greatly benefit
from recent technology advancements. Word Embeddings are an example of
this opportunity, giving additional possibilities to use un-structured data in
official statistics for the purpose of integrating analyses or reducing response
burden. Extensive evidence shows that Word Embedding models are indeed
superior to more traditional text encoding methods like, e.g., bag-of-words.
Ongoing works on textual Big Data at Istat make extensive use of these new
approaches with very promising results.
References
Barcaroli G., Nurra A., Salamone S., Scannapieco M., Scarnò M.and Summa
D. (2015). Internet as Data Source in the Istat Survey on ICT in
Enterprises. Journal of Austrian Statistics, vol. 44, n. 2.
Barcaroli G., Scannapieco and M. Summa D. (2017). Massive Web Scraping of
Enterprises Web Sites: Experiences and Solutions. 61st World Statistical
Congress, ISI.
Bolasco S., Pavone P., D’Avino E. (2007). Analisi dei diari giornalieri con
strumenti di statistica testuale e text mining. In: Romano. I tempi della vita
quotidiana, Istat, Roma, Argomenti, n. 32.
della Ratta Rinaldi F. (2009). Il trattamento dei dati, in F. Gallo, P. Scalisi, C.
Scarnera. L’indagine sulle professioni. Anno 2007, Contenuti, metodologia e
organizzazione. Collana Metodi e Norme, n. 42, Roma, Istat.
della Ratta-Rinaldi F.and Tinto A. (2012). Le opinioni dei cittadini sulle
misure del benessere. Risultati della consultazione online. Roma, IstatCnel.
della Ratta-Rinaldi F. and Tibaldi M. (2014). Sperimentazione di un sistema
di controllo e correzione per la codifica dell’attività economica. Istat
Working Paper, n. 4, 2014.
Macchia S. and Murgia M. (2002). Coding of textual responses: various issues
on automated coding and computer assisted coding. Proc. of JADT 2002:
6es Journées Internationales d’Analyse Statistique des Données Textuelles.
Mikolov T., Chen K., Corrado G. and Dean J. (2013). Efficient Estimation of
Word Representations in Vector Space. Proceedings of Workshop at ICLR.
Murgia M. and Prigiobbe V. (2016). La nuova applicazione di codifica web
dell’ATECO 2007: WITCH, un web service basato sul sistema di codifica
CIRCE. Istat Working Papers n. 19.
Pennington J., Socher R. and Manning C. D. (2014). Glove: Global Vectors for
Word Representation. Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing, 1532–1543.

200

JADT’ 18

Twitter e la statistica ufficiale:
il dibattito sul mercato del lavoro
Francesca della Ratta, Gabriella Fazzi, Maria Elena Pontecorvo,
Carlo Vaccari, Antonino Virgillito1
Istat – Istituto Nazionale di Statistica, Rome – Italy

Abstract
The goal of the paper is to show the potential and the benefits of the
integration between the big data analysis techniques and techniques used
for the textual analysis, through the analysis of a corpus extracted from
Twitter. The analysis is the development of a method already
experimented in other works (della Ratta, Pontecorvo, Virgillito, Vaccari,
2016 and 2017), in which we started from the collection of selected tweets
through a list of hashtags defined according to the theme of interest. This
procedure allows to obtain in a reasonable time a selection of tweets of
interest, on which to apply textual analysis techniques to describe the
contents of the text and to identify its main semantic contents. The paper
analyzes the role of the National Institute of Statistics in the discussion on the
labor market in the periods when ISTAT spreads the monthly and quarterly
press releases on employment. The analysis, already conducted at the end of
2016, has been replicated and refined in the same period of 2017, in order to
show the distinctive elements of the labor market debate and to understand
the changes in the perception of public opinion, also taking into account the
changes in terms of the economic situation and the political scenario.
Key words: big data; text mining; twitter; Istat, labour market
1. Big data e Twitter
I dati provenienti dai Social Network sono una delle sorgenti di Big Data più
utilizzate dai ricercatori: l’enorme diffusione di questi siti web, nei quali gli
utenti generano grandi quantità di informazioni, li rende potenzialmente una
delle fonti più interessanti anche per i dati testuali. Twitter è un Social
Network nel quale gli utenti scrivono e leggono corti messaggi chiamati

1 Questo lavoro è frutto della riflessione condivisa degli autori; il paragrafo 1 è
stato redatto da Carlo Vaccari e Antonino Virgillito, il paragrafo 2.1 da Francesca
della Ratta, il 2.2 da Gabriella Fazzi e Maria Elena Pontecorvo, le conclusioni da tutti
gli autori.

JADT’ 18

201

“tweet”, normalmente visibili da tutti gli utenti, che possono anche
“iscriversi” ai tweet di altri utenti (diventando “follower”), inoltrare
(“retweet”) singoli tweet ai propri followers o aggiungere “mi piace” ad altri
tweet. Twitter è oggi uno dei Social Network più diffusi, e ha superato nel
2017 i 300 milioni di utenti attivi. Secondo Alexa (2018) Twitter è oggi il
tredicesimo sito più visitato al mondo, l’ottavo negli USA. Scopo di questo
lavoro è applicare le tecniche dell’analisi testuale a un corpus estratto da
Twitter, unendo i due mondi dei Big Data e dell’Analisi Testuale. La raccolta
dei dati da Twitter è stata effettuata utilizzando una piattaforma, la
“Sandbox”2, che è il risultato finale del progetto “Big Data in Official
Statistics”, portato avanti nell’ambito dell’High Level Group on
Modernisation of Official Statistics (HLG-MOS). La Sandbox è un ambiente
web-based utilizzato per numerosi esperimenti basati su diverse sorgenti dati
come le visite alle pagine di Wikipedia, i dati sul Commercio Estero del sito
Comtrade dell’ONU, i siti delle imprese per ricercare annunci di lavoro e,
appunto, i tweet raccolti in varie nazioni del mondo. La Sandbox è oggi
ancora utilizzata per portare avanti le sperimentazioni della ESSnet on Big
Data3, un progetto europeo coordinato da Eurostat per l’utilizzo dei Big Data
nella produzione di statistiche ufficiali. I tweet analizzati sono stati raccolti
attraverso uno strumento online messo a disposizione gratuitamente da
Twitter (Streaming API), interrogato attraverso programmi scritti in R ed
eseguiti all’interno della Sandbox. Questa soluzione, per quanto semplice da
utilizzare e di immediata implementazione, presenta limitazioni sia per
l’ammontare dei dati che possono essere estratti, sia per la non completa
aderenza dei dati ottenuti rispetto ai filtri impostati in fase di estrazione,
come spiegato nella sezione successiva. I tweet acquisiti sono stati
memorizzati su Elasticsearch, un database installato nella Sandbox
specializzato in dati semi-strutturati, che permette di memorizzare grandi
quantità di documenti ed estrarre velocemente dei sottoinsiemi attraverso
query basate su parole chiave.
2. L’analisi dei post sul mercato del lavoro: l’impatto dell’Istat
2.1 Creazione del corpus
Per analizzare i dati estratti da Twitter si è replicato il metodo testato in
occasione di precedenti lavori (della Ratta, Pontecorvo, Virgillito, Vaccari;
2016 e 2017). Si è deciso, in questo contesto, di focalizzare l’analisi sul ruolo

2 I risultati del progetto Sandbox, coordinato da Virgillito nel 2014 e da Virgillito
e Vaccari nel 2015, sono illustrati in Unece (2014 e 2016).
3

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/ESSnet_Big_Data

202

JADT’ 18

ricoperto dall’Istat nella diffusione delle informazioni sulla tematica del
lavoro, estraendo automaticamente un primo set di tweet nelle settimane in
cui l’Istat diffonde i dati mensili e trimestrali sul mercato del lavoro. Tale
estrazione, già effettuata a fine 2016, è stata replicata nello stesso periodo del
2017, partendo da una query piuttosto ampia4 che ha consentito di ottenere
un corpus di 58.277 tweet relativo al periodo 28 novembre-12 dicembre 2017.
Da questo corpus sono stati estratti tutti gli hashtag con occorrenza maggiore
di 14 (facilmente identificabili nel testo grazie alla presenza del simbolo #) tra
i quali sono stati individuati quelli strettamente connessi alla discussione sul
mercato del lavoro (Tabella 1). È quindi stato estratto, utilizzando il software
Taltac2, un corpus di 19.398 tweet contenente almeno uno degli hashtag di
interesse. Questo corpus è stato ulteriormente ripulito eliminando i tweet
relativi alle offerte di lavoro (presenza degli hashtag #offertalavoro
#annunciolavoro), considerati non pertinenti. Si è così arrivati a un corpus
composto da 17.419 tweet, composto da 283.000 occorrenze, 18.000 forme
grafiche e una ricchezza lessicale (rapporto type/token) del 6,7%.
Poco più di un terzo dei tweet sono originali, mentre il volume dei retweet
costituisce il 63% del corpus complessivo, in misura maggiore rispetto al
corpus del 2016. Per “misurare” l’impatto dell’Istat nel dibattito sul lavoro
sono stati etichettati tutti i tweet in cui compare la forma “Istat”: il 13,9% del
totale, una misura quasi triplicata rispetto a quanto osservato nel 2016 (5%).
Se da un lato nel 2016 l’impatto del concomitante dibattito referendario aveva
ridimensionato il peso del commento del dato Istat nella discussione sul
mercato del lavoro, nel 2017 i temi della ripresa occupazionale e delle sue
caratteristiche sembrano aver attirato maggiormente l’attenzione degli utenti.
Inoltre, la prima uscita del rapporto annuale integrato sul mercato del lavoro
ha probabilmente accresciuto il peso dei commenti sui dati.
Se nel 2016 la presenza dei riferimenti a Istat si addensava in corrispondenza
delle uscite ufficiali, nel 2017 è distribuita in maniera più uniforme, con un
picco in corrispondenza del comunicato trimestrale del 7 dicembre (nel quale

La query iniziale utilizzata è la seguente: "(istat OR inps OR #istat OR #inps OR
#lavoro OR #occupati OR #disoccupati OR #disoccupato OR #jobsact OR
#occupazione OR #disoccupazione OR #mercatodellavoro OR #poletti OR
#cassaintegrazione)". Sul primo corpus di 58.277 tweet estratto dall’API di Twitter è
stata rieseguita la stessa query in Elasticsearch, che ha consentito di effettuare una
selezione ulteriore, eliminando moltissimi tweet che pur estratti attraverso la stessa
query non contenevano le parole chiave, evidenziando una non completa accuratezza
dell’API gratuita di Twitter nell’applicazione dei filtri di estrazione. Alla fine si è
ottenuto un corpus di circa 26 mila tweet, su cui è stata effettuata la selezione
successiva.
4

JADT’ 18

203

ha avuto molta eco la notizia del record assoluto di lavoratori a termine Figura 1).
Tabella 1 – Selezione di hashtag
HASHTAG

OCC

HASHTAG

#lavoro

14.172

#licenziamento

#jobsact

1.225

OCC

HASHTAG

190

#occupati

OCC

HASHTAG

48

OCC

#MercatoDelLavoro

19

#disoccupato

173

#Occupazione

42

#precarizzazione

19

#occupazione

948

#Thyssenkrupp

164

#Cococo

42

#precarietà

19

#JobsAct

861

#contaillavoro

158

#Discoll

41

#Smartworking

18

#Jobsact

587

#lavoratori

156

#orientamento

41

#voucher

17

#disoccupazione

463

#Disoccupazione

149

#cassaintegrazione

40

#freelance

17

#povertà

414

#Melegatti

139

#mercatodellavoro

37

#Art18

15

#Poletti

278

#GaranziaGiovani

124

#JobsActSempre

32

#dipendente

15

#precari

265

#precariatodistato

110

#smartworking

31

#ScuolaLavoro

15

#LAVORO

205

#precariato

109

#thyssen

31

#ContailLavoro

201

#pandoro

98

#RelazioniIndustrialiA

20

#disoccupati

196

#articolo18

53

#poletti

20

37.1

2017

2016

26.8
25.1
23.5
21.7
20.3

13.6
11.5
9.3

8.2
6.7
5.1

4.8

0.8
28

0.5
29

4.9

3.7
1.6

1.4

0.4
30

Novembre

4.8

3.8

2.3
0.2
1

2

3

4

5

6

7

8

9

2.2
1.0
10

0.5
11

12

Dicembre

CALENDARIO DIFFUSIONI ISTAT: 28/11 Natalità e fecondità; 1/12 Occupati e disoccupati mese di ottobre *, Conti economici trimestrali; 5/12 Nota trimestrale sull'andamento dell'economia; 6/12 Condizioni di vita,
reddito e carico fiscale delle famiglie;
7/12 Il mercato del lavoro (III trimestre); 11/12 Il mercato del lavoro (rapporto annuale integrato)**.
(*) nel 2016 uscito il 30/11; (**) solo nel 2017

Figura 1 – Incidenza riferimenti a Istat per giorno. Anno 2016 e 2017

Più modesto l’impatto del comunicato mensile sull’occupazione, i cui dati
erano risultati sostanzialmente stabili (al contrario nel 2016, data la
concomitanza con il referendum costituzionale, il mensile aveva registrato la
quota massima di citazioni). Un volume consistente è stato registrato in
occasione del comunicato sulle condizioni di vita e di reddito (6/12) e della

204

JADT’ 18

presentazione del primo rapporto integrato sul mercato del lavoro (11/12).
Anomalo il picco del 3 dicembre, domenica, alimentato da un notevole tasso
di retweet molto critici sulle politiche del lavoro dell’attuale governo dovuti
probabilmente all’intervento del segretario del PD Matteo Renzi in una
popolare trasmissione serale (Che Tempo Che Fa) centrato anche sulle
politiche del mercato del lavoro degli ultimi anni. Il più citato è stato un
tweet critico sul meccanismo di conteggio degli occupati, insieme ad altri più
politici sull’aumento del lavoro a termine dell’ultimo periodo.
2.2 Il contenuto del corpus
Il contenuto del corpus può essere descritto utilizzando le parole chiave,
calcolate rispetto all’italiano standard (Bolasco, 2013) che consentono di
delimitare gli ambiti di contenuto: si incontra innanzitutto contratti, con
riferimento all’aumento dei contratti a termine o a tempo determinato.
Contribuiscono alla sovrarappresentazione del termine un numero limitato
di tweet (13) che ricevono tuttavia numerosi retweet e che, riprendendo il
dato Istat sulla durata dei contratti evidenziano l’aumento del precariato,
anch’esso termine sovrautilizzato (Figura 2). Altri termini molto presenti nel
testo sono disuguaglianze ed esclusione, utilizzati soprattutto in un post della
Caritas che riprende il dato sulla povertà pubblicato il 6 dicembre.

Figura 2 - Tagcloud delle parole chiave

Colpisce la presenza di termini molto forti, che connotano un dibattito dai
toni pesanti: trucco, fraudolenta, infamia, tossico, truffa, schiavitù. Analizzando i
contesti d’uso si riscontra che ciascuno di questi termini è riferito a episodi
diversi (trucco dei dati sulla definizione di occupazione); manodopera
fraudolente; truffa del Governo sulle pensioni; accordo tossico con riferimento

JADT’ 18

205

al CETA; infamia contro il lavoro in riferimento al Jobs Act) e che proprio i
tweet più forti siano quelli in grado di generare un numero elevato di
retweet. Significativi anche termini utilizzati in tweet in cui si evocano storie,
e in cui il dato statistico è sostituito dal caso esemplare, capace di generare
empatia e, di conseguenza, retweet. Non è un caso che i termini
maggiormente sovrarappresentati facciano riferimento ad un unico tweet, su
un lavoratore colpito da leucemia che guarisce ma viene comunque licenziato.
Fra gli esempi anche quello di una madre separata, licenziata dall’Ikea a
Milano. Anche i riferimenti al record degli occupati e a quanti esultano per i
dati sull’occupazione sono riportati talvolta in maniera critica; fa eccezione il
riferimento al tasso di disoccupazione giovanile, che viene ripreso in maniera
neutra dall’agenzia Ansa e retweettato numerose volte.
Prendendo in considerazione i segmenti ripetuti (ossia le sequenze di parole
ripetute nel testo), si possono delimitare quattro aree semantiche principali a
cui fanno riferimento i tweet (Tabella 2). In primo luogo ci sono le espressioni
che rimandano alla pura diffusione delle notizie che ruotano intorno alla
tematica “lavoro” e che hanno un peso rilevante anche in termini di
occorrenze. In particolare emerge da un lato il riferimento ai dati diffusi
dall’Istat su povertà, natalità e occupazione, dall’altro spiccano due segmenti
che si riferiscono agli episodi di attualità già citati: il licenziamento da parte
di Ikea di una madre separata con due figli piccoli e quello di un dipendente
di una fabbrica di vernici, avvenuto dopo un lungo periodo di assenza per
malattia. Accanto ai segmenti relativi alle notizie, vi sono poi i segmenti
riconducibili ai commenti degli esponenti politici, ai provvedimenti
legislativi e alle prime avvisaglie di campagna elettorale. A questi fanno da
contraltare i tweet caratteristici del dibattito pubblico tra cui non mancano
note polemiche o sarcastiche. Infine, nonostante il file sia stato in parte
ripulito dagli hashtag riconducibili agli annunci di lavoro, emergono
comunque alcuni segmenti inerenti la ricerca di particolari profili
professionali.
Come è facilmente intuibile, peraltro, alcuni contenuti caratterizzano
maggiormente i frammenti in cui si fa esplicito riferimento all’Istat. Rispetto
all’analisi effettuata nello stesso periodo dello scorso anno, l’analisi delle
specificità mostra la prevalenza di un linguaggio più tematico che tecnico
quando si cita l’istituto (dati, contratti, #povertà, disoccupazione), mentre i tweet
che parlano di lavoro senza citare l’Istat fanno riferimento ai fatti di cronaca e
alla politica (#jobsact, #pensioni, legge, licenziato ecc.), con minori riferimenti
personali ai soggetti che nel 2016 erano in prima linea nella campagna
referendaria. Inoltre l’analisi delle concordanze mostra che lo stesso
riferimento all’Istat viene utilizzato in differenti contesti.

206

JADT’ 18
Tabella 2 – Segmenti ripetuti principali

Le notizie
Segmento
Occ

Riferimenti politici
Segmento
occ

Dibattito pubblico e polemica
Segmento
occ

dati #Istat

419

Missione
compiuta\\#JobsAct

67

Come ti trucco i dati

348
119

Annunci di lavoro
Segmento
occ
#lavoro #roma
#romalavoro
152
#lavorare
#lavoro
144
#adnkronos

a
rischio
#povertà
guarisce
e
viene
licenziato
esclusione
sociale

392

Ministro #Poletti

60

continuano a produrre
sfruttamento

253

#jobsact funziona

54

tutto da rifare

55

kijiji lavoro

53

236

Fedriga Presidente

47

essere licenziati

40

cerca socio

32

madre
separata

195

campagna elettorale

43

#Bonus
dipendenza

86

manovra finanziaria

28

si sono rivelate tutele
inesistenti

27

80

Liberi e Uguali

18

conti non tornano

18

71

presidenta #boldrini

11

politici hanno distrutto
tre generazioni

3

56

Politiche Attive

9

dovremmo ribellarci

2

55

Lavori usuranti

2

giovani
andati
grazie a te

tempo
determinato
tempo
indeterminato
terzo
trimestre
crollo
della
natalità
#Algoritmi
#BigData

creano

via

31

2

Commessa IV
livello
part
time
#lavoro
professionale
diventare
#psicoterapeuta
ufficio acquisti
dirigente
medico
Concorsi
Pubblici
Gazzetta
Ufficiale

Oltre alla stretta diffusione delle notizie e al commento del dato sull’aumento
dei contratti a termine, non manca l’uso strumentale dei dati come metro di
giudizio delle politiche sul mercato del lavoro [#Istat "record di occupati
a_termine: sono 2,8 milioni". ecco l' unico risultato oggettivo del #jobsact..";
continua a calare la #disoccupazione-i nuovi dati #Istat confermano le previsioni,
un' altra ventata di ottimismo...]. Rispetto al 2016 il tono sarcastico di alcuni
tweet è meno rivolto esplicitamente all’Istat ma in generale alla situazione
del Paese [«record di #precari in Italia, 2,8 milioni. va tutto ben, madama la
marchesa.. #lavoro #Istat #occupazione”»]. Resta però un residuo polemico su
alcune definizioni di occupazione e disoccupazione [«Ricordiamo che per #istat
se si lavora un'ora retribuita a settimana si è considerati occupati. #supercazzola»;
«Come ti trucco i dati #Istat sulla disoccupazione: il 14; 6% dei contratti dura meno
di 3 giorni, il 31% un_mese»]. Infine, di interesse la valutazione del tono del
testo, possibile con l’analisi degli aggettivi positivi e negativi, riconosciuti
all’interno di Taltac2. Il rapporto tra aggettivi negativi e positivi è del 50,2%,
un valore che denota una criticità media, pari a quella che si riscontra nel
linguaggio della stampa (Bolasco, della Ratta, 2004). Il livello di criticità è
variabile nelle diverse giornate: è più basso nei giorni di diffusione dei

27
21
13
8
7

3

JADT’ 18

207

comunicati, specie quello mensile, mentre è particolarmente elevato il 3
dicembre, a causa del “rumore” prodotto dai retweet (i retweet presentano
una criticità del 63,6%), probabilmente a causa del maggiore successo dei
tweet polemici. Tra gli aggettivi negativi più frequenti precari, fraudolenta,
dannoso, fallito5.
3. Conclusioni
L’analisi effettuata ha consentito di affinare una metodologia di trattamento
dei tweet: dal punto di vista della loro estrazione, la procedura utilizzata ha
consentito di ottenere in partenza un file più pulito su cui operare una
selezione a partire dalla lista degli hashtag. L’analisi del testo ha poi
consentito di evidenziare i diversi contesti in cui si fa riferimento al dato
della statistica ufficiale. Particolarmente interessante il confronto tra i risultati
dello stesso corpus a un anno di distanza. Infatti, nello stesso periodo
dell’anno precedente la discussione era fortemente condizionata dal dibattito
referendario che ha probabilmente “stravolto” la discussione sulle tematiche
del lavoro. Nei tweet di un anno prima i livelli di criticità erano più elevati e
il ruolo dell’Istat più ridimensionato (13% la presenza odierna contro il 5% di
un anno prima). Il tono del testo appare in generale più neutro, con maggiori
richiami all’Istat nella sua veste ufficiale di diffusore di dati e meno come
oggetto di scherno e polemica. Riguardo ai contenuti, nella discussione di
fine 2017 sembra avere avuto più peso la discussione sugli effetti del Jobs Act
e della diffusione del lavoro precario. Il corpus odierno è inoltre
caratterizzato da un più ampio ricorso al retweet.
Riferimenti
Alexa (2016). Twitter site overview, at
http://www.alexa.com/siteinfo/twitter.com.
Bolasco S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining.
Roma, Carocci.
Bolasco S., della Ratta-Rinaldi F. (2004). Experiments on semantic
categorisation of texts: analysis of positive and negative dimension. In
JADT 2004 - Le poids des mots, Actes des 7es Journées internationales d’Analyse
Statistique des Données Textuelles. UCL. Louvain.
della Ratta-Rinaldi F., Pontecorvo M.E., Virgillito A., Vaccari C. (2016). Big
data and textual analysis: a corpus selection from twitter. Rome between
the fear of terrorism and the Jubilee. In JADT 2016 - Statistical Analysis of

5Sono stati comunque eliminati i termini tecnici (riferiti a specifici aggregati
statistici) che hanno una connotazione negativa, come disoccupato, scoraggiato o
povero.

208

JADT’ 18

Textual data – Vol.2. Nice.
della Ratta-Rinaldi F., Pontecorvo M.E., Virgillito A., Vaccari C. (2017). The
Role of NSIs in the Job Related Debate through Textual Analysis of
Twitter Data. NTTS 2017. Brussels.
UNECE (2016). Big Data in Official Statistics.
http://www1.unece.org/stat/platform/display/bigdata/Big+Data+in+Official+S
tatistics
UNECE (2014). Big Data in Official Statistics.
http://www1.unece.org/stat/platform/display/bigdata/Big+Data+in+Official+S
tatistics
Vaccari C. (2014). Big Data and Official Statistics. PhD Thesis, School of Science
and Technologies. University of Camerino.

JADT’ 18

209

Gauging An Author’s
Mood Using Hidden Markov Chains
Sami Diaf
Hildesheim Universität – sami.diaf@uni-hildesheim.de

Abstract
This paper aims to gauge the mood of an author using a text-based approach
built upon a lexicon score and a hidden Markov model. The text is tokenized
into sentences, each given a polarity score, yielding three evaluative factors
(positive, neutral and negative) which represent the observable states. The
mood of the author is considered a latent state (good, bad) and is estimated
via a hidden Markov model. Tested on a psychological fiction, Franz Kafka’s
novel Metamorphosis, this methodology shows an interesting linkage
between the author’s feelings and the intent of his writing.
Keywords: Sentiment analysis, hidden Markov model, polarity.
1. Introduction (Times Bold 14 pt, left)
Sentiment analysis is defined as the general method to extract subjectivity
and polarity from a text, while semantic orientation refers to the polarity and
strength of words, phrases, or texts, meaning a measure of subjectivity and
opinion in the text, capturing an evaluative factor and potency or strength of
a given corpus toward a given subject (Taboada et al., 2011).
Extracting sentiment automatically usually involves two main approaches
(Taboada et al., 2011): a lexicon-based approach built on computing
orientation for a document from the semantic orientation of words or
sentences, and a text-classification approach stemming from supervised
machine learning techniques and involves building classifiers from labeled
instances of texts or sentences. Lexicon-based models stress out the
importance of adjectives as an indicator of a text’s semantic orientation and
have been preferred in the linguistic context as classifiers yielded changing
results regarding their areas of application (Taboada et al., 2011).
Among many lexicon-based approaches adopted in the academic field, the
one implemented by Hu and Liu (Hu and Liu, 2004) remains popular. It was
built upon two hypotheses concerning the semantic orientation:
independence of context (prior polarity) and being expressed as a numerical
value suing an opinion lexicon.
This article uses the polarity approach of Hu and Liu to build a sequence of

210

JADT’ 18

evaluative factors (positive, neutral and negative), considered as the
realization of an observable state x, and supposes the mood of the author
could be approached via a two-state latent variable z taking two hidden
states (good and bad). For this aim, hidden Markov models (Murphy, 2012)
will be used to estimate the transition probabilities between hidden and
observed states, to better estimate long-range correlations among the
sequence of data than standard Markov models.
2. Polarity function
Polarity is defined as the measure of positive or negative intent in a writer’s
tone (Kwartler, 2017) and can be calculated by sophisticated or fairly
straightforward methods, usually using two lists of words: one positive and
one negative. Hu and Liu set up the architecture for the polarity function
used to tag polarized words in the English language (Hu and Liu, 2004) and
Rinkler (2017) provided a detailed description of the polarity function and its
computation. A context cluster of words is pulled around a polarized word
to be considered as valence shifters. Words in this context cluster are tagged
as neutral, negator, amplifier or de-amplifier. Each polarized word is then
weighted according to a dictionary of positive/negative words and weights,
and then further weighted by the number of position of the valence shifters
directly surrounding the positive or negative word. Final computation step is
the sum of the context clusters divided by the square root of the word count,
which yields an unbounded polarity score.
2. Application
To illustrate this framework, we took the English version of the novella
Metamorphosis written by Franz Kafka published in 1915 under the name «
Die Verwandlung » and freely available at the Project Gutenberg database. This
work was translated to English by David Wyllie in 2002 and belongs to the
psychological fiction category.
The novella is broken down into sentences, a process called tokenization, and
then we compute the polarity function for each sentence, to construct a
sequence of evaluative factors (positive, neutral or negative) according to the
polarity score, as shown in Figure 1.

JADT’ 18

211

Figure 1. Sequence of data corresponding to the polarity score of each sentence.

This step generates 812 sentences where the positive and negative polarity
scores represent respectively 29.1% and 28.6% of the total. The remaining
sentences (42.3%) correspond to the neutral evaluative factor. Statistical tests
show that the generated time series has the first two autocorrelations
significantly different from zero and exhibits a slightly persistent memory as
the estimated Hurst exponent is 0.587, significantly different from the value
of 0.5 which corresponds to the case of a Brownian motion (Mandelbrot and
Hudson, 2006). The estimated probability transition matrix of evaluative
factors via the maximum likelihood shows the associated Markov chain is
irreducible with no persistent states, as shown in Figure 2.

Figure 2. Probability transition matrix of the evaluative factors.

We assume the mood of the author could be modeled via a latent variable Z
taking two states (good and bad). Hence, we can build a hidden Markov
model explaining the interactions between observable states (positive,
neutral and negative) and latent, unobservable states (good and bad). To
estimate the hidden Markov model, the transition matrix of the latent state is
set uniformly, that is all its elements equals 0.5, the same applies also for the
initial latent vector. However, the emission matrix which describes the links
between the latent and the observable states is set arbitrarily as in Figure 3.

212

JADT’ 18

Figure 3. Prior probability transition of the emission matrix.

Given these priors, the estimated hidden Markov model using the BaumWelch algorithm (Murphy, 2012) yields a starting probability vector slightly
skewed to good mood (51%) than bad mood (0.49). The estimated transition
and emission matrices are reported in Figure 4 and 5 respectively.

Figure 4. Estimated transition matrix via Baum-Welch algorithm.

Figure 5. Estimated emission matrix via Baum-Welch algorithm.

Results demonstrate significant links between writing without intent (neutral
state) and being in a good mood, and between negative intent and the bad
mood. The most probable states estimated via the Viterbi algorithm
(Murphy, 2012) clearly show the dominance of the good state (71.4%) over
the bad (28.6%) as shown in Figure 6.
These findings help clarify the nature of the story (thriller, roman, novella,
…) and the author’s narrative style which could be confirmed by analyzing
the remaining works. Finally, it is worth noticing that this methodology
could also be used to assess the accuracy of translations with respect to the
original work, by comparing the similarities of the transition and the
emission probabilities of the hidden Markov models.

JADT’ 18

213

Figure 6. Most probable states estimated via Viterbi algorithm
(Bad in red and Good in blue).

4. Conclusion
This works expands the application field of semantic orientation to explore a
new probabilistic approach based on hidden Markov models and evaluation
factors. The resulting outcomes help understanding the author’s mood by
examining the linkage between the evaluative factors which express the
author’s mindscape through his writing. The emission probabilities between
the latent states and the evaluative factors helped identifying hidden
structures linked to the psychological state of the author and the
development of the facts. This approach could be used as a controller of
translation accuracy under the condition of having a precise list of positive
and negative words in the original language, to be able to compute the
polarity score.

214

JADT’ 18

References
Hu M. and Liu B. (2004). Mining and summarizing customer reviews.
Proceedings of the ACM SIGKDD, pp. 168-177.
Kwartler T. (2017). Text Mining in Practice with R. Wiley.
Mandelbrot B. and Hudson R.L (2006). The Misbehavior of Markets: A
Fractal Review of Finance Turbulance. Basic Books.
Murphy K.P. (2012). Machine Learning: A Probabilistic Perspective. MIT
Press.
Project Gutenberg [www.gutenberg.org]
Rinkler T. (2017). Polarity score (Sentiment Analysis)
[https://www.rdocumentation.org/packages/qdap/versions/2.2.9/topics/po
larity]
Silge J. and Robinson D. (2017). Text mining with R: A Tidy Approach.
O’reilly.
Taboada M., Brooke J., Tofiloski M., Voll K. and Stede M. (2011). Lexiconbased Methods for Sentiment Analysis. Computational Linguistics Vol.
37, Issue. 2, pp. 267-307.

JADT’ 18

215

Les hémistiches répétés
Marc Douguet
Université Grenoble Alpes – marc.douguet@univ-grenoble-alpes.fr

Abstract
In this paper, we propose to use the syllabic structure of classical alexandrine
in order to automatically identify textual recurrences in French 17th-century
theater. The two hemistichs of 6 syllables each present a syntactical unity:
consequently, extracting recurrent hemistichs is a way, on the one hand, to
hightlight idiomatic expressions characteristic of this period, and, on the
other hand, to evaluate the influence of metric constraints on writing.
Résumé
Dans cet article, nous proposons d’utiliser les caractéristiques métriques de
l’alexandrin classique afin de repérer automatiquement des récurrences
textuelles dans le corpus du théâtre français du XVIIe siècle. Les deux
hémistiches de 6 syllabes chacun qui le constituent possèdent en effet une
unité syntaxique : dès lors, les réemplois fréquents des mêmes hémistiches
permettent d’une part de faire émerger les éléments langages propres à ce
style d’écriture, et d’autre part d’évaluer l’influence des contraintes
métriques sur l’écriture.
Keywords: repeated segments, metre, verse, textual recurrences
1. Introduction
La détection des segments répétés dans un corpus est un outil
particulièrement précieux pour l’analyse stylométrique : elle permet à la fois
de caractériser le style propre à un auteur, un genre ou une période, et
d’évaluer l’originalité d’un auteur par rapport à ses contemporains, sa
capacité à s’affranchir ou non des éléments de langage de son époque (cf.
notamment Salem, 1987 ; Legallois, 2009 ; Delente et Legallois, 2016). De ce
point de vue, l’alexandrin classique présente une caractéristique qui nous
semble n’avoir pas encore été totalement exploitée. La césure divise en effet
le vers en deux hémistiches d’égale longueur (6 syllabes) qui constituent des
unités à la fois rythmiques et syntaxiques. Or ces unités font l’objet de
nombreuses répétitions. Par rapport à l’approche qui consiste à extraire tous
les segments de n mots pour détecter les récurrences, cette approche (qui la
complète) a, pour la stylistique computationnelle de la poésie, un triple
avantage :
– elle permet de n’extraire que des segments qui constituent déjà des unités

216

JADT’ 18

syntaxiques et évite d’avoir à trier manuellement les résultats pertinents ;
– elle permet d’extraire des segments qui, quel que soit leur nombre de mots,
ont le même nombre de syllabes, et sont donc, en régime poétique,
d’importance strictement comparable ; – elle permet de mettre en rapport
réflexion sur la répétition et analyse de la versification et d’apprécier,
notamment, la contrainte que le mètre fait peser sur l’écriture.
2. Méthodologie
Nous avons travaillé sur un corpus de 200 pièces de théâtre en alexandrins
publiées entre 1630 et 1680, représentatif de la diversité des genres
dramatiques de cette période (tragédie, comédie, tragi-comédie1). Le corpus
est édité en XML-TEI, avec un balisage qui décrit le découpage en actes, en
scènes, en répliques et en vers2.
Nous avons développé un syllabeur capable de césurer les vers et d’en
extraire séparément chacun des hémistiches. Celui-ci est plus modeste que
d’autres outils développés en analyse automatique du vers (notamment
Beaudouin, 2002 ; Delente et Renault, 2015 ; Salvador, 2016), puisqu’il n’a pas
pour ambition de placer avec exactitude la limite entre deux syllabes à
l’intérieur d’un mot. Afin de produire un dictionnaire de diérèses et de
synérèses, nous l’avons préalablement entraîné en vérifiant manuellement les
résultats. Le syllabeur reconnaît automatiquement comme des vers de 12
syllabes 99,98% des 55 031 vers de Corneille dont on a préalablement vérifié
qu’ils étaient des alexandrins. La marge d’erreur est uniquement due à
l’ambiguïté de certains mots, dont la prononciation change en fonction de la
catégorie grammaticale (par exemple « content » et « fier », selon qu’il s’agit
de verbes ou d’adjectifs).
Le corpus est composé de 332 938 vers, soit en théorie 665 876 hémistiches.
Nous n’en avons retenu que 624 597, après avoir exclu ceux qui était
distribués sur plusieurs répliques. Le nombre d’occurrences de chaque
hémistiche est calculé après avoir supprimé les ponctuations et les
majuscules.

La liste des pièces, les scripts utilisés ainsi que les résultats complets sont
disponibles sur https://github.com/marcdouguet/dheform.
2 Les textes sont disponibles sur https://github.com/dramacode/tcp5. Ils nous ont
été fournis par le projet « Bibliothèque dramatique » (http://bibdramatique.parissorbonne.fr/), dirigé par Georges Forestier, et le projet « Théâtre classique »
(http://theatre-classique.fr/), dirigé par Paul Fièvre. Nous les remercions tous deux
d’avoir rendu accessibles leurs sources XML, sans lesquelles ce travail n’aurait pas été
possible.
1

JADT’ 18

217

3. Fréquence des hémistiches répétés
Le phénomène de la reprise textuelle des hémistiches est sans commune
mesure avec celui, similaire, qui concerne les vers entiers. Dans notre corpus,
499 vers sont répétés au moins une fois, soit seulement 0,1%. Pour quelqu’un
qui a une connaissance approfondie du corpus, ces répétitions sont souvent
repérables manuellement, et les éditions critiques en soulignent certaines (on
connaît notamment le célèbre « Je suis maître, je parle, allez, obéissez » dans
La Mort de Pompée de Corneille, repris dans L’École des femmes de Molière).
Les enjeux de ces reprises mériteraient d’être étudiées (plagiat, parodie,
citation d’un personnage par un autre, phénomène de refrain, etc.).
La répétition d’hémistiches possède des enjeux différents, à la fois en raison
de la brièveté des segments répétés et du très grand nombre de répétitions :
16% des hémistiches du corpus sont répétés au moins deux fois, et un
hémistiche y apparaît en moyenne 1,11 fois. L’écriture en vers utilise donc un
certain nombre d’éléments de langage et d’idiomatismes préexistants, que le
dramaturge combine de manière originale.
En complément des relevés quantitatifs, nous avons également développé
une interface de lecture (accessible sur http://obvil.lip6.fr/dheform) :
l’utilisateur peut entrer un texte, dont les hémistiches répétés seront mis en
évidence à l’aide d’un code couleur.
4. Analyse des hémistiches les plus fréquents
À titre d’exemple, le tableau suivant liste les 10 hémistiches les plus fréquents
du corpus, avec leur nombre d’occurrences et deux exemples en contexte :
en cette occasion

119

en l’état où je suis

98

pour la dernière fois

87

à votre majesté

87

que votre majesté

70

en cette extrémité

68

je vous l’ai déjà dit

55

une seconde fois

51

les armes à la main

42

de votre majesté

41

Que me donne l’amour en cette occasion
N’offrez donc point, Seigneur, en cette occasion
Que ferai-je, Philante, en l’état où je suis ?
Je ne réponds de rien en l’état où je suis.
Dites-lui de ma part pour la dernière fois
Pour la dernière fois je me jette à vos pieds.
Le respect que je dois à votre Majesté
Je me livre, grand Prince, à votre Majesté,
Que votre Majesté le rappelait près d’elle.
Ah ! Grand Roi, se peut-il que votre Majesté
Mettre tout en usage en cette extrémité ;
Quoi ? vous m’abandonnez en cette extrémité,
Je vous l’ai déjà dit, sans vous parler de moi,
Je vous l’ai déjà dit, j’estime votre flamme,
Je renonce à choisir une seconde fois ;
J’en ferais un ingrat une seconde fois.
Les armes à la main, venez si bon vous semble,
Laissez-nous lui parler les armes à la main,
Qui vient offrir aux pieds de votre Majesté
Il tira des bienfaits de votre Majesté :

218

JADT’ 18

Si l’on élargit l’analyse aux 470 hémistiches qui possèdent plus de 10
occurrences, on peut distinguer plusieurs catégories de récurrences. De
nombreux hémistiches sont composés d’un substantif de trois syllabes ou
plus, précédé de prépositions, de conjonctions et de déterminants, et placé en
position de sujet, de complément de nom ou d’objet. Dans cette
configuration, on repère plusieurs variations autour d’un même substantif :
« à votre majesté » (87 occurrences – nous indiquerons désormais
systématiquement le nombre d’occurrences d’un hémistiche entre
parenthèses), « que votre majesté » (70), « de votre majesté » (41), « de
générosité » (40), « la générosité » (26), « à ma confusion » (30), « cette
confusion » (15), etc. Les substantifs concernés relèvent principalement d’une
thématique morale ou politique, caractéristique du style d’écriture
dramatique du XVIIe siècle.
Plus intéressants sont les compléments circonstanciels qui insistent sur le
caractère exceptionnel de la situation et sur l’état émotif du locuteur et
renforcent ainsi le pathos du discours : « en cette occasion » (119), « en l’état
où je suis » (98), « en cette extrémité » (68), « en ce malheur extrême » (23),
« en cette conjoncture » (22). De nombreuses expressions modalisent
l’énoncé : insistance agacée (« je vous l’ai déjà dit » (55)), certitude (« il n’en
faut point douter » (37), « il n’en faut plus douter » (25)), prétérition (« je ne
vous dirai point » (40)). On notera également la série « pour la dernière fois »
(87), « une seconde fois » (51), « pour la première fois » (29), qui relie une
situation dramatique à d’autres, passées ou à venir.
Certains syntagmes figés possèdent au contraire une fonction référentielle :
violence des relations (« les armes à la main » (42), « un poignard dans le
sein » (27)), instinct (« la voix de la nature » (19), pouvoir (« la suprême
puissance » (25), « une entière puissance » (24), « un absolu pouvoir » (22)),
etc. Les expressions temporelles sont quant à elle nombreuses, et peuvent
être associées à une sentence générale décrivant les mœurs du temps (« dans
le siècle où nous sommes » (17)) ou à l’urgence d’une situation (« sans tarder
davantage », (19)). La fréquence élevée d’« avant la fin du jour » (31) montre
à quel point le dramaturges explicitent le respect de l’unité de temps dans
leurs œuvres afin d’accroître la tension dramatique. Les expressions spatiales
renvoient elles aussi à l’universalité (« sur la terre et sur l’onde » (16)) ou au
contraire aux lieux fréquemment convoqués dans le théâtre classique (« dans
son appartement » (20), « dans la chambre prochaine » (16)).
Ces expressions figées peuvent souvent être considérées comme des
« chevilles », où l’on sent clairement que l’invention verbale se soumet aux
contraintes de la métrique. On peut ici identifier deux cas de figure. D’une
part, le sémantisme de certains hémistiches circonstanciels est parfois très
faible : « en cette occasion », « en l’état où je suis » pourraient aussi bien être

JADT’ 18

219

supprimés sans nuire au sens du texte, ou greffés sur n’importe quel énoncé.
D’autre part, même si elles sont mieux ancrées dans l’énoncé, les expressions
figées que nous avons relevées (« la suprême puissance », « la voix de la
nature ») doivent certainement leur succès au fait qu’elle rentrent facilement
dans le moule de l’alexandrin. C’est ici l’apposition récurrente d’un adjectif
(la puissance sera « entière » ou « suprême »), ou l’utilisation d’une formule
imagée (« la voix de la nature », au lieu de « la nature ») qui se justifie par les
contraintes de la versification. Il serait intéressant de poursuivre cette analyse
en la croisant avec la théorie de la fonction poétique du langage de Jakobson,
que résume en partie l’exemple suivant : « Without its two dactylic words the
combination “innocent bystander” would hardly have become a hackneyed phrase. »
(1960 : 358)
5. Vers et prose
Afin d’évaluer la spécificité de l’écriture poétique, nous avons constitué un
corpus de pièces en prose de la même époque (11 tragédies de d’Aubignac3,
Baro et Puget de La Serre, et 9 comédies de Molière). Nous avons compté le
nombre d’occurrences de chacune des expressions correspondant à un
hémistiche récurrent, en le rapportant à la taille respective des deux corpus,
calculée en nombre de mots. Certains « hémistiches » (les guillemets
s’imposent ici) sont aussi fréquents en vers qu’en prose, mais il n’existe pas
de corrélation nette entre les deux corpus, alors même que l’on reste dans le
genre dramatique. Or les « hémistiches » que l’on trouve aussi fréquemment
en prose qu’en vers, voire plus fréquemment, sont ceux qui reposent à la fois
sur un substantif unique (suffisamment long pour occuper les six syllabes
avec les déterminants, les prépositions et les conjonctions qui le précèdent) et
qui n’ont pas une fonction de complément circonstanciel. Le fait qu’ils
figurent parmi les hémistiches les plus fréquents dans le corpus en vers
s’explique simplement par le fait que le substantif en question est lui-même
extrêmement fréquent. En revanche, les formules figées qui reposent sur une
association de plusieurs termes et qui ne font qu’apporter une modalisation
sont bien surreprésentées en vers (par exemple « je vous l’ai déjà dit » : 17
occurrences pour un million de mots en vers, 0 en prose ; « il n’en faut point
douter » : 12 en vers, 0 en prose ; « pour la dernière fois » : 28 en vers, 9 en
prose). Ces expressions, spécifiques au théâtre en vers, semblent donc bien
devoir leur suremploi à la nécessité de couler la phrase dans le moule de
l’alexandrin.

3 Nous tenons à remercier ici Bernard J. Bourque, qui nous a fourni la version
numérique de son édition Abbé d’Aubignac, Pièces en prose, Tübingen, Gunter Narr
Verlag, coll. « Biblio 17 », 2012.

220

JADT’ 18

6. Premiers et seconds hémistiches
Un des défauts de cette approche est de surévaluer la césure au détriment de
l’unité du vers, et de la considérer comme une coupure, une pause entre
deux segments indépendants. Deux écueils se profilent. D’un côté, on risque
d’oublier que l’hémistiche ne constituent pas toujours, au sein d’un vers, une
unité syntaxique pertinente. Les dramaturges du XVIIe siècle pratiquent
souvent le rejet, le contre-rejet ou l’enjambement internes (par exemple : « Le
temps de cet orgueil me fera la raison », dans La Galerie du Palais de
Corneille). Cependant, notre projet est avant tout lexical, et non prosodique.
Isoler les hémistiches n’est qu’une manière de faire émerger des
idiomatismes, en se fondant sur le fait que, malgré des exceptions, la césure à
l’hémistiche reste le plus souvent la plus forte rupture syntaxique du vers.
Il ne faudrait pas non plus oublier que l’élocution fond les deux hémistiches
dans un même mouvement, et que ceux-ci ne se situent donc pas sur le même
plan : un poème en alexandrins n’est pas une suite d’hémistiches. Ici,
l’analyse automatique à laquelle nous nous sommes livré donne justement
des arguments en faveur de l’unité du vers, car elle nous permet de faire
émerger plusieurs différences entre les premiers et les seconds hémistiches,
qui complètent et confirment les analyses de Beaudouin (2002 : 275-319)
concernant la répartition des phonèmes et des catégories morphosyntaxiques en fonction de la position métrique.
Ils diffèrent tout d’abord dans le taux de répétition. 13% des hémistiches
placés en première position sont employés ailleurs dans notre corpus (soit en
première, soit en seconde position), ce qui est moins que le pourcentage
global de récurrences. Au contraire, ce pourcentage monte à 18% quand on
considère les hémistiches placés en seconde position. Cette divergence
s’explique facilement par le fait que le second hémistiche n’est pas seulement
soumis à la contrainte du mètre, mais aussi à celle de la rime.
Si l’on considère la proportion d’hémistiches qui commencent par un son
vocalique, on constate également un déséquilibre : 27% des premiers
hémistiches, mais 30% des seconds. La différence est faible, mais elle nous
semble permettre de quantifier la contrainte que pose la présence d’un e à la
fin du premier hémistiche, qui serait fautive si le second commençait par un
son consonantique. Ainsi, tandis que le premier hémistiche peut commencer
par n’importe quel son, un hémistiche commençant par un son vocalique est
plus facile à placer en seconde position qu’un hémistiche commençant par un
son consonantique.
Enfin, les hémistiches les plus fréquents ne sont pas les mêmes selon que l’on
considère ceux placés en première et en seconde position. Certains sont
utilisés aussi bien à l’une ou l’autre place (par exemple, « en l’état où je suis »
apparaît 40 fois en premier, 58 fois en second), mais on observe souvent une

JADT’ 18

221

répartition nette : les hémistiches de modalisation de l’énoncé sont plus
souvent en premier (« je ne vous dirai point » : 39 pour 1, « je vous l’ai déjà
dit » : 52 pour 3 ; « je vous le dis encor » : 20 pour 2), les hémistiches ayant
fonction de compléments, en second (« à votre majesté » : 85 pour 2 ; « de
votre majesté » : 40 pour 1 ; « à mon ressentiment » : 37 pour 0).
7. Conclusion et perspectives
La détection automatique des récurrences d’hémistiches permet donc de
mettre en valeur les contraintes spécifiques qui pèsent sur l’écriture en vers.
Même si les conclusions que l’on peut tirer ne font que confirmer un savoir
déjà existant, cette méthode nous offre aussi un point d’entrée original dans
le corpus du théâtre classique. Elle nous amène à lire autrement ces textes et
rend particulièrement sensible, derrière la voix d’un auteur, la voix diffuse
d’un style d’époque. À travers ces expressions et ces associations d’idées
transparaît tout un imaginaire qui constitue en quelque sorte le « dictionnaire
des idées reçus » du XVIIe siècle.
Nous n’avons fait là que jeter quelques pistes de réflexion. Un examen
quantitatif et qualitatif plus précis est nécessaire pour mieux cerner les enjeux
de ce phénomène, tout comme la prise en compte de textes versifiés non
dramatiques. Il restera également à étendre le corpus de référence des textes
en prose et à définir d’autres principes de comparaison pour évaluer
l’influence de la métrique sur la diversité syntagmatique des textes.
Envisager les récurrences au niveau, plus abstrait, du motif syntaxique (dans
la lignée des travaux de Ganascia, 2001 ; Longrée et al., 2008 ; Mellet et
Longrée, 2013 ; Legallois et Prunet, 2015), nous permettra par ailleurs de
regrouper des occurrences présentant une structure syntaxique semblable
(« la voix de la nature », « le flambeau de la guerre », « les fruits de la
victoire ») ou centrées sur les mêmes termes (« qu’on le/la/les fasse venir »).
Enfin, la fréquence relative de ces hémistiches récurrents nous paraît être un
outil statistique particulièrement prometteur pour évaluer la spécificité du
style d’écriture propre à un genre ou un auteur, ainsi que pour observer
l’évolution de ces éléments de langage dans le temps.
Références
Beaudouin V. (2002). Mètre et rythmes du vers classique. Corneille et Racine.
Honoré Champion.
Delente É. et Legallois D. (2016). La répétition littérale dans Les RougonMacquart : présentation d’un phénomène peu connu. Excavatio, vol.28.
Delente É. et Renault R. (2015). Outils et métrique : un tour d’horizon.
Langages, vol.199 : 5-22.
Ganascia J.-G. (2001). Extraction automatique de motifs syntaxiques. Dans

222

JADT’ 18

Maurel D. (éd), TALN - RECITAL 2001 : 8e conférence annuelle sur le
Traitement Automatique des Langues Naturelles.
Jakobson R. (1960). Closing statements: Linguistics and Poetics. Dans Sebeok
T. A. (éd), Style in Language. The Technology Press of MIT/John Wiley and
Sons, inc.
Legallois D. (2009). À propos de quelques n-grammes significatifs d’un
corpus poétique du XIXe siècle. L’Information grammaticale, vol.121 : 46-52.
Legallois D. et Prunet A. (2015). Sequential patterns: a new corpus-based
method to inform the teaching of language for specific purposes. Journal of
Social Science, vol.44 : 127-140.
Longrée D., Luong X. et Mellet S. (2008), Les motifs : un outil pour la
caractérisation topologique des textes. Dans Heiden S. et Pincemin B.
(éds), JADT 2008. 9es Journées internationales d’Analyse statistique des Données
Textuelles, pp. 733-744.
Mellet S. et Longrée D. (2013). Le motif : une unité phraséologique
englobante ? Étendre le champ de la phraséologie de la langue au
discours. Langages, vol.189 : 65-79.
Salem A. (1987). Pratique des segments répétés. Essai de statistique textuelle.
Klincksieck.
Salvador X.-L. (2016). Versification : outil d’analyse du mètre français
(http://www.projetprada.fr/versification et
https://gist.github.com/xavierLaurentSalvador).

JADT’ 18

223

«Mangiata dall’orco e tradita dalle donne». Vecchi e
nuovi media raccontano la vicenda di Asia Argento,
tra storytelling e Speech Hate
Francesca Dragotto1 Sonia Melchiorre2
1

Università di Roma Tor Vergata – dragotto@lettere.uniroma2.it
2Università della Tuscia – melchiorresmr@unitus.it

Abstract 1
Re-enacted and dissected in the National and International news, the
narration of the rape denounced by Italian actress Asia Argento has triggered
several coming outs revealing the violence perpetuated against other actors
and actresses by prominent personalities of the Hollywood star system.
Textually molded between diffused narration and the blink of a tweet, the
story has hooked the public displaying, in the Italian media in particular, a
morbid legitimation of Victim Blaming. Asia Argento has become the object
of Hate Speech revealing, in turn, a cultural palimpsest of lies and guilty
silences deriving from stereotypes represented in comments of the most crass
and basest order. The present discussion starts therefore from a quantitative
and qualitative analysis of texts, in English and Italian, reporting the story
and aims to reveal the similarities and differences between language
practices substantiating the discourse of violence. Another corpus derived
from the social networks will also reveal the righteous indignant reactions of
cybernauts concerning this story which will help identify the language
patterns at the core of gender-based violence.
Abstract 2
Spolpata dalle cronache nazionali e internazionali, la narrazione della
violenza sessuale denunciata dall’attrice italiana Asia Argento ha funto da
detonatore di una esplosiva sequela di coming out rivelatori di episodi
analoghi subiti, da altre attrici e, seppur in misura inferiore, attori, da parte di
personaggi di spicco dello Star System hollywoodiano. Colata in tutti gli
stampi testuali compresi tra la narrazione diffusa e il succinto tweet, la trama
di questa vicenda ha tenuto e ad oggi ancora tiene significativo banco
mediatico, alimentando un dibattito che, nel caso italiano, si è dimostrato
spesso più interessato all’individuazione di ragioni utili a legittimare il
Victim Blaming che a ricostruire le coordinate del contesto in primis
psicologico nel quale si sarebbe consumata la violenza. Oggetto di
innumerabili discorsi di odio, il racconto rappresentato dalla cronaca italiana

224

JADT’ 18

costituisce un oggetto utile a investigare il sentimento sociale nei confronti di
storie di violenza con protagoniste persone (in special modo donne) famose,
nei confronti delle quali si attivano reazioni di sdegno frammisto alla
colatura dei più beceri stereotipi di genere. Muovendo dall’analisi
quantitativa e qualitativa di un corpus di testi incentrati su questa vicenda,
prodotti in lingua inglese e in lingua italiana, chi scrive si ripropone di far
emergere luoghi di contatto e di separazione tra le diverse forme della
cronaca, unitamente alle costellazioni lessicali, semantiche e pragmatiche che
le hanno sostanziate. Correderà questa analisi quella di un secondo corpus,
stavolta estrapolato dalla ricca produzione social riconducibile ad account
ora individuali, ora di gruppi noti per l’indefessa attività di comunicazione
indignata intorno a vicende dell’attualità. Scopo ultimo del lavoro, sarà
l’intercettazione dell’eventuale pattern linguistico e concettuale della
violenza di genere, del quale si testeranno i limiti di validità all’interno di
sistemi diversi e di varietà diverse dello stesso sistema.
1. La narrazione
Umiliata e offesa. Questo il destino toccato all’attrice italiana Asia Argento,
tra le prime a denunciare la violenza subita dal produttore cinematografico
hollywoodiano Harvey Weinstein. La donna ha avuto il coraggio di esporre
pubblicamente il suo stupratore assieme a una ottantina di altre, che come lei,
hanno subito prima un oltraggio fisico e successivamente un’esposizione
mediatica senza precedenti. Appare significativo da un punto di vista
narrativo, che la vicenda sia stata innescata da un tweet e che sia
successivamente rimbalzata nei media di tutto il mondo. Nel breve lasso di
un cinguettìo Asia Argento rivela i nomi di tutte le donne che con coraggio
hanno denunciato la violenza perpetrata nei loro confronti da un uomo che si
credeva potente e intoccabile. Ed ecco che dal racconto delle vittime
scaturisce una nuova narrazione in cui le donne diventano survivors, dando
voce alla loro rabbia contro un sistema patriarcale, sessista e misogino,
condensato in uno slogan già storico: Me too, “anche io”, nel quale tutte le
donne del mondo vittime di violenza si sono riconosciute. È accaduto poi che
due parole si transustanziassero nella Person of the Year 2017 guadagnando la
copertina del Time, che si incarnassero nei corpi abbigliati di nero di tutte le
attrici che hanno partecipato al Golden Globe 2018 e che, infine, si
trasformassero nel Time’s Up, “Il tempo è scaduto”, refrain che si propone
come impulso trasformatore della rabbia in forza (ri)costruttrice e che,
probabilmente, accompagnerà l’afro-americana Ophra Winfrey nella corsa
per la Casa Bianca. In Italia, nel frattempo, si fatica, e molto, ad ammettere
perfino che le parole usate dai media nel caso Asia Argento dimostrino
l’esistenza di un grave problema culturale. Nel nostro paese parole tossiche,
nell’insieme dette hate speech, hanno condotto a un vergognoso victim blaming

JADT’ 18

225

nei confronti di Asia Argento: una etichetta eufemistica per le orecchie
italiane che finisce però per assumere la forma testuale di un testo
argomentativo dalle cui trame scaturisce violenza e accanimento mediatico –
ironia sprezzante e spregiativa nei casi migliori – non già nei confronti degli
aggressori, bensì delle persone vittime di violenza sessuale. Questa tendenza
ben si evince dalla disamina, anche solo cursoria, di testi recuperabili dal
web. In questa sede ne è stata raccolta una selezione, in lingua italiana e
inglese, successivamente sottoposta ad analisi contrastiva. Dall’analisi è
emersa la tendenza all’uso di una terminologia, sistematicamente sostenuta
da toni aggressivi, rivelatrice di un sistema più complesso di collusione
culturale con un sistema che sarebbe frettoloso liquidare come fallocentrico e
misogino e percorso da una omosocialità maschile da spogliatoio. Portatrice
di significato per quanto e come dice, ma anche per quanto non dice, la
lingua di questi testi (e in generale di ogni testo), costituisce infatti una porta
di accesso all’architettura ideologica che la sorregge e che sorregge le
coordinate di chi se ne serve: una architettura che cela un mondo
sclerotizzato, che nel caso in questione prevede un pendant tra atteggiamento
aggressivo di chi offende e lesione della dignità di chi offeso/a, su cui è
necessario gettare luce se si vogliono comprendere le dinamiche che guidano
l’agire in questa porzione di tempo che vede la vita sociale e comunicativa
governata dalle strutture dei social media. In attesa dei risultati dell’analisi di
un corpus meglio strutturato e più tendente alla sistematicità – con tutti i
limiti che la sistematicità applicata al testo inteso in senso cognitivo può
avere – in questa prima fase si procederà con l’esposizione dei nuclei più
significativi ottenuti per carotaggio. I frammenti proposti sono stati scelti
perché rappresentativi ciascuno di un corpus dalle caratteristiche analoghe.
1.1 Victim blaming
Queste alcune delle domande proposte ad Argento da G.M. Tammaro, de La
Stampa (15 ottobre 2017), a immediato ridosso della denuncia pubblica
dell’attrice. Difficile non rintracciarvi lo schema narrativo plurisecolare
dell’interrogatorio della vittima di violenza (si pensi, uno su tutte, al primo
processo per stupro della storia, quello nei confronti della pittrice Artemisia
Gentileschi, e non dell’aggressore Agostino Tassi, del 1612): il testo-genere si
compone di domande alle quali chi ha subito violenza deve rispondere in
maniera dettagliata per non essere tacciata di collusione con il predatore.1 In
grassetto gli elementi che si ritengono rilevanti per il discorso.

1http://www.lastampa.it/2017/10/15/italia/cronache/un-orco-mi-ha-mangiata-lacosa-pi-sconvolgente-i-tanti-attacchi-dalle-donnehUwq9t9TFgRHkmcjU8yhAL/pagina.html (ultimo accesso 11/01/2018).

226

JADT’ 18

1. Perché ha deciso di rivelare questa storia a distanza di tanti anni?
2. Non pensa che parlare prima avrebbe evitato che altre donne subissero
come lei?
3. Che cosa l’ha ferita maggiormente?
4. E lei come reagisce?
5. Come ha vissuto questi anni di silenzio?
6. Si sente ancora in colpa per questo?
7. Che cosa temeva che le potesse accadere, in caso di denuncia all’epoca dei
fatti?
8. Fabrizio Lombardo, ex capo di Miramax Italia, nega di averla portata da
Harvey Weinstein, come lei invece sostiene.
9. Dopo il primo incontro in un hotel in Costa Azzurra, lei iniziò una
relazione con Weinstein?
10. Weinstein cercò di contattarla ancora?
11. Lei accettò?
12. Qual era l’atteggiamento di Weinstein nei suoi confronti?
13. Come cambiò il suo comportamento, nei confronti di Weinstein?
14. Quindi vi incontraste altre volte?
15. Poi però ha deciso di farsi avanti in prima persona: come mai?
16. In Italia non tutti la pensano così. Non tutti le credono. Non tutti stanno
dalla sua parte.
17. La accusano anche di aver firmato la petizione a favore di Roman
Polanski, indagato per pedofilia.
18. Si è pentita?
19. Dopo essersi fatta avanti insieme alle altre donne e aver raccontato quello
che le è successo, cosa spera che accada?
Poste una di seguito all’altra, le domande assumono la forma di una
narrazione a se stante, caratterizzata da una costellazione di termini e da una
semantica incentrata sulla vittima non in quanto tale ma in quanto teste che
deve fornire spiegazioni per quanto accaduto, per giustificare il suo silenzio.
Quelle a seguire sono invece alcune delle frasi pronunciate, a vario titolo, da
Mario Adinolfi, Vittorio Feltri e Vittorio Sgarbi, rimbalzate tra numerosi siti e
quotidiani del mondo, tra i quali il New Yorker, per primo, il Guardian e
l’Independent. L’articolo di The Guardian riporta, per esempio, le seguenti
parole: “Far from being hailed as brave, Argento’s allegations were initially
treated in some Italian media outlets with a mix of scepticism and scorn”
dove colpisce il pendant tra il brave, ‘coraggiosa’, utilizzato dalla giornalista
per definire Asia Argento, e l’atteggiamento generalizzato di ‘scetticismo’ e
‘disprezzo, disdegno’ (scorn rimanda anche all’idea di ‘rifiuto’, di non
accettazione di qualcosa che viene proposto). La giornalista riporta poi le

JADT’ 18

227

parole di Asia Argento: “Here people don’t understand. They’ll say, ‘oh it’s
just touching tits’. Well yeah, and this is a very grave thing for me. It is not
normal. You can’t touch me, I am not an object”. Il pezzo non omette la
descrizione dettagliata della violenza subita dall’attrice e il commento
offensivo di Vittorio Feltri, che sminuisce l’atto sessuale poiché solo sesso
orale (licking e non oral sex nella sua interpretazione). L’elemento più
rilevante dell’articolo resta una delle frasi conclusive della giornalista: “For
now, not a single fellow female actor who is well known has spoken out in
support of her, even though the Italian film industry is rife with abuse”, dove
rife with abuse rimanda da un lato alla reiterazione di atti, dall’altro,
significando rife ‘pieno zeppo’, allude anche a un atteggiamento collusivo di
quanti con comportamento omertoso non denunciano. In un altro articolo,
sull’Independent, sempre in Gran Bretagna, Lydia Smith scrive: “But she was
subsequently criticised by some sections of the Italian media for not coming
forward sooner about the alleged assaults, despite hesitation being common
among survivors for fear of reprisals, among other reasons. […]”. Riporta poi
gli interventi di Renato Farina apparsi su Libero e i suoi commenti Victim
Blaming volti cioè alla colpevolizzazione della vittima e tipico di chi è rimasto
molto, troppo indietro rispendo a un mondo che va veloce:2 “Conservative
newspaper Libero published an op-ed by Renato Farina, with the headline:
‘First they give it away, then they whine and pretend to repent’”.3
1.2 Hate speech
“Se denunci uno stupro in Italia sei tu la troia”. E, ancora, “Solo in Italia
vengo considerata colpevole del mio stupro perché non ne parlai quando
avevo 21 anni”, denuncia Asia Argento dopo le critiche e le aggressioni
verbali ricevute sui media italiani, anche da parte di star, che insinuano o
apertamente dichiarano che “Si può sempre dire di no...”. Il 13 ottobre Asia
Argento torna sul caso Weinstein con un tweet amaro: “Ho denunciato uno
stupro e per questo vengo considerata una tr...”. Ma il mondo dello
spettacolo affronta la questione in modo che è eufemistico definire prudente.
“Conosco bene Asia Argento e la stimo”, rivela Vladimir Luxuria. “Quando
ho letto che raccontava di essere stata costretta a un rapporto orale, la prima
reazione è stata di solidarietà. Ma quando ho letto che, dopo aver subito
questa violenza, ha fatto un film con lui, è andata con lui sul red carpet a

http://www.liberoquotidiano.it/news/opinioni/13264032/harvey-weinsteinrenato-farina-scandalo-sessuale-hollywood.html (ultimo accesso 08/01/2018).
3
http://www.independent.co.uk/arts-entertainment/films/news/harveyweinstein-sexual-assault-asia-argento-flees-italy-public-condemn-speaking-outa8012511.html (ultimo accesso 08/01/2018).
2

228

JADT’ 18

Cannes, l’ha frequentato per cinque anni, allora mi sono detta che c’era
qualcosa che non andava. Purtroppo in queste vicende bisogna avere una
credibilità totale, altrimenti basta una sola fake news a mettere in discussione
tutto: […]”. Ottavia Piccolo, stimatissima attrice di teatro e cinema, preferisce
sorvolare: “Sono cose che sono sempre accadute, non voglio parlarne perché
rischierei di dire solo banalità”. Mentre Rita Dalla Chiesa affronta senza
timore l’argomento: “Sicuramente la paura di perdere il lavoro può esserci.
Se però una persona si è sentita realmente offesa e traumatizzata ma poi,
invece di scappare, resta all’interno di questo cerchio negativo, prende treni e
aerei e va agli appuntamenti in albergo, non parlerei più di stupro, ma di un
rapporto cosciente”. Cita poi le parole di Barbara Palombelli, con le quali
afferma di concordare: “[…] Sei stata violentata? E perché lo dici dopo anni?
Troppo comodo. Non facciamo battaglie femministe su cose che col
femminismo non c’entrano niente”. “Sarò una mosca bianca”, rivela invece
Alba Parietti, “ma a me non è mai capitato niente del genere. A volte basta
l’atteggiamento per scoraggiare un uomo. Il punto centrale del problema è la
paura: l’eterna paura delle donne nei confronti degli uomini, del loro potere,
di non essere credute. Conosco potenti donne manager che quando tornano a
casa si lasciano menare dal marito. Perché questo tipo di atteggiamento non
riguarda solo il mondo dello spettacolo, ma tutti gli ambiti lavorativi. Con
un’aggravante: nello spettacolo non insegui un posto da 1200 euro al mese,
ma fama e successo”. Il 26 ottobre 2017 Guia Soncini, editorialista della
rivista Gioia, commenta sul New York Times il fallimento del femminismo
italiano, riferendosi alla vicenda di Asia Argento:4 “This episode is another
example of my country just being male-run, sexist Italy […] This, in a country
that has a total of zero national newspapers edited by women and zero
female columnists in its main national papers. […] Where the reaction to Ms.
Argento’s account has been truly vicious has been on social media. And
there, it has primarily come from women. […] What this tells us about Italian
feminism isn’t clear, but it’s certainly ugly. […] There’s something underripened about the state of feminism in my country”. Peccato che Soncini
avesse postato un tweet decisamente poco femminista (“Sogno un pezzo su
Weinstein d’una sola riga. Quello sarà un vecchio porco, ma voli gliela
tiravate con la fionda, finché pensavate servisse”) qualche giorno prima (10
ottobre 2017), cosa che non sfugge propria ad Asia Argento. L’attacco più
diretto è quello sferrato, via Facebook, da Selvaggia Lucarelli in un post

4
https://mobile.nytimes.com/2017/10/26/opinion/italian-feminism-asia-argentoweinstein.html?partner=IFTTT &_r=0&referer=https://t.co/pj6FLcp4Fx (ultimo accesso
10/01/2018).

JADT’ 18

229

molto lungo:5 “Ora. Francamente. Vai a letto con un bavoso potente per anni
e non dici di no per paura che possa rovinare la tua carriera. Legittimo. Frigni
20 anni dopo su un giornale americano raccontando di tuoi rapporti da
donna consenziente tra l’altro avvenuti in età più che adulta, dovendo
attraversare oceani, con viaggi e spostamenti da organizzare, dipingendoli
come “abusi”. Meno legittimo. Ad occhio, sono abusi un po’ troppo
prolungati e pianificati per potersi chiamare tali. E se tu sei la prima a dire
che lo facevi perché la tua carriera non venisse danneggiata, stai ammettendo
di esserci andata per ragioni di opportunità. Nessuno ti giudica, Asia
Argento. Però ti prego. Paladina delle vittime di molestie, abusi e stupri,
anche no. Facciamo che sei finita in un gorgo putrido di squallidi do ut des e
te ne sei pentita. Con 20 anni di ritardo però”.6
1.3 La sindrome di Stoccolma
All’inizio di quest’anno i media hanno riportato la notizia dell’ennesimo
femminicidio in Italia. Si scopre e ci si meraviglia che la donna, bruciata viva
dal suo convivente, abbia più volte difeso il suo aggressore. Questo
atteggiamento ha un nome: Sindrome di Stoccolma, una sindrome che
sembra colpire tante donne e il cui effetto andrebbe per lo meno valutato
anche per spiegare le reazioni delle tante donne che hanno reagito
attribuendo la responsabilità di quanto accaduto alla Argento, chiamando del
tutto fuori il suo aggressore. Natalia Aspesi, femminista e donna di cultura,
ha sostenuto che “Se mi chiedi un massaggio in ufficio e io te lo concedo, poi
non mi posso stupire su come va a finire”. E, ancora, “Che i produttori,
almeno da quando ho memoria di vicende simili, hanno sempre agito così. E
le ragazze, sul famoso sofà, si accomodavano consapevoli. Avevano fretta di
arrivare. E ancor più fretta di loro avevano le madri legittime che su quel
divano, senza scrupoli di sorta, gettavano felici le eredi in cerca di un ruolo,
di un qualsiasi ruolo”.7 “L’eccezione alla regola proposta è Sofia Loren, che
sposò un produttore per proteggersi – afferma ancora Aspesi – da attenzioni
indesiderate”. A chi le chiede se stia giustificando Weinstein, risponde inoltre
“Non giustifico niente. Il femminismo è ancora una delle missioni più
importanti per le donne di tutto il mondo, forse la più importante in assoluto.

https://www.leggo.it/gossip/news/asia_argento_stuprata_da_weinstein_selvagg
ia_lucarelli_frigni_dopo_20_anni_foto_video_11_ottobre_2017-3295503.html (ultimo
accesso 10/01/2018).
6https://www.leggo.it/spettacoli/cinema/asia_argento_weinstein_sfogo_twitter_1
2_ottobre_2017-3297028.html (ultimo accesso 09/01/2018).
7 https://www.vanityfair.it/news/approfondimenti/2017/10/11/weinsteincommento-natalia-aspesi (ultimo accesso 11/01/2018).
5

230

JADT’ 18

È qualcosa in cui ho creduto e credo ancora ciecamente. Ma non mi pare che
con queste denunce possa fare un salto decisivo. Magari sbaglio, ma ho i miei
dubbi”. Il dubbio “Che sia una vendetta fratricida, per togliere di mezzo
Weinstein. Era un produttore potente come pochi e sporcaccione come
moltissimi altri. Che la storia, risaputa da decenni, sia venuta fuori con
questa virulenza soltanto adesso, accompagnata da decine di testimonianze,
non può essere casuale”. A completare la rassegna, un articolo, senza firma,
battuto da ADN Kronos (13/10/2017), che già col solo titolo riesce a
sintetizzare lo stato della polemica Donne che odiano le donne, gogna social per
Asia Argento: “[…] E nel marasma dei commenti social che la accusano di
volta in volta di opportunismo, di prostituzione, di sensazionalismo, a
colpire duro incredibilmente sono soprattutto le donne. Man mano che si
scorrono i commenti agli articoli dedicati al caso in questi giorni dai
principali quotidiani, non è infatti difficile incappare – anzi, è impossibile –
nei tanti insulti lanciati contro l'attrice: a scriverli sono mamme, nonne,
ragazze, studentesse, tutte convinte della colpevolezza di Asia Argento, rea
nel migliore dei casi per chi commenta di aver aspettato troppo a parlare o,
nel peggiore, di essersi prostituita in cambio di un posto al sole di
Hollywood”.8
1.4 La decisione di lasciare l’Italia
“Newspapers ‘slut-shamed’ Asia Argento so badly over the Weinstein saga
that she’s leaving Italy”,9 riporta spesso la stampa straniera nel dar conto
dell’evoluzione della saga di Asia Argento, giudicata coraggiosa e ispiratrice
di altre donne. Fuor di patria. “Part of the criticism from some Italian
newspapers and social media users revolves around the counter-argument
that these celebrities should have come forward years ago (we debunked this
argument here). While these newspapers and internet users are hardly the
only ones engaging in this form of victim-blaming, the violent tone used by
some is alarming and astonishing […].”. Cita quindi il caso di Renato Farina.
La reazione sorprende ancor più la stampa straniera che ha un mezzo di
facile paragone nella solidarietà riservata alle attrici americane protagoniste
di analoghe denunce nei confronti di Weinstein. Giunta a Laura Boldrini la
notizia dell’espatrio volontario, la Presidente della Camera indirizza il
proprio appello all’attrice chiedendole di desistere dai suoi propositi: «Resta

http://www.adnkronos.com/fatti/cronaca/2017/10/13/donne-che-odiano-donnegogna-social-per-asia-argento_4KNSPMO49OoLtVvox04GWN.html
9 http://mashable.com/2017/10/18/asia-argento-harvey-weinstein-sexualharassment-slut-shaming/#YIIO
i.0cNaql
8

JADT’ 18

231

in Italia, non mollare».10 Da sempre impegnata in attività contro la violenza
sulle donne, da New York ha commentato al Corriere della Sera: “Non ho
avuto modo di chiamare Asia Argento perché sono in missione a New York e
in Canada. Le mando, però, questo messaggio: bisogna rimanere in Italia per
rafforzare la solidarietà tra donne. Asia non mollare”. Ha poi aggiunto
“Detesto il fatto che Asia Argento debba arrivare a giustificarsi […]. Questo è
il mondo alla rovescia, non è importante se e quando una donna decide di
denunciare un abuso. Queste sono sue scelte. Lo scandalo è che un uomo di
potere, questo Weinstein, si sentiva libero di saltare addosso alle ragazze che
volevano lavorare. Questo è il sistema marcio che va sradicato”. La stessa
presidente della Camera non è del resto estranea all’azione denigratrice del
web, che ne ha spesso fatto la destinataria di valanghe di insulti e parole
violente. Riporta, tra gli altri, l’intervento di Boldrini il quotidiano Libero,
che,11 il 19 ottobre 2017, titola Laura Boldrini: “Cara Asia Argento resta in Italia,
le donne sono con te” un articolo parco di commento ma nel quale la lingua
non rispettosa del genere e della morfologia della lingua italiana – su tutti la
presidenta – comunica ben più di quanto avrebbero fatto molte parole: “‘Per
quanto riguarda le molestie e gli stupri’, ha sottolinea[to n.d.r.] la presidenta,
‘il problema sono gli uomini e il loro comportamento […]’”.
2. Considerazioni finali
In attesa di uno scandalo a ruoli capovolti, che, da stereotipi culturali e
linguistici dominanti, ad oggi lascerebbe prefigurare tutt’altro genere di
commenti, ci si limiterà a una rosa di citazioni che se anche ampliata
notevolmente non riuscirebbe a spostare di una virgola – chi scrive ne è
convinta – lo stato di polarizzazione che si è venuto a prefigurare in Italia fin
dai primi giorni di diffusione della vicenda. Una polarizzazione oppositiva
che richiama quella tipica del tifo e più di recente della fede politica – che
sembra rendere incapaci di acquisire, anche solo provvisoriamente, una
prospettiva diversa, anche solo in parte, da quella originaria, – alla quale
nessun commento sembra potersi sottrarre. Ragion per cui, per evitare che
anche l’approccio descrittivo tipico dell’analisi del testo possa essere accusato
di faziosità da una o dall’altra parte, occorrerebbe ampliare il corpus di
riferimento di questo lavoro almeno con la disamina quantitativa e
qualitativa di tutti i tweet presenti nell’account di Asia Argento con
riferimento ai profili che li hanno generati; con la disamina almeno

10 https://www.vanityfair.it/news/cronache/2017/10/19/caso-weinstein-lauraboldrini-asia-argento
11 http://www.liberoquotidiano.it/news/politica/13266009/laura-boldrini-caraasia-resta-in-italia-donne-sono-con-te-minigonna-uomini.html

232

JADT’ 18

quantitativa dei segmenti e dei contesti in cui il termine vittima compare
esplicitamente o è richiamato in altro modo; con la disamina dei contesti e
delle forme cui si ricorre per parlare di chi ha offeso, con l’attività social
scaturita dalle cronache relative a momenti clou dell’anno in materia di
violenza o di rivendicazione di genere, nello specifico nei confronti delle
donne, quali la giornata contro la violenza sulle donne o l’8 marzo. Già
attuata a campione, la raccolta e la successiva analisi di messaggi mostra una
pervicace azione a ripetere impermeabilmente le proprie azioni
comunicative, tanto nei contenuti tanto nella forma e nelle costellazioni di
termini che accompagnano il focus di volta in volta oggetto di discussione.
Segno inequivocabile della posizione che gli elementi da cui si irradia la
costellazione stessa hanno nell’enciclopedia e nella coscienza e sensibilità
della comunità linguistica italofona.

JADT’ 18

233

Il cosa e il come del processo narrativo. L’uso
combinato della Text Analysis e Network Text
Analysis al servizio della precarietà lavorativa
Cristiano Felaco1, Anna Parola2
Università degli Studi di Napoli Federico II – cristiano.felaco@unina.it; anna.parola@unina.it

Abstract
This paper shows the analytic procedures in order to use jointly Text
Analysis and Network Text Analysis. Text Analysis allows to detect the main
themes subjects in the narrations and hence the processes of signification,
Network Text Analysis permits to track down the relations between
linguistic expressions of text, identifying therefore the path of flow of
thoughts. Using jointly the two methods is possible not only to explore the
content of narrations, but, starting from the words and concepts with higher
semantic strength, also to identify the processes of signification. To this
purpose, we will present a research aiming to understand high school
students’ perception of employment precariousness in Italy. The lexical
corpus was built by narrations collected from 2013 to 2016 in blog of
Repubblica “Microfono Aperto”.
Riassunto
Il lavoro presenta le procedure analitiche per un uso congiunto delle tecniche
di Text Analysis e Network Text Analysis. La prima permette di cogliere i
temi principali affrontati nelle narrazioni e quindi i processi di significazione,
la seconda di rintracciare le relazioni tra le espressioni linguistiche di un
testo, individuando i percorsi dei flussi di pensiero. L’uso combinato delle
due tecniche permette, dunque, non solo di esplorare i contenuti delle
narrazioni, ma, lavorando su parole e concetti con una maggiore carica
semantica, anche di ricostruire i percorsi attraverso i quali si costruisce il
significato. A tale scopo sarà presentata una ricerca volta a comprendere la
percezione degli studenti delle scuole secondarie superiori sulla precarietà
lavorativa in Italia. Il corpus testuale è stato creato a partire dalle narrazioni
raccolte dal 2013 al 2016 nel blog di Repubblica “Microfono Aperto”.
Keywords: Thematic Analysis of Elementary Contexts; Network Text
Analysis; Employment Precariousness; Students.

234

JADT’ 18

1. Introduzione
La narrazione, e più nello specifico il narrare, è un processo di costituzione di
una tessitura testuale dotata di senso e veicolante significati. Analizzare i
testi permette di cogliere da un lato la percezione di chi narra su un dato
argomento e il processo di significazione attribuita all’esperienza narrata, ma
dall’altro di comprendere i flussi di pensiero, entrando nello specifico delle
parole utilizzate e della loro sequenzialità. L’uso della statistica testuale al
servizio delle narrazioni permette, perciò, il riconoscimento in profondità del
significato delle parole e del senso ivi presente (Bolasco, 2005). Tra le tecniche
di analisi del contenuto, l’uso combinato della Text Analysis (TA) e Network
Text Analysis (NTA) si presta bene a questi scopi. Se la TA permette di
cogliere i temi affrontati, le parole scelte e utilizzate e le dimensioni di senso
attribuite (Lebart et al., 1998), il cosa si narra, l’uso della TNA offre un
ulteriore approfondimento sul come si narra. Analizzando, infatti, la
posizione delle parole all’interno della rete testuale è possibile rintracciare le
parole con una maggiore carica semantica, individuando in questo modo i
diversi percorsi e contesti di significato (Hunter, 2014) mediante lo studio
della natura delle relazioni tra i vari termini. Partendo dall’assunto che la
struttura di relazioni tra le parole di un testo possa corrispondere ai modelli
mentali e alle mappe cognitive messe in atto dagli autori del testo (Carley,
1997; Popping et Roberts, 1997), tale metodo permette di modellizzare il
linguaggio come rete di parole e di relazioni attraverso la creazione di una
mappa cognitiva (Popping, 2000). Il concetto è il nucleo (mentale) che viene
rappresentato attraverso un termine o un’espressione linguistica; i termini
possono essere in relazione tra loro formando un’affermazione. Le
affermazioni che condividono uno stesso concetto formano una struttura
interdipendente creando così una mappa concettuale o rete testuale costituita
da punti (o nodi) che rappresentano le singole parole (o concetti) e da linee,
cioè i legami che li collegano.
2. Metodologia
L’approccio proposto prevede dapprima che i testi prodotti siano sottoposti
ad un’analisi statistica dei dati testuali servendosi del software di analisi
automatica T-lab, e successivamente analizzati in una prospettiva di rete
mediante il software Gephi.
2.1 Pre-trattamento dei testi
Raggruppati all’interno di un unico corpus, la prima fase di lavorazione del
testo si compone di una fase di normalizzazione del corpus e di
personalizzazione del dizionario. La prima ha l’obiettivo di riconoscere le
parole come forme grafiche e ciò comporta una trasformazione del corpus

JADT’ 18

235

(eliminazione di spazi vuoti in eccesso, marcatura degli apostrofi, riduzione
delle maiuscole), e la creazione di stringhe per le locuzioni polirematiche,
insiemi di parole che hanno un significato unitario non desumibile da quello
delle parole che lo compongono, arrivando alla creazione delle multiwords. La
fase di personalizzazione del dizionario è effettuata con le procedure di
lemmatizzazione e disambiguazione del testo che permettono di rinominare
le forme grafiche in lemmi. Lo step della disambiguazione permette di
selezionare le forme omografe per disambiguarle; quello di lemmatizzazione,
partendo dal riconoscimento delle forme con la stessa radice lessicale
(lessema) o appartenenti alla stessa categoria lessicale, di ricondurre ogni
aggettivo e sostantivo al maschile singolare, ogni verbo alla forma di infinito
presente, e così via. Terminata questa fase, si procede al controllo delle
caratteristiche lessicali del corpus per comprenderne la trattabilità a livello
statistico, verificando i valori del type/token ratio, adeguato per un valore
inferiore a 0.2, e gli hapax, adeguato per una percentuale inferiore al 50% per
corpus di grandi dimensioni, e per percentuali leggermente superiori in caso
di corpus di medie o piccole dimensioni. Prima di procedere all’analisi, va,
inoltre, presa visione della lista delle parole chiave, creata con una procedura
automatica dal software, e alla loro occorrenza all’interno del corpus, e si
fissa una soglia di occorrenza minima, escludendo dall’analisi tutte le parole
presenti meno di n. volte. La scelta della soglia di occorrenza dipende dalle
caratteristiche lessicali e dalle dimensioni del corpus in analisi. Le parole
chiave possono dunque essere prese nella loro integrità, ridotte in relazione
alla soglia di occorrenza, o ancora ulteriormente ridotte in base agli scopi
della ricerca.
2.2. Analisi dei testi mediante Analisi Tematica dei Contesti Elementari
L’Analisi Tematica dei Contesti Elementari mediante una Cluster Analysis
permette di costruire ed esplorare i contenuti del corpus in analisi (Lancia,
2004). I cluster sono costituiti da un insieme di contesti elementari definiti
dagli stessi pattern di parole chiave e descritti attraverso le unità lessicali che
maggiormente vanno a caratterizzare i contesti elementari. La cluster
analysis è eseguita mediante un metodo gerarchico-ascendente non
supervisionato (algoritmo bisecting K-means), caratterizzato dalla cooccorrenza dei tratti semantici. Nello specifico, la procedura d'analisi è
costituita da: analisi delle co-occorrenze mediante la creazione di una tabella
dati unità di contesto*unità lessicali con valori di presenza/assenza; pretrattamento dei dati tramite TF-IDF e trasformazione di ogni vettore riga a
lunghezza 1 (norma euclidea); uso del coseno e clusterizzazione tramite
algoritmo bisecting K-means; analisi comparativa con creazione della tabella
di contingenza unità lessicali*cluster; test del chi-quadrato agli incroci

236

JADT’ 18

cluster*unità lessicali. Rispetto al criterio di partizione che determina il
numero dei cluster, viene utilizzato un algoritmo che utilizza il rapporto tra
varianza intercluster e varianza totale assumendo come partizione ottimale
quella in cui questo rapporto supera la soglia del 50%. L’interpretazione della
posizione occupata dai cluster nello spazio fattoriale e delle parole che li
caratterizzano permettono di individuare le relazioni implicite che
organizzano il pensiero dei soggetti, consentendo di cogliere il punto di vista
del narratore nei confronti dell’evento narrato. Quest’ultimo comprende
anche una serie di elementi valutativi, riflessioni, significati, giudizi di valore,
ma anche proiezioni affettive.
2.3. Analisi delle reti
Il secondo step d’analisi prevede l’inserimento del corpus all’interno del
software Gephi. Tale software organizza i vari lemmi in una matrice di
adiacenza (lemma*lemma) consentendo la creazione di una rete 1-mode, uno
strumento utile per visualizzare la struttura di relazioni tra i vari lemmi,
rappresentati da cerchi o nodi, e collegati tramite legami rappresentati da
linee direzionate. Tale tecnica permette di cogliere il modo con cui i nodi
sono connessi tra loro, identificando così le zone di vicinato (neighbourhood), e
individuando quei nodi che occupano una posizione di rilevanza in differenti
set o nell’intero network. A tale scopo, vengono calcolate differenti misure
basate sulla centralità e, tra queste, la degree centrality che indica le parole
usate con maggiore frequenza in connessione ad altre parole all’interno delle
narrazioni e nei vari contesti di significato. Più nel dettaglio, l’incidenza di
ogni nodo può essere espressa sia come in-degree, numero di archi entranti in
un punto, individuando in questo modo i cosiddetti “predecessori” di ogni
unità lessicale, sia come out-degree, numero di archi uscenti dal punto,
mostrando invece i “successori”. Tale relazione tra predecessori e successori
all’interno della rete testuale aiuta a comprendere la varietà semantica
generata dai nodi. Altro indice utilizzato è la betweennes centrality, misura di
centralità globale basata sulla vicinanza, che esprime il grado con cui un
nodo sta “fra” gli altri nodi del grafo. I nodi collocati in queste zone del
network eserciterebbero una funzione di controllo sui flussi informativi e di
“passaggio” permettendo il collegamento tra due o più set del network
(Freeman, 1979). Nell’ottica dell’analisi testuale, questi lemmi, infatti, giocano
un ruolo centrale nella circolazione dei significati all’interno della rete,
fungendo da punto di giunzione da cui si connettono zone diverse di testo e
si snodano specifici percorsi di significato, andando a definire in questo
modo la varietà semantica delle narrazioni.

JADT’ 18

237

3. Caso studio
Presentiamo uno studio condotto attraverso l’uso combinato delle tecniche
allo scopo di comprendere la percezione degli studenti del mondo del lavoro
nel contesto italiano. Gli ultimi dati disponibili mostrano che l’Italia è tra i
paesi europei con il più alto tasso di disoccupazione giovanile (Eurostat,
2017). L’instabilità, la precarietà e la discontinuità delle entrate rendono i
giovani vulnerabili ai cicli economici, modificando natura e tempi della
transizione al mondo del lavoro e riducendo le opportunità di sviluppare
soddisfacenti piani di vita (Leccardi, 2006). La sfiducia incide sui propulsori
della transizione, cioè sul mantenimento di aspirazioni elevate, sulla
cristallizzazione degli obiettivi di carriera e sul comportamento intensivo
della ricerca di un lavoro (Vuolo et al., 2012). Per lo studio abbiamo utilizzato
una fonte di dati testuali provenienti dal blog di Repubblica “Microfono
Aperto” in cui studenti delle scuole superiori, nel periodo dal 2013 al 2016,
hanno risposto al promt “Quattro giovani su dieci senza lavoro. E tu che
pensi? Di chi sono le colpe? Cosa vorresti che venisse fatto al più presto per
garantirti un dignitoso futuro?”. Raccontarsi attraverso la Rete agevola il
processo di riflessione su di sé, sul proprio ruolo e sul rapporto con ciò che
accade nel contesto in cui il giovane è inscritto. In una situazione di
malessere per la precarietà lavorativa, il web può essere un utile contenitore
per la condivisione dell'esperienza di precarietà, costituendo un ambiente di
condivisione e socializzazione delle proprie esperienze (Di Fraia, 2007).
3.1 Risultati
Il corpus conta 130 narrazioni (10110 occorrenze, 2484 forme grafiche, 1590
hapax), utilizzando come variabili descrittive la provenienza territoriale
(nord, centro, sud) e il tipo di istituto frequentato (istituto tecnicoprofessionale e liceo) e soddisfa i criteri statistici di trattabilità. L’analisi
tematica dei contesti elementari ha prodotto quattro cluster (Fig. 1; Tab. 1),
rinominati CL1 “Guardare le opportunità” (14,6%); CL2 “E il governo?”
(19,8%); CL3 “Dai sogni alla crisi” (38,5%); CL4 “La ricerca del lavoro, dove?”
(27,1%). Le narrazioni del cluster “Guardare le opportunità” rimandano
all’analisi di sacrifici e opportunità; emerge in modo marcato la necessità di
una “attività”, di una messa in pratica di azioni nel presente in vista di un
futuro migliore. Per questo motivo, la crisi è al tempo stesso un’opportunità
che i giovani devono cogliere per dimostrare le proprie capacità: Ormai, per
ciò che si sente, chiunque si chiede del proprio futuro. Per garantire che un giorno ci
sia più lavoro, si deve agire ORA. […]. Anche chi cerca lavoro, però, deve volare
basso e accontentarsi, per il momento, di poco, invece di restare a casa arreso.
Secondo me i giovani devono avere l'opportunità di dimostrare ciò che valgono,
dimostrare al mondo ciò che sanno essere e far capire a tutti che sono capaci "se si

238

JADT’ 18

impegnano" di fare qualsiasi lavoro, dal più semplice al più complesso. I testi del
secondo cluster sono maggiormente orientati alla ricerca della “colpa” e ad
una richiesta di soluzioni principalmente dallo Stato: Penso che lo Stato
dovrebbe dare più spazio ai giovani assicurando loro protezione e tutela. I
parlamentari devono conservare i diritti e le possibilità di ogni giovane, siamo noi il
futuro di questo stato, e come tali abbiamo bisogno di opportunità.
Il cluster “Dai sogni alla crisi” rimanda alla dimensione più interna
dell’essere immersi in una società che sta attraversando un momento di crisi
economica. Gli studenti rimarcano che la mancanza di lavoro annulla i sogni:
Sono davvero preoccupata, tutti noi sogniamo cosa fare da grandi e sapere che il
38,7% dei giovani non riesce a trovare lavoro mi rende indignata. I giovani sono il
futuro, il progresso, si impegnano […] Sappiamo tutti cosa dice il primo articolo
della nostra splendida costituzione, eppure sembra sia ignorato. Bisogna dare più
occasioni ai giovani, tenere in considerazione la nostra costituzione, per aprire le
porte al futuro e rendere l'Italia migliore. Le narrazioni dell’ultimo cluster
riguardano trasversalmente tutte le difficoltà del cercare lavoro (la ricerca
affannata, le aziende che non assumono a causa delle troppe tasse) e della
necessità di andare all’estero: L'Italia si ritrova in un periodo di profonda crisi e se
non si riprende economicamente ridando la possibilità a noi giovani di far capire a chi
di dovere che abbiamo le capacità e volontà di lavorare, l'Italia perderà tutti quei
giovani ma soprattutto tutte quelle menti che andranno all'estero in cerca di
condizioni di vita più favorevoli ma soprattutto di maggiori possibilità di lavoro.
La posizione delle variabili descrittive mostra una differenza per la variabile
provenienza territoriale e nessuna differenza per istituto frequentato. Se
infatti il frequentare una scuola piuttosto che un’altra sembra non incidere
sulla percezione del mondo del lavoro e sui vissuti di sfiducia, che sono
invece comuni, l’appartenenza territoriale ha un suo peso. La modalità nord
è, in termini di vicinanza, posta in prossimità dei cluster 1 e 4, il centro del 3 e
il sud del cluster 2. Ciò indica come gli studenti del nord tendano
maggiormente a problematizzare il fenomeno del precariato e la difficile
ricerca del lavoro, mettendo anche l’accento sulle opportunità che i giovani
hanno di dimostrare il proprio valore; le tematiche di quelli del sud vanno
maggiormente nella colpevolizzazione del contesto, in linea con una
maggiore risonanza del tema di discussione a causa di un’elevata incidenza
della disoccupazione giovanile; le narrazioni degli studenti del centro,
invece, maggiormente richiamano i propri vissuti interni.

JADT’ 18

239

Figura 1: Cluster Analysis

La rete prodotta è composta da 259 nodi e 414 legami. Una prima
approfondita forma di visualizzazione della struttura di relazioni tra i vari
lemmi mostra i livelli più alti di degree centrality, in cui “lavoro”, “giovani”,
“futuro”, “problema” e “possibilità” rappresentano i nodi con maggiori
connessioni. Inoltre, questi stessi nodi riportano anche i valori più alti di indegree centrality, nodi “assorbenti” che presentano più legami in entrata che
in uscita rispetto a tutti gli altri punti; gli studenti tendono a indirizzare i
propri discorsi e, più in generale, il flusso di pensiero verso le tematiche
relative al lavoro in termini sia di possibilità future sia analizzandone le
problematiche ad esso legate. Dall’altro canto, “impegnare” (inteso come
impegno messo in atto) e “condizioni” rappresentano il fulcro da cui muove
la narrazione verso altre parole, nodi “sorgente” che hanno più legami in
uscita che in entrata rispetto ai restanti nodi della rete. I lemmi che
rimandano ai vissuti degli studenti, ai propri stati d’animo rispetto all’attuale
condizione e ad una prospettiva lavorativa futura incerta sono quelli che
giocano un ruolo centrale nella circolazione dei significati all’interno della
rete, presentando difatti i valori più elevati di betweenness centrality. In
particolare, “disoccupato”, “costringere”, “rimanere” e “scoraggiare” sono i
nodi che fungono da principale punto di giunzione da cui si snodano
specifici percorsi di significato: le diverse zone del network, e quindi diverse
parti della narrazioni sono collegate tra loro da quei lemmi che ruotano
intorno al tema della precarietà del presente, una situazione di costrizione e
di forte scoraggiamento.

240

JADT’ 18

In-degree Centrality

Out-degree Centrality

Betweenness Centrality
Figura 2

4. Conclusioni
L’uso misto della TA e NTA permette di rappresentare un quadro sintetico
della struttura semantica, comprendere di cosa si parla, ma anche in che
modo lo si fa: la scelta delle parole e l’ordine stesso di presentazione di
un’idea o opinione rispetto al tema in oggetto. L’uso congiunto delle due
tecniche fornisce: a) una sintesi delle informazioni contenute nelle narrazioni;
b) l’analisi dei temi affrontati; c) un focus sulla strutturazione delle frasi in
termini di relazioni tra lemmi. Permette così di mettere in relazione categorie
tematiche e di contenuto in quanto struttura latente, ricostruendo a ritroso il
processo discorsivo.
Bibliografia
Bolasco S. (2005). Statistica testuale e text mining: alcuni paradigmi
applicativi. Quaderni di Statistica, vol. 7: 1-37.

JADT’ 18

241

Carley K.M. (1997). Extracting team mental models through textual analysis.
Journal of organizational behavior, 18(1): 533-558.
Di Fraia G., a cura di, (2007). Il fenomeno blog. Blog-grafie: identità narrative in
rete. Milano: Guerini e Associati.
Eurostat (2017). Statistics on young people neither in employment nor in
education or training. Report.
Freeman L.C. (1979). Centrality in Social Networks Conceptual Clarification.
Social Networks, vol. 1: 215-239.
Hunter S. (2014). A novel method of network text analysis. Open Journal of
Modern Linguistics, vol. 4(2): 350–366.
Lancia, F. (2004). Strumenti per l’analisi dei testi. Milano: Franco Angeli.
Lebart L., Salem A. and Berry, L. (1998). Exploring textual Data. Dordrecht:
Kluwer Academic Publishers.
Leccardi C. (2006). Redefining the future: Youthful biographical
constructions in the 21st century. New directions for child and adolescent
development, vol. 113: 37-48.
Popping R. (2000). Computer-assisted Text Analysis. London: Sage.
Popping R. and Roberts C.W. (1997). Network approaches in text analyisis. In
Klar R. and Opitz O., editors, Classification and knowledge organization.
Berlin, New York: Springer.
Vuolo M., Staff J. and Mortimer, J. T. (2012). Weathering the great recession:
Psychological and behavioral trajectories in the transition from school to
work. Developmental psychology, vol. 48(6): 1759.

242

JADT’ 18

Hablando de crisis: las comunicaciones del Fondo
Monetario Internacional
Ana Nora Feldman
Universidad Nacional de Luján – anafeldman@gmail.com

Abstract
The annual reports of the International Monetary Fund issued annually
under the name of “World Economic Outlook" from the years 2005 to 2012,
are analyzed in this Paper by using the techniques of Statistical Analysis of
Textual Data. The scan tool text, allows us to see the way the IMF describes in
their reports the world crisis, highlighting their strengths and weaknesses in
their role of the ultimate guarantor of global economic balance. Much has
been discussed about the foresight of the crisis and what was the position of
the IMF regarding its consequences. The denial of the crisis, only recognized
in 2010, is consistent with the mission that the International Monetary Fund
considers to carry out, lecturing on how governments should correct their
economies (Weisbrot et al., 2009). All this ignoring that "their prescriptions
failed" (Stiglitz, 2002) as their "structural adjustment policies" … "produced
hunger and unrest" benefiting those who had more resources while "the poor
sometimes sank more and more in misery. " In particular what is analyzed
from the processing of textual corpus with Taltac2 software, developed by
Prof. Sergio Bolasco from the Università di Roma "La Sapienza", are the
concepts and language associated as a contribution to "a significant debate on
a variety of exclusions "..." that encompass the political, economic and social
fields"(Sen et Kliksberg, 2007) and considering that the World Economic
Outlook reports may be useful for understanding the behavior of the IMF in
the context of the financial crisis. The texts analyzed are written by
technicians and bureaucrats, who possess a high level of expertise and
skillful management of common codes, and are the product of a clear
intention on how the global economic situation and the role of the Monetary
Fund (and technicians), within this context, must be read. These reports, as
will be demonstrated meet the goal of preaching the hegemonic conception
on markets and policies, seeking to satisfy goals related to communication
and marketing strategies in order to align public opinion, government
officials and government objectives behind this concept. It is along this line
that the contradictions between the more political text (the introduction and
the summary) and the technical text (the body of the publication) are also
shown.

JADT’ 18

243

Resumen
Con la ayuda de técnicas de Análisis Estadístico de Datos Textuales, se
analizan los informes anuales del Fondo Monetario Internacional que se
publican anualmente con el nombre de “Perspectivas de la Economía
Mundial” entre los años 2005 y 2012. Se trata de evidenciar en los textos la
forma en la que describe el FMI a la crisis, poniendo en evidencia sus
fortalezas y debilidades en su rol de último garante del equilibrio económico
mundial. Mucho se ha discutido acerca de la capacidad de previsión de la
crisis y cuál fue la posición del Fondo Monetario respecto de sus
consecuencias. La negación de la crisis, sólo reconocida en el año 2010, es
coherente con la misión que el FMI considera que debe cumplir, aleccionando
sobre la forma en que los gobiernos deben corregir sus economías (Weisbrot
et al., 2009). Todo esto ignorando que “sus recetas fallaron” (Stiglitz, 2002)
pues “las políticas de ajuste estructural”… “produjeron hambre y disturbios”
beneficiando a quienes poseían más recursos mientras que “los pobres en
ocasiones se hundían aún más en la miseria”. En particular se analizan con la
ayuda de Taltac2, desarrollado por el Prof. Sergio Bolasco de la Università di
Roma “La Sapienza”, los conceptos y el lenguaje asociado como aporte a “un
debate significativo acerca de una variedad de exclusiones” … “que abarcan
el campo político, económico y social” (Sen et Kliksberg, 2007) para
comprender el comportamiento del FMI en el contexto de la crisis financiera.
Los textos analizados son escritos por técnicos y burócratas, que poseen un
alto nivel de especialización y un manejo hábil de códigos comunes, y son
producto de una clara intencionalidad acerca de cómo debe leerse la
situación económica mundial y el rol del Fondo Monetario (y sus técnicos) en
dicho contexto. Estos informes, como se demostrará, cumplen con el objetivo
de predicación de la concepción hegemónica, sobre mercados y políticas,
buscando satisfacer objetivos relacionados con estrategias comunicacionales
y de marketing con el objetivo de alinear a la opinión pública, funcionarios y
gobiernos detrás de esa concepción. En esa óptica es que se muestran
también las contradicciones entre el texto más político (la introducción y el
resumen) y el texto técnico (el cuerpo de la publicación).
Keywords: textual data analysis, content analysis, political language,
economic and financial crisis.
1. Introducción
La crisis económico – financiera que comenzó en Estados Unidos en el año
2007, y que luego se extendió a Europa y otros continentes, fue reconocida de
manera tardía por parte del Fondo Monetario Internacional (FMI).
Considerando que la misión del Fondo es la de prever los riesgos originados
en crisis económicas y brindar recomendaciones acerca de los mecanismos de

244

JADT’ 18

mitigación, la pregunta que se impone es ¿por qué, ante la crisis financiera de
mayor envergadura después de la Gran Recesión de 1930, el Fondo ignoró la
crisis, evitando declarar la emergencia de envergadura mundial? Desde el
punto de vista político (y discursivo), al negar la crisis, el FMI impidió la
puesta en marcha los mecanismos previstos para afrontar problemáticas de
semejante envergadura. En este trabajo se analizan, con técnicas de Análisis
de Datos Textuales, los informes anuales (Perspectivas de la Economía
Mundial) publicados durante 8 años (2005-2012). Congruencias y
contradicciones nos permiten analizar, desde un punto de vista diferente, las
estrategias políticas del Fondo Monetario que ha visto muy desgastada su
imagen como recurso válido e idóneo para el salvataje de economías en
peligro.
2. Corpus
El criterio para la elección del período en análisis es el de relevar información
en momentos diferentes de la crisis. Partiendo desde un “momento 0”
(previo a su aparición), pasando por la instancia de reconocimiento del
estado de situación, para finalmente considerar el cambio más importante en
la política llevada adelante hasta ese momento por parte del FMI, es decir el
paso del paradigma neoliberal “no intervencionista” (ninguna acción por
parte del Estado para que el mercado se regule solo) a una política activa de
ayuda por parte de los gobiernos (de Estados Unidos y de la Unión Europea),
para “salvar las principales empresas, compañías y bancos en quiebra”
(Rapoport et Brenta, 2010). Desde una óptica de análisis del contenido
(Krippendorf, 1969), se realiza un análisis comparativo de dichos informes,
buscando conocer cuál ha sido la forma en la que el FMI ha descrito la crisis y
cuáles son las temáticas asociadas a la misma. La hipótesis, es que este
lenguaje y contenido no neutral de criterios técnicos y políticos, responden al
acuerdo de la que hemos llamado comunidad internacional “de peso real”
(Feldman, 1995).
3. Ocho años de discursos del Fondo Monetario Internacional
Ya hemos trabajado y presentado diferentes aspectos relacionados con las
comunicaciones del Fondo Monetario ante la crisis más importante tanto
económica como financiera. Discursos que dependen del Director General de
turno y el uso de la lexicometría como herramienta para la interpretación de
los informes (Feldman, 2015 a y b).
En este trabajo analizaremos las cuestiones relacionadas con la congruencia y
el uso político que se da en estas publicaciones anuales. La ambigüedad del
discurso, la dificultad de previsión y reconocimiento (o negación) de la
misma, sus causas y consecuencias y los reiterados anuncios del fin de la

JADT’ 18

245

crisis (en los años 2012, 2013 y 2014) que han sido objeto de crítica por todos
los bloques de países más o menos cercanos al FMI.
El objetivo entonces es identificar las posiciones del Fondo Monetario
Internacional en el tiempo. Se trata de comprender cómo habla y cómo calla
el FMI sobre este crucial tema, como aporte a “un debate significativo” sobre
exclusiones que “abarcan el campo político, económico y social” (Sen et
Kliksberg, 2007). Subyace a esta propuesta la idea que la exploración y el
análisis de textos, mediante recursos de estadística exploratoria
multidimensional, permite “una concepción ecológica para el tratamiento de
datos cualitativos” (Bolasco, 2007). El software utilizado es TALTAC.
3.1. El Discurso del FMI
El corpus está constituido por un total de 1.056.336 palabras (u ocurrencias).
Se trata de textos largos (más de 300 páginas incluyendo gráficos y tablas)
con un promedio de 132.042 ocurrencias. Si bien la distribución entre años es
aproximadamente similar, el informe del 2008 se distingue pues concentra el
16% del total de ocurrencias.
Tabla 1 – Análisis Lexicométrico

Así como el año 2008 se destaca por su extensión el del 2009 es el que utiliza
una mayor riqueza de vocabulario. Según nuestra experiencia (Feldman,
1995), la utilización de una cantidad elevada de palabras en un informe
podría estar indicando una situación de “malestar” o bien del uso de
lenguaje “desvirtuado”. Es decir, se deben utilizar más palabras para
describir algo que aún no ha sido consensuado entre los técnicos y, por
consiguiente, no ha sido conceptualizado adecuadamente.

246

JADT’ 18
Tabla 2 – Riqueza de Vocabulario

La distribución en los años de la forma “crisis” es lo suficientemente
ilustrativa acerca del uso dado, por parte del FMI, al correr de los años.

Gráfico 1 – Distribución de la forma “crisis” en el tiempo

3.2. Dos niveles de análisis: año por año los informes del Fondo
Si tomamos en cuenta sólo la Introducción y el Resumen Ejecutivo (a los que
llamaremos “textos políticos”), que preceden al cuerpo del informe técnico
(más de 300 páginas de textos y números) de cada Informe (a los que
llamaremos “textos técnico-económicos”), éstos pueden ser considerados
piezas comunicacionales que tienen un alcance público mayor, pues existe

JADT’ 18

247

una amplia gama de públicos que “consumen” los documentos técnicos del
FMI (periodistas económicos, economistas, público en general) pero que
normalmente no leen los informes completos. Muchas veces son justamente
estos escritos sintéticos, los que tienen un efecto mayor en la modelación de
la opinión pública internacional. ¿Existen entonces diferencias y/o
inconsistencias entre los informes considerados integralmente y los
resúmenes ejecutivos e introducciones? A través de la lectura de los mismos
y el análisis de las principales formas estadísticamente significativas
comentaremos diferencias y similitudes entre estos. Sin presagiar ninguna
crisis, tanto en el año 2005 como en el 2006, en sus textos se registra coherencia
económica a partir de la sintonía entre los contenidos de la primera parte con
aquellas formas estadísticamente significativas del documento técnico:
INFLACIÓN, INVERSIÓN, AHORRO (2005), PRODUCTIVIDAD y
SECTORES PRODUCTIVOS (2006). En el año 2007, el del comienzo de la
crisis el FMI comienza a hablar de un “período incierto y difícil” y las
palabras estadísticamente significativas hacen referencia sobre todo a la
VOLATILIDAD, contemporáneamente habla de crecimiento, registrándose
disonancia económica entre ambas partes. El año 2008, como ya señalado más
arriba es el que concentra el 16% del total de ocurrencias del corpus. Nos
encontramos aquí ante una disonancia discursiva / económica con el uso de
muchos términos no habituales del FMI (VIVIENDA y CAMBIO
CLIMÁTICO) para la descripción de la situación económica (disonancia y/o
incongruencia en el uso de términos, cfr. Feldman, 1995). Ya estallada la crisis
en el año 2009, a partir de la presión internacional, el FMI debe comenzar a
explicar aquello que no previó ni anunció (ver gráfico 1 y Tabla 2).
Encontramos mayor disonancia entre texto y contexto y nuevas formas
significativas (DESPLOME, ALARMAS). Intentando retomar el liderazgo
político, luego de haber sufrido numerosas críticas por su falta de previsión
de la crisis, el FMI, durante el año 2010, donde – entre su parte sintética y el
documento técnico – encontramos coherencia política y disonancia
económica. Entre las formas significativas encontramos CRISIS.
A partir del año 2011 en el que encontramos más distancia entre lo que se lee
en la Introducción y el Resumen Ejecutivo y el contenido del Informe
completo, reaparece la política. Una vez recuperado su espacio institucional y
su razón de ser, los textos del 2012 poseen coherencia tanto política como
económica.
5. Conclusiones
El Fondo realiza una lectura de los indicadores económicos contradictoria,
con una visión poco clara acerca de la gravedad y las consecuencias de esta
crisis. El análisis del contenido de los textos (discursos e informes), con el uso

248

JADT’ 18

de herramientas de estadística textual, permite graficar de manera irrefutable
las contradicciones y los silencios en los que incurre el FMI desde los
primeros síntomas de la crisis en el año 2007. Los conceptos entonces
vertidos en los Informes Perspectivas de la Economía Mundial son el
producto “de una curiosa mezcla de ideología y mala economía, un dogma
que en ocasiones parecía apenas velar intereses creados” recomendando
“soluciones viejas, inadecuadas” con brutales efectos “sobre los pueblos de
los países a los que se aconsejaba aplicarlas” (Stiglitz, 2002). Estas recetas
fallaron en muchas oportunidades y produjeron situaciones sumamente
graves en varios países. Un mensaje, un emisor, un objeto y una misión que
falló, pues el FMI no cumplió con su rol de evitar que el mundo caiga
nuevamente en una nueva Gran Depresión. Los textos analizados permiten
establecer algunas pistas acerca de las motivaciones de este fracaso. En las
contradicciones evidenciadas y en los intentos de negación de una realidad
que no dejaba dudas acerca de la magnitud de esta crisis se afianza la idea de
que existe en el Fondo Monetario y otros organismos internacionales un
problema de gobernanza
Tabla 3 – Análisis de coherencia y disonancia de los Informes año por año

JADT’ 18

249

Bibliografía
Bolasco S., D’Avino E. y Pavone P. (2007) Analisi dei diari giornalieri con
strumenti di statistica testuale e text mining, Publicado en I tempi della vita
quotidiana. Un approccio multidisciplinare all'analisi dell'uso del tempo. ISTAT,
Roma
Feldman, A. (1995), Il concetto di sviluppo umano secondo le Nazioni Unite:
analisi del contenuto in Bolasco, S., Lebart, L. e Salem, A. (eds.). JADT
1995 - Analisi statistica dei dati testuali, Roma, CISU, 2 voll.
Feldman, A. (2015a) Análisis del Posicionamiento del Fondo Monetario
Internacional frente a la crisis del año 2007 en Revista Latinoamericana de
Opinión Pública. Año 2016, número 6, EDUNTREF. Buenos Aires
Feldman, A. (2015b) Text Mining Strategies applied on the annual reports of
the International Monetary Fund. A look at the crisis en ISI 2015 World
Statistics Congress, Rio de Janeiro
Krippendorff, K. (1969). Theories and Anlytical Constructs en: G. Gerbner,
O.R. Holsti, K. Krippendorff, W.J. Paisely y P.J. Stone (eds.) The Analysis
on Communication Content, New York, John Wiley & Sons, p. 6 e ss.
Lebart, L y Salem, A. (2008). Statistique Textuelle, Dunod, Paris.
Nemiña, Pablo. (2009) Aportes para un esquema de análisis del
comportamiento del FMI en crisis financieras a partir de su actuación
durante la crisis argentina (2001-2002). Documentos De Investigación Social
Número 8. ISSN 1851-8788. IDAES, UNSAM, Buenos Aires
Rapoport, M. y Brenta, N. (2010). Las grandes crisis del capitalismo
contemporáneo. Capital Intelectual. Buenos Aires.
Sen, A. y Kliksberg, B. (2007). Primero la Gente. Ediciones Deusto. 9na edición
Editorial Temas, Buenos Aires, Argentina.
Weisbrot, M., Cordero, J. y Sandoval, L. (2009). Empowering the IMF: Should
Reform be a Requirement for Increasing the Fund’s Resources? Center for
Economic and Policy Research. Washington, D.C., Estados Unidos
www.cepr.net

250

JADT’ 18

Brexit in the Italian and the British press: a bilingual
corpus-driven analysis
Valeria Fiasco
Università Roma Tre – valeria.fiasco@gmail.com

Abstract 1 (English)
The spread of English as the Lingua Franca of international communication
has given rise to meaningful language contact phenomena in the world’s
languages like loanwords and pseudo-loanwords, namely, words from one
language (the donor language) are adopted by another language (the
recipient language) sometimes becoming naturalized (Gusmani 1973). From
this perspective, it is thus interesting to observe their behaviour in real
language use. In particular, this study investigates Anglicisms and pseudoAnglicisms found in the newspaper discourse of Brexit by way of a bilingual
corpus collected from two Italian newspapers, i.e. La Repubblica and Il Corriere
della Sera and two British newspapers, i.e. The Independent and The Guardian
selected for both their authoritativeness and their extensive readership. The
exit of the United Kingdom from the European Union was chosen because it
is a widely covered topic both in the Italian and in the British press, thus
providing abundant material for comparative analysis, as well as offering
useful data in order to explore linguistic variation. It was useful for building
an electronic corpus which was retrieved from the digital archives of the
newspapers’ websites in order to carry out an automated text analysis.
The corpus includes articles collected during the periods that both preceded
and followed the Brexit referendum. In order to carry out the analysis,
corpus-driven methodology was used, namely an approach that lets
hypotheses emerge from corpus observation (Tognini-Bonelli 2001). The
investigation was carried out by way of the software TalTac2, and the
automated text analysis, as a result, turned out to be invaluable in order to
investigate and monitor the newspapers’ vocabulary which included
technical terms from the fields of politics, economics and finance as well as
general language words. In order to design and sample a representative
corpus, the parameters proposed by Biber (1993) were used to identify
descriptive criteria so as to select and balance the population.
The aim of this study is to get an overview of the Brexit discourse as used in
the two countries' newspapers’ vocabulary and terminology (of the two
countries) by using text mining to compare and categorize the whole corpus
as a collection of texts and, then, to cluster documents on the basis of the

JADT’ 18

251

lexical similarity of the vocabulary to establish semantic fields or conceptual
areas. Furthermore, by way of the lexical and textual analysis, this study also
investigates Anglicisms and pseudo-Anglicisms in the Italian newspapers,
identifying and analyzing a list of English words used in Italian. The two
British newspapers serve as a reference corpus to compare to the list of
Anglicisms extracted from the Italian corpus. The articles retrieved from the
British newspapers serve to find out which words are typical of each corpus
and to identify pseudo-anglicisms, namely new words that seem to be
English forms, even though they do not exist in English, or if they do exist,
they have a clearly different meaning. Lastly, the data gathered from the
bilingual corpus analysis were later compared with other wider corpora
included in SketchEngine and on the Brigham Young University platform in
order to make generalizations about the distribution of Anglicisms and
pseudo-Anglicisms in general language corpora.
Keywords: Bilingual Corpus, Textual Analysis, Anglicism, Linguistic
Interference
Abstract 2 (Italian)
La diffusione e l’affermazione dell’inglese come lingua franca della
comunicazione internazionale ha generato fenomeni significativi di contatto
linguistico come i prestiti e i falsi prestiti, ossia parole originariamente nate in
una lingua modello che entrano a far parte di un’altra lingua (lingua replica)
alla quale vengono talvolta assimilate e adattate (Gusmani 1973). È quindi
interessante osservarne l’uso e l’andamento in testi autentici che presentano
la lingua nel suo uso corrente. Questo studio analizza gli anglicismi e i falsi
anglicismi nel discorso giornalistico della Brexit, attraverso un corpus tratto
dai quotidiani italiani La Repubblica e Il Corriere della Sera e dai quotidiani
britannici The Guardian e The Independent, che sono stati selezionati per la loro
diffusione e la loro autorevolezza. La scelta della tematica dell’uscita del
Regno Unito dall’Unione Europea è stata dettata da diversi fattori, tra i quali
l’ampia diffusione dell’argomento nella stampa italiana e in quella britannica,
dando la possibilità di creare un corpus per realizzare un’analisi comparativa
attraverso l’esplorazione della variazione linguistica. Dal momento che
queste riviste offrono una versione online che mette a disposizione un
archivio digitale consultabile, sono particolarmente adatte per creare un
corpus che può essere esaminato attraverso l’analisi automatica del testo. Il
corpus è composto da articoli raccolti durante il periodo che precede e segue
il referendum della Brexit e la metodologia utilizzata per condurre l’analisi è
di tipo corpus-driven, ossia un approccio esplorativo in cui, partendo
dall’osservazione del corpus, si arriva alla formulazione delle ipotesi

252

JADT’ 18

(Tognini-Bonelli 2001). Il software TalTac2 e l’analisi automatica dei testi
sono stati estremamente preziosi per esaminare e monitorare il lessico della
stampa che include termini tecnici della politica, dell’economia e della
finanza, insieme a parole che fanno parte del lessico comune. Per progettare
il corpus, sono stati utilizzati i parametri proposti da Biber (1993) con lo
scopo di identificare i criteri descrittivi per selezionare e bilanciare la
popolazione all’interno del corpus. L’obiettivo di questa ricerca è offrire
un’analisi del lessico e della terminologia utilizzata nel discorso sulla Brexit
nei quotidiani italiani e inglesi attraverso il text mining per raffrontare i testi
che compongono il corpus, categorizzarli e raggrupparli sulla base di
somiglianze lessicali per individuare i campi semantici e le aree concettuali.
Inoltre, l’analisi lessicale e testuale ha consentito l’identificazione degli
anglicismi e dei falsi anglicismi nei quotidiani italiani, mentre il corpus dei
quotidiani britannici ha svolto la funzione di corpus di riferimento per
paragonare la lista degli anglicismi estratta dal corpus italiano con i dati
raccolti nel corpus britannico, capire quali parole sono tipiche di ogni lingua
e identificare i falsi anglicismi, vale a dire parole che presentano una forma
inglese, che però non esistono nel vocabolario originario o nel caso in cui
esistano, il loro significato è completamente differente. Infine, i dati raccolti
dall’analisi del corpus bilingue sono stati successivamente confrontati con
altri corpora più ampi, consultabili su SketchEngine e sulla piattaforma della
Brigham Young University con lo scopo di fare delle generalizzazioni sulla
distribuzione degli anglicismi e dei falsi anglicismi in corpora non
specialistici.
Parole chiave: Corpus bilingue, analisi testuale, anglicismo, interferenza
linguistica
1. Introduction
The growing influence of English on many languages in the world represents
the linguistic change produced by language contact. English is used in both
academic and professional settings revealing a pervasive presence of
Anglicisms in European languages (Marazzini & Petralli 2015). This situation
can be traced back to economic and trade developments, as well as political
and social circumstances in the past decades. The Anglo-American
globalization also exerts an influence on language with an increasing number
of EFL (English as a Foreign Language) and ESL (English as a Second
Language) learners and the English use as a Lingua Franca (ELF) for
international communication giving rise to the borrowing of an increasing
number of Anglicisms which have thus become the symbol of the American
lifestyle, an expression of symbols, dynamism and progress. Pulcini, Furiassi,

JADT’ 18

253

Rodríguez Gonzàlez (2012:1) use the term Anglicization to stress the growing
extensive research on lexical borrowing which has had a major impact on
vocabulary and phraseology of English origin. Lexical borrowings adapt to
their receiving language in various ways, from occasional coinages to
integrated words, from more restricted circles to broad groups until reaching
the totality of the speakers of the recipient language. Gusmani (1993:28)
states that there are cases of complete acclimatization in which the speakers
of the recipient language become so used to the foreign word that it is
perceived to be part of the recipient language, i.e. film. One of the main
sources of neologisms and borrowings is from newspapers and magazines
which detect the emerging trends in contemporary language and coin new
words in a creative fashion. According to Beccaria (1983:65), newspapers are
one of the main forums of exchange between written and spoken language,
where different varieties coexist, for example, bureaucratic, technical and
literary language. Moreover, in newspapers, the interaction between the
general and specialized language takes place allowing specific terms to
penetrate the popular culture (Cabré 1999:17).
2. Research design
This paper stems from the assumption that the linguistic interference of
English on Italian brings about significant effects giving rise to lexical
borrowing phenomena like Anglicisms and false Anglicisms, especially in
newspaper language. This bilingual corpus-driven analysis describes both
the Italian and the British discourse of Brexit with the aim of analyzing its
vocabulary and terminology as used in both the Italian and the British press.
By way of text mining, patterns and trends that allow us to make connections
between the two languages under investigation can be discovered. We can
identify Brexit’s main themes, get a picture of how corpus data are shaped
and subdivided into text fragments that correspond to the newspaper
article’s sections (title, subtitle, summary, text). We can investigate the
linguistic interference of English on Italian and the markedness between the
Anglicisms/pseudo-Anglicisms retrieved in the Italian newspapers and their
Italian equivalent words.
The exit of the United Kingdom from the European Union was chosen
because it is a historic and momentous event which has been the focus of
attention of numerous newspapers, thus, providing abundant material to
collect in the corpus. The reason behind the choice of the two languages lies
in the linguistic interference phenomena they are closely involved in: English
performs the role of a highly productive donor language, while Italian is a
recipient language which is under the influence of English.
The bilingual corpus is made up of articles retrieved from two Italian

254

JADT’ 18

newspapers, i.e. La Repubblica and Il Corriere della Sera and two British
newspapers, i.e. The Independent and The Guardian. They were selected for
their authoritativeness, their extensive readership and the possibility to
access their on-line archives with a free subscription. Moreover, they all dealt
with the Brexit issue thoroughly. The corpus was compiled by downloading
and storing all the articles about Brexit published in the on-line versions of
these newspapers from June to October 2016, that is, the period that preceded
and followed the Brexit referendum. The selected articles provide a brief, but
detailed overview of the Brexit, even though they are not representative of all
of the Italian and the British press. The corpus is composed of two corpora,
the Italian and the British one. The Italian corpus includes 42 articles from La
Repubblica and 42 articles from Il Corriere della Sera for a total amount of
51,158 tokens, whereas the British corpus includes 31 articles from The
Guardian and 31 articles from The Independent for a total amount of 49,995
tokens. However, a difference can be observed in the number of articles that
make up the overall corpus, because the average length of the British articles
was shorter than that of the Italian ones. On the whole, the corpus includes
146 articles and 101,153 tokens. The corpus was designed and sampled
according to the parameters proposed by Biber (1993) in order to build up a
representative corpus and to identify descriptive criteria so as to select and
balance the population. The issue of whether a corpus is representative and
reliable is essential, because the information included in the corpus and the
way it is constructed is central in the corpus-driven approach, namely a
method that lets hypotheses emerge from corpus observation (TogniniBonelli 2001). The automated text analysis on the corpus was carried out by
way of the software TalTac2, to investigate the newspapers’ vocabulary, to
observe the behaviour of Anglicisms’, as well as to make a detailed bilingual
analysis. In order to make generalizations about the distribution of
Anglicisms and pseudo-Anglicisms in general language and to retrace their
routes from/into the donor and the recipient language, other general
language corpora were consulted: Sketch Engine (British National Corpus,
itTenTen16 and enTenTen13 corpus) and the online corpora available on the
Brigham Young University website (News on the Web – NOW, Global WebBased English – GloWbE, TIME Magazine Corpus). Furthermore, the software
Iramuteq was used to carry out the cluster analysis of both corpora, to map
them and extract the semantic associations of words according to their
similarity.
3. Results
In order to identify the main themes and semantic fields of the corpus, the
cluster analysis grouped its lexical content so as to maximize the similarity or

JADT’ 18

255

the dissimilarity of different groups of words. The analysis divided the
Italian and the English corpus into 4 homogeneous clusters whose topics are
economics and finance or European and British politics. The output graph
was a dendrogram showing the association of all the words included in the
two corpora according to their similarity. It grouped the words into two
clusters: the first one concerns economics/finance and the second one is
related to politics. The percentage of words included in the Italian economics
cluster equals 31% compared to 23% in the English economics cluster. In both
corpora, the words from the semantic field of economics are homogeneously
distributed, i.e. bank/banca, market/mercato, growth/crescita, fund/fondo,
investor/investimento, rate/tasso. As for the politics cluster, both corpora
subdivide the lexical content into three clusters. In the Italian corpus, the
cluster of politics generates cluster 4 (23%) grouping the words concerning
the British politics and the sub-clusters 1 and 3. Sub-cluster 1 (22%) regards
the European politics and the Brexit referendum, i.e. Unione, europeo, UE,
negoziati, uscire, trattativa, while sub-cluster 3 (23%) is related to European
policies linked to political integration and post-Brexit immigration policies,
i.e. difesa, migrare, integrazione, emergenza. In the English corpus, the cluster of
politics generates cluster 1 (26%) that corresponds to Italian cluster 3, i.e.
movement, immigration, person, European and two sub-clusters (2 and 3) about
the British politics. In particular, sub-cluster 3 is about the Leave campaign,
i.e. Ukip, independence, break, Farage whereas sub-cluster 2 is about the Remain
campaign of the United Kingdom in the European Union, i.e. Cameron,
conservative, labour, tory. Moreover, the dendrogram also shows who the main
actors of this event are: the European Union, David Cameron, Nigel Farage,
Theresa May, Boris Johnson, and Jeremy Corbyn.
By way of its textual analysis, the software TalTac2 also identified the words
occurring within specific text fragments in which the corpus has been
subdivided and labelled, i.e. headline, sub-heading, lead, body. This analysis
particularly focused on the headlines. On the whole, the most frequent lexical
word in both corpora, Brexit, is mainly found in the headlines and in the
body of Italian newspapers, while it can only be observed in the body of the
British press. The concept of “exit, leaving the European Union” mainly
appears in the body of the articles of the British press, while in Italian
newspapers it is predominantly found in headlines. The brief exploration of
the headlines starts with the key topics expressed by the nouns in both the
Italian and the English corpus. The topics refer to the domain of politics, the
governance of the UK, the debate and the negotiations between the two
parties and the problems arising from the exit of the United Kingdom from
the European Union (i.e. referendum, European Union, leader, government,
campaign, support/negoziato, collasso, rischio, leader, rischio, referendum). In

256

JADT’ 18

particular, the most recurrent nouns in both the English and Italian headlines
mirror the themes addressed in the two corpora, i.e. politics: Brexit, EU
referendum, Remain, vote/Brexit, premier, uscita; economics and finance: borsa,
sterlina/pound. As for verbs describing the actions, conditions or experiences
linked to the Brexit, they outline a delicate and unstable situation in both
corpora, i.e. to vote, to fail, to resign, to face, to divide/uscire, crollare, affrontare,
rischiare, intervenire.
As far as the analysis of the linguistic interference is concerned, the Italian
corpus includes 174 Anglicisms (types) for a total amount of 1.096
occurrences (tokens) whose percentage in the corpus is about 2.1%. As to
types, their sum includes a lot of hapax legomena 91 out of 174 Anglicisms to
be exact (approximately 52.3% of types). The 174 Anglicisms belong to the
semantic fields of politics (22.5%), economics (27.5%), general language
(45.5%), and newspaper language (4.5%). The list of Anglicisms extracted
from the Italian corpus was later compared with the British one to check
whether they were actually used in English and how: 81 Anglicisms out of
174 were found in the English corpus. The other 93 Anglicisms are real
English words except for neo-premier (58.64 per million words) which can be
defined as a pseudo-Anglicism. It is a loanblend or a hybrid compound
(Furiassi 2010:40) formed by the English word premier and the Greek-derived
suffix neo-. These two lexical elements are individually used in English, but
they are not used together. The suffix neo- can be found in English
compounds referring to political movements like neo-socialist, neo-fascist or
regarding art and philosophy subjects, i.e. neo-baroque, neo-Aristotelian. The
use and frequency of the compound neo-premier was compared with the
Italian itTenTen16 corpus on SketchEngine. This online corpus displays two
variants of the compound: the hyphenated word neo-premier (0.02 per million
words) and neopremier (0.02 per million words). Conversely, the search of the
same word in English corpora like BNC, enTenTen13, or Now corpus didn’t
produce any results.
The most frequent Anglicisms in the Italian corpus are Brexit (309 tokens,
0.6%), referendum (111 tokens, 0.22%), premier (89 tokens, 0.17%), leader (61
tokens, 0.12%). These four words are particularly frequent in the British
corpus as well: Brexit (232 tokens, 0.46%), referendum (157 tokens, 0.31%),
leader (71 tokens, 0.14%). In particular, the word Brexit is productive in both
the English and the Italian corpus with numerous hyphenated compounds
composed of Latin and Greek suffixs or English-derived morphemes. Some
of them are common to both corpora, i.e. post-Brexit (English corpus 140 per
million words, Italian corpus 58.6 per million words), hard-Brexit (English
corpus 80 per million words, Italian corpus 58.6 per million words), pro-Brexit
(English corpus 100 per million words, Italian corpus 39.1 per million words).

JADT’ 18

257

Other Brexit-compounds like pre-Brexit (39.1 per million words) and dopoBrexit (19.5 per million words) are only found in the Italian corpus, while the
compound anti-Brexit (40 per million words) is only included in the English
corpus. As far as the word premier is concerned, in the English corpus, it only
shows 1 token (20 per million words), while its synonym, prime minister, has a
frequency of 119 tokens (2,380 per million words). The occurrence of this
compound was then compared with larger English corpora like the BNC
where Prime Minister is written both in capital letters (85.17 per million
words) and in lowercase letters (8.33 per million words). On the contrary, the
word premier is present in the BNC and occurs with a frequency of 0.23 per
million words, but it mainly occurs in the semantic field of football, i.e. as a
modifier of the noun league in the collocation premier league. However, it is
also found in the domain of politics as a noun co-occurring with the
modifiers deputy, country. Conversely, in the Italian corpus itTenTen16 in
SketchEngine, premier always occurs in the semantic field of politics. Two
different uses of the word premier and Prime Minister can thus be observed in
the two languages.
4. Conclusion
The aim of this paper has been to provide an outline of the Brexit discourse
as used in the vocabulary and terminology used by two Italian and two
important British newspapers. By way of cluster analysis, the Brexit’s main
themes have been identified: economics, finance, European and British
politics, and the Post-Brexit immigration policies. Another characteristic that
has been explored in this paper is the distribution of the words in various
newspaper article sections which was accomplished by focusing on the
headlines. The analysis showed that the nouns included in newspapers’
headlines refer, for the most part, to Brexit’s main political issues, even
though some words from the field of economics can be found as well.
Whereas verbs aim at describing the difficult circumstances that both the
European Union and the United Kingdom will face. As far as Anglicisms are
concerned, the investigation highlighted that even though they are often
used by newspapers, they represent only about 2% of the whole corpus. This
percentage conforms to the most recent studies on Anglicisms in Italian by
Serianni (2015), Cortellazzo (2015) and Scarpa (2015). They mirror the topic
subdivision of the corpus, and in fact they mainly belong to the semantic
fields of economics and politics, whereas almost half of them can be classified
as general language words. In the Italian corpus, only one pseudo-Anglicism
has been identified, i.e. neo-premier, and its status has been confirmed by
numerous general English corpora. The analysis of Brexit-related Anglicisms
provides a small but interesting contribution to the research on Anglicisms;

258

JADT’ 18

therefore, it would be interesting to keep collecting data about this historical
fact so as to expand the two small corpora under investigation, to make them
as comprehensible and comprehensive as possible, and to carry out an even
more detailed contrastive analysis.
References
Biber D. (1993). Representativeness in Corpus Design. In Literary and Linguistic
Computing, vol. 8 (4): 243-257.
Bolasco S. (1999). Analisi multidimensionale dei dati. Carocci.
Bolasco S. (2013). L’analisi automatica dei testi. Carocci.
Cabré Castellví M. T. (1999). Terminology: Theory, methods and applications.
John Benjamins
Publishing Company.
Cortellazzo M.A. (2015). Per un monitoraggio degli anglicismi incipienti. In
Marazzini C., Petralli A. La lingua italiana e le lingue romanze di fronte agli
anglicismi. Accademia della Crusca.
Furiassi C. (2010). False Anglicisms in Italian. Polimetrica.
Görlach M. (2001). A dictionary of European Anglicisms. Oxford University
Press.
Gusmani R. (1973). Analisi del prestito linguistico. Libreria scientifica editrice.
Gusmani R. (1993). Saggi sull’interferenza linguistica. Le lettere.
Hunston S. (2002). Corpora in Applied Linguistics. Cambridge University Press.
Lenci A., Montemagni S. and Pirrelli V. (2007). Testo e computer. Elementi di
linguistica
computazionale. Carocci.
Marazzini C., Petralli A. (2015). La lingua italiana e le lingue romanze di fronte
agli anglicismi.
Accademia della Crusca.
Pulcini V., Furiassi C. and Rodríguez González F. (2012). The Anglicization of
European lexis. John
Benjamins.
Scarpa F. (2015). L’influsso dell’inglese sulle lingue speciali dell’italiano. Edizioni
Università Trieste.
Serianni L. (2015) Per una neologia consapevole. In Marazzini C., Petralli A.
La lingua italiana e le lingue romanze di fronte agli anglicismi. Accademia
della Crusca.
Sinclair J. (1991). Corpus Concordance Collocation. Oxford University Press.
Tognini-Bonelli E. (2001). Corpus Linguistics at work. John Benjamins
Publishing Company.

JADT’ 18

259

Textual analysis to promote innovation
within public policy evaluation
Viviana Fini1, Giuseppe Lucio Gaeta2, Sergio Salvatore3
2

1 Ospedale Apuane, Massa – vivianafini@gmail.com
Università di Napoli L’Orientale - glgaeta@gmail.com
3Università del Salento - sergio.salvatore65@icloud.com

Abstract
This paper illustrates the contribution by textual analysis in carrying out the
research activities promoted by FORMEZ PA through the REVES (Reverse
Evaluation to Enhance local Strategies) pilot project1 that aims to innovate
public policy evaluation. While evaluation usually embraces a policy/project
viewpoint and adopts a sort of a top-down approach consistent with the flow
of rules/resources from policy makers to citizens’, REVES reverses this
perspective. Indeed, it aims to assess public policies’ performance in
intercepting and supporting development strategies promoted by
citizens/local actors. One of the three case studies carried out by the REVES
project focuses on Melpignano, a small municipality in the Puglia Region of
Southern Italy. Semi-structured interviews were carried out with a sample of
twenty policy actors (national, regional and local policy designer and policy
implementers as well as policy beneficiaries) linked with this municipality.
By using the TLab software, textual analyses of responses were performed in
order to identify their symbolic and latent components and to understand the
actors’ point of view about the world and specifically about local
development. This allowed to assess how similar concepts - such as civic
participation, innovation, community - are used with profoundly different
cultural meanings by the actors. This contributes to understanding public
policies’ difficulties in enhancing local strategies.
Keywords: Local cultures, textual analysis, innovation within evaluation.

The evaluative research was carried out within the framework of the NUVAL
Project, "Actions to support the activities of the National Evaluation System and
Evaluation Units" implemented by Formez PA. The case study was accomplished by
Viviana Fini and Vito Belladonna, under the scientific coordination of Laura Tagle,
Serafino Celano, Antonella Bonaduce, Giuseppe Lucio Gaeta. Viviana Fini carried out
the cultural analysis under the supervision of Sergio Salvatore and thanks with the
contribution of Giuseppe Lucio Gaeta.
1

260

JADT’ 18

Abstract
L'articolo descrive il contributo della ricerca culturale condotta attraverso lo
strumento dell’analisi testuale nella realizzazione del progetto di ricerca
pilota REVES (Reverse Evaluation to Enhance local Strategies) promosso da
FORMEZ PA con l’intento di innovare la valutazione delle politiche
pubbliche. Mentre il processo valutativo tradizionalmente segue il flusso
delle risorse finanziarie e l’attuazione di norme/provvedimenti da parte dei
soggetti locali, REVES propone un capovolgimento di prospettiva,
intendendo valutare le performance delle politiche pubbliche nell’intercettare
e valorizzare le strategie di sviluppo autonomamente elaborate dai territori.
Uno dei casi studio del progetto si incentra sulla città pugliese di
Melpignano. Sono state condotte interviste semi-strutturate con un campione
di 20 attori di policy (policy maker e attuatori di politiche attivi sul piano
nazionale, regionale e locale oltre a potenziali beneficiari delle politiche) a
vario titolo connessi con la città. Con l’ausilio del software TLab sono state
condotte analisi testuali aventi l’obiettivo di evidenziare le componenti
latenti che orientano le visioni del mondo e dello sviluppo proprie degli
attori intervistati. Ciò ha consentito di valutare come concetti simili, ad
esempio partecipazione civica, innovazione, comunità – siano impiegati dagli
attori con significati culturali diversi. Ciò contribuisce alla comprensione del
motivo delle difficoltà delle politiche pubbliche nel valorizzare strategie
localmente elaborate.
Keywords: Culture locali, Analisi testuale, innovazione nella valutazione.
1. Introduzione
L’articolo dà conto dell’indagine culturale - svolta attraverso analisi testuale realizzata per supportare l’innovazione che il progetto REVES ha apportato
al campo della valutazione delle politiche di sviluppo locale. Con un
approccio reverse accountability, il progetto si è domandato se e come le
politiche sovra-locali siano state in grado di cogliere e valorizzare le istanze
di specifici contesti locali, indagando il caso studio “Melpignano”, Comune
in provincia di Lecce, noto in letteratura per aver elaborato, proposto e
attuato, nel corso degli ultimi 30 anni, una visione e una strategia innovativa
di intervento riguardante lo sviluppo locale (Attanasi et al., 2011;
Parmiggiani, 2013). Si discutono qui i risultati dell’indagine culturale e il
vantaggio che l’analisi testuale ha permesso al progetto di realizzare,
consentendo una lettura che è andata oltre il contenuto delle singole
interviste, permettendo di cogliere come concetti simili fossero utilizzati
talvolta – dagli intervistati – con significati culturalmente profondamente
diversi.

JADT’ 18

261

2. L’indagine culturale come presupposto della ricerca valutativa
Il lavoro di ricerca realizzato mediante analisi testuale ha avuto quale fine la
rilevazione delle dimensioni culturali che in modo latente hanno dato forma
alle visioni e agli interventi sullo sviluppo locale. Questo tipo di indagine si
inscrive in una cornice teorica psicologica ad orientamento psicodinamico e
psico-culturale (Carli et al., 2002; Salvatore et al., 2011), che considera i
comportamenti e i discorsi degli attori sociali come espressione di dinamiche
culturali che solo in parte sono consce, in gran parte sono inconsce, latenti
(Matte Blanco, 1975; Fornari, 1979; Carli et al., 2002). Ciò che gli attori fanno,
dicono, ritengono saliente - secondo tale approccio – è funzione di un campo
di forze latenti, un sistema stabile di significati generalizzati, che chiamiamo
cultura (Carli et al., 2002; Salvatore et al., 2011). L’idea di organizzare le azioni
valutative sui risultati dell’indagine culturale ha risposto all’esigenza del
progetto di “costruire” l’oggetto di indagine a partire da una comprensione
profonda delle motivazioni alla base di certi esiti, in conseguenza della
presenza/assenza di alcune iniziative. L’indagine culturale ha consentito di
fare ipotesi su cosa ha avvicinato/distanziato modelli di azione appartenenti
ad attori di policy diversi, consentendo di classificare i loro discorsi in
relazione alla variabilità culturale che li caratterizza e che definisce lo
scenario entro cui ciascuno di essi, senza la mediazione del pensiero
razionale, si è mosso.
2.1 L’analisi testuale: modalità di analisi
Il metodo utilizzato per l’analisi testuale si fonda sul principio delle cooccorrenze lessicali come fonte di ricostruzione del contesto intratestuale.
Tale principio è stato definito all’interno della linguistica (Reinert, 1986) e
successivamente elaborato in chiave psicologica (Carli & Paniccia, 2002;
Lancia, 2004). In termini generali il metodo, utilizzando il software TLab,
trasforma il corpus lessicale in una matrice digitale di co-occorrenze, la quale
viene a sua volta sottoposta ad una procedura di analisi multidimensionale
che permette di estrapolare i cluster semantici attivi nel testo (cioè i cluster di
parole co-occorrenti entro le stesse frasi, in quanto tali indicative di pattern di
significato) che vengono successivamente sottoposti ad interpretazione. La
procedura adottata segmenta il testo in Unità di Contesto Elementari (ECU),
ossia parti di testo interrotte da punteggiatura, che possono contare da un
minimo di 250 caratteri ad un massimo di 500. Attraverso una serie di
operazioni il corpus testuale viene successivamente trasformato in una
matrice digitale in grado di rappresentare il testo in termini di
presenza/assenza dei lemmi nelle ECU che lo compongono. La matrice che si
viene così a definire è sottoposta ad una procedura di analisi
multidimensionale combinata, che unisce l’Analisi delle Corrispondenze

262

JADT’ 18

Multiple (ACM) e l’Analisi dei Cluster (AC). L’ACM permette di estrapolare
le modalità nei termini delle quali i lemmi si associano all’interno delle ECU
(vale a dire: le loro co-occorrenze intra - ECU). Ciascuna dimensione
fattoriale individuata dalla ACM rappresenta un pattern di co-occorrenze che
si ripropone attraverso il testo, o in una sua porzione sufficientemente ampia.
Le dimensioni fattoriali estrapolate dalla ACM vengono quindi utilizzate
come criteri classificatori dalla successiva CA. In questo modo la CA
permette di raggruppare ECU (e lemmi) in base alla loro somiglianza – ossia
in base alle combinazioni di parole per come si danno nelle frasi di testo. Il
risultato finale della procedura è dunque l’identificazione di cluster di frasi
tra loro simili in quanto caratterizzate dalla compresenza delle stesse parole;
oppure, specularmente, dalla identificazione di cluster di parole simili in
quanto tendenti ad essere utilizzate insieme nelle stesse frasi. Per questa loro
caratteristica computazionale, i cluster individuati si prestano ad essere
interpretati nei termini di nuclei tematici, tali in quanto caratterizzati dal
riferimento ad un aggregato sufficientemente stabile di parole (Lancia, 2005).
L’output dell’analisi può essere considerato come una rappresentazione del
campo culturale caratterizzante lo specifico contesto di policy (Carli et al.,
2002), dove sono visibili le dimensioni latenti che dinamizzano il campo
(Fattori) e la variabilità relativa ai diversi modi di pensare dei soggetti
intervistati (Cluster).
2.2 Popolazione di riferimento e campione
La popolazione di riferimento sono gli attori delle politiche. Il campione è
costituito da 20 soggetti che a vario titolo hanno operato in relazione allo
sviluppo locale, con i quali è stata condotta un’intervista in profondità,
considerati figure chiave del contesto studiato per le seguenti variabili
illustrative: ruolo (politici, cittadini, tecnici); tipo di implicazione nella politica
(policy maker, policy designer, attuatori, destinatari); livello di appartenenza
(locale, sovracomunale, regionale, nazionale). Trattandosi di uno studio
pilota, al campione rappresentativo si è preferito un campione a grappolo per
quote non proporzionali (Blalock jr, 1960), facendo riferimento agli attori
presenti entro i contesti, distribuiti in modo tendenzialmente equivalente in
relazione alle tre variabili. La scelta di un campione di questo tipo ha
consentito di costruire ipotesi, più che di verificarle, enucleando lo spettro di
eterogeneità culturale presente entro la popolazione di riferimento.
3. I principali risultati dell’analisi culturale
3.1 I Fattori: le principali dimensioni latenti del campo culturale
I principali fattori estratti sono tre. Di seguito, una loro interpretazione sul
piano culturale.

JADT’ 18

263

Primo Fattore - Simbolizzazione del processo di regolazione sociale: operatività
proceduralizzata vs appartenenza valorizzata
Invitati a parlare della propria visione dello sviluppo, del proprio ruolo in
relazione ad esso, delle politiche in grado di promuoverlo, i soggetti
incontrati parlano, in prima istanza, del modo in cui regolano il processo
relazionale con i propri interlocutori. Da un lato (operatività proceduralizzata)
lo sviluppo del territorio viene visto come esito dell’adesione, da parte degli
attori locali, al frame valoriale e alle azioni proposte dalle politiche di
sviluppo. Dall’altro il riferimento è al costruire un comune sentire
(appartenenza valorizzata), governando e amministrando fatti concreti
riguardanti la vita delle persone, avvalorando le valenze affettive dei legami
di appartenenza. Due differenti modelli di regolazione sociale, che implicano
due visioni alternative di sviluppo: tecnicalità come modello di relazione che
funziona a supposto contesto dato (Carli et al., 1999) - lo sviluppo qui è
realizzabile per decreto - vs modello di regolazione sociale che funziona in
modo esperienziale, - lo sviluppo è qui concepito come sviluppo endogeno
del sistema (Fini et al., 2015).
Secondo Fattore - Forme del desiderio: salvaguardia vs riuscita.
In seconda istanza, i soggetti intervistati parlano della spinta che muove la
loro azione, ossia della forma del loro desiderio. Da un lato (salvaguardia) la
trasformazione in mito della comunità di appartenenza sembra rispondere al
desiderio di sottrarre la propria storia alla contingenza. Operazione che offre
“sicurezza” in cambio di “dipendenza”. Dall’altro (riuscita) viene messa al
centro una dialettica tra identità ed estraneità, con “speranza” e “avvenire”
che prendono il posto di “sicurezza”. In entrambe i casi “comunità” è lemma
centrale, ma mentre nella polarizzazione salvaguardia le parole con cui cooccorre la fanno sembrare valore e scopo dell’azione, nel secondo caso
appare più come un prodotto da costruire, dialogicamente, tra dentro e fuori,
vecchio e nuovo. Due diverse modalità di entrare in rapporto con l’estraneità:
nel primo caso si adatta ciò che è sconosciuto a ciò che già si sa; nel secondo
caso si utilizza il noto per esplorare l’ignoto.
Terzo Fattore - Simbolizzazione della domanda di sviluppo: funzione sostitutiva vs
funzione integrativa
I soggetti intervistati, in terza istanza parlano della domanda di sviluppo. Da
un lato, laddove ci si propone di adeguare i destinatari alle regole della
pianificazione, le regole diventano ordini invalicabili e gli operatori sentono
svilito il proprio ruolo ad un mero adempimento e si sentono impotenti.
Dall’altro i destinatari delle policy si propongono imprenditivamente,
avendo a mente ciò che è rilevante per sé e chiedendo regole che consentano
di muoversi all’interno di aspettative condivise. Emergono, polarizzate, due
domande di sviluppo: la prima soggiacente ad un modello che potremmo

264

JADT’ 18

definire “sostitutivo” (Carli, Paniccia, 1999), che attribuisce alla policy un
potere elevato, valutabile a prodotto finito, che mette l’impotenza al posto
del desiderio. La seconda, relativa ad un modello che potremmo chiamare
“integrativo” (Carli, Paniccia, 1999) che esprime il desiderio di contribuire al
raggiungimento degli obiettivi dei destinatari, in compenetrazione di
funzioni e scelte e che pensa per processi.
3.2 I principali Cluster
La Cluster Analysis ha individuato 4 Cluster principali.

CL_1
Elementary
Context:
407 di 2504
(16,25%)

Tab 1. Contesti Elementari
CL_2
CL_3
Elementary
Elementary
Context:
Context:
840 di 2504
593 di 2504
(33,55%)
(23,68%)

CL_4
Elementary
Context:
664 di 2504
(26,52%)

C1. Le parole con un χ2 maggiormente significativo (che riportiamo tra
parentesi) per questo cluster sono: tema (102,4); amministrazione (100,6); aspetto
(83,9); processo (68,5); economico (66,4); contesto (64,4); imprenditoriale (62,3);
azione (50,6); amianto (52); costruire (49,6); impresa (48); innovazione (43,5).
Abbiamo denominato C1 “Governo imprenditivo dell’innovazione”, per
l’accento posto sull’innovazione, considerata come processo da governare
proattivamente.
C2. Le parole maggiormente rappresentative sono: io (277,2); tu (154,7);
sindaco (80,5); parlare (63,4); trovare (62,1); sentire (56,7); persona (51,7); giorno
(45,5); figlio (41,6); paese (34,9); riuscire (32,6). Abbiamo denominato C2
“Implicazione nella gestione della cosa pubblica”, per l’accento posto sulla
partecipazione diretta e personale, ognuno con il proprio ruolo e la propria
soggettività, al governo del bene comune.
C3. Le parole maggiormente rappresentative sono: cooperativo (224,7);
comunità (182); notte (105,3); anno (103,1); Melpignano (91,2); fare (87,8);
cittadino (83,5); acqua (83); bello (78); casa (75,7); pagare (68,9); euro (63,7);
Taranta (60,7). Abbiamo denominato C3 “Comunità come identità” per
l’accento posto su tutto ciò che ha reso possibile la costruzione di Melpignano
come comunità che si riconosce nella gestione della cosa pubblica e nella
valorizzazione della tradizione popolare.
C4. Le parole maggiormente rappresentative sono: territorio (442,4);
programmazione (191,7); sviluppo (179,4); area (173,9); regione (171,1); GAL
(118,6); attività (104,3); intervento (102,6); livello (90,3); vasto (86,9); Puglia
(77,8); governance (75,2). Abbiamo denominato C4 “Pianificazione come

JADT’ 18

265

sviluppo” per l’identificazione del territorio con i confini amministrativi e la
sovrapposizione tra sviluppo e varie forme di pianificazione, come se
definire confini e pianificare azioni fosse di per sé garanzia di produzione di
sviluppo.
3.3 Discussione
La Tabella 2 mostra il rapporto Cluster-Fattori.

Cluster
CL_01
CL_02
CL_03
CL_04

Tab 2. Rapporto Cluster-Fattori
Fattore1
Fattore2
- 22,2374
14,7017
37,0788
59,5785
60,9616
- 52,9475
- 81,5382
- 11,9437

Fattore3
63,7361
- 22,7426
0
- 30,8565

La proiezione dei Cluster sullo spazio fattoriale ha consentito di comprendere
come concetti simili fossero utilizzati dagli intervistati con significati
culturalmente molto diversi.
È il caso, ad esempio, di C2 (quadrante riuscita - appartenenza valorizzata e
quadrante funzione sostitutiva - appartenenza valorizzata). I discorsi di C2
concernono l’essere attivi nella gestione della cosa pubblica. Ma il loro
differente posizionamento sullo spazio fattoriale ci ha fatto ipotizzare una
differente visione e, di conseguenza, un diverso utilizzo del tema della
partecipazione civica, argomento strategico per il contesto locale e per le
politiche di sviluppo e strettamente connesso con l’attivazione dei cittadini.
Questa ipotesi ha orientato in modo mirato le successive esplorazioni che
hanno evidenziato, sotto lo stesso cappello, micro-processi socioorganizzativi molto diversi: da un lato il destinatario di policy visto come
soggetto da implicare nella produzione del bene, esplorando e valorizzando
il suo desiderio (in coerenza con il quadrante riuscita-appartenenza valorizzata).
Qui la partecipazione è considerata esito di una costruzione dialogica.
Dall’altro (quadrante funzione sostitutiva-appartenenza valorizzata) i destinatari
alternativamente visti come fruitori passivi di un bene prodotto da altri o
soggetti ai quali delegare sovranità e la partecipazione trattata come
strumento di rafforzamento dei sistemi di appartenenza. Questa evidenza ha
consentito di superare la classica distinzione presente in letteratura tra
processi top down/bottom up (Bens, 2005; Sclavi, 2002) e, in una restituzione
ai soggetti locali, di discutere con loro su come lo scarto esistente stesse
piuttosto nelle diverse modalità di presa in carico dell’estraneità relativa al
desiderio del destinatario delle policy. Grazie al tipo di indagine è stato
possibile anche cogliere come temi quali innovazione e comunità, che nelle

266

JADT’ 18

interviste emergevano in modo contiguo come due miti locali per certi versi
sovrapponibili, evidenziassero invece posizionamenti culturali differenti:
quando a prevalere è C1-innovazione (ad esempio: inventare una tradizione
come il Festival di musica popolare La Notte della Taranta; introdurre la
raccolta differenziata; promuovere presso la cittadinanza l’uso dei pannelli
fotovoltaici) le pratiche raccontate sono maggiormente orientate
dall’importanza attribuita al raggiungimento di obiettivi (quadrante
operatività proceduralizzata – riuscita) e dalla necessità di capire come rendere
le innovazioni appetibili per la cittadinanza (quadrante operatività
proceduralizzata – funzione integrativa). Quando invece a prevalere è il tema
C3-comunità (ad esempio promuovere lo sviluppo di una Cooperativa di
Comunità) ciò che sembra essere motore dell’azione è l’idea di rafforzare il
proprio sistema di appartenenza (quadrante appartenenza valorizzata –
salvaguardia; e appartenenza valorizzata-funzione sostitutiva). Infine la proiezione
di C4 sullo spazio fattoriale nel quadrante operatività proceduralizzata –
salvaguardia e operatività proceduralizzata – funzione sostitutiva ha consentito di
cogliere quanto, entro questo assetto culturale, la pianificazione si muova in
modo avulso dai contesti anche laddove la retorica dei programmi preveda
strumenti per l’ascolto e la partecipazione dei destinatari delle policies. Da
sottolineare, poi, come le variabili illustrative si siano polarizzate
maggiormente sul primo fattore: operatività proceduralizzata vs appartenenza
valorizzata. Tecnici da un lato e cittadini/politici dall’altro; policy designer da
un lato e policy maker/destinatari dall’altro. Queste polarizzazione ci hanno
fatto pensare ad una vicinanza culturale tra policy maker/politici e
destinatari/cittadini, evidenziando come la politica locale, a differenza di
quella centrale, sia in una posizione privilegiata per comprendere domande e
interpretare esigenze, limiti, potenzialità di sviluppo dei contesti reali. Gli
attuatori, invece, si posizionano in opposizione a policy maker, destinatari e
policy designer. Questo ci ha interrogati sul loro difficile ruolo di cuscinetto,
tra le domande dei diretti interlocutori della politica (destinatari, policy
maker) e le esigenze intrinseche ai programmi.
4. Conclusioni
L’indagine culturale realizzata mediante analisi testuale ha consentito al team
di ricerca di costruire l’oggetto di indagine a partire da elementi altrimenti
difficilmente individuabili, dal momento che i contenuti proposti dagli
intervistati si presentavano pressoché identici. Poter cogliere tali differenze
sostanziali dal punto di vista culturale ci ha permesso di realizzare
osservazioni, interviste, discussioni con gli attori locali in merito a quando
andavamo capendo ben più mirate e interessanti, anche per i soggetti locali
stessi. In ciò riposa la vera innovazione che l’indagine culturale ha consentito

JADT’ 18

267

al Progetto REVES di apportare nel campo della valutazione delle politiche di
sviluppo locale.
Riferimenti bibliografici
Attanasi, G., Giordano, G. (2011). Eventi, cultura e sviluppo. L’esperienza de “La
Notte della Taranta". Milano: Egea
Bens, I., (2005). Facilitating with ease! Core skills for facilitators, team leaders and
members, managers, consultants and trainers. San Francisco: Josey-Bass.
Blalock, Jr., H. M. (1960). Social Statistics. New York: McGraw-Hill Book
Company.
Carli R., Paniccia, R.M (1999). Psicologia della formazione. Bologna: Il Mulino.
Carli, R., Paniccia, R.M. (2002). L’Analisi Emozionale del Testo. Milano: Franco
Angeli.
Fini, V., Belladonna, V., Tagle, L., Celano, S., Bonaduce, A., & Gaeta, L.G.
(2016), Progetto Pilota di Valutazione Locale, Studio di Caso: Comune di
Melpignano. Come Stato centrale, fondazioni e Regioni possono sollecitare la
progettualità locale retrieved at
http://valutazioneinvestimenti.formez.it/sites/all/files/2_reves_rapporto_cas
o_melpignano.pdf
Fini, V., Salvatore. S. (in press). The fuel and the engine. A general semiocultural psychological framework for social intervention. In S. Schliewe, N.
Chaudhary & P. Marsico (Eds.), Cultural Psychology of Intervention in the
Globalized World. Charlotte (NC): Information Age Publishing.
Fornari, F. (1979). I fondamenti di una teoria psicoanalitica del linguaggio. Torino:
Boringhieri.
Lancia F. (2004). Strumenti per l’analisi dei testi. Introduzione all’uso di T-LAB.
Milano: Franco Angeli.
Matte Blanco, I. (1975). L'inconscio come insiemi infiniti. Saggio sulla bi-logica.
Torino: Einaudi.
Parmiggiani, P. (2013). Pratiche di consumo, civic engagement, creazione di
comunità, in Sociologia del lavoro, 132, 97 – 112.
Reinert, M. (1986). Un logiciel d’analyse textuelle: ALCESTE, in Cahiers de
l’Analyse des Données, 3.
Salvatore, S., & Zittoun, T. (2011). Outlines of a psychoanalytically informed
cultural psychology. In S. Salvatore, & T. Zittoun (Eds). Cultural Psychology
and Psychoanalysis in Dialogue. Issues for Constructive Theoretical and
Methodological Synergies (pp. 3-46). Charlotte, NC: Information Age.
Sclavi, M. (2002). Avventure Urbane. Progettare la città con gli abitanti. Milano:
Euleuthera.

268

JADT’ 18

A proposal for Cross-Language Analysis:
violence against women and the Web
Alessia Forciniti, Simona Balbi
University of Naples Federico II - alessia.forc@libero.it

Abstract
Aim of the paper is investigating the mood on the Web with respect to one of
the most relevant Human Rights violation, without any geographic
distinction: the violence against women. While the literature that studies the
phenomenon is rapidly growing, the action field is still fragile and the
question marks are about the relationship between the public opinion and
the contextual factors. In a first look at the phenomenon, we aim at mapping
gender violence on the Web, in a Big Data perspective. The peculiar problem
we deal with consists in analysing short documents (tweets) written in six
European different languages, in the occasion of a common event: the
International Day for the Elimination of Violence against Women, 25
November 2017. For our statistical analysis, we choose a multi-linguistic,
cross-national perspective. The basic idea is that there are some common
structures, language independent ("concepts"), which are declined in the
different national natural language expressions ("terms"). Investigating those
structure (e.g. factors of lexical correspondence analyses separately
performed on the different collections), enables a double level analysis trying
to understand and visualise national peculiarities and communalities. The
statistical tool is given by Procrustes rotations.
Keywords: Big Data, Text Mining, Cross-national study, Procrustes rotations
1. Introduction
This paper proposes a statistical-linguistic analysis about the mood on Web
in relation to a social issue of universal relevance: the violence against
women (European Union Agency for Fundamental Rights (FRA), 2014; ONU
and United Nations Population Fund, 2016, 2017). The social media, today,
are becoming an important platform of the collective thought of the society
and therefore, they represent an interesting container of context to study. The
constant growth in unstructured information on Web makes the Text mining
applications increasingly important in achieving to knowledge extraction of
the phenomena. This work faces the problem of the public opinion on the
phenomenon of gender-based violence, in Europe, as reply to a common

JADT’ 18

269

event: the International Day for the Elimination of Violence against Women
(United Nations, General Assembly, 1999), 25 November 2017.The proposed
method of analysis is a multi-linguistic, cross-national study of the
multimedia contents extracted from Twitter through Web scraping
techniques. The features of data (Wu X., Wu G-Q.,Zhu et al.,2014) propose an
analysis in terms of Big Data (Zielinski et al., 2012). Considering the aspects
of the comparative research (Finer, 1954; Lijphart, 1975) the choice of number
of cases study does not excess the six European countries; three west
countries, as United Kingdom (Uk), Italy, France and three east countries, as
Bulgaria, Czech Republic and Romania.
The research takes on several methodological issues; it requires the treatment
of multilingual corpora (tweets are written in six different languages) and not
all the treated languages in this study are typical of the Textual Data mining
application. The implications are relative to: a careful pre-processing step
(corpora cleaning from URL and emoticons), it does not exists a package or
software that includes a list of stop words for all investigated languages in
this research and in addition the appropriate system of weights for the
analysis unit in relation to the nature of data (short messages of up 140
characters). The accuracy of these choices is very important for the good
result of the investigation. Therefore, this work has not only a simple
cognitive function of the phenomenon but it represents an opportunity to test
the scientific method. The cross-linguistic perspective is given by projection
on factorial plan of the most frequent terms for couples of countries. In order
to visualize the national peculiarities and communalities, the factors are
projected in the two different natural languages on a common reference
space, per pairwise through the Procrustes rotations.
2.Theoretical Framework
In order to visualize the relationships between document and between terms,
in textual data analysis, is commonly performed a factorial approach. The
starting point is a lexical table, cross-tabulating terms and documents (in this
case terms and tweets).
This study in question intends propose a Procrustes analysis, such as efficient
geometric technique to align lexical matrices. Our research proposes six
lexical tables (X1, ..., X6,) as many as there are the case studies. There is an
extremely wide multivariate analysis literature devoted to the problem of
comparing and synthesising information contained in two or more matrices.
An interesting way of approaching the problem consists in comparing
geometrical configurations in some Euclidean space (Gordon, 1981). In our
case, Correspondence Analysis (CA) is performed on the six tables and
visualises the major themes and suggests similarities and peculiarities

270

JADT’ 18

between countries. In order to have a measure of this similarity for couple of
countries, we can compute the sum of the square distances between
corresponding points in the two configurations:

The data structure consists of two matrices, X (n,p) and Y (n,p). X is the lexical
table having in row the n tweets in which the corpus is organized, and in
columns some content bearing words selected among the most frequent
terms in the corpus for a country. Y is the lexical table having in row the n
tweets, and in columns the content bearing words selected in the natural
language of the other country. Through the CA performed on each corpus,
we compute the principal coordinates and create two matrices: X1 and Y1;
which represent coordinate matrices of each language. The coordinates
matrices have been standardized and normalized so that is not necessary “rescaling” factor”.
3. Data extraction: the Web Scraping
The Social media are a potentially infinite source of user data, and Twitter is
one of the worldwide used Social network. Twitter is a micro-blogging
service which messages (called tweets) of up to 140 characters. Web scraping
is the process of automatically extract data from the Web by an Application
Programming Interface (API) supported by software (or by packages
connected to software). For our research, data extraction has been conducted
with API Twitter and R, respecting specific parameters, common for each
country: a keyword translated in the different 6 languages, the specification
of the language, the geocode (in order to exclude urban semantic deriving
from dialects or territorial slang which change the common sense of words)
and finally the sample size (with technical limits; it is possible to extract until
to n=3200 tweets per day). The monitoring period is a week around the
International Day for the Elimination of Violence against Women, from 23
November to 30 November 2017.
4. Knowledge extraction of the phenomenon
Considering but at same time overlooking the detailed description of the
methodological issues aimed at pre-processing procedures of multi-linguistic
and multimedia content, the argumentation focuses on the results. The
results represent one of the most interesting developments of our proposal.

JADT’ 18

271

However, a note deserves the attention: given the structure and the length of
each document (tweet), the system of weights of elementary unit is tf (term.
frequency) where: wij =
The canonical tools used for Textual Data Analysis, such as occurrence values
of the most frequent terms does not represent, in this case, a useful tool to
comparing relation between countries. There are other statistical tools can
enable us to go deeper in understanding of the phenomenon, such as the
factorial approach.
4.1. Procrustes analysis for a cross-language study
The scientific method that this research intends to test is the Procrustes
Analysis by performing the overlapping of two different configurations. The
configurations to comparing are two normalized CA coordinates matrices.

vita
insinuano

aiutare
exploitation
rape
abuse
domestic
fight
men

approvato
consiglio

internationalday
government
women
rights
activism
elimination
world
aggression
reflect
issue
donnevittime
report
campaign
violence
race
gender
fenomeno
contrastare
giornatainternazionale
genere
stanziato
violenza
maschile
mnl
dirittifondamentali
casadonnepisa
femminista

libera

novembre
legislation
riformatore

-0.5

0.0

0.5

abusi

-1.0

Dimension 2

1.0

1.5

Procrustes errors

-2

-1

0

Dimension 1

Figure 1. Procrustes errors: comparison between Italy-Uk

1

272

JADT’ 18

The graphic representation allows to observe the Procrustes errors between
the two dimensions: points of Italy's normalized principal coordinates matrix
and United Kingdom's points of normalized principal coordinates matrix,
where Uk is the rotated matrix. Beyond the descriptive statistics about the
residual scores, the graph shows how around the axes origin there is a
concentration of points both X1 and Y1 and so we can affirm that there is not a
wide distance between X1 e Y1.Procrustean approach confirms the similarity
estimated by CA maps between Uk and Italy (Figure 2 and Figure 3); where
despite, third quadrant of Italy's and United Kingdom's suffer a dense
overlapping of statistics entities, it is possible note similar topics, which are
collocated nearly in same position on the multidimensional space.

Figure 2. Correspondence Analysis Maps for Uk

Furthermore, through the CA, is possible to investigate structures, language
independent (“concepts”), which is declined in the different national natural

JADT’ 18

273

1.5

language expressions ("terms"). In other words, even though there are terms
that they are not the exact translation from a language to another and so from
Italian to English o conversely, does not changes the conceptual aspect.
Studying the vocabulary of the country we can consider the conceptual
aspect and we can create thematic-groupings and to label the clusters.
Procrustes errors and t Correspondence Analysis permits to observe the
collocation of the statistic entity "abuse". In Procrustes errors plot (Figure 1)
the "term" is distant from others statistics units; therefore it represent a
Procrustes residual. Same consideration is given by observing CA maps
(Figure 2 and 3). Despite, the word "abuse" is the relative translation of natural
language from Italian to English the collocation on the multidimensional
space is different. The "joint terms space" (Figure 4) of the comparison
between Italy and Uk, allows to affirm that the terms that are the exact
translation, are almost close in the projected factorial space; e.g. "women",
"violence", "international day" and "rights".

domestic

1.0

approvato

libera
consiglio

0.5

giornatainternazionale

0.0

government
dirittifondamentaliviolence
novembre
aggression
rights
violenza
women
casadonnepisa
femminista
legislation
genere
contrastare
maschile
stanziato
mnl
reflect
issue
activism
gender
campaign
world donne
community
race
report
riformatore
fenomeno

abuse

fight

abusi
men

vittime

-0.5

Dimension 2 (13.6%)

internationalday

aiutare
rape

-1.0

insinuano vitaexploitation

-0.5

0.0

0.5

1.0

1.5

2.0

Dimension 1 (21.1%)

Figure 3. Correspondence Analysis Maps for Italy

Finally, by confirming the Procrustes errors plot (Figure 1) and the CA maps
(Figures 2 and 3), it is possible to see the unit "abuse" (despite the exact

274

JADT’ 18

translation) is more distant compared to the relative translation of natural
language of the other investigated context. The visualizations of Procrustes
Correspondence Analysis and “Joint terms space”, test the similarity between
Italy and United Kingdom in a cross-linguistic perspective. The graphic
intelligibility allows confirming the concordance between the two profiles in
relation to public opinion on violence against women.

Figure 4. Joint terms space Italy-Uk

In the complex, the visualizations lead us to assert what above mentioned,
while singularly they permit to investigate specific aspects of the linguistic
peculiarities. The "Joint terms space" confirms the overlapping of statistics
units (between countries) around the axis origin, so like the Procrustes errors
graph. Therefore, it does not exist a big difference between Italy and Uk. The
closeness between the "terms" of different languages collocated on the same
reference space recall the thematic-groupings brought out by CA.
5. Conclusion and perspectives
In this paper we faced the problem of comparing corpora when, one is not
the translation of the other. Some investigations (e.g. comparison between Uk
and Italy) indicate that the Procrustes approach is a valid tool for crosslanguage study. However, the cross-national investigations, carried out for
all case studies, bring out some limits relative to semantic of the natural
language expressions of the countries. It is possible that some terms, which

JADT’ 18

275

are natural language expressions of a country does not coincide with the
translation of the language expressions of another country. For example, in
the same case Italy-Uk, we can consider that "reformer" can indicate the
political aspect that Uk shows through terms such as "legislation" or
"government". Different terms (in natural language expressions) could be
ascribable to common conceptual labels since actually are belonging to same
semantic category. The future perspective is addressed to resolve the
semantic problems between countries by performing an analysis that focuses
on study of thematic-axes.
References
Balbi and Misuraca (2006). Procrustes Techniques for Text Minig, in Zani et
al., (Eds.), Data Analysis, Classification and the Forward Search, pp.227-234
Berlin, Heidelberg: Springer.
Bolasco (1999), Analisi multidimensionale dei dati. Metodi, strategie e criteri
d’interpretazione, Carocci, Roma.
Bolasco (2005), Statistica testuale e text mining: alcuni paradigmi applicativi.
Quaderni di Statistica, Vol.7, pp. 1-37.
European Union (2017). Report on equality between women and men in the
EU.
Feldman et al. (1998). Mining text using keyword distributions. Journal of
Intelligent Information Systems. Vol. 10, Issue 3, pp. 281–300.
Finer (1954). Metodo, ambito e fini dello studio comparato dei sistemi
politici, in Studi politici, III, 1, pp. 26-43.
FRA, European Union Agency for fundamental Rights (2014). Report
summary: Violence against women: an EU-wide survey. Results at a
glance. Publications Office of the European Union.
Gower (1975). Generalised Procrustes Analysis. Psychometrika, vol.(40):33-51.
Lijphart (1975). The comparable-cases strategy in comparative research, in
Comparative political studies, VIII, pp. 161-174.
Wu X., Wu G-Q., Zhu et al. (2014). Data mining with big data. IEEE
Transactions on Knowledge and Data Engineering. Vol. 26, Issue: 1.
Zielinski et al. (2012). Multilingual Analysis of Twitter News in Support of
Mass Emergency Events, Multilingual Twitter Analysis for Crisis
Management.

276

JADT’ 18

La verbalisation des émotions
Béatrice Fracchiolla, Olinka Solène De Roger
University of Lorraine in Metz
beatrice.fracchiolla@univ-lorraine.fr; olinka-solene.de-roger8@etu.univ-lorraine.fr

Abstract
Our study concerns the correlation between the perception of negative
emotions and discursive productions to express them. Our study is based on
26 transcribed oral interviews to be analyzed with Lexico3 (13 men and 13
women). We study the way in which healthy volunteers react verbally to the
conditioned production of negative emotions after viewing the government
realized video stop jihad, broad casted on television after the 2015 attacks.
Interviews were collected between November 2016 and February 2017
through out the COREV1project framework (understanding verbal violence
in reception). At the same time, following an identical protocol, we showed
another "neutral" video to the same people in order to have a control group.
All the subjects saw both videos, but in different orders, after 11hours of
intervals. According to our methodology of analysis with Lexico3 we were
able to extract the linguistic data allowing to have an over view of the
emotional feelings perceived by the volunteers after viewing each neutral or
violent video and to propose a synthetic card of them. The analysis was
conducted with three tools for statistic alanalysis of textual data proposed by
Lexico3:search for specificity according to the partitions using the PCLC tool
(Main Lexicometric Characteristics of the Corpus), the concordances, the
graphs of ventilation by partition. The over all analysis of the results shows
firstly that the emotions are distributed according to the nature of the videos
(neutral video: positive emotions and /or neutral - violent video: negative
emotions) and that the violent video provokes a quantity of speech longer
than the neutral. Then, if the intensity of perceived emotions seems to differ
according to the person wehere show it also is globally correlated to the
order of diffusion of the videos. We can see in the responses and the
construction of the speeches a correlation of positive or negative intensity of
the emotions according to the video which is seen first Like wise, the analysis

The Corev project (2016-2017) which allowed us to constitute the corpus
studied is an association of the CNRS, the University of Lorraine and the hospital of
Pitié Salpêtrière in order to make a comparative analysis of the neurophysiological
responses, emotional and discursive to exposure to (verbal) violence before / after
sleep and before / after waking.
1

JADT’ 18

277

seems to show that the reception of the violence invites volunteers and urges
them to express them selves more about their feelings: can we see here a
correlation also between discursive productivity and negative emotions - a
form of verification to the French proverb that "happy people have nothing
to say " ?
Résumé
Notre étude porte sur la corrélation qui existe entre la perception d’émotions
négatives et les productions discursives pour les exprimer. Elle est réalisée à
partir de 26 entretiens individuels oraux retranscrits pour être analysés via
Lexico3 (13 hommes et 13 femmes). Nous étudions la manière dont des
volontaires sains réagissent verbalement à la production conditionnée
d’émotions négatives après avoir visionné la vidéo stop-djihad du
gouvernement, diffusée à la télévision après les attentats de 2015. Les
entretiens ont été recueillis entre novembre 2016 et février 2017 dans le cadre
du projet COREV2 (comprendre la violence verbale en réception).
Parallèlement, suivant un protocole identique, nous avons montré une autre
vidéo « neutre » aux mêmes personnes afin d’avoir un groupe contrôle. Tous
les sujets ont vu les 2 vidéos, mais dans des ordres différents, à 11h
d’intervalles. Suivant notre méthodologie d’analyse via Lexico3 nous avons
pu extraire les données linguistiques permettant d’avoir un aperçu des
ressentis émotionnels perçus par les volontaires après le visionnage de
chaque vidéo neutre ou violente et d’en proposer une carte synthétique.
L’analyse par Lexico 3 a été menée via trois outils d’analyse statistiques des
données textuelles proposés par Lexico3: la recherche de particularité selon
les partitions à l’aide de l’outil PCLC (Principales Caractéristiques
Lexicométriques du Corpus), les concordances, les graphiques de ventilation
par partition. L’analyse globale des résultats montre tout d’abord que les
émotions sont réparties selon la nature des vidéos (vidéo neutre : émotion
positive et ou neutre – vidéo violente : émotion négative) et que la vidéo
violente suscite un temps de prises de parole plus long que la neutre. Si
l’intensité des émotions perçues semble différer selon la personne nous
montrons ici qu’elle est également relative à l’ordre de diffusion des vidéos.
Des indices lexicaux ou discursifs nous permettent de vérifier que les sujets
qui ont vu d’abord la vidéo djihad réagissent avec plus d’émotions positives

Le projet Corev (2016-2017) qui nous a permis de constituer le corpus étudié est
issu d’une association entre le CNRS, l’Université de Lorraine et l’hôpital de la Pitié
Salpetrière dans le but de faire une analyse comparée des réponses
neurophysiologiques, émotionnelles et discursives à une exposition à de la violence
(verbale) avant / après sommeil et avant /après réveil.
2

278

JADT’ 18

à la vidéo « neutre »et , inversement, que celles et ceux qui ont vu la vidéo
neutre en premier réagissent avec plus d’émotions négatives lors de la
projection de la vidéo stop-djihad. Autrement dit : nous constatons dans les
réponses et la construction des discours une corrélation d’intensité positive
ou négative des émotions en fonction de la vidéo qui est vue en premier. De
même, l’analyse semble montrer que la réception de la violence interpelle les
volontaires et les pousse à plus s’exprimer sur leur ressenti : peut-on voir ici
une corrélation également entre productivité discursive et émotions
négatives – soit une forme de vérification du proverbe selon lequel « les gens
heureux n’ont rien à dire ».
Keywords: verbal violence, discourse analysis, emotions, textual statistical
analisis, Lexico3
1. Introduction
Dans cette étude, nous nous intéressons à la manière dont des sujets
confrontés à des éléments violents extériorisent verbalement leurs émotions.
Dans l’expérimentation que nous avons conçue pour y arriver, nous avons
travaillé sur différents types de réponses émotionnelles obtenues sur 26
sujets ayant visionné une vidéo « violente » (la vidéo « stop-djihad » diffusée
par le gouvernement français suite aux attentats de 2015 – désormais notée
vidéo V) et une vidéo « neutre » (sur la nouvelle région Languedoc
Roussillon midi Pyrénées – désormais notée N). Le protocole multimodal
suivi pour récupérer nos données a été réalisé en milieu hospitalier3. Nous
avons recueilli plusieurs entretiens individuels semi-directifs portant sur le
ressenti émotionnel avant et après la vision des différentes vidéos, ainsi que
de nombreuses données neurovégétatives. Cette recherche soutenue par la
mission à l’interdisciplinarité du CNRS entre novembre 2016 et décembre
2017 visait plus particulièrement la compréhension et la perception de la
violence verbale chez des sujets sains (Fracchiolla et al., 2013).
L’expérimentation ainsi menée nous permet à la fois de mettre en évidence
certains des éléments marqueurs d’extériorisation émotionnelle verbale et de
comparer les types de réponses aux vidéos V et N. La présente publication
porte exclusivement sur la dimension verbale de l’extériorisation des
émotions, une fois le corpus des entretiens menés avec nos sujets retranscrit
et étudié à l’aide du logiciel Lexico3. Notre approche sera ici plus

3 Dans le service de et en collaboration avec la Professeure Isabelle Arnulf,
Neurologue, directrice de l'unité des pathologies du sommeil de l'hôpital de la PitiéSalpêtrière, professeure de neurologie à l’Université Pierre et Marie Curie (UPMC),
laboratoire : ICM UMR 7225.

JADT’ 18

279

spécifiquement de nous demander si les mots que nous utilisons pour nous
exprimer sont en adéquation avec ce que nous pensons et surtout avec les
émotions ressenties. Notre corpus est ainsi constitué de 26 entretiens répartis
en deux groupes comme suit : le Groupe 1 a vu les vidéos dans l’ordre : 1/
Vidéo N – 2/ Vidéo V. Le Groupe 2 : a vu les vidéos dans l’ordre inverse 1/
Vidéo V – 2/ Vidéo N4.
2. Manifestations d’un discours « émotionné »
2.1. Analyse des PCLP
La répartition du corpus selon la partition « vidéo » avec l’outil PCLC
(Principales caractéristiques lexicométriques du corpus), montre les
spécificités de cette première partition par vidéo et par groupe. Les
interventions des enquêtrices n’y sont pas inclues.
Tableau 1 : Principales caractéristiques de la partition « vidéo »
Partie
V1 N1
V1 N2
V1 Neutre
V2 Dj1
V2 Dj2
V2 Djihad
Groupe 1
V1 Dj1
V1 Dj2
V1 Djihad
V2 N1
V2 N2
V2 Neutre
Groupe 2

Occurrences
8295
33359
41654
7872
40191
48063
89717
12794
35405
48199
5790
36002
41792
89991

Formes
1227
2926
4153
1224
3325
4549
8702
1677
2966
4643
961
3013
3974
8617

Hapax
689
1538
2227
685
1679
2364
4591
906
1492
2398
517
1561
2078
4476

Fréquence Max
300
1049
1349
260
1225
1485
2834
368
1096
1464
168
1205
1373
2837

Forme
de
de
de
de
de
de
de
Et
Je
Je
La
Je
Je
Je

Pour le groupe 1 (N en 1 et V en 2) la forme la plus fréquente est « de » alors
que pour le groupe 2, c’est « je ». Les caractéristiques sont à peu près
équivalentes quelle que soit la vidéo projetée en 1. Quelle que soit la vidéo
projetée, et quel que soit l’ordre, pour les deux groupes on remarque que la
première exposition à la vidéo provoque moins de réactions (paroles=
nombre de formes) que la seconde, ce qui est a priori dû au fait que les
entretiens 2 (soir) et 3 (lendemain matin) contiennent un entretien de

L’un des principaux critères de recherche était de voir si les émotions étaient
plus ou moins mieux intégrées à 11h d’intervalle de jour ou de nuit. Tous les sujets
ont donc vu les 2 vidéos deux fois, à 11h d’intervalle entre chaque projection. 13 sujets
dans l’ordre vidéo V matin et soir et N soir et matin, 13 sujets au contraire dans
l’ordre vidéo N matin et soir et V soir et matin.
4

280

JADT’ 18

mémoire de la vidéo, avant la seconde projection sont plus longs. Cependant,
quel que soit l’ordre de passage, l’ensemble des sujets, tout groupes
confondus, parlent plus (environ 7000 occurrences de plus), à propos de la
vidéo V (stop djihad), qu’à propos de la N. Une tendance se dessine ainsi
selon laquelle la confrontation à la violence provoquerait une prise de parole
en « je » et un besoin de parler plus important.
2.2. Analyse du lexique « émotionné »
Reconnues comme des « moments » spécifiques instantanés, les émotions
sont définies comme « une réaction physique et/ou psychologique due à une
situation. », dont l’effet peut parfois se prolonger plus ou moins dans le
temps en fonction de leur intensité (Coletta & Tcherkassof, 2016; voir aussi
Bourbon, 2009 ; Feldman et al,. 2016 ou Fiehler, 2002). Pour étudier le lexique
des émotions, nous avons regroupé sous formes de listes des mots identifiés
dans le corpus et en fonction des concordances comme se rapportant à
l’expression de 4 des 6 émotions de base selon Ekman (1972) à savoir : la joie,
la colère, la tristesse et la peur (ici nommée inquiétude). Ce choix de 4
émotions et du terme « inquiétude » au lieu de « peur » a été fait en
adéquation avec les tests BMIS (échelles d’auto-évaluation de l’état
émotionnel par les sujets) demandés aux volontaires avant et après chaque
projection de vidéo. Les termes du lexique « émotionné » sont rassemblés cidessous par « groupes de formes ». Ainsi par exemple agréable+ contient
agréable(s)(ment) :
Bonheur/ Joie : Adoucit ; agréable+ ; allégresse ; ambiance+ ; amusé+ ;
apaisant+ ; bon+ ; calme+ ; content+ : désir+ ; emballer+ ; émerveillé ;
émouvoir+ ; excitant+ ; fière ; gai+ ; heureux+ ; jaloux* ; joie,+ ; marrant+ ;
paisible ; ravi ; serein+ ; surpris+
Colère : aberrant+ ; agacée+ ; agressé+ ; blasé+ ; chiffonne ; choc/choquer+ ;
colère ; énerver+ ; fâcher ; frappant+ ; furieux ; haine ; hard ; heurté+ ;
horreur+ ; horripile+ ; hostile+ ; irriter+ ; révolter+ ; saoulé
Inquiétude/ Peur :agitation+ ; angoissant+ ; anxiété+ ; apeuré+ ; crainte ;
effraiement*, effrayant+ ; flippant+ ; gêne+ ; incompréhensible+ ; nerveux+ ;
perdre+ ; peur+ ; stressant+ ; terreur
Tristesse : affecter+ ; affreux+ ; attristé+ ; bouleversé+ ; déception/déçu+ ;
dégoût+ ; déprimant+ ; dérange+ ;désolant+ ; impuissance ;
malheureusement, malheureux ; mélancolique; navrée ; peine+ ; triste+
Nous avons ici fusionné les émotions positives et neutres dans un même
groupe, ce qui explique que sous « joie » soient listés les termes « apaisante,
calme, serein » qui ne signifient pas éprouver de la joie, mais dont l’axiologie
est évaluée comme positive car exprimant une certaine neutralité
émotionnelle (Kerbrat-Orrechioni, 1980). De même, le terme « jaloux » dans
la colonne « joie » prête à interrogation : la jalousie est normalement associée

JADT’ 18

281

à l’expression d’un désir négatif, de l’ordre de l’inquiétude et de la colère ;
mais elle traduit ici du désir, comme le montre le contexte : «… ça faisait, ça
faisait très envie et ça rendait un peu jaloux». Ici, « jaloux », comme « envie »,
exprime un désir positif, qui va dans le sens d’un bien-être, contrairement à
son axiologie sémantique intrinsèque. De même, le terme « chiffonne »
(préoccuper, contrarier) est également une émotion négative qui devrait
trouver sa place plutôt dans la colonne de l’inquiétude. Mais en contexte, il
correspond ici à de la colère (« énerve » serait ici un synonyme) : « … ça me,
ça me chiffonne un peu de voir ce genre de, de, de vidéo à chaque fois ».
Enfin, le néologisme « éffraiement* », substantif masculin construit sur le
verbe effrayer, est ici associé à la peur, nous permettant de le classer dans la
colonne inquiétude : « un petit peu de peur et, et d’effraiement5 ». D’une
manière générale, pour une étude fine, tous les termes ici listés nécessiterait
une analyse développée, en contexte ; ce qui est l’objet d’une autre
publication.
3. Evaluation des émotions en contexte
L'analyse en concordance du lexique émotionné relevé ci-dessus révèle des
éléments significatifs avec le tri « avant », synthétisés dans le tableau cidessous. Ces résultats ont été doublés par des graphiques de ventilation :
Tableau 2 : synthèse des locutions adverbiales ou adverbes accompagnant
les expressions des émotions
Joie
un (petit) peu
un (peu) plus
(encore/beaucoup) plus
aussi
assez
plutôt
moins
pas très
pas
très
vraiment
autant
surtout

10
8
20
0
5
8
7
8
12
13
0
0
0

Colère
37
0
27
2
9
8
5
0
0
0
3
0
0

Inquiétude
37
4
8
2
2
1
0
0
0
1
4
3
0

Tristesse
36
0
9
0
0
2
0
0
7
0
0
0
4

On peut ici interroger à un niveau plus large le principe même de la création
néologique en rapport avec le contexte de l’émotion, qui peut se traduire au niveau de
la production verbale comme au niveau du corps, par différentes perturbations
(bégaiement, intonation, respiration changée, ne plus trouver ses mots…) (voir
Plantin, 2016) ; perturbations dont la création de néologismes serait l’une des
manifestations sur le plan lexical.
5

282

JADT’ 18

Figure 1 : Histogramme représentant les locutions adverbiales présentes à proximité des
expressions d’émotion (fréquences relatives)

Le contexte interactionnel de l’étude où l’on demande aux interviewés
d’évaluer les émotions ressenties, génère comme on le voit des réponses
presque systématiquement accompagnées d’adverbes ou locutions
adverbiales exprimant une intensité positive, équivalente, ou négative. De
manière significative, on relève ensuite une accentuation de l’intensité
positive lorsqu’il s’agit d’exprimer la joie (« encore/beaucoup/plus » 20 fois,
« très » 13 fois) alors que « un (petit) peu » est hyper présent pour atténuer
significativement les émotions négatives ressenties (colère, inquiétude,
tristesse). La seconde projection graphique permet de voir que, lorsque la joie
est exprimée, elle l’est de manière plus diverse, comparativement aux
émotions négatives. Ces résultats indiquent que pour le corpus étudié, qui
s’intéresse à la réception d’un discours violent, l’expression de l’intensité
correspond à celle d’une atténuation. On peut voir par exemple que
l’inquiétude et la tristesse sont les émotions qui attirent le plus la locution
d’intensité « un peu » qui tend à restreindre l’intensité de l’émotion perçue
par le locuteur (Coupin, 1995). Il est possible également que cela soit dû au
fait que ce sont des émotions plus diffuses et plus difficiles à caractériser de
manière tranchée que la joie et la colère, que l'on identifie assez facilement
lorsqu’on les ressent. Cela est confirmé par le fait que les émotions positives
sont accompagnées de locutions adverbiales marquant une forte intensité
(encore/beaucoup ; plus et très) : les locuteur.trice.s expriment leur joie avec
certitude et n’ont pas peur de la dire. De manière significative, c'est
également le cas pour l'expression de la colère, qui semble être l'émotion la
plus caractérisée adverbialement, à la fois par des éléments atténuateurs et
par des éléments intensificateurs («un (petit) peu» 37 occ. et
« encore/beaucoup/plus » 27 occ.), ce que l’on peut interpréter comme
l’expression du fait que les volontaires ne sont pas particulièrement

JADT’ 18

283

heureux.ses de se trouver exposé.e.s deux fois à la vidéo V et le manifestent
de cette manière. Le contexte apparaît ici fondamental : la colère est liée
d’une manière ou d’une autre ici à une forme d’impuissance face à la fois aux
attentats terroristes, aux images montrées qui sont en lien plus ou moins
direct selon les sujets, avec les attentats et l’état d’urgence et avec la situation
des civils syriens.

Figure 2 : Graphiques de ventilation par partition : V en N

Les graphiques de ventilation par partition vidéo V et N montrent les
émotions exprimées par les volontaires selon les vidéos visualisées. Les
émotions négatives (colère, inquiétude, tristesse) sont élevées en V ; à
l’inverse la joie est assez élevée en N. On remarque une variation des
émotions entre le premier et le second visionnage des vidéos : en effet, la
verbalisation des émotions négatives tend à baisser lors du second
visionnage (V1 à V2) alors que les émotions positives augmentent de V1 à V2.
Le même phénomène s’observe à l’inverse :les émotions positives baissent de
N1 à N2, et les négatives augmentent de N1 à N2, ce que montre le tableau cidessous :
Tableau 3: tableau récapitulatif des graphiques de partition v1 et v2

Groupe 1

Joie
Colère
Inquiétude
Tristesse

V1=N

V2=DJ

159
153
145
84

154
215
202
134

Groupe 2
V1 –
V2
5
62
57
50

V1=DJ

V2=N

245
167
100
124

259
105
43
74

V1 –
V2
14
62
57
50

284

JADT’ 18

Conclusion
Les réactions des sujets montrent de manière attendue, que la vidéo V génère
des émotions négatives et N, des émotions positives. En revanche, l'intensité
des émotions exprimées tend à être influencée par l'ordre dans lequel sont
vues les vidéos :dans le groupe 1, 1’expression de la joie est exprimée 159
fois ; elle est exprimée 259 fois en N dans le groupe 2. Lorsque les volontaires
voient d'abord la vidéo V, il semble que leurs réactions émotionnelles tendent
statistiquement à l'inverse de ce à quoi elles tendent dans l'ordre contraire :
ainsi l’expression verbale d’une émotion de bonheur tend à être supérieure
lorsqu'ils voient la vidéo N après la V, et l'expression de la colère,
l’inquiétude et la tristesse sont nettement inférieures. L’étude du lexique
émotionné tend à montrer que les sujets ressentent plus de bien être lorsqu'ils
voient la vidéo N après la V, comme un soulagement, un apaisement qui
arrive après une scène violente. Lorsque la vidéo N est vue en premier,
néanmoins, un certain facteur de stress émotionnel demeure, dû
probablement au fait que les sujets découvrent l'expérimentation et ne savent
pas ce qu'ils vont voir.
References
Bourbon B., (2009). L’expression des émotions & des tendances dans le langage,
University of Michigan Library.
Colletta J.-M. et Tcherkassof A. (2003). Les émotions. Cognition, langage et
développement. (P. Mardaga, Éd.) Belgique:Mardaga.
Coupin C. (1995). La quantification de faible degré : le couple peu/un peu et la
classe des petits opérateurs, thèse de doctorat, dir. Oswald Ducrot, EHESS.
Feldman B. L., Lewis M., Haviland-J. et Jeanette M. (2016). Handbook of
Emotions, Fourth Edition, The Guildford Press.
Fiehler R. (2002). « How to Do Emotions withWords : Emotionality in
Conversations », in Fussel, Susan (ed.) The Verbal Communication of
Emotions, London, Lawrence Erlbaum,pp.87-107.
Fracchiolla B., Moïse C., Romain C. et Auger N. (2013). Violences verbales
Analyses, enjeux et perspectives. Rennes: Presses Universitaires de Rennes.
Kerbrat-Orecchioni C. (1980) L’énonciation. La subjectivité dans le langage, Paris,
A. Colin.
Perrin L. (2016). « La subjectivité de l’esprit dans le langage », in Rabatel A. et
al. (éds) Sciences du langage et neurosciences (Acte du colloque de l’ASL
2015), Lambert-Lucas, 189-209.
Plantin Ch. (2011). Les bonnes raisons des émotions. Principes et méthode pour
l’étude du discours émotionné. Berne, Peter Lang.

JADT’ 18

285

Improving Collection Process for Social Media
Intelligence: A Case Study
Luisa Franchina1, Francesca Greco2, Andrea Lucariello3,
Angelo Socal4, Laura Teodonno5
1

AIIC (Associazione Italiana esperti in Infrastrutture Critiche) President –
blustarcacina@gmail.com
2Sapienza University of Rome – francesca.greco@uniroma1.it
3Hermes Bay Srl – a.lucariello@hermesbay.com
4Hermes Bay Srl – a.socal@hermesbay.com
5Hermes Bay Srl – l.teodonno@hermesbay.com

Abstract
Social Media Intelligence (SOCMINT) is a specific section of Open Source
Intelligence. Open Source Intelligence (OSINT) consists in the collection and
analysis of information that is gathered from public, or open sources. Social
Media Intelligence allows to collect data gathering from Social Media web
sites (such as Facebook, Twitter, YouTube etc…). Both OSINT and SOCMINT
are based on the Intelligence Cycle. This Paper aims to illustrate advantages
gained by applying text mining to collection phase of the intelligence cycle,
in order to perform threat analysis. The first step for detecting information
related to a specific target is to define a consistent set of keywords. Web
sources are various and characterized by different writing styles. Repeating
this process manually for each source could be very inefficient and time
consuming. Text mining specific software have been used in order to
automatize the process and to reach more reliable results. A partially
automatized procedure has been developed in order to gather information on
specific topic using the Social Media Twitter. The procedure consists in
searching manually a set of few keywords to be used for a specific threat
analysis. Then TwitteR of R Statistics was used to gather tweets that were
collected in a corpus and processed with T-Lab software in order to identify a
new list of keywords according to their occurrence and association. Finally,
an analysis of advantages and drawbacks of the developed method.
Abstract
La Social Media Intelligence (SOCMINT) è una sezione specifica di Open
Source Intelligence. L’Open Source Intelligence (OSINT) consiste nella
raccolta e analisi di informazioni da fonti pubbliche o aperte. La Social Media
Intelligence consente di raccogliere dati da siti Web di social media (come
Facebook, Twitter, YouTube ecc.). Sia l’OSINT che la SOCMINT sono basate

286

JADT’ 18

sul ciclo di Intelligence. Il presente documento intende illustrare i vantaggi
ottenuti applicando tecniche di text mining alla fase di raccolta del ciclo di
intelligence, al fine di eseguire analisi delle minacce. Il primo passo per
individuare le informazioni relative ad un obiettivo specifico è definire un
insieme coerente di parole chiave. Le fonti Web sono varie e caratterizzate da
diversi stili di scrittura. La ripetizione manuale di questo processo per
ciascuna fonte potrebbe essere molto inefficiente e dispendiosa in termini di
tempo. Sono stati utilizzati software specifici di text mining per
automatizzare il processo e ottenere risultati più affidabili. È stata sviluppata
una procedura parzialmente automatizzata al fine di raccogliere
informazioni su argomenti specifici utilizzando il Social Media Twitter. La
procedura consiste nella ricerca manuale di un gruppo di poche parole
chiave da utilizzare per un'analisi specifica delle minacce. Quindi il pacchetto
TwitteR di R Statistics è stato utilizzato per raccogliere i tweet che sono stati
raccolti in un corpus ed elaborati con il software T-Lab al fine di identificare
un nuovo elenco di parole chiave in base al loro verificarsi e associazione.
Infine viene fornita un'analisi dei vantaggi e degli svantaggi della procedura
sviluppata.
Keywords: Social Media Intelligence, Twitter, text mining, data collection
1. Introduction
“Open Source Intelligence [OSINT] is the discipline that pertains to
intelligence produced from publicly available information that is collected,
exploited, and disseminated in a timely manner to an appropriate audience
for the purpose of addressing a specific intelligence requirement”
(Headquarters Department of the Army, 2010, p. 11-1). OSINT is mainly used
in the framework of national security, by law enforcement to conduct
investigations, and in business field to gather important information. Social
Media Intelligence (SOCMINT) is a specific section of OSINT which focuses
on Social Media.
In recent years, with the spread of Internet, and the high amount of readily
accessible data, which give a picture of the actual state of things, the
importance of OSINT and SOCMINT has grown, becoming a key enabler of
decision and policy making. To bring the best out of such flow of data, the
intelligence process must take place as a systematic approach structured
around clear steps: planning and direction; collection; processing; analysis
and production; dissemination. These stages, each of which is vital, create the
Intelligence Cycle (CIA - Central Intelligence Agency, 2013). In order to
automatically collect data from both the web and the Social Media, OSINT
dashboards are being developed (Brignoli et Franchina, 2017).

JADT’ 18

287

This paper describes the contribution provided by automated support tools
in the collection phase of the Intelligence Cycle from a Social Media (Twitter)
on the phenomenon of interest. To capture the real essence of text available
and turn data publicly collected into valuable and reliable knowledge, text
mining techniques were implemented. To this aim, text mining plays a
relevant role as it enables the detection of meaningful patterns to explore
knowledge from textual data. As stated by Feldman and Sanger: “Text
mining can be broadly defined as a knowledge-intensive process in which a
user interacts with a document collection over time by using a suite of
analysis tools. In a manner analogous to data mining, text mining seeks to
extract useful information from data sources through the identification and
exploration of interesting patterns” (Feldman et Sanger, 2007, p. 1).
2. The use of Twitter
Twitter is a common Social Media, a microblog mainly for real time
information and communication. With Social Media becoming the main tool
for informational exchange, in October 2017, Twitter reached about 330
million users (Statista, 2018).
Twitter’s specific characteristics makes such a social particularly suitable for
SOCMINT purposes. Contents can be accessed by anyone, with no need to
create an account. Its users interact with short messages called “tweet”,
whose length is limited to 280 characters and can be embedded, replied to,
liked and unliked. Tweet quick nature, which can then be easily compared to
SMS (Short Messaging Service) messaging, fosters the use of acronyms and
slang, providing a real-time feel as they bring the first reaction to an event.
Phrasing can be simple in structure or imply a large amount of hapax.
With Twitter becoming one of the most important web application, it
provides a big amount of data and therefore it constitutes a vital source for
Social Media Intelligence. Thanks to its characteristics (potential reach, oneon-one conversation, promotional impact), Tweeter gained importance over
years in different social fields, from policy, to media communication and
terrorism. As a result, it is commonly considered a valuable source to
monitor social phenomena and their changing pattern.
3. Case Study
This paragraph illustrates how text mining tools can be integrated into the
SOCMINT data collection phase. The aim of the procedure is to select a
suitable and limited list of keywords allowing for an effective and efficient
information retrieval in order to support the analyst work.
In this case study the analyst was interested in collecting tweets on the
criminal and antagonist threat macro thematic that is related to many specific

288

JADT’ 18

topics as, for example, critical infrastructures or telecommunications. The
collection process has to identify a list of keyword able to collect the
messages concerning, for example, "the criminal and antagonist threat in
relation to critical infrastructures". The process can be illustrated by a cycle of
four different steps: selection of keywords related with the specific tropic
performed by the analyst; tweets collection; text mining; and verification and
list of keywords definition (figure 1).

Figure 1: illustration of automatic process for Twitter’s data collection four steps cycle

3.2. Keywords selection
The first step is performed by the analyst and consists in defining a suitable
list of words which could be used in order to collect tweets related to a
specific thematic, which in our example could be Critical Infrastructures. To
each X topic there is a set of keywords defining it (X1, X2, … Xn), e.g., railway,
station, airport. The same topic is made by all possible sets, given by the
formula:

3.1. Tweets collection
Once the keywords are selected, the second step consists collect data from
Twitter repository, e.g. using the twitteR package of R statistics (Gentry,
2016), in order to identify the keywords allowing for the collection of a
certain amount of tweets, that in our example was more than one hundred in
a day. That is, a word could perfectly represent the topic but could be rarely
used in the messages, resulting in a collection of a small sample of tweets.
The aim of this step is to find these words that allows for an effective data
collection (n ≥ 100), eliminating those words that are rarely used in the

JADT’ 18

289

messages (n < 100). That makes information retrieval more effective as the
number of keywords that can be used is limited.
3.3. Text Mining
After the keywords’ data collection efficacy was checked, a ten day messages
collection was performed including the retweets (49,3%), which is the data
retrieval maximum limit of the Twitter repository. The large size corpus
(token = 284.253) of 19.491 tweets was cleaned and pre-processed by the
software T-Lab (Lancia, 2017) in order to build a vocabulary (type = 19.765;
hapax = 8.947) and a list of content words (nouns, verbs, adverbs, adjectives)
(table 1). Then the list of content words was checked in order to identify the
new keywords and to implement the list.
Table 1: List of the first 20 lemmas of the list
Word
stazione

n
6066

Word
elettrico

n

Word

n

Word

2226

treno

1198

via

n
825

Word
ferrovia

n
659

aeroporto

4734

nuovo

1581

regione

1025

Milano

731

repubblica

632

impianti

3605

rifiuti

1536

Zingaretti

1022

autorizzare

720

giorni

627

Roma

3337

comune

1317

aiutare

Italia

679

centrale

605

896

In order to perform a content analysis, keywords were selected. In particular,
we used lemmas as keywords filtering out the lemmas below ten
occurrences. Then, on the tweets per keywords matrix, we performed a
cluster analysis with a bisecting k-means algorithm (Savaresi et Boley, 2004)
limited to twenty partitions, excluding all the tweets that did not have at
least two keywords co-occurrence. The eta squared value was used to
evaluate and choose the optimal solution.
The results of the cluster analysis show that the keywords selection criteria
allow the classification of 98.53% of the tweets. The eta squared value was
calculated on partitions from 3 to 19, and it shows that the optimal solution is
13 clusters (η2 = 0,19) (figure 2). Then, the analyst controlled for the lexical
profile of each cluster in order to detect the words useful to focus data
collection by means of the Boolean operators.
This procedure allows for the identification of a short list of most used words
(about 20) with regard to both the macro thematic and the related topic. The
list of keyword was then further reduced and it was reached a set off five
meaningful words for each intersection of the macro thematic with a specific
topic. Such a reduction stems from the fact that the use of a bigger amount of
words led to an exponential increase of false - positive production rate.

290

JADT’ 18

Figure 2: Eta squared difference per partition

As abovementioned, though such a work methodology effectively enables to
extract more often used words, with regard to Twitter it is still necessary to
test keywords to delete “noise” they produce, which however will not be
eliminated entirely. In other words, this methodology affects keywords’
amount on the basis of redundancies used by users. However, keywords’
quality should be tested in Twitter search engine in order to reach a level of
acceptance which includes both false and positive negative. Such words
made up the vocabulary to be used to identify intersection between the
macro thematic and a specific topic, i.e in the first case “criminal and
antagonist's threat with regard to critical infrastructure”, in the second case
“criminal and antagonist’s threat with regard to telecommunication” etc.
Between words identified there is an OR relationship. Example: terrorism OR
attack OR attack at station OR airport OR railway. Intersection between
cluster “criminal and antagonist’s threat” and “critical infrastructure is
synthetized by the following formula:

Where A is the cluster “criminal and antagonist’s threat”, B is “critical
infrastructure” and C is the intersection, which is “criminal and antagonist’s
threat with regard to “critical infrastructures”. The following image shows an
example.

JADT’ 18

291

Figure 3: an example of a possible set of words defining the intersection of the cluster
“criminal and antagonist’s threat”, with the topic “critical infrastructure”

3.4. Verification test
Finally, the list of keywords was tested on the Open Source Intelligence
dashboard. Collected Tweets were analyzed in order to identify the level of
its reliability to monitor the desired phenomena.
4. Conclusion
The developed process reflects the reliability of text mining software in
supporting information gathering process for Social Media Intelligence
purposes. The vocabulary identified for four different clusters, each of one
covering a specific topic, is being tested at this very moment on an advanced
dashboard in order to evaluate reliability. However, the role of the analyst is
still fundamental. The relationship between OSINT dashboard and analysts
must be complementary: dashboard plays a key role in gathering a big
amount of tweet, but it is still necessary the analyst support in choosing the
suitable keywords to be upload in the database, in order to render
information collection more effective. Indeed, OSINT dashboard can’t
understand Twitter users’ use of metaphors and similarities: keywords
choice must be made in accordance with monitoring targets. It should be
recalled that Italian language is really complex and it might occur that users’
language don’t refer to chosen target. Let’s see a practical example: some
keywords which usually refer to criminal threats (bomba - bomb or furto theft) can be used in Italian language also to refer to synthetic concepts with
regard to football or business offers (“bomba” might be used to mean a goal
scored through a powerful strike; “furto” might be used to mean that a
particular business offer is uneconomical). Another very important issue,
which can’t be solved without analysts, regard ironic tweets: dashboard

292

JADT’ 18

collects all information uploaded into database but it can’t subdivide tweets
into ironic and non-ironic by means of interpretation. To conclude, as
dashboards don’t understand textual meaning of words, analysts are
required to support dashboards’ capabilities, being the only ones to interpret
the specific meaning of words.
References
Brignoli M. A., and Franchina L. (2017). Progetto di Piattaforma di
Intelligence con strumenti OSINT e tecnologie Open Source. Proceedings
of the First Italian Conference on Cybersecurity (ITASEC17), Venice, Italy,
pp. 232-241.
CIA,
Central
Intelligence
Agency
(2013).
Kids'
Zone.
CIA,
https://www.cia.gov/kids-page/6-12th-grade/who-we-are-what-wedo/the-intelligence-cycle.html
Feldman R. and Sanger J. (2006), The Text Mining Handbook: Advanced
Approaches in Analyzing Unstructured Data. Cambridge University Press.
Gentry J. (2016). R Based Twitter Client. R package version 1.1.9.
Headquarters Department of the Army (2010). FM 2-0 Intelligence: Field
Manual. USA Army, https://fas.org/irp/doddir/army/atp2-22-9.pdf
Lancia F. (2017). User’s Manual: Tools for text analysis. T-Lab version Plus 2017.
Savaresi S.M. and Boley D.L. (2004). A comparative analysis on the bisecting
K-means and the PDDP clustering algorithms. Intelligent Data Analysis,
8(4): 345-362.
Statista (2018). Twitter: number of monthly active users 2010-2017. Statista,
https://www.statista.com/statistics/282087/number-of-monthly-activetwitter-users/

JADT’ 18

293

The impact of language homophily and similarity of
social position on employees’ digital communication
Andrea Fronzetti Colladon, Johanne Saint-Charles, Pierre Mongeau

1. Introduction
Knowledge creation and organizational communication are fundamental
assets to obtain strategic competitive advantage (Tucker, Meyer, &
Westerman, 1996) and in modern organization most of these happen through
digital communication. We know that the way employees use digital
communication can predict their engagement level (Gloor, Fronzetti
Colladon, Giacomelli, Saran, & Grippa, 2017) as well as future business
performance (Fronzetti Colladon & Scettri, 2017). Hence there is a need to
better understand what is affecting employees’ participation in internal
communication in order to foster the efficacy of internal communication and
to deliver effective messages and campaigns in the most strategic way. Based
on the idea of homophily, this paper examines if employees’ participation in
their organization intranet is linked with their similarity in discourse and in
network positions. Communication, digital or not, encompasses both the
language people are using to communicate and the interactions and
relationships they have (Tietze, Cohen, & Musson, 2003; White, 2011). In the
last two decades scholars have explored how people’s discourse1 and
relationships are intertwined notably through the lenses of social network
analysis. Among others, those studies have shown that social relationships or
interactions between people are linked to the similarity of the words and
expressions they use (Basov & Brennecke, 2018; Nerghes, Lee, Groenewegen,
& Hellsten, 2015; Roth & Cointet, 2010; Saint-Charles & Mongeau, 2018).
Also, Gloor and colleagues have proposed a framework to study online social
dynamics in which language plays an important role, especially with regards
to the dimensions of sentiment, emotionality and complexity (Gloor et al.,
2017). Such results align with the notion of homophily that corresponds to
the tendency to relate to others on the basis of similarities (Lazarsfeld &
Merton, 1954). A tendency now acknowledged as an important factors for the
constitution of social networks (Mcpherson, Smith-Lovin, & Cook, 2001). It is
assumed that this similarity leads to the development of relationships since
similarity is linked to attraction towards the other (Montoya & Horton, 2013).

1 Discourse is define here as “a general term that applies to either written or
spoken language that is used for some communicative purpose” (Ellis, 1999, p. 81).

294

JADT’ 18

Considering digital communication Brown, Broderick, & Lee (2007) and
Yuan & Gay (2006) showed that ties strength and computer-mediated
interaction increases with homophily. Most of the studies have explored
similarities with regards to sociodemographic variables but several authors
have expanded this to a wide range of variables including attitudes,
psychological traits, values, etc. as latent homophily factors (Lawrence &
Shah, 2007; Shalizi & Thomas, 2011). Hence, given that interaction in digital
communication happens through written text, we assume that discourse
similarity of employees’ messages is a key homophilic determinant for
employees’ interactions in the network of internal digital communication.
Similarity can also be observed with regard to network position. Indeed,
occupying an equivalent position in a network was shown to lead to similar
outcomes (attitudes, points of view, roles, etc.) (Borgatti & Foster, 2003; Burt,
1987). In the study of large on-line networks, actors’ similarity in centrality
has proven useful for identifying role-similarity of actors in the network
(Roy, Schmid, & Tredan, 2014). According to Gloor et al. (2017), it is also
important to investigate the dynamic evolution of social positions. Rotating
leaders, for example, proved to play a very important role in online
communities, supporting their growth and participation (Antonacci,
Fronzetti Colladon, Stefanini, & Gloor, 2017). In sum, the “homophily
phenomenon” has been largely demonstrated through the study of various
types of similarities. This paper seeks to explore this phenomenon in the
context of the use of internal digital communication system in an
organization and we propose to use discourse and network position
similarity measures to this avail, our overall hypothesis being that the two
are correlated and that they are correlated with interactions.
2. Research Design and Methodology
We analyzed the digital communications of about 1,600 employees working
for a large multinational company, mainly operating in Italy. This company
has a largely popular intranet social network, structured as an online forum,
where only employees can interact, exchanging opinions and ideas through
the sharing of news and comments. We could extract and analyze more than
23,000 posts (news and comments), written in Italian over a period of one
and a half year. Users were mostly males (68%) and a small part of them also
played the role of content managers (7%).
The first step in our analysis was to build the social network which
represents the forum interactions. This network is made of N nodes, one for
each forum user, and M edges. In general, there is an edge between two
nodes if the corresponding employees had at least one interaction – for
example, they exchanged knowledge or opinion through subsequent

JADT’ 18

295

comments, or one answered a question of the other. We then proceeded to
calculate the similarity measures for both discourse and network position.
Based on what was presented above, we looked at five aspects of discourse
similarity: words use, sentiment, emotionality, complexity and length.
Additionally, we studied employees’ connectivity and interactivity, as
suggested by Gloor and colleagues (2017). We further explored employees’
use of language by looking at the sentiment, emotionality, complexity and
length of their forum posts. Length is simply calculated as the average
number of characters used in forum posts by an employee – after having
removed stop-words and punctuation, via a script written using the Python
programming language and the package NLTK (Perkins, 2014). Sentiment
expresses the positivity or negativity of forum posts and is calculated thanks
to the machine learning algorithm included in the social network and
semantic analysis software Condor (Gloor, 2017). Sentiment varies between 0
and 1, where 0 represents a totally negative post and 1 a totally positive one.
Emotionality expresses the variation from neutral sentiment and is computed
by Condor using the formula presented by Brönnimann (2014). Posts that
convey less neutral expressions, either positive or negative, are considered
more emotional. Lastly, complexity represents the deviation from common
language and is calculated as the probability of each word of a dictionary to
appear in the forum posts (Brönnimann, 2014); when rare terms appear in
forum posts more often, complexity is higher. Even this last measure was
obtained from Condor. Concerning the study of employees’ positions in the
social structure, we referred to network centrality measures (Freeman, 1979).
To measure centrality, we used the two well-known metrics of degree and
betweenness centrality. Degree centrality measures the number of direct
links of a node, i.e. the number of people an employee interacted with, in the
online forum. Betweenness centrality, on the other hand, takes into account
the indirect links of a node and counts how many times a social actor lies inbetween the paths that interconnect his/her peers. Betweenness centrality is
calculated by considering the shortest network paths that interconnect every
possible pair of nodes and counting how many times these paths include a
specific employee (i.e. the node for which the betweenness centrality is
calculated). Employees’ interactivity was operationalized by calculating
rotating leadership. This variable counts the oscillations in betweenness
centrality of a social actor, i.e. the number of times betweenness centrality
changed reaching local maxima or minima. If an employee maintains a static
position, his/her rotating leadership is zero. On the other hand, we have
rotating leaders when people oscillate more between central and peripheral
positions, activating or taking the lead of some conversations and then
leaving space to other people in the network. As control variables, we could

296

JADT’ 18

access to employees’ gender and forum role (content manager or not). Even if
gender homophily is not always supported by social networks studies, it is
very often used as a control variable, as it has been shown that gender can
influence online social communication and behavior (Thelwall, 2008, 2009).
Similarly, we control for content manager role, as we expect different
behaviors when employees have the assignment of informally moderating
the intranet social network. All the variables presented above were first
calculated at the node level and subsequently transformed into similarity
matrices. Like a network adjacency matrix, a similarity matrix is made of N
row and columns, where each row and column represents a specific
employee. For categorical attributes (gender and being a content manager or
not) we have a value of 1 in a cell of the matrix if the two corresponding
employees share the same attribute (for example they are both females), and
0 otherwise. For continuous variables, we populated the matrices with the
absolute value of the differences in individual actor scores.
3. Results
In general, we notice a prevalence of male employees, even if more forum
content managers are females (most of them working in the internal
communication department, which is mostly populated by females). Being a
content manager is also associated with more central and dynamic network
positions: content managers have on average higher scores of degree and
betweenness centrality and they rotate more. To put it in other words, they
have interactions with more people, often act as brokers of information and
in general do not keep a static dominant position after having fostered a
conversation.
As described in the previous section, we measured similarity with respect to
several characteristics of employees: their gender, content manager role, use
of language, centrality and interactivity. Text similarity shows the strongest
association with digital communication (ρ = 0.48). Employees who more
frequently use the same vocabulary communicate more between themselves.
Apart from gender and sentiment, homophily effects seem to be significant
for all the other variables included in our study. Employees that are more
similar with respect to their use of language, degree of interactivity and
network position tend to interact more between themselves.
As per agreed privacy arrangements, we are prohibited from revealing the
company name or other details that could help in its identification. It might
be useful to replicate our research to see if our findings are confirmed in
different business contexts. Future studies could include more control
variables, particularly those which are supposed to produce homophily
effects – such as employees’ age (Kossinets & Watts, 2009). Having more

JADT’ 18

297

accurate timestamps could also help in the assessment of average response
time, to see if more reactive users tend to cluster. As our was mainly an
association study, we advocate further research to carry out a longitudinal
analysis which could tell us which actor similarity effects can be considered
as significant antecedents of digital communication.
Our findings have practical implications both for company managers and
administrators of online communities. For example, if a company wants to
attract the attention of employees on a strategic topic, in the light of our
results, it appears vital to choose a language close to that of the target people.
Employees’ participation in conversations can be fostered by online messages
aligned with the general use of language and by choosing social ambassadors
who have network positions similar to the target.
References
Antonacci, G., Fronzetti Colladon, A., Stefanini, A., & Gloor, P. A. (2017). It is
Rotating Leaders Who Build the Swarm: Social Network Determinants
of Growth for Healthcare Virtual Communities of Practice. Journal of
Knowledge Management, 21(5), 1218–1239. https://doi.org/10.1108/JKM11-2016-0504
Basov, N., & Brennecke, J. (2018). Duality beyond Dyads: Multiplex
patterning of social ties and cultural meanings. Research in the Sociology
of Organizations, in press.
Borgatti, S. P., & Foster, P. C. (2003). The network paradigm in organizational
research: A review and typology. Journal of Management.
https://doi.org/10.1016/S0149-2063(03)00087-4
Brönnimann, L. (2014). Analyse der Verbreitung von Innovationen in sozialen
Netzwerken. University of Applied Sciences Northwestern Switzerland.
Retrieved from http://www.twitterpolitiker.ch/documents/Master_
Thesis_Lucas_Broennimann.pdf
Brown, J., Broderick, A. J., & Lee, N. (2007). Word of mouth communication
within online communities: Conceptualizing the online social network.
Journal
of
Interactive
Marketing,
21(3),
2–20.
https://doi.org/10.1002/dir.20082
Burt, R. S. (1987). Social Contagion and Innovation: Cohesion versus
Structural Equivalence. American Journal of Sociology, 92(6), 1287–1335.
https://doi.org/10.1086/228667
Ellis, D. G. (1999). From Language To Communication. New York, NY:
Routledge.
Freeman, L. C. (1979). Centrality in social networks conceptual clarification.
Social Networks, 1, 215–239.
Fronzetti Colladon, A., & Scettri, G. (2017). Look Inside. Predicting Stock

298

JADT’ 18

Prices by Analysing an Enterprise Intranet Social Network and Using
Word Co-Occurrence Networks. International Journal of Entrepreneurship
and Small Business, in press. https://doi.org/10.1504/IJESB.2019.10007839
Gloor, P. A. (2017). Sociometrics and Human Relationships: Analyzing Social
Networks to Manage Brands, Predict Trends, and Improve Organizational
Performance. London, UK: Emerald Publishing Limited.
Gloor, P. A., Fronzetti Colladon, A., Giacomelli, G., Saran, T., & Grippa, F.
(2017). The Impact of Virtual Mirroring on Customer Satisfaction.
Journal of Business Research, 75, 67–76.
https://doi.org/10.1016/j.jbusres.2017.02.010
Huang, A. (2008). Similarity measures for text document clustering. In
Proceedings of the sixth new zealand computer science research student
conference (NZCSRSC2008) (pp. 49–56). Christchurch, New Zealand.
Jivani, A. G. (2011). A Comparative Study of Stemming Algorithms.
International Journal of Computer Technology and Applications, 2(6), 1930–
1938. https://doi.org/10.1.1.642.7100
Kossinets, G., & Watts, D. J. J. (2009). Origins of Homophily in an Evolving
Social Network. American Journal of Sociology, 115(2), 405–450.
https://doi.org/10.1086/599247
Krackhardt, D. (1988). Predicting with networks: Nonparametric multiple
regression analysis of dyadic data. Social Networks, 10(4), 359–381.
Lawrence, T. B., & Shah, N. (2007). Homophily: Meaning and Measures. In
Paper presented at the International Network for Social Network Analysis
(INSNA). Corfu, Greece.
Lazarsfeld, P. F., & Merton, R. K. (1954). Friendship as a Social Process: A
Substantive and Methodological analysis. Freedom and Control in Modern
Society, 18, 18–66. https://doi.org/10.1111/j.1467-8705.2012.02056_3.x
Mcpherson, M., Smith-Lovin, L., & Cook, J. M. (2001). Birds of a feather:
Homophily in social networks. Annual Review of Sociology, 27(1), 415–
444. https://doi.org/10.1146/annurev.soc.27.1.415
Montoya, R. M., & Horton, R. S. (2013). A meta-analytic investigation of the
processes underlying the similarity-attraction effect. Journal of Social and
Personal Relationships, 30(1), 64–94.
https://doi.org/10.1177/0265407512452989
Nerghes, A., Lee, J.-S., Groenewegen, P., & Hellsten, I. (2015). Mapping
discursive dynamics of the financial crisis: a structural perspective of
concept roles in semantic networks. Computational Social Networks, 2(16),
1–29. https://doi.org/10.1186/s40649-015-0021-8
Perkins, J. (2014). Python 3 Text Processing With NLTK 3 Cookbook. Python 3
Text Processing With NLTK 3 Cookbook. Birmingham, UK: Packt
Publishing.

JADT’ 18

299

Roth, C., & Cointet, J. P. (2010). Social and semantic coevolution in
knowledge networks. Social Networks, 32(1), 16–29.
https://doi.org/10.1016/j.socnet.2009.04.005
Roy, M., Schmid, S., & Tredan, G. (2014). Modeling and measuring graph
similarity: The case for centrality distance. In Proceedings of the 10th
ACM international workshop on Foundations of mobile computing, FOMC
2014 (pp. 47–52). New York, NY: ACM.
https://doi.org/10.1145/2634274.2634277
Saint-Charles, J., & Mongeau, P. (2018). Social influence and discourse
similarity networks in workgroups. Social Networks, 52, 228–237.
https://doi.org/10.1016/j.socnet.2017.09.001
Shalizi, C. R., & Thomas, A. C. (2011). Homophily and contagion are
generically confounded in observational social network studies.
Sociological Methods and Research, 40(2), 211–239.
https://doi.org/10.1177/0049124111404820
Tata, S., & Patel, J. M. (2007). Estimating the selectivity of tf-idf based cosine
similarity predicates. ACM SIGMOD Record, 36(2), 7–12.
https://doi.org/10.1145/1328854.1328855
Thelwall, M. (2008). Social networks, gender, and friending: An analysis of
mySpace member profiles. Journal of the American Society for Information
Science and Technology, 59(8), 1321–1330.
https://doi.org/10.1002/asi.20835
Thelwall, M. (2009). Homophily in MySpace. Journal of the American Society for
Information Science and Technology, 60(2), 219–231.
https://doi.org/10.1002/asi.20978
Tietze, S., Cohen, L., & Musson, G. (2003). Understanding organizations through
language. Understanding Organizations Through Language.
https://doi.org/10.4135/9781446219997
Tucker, M. L., Meyer, G. D., & Westerman, J. W. (1996). Organizational
communication: Development of internal strategic competitive
advantage. Journal of Business Communication, 33(1), 51–69.
https://doi.org/10.1177/002194369603300106
White, H. C. (2011). Identité et contrôle. Une théorie de l’émergence des formations
sociales. Paris: Éditions de l’École des hautes études en sciences sociales.
Yuan, Y. C., & Gay, G. (2006). Homophily of network ties and bonding and
bridging social capital in computer-mediated distributed teams. Journal
of Computer-Mediated Communication, 11(4), 1062–1084.
https://doi.org/10.1111/j.1083-6101.2006.00308.x

300

JADT’ 18

Looking Through the Lens of Social Sciences:
The European Union in the EU-Funded Research
Projects Reporting
Matteo Gerli
University for Foreigners of Perugia – matteogerli81@gmail.com

Abstract
In the last decades, European integration and scientific production have
come to be deeply intertwined as a result of the Europeanization of many
research activities. On one side, European institutions promote the
realization of research projects aiming at developing a type of knowledge
“close” to the end users’ interests; on the other side, the resulting knowledge
contributes to conditioning the practices that take place in the European and
national institutions, according to a circular process that brings the
innovations to feed back into the system that expresses them. The purpose of
this paper is to explore this relationship by examining two peculiar scientific
products realized by researchers operating within the broad domain of the
Socio-economic Sciences and Humanities (SSH), as a part of the research
projects financed by the Seventh Framework Programme (2007-2013) of the
European Union: final reports and policy briefs. In other words, it aims to
analyse all reports as a whole using some automatic text analysis tools, while
incorporating some supplementary variables which help to define the
broader context of scientific production.
Keywords: European Union, International Research Projects, Socio-economic
Sciences and Humanities, Textual Data Exploration, Quantitative Discourse
Analysis, IRaMuTeQ.
1. Introduction
The European Research Policy plays a strategic role for thousand of
researchers and research institutions which operate within the EU borders.
Thanks to the concomitant decrease in national public funds for scientific
activities (see for instance, Vincent-Lacrin, 2006; 2009), the European research
agenda has dramatically increased its appeal among scholars and
consequently its ability to have an impact on the directions and processes of
scientific knowledge production. Indeed, starting from the 90s, the European
Commission has equipped itself with new means to combine and manage, on
the basis of medium to long-term planning cycles, the whole set of scientific
and technological initiatives financed by the European budget: the framework

JADT’ 18

301

programme (Ippolito, 1989; Ruberti and André, 1995; Guzzetti, 1995;
Menéndez and Borras, 2000; Borras, 2000; Banchoff, 2002; Cerroni and
Giuffredi, 2015). In short, the underlying logic is that of the programmatic
intersection between research activities and other European policies, so that
the promotion of scientific excellence complements the need to foster the
creation of cross-border and interdisciplinary collaborations intended for
producing a type of knowledge “close” to the end users’ interests.
As it was observed in previous studies (Adler-Nissen and Kropp, 2015),
European integration and scientific production have come to be deeply
intertwined: on one side, the progress of integration process influenced (and
still influences) research activities through the promotion of particular forms
of knowledge and research questions (as far as we are concerned, mainly
through the realization of cross-national and cross-disciplinary research
projects); on the other side, the resulting knowledge contributes to
conditioning the practices that take place in the European and national
institutions, according to a circular process that brings the innovations to
feed back into the system that expresses them. Social Sciences and
Humanities, which are less directly involved in the production of knowledge
with a clear practical usability, are by no means unconcerned with this kind
of phenomenon. At this regard, the Journal of European Integration has recently
published a special issue on the relationship between social sciences and
European integration, hosting some important articles that have highlighted
the existence of several “crossroads” between the European Union and the
scientific community’s “itineraries”1: Rosamond (2015), for instance,
observed how certain theories on the political and economic integration (in
particular that of the Hungarian Béla Balassa, from the economics side, and
the neofunctionalism, from the political science side) had been informing the
“strategic narrative” adopted by the European Commission during the 60s
and 70s to legitimize its newly-formed institutional role and its economic
policy position, according to a quite peculiar two-ways traffic of influences
process, being the economic integration theorized while it was happening;
Deem (2015) pointed out the existence of a relationship between the birth of a
new field on higher education studies, the simultaneous evolution of national
university systems and the launch of the so-called Bologna process at
European level; Vauchez analysed, through a sociogenetic approach, the
historical process through which the acquis communautaire «has been
formulated, stretched, criticized, revised and finally naturalized as the most
rigorous and objective measure of Europe against other possible methods»
(2015: 196) thanks to the work of those who have been defined

1

Journal of European Integration, 37 (2015).

302

JADT’ 18

“methodological entrepreneurs”, that is European officials who have
politically invested and succeeded in establishing Europe’s cognitive and
technical equipment.
Looking beyond such individual cases, what is really relevant to our purpose
is the underlying idea about the possibility of studying science production
from a sociological point of view, basically by rejecting what was
traditionally regarded as an internal/external division (Adler-Nissen and
Kropp, 2015: 161-163), and thus admitting that even scientific and academic
concepts can be formulated in conjunction with political-economic ambitions
and practical problems (see Bohme et al., 1983; Funtowicz and Ravez 1993;
Slaughter and Leslie 1997; Gibbons et al., 1994; Ziman, 2000; Albert and
Mcguire, 2014), such as those above mentioned. This does not mean that
science is equal to politics or economics (Breslau, 1998); what it does mean is
that, in order to understand science production, one needs to recognize that
“non-academic” resources (such as, for instance, financial or material
resources, ideas and beliefs, symbolic resources, political or normative
resources, people, etc.) may overstep scientific boundaries and be used for
the production of new knowledge. Bourdieu (1975, 1984, 1990, 1992, 1994,
1995, 2001) described this phenomenon through the concept of “fields
interrelations”. In few words, the social word is composed of multiple semiautonomous fields, basically microcosms characterized by different stakes,
rules of the game and particular resources which one needs to possess to get
access to the game itself and its specific advantages. He conceptualized these
sphere as partially independent, by which he means that, even though each
field develops its own institutions, hierarchies, problems, tacit or explicit
rules, they necessarily interact and affect each other. This is particularly true
for cultural fields (art, cinema, religion, science, journalism, etc.), since they
are structurally dependent and subordinated to political and economic fields.
Going straight to the point, this is to say that, if one is dealing with a
sociological analysis of a cultural product (e.g. a text), thus one neither can
just consider its formal characteristics, nor be limited to its context of
production. Instead, one should use a “relational approach”, taking into
account both the internal features of the product and its external determinants.
In engaging with this broad issue, this paper will try to further contribute to
the understanding of the topic by examining two peculiar scientific products
realized by researchers operating within the broad domain of the Socioeconomic Sciences and Humanities (SSH), as a part of the research projects
financed by the Seventh Framework Programme (2007-2013) of the European
Union: final reports and policy briefs. By using some automatic text analysis
tools, it will thus statistically explore the contents of such documents not per
se, but in connection with some variables, which help to define the broader

JADT’ 18

303

context of production. In its exploratory character, this study does not have
strong hypothesis to be tested. Nevertheless, following Bourdieu’s approach,
it aims to give an original perspective through which observing the
relationship between the field of social sciences and the public policy field of
the European Union (Gerli, 2017).
2. The corpus and methodology
Unlike the studies discussed earlier, which are mainly based on microsociological observation, our investigation covers a macro-sociological
analysis of a quit large corpus made of 46.513 graphic forms, equal to
3.025.960 occurrences. It is an ad-hoc constructed corpus: it contains 360 texts,
of which 205 belonging to final reports and 155 to policy briefs, which were
collected from the digital database CORDIS2, the main institutional source of
information related to the research projects financed by the European Union.
The choice to focus on these documents is not accidental, but depends on
their strict relevance to our research objectives. In fact, both include a
summary of the project results and conclusions, with a description of their
potential socio-economic impact (EC 2010), even though policy brief is strictly
designed for policy makers (both European and national ones), while final
report is addressed to a wider audience, which may include (at least
potentially) lay people as well. In this perspective, they represent an effective
“shortcut” through which empirically observe the way in which the research
groups awarded a grant “actualized” the inputs they received from the
Commission. This is, to resume the previous discussion, to analyse how
European institutions and social scientists contribute together to the
definition and resolution of some EU-related issues.
With regard to the methodology, both simple and multivariate analyses were
performed with the IRaMuTeQ software (Lebart et al., 1998; Bolasco, 2013). In
particular, the lexicographical analysis was used for a first exploration of the
corpus, that is to identify and format texts units, turn texts into text segments
(TS) and classify words by their frequency. The multivariate analysis,
instead, was performed to detect the associations between textual data and
the following supplementary variables related to what in the 7FP was
defined as macro-activity (MA) and financing scheme (FS)3. Going into more
details, the 7FP included eight macro-activities: Growth, employment and
competitiveness in a knowledge society (MA1); Combining economic, social and
environmental goals in Europe: towards sustainable development (MA2); Major

http://cordis.europa.eu/projects/home_it.html.
For more details: Decision No 1982/2006/EC of the European Parliament and of
the Council of 18 December 2006.
2
3

304

JADT’ 18

trends in society and their implications (MA3); Europe in the world (MA4); The
citizen in the European Union (MA5); Socio-economic and scientific indicators
(MA6); Foresight studies (MA7); Strategic activities (MA8). As for the financing
schemes, the 7FP included five main different types, which differed from
each other by the research team size and the type of purposes to be achieved
(the first three mainly focused on the development of new knowledge, while
the last two were mainly thought for the coordination and support of
research activities and policies): Small or medium-scale focused project (FS1);
Small or medium-scale focused research project aimed at international cooperation
(FS2); Large-scale integrating project (FS3); Coordination action (FS4); Support
action (FS5). Additionally, we also took into account the starting year of the
project and the geographic area in which the coordinating institution was
located.
As a whole, our sample (of non-probabilistic type) involves 223 research
projects out of 251 realized in 2007-2013 (equal to 88.8%) and broadly covers
all macro-activities and financing schemes above mentioned. In Tab. 1, a
description of the corpus and its main subsets is provided.
Tab. 1: Description of the corpus
Type
Final report
Policy Brief
Corpus

Number of texts
205
155
360

Graphic forms
42.047
19.795
46.513

Occurrences
2.441.168
584.792
3.025.960

3. The main findings
At first glance, the most frequent “full” words used in the SSH research
reports do not provide particularly relevant insights. The first ten (social,
policy, research, European, project, EU, countries, public, national, Europe)
concerns the “general context of meaning” where discourses on Europe and
related issues took shape. Ten words that, without having a clear disciplinary
connotation, define some “semantic coordinates” common to all research
projects carried out. Interesting enough, it is the wide use of the words
country/es (freq.=10.531) and national (freq.=5.527) which, compared with the
words European (freq.=9.190), EU (freq.= 8.563) and Europe (freq.=5.408), prove
the great importance of the “national” level of analysis, mainly in a
comparative way. Scrolling down the list, we can also recognise some typical
words of the socio-economic lexicon (economic, market, growth, employment,
financial), the socio-political lexicon (people, education, State, young, groups,
cultural, society, governance), and the methodological one, namely related to
the operative context of the research activities (date, case, results, impact,
analysis, study). Yet these are terms that, at this early stage of the analysis, do
not provide any clear “message”.

JADT’ 18

305

At a closer look, however, we can identify some specific words which are, in
a broad sense, linked to the political macro-orientations defined by the
Lisbon Strategy (European Council 2000), demonstrating the “osmosis”
existing between European institutions and social sciences. Here some
examples: innovation (freq.=5.793), cornerstone of industrial competitiveness
and economic growth (EC 2003, 2006); development (freq.=5.176), to be
understood, among the various meaning, mainly as sustainable development
(EC 2005, 2009); education (freq.= 3.490) and knowledge (freq.=3.221), which,
together with the already mentioned “innovation”, represent the “three
sides” of the so-called “knowledge triangle”, from the European
Commission’s perspective, the ground for a greater economic and social
dynamism.
For the aim of this study, what is of particular interest is also the
geographical scope of the research activities. Indeed, the most frequent
toponyms refer to EU based countries. Among these, the five main sponsors
and recipients of the framework programs (Germany, UK, France, Italy and
Spain) are placed at the top of the ranking. As for the extra-European
countries, several of them are placed in Asia (e.g. China, Japan, India, Vietnam
and Thailand), North Africa (Morocco, Tunisia, Egypt and Libya) and South
America (Brazil, Argentina, Colombia, Peru and Chile). This is indicative of a
globalization process, which is affecting both European institutions and
researchers by expanding their interests (“political”, with regard to the first
ones, and “scientific”, for the second ones) beyond the European borders.
What matters is that they are moving together insofar we can suppose the
existence of a clear synergy between the emergence of a new multipolar area
of political, commercial and cultural influence, in which the European Union
is now required to act, and the production of knowledge on topics with a
potential “global” added value.
3.1 The main semantic groups and their connections with the “context”
To go deeper in the analysis, and to explore the relationship between the
selected texts and some variables related to their context of production, we
performed a Descending Hierarchical Analysis (DHA). Indeed, this method
allowed us, first, to identify clusters with similar vocabulary within text
segments and, then, to visualize them in conjunction with the supplementary
variables (Camargo and Justo 2013; Curbelo, 2017). In Fig. 1, the output of the
DHA is summarised.

306

JADT’ 18

Fig. 1: Dendrogram of top-down hierarchical classification (Reinert method) of the corpus

As it can be easily seen in Fig. 1, the DHA algorithm allowed the
identification of five clusters, each with its own specific semantic content.
Following Reinert (1987), they can be interpreted as “lexical words”, namely
specific semantic structures which, in our case, refer to different and even
competing scientific representations of the European Union and related
issues. The second cluster has the greater representation (26,8% of the SSH
discourses) and identifies a semantic sphere characterized by a language
mainly oriented towards political and social issues. Indeed, the most central
word in this cluster is political, followed by cultural, identity, citizenship, border,
conflict, citizen, State and so on. Immigration (migrant) and related issues
appear to be particularly relevant as well. The fifth cluster (24,1%) delineates
a quite peculiar semantic sphere based on a set of words (such as project,
conference, research, university, workshop, dissemination, website, etc.) strictly
linked with the management and realization of European research projects
and, more in general, with scientific research and related activities. The first
cluster, third in terms of representativeness (19%), refers to the relationship
between economic development and environmental protection, being the
most central word innovation, followed by development, economic, sustainable,
environmental, change, rural and so on. This interpretation seems to be
supported by the presence of several words that refer to the need for a
change with respect to a situation that is perceived as not desirable (change,

JADT’ 18

307

impact, strategy, challenge, need, solution, improve, step, etc.). The third cluster
(16,2%), instead, covers a semantic area mainly related to the economy and
the market. It is a language that involves two main branches, the one of the
real economy (income, price, household, wage, firm, energy, poverty, etc.), and the
one of the finance (financial, bank, risk, monetary, credit), but above all it is
characterized by the large presence of technical terms and acronyms (gdp,
estimate, asset, inflation, emu, Eurozone, insurance, macroeconomic, etc.). Finally,
the fourth linguistic cluster (13,9%) includes words essentially associated to
the relationship between education, training and employment, as shown by
the presence of terms such as young, person, child, school, education, aspiration,
background, vocational and compulsory. It is a cluster that differs from the
others due to the greater concreteness of the language, as proved by the
recurring use of words referring to “concrete” social actors (child, parent,
student, teacher, mother, friend, volunteer, etc.).
Fig. 2, resulting from a Lexical Correspondences Analysis (LCA), shows the
relationship between clusters (left side) and between clusters and the
supplementary variables (right side). The main aim here was to verify
whether or not SSH discourse exhibits clear evidence of “adaptability” with
regard to the macro-activities and the financing schemes, as defined by the
European Commission.

Fig. 2: Association between clusters and supplementary variables

The first two factors summarize together 67,5% of the total inertia: the first
one (39,97%) marks a clear opposition between cluster 5 (positive half-plane)
and the other four clusters (negative half-plane); the second factor (27,47%),
instead, highlights a significant opposition between clusters 1 and 3 (positive
half-plane) and clusters 2 and 4 (negative half-plane). As a whole, we can
distinguish three different (partially autonomous) semantic contexts, arising
from the association between the “cultural” and “socio-political” discourses

308

JADT’ 18

(third quarter), the “economic” discourse and that on “innovation” and
“sustainable development” (forth quarter), and finally the discourse on
“research activities” (in-between the first and the second quarters).
As far as the relationship between discourses (clusters) and supplementary
variables, Figg. 3 and 4 show the most significant categories (those with a
larger chi-square and a lower p-value), referring to the “macro-activity” and
“financing scheme” variables. As shown in the first figure, MA1 and MA2
categories are only significant in the definition of clusters 1 (innovation) and
3 (economics); MA5 is the most relevant for cluster 2 (politics); similarly,
MA3 category is the only significant for cluster 4 (culture); and finally, MA4
and MA8 categories predominate on cluster 5 (research activities). In short,
these results strongly support the thesis of adaptability, insofar the different
scientific representations of the European Union emerged from the analysis
resulted strongly associated with the macro-activities defined by the
European Commission.
Cluster
1
2
3
4
5

Category

Chi2

%

p-value

MA2

1226.7

25,7

<0.0001

MA7

762.9

36.5

<0.0001

MA5

5220.0

54.8

<0.0001

MA1

1282.4

28.9

<0.0001

MA2

1414.2

27.0

<0.0001

MA3

5238.5

33.0

<0.0001

MA4

839.9

33.6

<0.0001

MA8

534.9

43.7

<0.0001

Fig. 3: Chi2 significance of variable “macro-activity” by cluster

On the other hand, the role of the “financing scheme” variable resulted much
less significant in discriminating the five clusters, except for categories FS4
and FS5, which are the most significant for cluster 5, and category FS1, which
instead clearly prevail on cluster 4. Nothing relevant emerged in relation to
the variables “geographic area” and “starting year”.
Cluster
1

Category

Chi2

%

p-value

FS2

186.3

25,7

<0.0001

FS3

145.1

24.7

<0.0001

2

FS1

487.6

29.0

<0.0001

3

FS1

286.5

17.6

<0.0001

4

FS1

1245.0

16.7

<0.0001

FS4

2195.0

51.5

<0.0001

FS5

1583.2

58.5

<0.0001

5

Fig. 4

JADT’ 18

309

4. Conclusions
The findings presented herein indicate a close relationship between the
programmatic framework, defined by the Commission, and the contents of
the final reports and policy briefs, supporting the thesis of a co-construction of
the European integration (Adler-Nissen, Kropp 2015). The scientific
discourse has come to be structured around few semantic macro-aggregates
arisen from DHA, which in turn resulted associated with the variables
performed in LCA. Furthermore, the SSH linguistic space shows a clear
cleavage between the economic discourse and the cultural discourse, which
points out the existence of a lack of interaction between these two spheres.
From a more “general” point of view, all this means that, in connecting the
social sciences field with the policy field, the European research projects
produced a scientific discourse that, on the whole, is structurally homologous
with the “space of possibilities” inherent to the 7PQ.
References
Adler-Nissen R., Kropp K. (2015). A Sociology of Knowledge Approach to
European Integration: Four Analytical Principles. Journal of European
Integration, 37(2): 155-173.
Albert M., Mcguire W. L. (2014). Understanding Changes in Academic
Knowledge Production in a Neoliberal Era. Political Power and Social
Theory, 27: 33-57.
Banchoff T. (2002). The Politics of the European Research Area. ACES
Working Paper 3, Paul H. Nitze School for Advanced International Studies.
Böheme G., Van den Daele W., Hohlfeld R., Krohn W., Shafër W. (1983).
Finalization in Science. The Social Orientation of Scientific Progress.
Dordrecht: Riedel.
Bolasco S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining.
Roma: Carocci.
Borras S. (2000). Science, Technology and Innovation in European Politics.
Research Paper n. 5, Roskilde University.
Bourdieu P. (1975). The Specificity of Scientific Field and the Social Condition
of the Progress of Reason. Social Sciences Informations, 6: 19-47.
Bourdieu P. (1984). Homo academicus, trad. it. (2013) Homo academicus. Bari:
Edizioni Dedalo.
Bourdieu P. (1992). Les règles de l’art, trad. it. (2013) Le regole dell’arte. Milano:
Il Saggiatore.
Bourdieu P. (1994). Raisons pratiques. Sur la théorie de l’action, trad. it. (2009)
Ragioni pratiche. Bologna: Il Mulino.
Bourdieu P. (1995). Champ politique, champ des sciences sociales, champ

310

JADT’ 18

journalistique, trad. it. (2010) Campo politico, campo delle scienze sociali, campo
giornalistico. In Cerulo M. (a cura di). Sul concetto di campo in sociologia.
Roma: Armando.
Bourdieu P. (2001). Science de la science et réflexivité, trad. it. (2003) Il mestiere di
scienziato. Milano: Mondolibri.
Breslau D. (1998). In Search of the Unequivocal: The Political Economy of
Measurement in U.S. Labor Market Policy. London: Praeger.
Camargo B. V., Justo A. M. (2013). R Interface for Multidimentional Analysis
of Texts and Questionnaires, IraMuTeQ tutorial, available on:
http://www.iramuteq.org.
Cerroni A., Giuffredi R. (2015). L’orizzonte di Horizon 2020: il futuro europeo
nelle politiche della ricerca. Futuri, 6: 29-39.
Curbelo A. A. (2017). Analysing the (Ab)use of Language in Politics: the Case
of Donald Trump. Working Paper n. 2. University of Bristol: SPAIS.
Deem R. (2015). What is the Nature of the Relationship between Changes in
European Higher Education and Social Science Research on Higher
Education and (Why) Does It Matter?. Journal of European Integration.
37(2): 263-279.
European Commission (2010). Communicating research for evidence-based
policymaking. Bruxelles: Directorate-General for Research.
European Commission (2003). Politica dell’innovazione: aggiornare l’approccio
dell’Unione Europea nel contesto della Strategia di Lisbona. COM(2003) 112
definitivo, 11.03.2003.
European Commission (2005). Comunicazione della Commissione al Consiglio e al
Parlamento europeo sul riesame della strategia per lo sviluppo sostenibile. Una
piattaforma d’azione. COM(2005) 658 definitivo, 13.12.2005.
European Commission (2006). Mettere in pratica la conoscenza: un’ampia
strategia per l’innovazione per l’UE. COM(2006) 502 definitivo, 10.05.2006.
European Commission (2009). Integrare lo sviluppo sostenibile nelle politiche
dell’UE: riesame 2009 della strategia dell’Unione Europea per lo sviluppo
sostenibile. COM(2009) 400 definitivo, 24.07.2009.
Funtowicz S., Ravez J. (1993). Science for the Post-Normal Age. Future, 25:
735-755.
Gerli M. (2017). Il campo sociale dei progetti di ricerca europei. Il caso delle
SSH. Studi Culturali, 1: 127-150.
Gibbons M., Limoges C., Nowotny H., Schwartzman S., Scott P. e Trow M.
(1994). The New Production of Knowledge. London: Sage.
Guzzetti L. (1995). A Brief History of European Union Research Policy.
Luxembourg: Publications Office of the European Communities.
Ippolito F. (1989). Un progetto incompiuto. La ricerca comune europea: 1958-88.
Bari: Edizioni Dedalo.

JADT’ 18

311

Lebart L., Salem A., Berry L. (1998). Exploring Textual Data. New York:
Kluwer Academic.
Menéndez L. S., Borrás S. (2000). Explainig Changes and Continuity in EU
Technology Policy: The Politics of Ideas. In Dresner S. e Gilbert N. (eds),
Changing European Research System. Aldershot: Ashgate.
Reinert M. (1987). Classification descendante hiérarchique et analyse lexicale
par contexte: application au corpus des poésies d’Arthur Rimbaud.
Bulletin de Méthodologie Sociologique, 13: 53-90.
Rosamond B. (2015). Performing Theory/Theorizing Performance in
Emergent Supranational Governance: The Live Knowledge Archive of
European Integration and the Early European Commission. Journal of
European Integration, 37(2): 175-191.
Ruberti A., André G. (1995). Uno spazio europeo della scienza. Riflessioni sulla
politica europea della ricerca. Firenze: Giunti.
Slaughter S., Leslie L.L. (1997). Academic Capitalism: Politics, Policies and the
Entrepreneurial University. Baltimore: The John Hopkins University Press.
Vauchez A. (2015). Methodological Europeanism at the Cradle: Eur-lex, the
Acquis and the Making of Europe’s Cognitive Equipement. Journal of
European Integration, 37(2): 193-210.
Vincent-Lacrin S. (2006). What is Changing in Academic Research? Trends
and Futures Scenarios. European Journal of Education, 41(2): 169-202.
Vincent-Lacrin S. (2009). Finance and Provision in Higher Education: A Shift
from Public to Private?. Higher Education to 2030 (vol. 2), Centre for
Education Research and Innovation: OECD.
Ziman J. (2000). Real Science: What It Is, and What It Means. Cambridge-New
York: Cambridge University Press.

312

JADT’ 18

Spécialisation générique et discursive d’une unité
lexical L’exemple de joggeuse dans la presse
quotidienne régionale
Lucie Gianola1, Mathieu Valette2
Université de Cergy-Pontoise – lucie.gianola@u-cergy.fr
2Institut National des Langues et Civilisations Orientales– mvalette@inalco.fr
1

Abstract
In this paper, we study the distribution of lexical items designating outdoor
sport practitioners (joggeur/joggeuse, randonneur/randonneuse, unneur/runneuse,
promeneur/promeneuse), in order to identify links between gender, semantic
themes and genre across press discourse in French. The corpus is sampled
from newspaper articles from regional newspapers. In press discourse, we
observe a convergence between gender and genre through the actualized
semantic classes.
Résumé
Nous étudions dans cet article la distribution d’unités lexicales désignant les
pratiquant·e·s de sport de plein air (joggeur/joggeuse, randonneur/randonneuse,
runneur/runneuse, promeneur/promeneuse) afin d’identifier les corrélations
entre genres sexuels, thèmes sémantiques et genres textuels dans le discours
journalistique en français. Le corpus est constitué à partir d’un
échantillonnage d’articles de la presse quotidienne régionale. Il apparaît que
dans le discours journalistique, on observe une convergence entre genres
sexuels et genres textuels par le biais des classes sémantiques instanciées.
Keywords: Press discourse, textometrics, semantic class, genre, gender
1. Introduction
Nous proposons une étude de lexicologie textuelle sur la distribution
d’unités lexicales choisies dans un corpus de textes de presse. L’étude n’a pas
été réalisée dans une perspective corpus-driven, comme c’est souvent le cas en
textométrie, mais avec une approche corpus-based (Biber, 2009) où les
observables ont été prédéfinis. Notre objectif est en effet de nous focaliser sur
les désignations des pratiquant·e·s de sport de plein air suivant une
opposition en genres sexuels : joggeur vs joggeuse, randonneur vs randonneuse,
runneur vs runneuse, promeneur vs promeneuse. Il s’agit d’identifier les
corrélations entre genres sexuels, isotopies et genres textuels dans le discours
journalistique de la presse quotidienne régionale française.

JADT’ 18

313

2. Problématique
2.1. Sommation des isotopies de genres et de discours en signifiés
La lexicologie textuelle consiste en l’analyse du lexique à partir des
conditions textuelles de sa production. Elle repose sur l’hypothèse selon
laquelle les unités lexicales subissent un ensemble de contraintes
intertextuelles et infratextuelles de la même nature que les formes
sémantiques diffuses et non lexicalisées et qui en conditionnent les régimes
de production et d’interprétation. Dans de précédents travaux, ont été
proposées les conditions théoriques d’une analyse textuelle du lexique,
principalement focalisées sur l’étude de la néologie sémantique – ou néosémie
(Rastier et Valette, 2009) et des formes sémantiques diffuses en voie de
lexicalisation synthétique ou protosémie (Valette, 2010ab). Il s’agit ici d’étudier
l’utilisation systématique d’une unité lexicale donnée dans un genre textuel
précis et l’incidence de cette utilisation sur son sémantisme. En effet, tout mot
placé dans un texte en reçoit des déterminations sémantiques, qui sont
susceptibles de modifier son signifié (afférence de sèmes). Posant l’hypothèse
selon laquelle le signifié est une forme sémantique lexicalisée (Valette 2010b),
on considérera que les sèmes des isotopies du texte peuvent se propager au
signifié d’une unité lexicale par le processus de sommation décrit par (Rastier,
2006). L’observation a pu être faite concernant les isotopies de domaine
(redomanialisation d’une unité lexicale dans le cas de la néosémie par
exemple) mais les isotopies génériques (relatives au genre textuel) ou
discursives (relatives au discours) peuvent-elles transformer le signifié d’un
mot de la même façon que les isotopies domaniales ? C’est à cette question
que nous allons tâcher de répondre ici.
2.2. Présentation du corpus
Le corpus est donc constitué suivant deux axes, lexical et discursif : nous
avons utilisé 8 formes considérées comme des mots-clés pour collecter des
textes exclusivement issus du discours journalistique et, plus précisément, de
la presse quotidienne régionale, sans considération de genre textuel. Le
corpus a été collecté de manière semi-automatique à l’aide d’un script
d’aspiration de pages web puis nettoyé et dédoublonné manuellement, afin
d’écarter des articles constitués de reprises de dépêches AFP qui se
retrouvent d’un titre à un autre. Le script, basé sur la commande Linux cURL,
est alimenté pas une liste d'URL collectées sur les sites des titres de presse à
l’aide de requêtes effectuées sur le moteur de recherche Google
(site:nomdusite forme, modulée par un inhibiteur -blade dans le cas de
« runner » afin d'écarter les articles à propos du film Blade Runner). Entre 100
et 130 URL ont été collectées pour chaque forme. La phase de nettoyage a
permis de supprimer les en-têtes, sommaires, liens annexes, légendes

314

JADT’ 18

d’images, etc., pour ne conserver que le titre et le corps de l’article. Le corpus
est organisé en huit sous-corpus correspondant aux 8 formes étudiées :
Joggeur, Joggeuse, Promeneur, Promeneuse, Randonneur, Randonneuse, Runner,
Runneuse, dont les statistiques sont présentées dans le tableau suivant.
Table 1 : Analyse factorielle des correspondances sur les parties du discours
Sous-corpus

Nombre de mots

Joggeur

40 671

Joggeuse

48 285

Randonneur

35 162

Randonneuse

31 931

Promeneur

44 497

Promeneuse

31 009

Runner

22 212

Runneuse

31 367

Total

285 134

Les articles sont issus principalement de titres de la presse quotidienne
régionale comme Nice Matin, Ouest-France, L’Est Républicain, La Dépêche du
Midi, La Montagne, Corse-Matin, La Provence. La collecte n’a pas été orientée
sur une rubrique en particulier mais sur l’ensemble des titres, et nous
n’avons pas défini de limite temporelle.
3. Analyses1
3.1. Observations générales
Une analyse factorielle préliminaire (figure 1) portant sur les seules parties
du discours montre une opposition marquée sur l’axe 1 entre les sous-corpus
Runner et Runneuse et les autres sous-corpus. Cet écart s’explique par les
genres textuels des sous-corpus considérés. En effet, comme l’ont montré les
travaux pionniers de (Biber, 1988) et, à leur suite, ceux de (Malrieu et Rastier,
2001), les variables locales que constituent les parties du discours sont des
marqueurs de genre particulièrement stables. Ici, il apparaît que Runner et
Runneuse relèvent du genre du compte rendu d’événements sportifs tandis
que les 6 autres sous-corpus sont composés en grande majorité de faits
divers. Autrement dit, la plupart des unités lexicales choisies pour nos
requêtes, qui correspondent à des pratiques sportives de plein air,

1 Le corpus a été analysé au moyen du logiciel de textométrie TXM
(http://textometrie.ens-lyon.fr/) (Heiden et al. 2010).

JADT’ 18

315

n’appartiendraient pas – ou alors à la marge – au vocabulaire des genres
sportifs du discours journalistique.
L’analyse factorielle des correspondances sur les formes, dont la fréquence
est au moins égale à 10 occurrences, offre à voir une distribution très
différente. Runner et Runneuse sont toujours très proches mais il en est
désormais de même de Randonneur et Randonneuse (désormais Randonneur·se)
(figure 2). Les sous-corpus Joggeur, Promeneur et Promeneuse se situent à la
croisée des axes et seront étudiés individuellement, mais Joggeuse se
singularise.
3.2. Analyses des classes sémantiques constituantes
L’analyse des spécificités (formes) des regroupements ainsi constitués nous
indique les contextes d’instanciation des différentes formes.
Le regroupement a priori très homogène Randonneur·se offre à voir un
vocabulaire associé aux accidents de montagne. Le corpus est structuré en 3
classes sémantiques principales,
- celle des accidents : « chute », « mortelle », « mètre »,
« avalanche », « fracture », « cheville », « hôpital », « blessée »,
« trauma », « glisser » etc.
- celle des disparitions : « disparu », « alerte », « retrouvé »,
« emporté », « inquiet », etc.
- celle des secours : « PGHM » (pour Peloton de gendarmerie de
haute montagne), « hélicoptère », « Dragon » (un modèle
d’hélicoptère) « évacué·e », « pompiers », « CRS », « secouriste »,
« secteur », « équipe », « sauveteur », « secourir », etc.
Le sous-corpus Promeneur et le sous-corpus Promeneuse relatent
essentiellement 3 types d’événements :
- la promenade : « sentier », « phare », « littoral », « patrimoine »,
« chemin », etc.
- les accidents : accident de chasse essentiellement : « chasseurs »,
« chasse », etc.
- les découvertes : « macabres », « corps », « cadavre », « tronc »,
« jambe », « squelette », « ossement », « obus », « pépite », etc.
Le sous-corpus Joggeur ne comporte quant à lui qu’une classe sémantique
principale, celle des accidents n’incluant pas de tiers humain : « arrêt,
malaise, crise cardiaque », « algues vertes », attaques d’animaux (« rapace »,
« aigle », « buse »), sulfure d’hydrogène, H2S, intoxication, toxique, gaz. Il est
à noter que cette classe ne s’actualise pas dans le sous-corpus Joggeuse.

316

JADT’ 18

Les deux sous-corpus restants, le regroupement très homogène Runner et
Runneuse (désormais Runneur·se) et Joggeuse méritent toute notre attention.
D’un point vue ontologique, le jogging comme le running sont des formes
similaires de course à pied relevant du domaine du sport. Mais leur usage
dans le discours journalistique diffère très sensiblement. Dans le
regroupement Runneur·se, qui comporte, comme nous l’avons vu,
essentiellement des articles relatant des événements sportifs, le vocabulaire
est structuré autour des classes sémantiques suivantes :
- définitoire : hyperonyme « sport », synonyme « coureur », etc.
Ainsi, le sous-corpus Runneur·se est le seul dont le sens
correspond à la signification.
- classe de la compétition : « course » « marathon », « semimarathon »,
« trail »,
«triathlon »,
« championnat »,
« inscription », « départ », « épreuve », « km », « victoire »,
« podium », « médaille », « sponsors », etc.
-

classe des blessures : « blessure », « foulure », « ampoule »,
« contracture », etc.
Il comporte également deux classes sémantiques liées aux techniques
associées à la pratique :
- classe
des
équipements :
« équipement »,
« baskets »,
« chaussures », « brassière », « connectés », « GPS » ou « montre
GPS », etc.
- classe des entrainements : « entrainement », « préparation »,
« fractionné », « cardio », « conseils », « performances », « yoga »
(comme activité complémentaire destinée à éviter les blessures),
etc.
Il est à noter que le sous-corpus Runneuse se singularise par la mention
d’événements sportifs caritatifs liés à la lutte contre le cancer du sein :
« octobre rose », « prévention ».
A l’inverse, la joggeuse dans le sous-corpus éponyme n’est nullement une
sportive, mais sa caractérisation textuelle est remarquablement précise : elle
est une femme agressée pendant son jogging et les classes sémantiques
actualisées dans ce sous-corps relèvent du crime, du droit et de l’enquête
judiciaire :
- classe des agressions : « meurtre », « tentative », « agressée »,
« agression sexuelle », « viol », « enlèvement », « tuée »,
- classe des agresseurs : « homme », « suspect », « meurtrier »,
« présumé », « portrait-robot », « violeur », « exhibitionniste »
- classe des procédures judiciaires : « enquêteurs », « avocats »,

JADT’ 18

317

« cour », « procureur », « réquisition », « réclusion », « prison »,
« accusé »,
« interpellé »,
« agresseur »,
« condamné »,
« procédure », « instruction », « ADN », etc.
3.3. Synthèse
A l’issue de cette analyse, on choisit de se concentrer sur la définition en
miroir de la joggeuse et de la runneuse, laissant de côté les autres unités
lexicales détaillées ci-dessus. Les isotopies génériques et discursives qui
constituent la trame sémantique des articles dans lesquels occurrent ces deux
formes donnent lieu à la construction de deux signifiés antagonistes, par
sommation :
La joggeuse apparaît :
1. /isolée/ (elle court seule),
2. /vulnérable/ (elle est sans défense face à un agresseur) et, quoi
qu’il arrive, puisque le genre du fait divers l’exige,
3. /victime/ (elle est agressée, violée, tuée).
A l’inverse la runneuse est :
1. /entourée/ (elle court dans le cadre d’événement sportifs
collectifs),
2. /sécurisée/ (par la technologie, notamment les montres GPS qui
permettent de gérer l’effort et d’optimiser ses performances, par
l’entraînement suivi. Les blessures subies apparaissent par
ailleurs bénignes par rapport aux risques encourus par la
joggeuse),
3. /compétitrice/ (elle participe à des compétitions).
4. Conclusion
Dans cet article, nous avons tenté de montrer comment les fonds sémantiques
issus des genres et des discours pouvaient modifier, par sommation, les
signifiés des unités lexicales qui sont utilisées. Pour deux unités lexicales
partageant a priori un référent identique, celui d'une femme pratiquant la
course à pied, l'actualisation en corpus journalistique fait émerger des
contenus sémantiques très différents. Il ne s’agit pas de considérer que les
joggeuses sont nécessairement des femmes en danger mais la régularité avec
laquelle le mot joggeuse est actualisé dans la presse comme une /victime/,
/vulnérable/ et /isolée/ pourrait avoir, à terme, une incidence sur la
perception d’une pratique dont la réalité médiatique est exclusivement
macabre. En d’autres termes, dans le discours de presse, pour les femmes, le
jogging est une pratique dangereuse, la joggeuse une victime d'agression,
alors que la runneuse une sportive impliquée dans des événements sociaux
et le running une pratique sûre et valorisante.

318

JADT’ 18

Références
Biber, D. (1988). Variation across Speech and Writing . Cambridge, Cambridge
University Press.
Biber, D. (2009). Corpus-Based and Corpus-driven Analyses of Language
Variation and Use. In B. Heine and H. Narrog (editors) The Oxford
Handbook of Linguistic Analysis, 159–191. Oxford.
Heiden S., Magué J.-P., et Pincemin B. (2010). TXM : Une plateforme logicielle
open-source pour la textométrie – conception et développement, S.
Bolasco. editors., Journées internationales d’Analyses statistiques des Données
Textuelles, vol(2), 1021-1032.
Malrieu, D. et Rastier, F. (2001). Genres et variations morphosyntaxiques, In
Traitements automatiques du langage, 42, 2, 547-577.
Rastier, F. (2006). Passages. In Corpus, 6, 125-152.
Rastier, F., Valette, M. (2009). De la polysémie à la néosémie. In Le français
moderne, vol. (77), 97-116.
Valette, M. (2010a). Propositions pour une lexicologie textuelle. In Zeitschrift
für Französische Sprache und Literatur, vol. (37): 171-188.
Valette, M. (2010b). Méthodes pour la veille lexicale, In L. Messaoudi, et al.
editors Sur les dictionnaires, Publication du laboratoire Langage et société,
Université Ibn Tofail, Kénitra: 251-272.

JADT’ 18

319

The Transparency Engine – A Better Way to Deal
with Fake News
Peter A. Gloor1, Joao Marcos de Oliveira2, Detlef Schoder3
1

MIT Center for Collective Intelligence, Cambridge MA – pgloor@mit.edu
2Galaxyadvisors, Aarau Switzerland – jmarcos@galaxyadvisors.com
3University of Cologne, Germany – schoder@wim.uni-koeln.de

Abstract
We introduce the “Transparency Engine”, a social network search engine to
separate fact from fiction by exposing (1) the hidden “influencers” and (2)
their “tribes”. Our goals are to quantify the influence and relevancy of
persons, concepts, or companies on institutions, issues or industries by
tracking the dynamics and changes in the observed environment. In
particular we visualize the networks of influence for a given social or
economical ecosystem, thus providing a tool to both the scientific and general
public (including journalists, or anyone interested to check news) to track the
diffusion of new ideas, both good and bad. In particular, the Transparency
Engine exposes the hidden influencers behind fake news. We propose a
unique solution, which combines three subsystems we have been developing
over the last five years: (I) Powergraph, (II) Tribefinder, and (III)
Swarmpulse, The powergraph displays the degree and power of the
spreader’s position by re-constructing her/his (social) network via Web sites
and social position in the Twitter-universe. The tribefinder exposes the tribal
echo chambers on Twitter nurturing fake news items through social media
mining, thus allowing the news consumer to develop an informed opinion
for identifying the motivation of the spreaders of fake news. This is done
through mining Twitter word usage of tribe members with neural networks
using tensorflow. The swarmpulse system finds the most relevant fake and
non-fake news on Wikipedia and Twitter by combining their emergent
patterns.
Keywords: Fake News, Transparency Engine, News, Truth, Belief System,
Machine Learning, Big Data
1. Introduction
According to independent investigations, Russian misinformation and fake
news by Western conspiracy theorists on social media may have contributed

320

JADT’ 18

to the outcome of the Brexit vote1 and the election of Donald Trump2.
Misinforming news has become a significant threat to societal discourse and
opinion formation. Mechanisms to deal with this type of fake news by
making them transparent are urgently needed. The goal of this project is to
understand the concept of “fake news” in the context of forming collective
awareness through social media. The concept of truth is dependent on a
personal belief system. On the other hand, conspiracy theories and satire is
nothing new, and people who WANT to believe these have always embraced
them. Categorizing news as “Fake news” happens when they are against
one's innermost and most passionate beliefs. The more somebody is
embedded into a predefined belief system, the more likely they are to believe
fake news. For instance, people who use Facebook as their major news
source, are more likely to believe fake news (Silverman & Singer-Vine, 2016).
What mental processes are happening when we embrace fake news? When
embedded in a particular belief system, individuals recognize fake news
immediately when they read them, because they do not want to believe
them, similarly they also immediately categorize news as true news when
they read them, because they perfectly fit into their belief system. For
instance, Trump followers label mainstream news as “fake news”, while
mainstream news labels news from Trump followers as “fake news”.
2. Related Work
There are many approaches to creating more transparency in societal
discourses. In fact, this may be seen as the core task of quality journalism.
Most if not all of these approaches, however, are not well supported by IT
tools, do not scale well, and many do not reveal the applied algorithms. Fact
checking Websites such as Wikitribune, Snopes.com, PolitFact, and
FactCheck.org, and corporate/proprietary initiatives like Facebooks’s fake
news detection tools mostly rely on human volunteers and/or paid staff to do
fact checking, which has major disadvantages:
- human bias: fact checkers might have a “leftist” or “right-wing” bias
- non-scalable: the human pool of fact checkers is by definition restricted
- deferred access: the machine can check any news item immediately, 24/7, and it
does not take the expensive detective work of the human fact checker
- non-replicable: as the fact checking is done by different users, the reader will not be
able to understand why a certain fact has been categorized in a particular way

1 Londongrad - Russian Twitter trolls meddled in the Brexit vote. Did they
swing it?. Economist, Nov. 23rd 2017
2 https://en.wikipedia.org/wiki/Russian_interference_in_the_2016_United_States
_elections

JADT’ 18

321

Among the automated approaches, Kloutscore (www.klout.com) gives a
metric for the social media influence of a person. However the kloutscore has
to be requested manually by a user who wants a kloutscore, so it is heavily
skewed towards self-promoters. Another solution for finding the social
media profiles of users is to leverage the Google Knowledge Graph
(https://en.wikipedia.org/wiki/Knowledge_Graph),
which
has
been
employed in theoretical work by Ciampaglia et al. (2015) for fact checking by
measuring the shortest path distance between related concept nodes.
Another approach consists of using machine learning to identify fake news,
for instance it has been shown by Ott et al. (2011) that machine learning
based on word usage beats humans by wide margins to identify fake reviews
in tripadvisor by computing feature vectors from the text of the reviews.
More generally, (Youyou et al. 2015) have shown that to identify (tribal)
attributes of people, having the computer look at their Facebook likes
through machine learning will be more reliable than human judgment. A
similar research question is addressed when identifying Twitter bots based
on their networking pattern and word usage. For instance, Botcheck
(botcheck.me) and Botometer (https://botometer.iuni.iu.edu/#!/) (Varol et al.
2017) check the likelihood of any Twitter id to be a bot, based on number of
followers and friends, tweeting dynamics, and content of tweets.
3. Motivation – How Influencers Spread Fake News
Today’s online social media consumers are exposed to a cacophony of fact
and fiction as never before. “It is true, I read it on the Internet” is
unfortunately a prominent way for information to spread. For example,
immediately after the 2016 US Presidential elections, in early November 2016,
Hillary Clinton was accused of running a pedophile ring out of a pizza
restaurant in Washington. Called “pizzagate”, this news item became a
favorite call to arms among right-wing extremists and Donald Trump
supporters, leading one incensed fanatic to drive a few hundred miles from
Salisbury, North Carolina to Washington DC, and firing his automatic gun
into the pizza restaurant. The origin of this fake news story has been well
documented, starting from a white supremacist Twitter account, then picked
up by the conspiracy News Web site of Sean Adl-Tabatabai, where it fell on
the willing ears of the American right. Just like Google has revolutionized the
way we access information, our proposed Transparency Engine intends to
change the way how we look at such information, by exposing the hidden
influencers like “Sean Adl-Tabatabai” who inject new information into the
public discourse.

322

JADT’ 18

3.1 The concept of tribes and how they perceive information
Besides knowing the sources of rumors, it is essential to also know the
(political) orientation of these influencers. Quantum physics suggests that
there are many different universes, with our current world being embedded
into just one out of infinitely many other universes. Looking at radically
different interpretations of the same news item, it seems we are indeed living
in different quantum universes. These different universes can be grouped
into “tribes” (Sloterdijk 2011). Each of these tribes has its own reality,
defining fact or fiction for the members of the tribe. Previous research (De
Oliveira et al. 2017) has exemplified this idea. What is fact for one tribe is
fiction for another tribe. It all depends on the tribe, and what the members of
the tribe WANT to believe. Examples are the denial of human-influenced
global warming, the explanation of evolution through “intelligent design”, or
the causal relationship between vaccination and autism where some tribes
perceives related issues as “fact” and “truth” whereas other tribes perceive
the objectively same issues as “fiction”, “lie” or “fake news”, thus creating an
“alternate reality”. In contrast to the power of states and corporations, the
growing power and dynamics of networks is mostly invisible. Unlike
hierarchical structures, the central influencers in networks are hard to
identify by the “naked eye”. What matters to spread any news – fact or fake –
is the influence of the spreader. The main way to quantify the influence of the
spreader is her/his position in a given network and with it the power to
“multiply” the word to larger audiences. More specifically, the degree and
power of the spreaders’ position can be measured by re-constructing their
(social) network via their Web sites and their social position for example in
the Twitter-universe (and other social networking platforms) thus measuring
the influence of Web sites and the influence of Twitter (accounts) on a
specific topic.

Figure 1 Twitter retweet network “pizzagate” (left), and Twitter influence network (right)

JADT’ 18

323

Pizzagate only spread because a moderately influential spreader, Sean AdlTabatabai, discovered the original tweet and posted it on his conspiracy
News Web site. Figure 1 illustrates how social media analysis can increase
trust and transparency by visualizing the echo chambers of fake news about
pizzagate using our social media analysis system Condor (Gloor 2017). The
picture at left shows the Twitter network about pizzagate, each node is a
person tweeting, a link between two people means either that one person is
retweeting a tweet sent by the other person, or is mentioning the other
person in a tweet. There is a large cluster in the center of the network, made
up of believers in the fake news. They are reinforcing each other, and
increasing the traffic in their echo chamber. The few supporters of Hillary
Clinton, trying to debunk the fake news, are pushed aside; their tweets are
ignored by the large echo chamber of conspiracy theory believers. The people
in the periphery (the “asteroid belt”) are tweeting into the void, as their
tweets are ignored by friends and foes alike.
Using an influencer algorithm (Gloor 2017) shows that the discourse about
pizzagate on Twitter is dominated by Trump followers (the picture at right
above). Our algorithm makes somebody an influencer, if the words she or he
is using, are picked up by others and spread quickly through the network. As
the picture at right in figure 1 shows, there is just one voice of reason left,
while the proponents of pizzagate reinforce each other much more, with a
cluster of influential spreaders of wild ideas in the center, and other
conspiratorialists in the periphery of the cluster, being retweeted by
hundreds of likeminded others (shown as “parachutes” in the graph).
4. Our Solution – Transparency Engine
We introduce the “Transparency Engine”, a social network search engine to
separate fact from fiction by exposing the hidden influencers and their
“tribes” behind fake news. Just like Google has revolutionized the way we
access information, Transparency Engine changes the way we look at such
information, by exposing the hidden influencers. Our goals are fourfold: (1)
Quantify the influence and relevancy of persons, concepts, companies on
institutions, issues or industries. (2) Qualify the dynamics and changes in the
observed environment. (3) Visualize the networks of influence for a given
social or economical ecosystem. (4) Provide a tool to track the diffusion of
new ideas, both good and bad.
4.1. Powergraph
Our solution combines three subsystems we have been developing over the
last five years (Fuehres et al. 2012, de Oliveira et al. 2016, de Oliveira et al
2017): Power graph, tribe finder, and swarmpulse. Power graph measures the

324

JADT’ 18

importance of “notable” people as defined by Wikipedia through calculating
the number of other Wikipedia people pages than can be reached within two
degrees of separation from a particular people page on Wikipedia. This is a
proxy for social capital, as it basically measures the influence of the people a
person is connected to. The system also identifies those people with Twitter
accounts by matching them with sources of information like Wikidata and
Google knowledge graph.

Figure 2. Sample Powergraph for “global warming”

Figure 2 illustrates our prototype version of the Powergraph, showing the
social network of the most influential people about “global warming”, based
on their Wikipedia and Twitter presence. We find, not surprisingly, that
Donald Trump and the former US presidents are most influential. We
measure the importance of people through calculating the number of other
Wikipedia people pages and Twitter friendship networks than can be
reached within two degrees of separation from a particular people page. This
is a proxy for social capital, as it basically measures the influence of the
people a person is connected to (Fuehres et al. 2012).
4.2 Tribefinder
The second component of our system, tribefinder (de Oliveira et al. 2017),
identifies the tribal affiliations of the opinion leaders about any news item.
To assign a tribe to an influencer, our system analyzes their word usage,
using deep learning. An integral component of the tribefinder system is
“TribeCreator", this subsystem automatically helps the user to find people
that belong to a newly defined tribe by looking at profile self-descriptions,

JADT’ 18

325

the content of tweets, and at followers, and Twitter friends. For example, if
users wants to create a tribe for Treehuggers (people who like nature), they
can search for people with profile descriptions that match the idea of this
tribe: “nature lover”, “I love nature”, “nature”, etc., for people who follow
pages about nature, or tweet about nature. In the second step we calculate
the vocabulary that these influentials are using in their tweets. This
vocabulary is then used to match the vocabulary against the vocabulary of
any Twitter user, calculating their tribal affiliates. Knowing the tribal
affiliations of the thoughtleaders for a news item allows readers to correctly
position the news item, deciding for themselves if they want to trust the
news coming from a particular influencer.
4.3 Swarmpulse
The third component of our system is Swarmpulse (de Oliveira et al 2016).
Swarmpulse finds the most recently edited Wikipedia pages and uses Twitter
to see which people are talking about those subjects. This system helps users
to serendipitously spot most recent news items they were not aware of, and
then check their influencer network on the power graph and calculate their
tribal affiliations with tribefinder.
5. Conclusion
The best approach for fact-checking is a critical, well-informed mind. Our
world needs more powerful ways and tools to support the critical mind.
Transparency is a key enabler for this. The Transparency Engine thus
provides the foundation for informing the critical mind: The global
Powergraph will display the power network of the one million globally most
influential people on Wikipedia people pages and the most popular Twitter
users. It will allow all other Twitter users to position themselves within the
context of the Powergraph. The Tribefinder will show the “truth of tribes” by
creating tribes through their use of language on social media and assigning
each influencer to one or more tribes and showing the tribal affiliations in the
Powergraph. Swarmpulse will build an index of most recent significant news
by combining new edits on Wikipedia with the most popular tweets from
influential twitterers and show the actors involved through Powergraph. The
landscape of transparency generating approaches calls for a scientific, open
approach such as the Transparency Engine proposes. Our aim is to
substantially contribute to popularizing and democratizing fact checking for
the whole world. Everyone should be enabled to do this easily and simply by
themselves!

326

JADT’ 18

References
Ciampaglia, G. L., Shiralkar, P., Rocha, L. M., Bollen, J., Menczer, F., &
Flammini, A. (2015). Computational fact checking from knowledge
networks. PloS one, 10(6), e0128193.
de Oliveira, J. Gloor, P. (2016) The Citizen IS the Journalist - Automatically
Extracting News from the Swarm. Rome, Italy June 9-11, 2016, Designing
Networks for Innovation and Improvisation: Proceedings of the 6th
International COINs Conference (Springer Proceedings in Complexity)
de Oliveira, J. Gloor, P. (2017) GalaxyScope – Finding the "Truth of Tribes" on
Social Media. Detroit September 11-14, 2017. Proceedings of the 7th
International COINs Conference (Springer Proceedings in Complexity)
Fuehres, H. Gloor, P. Henninger, M. Kleeb, R. Nemoto, K. (2012)
Galaxysearch: Discovering the Knowledge of Many by Using Wikipedia
as a Meta-Search Index. Proceedings Collective Intelligence 2012, April 1820, Cambridge, MA
Gloor, P. (2017) Sociometrics and Human Relationships: Analyzing Social
Networks to Manage Brands, Predict Trends, and Improve Organizational
Performance , Emerald Publishing, London 2017
Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011, June). Finding deceptive
opinion spam by any stretch of the imagination. In Proceedings of the
49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies-Volume 1 (pp. 309-319).
Silverman, C. Singer-Vine, J. (2016) "Most Americans who see fake news
believe
it,
new
survey
says."
BuzzFeed
News;
https://www.buzzfeed.com/craigsilverman/fake-news-survey
Sloterdijk, P. (2011). Bubbles: microspherology. MIT Press
Varol, O., Ferrara, E., Davis, C. A., Menczer, F., & Flammini, A. (2017). Online
human-bot interactions: Detection, estimation, and characterization. arXiv
preprint arXiv:1703.03107.
Youyou, W. Kosinski, M. Stillwell, D. (2015) Computer-based personality
judgments are more accurate than those made by humans. Proceedings of
the National Academy of Sciences (PNAS)

JADT’ 18

327

Brexit and Twitter: The voice of people
Francesca Greco, Leonardo Alaimo, Livia Celardo
Sapienza University of Rome – francesca.greco@uniroma1.it;
leonardo.alaimo@uniroma1.it; livia.celardo@uniroma1.it

Abstract 1
There is an increase in Euroscepticism among EU citizens nowadays, as
shown by the development of the ultra-nationalist parties among the
European states. Regarding the European Union membership, public opinion
is divided in two. British referendum in 2016, where citizens chose to “exit”
shaking the public opinion, and the following general election in June 2017,
where the British Europeanist parties won the election according to the 1975
British referendum where 72% of citizens chose to “Remain”, are clear
examples of this fracture. There are still few studies concerning the
investigation of Brexit discourses within the social media and most of them
focus on the 2016 British referendum. Due to that, this exploratory research
aims to identify how Brexit and the EU are nowadays discussed on Twitter,
through a text mining approach. We collected all the tweets containing the
terms “Brexit” and “EU”, for a period of 10 days. Data collection has been
performed with TwitteR package, resulting in a large corpus to which we
applied multivariate techniques in order to identify the contents and the
sentiments behind the shared comments.
Abstract 2
Negli ultimi anni c'è stato un aumento dell'euroscetticismo tra i cittadini
dell'UE, come testimoniato dallo sviluppo di partiti ultra nazionalisti in
diversi stati europei. Sul tema "Europa", l'opinione pubblica è divisa fra
europeisti e euroscettici. Un chiaro esempio di questa divisione è dato dalle
recenti vicende britanniche: infatti, nel referendum del 2016 i cittadini
britannici hanno scelto di "uscire" dall’UE scuotendo l'opinione pubblica,
mentre le successive elezioni politiche di giugno 2017 hanno visto
l'affermazione dei principali partiti filo-europeisti. Vi sono ancora pochi studi
in letteratura che indagano come nei social media venga affrontato il tema
della Brexit in relazione all’UE, dato che la maggior parte di essi si focalizza
su cause e potenziali effetti del voto di giugno 2016. In tal senso, questa
ricerca esplorativa ha lo scopo di identificare in che modo Brexit e l'Unione
Europea vengano discusse su Twitter in questo momento storico attraverso
l’analisi automatica del testo. A questo scopo sono stati raccolti tutti i
messaggi contenenti i termini "Brexit" e "EU" per 10 giorni attraverso

328

JADT’ 18

l'utilizzo del pacchetto TwitteR, ottenendo un corpus di grandi dimensioni a
cui sono state applicate delle tecniche multivariate, al fine di individuare i
contenuti e i sentimenti relativi al tema in esame.
Keywords: Brexit, Twitter, Emotional text mining.
1. Introduction
There is a growing increase in Euroscepticism among EU citizens nowadays,
as shown by the development of the ultra-nationalist parties among the
European states. Regarding to European Union membership, public opinion
is divided between Eurosceptics and pro-Europeans, as shown by the 2016
British referendum ("Brexit"), where 52% of citizens chose to “Leave”. For
further evidence of this division, the following general election of June 2017
saw the affirmation of the main Europeanist parties (especially the Labour
Party) and the results led to a hung Parliament. Brexit has shaken the
European public opinion as it revealed the relevance of the anti-Europeanist
trend. During the 60th Anniversary of the Treaties of Rome in 2017, millions
of citizens expressed their support to the EU participating to Europeanist
demonstrations in many European cities.
One useful starting point for explaining the results of Brexit is to focus on the
electoral issue: the relationship between the UK and Europe. This has always
been a central and rather controversial issue in the British public debate. The
media, public opinion and the political class have always been deeply critical
and sceptical about the European integration. This position influences
citizens' attitudes towards the Union, which is not only considered distant
and inadequate to resolve everyday issues (immigration, unemployment,
and so on), but it is often perceived as their major cause, by limiting the
political and economic power of United Kingdom. The electoral outcome
created disbelief all over the world. Britain is the home of the term
Euroscepticism (Spiering 2004, p.127). But, while it is clear that a large
proportion of UK residents are sceptical about Europe, it is not clear enough
that this position coincides with the wish to leave the EU. However,
Euroscepticism should not be confused with this wish. Szczerbiak and
Taggart (2008) have distinguished two different types of Euroscepticism: the
Hard Euroscepticism that is a principled opposition to the EU and European
integration and Soft Euroscepticism that concerns on one (or a number) of policy
areas lead to the expression of qualified opposition to the EU.
Although there are several studies exploring British Euroscepticism, only few
of them investigate the Brexit discourses within the social media. Due to that,
we decided to perform a quantitative study, where the online discourses
regarding Brexit and EU are analysed using two different approaches,

JADT’ 18

329

Content Analysis and Emotional Text Mining. The aim is to explore not only
the contents but also the sentiments shared by users on Twitter. For this
paper, we used one of the most important and known blog tools, Twitter. It is
an online platform for sharing real-time, character limited communication
with people partaking of similar interests that, in 2017, counted over than 300
million users and an average of about 500 million of tweets sent per day.
2. Data collection and analysis
In order to explore the sentiments and the contents on Brexit and EU in
twitter communications during ten days, we scraped all the messengers in
English language produced from September 22nd to October 2nd, 2017,
containing together the words Brexit and EU. The data extraction was carried
out with the TwitteR package of R Statistics (Gentry, 2016). We started
collecting 221,069 messengers, including 83% of retweets, from which two
samples of tweets were extracted. The first we used for the sentiment
analysis is composed of 99,812 messengers, where the retweets were limited
to the threshold of 31, resulting in a large corpus of 1,601,985 of tokens; the
second one we used for content analysis, where we excluded all the retweets,
resulted in a large corpus of 37,318 tweets and 618,255 tokens. In order to
check whether it was possible to statistically process data, two lexical
indicators were calculated: the type-token ratio and the hapax percentage
(TTRcorpus 1 = 0.02; Hapaxcorpus 1 = 39.8%; TTRcorpus 2 = 0.04; Hapaxcorpus 2 =
52.31%). According to the large size of the corpus, both lexical indicators
highlighted its richness and indicated the possibility to proceed with the
analysis.
2.1. Emotional text mining
We know that people sentiments depend not only on their rational thinking
but also, and sometimes most of all, on the emotional and social way of
functioning of people’s mind. If the conscious process set the manifest
content of the narration, that is what is narrated, the unconscious process can
be inferred through how it is narrated, that is, the words chosen to narrate
and their association within the text. According to this, it is possible to detect
the associative links between the words to infer the symbolic matrix
determining the coexistence of these terms in the text (Greco, 2016). To this
aim we perform a multivariate analysis based on a bisecting k-means
algorithm to classify the text (Savaresi et Boley, 2004), and a correspondence
analysis to detect the latent dimensions setting the cluster per keywords
matrix (Lebart et Salem, 1994) by means of T-Lab software. The interpretation
of the cluster analysis results allows to identify the elements characterizing
the emotional representation of Brexit, while the results of correspondence

330

JADT’ 18

analysis reflect its emotional symbolization. By the clusters interpretation, we
classify the emotional representations in positive, neutral and negative
sentiments, determining the percentage of messages for each sentiment
modality. To this aim, first corpus was cleaned and pre-processed with the
software T-Lab (T-Lab Plus version, 2017) and keywords selected. In
particular, we used lemmas as keywords instead of types, filtering out the
lemma Brexit and EU and those of the low rank of frequency (Greco, 2016).
Then, on the tweets per keywords matrix, we performed a cluster analysis
with a bisecting k-means algorithm limited to twenty partitions, excluding all
the tweets that do not have at least two keywords co-occurrence. The
percentage of explained variance (η) was used to evaluate and choose the
optimal partition. To finalize the analysis, a correspondence analysis on the
keywords per clusters matrix was made in order to explore the relationship
between clusters and to identify the emotional categories setting Brexit
representations.
2.2. Content analysis
Content analysis is a technique used to investigate the content of a text; in
text mining, many methods exist to analyse it automatically. One of these is
Text Clustering, where the corpus is splits in different subgroups based on
words/documents similarities (Iezzi, 2012). In this paper, a text co-clustering
approach (Celardo et al., 2016) is used. The objective is to simultaneously
classify rows and columns, in order to identify groups of texts characterized
by specific contents. To do that, data were pre-processed with Iramuteq
software lemmatizing the texts, removing stop words and terms with
frequency lower than 10. The weighted term-document matrix was then coclustered through the double k-means algorithm (Vichi, 2001); the number of
clusters for both rows and columns was fixed using the Calinski-Harabasz
index.
3. Emotional text mining main results and discussion
The results of the cluster analysis for ETM show that the 655 keywords
selected allow the classification of 88,6% of the tweets. The percentage of
explained variance was calculated on partitions from 3 to 19, and it shows
that the optimal solution is six clusters (η= 0.057). The correspondence
analysis detected six latent dimensions. In table 1, we can appreciate the
emotional map of Brexit and the EU emerging from the English tweets. It
shows how the clusters are placed in the factorial space produced by five
factors. The first factor represents the political and economic domain where
Brexit seems to have its main impact; the second factor reproduces the
possible solutions of Brexit: a separation or a new agreement; the third factor

JADT’ 18

331

represents the national or European level of reaction to Brexit; the fourth
factor is the blame, distinguishing the blame of politicians from the one of the
willingness to be independent; and the fifth factor is the political leadership,
differing old and new policies.
Table 1  Correspondence analysis results (the percentage of explained inertia
is reported between brackets beside each factor).
Factor 1 (27.5%)
NP

NP

PP

try

Macron

war

pro

support

chance

Brussel

deliver

Europe
an
good

Florenc
e
Delay

remaine
r
concern

zero

divorce

better off

save

laureate

debate

union

finger

proposa
l
fight

leaving

progres
s
negotiat
or
pay

brexiteer miracle
s
help
market

allow

single

finish

chief

event

Merkel

row

0.070.02 ac

economi
st
6.49-4.40
ac

NP

4.721.50 ac

PP

Factor 3 (19.8%)

negotiatio bill
n
Briton
Barnier

future

PP

Factor 2 (24.3%)

0.350.12
ac

1.5-1.61
ac

Factor 4 (15.6%)
NP
blame
march

withdraw stay
al
blast
speech
states

0.350.05
ac

PP

Factor 5 (12.9%)
NP

referendum leader
Johnson

remai
n
Verhofstadt walk

PP
people
Tory
hard

independen urge
voter
t
conservativ destroy
May T. party
e
anti
migrant
hope
happe
n
Blair
vow
Cataloni
call
a
reverse adopt
die
time
0.550.29
ac

5.220.94
ac

3.651.24
ac

10.281.49 ac

NP =negative pole; PP = positive pole; ac = absolute contribution (10-3)

The six clusters are of different sizes and reflect the representations of Brexit
(table 2), that correspond to three different sentiments: positive, negative for
domestic reasons, and negative for foreign ones (table 1). The first cluster
represents the choice to leave EU as a good option, underlining the need to
proceed; the second cluster focuses on the EU political reaction fixing divorce
conditions, perceiving EU political representatives as unfavourable and
therefore threatening; the third cluster represents Britons’ hope to improve
their economic condition leaving EU as naive; the fourth cluster represents
the old British political leadership as incompetent, being unable to protect
and adequately inform Britons in order to support them in remaining in the
EU; the fifth cluster reflects the negotiation of the divorce conditions,
perceiving the negotiation as unfair and the costs of leaving EU as a
punishment; and the sixth cluster represents Brexit as a Britons informed
choice, highlighting that its consequences belong to the policy domain who
should respect the citizens’ choice.

332

JADT’ 18
Table 2  Clusters (the percentage of context units classified
in the cluster is reported between brackets).

Cluster 1
(10.0% CU)

Cluster 2
(14.9% CU)

Cluster 3
(20.9% CU)

Cluster 4
(13.4% CU)

Cluster 5
(19.2% CU)

Cluster 6
(21.7% CU)

Good Choice

EU Reaction Uncertain Future British Leadership Divorce Conditions Informed Choice
people

bill

referendum

leaving

Tory

Barnier

Corbyn

Briton

hard

brussel

Johnson

Theresa May market

chance

voter

progress

think

urge

warn

zero

party

divorce

independent

call

single

better off

happen

negotiator

Boris

walk

business

Nobel

Florence

pay

Verhofstadt

UKIP

minister

economist

stay

chief

Florence

government

Europe

laureate

Catalonia

demand

destroy

hope

move

tell

believe

national

try

look

Merkel

rating

Spain

Davis

policy

mean

miracle

law

Rees Mogg

offer

issue

leader

Macron

negotiation

remain

European

time

good

From 1611
to 620 CU

From 2004
to 951

From 1844 to
668 CU

From 2506 to
461 CU

From 2705 to 843
CU

From 2098 to
512 CU

CU = context units classified in the cluster.

By the clusters interpretation, we detected six different representations of
Brexit that correspond to three different sentiments (table 1). We have
considered as positive (21,7%) the representation of Brexit as a Good Choice
or an Informed Choice, and negatives all the other representations (78,3%).
Among the negative clusters, we distinguished negativity according to the
origin of the problem: Uncertain Future and British Leadership are negative
for domestic reasons (34,2%), that is, the lack of UK political leadership’s
competences; and EU Reaction and Divorce Condition are negatives due to
foreign factors (34,1%) as the EU after Brexit seems to be perceived as
vindictive and, therefore, threatening.
4. Content analysis main results and discussion
The pre-processing phase, implemented on the second corpus, allowed us to
identify a set of 1.957 keywords, representing the 97% of the tweets; so, on
the term-document matrix of dimension (1.957 × 36.383) we calculated the
Calinski-Harabasz Index in order to define the number of clusters for rows
and columns. After calculating the index values for partitions from 2 to 10 for
each dimension, the Calinski-Harabasz Index suggested to classify the words
in three groups and the tweets in five groups. In table 3, the centroids of the
clusters are exposed.

JADT’ 18

333
Table 3  Centroids matrix (Terms × Documents).

Cluster 1

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

(55%)

(20%)

(12%)

(11%)

(2%)

0,005

0,003

0,004

0,000

0,000

Cluster 2

0,002

0,063

0,003

0,149

0,012

Cluster 3

-0,002

0,000

0,090

-0,003

0,309

Table 4  Words groups (first 10 words listed below by frequency of occurrence).
Cluster 1
Negotiation
stay

Cluster 2
Economic Transformation
leave

Cluster 3
British Identity
home

Junker

move

sound

ambassador

transition

cake

cry

late

plan

track

deal

datum

surge

trade

live

peer

retain

finish

shape

post

Id

turmoil

Macron

idea

survive

urge

national

As shown in the table 3, the algorithm has identified five blocks of
specificities; in fact; the first cluster of words is connected to the first group of
tweets; the second is specific of the second and the fourth cluster of tweets
and the third is related to the third and the fifth group of tweets. In table 4,
the groups of words are presented.
The first group of words is related to the need of defining new rules and
settlements within the negotiation and it represents more than half of the
tweets; it has no strong specificities related to the texts, but in comparison to
all the documents clusters, it seems to be more connected to those words. On
the other hand, for the other two groups of words, there are more effective
specificities; the second cluster of words is about the definition of new
economic agreements, and it is connected to the 31% of the tweets, while the
third one, related to the requirement in specifying a new identity after Brexit,
is representative of the 14% of the corpus documents.
5. Conclusions
The results of the two analyses showed a strong relationship between the
terms “Brexit” and “EU”, not only in terms of sentiment, but also in terms of

334

JADT’ 18

contents. According to the literature, the sentiment analysis revealed the
presence of both positive and negative opinions in respect to the exit of
United Kingdom from the EU. On the other hand, starting from the analysis
of the contents we found that the Twitter communications on Brexit focuses
primarily on the concept of negotiation. The remaining part of the messages
take into account both the Brexit economic features and the need of the
national identity redefinition. To conclude, the results of the two analyses
revealed that Brexit is a theme with a strong emotional charge, mostly
negative. British people seem to focus their attention basically toward three
issues: the new asset, the economic consequences, and the national identity.
These subjects are treated positively and negatively from the users, probably
because of the lack of cohesion within the country.
References
Celardo L., Iezzi D.F. and Vichi M. (2016). Multi-mode partitioning for text
clustering to reduce dimensionality and noises. In Proceedings of the 13th
International Conference on Statistical Analysis of Textual Data.
Gentry J. (2016). R Based Twitter Client. R package version 1.1.9.
Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per
leggere il cambiamento culturale. Franco Angeli.
Hobolt S. (2016). The Brexit vote: a divided nation, a divided continent.
Journal of European public policy, 23(9): 1259–1277.
Iezzi D. F. (2012). Centrality measures for text clustering. Communications in
Statistics-Theory and Methods, 41(16-17), 3179-3197.
Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod
Savaresi S. M. and Boley D. L. (2004). A comparative analysis on the bisecting
K-means and the PDDP clustering algorithms. Intelligent Data Analysis,
8(4): 345-362.
Spiering M. (2004). British Euroscepticism. In Harmsen R. and Spiering M.,
editors, Euroscepticism: Party Politics, National Identity and European
Integration. Editions Rodopi B.V.
Szczerbiak A. and Taggart P. (2008). Opposing Europe? The Comparative Party
Politics of Euroscepticism. Volume 1: Case Studies and Country Surveys.
Oxford University Press.
Vichi M. (2001). Double k-means clustering for simultaneous classification of
objects and variables. Advances in classification and data analysis, 43-52.

JADT’ 18

335

A text mining on clinical transcripts of good and poor
outcome psychotherapies
Francesca Greco1, Giulio de Felice2, Omar Gelo3
Sapienza University of Rome & Prisma S.r.l. – francesca.greco@uniroma1.it
2 Sapienza University of Rome & NCU University – giulio.defelice@uniroma1.it
3 University of Salento & Sigmund Freud University – omar.gelo@unisalento.it
1

Abstract
The text mining of clinical transcripts is broadly used in psychotherapy
research, but is limited to top-down approaches, with a-priori vocabularies
that code the transcripts according to a theoretical predetermined
framework. Nevertheless, the semantic level that a word or clinical
intervention can assume depends on the relational field in which the
discourse is produced. Thus, bottom-up approaches seem to be particularly
meaningful in addressing such a relevant issue. With the aim of investigating
possible similarities and differences between good outcome and poor
outcome psychotherapies, we applied a multivariate analysis on the
transcripts of eight single cases of brief experiential psychotherapy (four
good outcome vs four poor outcome cases), in order to identify the general
core themes, and their difference according to therapy outcome. The results
showed a significant difference in the number of context units classified in
two of the six core themes (clusters) between good and poor outcome cases
(χ2, df=5, p<0,01). These findings show how the bottom-up technique of text
analysis on clinical transcripts turned out to be an enlightening tool to let
their latent dimensions emerge, setting the clinical process and outcome, and
therefore, providing a very useful tool for clinical purposes.
Abstract
L’analisi delle trascrizioni cliniche è stata ampiamente utilizzata nella ricerca
in psicoterapia, sebbene prevalentemente si basi sull’utilizzo di un dizionario
che consente la codifica del testo in funzione di criteri predeterminati.
Tuttavia, la polisemia che una parola, o un intervento clinico,” può assumere
dipende dal campo relazionale in cui il discorso è prodotto. Pertanto, gli
approcci bottom-up sembrano essere particolarmente utili nell'affrontare tale
questione. Allo scopo di indagare gli elementi caratterizzati le trascrizioni
cliniche con esito positivo e negativo, è stata effettuata un’analisi multivariata
di un corpus composto da otto trascrizioni di psicoterapia breve (quattro con
esito positivo e quattro con esito negativo) al fine di identificare i temi
centrali generali e la distribuzione delle unità di contesto nei diversi temi in

336

JADT’ 18

funzione dell’esito della terapia. I risultati hanno evidenziato una differenza
significativa tra i casi con esito positivo e quelli con esito sfavorevole (χ2, df =
5, p <0,01), mettendo in evidenza come l'analisi automatica del testo delle
trascrizioni dei colloqui clinici possa essere uno strumento utile a far
emergere le dimensioni latenti organizzatrici del processo e del risultato,
configurandosi così come un utile strumento ai fini clinici.
Keywords: Emotional Text Mining, clinical transcripts, psychotherapy
outcome.
1. Introduction
The text mining of clinical transcripts is very broadly used in psychotherapy
research, but is limited to top-down approaches where a-priori vocabularies
code them according to a theoretical predetermined framework.
Nevertheless, the semantic level that a word, or clinical intervention, can
assume, depends on the relational field in which the discourse is produced.
Thus, bottom-up approaches seem to be particularly meaningful in
addressing such relevant issue. Psychotherapy can be considered a dynamic
communicative exchange between the client and the therapist (e.g., Gelo et
Salvatore, 2016). Within such an exchange, the content (i.e., the semantic) of
what is said plays a primary role. Thus, the textual analysis of therapy
transcripts may represent a very useful tool for psychotherapy process
researchers as well as for clinicians (Gelo et al., 2013; Salvatore et al. 2017). In
the field of psychotherapy research, some methods of text mining have been
developed and applied, such as the Therapeutic Cycle Model (Mergenthaler,
2008) and Referential Activity (Bucci et al., 1992). Following a top-down
approach, these methods use predefined content categories to semantically
classify units of text. Each of these categories corresponds to a thematic
dictionary containing all the words indicative of the content represented by
that category. Even though these top-down methods of text mining allow for
a reliable and valid investigation of the therapeutic process, they present a
major limitation, disregarding the contextual nature of the linguistic meaning
(Carli et al., 2004; Salvatore et al., 2012). In fact, the meaning of a word is
polysemic and depends on the way it combines with other words in the
communicative interaction, i.e., it depends on its association with other
words. Grounded on these considerations, there has recently been a
development of text mining approaches which, by means of their bottom-up
logic, allow for a context-sensitive textual analysis (e.g., Salvatore et al., 2012;
2017; Cordella et al., 2014; Greco, 2016). The aim of this study is to investigate
possible similarities and differences between good outcome and poor
outcome psychotherapy cases by applying the Emotional Text Mining
(Cordella et al., 2014; Greco, 2016). Our assumption is that it is possible to

JADT’ 18

337

detect the associative links between the words in order to infer the symbolic
matrix determining the coexistence of the terms in the text. To this aim, we
perform a multivariate analysis based on a bisecting k-means algorithm
(Savaresi et Boley, 2004) to classify the text, and a correspondence analysis
(Lebart et Salem, 1994) to detect the latent dimensions setting the cluster per
keywords matrix. The interpretation of the cluster analysis allows for the
identification of the elements characterizing the core themes of the treatment,
while the results of the correspondence analysis reflect the emotional
symbolisation characterising the therapeutic exchange. The advantage of
such an approach is to interpret the factorial space according to word
polarization, thus identifying the emotional categories that generate the core
themes, and to facilitate the interpretation of clusters, exploring their
relationship within the symbolic space (Greco et al., 2017).
2. Data collection and analysis
2.1. Data collection
The sample of the present study was drawn from the York Depression Study
I, a randomized clinical trial to assess the efficacy of brief experiential
therapy for depression (Greenberg et Watson, 1998; Watson et al., 1998).1
From the original sample, we initially selected the six best outcome cases and
the six cases worst outcome cases based on the Reliable Change Index of the
Beck Depression Inventory (BDI; Beck et al., 1988). We then excluded four
cases due to missing session transcripts. Our final sample was thus
comprised of a total of eight cases, with four good outcomes and four poor
outcomes. The treatment length was between 15 and 20 sessions (M = 17.62;
SD = 1.38), for a total of 141 sessions. Patients (one man and seven women;
M=37.1 years old) met the criteria for major depressive disorder assessed by
means of the Structured Clinical Interview for DSM-III-R (SCID; Spitzer et al.,
1989). Therapists (seven women and one man; M= 5.5 years of therapeutic
experience) had six months of training in experiential psychotherapy
(Greenberg et al., 1993). The transcripts were collected in a large size corpus
of 1090234 tokens. In order to check whether it was possible to statistically
process data, two lexical indicators were calculated: the type-token ratio and
the percentage of hapax (TTR = 0.01; hapax = 35.3%). They highlighted the
richness of the corpus indicating the possibility of proceeding with the
analysis.

1 We are grateful to Dr. Les Greenberg for having us provided with files of the
transcripts for these cases.

338

JADT’ 18

2.2. Data analysis
First, data were cleaned and pre-processed with the software T-Lab and
keywords selected. In particular, we used lemmas as keywords instead of
type. We selected all the lemmas in the medium rank of frequency (upper
frequency threshold = 933), and those of the low rank of frequency until the
threshold of 17 occurrences, that is, equal to the average number of sessions
made on average by the patients (Greco, 2016). Then, in order to identify the
core themes common to all the psychotherapies, we performed a cluster
analysis on the keywords per context units (CU) matrix, by means of a
bisecting k-means algorithm (Savaresi et Boley, 2004), limited to ten
partitions, excluding all the CU that did not have at least two keywords cooccurrences. The eta squared value was used to evaluate and choose the
optimal solution. To finalize the text mining, we performed a correspondence
analysis on the keywords per clusters matrix (Lebart et Salem, 1994) in order
to explore the relationship between clusters, and to identify the emotional
categories setting the psychotherapeutic process. The interpretation of the
factorial space was performed according to the procedure proposed by
Cordella and colleagues (2014) in which each keyword is considered only in
the factor with the greatest absolute value. To finalise the analysis, we
performed a chi squared test on the contingency table cluster per therapy
outcome, calculating the standard residual in order to identify the differences
between good outcome and poor outcome clinical transcripts in terms of core
themes.
3. Main results and discussion
The results of the cluster analysis show that the 1351 keywords selected
allow for the classification of 56.6% of context units. The high proportion of
unclassified context units is due to the transcripts richness in terms of paraverbal interactions (i.e. mhm, yeah, etc). The eta squared value was calculated
on partitions from 3 to 9, and it showed six clusters as the optimal solution
(η2 = 0.034). In table 1, we can appreciate the emotional map emerging from
the clinical transcripts representing the clusters location in the factorial space
produced by the interpretation of the five factors. The first factor reflects
patient positioning, which can be passive or active; the second factor refers to
the relationship that could be familiar or unfamiliar, i.e., a person facing
something new and unpredictable; the third factor represents the
communication content that can be emotional or concrete; the fourth factor
reflects the outcome of the therapeutic work, that is, the patient’s
empowerment or making sense of the patient’s experiences; and the fifth
factor distinguishes the issues within the daily ones, concerning everyday life,

JADT’ 18

339

from the relational ones, concerning the loved ones.2
Table 1  Factorial space representation (the percentage of explained inertia
is reported between brackets under each factor).
Cluster

1

Label
(CU%)

Family Structure
(11.6%)
Transformative Process
(12.1%)
Concrete thinking
(16.1%)
Therapeutic Relationship
(22.4%)
Relational Issues
(14.6%)
Feelings
(23.1%)

2
3
4
5
6

Factor 1
(26.7%)
Positioning
Passive
0.20
Active
-0.46
Passive
0.84
Active
-0.25
0.04
0.06

Factor 2
(25.8%)
Relationship
Familiar
-0.56
Unfamiliar
0.29
Unfamiliar
0.34
Familiar
-0.18
Familiar
-0.14
Unfamiliar
0.58

Factor 3
(21.5%)
Content
Emotional
-0.16
0.06
Concrete
0.42
Concrete
0.41
Emotional
-0.47
Emotional
-0.43

Factor 4
(14.5%)
Outcome
-0.01
To empower
-0.35
To empower
-0.19
To understand
0.28
To empower
-0.18
To understand
0.49

Factor 5
(11.5%)
Issues
Daily
-0.32
Daily
-0.16
0.05
Relational
0.16
Relational
0.45
Daily
-0.14

CU = context units classified in the cluster.
Table 2  Psychotherapy core themes.
Cluster 1
Family
Structure

Cluster 2
Transformative
Process

Cluster 3
Concrete
Thinking

Cluster 4
Therapeutic
Relationship

Cluster 5
Relational
Issues

Cluster 6
Feelings

keyword CU keyword

CU keyword CU keyword CU keyword

CU keyword

home

525 start

507 hear

455 week

699 mother

399 understand 416

kid

371 able to

504 money

326 sense

675 life

335 hurt

300

house

290 change

438 dollar

267 day

438 problem

333 important

298

father

241 different

396 accept

205 bad

432 hard

292 person

231

husband

213 situation

288 pay

196 angry

381 care

268 hard

213

child

205 point

237 listen

175 call

253 deal

252 support

185

parent

194 go on

216 believe

135 night

189 family

237 inside

170

stay

190 mind

213 matter

130 morning

169 relationship 233 strong

168

live

179 trying

183 sell

126 set

162 Father

153

195 pain

CU = context units classified in the cluster.

The six clusters are of different sizes (table 1) and reflect the core themes of
the brief psychotherapy (table 2). The first cluster describes the family
structure with its role and places; the second cluster reflects the transformative

2 In the negative pole of the fifth factor (Daily Issues) we find the following
words: house, stay, TV, rule, street, teacher, move out, neighbour, pounds, and in the
positive pole we find words as mother, life, problem, sister, relationship.

CU

340

JADT’ 18

process characterising a psychotherapy; the third cluster highlights the
concrete thinking process, a way to think that could be defined as concrete
thinking, which is often rational and frequently concerning economic issues;
the fourth cluster represents the therapeutic relationship that is made of
concrete limits, and the process of making sense of personal experiences; the
fifth cluster reflects the relational issues of the patient’s private life; and the
sixth cluster refers to the process of detecting, recognizing, and
understanding feelings, characterizing internal emotional experiences.
There is a significant difference in the number of content units classified in
each cluster among the good and poor outcome therapies (χ2, df = 5, p < 0.01).
In particular, the differences lay on the relevance of two of the six core
themes: the concrete thinking and the feelings. While the good outcome brief
psychotherapies are characterized by a high number of context units
classified in the cluster feelings (SE = 6.8) and a low number of context units
classified in the cluster concrete thinking (SE = -5.8); the poor outcomes
psychotherapies are characterized by a high number of context units
classified in the cluster concrete thinking (SE = 6.8) and a low number of
context units classified in the cluster feelings (SE = -7.0). Namely, it would
seem that patients tend to dwell upon their emotional experiences in the
good outcome psychotherapy, while they tend to dwell upon facts in the
poor outcome psychotherapy, probably not connecting them to their
emotional experiences. Given that we classified the interactions among the
patients and the therapists in this analysis, the therapy outcome could derive
both from the patient’s ability in dealing with feelings or the therapist’s
ability to support the patient in doing so.
The above-mentioned differences between good and poor outcome cases are
coherent with findings obtained on the same sample by means of a principal
component analysis made on the transcripts coded according to three
dictionaries: abstract language, emotional positive language, and emotional
negative language (de Felice et al., 2018). In this study, differences in the
correlation matrices between good outcome and poor outcome cases were
evident. The most obvious one concerned the dynamic in which the patient
made use of abstract/concrete language, interpreted very positively in poor
outcome cases and very negatively in good outcome cases. In the latter, it
was probably and correctly considered as a patient’s defense mechanism to
address. This was confirmed by the use of positive and negative emotional
language, inversely proportional to abstraction, only in poor outcome cases.
4. Conclusion
Talking about concrete events without any sort of emotional involvement in
the clinical literature is a defence mechanism that goes under the name of

JADT’ 18

341

rationalisation, and it represents a way to protect the mind from painful
feelings using an abstract, intellectual and often concrete attitude in dealing
with them. While the good outcome psychotherapeutic relationships seem to
be capable of addressing the emotional content laying under the surface of
the psychotherapeutic field (i.e. use of the therapist’s negative emotional
language), the poor outcome dynamics seem to be completely wrapped up in
a process of avoiding it. Both the PCA (de Felice et al 2018) and text analysis
on clinical transcripts confirmed the difficulty in poor outcome
psychotherapies to work on the patient’s emotional aspects. This bottom-up
technique of text analysis on clinical transcripts turned out to be an
enlightening tool to let their latent dimensions emerge, arranging the clinical
process and outcome, therefore, providing a very useful tool for clinical
purposes.
References
Beck A.T., Steer R.A. and Garbin M. G. (1988). Psychometric properties of the
Beck Depression Inventory: Twenty-five years of evaluation. Clinical
Psychology Review, 8: 77 100.
Bucci W., Kabasakalian-McKay R. and RA Research Group (1992). Scoring
referential activity. Ulm, Germany: Ulmer Textbank.
Carli R., Dolcetti F. and Dolcetti (2004). L’Analisi Emozionale del Testo
(AET): un caso di verifica nella formazione professionale. In Purnelle G.,
Fairon C. and Dister A., editors, Actes JADT 2004: 7es Journées
internationales d’Analyse statistique des Données Textuelles, pp. 250-261.
Cordella B., Greco F. and Raso A. (2014). Lavorare con Corpus di Piccole
Dimensioni in Psicologia Clinica: Una Proposta per la Preparazione e
l’Analisi dei Dati. In Nee E., Daube M., Valette M. and Fleury S., editors,
Actes JADT 2014 (12es Journées internationales d’Analyse Statistque des
Données Textuelles, Paris, France, Juin 3-6, 2014), pp. 173-184.
de Felice G., Orsucci F., Mergenthaler E., Gelo O., Paoloni G., Scozzari A.,
Serafini G., Andreassi S., Vegni N. and Giuliani A. (2018). What
differentiates good and poor outcome psychotherapies? A statistical
mechanics approach to psychotherapy research. Nonlinear Dynamics,
Psychology and Life Sciences. Submitted.
Gelo O.C.G. and Salvatore S. (2016). A dynamic systems approach to
psychotherapy: A meta-theoretical framework for explaining
psychotherapy change processes. Journal of Counseling Psychology, 63(4):
379-395.
Gelo O.C.G., Salcuni S. and Colli A. (2013). Text analysis within quantitative
and qualitative psychotherapy process research: introduction to special
issue. Res. Psychother. Psychopathol. Process Outcome 15: 45–53.

342

JADT’ 18

Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per
leggere il cambiamento culturale. Franco Angeli.
Greco F., Maschietti D. and Polli A. (2017). Emotional text mining of social
networks: The French pre-electoral sentiment on migration. Rivista Italiana
di Economia Demografia e Statistica, 71(2): 125:36.
Greenberg L., Rice L. and Elliott R. (1993). Facilitating emotional change. The
moment by moment process. Guilford Press.
Greenberg LS, Watson JC (1998). Experiential therapy of depression:
differential effects of client-centered relationship conditions and process
experiential interventions. Psychotherapy-Research 8: 210–224.
Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod
Mergenthaler E. (2008). Resonating minds: A school-independent theoretical
conception and its empirical application to psychotherapeutic processes.
Psychotherapy Research, 18(2): 109-126.
Salvatore S., Gelo O., Gennaro A., Metrangolo R., Terrone G., Pace V.,
Venuleo C., Venezia A. and Ciavolino E. (2017). An automated method of
content
analysis
for
psychotherapy
research:
A
further
validation. Psychotherapy Research, 27(1): 38-50.
Salvatore S., Gennaro A., Auletta A.F., Tonti M. and Nitti M. (2012).
Automated method of content analysis: A device for psychotherapy
process research. Psychotherapy Research, 22(3): 256-273.
Savaresi S.M. and Boley D.L. (2004). A comparative analysis on the bisecting
K-means and the PDDP clustering algorithms. Intelligent Data Analysis,
8(4): 345-362.
Spitzer R., Williams J., Gibbons M. and Firs M. (1989). Structured Clinical
Interview for DSM-III-R. American Psychiatric Association
Watson J.C., Greenberg L. S. and Lietaer G. (1998). The experiential paradigm
unfolding: Relationship & experiencing in therapy. In Greenberg L.S.,
Watson J.C. and Lietaer G., editors, Handbook of experiential psychotherapy,
Guilford Press.

JADT’ 18

343

DOMINIO: A Modular and
Scalable Tool for the Open Source Intelligence
Francesca Greco1, Dario Maschietti2, Alessandro Polli3
1

La Sapienza University of Rome, Prisma S.r.l. – francesca.greco@uniroma1.it
2 Prisma S.r.l – d.maschietti@prismaprogetti.it
3 La Sapienza University of Roma – alessandro.polli@uniroma1.it

Abstract
Prisma has developed an innovative technology for the Open Source
Intelligence (OSINT) which aims to provide a solution for those processes of
knowledge management, which require the intervention of a human
operator, unaided by information technology (IT) support, in one or more
stages of the procedure. Such intervention involves a considerable waste of
time and resources that could be reduced through the use of an IT tool,
partially or totally automating entire stages of the procedure. DOMINIO is a
platform that implements tools for automatic online information aggregation,
its analysis, the possible alignment with traditional databases and the
representation through infographic and georeferencing tools, in order to
generate a report. This paper describes the platform architecture, the main
algorithms used in the analysis stage of the contents and possible directions
of development.
Abstract
Prisma ha sviluppato una tecnologia innovativa finalizzata all’Open Source
Intelligence (OSINT) che intende fornire risposta alle necessità di knowledge
management, che richiedono l’intervento di un operatore umano, non
assistito da supporti di information technology (IT), in una o più fasi della
procedura. Tale intervento comporta un notevole dispendio di tempo e
risorse che potrebbe essere ridotto attraverso l’utilizzo di uno strumento di
IT, automatizzando parzialmente o totalmente intere fasi della procedura.
DOMINIO è una piattaforma che implementa strumenti per l’aggregazione
automatica di informazioni on line, la loro analisi, l’eventuale allineamento
con banche dati di tipo tradizionale, la rappresentazione attraverso tool di
infografica e georeferenziazione, allo scopo di generare una reportistica. Il
presente contributo descrive l’architettura della piattaforma, i principali
algoritmi adottati nella fase di analisi dei contenuti e le possibili direzioni di
sviluppo.
Keywords: knowledge management, Open Source Intelligence tool,
Information Technology,

344

JADT’ 18

1. Introduction
There is a close link between data management and knowledge on the one
hand, and knowledge and innovation on the other. The growing mass of
unstructured information from disparate channels (search engines, RSS
feeds, social networks) and traditional databases entails the need to
drastically simplify the preparation, analysis and reporting stages required to
structure the information. In fact, only a structured information translates
into knowledge. Knowledge, in turn, is a major driver of innovation and,
properly managed, it translates into a competitive advantage. The idea at the
basis of the tool OSINT (Open Source Intelligence) stems from the needs
expressed by analysts – mainly involved in the field of sentiment analysis
and opinion mining industry. However, this idea is enough comprehensive
to encompass all those activities of knowledge management, similar to the
former, which require intervention by a human operator, unaided by IT
support (Information Technology), in one or more stages of the procedure,
the intervention of which involves a great deal of time and resources.
Although in high-end solutions machine learning systems are starting to
spread, the available technology is still characterized by significant
limitations, especially in the presence of unstructured information. In
particular, with regard to supervised machine learning systems, intervention
is required by an operator in the initial stages of the procedure and, in
general, with reference to any automated system applied to the analysis of a
text, it is still impossible to identify complex cognitive functions (for example,
irony). Of course, these problems are immanent in many fields of OSINT,
and they also affect the stage of reporting, which requires a direct
involvement of the analyst, unaided by IT. So, the availability of an IT tool
that minimizes human operator intervention − partially or totally automates
entire stages of the procedure − would result in substantial advantages, like
time savings, increased productivity and the resulting increased efficiency in
the allocation of human and financial resources.
Prisma has developed an innovative technology of OSINT, which aims to fix
the problems briefly described above. The platform implements tools for
automatic aggregation of the online information, their analysis, the alignment
with traditional databases, the representation through infographic and
georeferencing tools, aimed to automate also the phase of elaboration of the
final report.
This paper will describe the architecture of the platform, the main analysis
modules and the possible directions of development.

JADT’ 18

345

2. Platform Architecture
DOMINIO is an OSINT (Open Source Intelligence) platform that
automatically aggregates information from online and traditional databases,
analyses it and generates reports on a user-defined subject. The platform
collects information by querying several channels: search engines (Google,
Yahoo, Bing), social networks (Facebook, Twitter, Google+), RSS feeds, blogs
(Blogger, Wordpress, Tumblr), traditional databases. The goal of DOMINIO
is to build a structured set of contents, as broad as possible, and to carry out a
wide range of qualitative and quantitative analysis. DOMINIO stores these
contents within a non-relational database (DB) (MongoDB, 2018; Morphia,
2018), classifying the various documents by channel of origin (Twitter,
Facebook, RSS, etc.) to ensure the homogeneity of the collections.
Among the options, the DOMINIO user can make queries on-demand or in a
continuous mode. The on-demand option carries out an asynchronous
search, while the continuous mode option enables to aggregate periodically
data and to track a subject over an extended time span. The DOMINIO’s
architecture allows the user to switch from one mode to another; the
availability of two searching modes allows overcoming the trade-off between
accuracy of analysis and speed of processing.
With regard to one or more subjects selected by the operator, DOMINIO
performs synchronous or asynchronous research on a set of Internet
channels, such as search engines (Google, Yahoo, Bing), social networks
(Facebook, Twitter, Google+), RSS feeds, blogs (Blogger, Wordpress, Tumblr).
The user can also extend the search to the Deep Web, through specific search
engines, such as Torch or Grams.
Moreover, to meet specific information needs, DOMINION can match these
search results with the information achievable from the traditional databases
to support many types of analysis (brand reputation, country risk
assessment, opinion polls, cyber security, etc.), considerably increasing the
operability and flexibility of the tool.
Among the traditional databases already available, DOMINIO includes:
 IHS Jane's (2018), which provides updates on military and political
situation, terrorist acts, civil wars, transportation system, for most of
the countries in the world;
 Bureau Van Dijk (2018), which collects firms data on ratings,
shareholdings, equity investments and M&A;
 MIG (a geographic information database drawn up by one of the
authors).
In addition, for specific information purposes, DOMINIO is open for
interfacing with Enterprise Resource Planning databases (like SAP, Oracle,
etc.) through market tools (Business Object, Quick View).

346

JADT’ 18

The search results are recalled by the analyst, who operates from a CMS
(Content Management System) application to manage the structured set of
content and conduct a wide range of qualitative and quantitative analyses
(from simple summary statistics to sophisticated multivariate analyses and
text and opinion mining techniques).
The statistical methods implemented on DOMINIO are chosen by the Prisma
research team according to a set of criteria that privileges the suitability of
one algorithm to automate entire stages of the procedure, in accordance with
the original design idea. Moreover, the modular architecture of DOMINIO,
described briefly below, allows a quick integration of the latest analysis tools
and innovative methodologies produced in the academic field.
Once the stage of content analysis is completed, the CMS application
generates a micro-site containing the results (geo-referenced maps, summary
statistics, multivariate analysis results, textual and semantic analysis of
sentiment analysis, etc.). After selecting a graphic layout for the final report,
the analyst has only to write notes and final remarks.
The possibility of including features generating automatic and/or autocompletion comments, customizable by the user, is also being studied. Once
the last stage is completed, the report is ready for online publication or
traditional diffusion in pdf format, or linked to external services.
From an architectural point of view, DOMINIO is designed following the
most modern criteria of modular software design, with the parallel
development of the platform’s modules. In short, in order to ensure a greater
fault tolerance and high safety standards, the system is divided into three
independent logical units (cfr. Figure 1):
• DOMINIO Engine Unit (MEU), which implements the features of
1) scraping information from the sources mentioned above (web,
social networks, RSS feeds, traditional databases); 2) storage of
results on MEDB database; 3) qualitative and quantitative
analysis;
• DOMINIO RESTurl Unit (MRESTU), which receives requests from
the MCMS unit, verifies the consistency and forwards the request
to the unit ME. Upon receiving the response, it implements the
request by adding additional fields (username, token, etc.) and
returns them to the MCMS client. The MRESTU unit contains the
database (MRESTDB) for user profiling;
• DOMINIO Content Management System Unit (MCMSU), which
manages the stage concerning the reporting and archiving of
reports according to pre-logical criteria (organization by topic,
chronologically, for templates, etc.).

JADT’ 18

347

Figure 1 - DOMINIO General Overview

3. Main analysis modules
3.1. Country Threat Assessment
The Country Threat Assessment module supports the Company Intelligence
and Security analyst in the country's risk assessment process. Through a
responsive type interface, it aggregates information from major global
industry databases (eg, IHS Jane's) giving an assessment of external and
internal risk and that due to political and socio-economic factors and
potential outbreaks or revolutionary movements for 192 different countries.
Country Threat Assessment is integrated with intelligence information
updated weekly on each country. Through an automatic report, data is
aggregated into a single file by optimizing timing of risk assessment and
providing a solid foundation for any further detailed analysis. DOMINIO
offers the possibility of making a full or partial information download, and
the generation of an automatic report, thus optimizing any drafting
processes.
3.2. Due Diligence
The Due Diligence module supports the Economic Intelligence analyst in the
process of business valuation in relation to suppliers, partners and
customers. Among the sectors analysed in the module are included

348

JADT’ 18

assessments of profitability and financial performance as well as
creditworthiness. Through a simple and intuitive interface, the module
aggregates information from leading industry databases and returns an
economic, financial and credit risk profile on hundreds of millions of
businesses around the world. The Due Diligence Module also allows an
assessment of individuals, through the analysis of individuals exposed
politically, returning an automatic report that integrates the main aspects of
each business and its economic risk analysis.
3.3. Open Source Intelligence
On completion of the aggregation of large amounts of data from major social
networks (Facebook, Twitter, Youtube) and the main Italian newspapers
based on predetermined keywords analyst, a statistical representation of the
main trending topic is returned and an output of structured data for
subsequent multivariate analysis is generated. Furthermore, the module
allows the geo-referencing of content, highlighting even at geographic levels
useful signs for the analyst. As for each of DOMINIO’s modules, it is possible
to generate automatic reporting.
3.4. Geographic Information Module
This is a module that analyses the information inferable from a dataset of
basic statistical information and related indicators, with reference to a
multitude of subjects, 9 of which are in a current stage of development. The
basic statistical information, refers to the division of the Italian territory into
provinces, covering a time period between 1995 and the latest available year,
which for some subject areas is ongoing or, more frequently, the previous
year to the current one. The dataset will be supportive to a wide range of
applications - from forecasting and scenario analysis, counterfactual analysis
to spatial analysis.
3.5. Text Mining Module
On completion of the automatic analysis of textual data using statistical
methods (Lebart et Salem, 1994; Feldman et Sanger, 2006; Bolasco, 2013), in
order to extract structured information, the main statistical methods of
analysis of textual data implemented in DOMINIO are: factor analysis
(correspondence analysis, multiple correspondence analysis); cluster analysis
(k-mean, bisetting k-mean, fuzzy clustering, etc.); network analysis; Markov
analysis; pattern recognition.
For example, during the French presidential campaign of 2017 we analysed
the sentiment about migration, that was one of the most debated theme. We
performed an Emotional Text Mining (Greco et al., 2017) in order to explore

JADT’ 18

349

the emotional content of the Twitter messages concerning migration written
in French in the last two weeks before the first round of the presidential
election in 2017. The aim was to analyse the opinions, feelings and shared
comments, classifying the contents and the sentiments. We retrieved the
messanges from the Twitter repository collecting a sample of over une
hundred thousand tweets The large size corpus of 2.154.194 tokens (TTR =
0,01; Hapax percentage = 40,4) underwent a multivariate analysis based on a
bisecting k-means algorithm (Savaresi et Boley, 2004) to classify the text, and
a correspondence analysis (Lebart et Salem, 1994) to detect the latent
dimensions setting the cluster per keywords matrix. The advantage
connected with this approach is to interpret the factorial space according to
words polarization, thus identifying the emotional categories that generate
migration representations, and to facilitate the interpretation of clusters,
exploring their relationship within the symbolic space (Greco, 2016).
The results interpretation allowed for the detection of seven representations
of migrants that corresponded to three different sentiments: positive (42%),
negative for the community (45%), and negative for migrants (13%). We
considered as negative the representation of migrants as squatters, invaders,
terrorists, trafficking slaves and migration victims, and positive the sport
heroes and the EU solidarity target. Among the negative clusters, we
distinguished negativity according to the direction of the action: squatters,
terrorists and invaders are negative for the community and trafficking slaves
and migration victims are negatives for migrants themselves (see Greco et al.,
2017). Moreover, It was possible to highlight the connection between the real
life events and the tweets production. While the terrorist attack three days
before the first round of voting in the centre of Paris had slightly modified
the production of messages, the candidates’ interviews had a higher impact.
This suggests that the medialization was more important than the terrorist
attack in the production of messages (see Greco et al., 2017).
4. Conclusion
The innovative aspect that characterizes DOMINIO is the ability to aggregate
data of different types and from different channels of information,
automatically, simply and transparently. Moreover, its structure allows for
the integration of the latest analytical tools and innovative methodologies
produced in academia. By means of an automated reporting system, the
analyst is supported in the assessment of risk and the collection of
information in the geopolitical and economic field and from open sources.
The set of modules allows the analyst to generate knowledge from an evergrowing amount of data by optimizing the processes of assessment and risk
reduction.

350

JADT’ 18

References
Bolasco S. (2013). L’analisi automatica dei testi: Fare ricerca con il text mining.
Carocci.
Bureau von Dijk (2018). A Moody’s Analytics Company. Bureau von Dijk,
https://www.bvdinfo.com/it-it/home
Feldman R. and Sanger J. (2006). The Text Mining Handbook: Advanced
Approaches in Analyzing Unstructured Data. Cambridge University Press.
Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per
leggere il cambiamento culturale. Franco Angeli.
Greco F., Maschietti D. and Polli A. (2017). Emotional text mining of social
networks: The French pre-electoral sentiment on migration. RIEDS, 71(2):
125:36.
IHS Jane’s (2018). Jane’s Information Group. IHS Jane’s, http://www.janes.com
Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod
MongoDB
(2018).
MongoDB
for
GIANT
ideas.
MongoDB,
https://www.mongodb.com
Morphia (2018). The Java Object Document Mapper for MongoDB. MongoDB,
https://mongodb.github.io/morphia/
Savaresi S.M. and Boley D.L. (2004). A comparative analysis on the bisecting
K-means and the PDDP clustering algorithms. Intelligent Data Analysis,
8(4): 345-362.

JADT’ 18

351

Is training worth the trouble?
A PoS tagging experiment with Dutch clinical records
Leonie Grön, Ann Bertels, Kris Heylen
KU Leuven – leonie.gron@kuleuven.be; ann.bertels@kuleuven.be; kris.heylen@kuleuven.be

Abstract
Part-of-speech (PoS) tagging is a core task of Natural Language Processing
(NLP), which crucially influences the output of advanced applications. For
the tagging of specialized language, such as that used in Electronic Health
Records (EHRs), the domain adaptation of taggers is generally considered
necessary, since the linguistic properties of such sublanguages may differ
considerably from those of general language. Previous research suggests,
though, that the net benefit of domain adaptation varies across languages.
Therefore, in this paper, we present a case study to evaluate the effect of
training with in-domain data on the tagging of Dutch EHRs.
Keywords: Electronic Health Records; Part-of-Speech tagging; medical
sublanguage; Dutch
1. Background
EHRs are valuable resources for data-driven knowledge-making. To unlock
the relevant information from free text, domain-specific NLP systems are
required. Such systems must deal with a text genre that can be characterized
by a high density of specialized terms, including non-canonical variants, and
non-standard syntactic constructions. These properties affect all further steps
in a processing pipeline, starting from core tasks such as PoS tagging. Since
PoS values are important features for further processing, the output of many
systems, such as tools for term extraction and term-to-concept mapping (e.g.
Doing-Harris et al., 2015; Scheurwegs et al., 2017), crucially depends on the
accuracy of the PoS tags assigned in the first place. Processing suites such as
cTAKES (e.g. Savova et al., 2010), which have been developed specifically for
the medical domain, are known to boost tagging performance. As most tools
are only available for English, though, systems dealing with other languages,
such as Dutch, are required to start the domain adaptation from scratch.
Typically, this process involves the re-training of an existing tool on handcoded data, which is time- and labor-intensive. Besides, evidence from
German challenges the wide-held belief that domain training is a prerequisite
to achieve good tagging performance (Wermter et Hahn, 2004).

352

JADT’ 18

Given these considerations, we conduct a pilot study to investigate the
potential benefit of domain adaptation for the PoS tagging of Dutch EHRs.
Firstly, we assess the impact of training with a hand-coded clinical dataset on
the accuracy of an off-the-shelf tagger. Secondly, we evaluate how the
difference in accuracy affects the output of a term extraction method based on
PoS patterns.
2. Related Work
For the PoS tagging of clinical writing, the main challenges reside in the
particular linguistic properties of the genre, both at the lexical and the
syntactic level: On the one hand, EHRs contain a high proportion of
specialized terminology and idiosyncracies, including misspellings and noncanonical abbreviations; a tagger developed for general language will thus
encounter a high number of out-of-vocabulary words (Knoll et al., 2016). To
complicate this matter, the PoS distributions in clinical corpora differ from
those found in general language, which may be detrimental to the statistical
classification of unknown or ambiguous tokens (Pakhomov et al., 2006). On
the other hand, EHRs are typically composed in a telegraphic style, which can
be characterized by the omission of functional syntactic elements; the lack of
linguistically informative context may prevent the accurate prediction of PoS
transitions within n-grams (Coden et al., 2005). At the same time, the average
sentence length in EHRs is relatively short; the high number of inter-sentential
transitions may pose additional pitfalls for an out-of-domain tagger
(Pakhomov et al., 2006). Most previous research thus agrees that the use of offthe-shelf taggers on clinical writing is highly prone to errors, which are likely
to be propagated through the different levels of an application (Ferraro et al.,
2013).Therefore, many state-of-the-art systems use an annotated set of EHRs
for training. The creation of training materials comes at a cost, though, and
entails a range of methodological challenges in itself, such as the creation of
suitable guidelines and tagsets (Albright et al., 2013). To circumvent these
issues, alternative ways of domain adaptation have been explored, including
the integration of a domain-specific vocabulary, and the exploitation of
morphological features to classify unknown words (Knoll et al., 2016).
However, other languages than English may present a different case: In an
early study, Wermter & Hahn (2005) come to the conclusion that in German,
taggers trained on newswire perform very well on EHRs. This surprising
finding can be partly attributed to the rich inflectional system of the language,
which lends itself to the prediction of PoS categories. On the other hand, the
low complexity of the medical sublanguage may be a factor: In their study, the
general training data subsumed all PoS transitions found in the clinical test
data, so that the tagger was sufficiently equipped to handle the latter.

JADT’ 18

353

3. Methods
3.1. Corpus and manual tagging
Our study is based on the analysis of a mixed sample of EHRs, containing a
total of 375 documents. As detailed in Table 1, the subsets of this sample
differ with regard to their medical subdomain, institutional origin and
document structure: The EN and RD sets cover only one medical specialty,
whereas the DL, SP and GP sets are less homogeneous; the DL, EN and RD
sets were composed at a single institution, while the documents in the GP
and SP sets are drawn from a multi-source database, Integrated Primary Care
Information (ICPI), which contains EHRs from medical practices all across
the Netherlands. Finally, the EHRs in four subsets (DL, GP, RD, SP) had been
split into shorter fragments to comply with privacy standards; therefore,
these documents are much shorter than those in the EN set, which count
204.2 tokens on average. All EHRs are tokenized with the NLTK tokenizer1
and manually labelled by the authors, using the Universal Tagset (Petrov et
al., 2012). Finally, for each subset, the EHRs are split into a training and test
set, containing 67% vs. 33% of the files respectively.
Table 1: Overview of the subsets of our file sample. The first three columns specify the name of
the subset, the document types, the origin and the number of institutions involved in their
creation. The remaining columns give the number of documents, the absolute length in tokens,
and the average document length in tokens.
Subset

Document
types

Origin

Nr. of
sources

Nr. of documents

Subset
length

DL

Clinical
discharge
letters
EHRs from
endocrinology
EHRs from
general
practitioners
EHRs from
radiology

EMC
Rotterda
m
UZ
Leuven
IPCI
(Vlug et
al., 1999)
EMC
Rotterda
m
IPCI
(Vlug et
al., 1999)

One

88

3597

Average
documen
t length
40.88

One

80

16337

204.2

Multipl
e

60

1431

23.85

One

60

1441

24.02

Multipl
e

87

4784

54.99

Σ

375

27590

73.57

EN
GP

RD

SP

Specialist
letters from
various fields
(e.g.
cardiology)

1

http://www.nltk.org/_modules/nltk/tokenize.html

354

JADT’ 18

3.2. Evaluation
3.2.1. Effect of domain training on tagging performance
Firstly, we assess the impact of using in-domain data for training on tagging
accuracy. For evaluation, we use the state-of-the-art Perceptron Tagger.2 This
tagger uses context tokens as well as suffix features for classification. As
Knoll et al. (2016) show, this configuration outperforms a primarily
sequential tagger, as used by Wermter & Hahn (2005), on clinical data. The
pre-compiled model for Dutch is trained on the Alpino Treebank (van
Noord, 2006). In addition, we build a domain-specific model based on the
manually labelled training set. Then, we feed both models into the tagger to
classify the test set. To measure the accuracy of each model, we calculate the
precision, i.e. the proportion of tags that match those in the manually labelled
gold standard.3 To compare the effect across the different subsets, we
calculate the gain in precision achieved with the domain model relative to the
precision achieved with the Alpino baseline.
3.2.2. Effect of tagging performance on term recognition and extraction
Secondly, we quantify the effect of tagging performance on pattern-based
term recognition. For the identification of term candidates, we use a set of
PoS sequences that are characteristic for termhood in the domain. Similar to
Scheurwegs et al. (2017), we focus on complex nominals, i.e. nouns
surrounded by one or more modifiers; Table 2 provides some examples of
such patterns.
Table 2: Examples of PoS patterns used for term retrieval. The left column lists the target tag
sequence, the middle and right column provide Dutch examples and English translations of
term candidates.

PoS pattern
adjective noun
noun adposition noun
noun noun

Dutch example
‘diabetische
retinopathie’
‘syndroom van Apert’
‘zwelling enkel’

English translation
diabetic retinopathy
syndrom of Apert
swelling ankle

Using a sliding-window approach, we iterate through the three tagged
versions of the test set, i.e. the manually tagged gold standard, the version

http://www.nltk.org/_modules/nltk/tag/perceptron.html
The Alpino model uses a more fine-grained tagset than the Universal Tagset
used for the manual tagging. To enable the comparison across models, the redundant
labels from Alpino are mapped to the respective categories of the Universal Tagset
(e.g. adj <adjective>, comparative <comparative> → ADJ <adjective>).
2
3

JADT’ 18

355

tagged with the Alpino model and the version tagged with the domain
model. We identify all PoS sequences that match the pre-specified patterns,
and extract the respective tokens for manual validation. For each version, we
calculate the precision as the proportion of true positives, i.e. domain-specific
phrases, relative to the total list of matches.4 To assess the individual effect
size, we also calculate the relative gain in precision for each subset.
3.3. Results
3.3.1. Effect on tagging performance
For PoS tagging, training on domain data has a sizeable effect on precision:
The domain model reaches 85.8% accuracy on the test set of held-out EHRs,
compared to 66.9% with the Alpino baseline. Regardless of the model, the
best results are achieved for DL, followed by RD and EN; for SP and GP,
precision stays at the lowest levels. To evaluate the improvement across the
different subsets, we compare the increase in precision relative to the value
achieved with the baseline. The comparison of these values reveals
considerable differences of the individual effect sizes: In SP, the training
effect is most striking, followed by GP and EN; in RD and DL, the
improvement is less evident.
3.3.2. Effect on term recognition and extraction
The increase in accuracy has a strong effect on the term retrieval task: When
using the tags assigned by the Alpino model, only 3.42 of the retrieved
candidates are correct; with the domain model, precision jumps to 9.3%.
Again, the results vary substantially across the different datasets: Overall, the
best results are obtained for EN, followed by RD and DL. In SP and GP,
precision remains at the lowest levels. Judging from the relative gain in
precision, though, we find the strongest increase in GP, followed by DL. In
RD, EN and SP, we only find weaker effects. Table 3 provides the full results
for both tasks.
For error analysis, we label all false positives with the nature of
misclassification, whereby we distinguish between three types of errors:
Firstly, errors based on erroneous PoS tags (e.g. ‘merkt hypoglycemie’ notices
hypoglycemia, whereby the verb is tagged as an adjective); secondly,
segmentation errors, whereby one token is associated with an unrelated one
(e.g. ‘oedeem Lipitor’ edema Lipitor, whereby two unrelated nouns are

To qualify as domain-specific, a phrase must contain at least one noun that has
a concept entry in the clinical terminology SNOMED-CT (International Release July
2017; http://browser.ihtsdotools.org/). For instance, ‘echografie rechterschouder’
echography right shoulder, which refers to a clinical procedure, would count as a true
positive; the general expression ‘pak koekjes’ bag of biscuits would not.
4

356

JADT’ 18

mistaken for a compound); thirdly, term candidates that match a target PoS
pattern, but are not domain-specific (e.g. ‘kleine boterhammen’ small
sandwiches). Then, we calculate the proportion of error types among the false
positives provided by both models. With the Alpino model, the vast majority
of errors (74.4%) is based on false PoS tags. About 18.2% of the proposed
term candidates are out-of-domain, while only a small portion (7.3%) of
errors is caused by mistakes in segmentation. Conversely, with the domain
model, most false positives (49.7%) are out-of-domain terms; errors in
tagging and segmentation account for 30.1% and 20.2% respectively.
Table 3 : Precision of PoS tagging and term extraction across subsets. The first
column specifies the subset. The second and third column provide the percentage of
correct tags assigned by the domain model and the Alpino model respectively; the
fourth column contains the relative increase in precision. The remaining three
columns provide the corresponding values for the extraction task.
Term extraction

PoS tagging

%
% Prec
Prec
%
domain
subset % Prec domain model % Prec Alpino % increase model Alpino increase
DL

89.62

EN
GP

76.61

16.99

7.33

2.64

177.87

86.82

67.5

28.62

21.48

8.04

167.1

79.81

61.76

29.23

3.28

0.84

291.31

RD

88.98

74.1

20.08

8.89

3.31

168.52

SP

83.68

54.5

53.53

5.52

2.26

144.09

Σ

85.78

66.9

29.69

9.3

3.42

189.78

4. Discussion
Overall, the positive effect of domain adaptation is evident: Using clinical
data for training improved the accuracy of PoS assignments and, as a
consequence, the output of the term extraction method. Based on our results,
we do not see a clear relation between the amount of training data and the
global level of precision: For PoS tagging, DL and RD, which are among the
smaller subsets, score highest; on the other hand, for the term extraction task,
EN, which is the largest subset, produces the best results by far. This
indicates that the benefit of training hinges on linguistic and semantic
qualities, rather than the mere quantity of the data.
In particular, tagging performance correlates with the homogeneity and wellformedness of the data. The homogeneity depends, on the one hand, on the
medical field: A dataset such as RD, which is confined to one clinical

JADT’ 18

357

specialty, only makes reference to a fairly limited number of medical
concepts; by contrast, a more heterogeneous set, such as SP, covers a wider
range. Besides, the number of institutions involved in data creation plays a
role: In an EHR sample provided by a single hospital, such as EN, it is likely
that preferred terms and phrases are perpetuated throughout the dataset. By
contrast, in a set drawn from a multi-source database, such as GP, the
potential for variation is higher. Both these factors affect the overall size of
the vocabulary, which, in turn, determines the complexity of the tagging
task. The well-formedness, on the other hand, depends mainly on the EHR
type. The GP set, for instance, contains mostly notes intended for internal
documentation; these notes are written in an informal style, whereby
function words and suffixes may be left out or truncated. As these features
usually serve as predictors for PoS classification, their omission may cause a
drop in tagging performance. While the global level of precision is thus
lowest in conceptually and lexically EHR samples, such as GP and SP, the
relative benefit of domain adaptation is the greatest here.
5. Conclusion
We conclude that the training with in-domain data benefits the output of PoS
taggers for clinical Dutch. Especially if the file sample covers different
subdomains, or if the language used deviates strongly from the standard, the
potential gain in performance is great. At the same time, considerable
training efforts are required to achieve only marginal improvements.
Depending on the scope of the project and the composition of the sample, it
may thus be preferable to implement a cheaper alternative, for instance by
integrating a domain dictionary into the tagger.
Acknowledgements
This work was supported by Internal Funds KU Leuven.
References
Albright D., Lanfranchi A., Fredriksen A., Styler W.F., Warner C., Hwang
J.D., Choi J.D. et al. (2013). Towards Comprehensive Syntactic and
Semantic Annotations of the Clinical Narrative. J Am Med Inform Assoc
vol. 20: 922–30.
Coden A.R., Pakhomov S.V., Ando R.K., Duffy P.H. and Chute C.G. (2005).
Domain-Specific Language Models and Lexicons for Tagging. J Biomed
Inform vol. 38: 422–30.
Doing-Harris K., Livnat Y. and Meystre S. (2015). Automated Concept and
Relationship Extraction for the Semi-Automated Ontology Management
(SEAM) System. J Biomed Semantics vol. 6 (15): 1–15.

358

JADT’ 18

Fan J.-W., Prasad R., Yabut R.M., Loomis R.M., Zisook D.S., Mattison J.E. and
Huang Y. (2011). Part-of-Speech Tagging for Clinical Text: Wall or Bridge
between Institutions? In AMIA Annu Symp Proc, pp. 382–91.
Ferraro J.P., Daumé H.I., DuVall S.L., Chapman W.W., Harkema H. and
Haug P.J. (2013). Improving Performance of Natural Language Processing
Part-of-Speech Tagging on Clinical Narratives through Domain
Adaptation. J Am Med Inform Assoc vol. 20: 931–39.
Knoll B.C., Melton G.B., Liu H., Xu H. and Pakhomov S.V.S. (2016). Using
Synthetic Clinical Data to Train an HMM-Based POS Tagger. In 2016
IEEE-EMBS (International Conference on Biomedical and Health Informatics),
pp. 252–55.
van Noord, G. (2006). At Last Parsing Is Now Operational. In Proceedings of
TALN 2006, pp.20–42.
Pakhomov S.V., Coden A. and Chute C.G. (2006). Developing a Corpus of
Clinical Notes Manually Annotated for Part-of-Speech. Int J Med Inform
vol. 75: 418–29.
Petrov S., Das D. and McDonald, R. (2012). A Universal Part-of-Speech
Tagset. In Piperidis N.C., Choukri K., Declerck T., Doğan M.U., Maegaard
B., Mariani J., Moreno A., Odijk J., and Piperidis S. Proceedings of the Eight
International Conference on Language Resources and Evaluation (LREC’12), pp.
2089–96.
Savova G.K., Masanz J.J., Ogren P.V., Zheng J., Sohn S., Kipper-Schuler K.C.
and Chute C.G. (2010). Mayo Clinical Text Analysis and Knowledge
Extraction System (cTAKES): Architecture, Component Evaluation and
Applications. J Am Med Inform

JADT’ 18

359

Les outils de la statistique textuelle pour analyser
les corpus de données d’enquêtes
de la statistique publique
France Guérin-Pace, Elodie Baril
Institut national d’études démographiques

Abstract
For more than 20 years, textual statistic methods have been allowing us to
explore and analyze data from official statistics survey and the different
corpora it contains: answers to an open question, associated words,
significant life events. Based on three corpora of data: Population-Lived
Spaces-Environments survey (Ined, 1992), EuroBroadMap survey on
representations of Europe in the world (2009), and more recently the
Information and Daily Life survey on adult reading skills (INSEE, 2011), we
have demonstrated the diverse use cases of these methods and the richness
that helps identify the corpus content in relation to the individual
characteristics of respondents as well as to the survey questions. In recent
years, we have mobilized these methods to post-codify the events collected in
the IVQ survey. Today we will present to you the results of this work: the
benefits and limitations of textual statistic method.
Résumé
Réponses à une question ouverte, mots associés, évènements marquants de la
biographie, constituent autant de corpus issus de données d’enquêtes de la
statistique publique que nous avons explorés et analysés avec les méthodes
de la statistique textuelle, depuis plus de 20 ans. A partir de trois corpus de
données : enquête Populations-Espaces de vie-Environnements (Ined, 1992),
enquête EuroBroadMap sur les représentations de l’Europe dans le monde
(2009), et plus récemment l’enquête Information et Vie quotidienne sur les
compétences en lecture des adultes (Insee, 2011), nous montrons la diversité
d’applications de ces méthodes, leur richesse pour cerner le contenu des
corpus en lien avec les caractéristiques individuelles des répondants mais
aussi d’autres questions d’enquête. Plus récemment nous avons mobilisés ces
méthodes pour post-codifier les évènements recueillis dans l’enquête IVQ.
Nous présenterons les apports et les limites de cette démarche.
Keywords: textual statistics, open-ended questions, associated words corpus,
post-coding.

360

JADT’ 18

1. Des corpus de nature variée
Introduire un questionnement ouvert dans une enquête en population
générale est toujours un défi pour les concepteurs même si les méthodes de
la statistique textuelle ont prouvé depuis longtemps leur intérêt et leur
efficacité pour leur traitement. Cerner les contours et l’acception d’un mot
valise était l’objectif de l’introduction de la question ouverte «Si je vous dis
environnement, qu'est-ce que cela évoque pour vous? » dans l’enquête «
Populations-Espace de vie-Environnements » réalisée en 1992 (INED) auprès
d'un échantillon de 6 000 personnes, représentatif de la population française.
Un des objectifs consistait à examiner quelles représentations les populations
construisent sur la notion même d'environnement.
Une technique un peu différente de recueil est celle adoptée, par exemple,
dans l’enquête EuroBroadMap conduite en 2009 dans 18 pays. Enquêter près
de 10 000 étudiants à travers le monde sur leurs représentations de l’Europe
est l’un des objectifs de ce projet européen. Une pièce centrale de ce dispositif
est de recueillir les mots associés à l’Europe par les étudiants1 après leur
avoir demandé de délimiter, selon leur perception, ses contours sur une carte
du monde. A la différence du corpus précédent, les mots ne sont pas
proposés sous forme de liste et ce sont les représentations spontanées qui
sont recueillies. Cette technique des mots associés a pour intérêt de
contraindre davantage le format des réponses et d’obtenir un corpus plus
homogène. Une des principales difficultés de ce corpus est celle de la langue
de recueil des mots associés. Pour résoudre en partie ce problème, nous
avons choisi de traduire les réponses en anglais pour chacun des pays au
moment de la saisie, selon des consignes précises2.
Une autre forme de matériau qualitatif intéressant à recueillir dans les
enquêtes concerne les événements de vie. Pour les démographes, le recueil
d’éléments des parcours individuels possède une dimension explicative très
pertinente, qu’ils s’agissent de points d’inflexion, de ruptures au sein des
parcours biographiques ou d’éléments ponctuels sans conséquence à long
terme (Laborde et al., 2007). C’est ce que nous avons mis en place dans
l’enquête Information et Vie quotidienne (Guérin-Pace, 2009). Les
évènements marquants peuvent être recueillis de manière ouverte ou fermée.
L’intérêt de les recueillir, sous forme de question fermée, est de pouvoir

La question posée était « Quels sont les mots que vous associez le plus à l’ «
Europe » ? Choisissez 5 mots au maximum. »
2 Pour des raisons de coût et de délai, l’instruction donnée aux partenaires était
de traduire, eux-mêmes, en anglais les mots associés au moment de la saisie des
questionnaires. Les premiers traitements textuels ont permis de repérer des
incohérences et nécessité un retour vers les questionnaires dans leur langue d’origine.
1

JADT’ 18

361

effectuer des comparaisons systématiques dans la mesure où tous les
enquêtés répondent à une même question. Nous avons introduit dans
l’enquête sous forme de question fermée les évènements les plus
fréquemment cités (divorce ou séparation des parents, décès d’un proche,
problème de santé, etc.). Les événements recueillis de manière « fermée » ne
permettent pas d’aborder tous les thèmes notamment ceux portant sur des
sujets sensibles (cas de violence par exemple). Le recueil sous forme
d’énumération devient en effet vite intrusif, parfois déplacé, si les personnes
ne sont pas concernées. Par ailleurs, par cette démarche, on fait l’hypothèse
de la nature a priori traumatisante d’un événement sans savoir si Ego l’a
vécu comme tel durant son enfance (Laborde et al., 2007). Nous avons ainsi
fait le choix de compléter ce questionnement par la question ouverte suivante
« Avez-vous connu un autre événement marquant durant votre enfance ? Si
oui, lequel ? ». Près d’un quart des répondants déclarent un « autre
événement marquant » de leur enfance en réponse à cette question. Parmi
eux, un sur deux évoque un décès, un sur dix un événement lié à un
problème de santé, et dans les mêmes proportions une situation de violence
vécue durant l’enfance (Baril, Guérin-Pace, 2016).
Tableau 1 : Description des corpus analysés
Enquêtes
PopulationsEspaces de VieEnvironnement
(1992)
EuroBroadMap
(2009)
Information et Vie
Quotidienne (2011)

Corpus

Nombre de
réponses

Nombre
d’occurrences

Nombre de
mots
distincts

Environnement

4596

28716

2130

9343

40800

5111

3167

15993

2161

Mots associés à
l’Europe
Evènements
marquants de
l’enfance

2. Une étape sous-estimée : lecture des mots du corpus et les statistiques
lexicales
Une première étape essentielle d’analyse est la lecture du lexique des mots
les plus fréquents associé à un corpus d’enquête. Ce lexique donne à lui seul
un aperçu de la tonalité du vocabulaire (positive ou négative) et des registres
abordés. Par exemple, dans le corpus de mots associés à l’Europe, le premier
mot à connotation péjorative n'apparait qu'en 26ème position (colonialism). La
lecture des événements les plus fréquents indique quant à elle le caractère
individuel ou collectif, le plus souvent historique, des événements perçus.
Pour les enquêtes internationales ou à passage répété, le recours aux

362

JADT’ 18

statistiques lexicales permet de comparer la richesse du vocabulaire de
manière pertinente. Ainsi, dans le corpus « Europe », la comparaison des
proportions de mots distincts (Figure 1) apporte des informations
intéressantes. Il apparaît ainsi que les étudiants interrogés dans des pays les
plus éloignés de l’Union européenne (Cameroun, Chine, Russie, Brésil, Inde)
ont une vision plus consensuelle ou partagée de l’Europe que ceux des pays
qui en sont membres, ou à la marge.

Figure 1 : Diversité des mots associés à l’Europe selon les pays d’enquête
Source : Enquête EuroBroadMap (2009)

3. Faire émerger le contenu d’une question ouverte à partir du TLE
Une autre application des méthodes d’analyse textuelle à un corpus de
réponses à une question ouverte consiste à extraire les mondes lexicaux selon
la méthodologie Alceste. Une CDH effectuée sur le tableau croisant les
réponses à la question ouverte avec le lexique associé au mot
« environnement » met en évidence deux approches fondamentalement
différentes de la notion d’environnement (Figure 2). La première aborde
l’environnement selon une approche cognitive concernant un espace
physique et social (qualité de vie, univers local, etc.), tandis que la deuxième
approche est plus symbolique ou imaginaire (iconographie de la nature,
sensation de bien-être.).

JADT’ 18

363

Figure 2 : Les mondes lexicaux du corpus « environnement » (Alceste)
In Guérin-Pace F., 1997

4. Croiser les réponses spontanées avec un questionnement fermé
Les limites d’interprétation d’une question ouverte résident dans
l’impossibilité d’interpréter ce qui n’a pas été évoqué par les répondants.
Compléter ce dispositif par un questionnement fermé permet d’y remédier.
Nous avons ainsi, à la suite de la question ouverte, introduit deux questions
fermées qui proposaient une liste de mots et d’adjectifs pouvant être associés
ou non, par le répondant, au mot « environnement »3. L’observation conjointe

3 Les questions étaient libellées de la manière suivante : « Voici une liste de noms
(adjectifs). Lesquels vous semblent liés à la notion d’environnement ? (Pour chacun,
précisez oui ou non).

364

JADT’ 18

des réponses à ces deux modes de questionnement par une ACM sur le TLA
permet d’enrichir l’analyse du contenu « spontané » au regard des
représentations fermées.
On observe ainsi (Figure 3) que l’opposition entre un environnement fait de
« relations » et un environnement fait de « nature » (axe horizontal)
s’accompagne, par exemple, du choix ou du refus de mots et d’adjectifs qui
décrivent les nuisances urbaines. Sur l’axe vertical, à l’opposition entre un
environnement conçu comme une proximité immédiate et un environnement
basé sur les relations entre « l’homme et son milieu » correspond un
vocabulaire associé qui renforce cette perception. Proche de la première
perception, on relève les mots « maison-oui », « amical-oui », « sécurité-oui »
et « planète-Non ».

Figure 3 : Proximité entre formes du corpus « environnement » et associations proposées
Guérin-Pace F., Garnier B, 1995 Lecture : à proximité des mots « santé » ou « liberté » cités en
réponse à la question ouverte, on relève les réponses « non » à l’association du mot
environnement aux mots « ville » ou « violence ».

5. Post-coder les événements marquants de l’enfance par la statistique
textuelle
Une autre application plus récente de ces méthodes pour post-codifier des
réponses à une question ouverte peut sembler contradictoire avec l’esprit
même de la statistique textuelle. Il s’agit plus précisément de post-codifier les
évènements recueillis dans l’enquête Information et Vie quotidienne (IVQ).
Pour cela, nous avons effectué une classification (CDH) sur le tableau lexical

JADT’ 18

365

entier croisant les réponses à la question « Avez-vous vécu d’autres
événements marquants ? » avec le lexique du corpus. On retient une partition
en cinq classes au sein de laquelle on observe une première dichotomie entre
des événements de nature collective (guerre d’Algérie, Mai 1968, etc.) et un
ensemble de classes qui évoquent des événements de nature individuelle :
décès, maladie, accident et violence (Figure 4). Nous avons ajouté à ces cinq
classes deux classes supplémentaires : une classe intitulée « Refus »
regroupant toutes les réponses qui marquent une volonté de l’enquêté de ne
pas détailler l’événement marquant à l’enquêteur (tout en ayant donné une
réponse affirmative à la question « Avez-vous connu un autre événement
marquant ? ») ; une classe « Autre » au sein de laquelle nous avons regroupé
les réponses non classées4. Nous avons ensuite cherché à affiner cette
typologie en précisant les acteurs éventuels impliqués dans les événements.
Par exemple, au sein de la classe « Maladie » (classe 2), nous avons filtré au
moyen d’un vocabulaire familial (père, mère, frère, sœur, tante, ami, etc.) et
constitué 4 sous-modalités distinctes selon les personnes concernées.

Figure 4 : Typologie des événements marquants de l’enfance
Source : Enquête IVQ, Iramuteq (classification Méthode Reinert)

Nous avons procédé de la même manière pour la classe « violence » en
distinguant cette fois les personnes concernées par l’événement et son auteur
éventuel. Nous obtenons finalement une typologie construite sur les
questionnements ouverts et fermés, composée de 43 items (Baril, GuérinPace, 2016), qui pourrait être réutilisée pour d’autres enquêtes nationales.
En conclusion, ces différentes applications sur des corpus variés d’enquêtes

4

Près de 90 % des 3167 réponses à cette question sont classées.

366

JADT’ 18

de la statistique publique permettent de mettre en évidence la diversité des
apports des méthodes de la statistique textuelle. Aujourd’hui, de plus en plus
d’enquêtes nationales abordent des thématiques sensibles (violences,
précarité, illettrisme, etc.). Le recours à un questionnement ouvert s’avère
ainsi indispensable en permettant au chercheur d’objectiver sa démarche. Les
méthodes de la statistique textuelle se révèlent incontournables dans cette
perspective.
Références
Baril E., Guérin-Pace F. (2016). Compétences à l’écrit des adultes et
événements marquants de l’enfance : le traitement de l’enquête
Information et vie quotidienne à l’aide des méthodes de la statistique
textuelle, Economie et statistique, n°490, pp. 17-36.
Guérin-Pace F. (2009). Illettrismes et parcours individuels, Economie et
statistique, n°424-425.
Brennetot A., Emsellem K, Guérin-Pace F., Garnier B. (2013). Dire l’Europe à
travers le monde. Les mots des étudiants à travers l’enquête
EuroBraodMap, Cybergeo : European Journal of Geography.
Guérin-Pace F., Collomb P. (1998). Les contours du mot environnement :
Enseignements de la statistique textuelle, L’Espace Géographique, n°1, pp.
41-52.
Guérin-Pace F. (1997). La statistique textuelle : un outil exploratoire en
sciences sociales, Population, n°4, pp. 865-888.
Laborde, C., Lelièvre, E., Vivier, G. (2007). Trajectoires et événements
marquants, comment dire sa vie : Une analyse des faits et des perceptions
biographiques. Population, vol. 62,(3), pp. 567-585.
ssoc vol. 17: 507–13.
Scheurwegs E., Luyckx K., Luyten L., Goethals B. and Daelemans W. (2017).
Assigning Clinical Codes with Data-Driven Concept Representation on
Dutch Clinical Free Text. J Biomed Inform vol. 69: 118–27.
Vlug A. E., van der Lei J., Mosseveld B.M., van Wijk M.A., van der Linden
P.D., Sturkenboom M.C., and van Bemmel J.H. (1999). Postmarketing
Surveillance Based on Electronic Patient Records: The IPCI Project.
Methods Inf Med 38 (4/5): 339–44.
Wermter J. and Hahn U. (2004). Really, Is Medical Sublanguage That
Different? Experimental Counter-Evidence from Tagging Medical and
Newspaper Corpora. In Fieschi M., Coiera E. and Li Y.-C.L. Proc. of the
11th World Congress on Medical Informatics (MEDINFO 2004), pp. 560–64.

JADT’ 18

367

Annotation-based Digital Text Corpora Analysis
within the TXM Platform
Serge Heiden
Université de Lyon, ENS de Lyon, IHRIM – UMR5317, CNRS – slh@ens-lyon.fr

Abstract
This paper presents new developments in the TXM textual corpora analysis
platform
(http://textometrie.org)
towards
direct
text
annotation
functionalities. Some annotations are related to a web based external historic
ontology called SyMoGIH and others to co-reference information between
words or to word properties like part of speech or lemma.
The paper discusses the methodological stakes of unifying in a single
framework the production and the analysis those annotations with the
traditional ones already available in TXM corresponding to the XML markup
of the text sources and to the linguistic annotations automatically added to
texts by NLP tools.
Keywords: textometry, TXM, digital text representation, XML, TEI,
annotation, ontology, co-reference, part of speech, digital hermeneutic circle.
1. Introduction
TXM (Heiden, 2010) is a software platform offering textual corpora analysis
tools. It is delivered as a standard desktop application for Windows, Mac and
Linux and as a web portal server application (http://textometrie.org).
Its analysis tools combine qualitative types of tools like word lists,
concordancing or text edition navigation (close reading) with synthetic
quantitative types of tools like factorial analysis, clustering, keywords or
statistical co-occurrence analysis (distant reading).
To be able to work on texts, the platform imports first the corpus sources to
build a rich internal representation of texts through the following general
workflow:
a) first the “base text” of each text is established: this operation
implements “digital philology” principles and consists of decoding
information in the various formats of the source documents5 to

5
TXM can analyze three main types of corpora : corpora of written texts,
possibly including paginated editions including images of facsimiles ; record
transcriptions corpora, possibly time synchronized with the audio or video source ;

368

JADT’ 18

decide primarily where are the text limits, internal structures
boundaries and words and punctuations of the text. Its result is
represented in a pivot XML format especially designed for TXM
called “XML-TEI TXM” and extending the standard encoding
recommendations of the Text Encoding Initiative consortium (TEI
Consortium, 2017) ;
b) then, natural language processing (NLP) tools are optionally applied
to the base text to automatically add linguistic information like
sentence boundaries, grammatical category (pos = part of speech)
and lemma of words by eg TreeTagger (Schmid, 1994), etc. As NLP
tools generally don’t take XML format as input, the pivot
representation is first converted to raw text for NLP processing and
results are added back into the XML-TEI TXM representation ;
c) finally a specialized representation of texts is built into TXM for
efficient execution of its tools (by indexing for search engines and
text edition rendering).
From the point of view of TXM, NLP tools results in b) are seen as automatic
annotations added to the initial XML-TEI TXM representation of texts built in
a), and the XML tags of the initial XML-TEI TXM representation in a) can be
seen as manual annotations added to the base text (or raw text), typically
philologically edited with the help of specialized XML editors (like Oxygen
XML Editor6) outside of TXM when the source is in XML format, or as
automatic annotations added by TXM when converting from some other
format into XML-TEI TXM. All TXM tools apply indiscriminately to all types
of annotation regardless of their origin (automatic or manual).
Thus, TXM implements a traditional workflow combining a “text source
encoding and annotation” step to an “application of analysis tools to
annotated texts” step. The text analysis tools use text annotations (for
example word pos) to offer their services and produce their results (for
example the concordance of all infinitive verbs). The workflow is
unidirectional and the whole of it must be passed through again completely
if any annotation needs to be corrected. To add or correct annotations, the
user has to edit the sources or the annotations outside of TXM. For example
word properties can be exported from the XML-TEI TXM representation,
edited in a spreadsheet and inserted back into the texts before re-import7.

and parallel multilingual corpora aligned at the level of a textual structure such as the
sentence or the paragraph.
6
https://www.oxygenxml.com
7
see
for
example
this
tutorial
based
on
TXM
macros:
https://groupes.renater.fr/wiki/txm-users/public/tutoriel_correction_mots.

JADT’ 18

369

This paper introduces new services developed in TXM to annotate directly
texts from within the results view of specific tools for a better integration of
philological and analytic work.
2. Annotation services in TXM
The new annotation services concern both adding and correcting information
and all the annotations edited are meant for further exploitation by usual
TXM tools.
2.1. SyMoGIH annotation by concordance
The first new service, developed in partnership with the LARHRA research
laboratory in history8, is based on the annotation of concordance pivots: any
sequence of words composing the pivots can be annotated with any semantic
category9 coming from the SyMoGIH10 historical ontology framework
(Beretta, 2015). In this architecture, the SyMoGIH web platform hosts the
ontology of historic facts and knowledge, and concordances provide the user
interface to link identifiers of those data to text spans for further analysis. As
an illustration, see figure 1 the annotation of the “Faculté de droit d’Aix”
entity (of id CoAc13562) in unverified OCRed texts of the “Bulletin
administratif de l'Instruction publique" corpus11.
TXM internal management of those annotations is equivalent to a re-import
of the current pivot representation of the annotated texts. After re-import
(after saving annotations) the new annotations are available for all TXM tools
to work on like any original “annotation” of the texts (internal structures and
their properties, word properties, etc.).
2.2. URS annotation in text edition
The second new service is based on manual annotation of word sequences
inside text editions with elements of a Unit-Relation-Schema (URS)
annotation model. URS type annotations are designed to encode discourse
entities like co-reference chains in texts (Schnedecker, Glikman, & Landragin,
2017). In a URS model, Units or entities have any number of properties and
can be linked together by the two other annotation types: Relations, having
any number of properties (1-to-1 relation type), and Schemas, having any

http://larhra.ish-lyon.cnrs.fr
pivots can also optionally be annotated with simple keywords or with keyvalue pairs, managed by TXM in a local repository.
10
http://symogih.org/?lang=en
11
see the Bibliothèque historique de l'éducation (BHE) project:
http://www.persee.fr/collection/bhe
8
9

370

JADT’ 18

number of properties (1-to-n relation type). Any types and properties of
units, schemas, and relationships are definable in the annotation model
before and during annotation. The types and properties are chosen by the
user, they are not limited to co-reference chains.

Figure 1: TXM screenshot of a Concordance of a “Faculté de droit d’Aix” word sequence
pattern to annotate (top) and of browsing SyMoGIH semantic categories to use for the
annotation (bottom).

The original URS model has been designed and developed in the Glozz
(Widlöcher & Mathet, 2009) and Analec (Landragin, Poibeau, & Victorri,
2012) software. It is being integrated into TXM through the text edition
reading tool for a project funded by the French National Research Agency
(ANR) called DEMOCRAT12.
As an illustration, see figure 2 the annotation of the “ses loix” word sequence

12
http://www.agence-nationale-recherche.fr/en/anr-fundedproject/?tx_lwmsuivibilan_pi2%5BCODE%5D=ANR-15-CE38-0008

JADT’ 18

371

with a unit of type MENTION, of “GN.POS” grammatical category and “les
lois de la divinité” referent, in the first chapter of the 1755 edition of De
l'esprit des lois by Montesquieu. TXM internal management of those
annotations can be represented as new XML-TEI stand-off annotations
anchored to the word elements of the XML-TEI TXM representation of texts
(Grobol, Landragin, & Heiden, 2017).

Figure 2: TXM screenshot of the edition of the first page of De l'esprit des lois with units of
type MENTION highlighted in yellow and the selected unit in bold (top) and the current
values of the properties of the selected unit (bottom).

2.3. Word properties annotation by concordance
The third service will be based on the annotation of concordance pivot
words: a word present in the pivots of a concordance will be able to be
annotated with properties. The primary goal of that service is to annotate and
correct grammatical properties and lemma of word elements of the XML-TEI
TXM representation of texts. This development is done for a project cofunded by the ANR and Deutsche Forschungsgemeinschaft (DFG) called
PaLaFra13 <http://palafra.org>.
2.4. Editing XML sources
Finally we are developing the possibility to directly edit the XML sources

13
http://www.agence-nationale-recherche.fr/en/anr-fundedproject/?tx_lwmsuivibilan_pi2%5BCODE%5D=ANR-14-FRAL-0006

372

JADT’ 18

from within TXM through an internal XML editor. This editor will eventually
be accessed through TXM tools as a “back to source” operation similar to the
current “back to text” operation (for example from a concordance line to a
text edition page).
3. Discussion
By using a common XML-TEI pivot representation for internal management
of corpora for all the annotation services, TXM unifies transcription and
annotation activities in a single framework. In this framework, annotations
represent manual (user), semi-automatic (machine+user) or automatic
(machine) interpretation results used further for analysis and interpretation
work. The reflexive nature of the resulting text analysis workflow is
schematized in figure 3. Texts are first digitized by OCR, transcribed or
converted from digital formats. They are then philologically corrected and
established through XML-TEI manual encoding. Then automatically
processed by NLP tools while being imported into TXM to produce the TXM
internal corpus model. Corpus analysis is then assisted by TXM tools applied
to the corpus model. The pivot representation that gathers all annotations
produced by annotation tools is figured as the node labeled « Pivot rep. » and
the interpretation workflow itself is figured as a digital hermeneutic circle.

Figure 3: Digital hermeneutic circle integration into TXM.

JADT’ 18

373

Legend:
- red box = automatic annotation activity - black box = tool
- blue box = manual annotation activity - green box = TXM corpus
data model
- purple disk = data representation
- black arrow =
activity
- green arrow = annotation equivalence
4. Conclusion
All the new annotation services integrated into TXM are building a
comprehensive annotation-based digital text corpora analysis platform. From
an epistemological point of view, the integration of different annotation
models and tools into the platform should help its users to better define what
comes from the source corpus they analyze and what comes from their own
or from others interpretation work.
This work was funded by the ANR and the DFG under grant numbers ANR15-CE38-0008 (DEMOCRAT project) and ANR-14-FRAL-0006 (PaLaFra
project).
References
Beretta, F. (2015). Publishing and sharing historical data on the semantic
web : the SyMoGIH project – symogih.org. Presented at the Workshop:
Semantic Web Applications in the Humanities. Retrieved from
https://halshs.archives-ouvertes.fr/halshs-01136533
Grobol, L., Landragin, F., & Heiden, S. (2017). Interoperable annotation of
(co)references in the Democrat project. Presented at the Thirteenth Joint
ISO-ACL Workshop on Interoperable Semantic Annotation. Retrieved
from https://hal.archives-ouvertes.fr/hal-01583527/document
Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis
Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro
(Ed.), 24th Pacific Asia Conference on Language, Information and Computation
(pp. 389–398). Institute for Digital Enhancement of Cognitive
Development, Waseda University. Retrieved from http://halshs.archivesouvertes.fr/halshs-00549764/en/
Landragin, F., Poibeau, T., & Victorri, B. (2012). ANALEC: a New Tool for the
Dynamic Annotation of Textual Data (pp. 357–362). Presented at the
International Conference on Language Resources and Evaluation (LREC
2012).
Retrieved
from
https://halshs.archives-ouvertes.fr/halshs00698971/document
Schmid, H. (1994). Probabilistic Part-Of-Speech Tagging Using Decision

374

JADT’ 18

Trees. In Proceedings of the International Conference on New Methods in
Language Processing (Vol. 12).
Schnedecker, C., Glikman, J., & Landragin, F. (2017). Les chaînes de
référence : annotation, application et questions théoriques. Langue
française, (195), 5–16. https://doi.org/10.3917/lf.195.0005
TEI Consortium. (2017). TEI P5: Guidelines for Electronic Text Encoding and
Interchange. TEI Consortium. Retrieved from http://www.teic.org/Guidelines/P5
Widlöcher, A., & Mathet, Y. (2009). La plate-forme Glozz: environnement
d’annotation et d’exploration de corpus. In Actes de la 16e Conférence
Traitement Automatique des Langues Naturelles (TALN’09), session posters (p.
10). Senlis, France, France. Retrieved from https://hal.archivesouvertes.fr/hal-01011969

JADT’ 18

375

Quantifying Translation : an analysis of the
conditional perfect in English-French comparableparallel corpus
Daniel Henkel
Université Paris 8 Vincennes St-Denis – dhenkel@univ-paris8.fr

Abstract
The frequency of the conditional perfect in English and French was observed
in an 8-million-word corpus consisting of four 2-million-word comparable
and parallel subcorpora, tagged by POS and lemma, and analyzed using
regular expressions Intra-linguistically the Wilcoxon-Mann-Whitney test was
used to compare authors and translators. Frequencies in source and target
texts were evaluated using Spearman's correlation test to identify interlinguistic influences. Overall, the past conditional in English was found to
have a stronger influence in the translation process.
Résumé
La fréquence du conditionnel parfait en anglais et en français a été observée
dans un corpus de 8 millions de mots comprenant quatre sous-corpus
comparables et parallèles de 2 millions de mots chacun, étiquetés par
catégorie grammaticale et par lemme, et analysés par expressions rationnelles
(regex). Le test de Wilcoxon-Mann-Whitney a servi pour comparer les
auteurs et traducteurs, tandis que la corrélation entre textes-sources et -cibles
a été évaluée au moyen du coefficient de corrélation de Spearman.
Globalement, l'influence du conditionnel parfait en anglais sur le processus
traductionnel paraît plus sensible.
Keywords: corpus, translation, regular expressions, statistical analysis,
Wilcoxon-Mann-Whitney, Spearman, conditional perfect
1. Introduction
Since Corpus-based Translation Studies (CBTS) first began to gain
momentum around the turn of the 21st century, differences have consistently
been shown between corpora of translated English, French and other
languages in comparison with untranslated reference corpora in the same
languages. The hybrid nature of translated texts is now thus widely

376

JADT’ 18

acknowledged as an established fact among specialists1 in the field so much
so that any further proof might seem superfluous. These studies have
focused on phenomena such as the use of 'that' to introduce subordinate
clauses (Olohan & Baker, 2000), contractions (Olohan, 2003), manner-ofmotion verbs (Cappelle, 2012), existential predications (Loock & Cappelle,
2013) most often in terms of their overall frequency2. Such comparisons have
provided valuable insights about the languages involved and the translation
process. Little consideration has been given so far, however, to the fact that
each language-system consists of many individual styles or idiolects which
gravitate around a common center, but individually exhibit widely differing
characteristics. In other words, while the variation from one author or
translator to another is inherent in the very nature of corpus linguistics, this
dimension remains absent from the equation in many, if not most, corpusbased translation analyses.
2. Methods
Two important terminological distinctions must be made at the outset. The
first is between ex nihilo, a.k.a. 'original', English (En0) and French (Fr0), i.e.
discourse in each language produced independently of any known prior
influence, as opposed to English-translated-from-French (EtrF) and Frenchtranslated-from-English (FtrE), which will be used to refer to translations into
each language, based on a pre-existing work in the other language, and
therefore potentially subject to inter-linguisitic influences. The second
distinction is between two sorts of bilingual corpora, 'comparable' and
'parallel'. In keeping with the clarification offered by McEnery & Xiao (2007),
the term 'comparable corpus' will hereafter refer to a bilingual corpus
consisting of two subcorpora of ex nihilo English and French texts, which are
therefore not translations of one another, but which share a certain number of
common characteristics, whereas the term 'parallel corpus' will designate a

Albeit with some divergence of opinion as to whether such differences are
best interpreted as evidence of source-language interference or as consequences of the
translation process regardless of the source-language, i.e. characteristics inherent in
the 'third code' or 'translationese' (cf. Koppel & Ordan, 2011).
2
Olohan (2002) apparently subscribes to Stubbs' (2001) view that “corpus
linguistics […] investigates relations between frequency and typicality, and instance and
norm. It aims at a theory of the typical,” (while nonetheless encouraging investigation of
individual translators' styles in her conclusion), and the predominance of this
approach is confirmed again over a decade later by Loock (2013) who observes that
“many studies within the CBTS framework still solely rely on overall quantitative analyses to
establish differences between original and translated languages.”
1

JADT’ 18

377

corpus made up of one sub-corpus of ex nihilo works in a source-language
and another sub-corpus consisting of the translations of those same works
into the target-language.
The corpora used in this study were compiled from public domain works
available in electronic format (.epub, .mobi, .html or .txt), the translations of
which were also available in electronic format via publicly available sources
(primarily Project Gutenberg). Common criteria3 based on size and date were
then used to select 20 works by 20 different authors in En0 and the same
number in Fr0, so as to obtain, first of all, two reference sub-corpora
comparable in terms of date, size, discourse type and diversity:
Table 1 Summary of characteristics for comparable En0 and Fr0 subcorpora.
Subcorpus 1 En0 (n=20)

Subcorpus 2 Fr0 (n=20)

Wordcounts4

Max. 199,976
(Collins,
The
Moonstone)
Min. 59,771
(Mansfield,
The
Garden-party)
Median 99,558 (Wells, The War in
the Air)
Total 2,114,517

Max. 192,521 (Zola, Les trois villes
Paris)
Min. 62,539
(Rolland,
Les
précurseurs)
Median 90,873 (Leroux, La chambre
jaune)
Total 2,083,787

Dates

Max. 1928 (Woolf, Orlando)
Min. 1868 (Collins, The Moonstone)
Median 1901 (Kipling, Kim)

Max. 1921 (Leblanc, Les dents du
tigre)
Min. 1866 (Gaboriau, L'affaire
Lerouge)
Median 1901 (Bazin, Les Oberlé)

The translations of these works were then compiled into two sub-corpora of
EtrF and FtrE, so as to produce an 8m-word 'super-corpus' consisting of four
2m-word sub-corpora, designed to be both comparable and parallel and
thereby provide a basis for three types of comparisons:
– between En0 and Fr0, in order to establish benchmark data for each
language,
– between EtrF and En0, so as to ascertain whether the linguistic indicator

Whenever several works by the same author were available, preference was
given either to the most recent or the one with the highest word-count. In general date
was given precedence over size, except in cases where a major difference in wordcount was found between works published within a relatively close interval.
4
Word-counts were estimated using the text editor Geany, after replacing
punctuation with whitespaces, given that punctuation has been found to artificially
inflate word-counts in French as compared to English.
3

378

JADT’ 18

under investigation, i.e. the conditional perfect, has a similar distribution in
EtrF compared to En0, and likewise for FtrE in comparison with Fr0,
– between source- and target-texts, to determine whether correlations exist
between the parallel subcorpora (i.e. EtrF~Fr0 and FtrE~En0) which could be
taken as evidence of interlinguistic interference.
All of the texts were cleaned of metatext, tagged for POS and Lemma in
TreeTagger, and interrogated in TextSTAT using the following regular
expressions to target the conditional perfect.
English (all verbs):
d) (((w|c|sh)ould)|('d)|(might)|(ought))(e?st)?/\S+(
\S+/RB[RS]?/\S+)*(
to/\S+)?( ((ha|')ve|of)/\S+)( \S+/RB[RS]?/\S+)* \S+/V[BHV][ND]/
French (verbs taking AVOIR as an auxiliary, verbs taking ÊTRE, reflexive
constructions):
e) \S+/VER:cond/avoir( \S+/ADV/\S+)* \S+/VER:pper
f) \S+/VER:cond/être(
\S+/ADV/\S+)*
\S+/VER:pper/(r[eé])?(aller|(ad|de|inter|par|pro|sur)?venir|rester|deme
urer|(ap|dis)?paraître|naître|mourir|décéder|arriver|partir|tomber|mo
nter|descendre|passer|rentrer|retourner|sortir)
g) ((je/\S+(
\S+/ADV/\S+)*
m[e']/\S+)|(tu/\S+(
\S+/ADV/\S+)*
t[e']/\S+)|(nous/\S+(
\S+/ADV/\S+)*
nous/\S+)|(vous/\S+(
\S+/ADV/\S+)* vous/\S+)|(s[e']/\S+))( en|y/\S+)* \S+/VER:cond/être(
\S+/ADV/\S+)* \S+/VER:pper/
The results obtained from these queries were converted into frequencies per
1000 words (freq./1k) for each author or translator and analyzed using the
Wilcoxon-Mann-Whitney and Spearman tests as described in the following
section.
3. Results and analysis
The data collected from each of the subcorpora are presented in the following
tables and summarized in Fig. 1.
Table 2a Conditional perfect frequencies in En0
Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Buchan

139

102022

1.36

Lewis

58

83799

0.69

Burnett

78

84093

0.93

London

57

100816

0.57

Collins

326

199976

1.63

Mansfield

67

59771

1.12

ConanDoyle

108

105040

1.03

Reid

200

94254

2.12

Cox

142

114352

1.24

Stevenson

81

70366

1.15

Eliot

319

164456

1.94

Stoker

127

161255

0.79

JADT’ 18

379

Hardy

254

153076

1.66

Wallace

135

101948

1.32

Hope

115

83189

1.38

Wells

54

99558

0.54

Joyce

26

69225

0.38

Wilde

76

79412

0.96

Kipling

109

107601

1.01

Woolf

76

80308

0.95

max: 2.12, min: 0.38, median: 1.03
Table 2b Conditional perfect frequencies in EtrF
Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Tr.Barbusse

48

116179

0.41

Tr.Leroux

127

74920

1.7

Tr.Bazin

74

76312

0.97

Tr.Loti

15

65837

0.23

Tr.Benoît

41

64301

0.64

Tr.Massenet

42

57736

0.73

Tr.Flaubert

125

175678

0.71

Tr.Maupassant

45

76070

0.59

Tr.France

66

76830

0.86

Tr.Mirbeau

76

101959

0.75

Tr.Gaboriau

335

170870

1.96

Tr.Proust

408

198721

2.05

Tr.Gourmont

76

69399

1.1

Tr.Rolland

27

65872

0.41

Tr.Hugo

104

125428

0.83

Tr.Vanderem

80

95884

0.83

Tr.Huysmans

46

130181

0.35

Tr.Verne

89

63760

1.4

Tr.Leblanc

112

128493

0.87

Tr.Zola

179

205503

0.87

max: 2.05, min: 0.23, median: 0.83
Table 2c Conditional perfect frequencies in Fr0
Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Barbusse

47

114877

0.41

Leroux

78

90873

0.86

Bazin

41

78395

0.52

Loti

15

72386

0.21

Benoît

33

67915

0.49

Massenet

45

76711

0.59

Flaubert

108

149808

0.72

Maupassant

46

75598

0.61

France

20

71998

0.28

Mirbeau

59

117035

0.5

Gaboriau

53

120464

0.44

Proust

296

170105

1.74

Gourmont

60

73000

0.82

Rolland

11

62539

0.18

Hugo

18

118095

0.15

Vanderem

44

91476

0.48

Huysmans

22

132824

0.17

Verne

50

76890

0.65

Leblanc

47

130277

0.36

Zola

141

192521

0.73

max: 1.74, min: 0.15, median: 0.5

380

JADT’ 18
Table 2d Conditional perfect frequencies in FtrE
Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Tr.Buchan

69

105082

0.66

Tr.Burnett

74

80743

Tr.Collins

138

Tr.ConanDoyle

Cond.Pf.
(n=)

Words
(n=)

Freq./1k

Tr.Lewis

80

96211

0.83

0.83

Tr.London

49

86378

0.57

198988

0.69

Tr.Mansfield

82

68674

1.19

119

117280

1.01

Tr.Reid

120

93025

1.29

Tr.Cox

194

130967

1.48

Tr.Stevenson

64

76757

0.83

Tr.Eliot

120

168125

0.71

Tr.Stoker

167

176623

0.95

Tr.Hardy

217

151435

1.43

Tr.Wallace

97

87316

1.11

Tr.Hope

99

82966

1.19

Tr.Wells

74

108529

0.68

Tr.Joyce

49

72739

0.67

Tr.Wilde

63

82430

0.76

Tr.Kipling

68

124885

0.54

Tr.Woolf

56

87475

0.64

max: 1.48, min: 0.54, median: 0.83

Fig. 1 Distributions of conditional perfect frequencies in En0, EtrF, FtrE and Fr0.

As is readily apparent from Fig. 1, the conditional perfect is used more
frequently in En0 than in Fr0, which, aside from one extreme outlier (Proust),
is situated below the 1st quartile of En0. EtrF and FtrE (as usual) occupy an

JADT’ 18

381

intermediate zone, with practically identical medians (0.83) which are both
inferior to Q1 in En0 and superior to Q3 in Fr0. The most striking difference
is between authors in Fr0 and translators, who use the conditional perfect
almost twice as often in FtrE. As a result, the entire distribution in FtrE is
superior to the median for Fr0, with 75% of FtrE (Q2-Q4) in the same range as
the top quartile (Q4) of Fr0. Wilcoxon-Mann-Whitney confirms that a similar
disparity could hardly occur by chance (U=337, n1=n2=20, p=0.0002) and that
it is therefore reasonable to infer that – notwithstanding the considerable
amount of variation that can be observed from one author or translator to
another – FtrE and Fr0 are clearly different with respect to their use of the
conditional perfect. Between EtrF and En0, however, the difference is less
obvious. Although the interquartile range for EtrF (0.63-1) is noticeably lower
than in En0 (0.9-1.37), there is nonetheless a great deal of overlap between the
two distributions, and Wilcoxon-Mann-Whitney (U=135, n1=n2=20, p=0.08)
indicates that the risk of error is too great to say with confidence whether any
substantial difference exists between EtrF and En0 in their use of the
conditional perfect.
To what extent such differences may be attributed to the influence of the
analogous forms in the source-texts can be assessed statistically as illustrated
in Fig. 2a and 2b:

Fig. 2a Frequency of conditional perfect forms
in FtrE vs. En0. (ρ=0.47, p=0.036)

Fig. 2b Frequency of conditional perfect forms in
EtrF vs. Fr0. (ρ=0.57, p=0.009)

In both cases, Spearman's5 correlation test reveals a statistically significant
(p<0.05) positive correlation (ρ=0.57 for EtrF/Fr0, ρ=0.47 for FtrE/En0) of
moderate strength, which somewhat unexpectedly obtains a higher score for

5
Spearman's was preferred due to the presence of outliers. Pearson's R yields
an almost identical result for FtrE/En0, and a somewhat stronger coefficient (r=0.67)
for EtrF/Fr0, with similar p-values in both cases.

382

JADT’ 18

EtrF/Fr0. These correlations of similar strength suggest an intuitively
plausible tendency to translate individual instances of the conditional perfect
in one language by the analogous form in the other language in both
directions and in roughly similar proportions (although this remains to be
verified by manual examination of translation segments). Such a hypothesis
would help to explain why the medians and interquartile ranges observed in
EtrF and FtrE occupy a middle zone between En0 and Fr0, but it does little to
account for the greater disparity between FtrE and Fr0 as opposed to EtrF
and En0. Other contextual parameters may well be involved, or perhaps the
higher frequency of the conditional past in En0 exerts a sort of subliminal
effect on translators, who then use it more freely in FtrE with or without a
syntactic counterpart in the corresponding En0 segment.
4. Conclusion
These findings demonstrate how quantitative analysis of translated parallel
corpora in comparison with untranslated comparable corpora, can be used
both to identify disparities between target-texts and the target-language as
represented in an ex nihilo corpus, and to assess the influence of the sourcetexts on the target-texts. Such relationships are often asymmetrical: in this
case the correlation between the original French conditional perfect and the
translations into EtrF is stronger, while the higher frequency of conditional
perfect forms in English, though less strongly correlated on a text-to-text
basis, nonetheless fosters a style of French-translated-from-English which is
markedly different from ex nihilo French. While the exact mechanisms
involved will require further investigation, the conditional perfect in English
appears to exert a stronger influence in the translation process than the
corresponding form in French.
References
Hu K. (2016). Introducing corpus-based translation studies. Springer.
Koppel M and Ordan N. (2011). Translationese and Its Dialects Proceedings of
the 49th Annual Meeting of the Association for Computational Linguistics,
pp. 1318–1326, June 19-24, 2011
Kruger A., Wallmach, K. and Munday J. (Eds.). (2011). Corpus-based
translation studies: Research and applications. Bloomsbury Publishing.
Loock R. (2013). Close encounters of the third code. In Lefer M.A. and
Vogeleer S., eds, Interference and normalization in genre-controlled
multilingual corpora, Belgian Journal of Linguistics 27: 61-86
Olohan M. (2002). Comparable corpora in translation research. In LREC
Language Resources in Translation Work and Research Workshop Proceedings
pp. 5-9.

JADT’ 18

383

Zanettin F. (2013). Corpus methods for descriptive translation studies.
Procedia-Social and Behavioral Sciences, 95, 20-32.
Hüning Matthias. TextSTAT 2.9c © 2000/2014 Niederländische Philologie,
Freie
Universität
Berlin,
http://neon.niederlandistik.fuberlin.de/en/textstat/
R Core Team (2017). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria URL
https://www.R-project.org/.
Schmid H.
TreeTagger,
Universitaet
Stuttgart,
http://www.cis.unimuenchen.de/~schmid/tools/TreeTagger/

384

JADT’ 18

Extraction of lexical repetitive expressions from
complete works of William Shakespeare
Daniel Devatman Hromada
Univesität der Künste, Berlin, Germany – daniel at udk dash berlin dot de

Abstract
Rhetoric tradition has canonized dozens of repetition-involving figures of
speech. Our article shows a way how hitherto ignored repetition-involving
schemata can be identified by means of translation of so-called “entangled
numbers” into backreferencing regular expressions. Each regex is
subsequently exposed to all utterances in all works of William Shakespeare,
allowing us to pinpoint 3367 instances of 172 distinct repetitive schemata.
Keywords: rhetoric stylometry, figures of speech, repetition, chiasm,
entangled numbers, regular expressions, William Shakespeare, non-zipfian
distribution
Résumé
On montre, comment peut-on identifier les figures de styles jusqu'ici
inconnues. Le but en question est atteint grâce au fait qu'on peut concevoir
un certain groupe de figures de style tel un nombre ayant quelques
propriétés particulières. Une fois découverte et énumérés, on peut transcrire
ces nombres en expressions régulières qui peuvent ensuite être éxposé à un
corpus textuel. Dans le cas de notre étude préliminaire, il s'agissait du corpus
de William Shakespeare.
Mots clés: stylométrie rhétorique, rfigures de style, répétition, chiasme,
expressions régulières, répetition, William Shakespeare
1. Introduction
Masterpieces of litterature and drama abound with repetitions. Rhetorics
abounds with repetitions, succesful oratories abound with repetitions. Many
a schema and a figure exists which exploits repetition : e.g. a polysyndeton
and an anaphore, an anadiplose and an epistrophe, a symploche and an
antanaclasis, paranomasis and an antimetabole. And alliterations and
paregmenons, and polypoptons, epizeuxiae or even a good old psittacism ?
Many are such schemata, many are such figures. Woe to the one who
thinking he knows them all !

JADT’ 18

385

Our article presents a way of enumerating of many a new schemata
involving one or more repetition of one or more lexical signifiants. The
procedure starts with a theoretical insight, that at least certain subset of the
set of all such schemata, is easily enumerable. This insight is subsequently
transcribed into an algorithm enumerating natural numbers which satisfy
following properties. These numbers once identified, they are to be translated
into Perl Compatible Regular Expressions exploiting some back-references
and negative lookaheads.
1.1. Computational rhetorics and its roots
In literature studies it is fairly common to speak about so-called "rhyme
schemes" like AAAA for monorhymes, ABAB for alternate rhyme, ABBA for
enclosed rhymes etc.
It is therefore barely surprising that analogic formalisms - that is, formalisms
that involve alphabetic indices - have been adopted by scholars aiming to
formalize a subgroup of rhetoric figures, known as the group of schemes. For
example (Harris et DiMarco, 2009) use a following formalism:
[W ]a::: [W ]b::: [W ]b::: [W ]a
to denote the rhetoric figure known as antimetabole. Subsequent studies in
automatized chiasm identification pursue a similiar route and often use
formulae like ABXBA, ABCBA, ABCXCBA to denote schemata
corresponding to utterances such as: "Drake love loons. Loons love Drake.", "All
as one. One as all." (Hromada, 2011) or "In prehistoric times women resembled
men, and men resembled women." (Dubremetz & Nivre, 2015) .
Table 1: 14 lowest E-numbers, their corresponding alphabetic representations and some
corresponding Shakespearean expressions .

E-number

Alphabetic

Example expression

11

AA

"we split we split " 1

111

AAA

"we split we split we split "

1111

AAAA

"justice justice justice justice "

1122

AABB

"gross gross fat fat "

1212

ABAB

"to prayers to prayers "

1
Note that sometimes one single word is attributed the role of a distinct
« brick » , sometimes a concatenation of two or even more words assumes such a role.
As will be indicated in sections two and three, this behaviour is not a bug, but an
anticipated property of our method.

386

JADT’ 18

1221

ABBA

"my hearts cheerly cheerly my hearts "

11111

AAAAA

"so so so so so "

11122

AAABB

"great great great pompey pompey "

11212

AABAB

"come come buy come buy "

11221

AABBA

"high day high day freedom freedom high day "

11222

AABBB

"o night o night alack alack alack "

12112

ABAAB

"too vain too too vain "

12121

ABABA

"come hither come hither come "

12122

ABABB

"come buy come buy buy "

1.2. Entangled numbers
The set of entangled numbers (or E-numbers) is a subset of a set of natural
numbers (i.e. integers). Entangled numbers are defined as “words of length n
over an alphabet of size 9 that are in standard order and which have the property that
every letter that appears in the word is repeated. “ (OEIS, 2016)
Note that the term word, as used in the preceding, as well as in following
citations, is used in mathematician's sense, meaning something as « sequence
of symbols » : “A word is in "standard order" if it has the property that whenever a
letter i appears, the letter i-1 has already appeared in the word. This implies that all
words begin with the letter 1.” (Arndt et Sloane, 2016). Hence, numbers like 22
or 33 are not entangled numbers because they are not in “standard order”
and numbers like “12” or “121 ”are not entangled because some (or all) of
their digits are not repeated. Fourteen smallest (i.e. with the lowest numeric
value) entangled numbers and their corresponding alphabetic transcriptions
are enumerated in Table 1.
Given that entangled numbers are natural numbers, they can be easily
enumerated by an incremental algorithm starting at one and iterating
towards infinity. Once enumerated (OEIS, 2016), we can bridge the realm of
numbers with the realm of text and apply our method.
2. Method
The core idea behind our method can be stated as follows:
Any E-number can be "translated" into a backreference-endowed regular
expression.
Concretely speaking, every digit of an E- number can be interpreted as an
element or a "brick". In this article, we work only with one type of bricks,
those corresponding to sequences which are between two to twenty-three

JADT’ 18

387

characters long 2. Such sequences can correspond to one or multiple lexical
units. A first occurence of a novel brick can be represented as a PERLcompatible regular expression (Friedl, 2002 ; Aho, 2014):
(.{2,23})
However, any subsequent repeated occurence of a digit in an E- number is
interpreted not as an occurence of the new brick, but rather as a
backreference to the brick which was already denoted by the same digit. The
very first E- number 11 is therefore NOT to be translated into regex /(.{2,23})
(.{2,23})/. For this would imply existence of two distinct bricks. Rather, the Enumber 11 is to be translated into regex:
(.{2,23}) \1
wherein the expression \1 denotes the backreference to the content matched
by the regex-brick specified in first parentheses, i.e. brick no.1 .
Hence, the E-number 111 can be easily translated into a regex /(.{2,23}) \1 \1/,
1111 into a regex /(.{2,23}) \1 \1 \1/ etc.
What's more, when we combine the backreference with a negative lookahead
operator – traditionally expressed by the formula (?!) - we can make sure that
a so-called non-identity principle is also satisfied. That is :
"Each distinct digit corresponds to distinct content"
For example, by translating the E-number 121 into the regex
(.{2,23}) (?!\1)(.{2,23}) \1
we can make sure that the content matched by the brick denoted by digit 2
shall be different from the content matched by the brick denoted by digit 1.
Thus, a phrase "no no no" shall not be matched by such a regex while an
expression "no yes no" shall.
Going somewhat further, an E-number 12321 - which could be understood as
an instance of chiasm or antimetabole ABXBA - is to be translated into regex
(.{2,23}) (?!\1)(.{2,23}) (?!\1\2)(.{2,23}) \2 \1
whereby the disjunctive backreference contained in the negative lookahead

2

These are the only variable parameters of our method.

388

JADT’ 18

(?!\1\2) assures that the content matched brick no.3 - corresponing to filler X
- shall be different from content matched by the brick representing digit 1 as
well as the brick representing digit 2.
3. Corpus & Processing
A digital, unicode-encoded version of Craig's edition of "Complete works of
William Shakespeare" has been downloaded from a publicly available
Internet source3 . This corpus contains 17 txt files stored in the sub-folder
"comedies", 10 txt files stored in the sub-folder "tragedies" and 10 txt files
stored in the sub-folder "historical".
Texts were subsequently split into utterances by interpreting closing tags
(e.g. </PERSONA>, </MIRANDA> etc.) as utterance separator. Even more
concretely, one can simply consider the slash symbol / to be the utterance
separator.
Only two further text-processing steps have been executed during the
initialization phase of the experiment hereby presented. Primo, content of
each utterance has been put into lowercase. Secundo, non-alphabetic symbols
(e.g. dot, comma, exclamation mark etc.) have been replaced by blank spaces.
We are aware that such replacement could potentially lead to certain amount
of loss of prosody- or pathos- encoding information. However, we consider
this step as legitimate because the objective of our experiment was to focus
on repetition of lexical units4.
Pre-processing code once executed, identification of expressions containing
diverse types of lexical repetitions is as simple as matching each
Shakespearean utterance with each regex.
4. Results
All in all, 3667 instances of a repetitive expressions have been detected in
Shakespeare's complete works. These were contained in 2295 distinct
utterances and corresponded to 172 distinct schemata. Among these, 71
matched more than one instance: these schemata could thus potentially
correspond to a certain cognitive pattern or a habitus in Shakespeare's mind.
Table 2 contains summary information concerning 23 schemata matching at
least five distinct utterances.

3

http://www.lexically.net/downloads/corpus_linguistics/ShakespearePlaysPlus.zip
4
Regexes matching repetitions of phonotactic clusters, syllables, or phrases,
are also possible. We prefer, however, not to focus on this topic within the limited
scope of this conference proposal.

JADT’ 18

389

Table 2: Repetitive schemata matching at least 23 distinct utterances present in collected
works of William Shakespeare.
Instances
2332
525
170
100
48
35
32
32
30
23

E-number
11
1212
111
123123
12121
1221
12341234
1122
1111
121212

Example
"bestir bestir "
"to prayers to prayers "
"ha ha ha "
"cover thy head cover thy head "
"come hither come hither come "
"fond done done fond"
"let him roar again let him roar again "
"with her with her hook on hook on "
"great great great great "
"come on come on come on "

Another phenomenon may be found noteworthy by a reader interested in
purely quantitative aspects of our research. It concerns the relation between
the length of the E-number (i.e. the amount of corresponding bricks) and the
number of utterances matched by such numbers. In case of trivial repetitions,
this relation seems to be plainly Zipfian. For example : Shakespeare's dramas
seem contain 2332 duplications (e.g. E=11), 170 triplications (E=111), 30
tetraplications (E=1111), 8 pentaplications (E=11111) two hexaplications
(E=111111), one heptaplication (E=1111111) and zero octaplications.
Table 3: Comparison of frequencies of occurrence of schemata of certain length and amount
Digits
2
3
4
5
6
7
8
9

Theoretical
1
1
4
11
41
162
715
3425

Matched
2332
170
622
91
211
56
86
67

It is worth mentioning, however, that generic relation between the length (in
digits) of an and the amount of utterances which matches seems not to be
Zipfian. As indicated by Table 3, an observed preference for repetitive
expressions including two, four, six or eight bricks cannot be explained in
terms of number-theoretical distribution of E-numbers themselves.
For example, there exists eleven E-numbers with five digits and fourty-one Enumbers of length six. However, when exposed to Shakespeare corpus,
regexes generated from six digits long seem to match 211 utterances while
five brick long regexes match only ninety-one of them. Whether this

390

JADT’ 18

observed asymmetry is an artefact of our method or whether it is due to a sort
of cognitive bias, a sort of preference for balanced repetitions within the Poet's
mind poses us in front of an argument which we do not dare to tackle here.
4. Conclusion
Insight that certain class of repetition-based schemata can be enumerated
allows us to generate myriads hitherto unseen Perl Compatible Regular
Expressions5 which involve back-references and negative lookaheads.
In the end, such regexes have been exposed to corpus containing collected
works of William Shakespeare.
Matching all utterances with all regexes generated out of all 4360 E-numbers
with less than 10 digits lasted 9555 seconds in case Shakespearean comedies,
6607 seconds in case of tragedies and 6900 seconds in case of historical
dramata. All this on one single core of an 1.4 GHz CPU.
This approach allowed us to pinpoint 36676 utterances matching at least one
among 172 distinct repetitive schemata. 23 among these schemata matched at
least 5 distinct utterances, 71 among them matched at least two utterances.
This may potentially point to a sort of neurolinguistic habit residing in the
opaque sphere between the syntactic and lexical layers.
We believe that at least some among these «figures » could be of certain
interest not only for scholars trying to understand inner intricacies of
Shakespeare's genius, but also to address more generic topics in fields as
distinct as digital humanities, computational rhetorics, discourse stylometry
or even more general cognitive sciences.
References
Aho, A. V. (2014). Algorithms for finding patterns in strings. Algorithms and
Complexity, 1:255.
Arndt, J., Sloane, N. J. A. (2016). Counting words that are in "standard order".
The on-line encyclopedia of integer sequences.
https://oeis.org/A278984/a278984.txt.
Dubremetz, M., Nivre, J. (2015). Rhetorical figure detection: the case of
chiasmus. On Computational Linguistics for Literature, page 23.

We remind the reader that PCREs are much more powerful than so-called
regular grammars. For example, regular grammars are unable to backreference, while
for PCREs, backreferencing is a completely legal act.
6
See
https://refused.science/rhethorics/shakespeare-regex/matches.csv
(Licenced under CC BY-NC-SA) for list of all matched utterances, including the
information about the respective entangled numbers, theater pieces, genres (comedy /
tragedy / drama) and the dramatis personae.
5

JADT’ 18

391

Friedl, J. E. F. (2002). Mastering regular expressions. O’Reilly Media, Inc.
Harris, R., DiMarco Ch. (2009). Constructing a rhetorical figuration ontology.
In Persuasive Technology and Digital Behaviour Intervention
Symposium, pages 47–52. Citeseer.
Hromada, D. D. (2011). Initial experiments with multilingual extraction of
rhetoric figures by means of PERL-compatible regular expressions. In
RANLP Student Research Workshop, pages 85–90.
OEIS (2016). List of words of length n over an alphabet of size 9 that are in
standard order and which have the property that every letter is repeated
at least once. https://oeis.org/A273978

392

JADT’ 18

Spécificités des expressions spatiales et temporelles
dans quatre sous-genres romanesques (policier,
science-fiction, historique et littérature générale)
Olivier Kraif, Julie Sorba
Univ. Grenoble Alpes, LIDILEM
olivier.kraif@univ-grenoble-alpes.fr; julie.sorba@univ-grenoble-alpes.fr

Abstract
In this paper, we aim to test if the classifications of the phraseological units
based on recurring trees and ngram methods are functional in order to
separate novel genres one from another. Our results confirm that these two
methods are relevant for the expressions relative to space and time into our
corpora.
Résumé
Notre objectif est de tester les classifications des phraséologismes, opérées
par les méthodes des ALR et des SR, dans le but de distinguer des sousgenres romanesques les uns des autres. Dans nos corpus, nos résultats
confirment la pertinence de ces classifications pour les deux champs de
l’espace et du temps.
Keywords: ngram, recurring trees, novel genres, phraseology
1. Introduction
Notre étude, qui s’inscrit dans le cadre de l’analyse exploratoire des données
textuelles, concerne des romans français contemporains rassemblés dans le
cadre du projet ANR-DFG PhraseoRom. Ce corpus (plus de 110 millions de
mots pour le français) est partitionné en plusieurs sous-corpus correspondant
à différents sous-genres littéraires (policier, science-fiction, fantasy, roman
historique, roman sentimental, littérature générale). Notre objectif est de
caractériser ces genres et sous-genres textuels par les unités phraséologiques
spécifiques qu’ils contiennent. À l’instar de Boyer, nous postulons que
« chaque genre comprend un certain nombre de sous-ensembles, des séries
fondées sur la réutilisation de composantes identiques » (1992, p.91). Dans la
mesure où la phraséologie étendue s’intéresse à tout ce qui est
« préfabriqué » dans les séquences lexicales, elle constitue donc un point
d’entrée privilégié pour mettre en évidence ces « séries ».
Pour cette étude, nous retenons spécifiquement 4 sous-genres : les romans de

JADT’ 18

393

science-fiction (SF), les romans policiers (POL), les romans historiques (HIST)
et les romans de littérature dite blanche ou générale (GEN). La fouille des
textes utilise la technique de repérage des Arbres Lexicosyntaxiques
Récurrents (ou ALR, Kraif & Diwersy, 2012 ; Kraif, 2016) dont la validité a
déjà été montrée par le repérage d’unités phraséologiques spécifiques dans
les textes scientifiques (Tutin & Kraif, 2016). Nous proposons en outre de
comparer ici cette technique d’extraction avec celle des segments répétés
(Salem, 1987), les ALR ayant montré une meilleure prise en compte de la
variabilité syntaxique pour le repérage des routines, mais s’avérant parfois
défaillants pour identifier des segments figés en surface, du fait du modèle
dépendanciel employé.
Dans des travaux antérieurs, nous avons montré comment les ALR
permettaient de repérer des motifs récurrents construits autour d’expressions
spécifiques fortement liées à la composante thématique des sous-genres en
question : c’était le cas pour « scène de crime » dans POL (Kraif, Novakova &
Sorba, 2016). Ici, nous nous concentrons sur des expressions moins
directement liées aux univers de référence des sous-genres (le crime, l’amour,
la science, etc.), afin de mettre en évidence des traits moins prévisibles. C’est
pourquoi, nous avons choisi de sélectionner les séquences – bien souvent
adverbiales – liées à l’expression du temps et de l’espace.
Nous allons désormais présenter les résultats obtenus dans des travaux
antérieurs (partie 2), puis décrire notre méthodologie expérimentale (partie
3). Enfin, nous exposerons et discuterons nos observations (partie 4) avant de
proposer des conclusions et perspectives à notre étude (partie 5).
2. Travaux antérieurs
Lefer, Bestgen & Grabar (2016) s’appuient sur une extraction de n-grammes
de 2 à 4 mots pour caractériser 3 genres textuels : des débats parlementaires
européens, des éditoriaux de presse et des articles scientifiques. Ces auteurs
utilisent une méthode d’AFC pour identifier les expressions les plus typiques
et en tirent des observations contrastives concernant l’expression de la
certitude et de l’opinion. De notre côté, nous avons analysé des contrastes
génériques sur un plan qualitatif, en identifiant des ALR dans des corpus de
romans policiers et de science-fiction, en nous fondant sur des mesures de
spécificité (Kraif, Novakova & Sorba, 2016). Nous avons également utilisé
l’extraction des ALR pour classer automatiquement, dans une approche
supervisée, des sous-corpus POL, SF et GEN (Chambre & Kraif, 2017). Ces
travaux préliminaires ont montré que les ALR donnaient de meilleurs
résultats que les autres catégories de traits (ponctuation, morphosyntaxe,
lexique), et permettaient de classer correctement 98% des textes du corpus à
partir d’une sélection de traits discriminants. La plupart de ces traits

394

JADT’ 18

appartenaient à des champs lexicaux précis, liés aux univers de référence
propres à chaque sous-genre, comme ceux du ‘téléphone’ (le numéro de
portable, passer un coup de fil, etc.) ou de la ‘voiture’ (à travers le pare-brise,
démarrer en trombe, etc.) pour POL. De plus, des expressions temporelles (p.ex.
pour POL à huit heures, vingt et une heure, au bout de X minutes) et des
indications spatiales très variées (p.ex. pour SF par la voie, dans le territoire,
dans la sphère, dans l’espace, la zone de) ont été mises en évidence.
Nous proposons ici un prolongement de cette expérimentation, d’une part,
en étudiant les expressions spatiales et temporelles, et d’autre part, en
ajoutant le sous-genre des romans historiques (HIST), afin de déterminer si
ces classes d’expression sont suffisantes pour différencier les quatre sousgenres (POL, SF, GEN, HIST).
3. Méthodologie
Pour chaque sous-genre, notre corpus comporte un échantillon d’environ 8
millions de mots, correspondant à environ 70 œuvres d’une quarantaine
d’auteurs (cf. Tableau 1). Ces œuvres sont toutes postérieures à 1950, et la
majorité d’entre elles ont été publiées pour la première fois après 2000. La
classification des œuvres en genre a été effectuée a priori selon des critères
éditoriaux, en fonction des collections de publication.
Auteurs

Romans

Taille

POL

46

69

8 008 395

SF

36

75

8 001 582

HIST

38

70

8 015 933

46

69

8 008 395

GEN

Tableau 1 : Constitution du corpus

Figure 1 : ALR représentant l’expression en une fraction de seconde

Pour identifier les expressions phraséologiques caractéristiques des différents
sous-genres, nous utilisons deux méthodes de repérage :
- la méthode des ALR : nos corpus étant analysés en dépendances avec XIP

JADT’ 18

395

(Aït-Mokhtar et al., 2002), ces ALR sont des sous-arbres respectant des
critères de fréquence (ici ≥ 10 occurrences), de dispersion (ici ≥ 10 auteurs
différents, appartenant à au moins 3 sous-genres différents) et de taille (ici ≥ 3
nœuds et ≤ 8 nœuds). En outre, lors de la recherche de ces ALR, une mesure
d’association est calculée afin de ne retenir que les nœuds significativement
associés avec le reste de l’arbre. La figure 1 montre un exemple d’ALR
correspondant à l’expression en une fraction de seconde.
- la méthode des segments répétés (ou SR, Salem, 1987) : nous avons appliqué
les mêmes critères de dispersion et de taille (≥ 3 et ≤ 8), afin de comparer les
deux méthodes in fine. Les SR sont constitués de séquences de lemmes
(obtenus avec XIP), et non de formes fléchies. Cette dernière méthode est
plus simple à mettre en œuvre et nécessite peu de ressources linguistiques,
bien qu’elle pose des problèmes d’explosion combinatoire (cf. partie 4).
Dans un second temps, nous appliquons un filtrage par mots-clés afin de ne
retenir que les séquences liées aux deux sous-domaines étudiés, à savoir
l’expression du temps et de l’espace. Les mots-clés pour l’espace sont des
noms de lieux, d’espaces naturels, de description géographique, des mesures
de distance, des adverbes de lieu, sélectionnés après un premier sondage des
ALR extraits :
- Mots-clés ESPACE : cave, salon, hôpital, immeuble, bâtiment, camp, restaurant,
village, route, rue, quai, chaussée, terrasse, ministère, parc, bureau, carlingue,
maison, toit, chambre, hôtel, palais, rez-de-chaussée, entrée, pont, escalier, chemin,
place, salle, jardin, seuil, cour, couloir, colline, sentier, sol, rive, rivage, plage, rivière,
mont, montagne, mer, océan, lac, bois, forêt, espace, endroit, coin, pays, continent,
frontière, direction, cap, sud, est, nord, ouest, confins, mètre, kilomètre, annéelumière, hectare, acre, loin, proche, près de, au bord de, orée, distance.
Les mots-clés pour le temps désignent des moments de la journée et de
l’année, des unités de mesure et des découpages conventionnels de période
(noms, adverbes et locutions adverbiales) :
- Mots-clés TEMPS : matin, soir, soirée, après-midi, nuit, jour, temps, fois, moment,
instant, toujours, jamais, parfois, souvent, autrefois, jadis, tôt, tard, longtemps,
brièvement, immédiatement, subitement, tout à coup, tout de suite, aujourd'hui,
demain, hier, lendemain, maintenant, heure, minute, seconde, journée, semaine,
mois, an, année, décennie, siècle, millénaire, printemps, été, automne, hiver.
Ces listes ne prétendent pas être exhaustives et le filtrage opéré produit à la
fois du silence et du bruit, du fait des ambiguïtés. Celles-ci demeurent
toutefois marginales (d’après un sondage manuel, le bruit est inférieur à
10 %).
Pour identifier les ensembles de traits pertinents du point de vue des sousgenres, nous injectons ces expressions (ALR ou SR) dans un système de
classification automatique. De la sorte, nous visons un double objectif : d’une

396

JADT’ 18

part, vérifier que nos classes constituées a priori sont cohérentes et corrélées à
des critères objectivables ; d’autre part, identifier ces critères sous la forme
d’ensemble de traits discriminants pour la classification.
4. Résultats et discussion
Dans une première étape, nous avons extrait les 6000 ALR les plus fréquents
sur l’ensemble du corpus. En effectuant une classification sur ces traits, avec
un modèle SVM optimisé par SMO (avec la plate-forme Weka, Eide et al.
2016), on obtient, dans une évaluation croisée à 10 plis, une précision de 74 %
(123 sur 166), avec un Kappa de 0,65, ce qui correspond à un très bon accord
avec la classification de référence. La matrice de confusion (cf. Tableau 2)
montre que les deux genres les mieux classés sont SF (93,1 %) et POL
(79,5 %). Le genre GEN obtient la précision la plus faible (64%) avec des
confusions fréquentes avec POL et HIST ; HIST est de son côté fréquemment
confondu avec GEN.
L’examen des ALR les plus discriminants montre, comme on pouvait s’y
attendre, la forte présence de certains thèmes dans POL, HIST et SF (la
voiture, le crime, le téléphone pour POL ; la guerre, la religion pour HIST ;
l’univers spatial et les artefacts technologiques pour SF) et l’absence de traits
saillants dans GEN.
4.1 Sélection des traits TEMPS+ESPACE
Lorsqu’on sélectionne les traits liés à l’expression du temps seul (environ un
millier), on obtient une dégradation par rapport aux résultats précédents,
avec une précision globale de 48,8 % et un Kappa de 0,31 signifiant un accord
faible entre la classification a priori et la classification automatique. Les
expressions spatiales, de leur côté (on en obtient 1560, mais nous avons
retenu les 1000 plus fréquentes afin de disposer de résultats comparables),
obtiennent des résultats un peu meilleurs, toutefois moins bons que les traits
non filtrés : la précision est de 59,6 %, avec un Kappa de 0,46 correspondant à
un accord modéré.
Quand on sélectionne conjointement les ALR de TEMPS et ESPACE, on
obtient une légère amélioration par rapport à la classification avec ESPACE
seul : 61,4 % (102 instances bien classées sur 166), avec un Kappa assez bon
de 0,48. La matrice de confusion (cf. tableau 2) montre que POL obtient la
meilleure précision (69%) et GEN la moins bonne (55,9 %).
Si on sélectionne les traits les plus discriminants (attributs SfcSubsetEval avec
méthode BestFirst dans Weka), on obtient un ensemble de 54 attributs. On
peut évaluer, de manière indicative, le pouvoir classificateur de ces attributs
sur notre corpus en les réinjectant dans une classification par SMO : on
obtient alors une précision globale très légèrement supérieure (62 %), mais il

JADT’ 18

397

est intéressant de noter que les genres marqués POL, SF et HIST sont très
bien classés sur la base de ces traits (précision de 85,7% pour HIST, 84 % pour
SF, 75,7 % pour POL) avec une dégradation forte pour GEN (43,4%), comme
le montre la matrice de confusion ci-dessous (tableau 2).
Tableau 1 : Matrices de confusion pour les classifications avec (1) tous les traits, (2) les ALR
filtrés (TEMPS+ESPACE) et (3) les ALR sélectionnés
(1) Tous les traits
(6000 ALR plus fréquents)

(2) TEMPS+ESPACE
(2571 traits filtrés)

(3) TEMPS+ESPACE
Sélection de 54 traits

SF

POL

GEN

HIST

SF

POL

GEN

HIST

SF

POL

GEN

HIST

SF

27

2

2

5

18

5

6

7

21

2

13

0

POL

1

35

9

1

5

29

12

0

3

28

15

0

GEN

1

5

32

8

3

3

33

7

1

6

36

3

HIST

0

2

7

29

3

5

8

22

0

1

19

18

L’examen détaillé des 54 traits sélectionnés révèle plusieurs points saillants :
- d’une manière générale, les ALR relatifs à l’espace sont très largement
majoritaires avec 33/54 contre 17/54 pour le temps, après élimination du bruit
(4/54).
- si on considère les traits spécifiques à HIST, les expressions spatiales
désignent surtout des lieux de pouvoir (la place forte, de son palais, salle du
palais, salle du château, pénétrer dans la grande salle) et la mer (sur la mer, de la
mer), tandis que les expressions temporelles font référence à une temporalité
longue (au bout de quelques mois, règne de X années, avoir le temps) et à des
datations absolues ou relative (du Ne siècle, venir le lendemain, à trois heures de
l’après-midi).
- pour POL, en revanche, les expressions temporelles indiquent des datations
horaires (à 8 heures, 21 heures) et des durées courtes (une vingtaine de secondes).
Les expressions spatiales, nombreuses, indiquent des pièces et des espaces
intérieurs (de la salle de bain, vers la salle de bain, entrer dans le bureau, vers le
bureau, dans le coin), des lieux urbains (aller à l’hôtel, passer à l’hôpital, à
l’hôpital), et des localisations vagues (dans le coin au sens de « dans les
parages »).
- pour SF, les expressions temporelles sont plus nombreuses (7/18) que dans
les autres sous-genres. Elles font référence à des durées extrêmes par leur
longueur (milliers d’années, de mille ans) ou leur brièveté (une fraction de
seconde, un centième de seconde). Pour l’espace, on trouve des expressions de
distances chiffrées (dizaines de mètres, centaine de mètres, plusieurs centaines de
mètres), des références attendues à l’espace intersidéral (dans l’espace, à travers
l’espace, être dans l’espace, voyager dans l’espace, flotter dans l’espace), à l’espace-

398

JADT’ 18

temps et des expressions avec sol (sur le sol, sous-sol).
- pour GEN : la seule expression spécifique apparaissant dans les traits
sélectionnés est chemin de traverse.
4.2 Comparaison avec les segments répétés
Nous n’avons pas réussi à extraire la totalité des SR de 3 à 8 mots pour
l’ensemble du corpus, du fait des problèmes d’explosion combinatoire
(environ 40 000 000 SR générés pour 100 textes du corpus). Nous avons donc
retenu les SR contenant les mots-clés sélectionnés pour TEMPS et ESPACE,
en conservant les 1000 SR les plus fréquents afin d’avoir des ensembles de
traits comparables aux ALR filtrés. On obtient de meilleurs résultats que
pour les ALR, avec une précision de 66,7 % pour ESPACE et 58,3 % pour
TEMPS contre respectivement 59,6 % et 48,8 %. Pour TEMPS+ESPACE, on
constate une certaine dégradation, avec une précision qui tombe à 64,1 %. À
ce stade de nos observations, il nous est difficile d’interpréter ces résultats
quantitatifs car la sélection du meilleur ensemble de traits pour ESPACE
donne peu ou prou les mêmes expressions qu’avec les ALR :
le chambre de, le cour de, à le cour, dans le espace, le salle de bain, de le espace, dans
son bureau, de le immeuble, le maison et, à le hôtel de, centaine de mètre, sur le
bureau, sur le place de, le palais de, dans le grand salle, de bureau de, de le salle de
bain, sur son bureau, cour de France, en route pour, dans mon bureau, dans tout le
direction, un dizaine de mètre, de son pays, à le rue, dans le sous-sol, quitter le salle,
dans un restaurant, sur le rivage, mètre plus bas, vers le bureau, route vers le,
dizaine de mètre de, un kilomètre de, à ministère de, dans le espace et, de un
montagne, le espace et le.
Les deux méthodes donnent donc des résultats convergents en termes
qualitatifs en extrayant les mêmes expressions. Néanmoins, des
investigations complémentaires seront nécessaires pour interpréter
correctement le fait que les SR obtiennent de meilleurs résultats quantitatifs.
5. Conclusion et perspectives
Cette étude confirme que les expressions phraséologiques constituent de
bons descripteurs pour la classification en sous-genre (Chambre & Kraif,
2017). En effet, même si les résultats obtenus ici à partir du sous-ensemble
constitué des expressions spatiales et temporelles sont sensiblement
inférieurs à ceux obtenus à partir de traits plus directement liés aux univers
de référence de chaque sous-genre (61.4 % /vs/ 98 %), ces expressions moins
riches sur le plan informatif permettent cependant de classer les romans dans
les sous-genres marqués POL, SF et HIST de manière satisfaisante. En
revanche, pour la catégorie des romans généraux (GEN), elles ne sont pas
discriminantes. Notre méthode permet aussi de dégager des spécificités

JADT’ 18

399

génériques propres à ces deux champs ESPACE et TEMPS (lieux de pouvoir
dans HIST /vs/ intérieur et lieux urbains dans POL ; durées et distances
extrêmes dans SF). Enfin, à partir de cette sélection d’expressions spatiotemporelles, la méthode des segments répétés produit une classification en
sous-genres plus précise que celle des ALR. Ce point, difficile à interpréter à
partir de nos premières observations qualitatives, nécessite une étude plus
approfondie. Ces résultats nous incitent à poursuivre l’exploration d’autres
champs lexicaux en marge des univers de référence de chaque sous-genre,
afin, d’une part, d’affiner notre méthodologie et, d’autre part, de cibler les
éléments au cœur de la phraséologie.
Références
Aït-Mokhtar S., Chanod J.-P. and Roux C. (2002). Robustness beyond
Shallowness: Incremental Deep Parsing. Natural Language Engineering,
8:121-144.
Boyer A.-M. (1992). La paralittérature. Presses Universitaires de France.
Chambre J. et Kraif O. (2017). Identification de traits spécifiques du roman
policier et de science fiction. Communication présentée aux Journées
Internationales de la Linguistique de Corpus - JLC2017, Grenoble, 05.07.2017.
Eibe F., Hall M. A. and Witten I. H. (2016). The WEKA Workbench. Online
Appendix for "Data Mining: Practical Machine Learning Tools and
Techniques", Morgan Kaufmann, Fourth Edition.
Kraif O., Novakova I. et Sorba J. (2016). Constructions lexico-syntaxiques
spécifiques dans le roman policier et la science-fiction. Lidil, 53 : 143-159.
Kraif O. et Diwersy S. (2012). Le Lexicoscope : un outil pour l'étude de profils
combinatoires et l’extraction de constructions lexico-syntaxiques. Actes de
la conférence TALN 2012, pp. 399-406.
Lefer M.-A., Bestgen Y. et Grabar N. (2016). Vers une analyse des différences
interlinguistiques entre les genres textuels : étude de cas basée sur les ngrammes et l’analyse factorielle des correspondances. Actes de la conférence
conjointe JEP-TALN-RECITAL 2016, pp. 555-563.
Tutin A. et Kraif O. (2016). Routines sémantico-rhétoriques dans l’écrit
scientifique de sciences humaines : l’apport des arbres lexico-syntaxiques
récurrents. Lidil, 53 : 119-141.
Salem A. (1987). Pratique des segments répétés. Essai de statistique textuelle.
Klincksieck.

400

JADT’ 18

Les phrases de Marcel Proust
Cyril Labbé1, Dominique Labbé2
1 Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG, F-38000 Grenoble France
(cyril.labbe@imag.fr)
2 Univ. Grenoble Alpes, PACTE (dominique.labbe@umrpacte.fr)

Abstract
Analysis of sentence lengths in Marcel Proust’s A la recherche du temps perdu.
Counting standards and the various available measures are presented. For
most of his reading time, the reader of this novel is confronted with very long
and syntactically-complex sentences. A comparison with other writers shows
that these sentences are atypical but not unique and that some of their
characteristics can be observed in a number of other works, some of which
are cited in the Recherche du temps perdu.
Résumé
Analyse des longueurs de phrases dans A la recherche du temps perdu de
Marcel Proust. Présentation des normes de dépouillement et des différentes
mesures possibles. Durant la majorité de sa lecture, le lecteur se trouve
confronté à des phrases très longues et syntaxiquement complexes. Une
comparaison avec un large panel d’écrivains montre qu’il s’agit d’un
phénomène exceptionnel mais pas unique et que certaines caractéristiques se
retrouvent dans quelques œuvres dont certaines sont citées dans la Recherche
du temps perdu.
Keywords: lexicometry - stylometry - sentence length – French literature Proust
1. Introduction
Les phrases de Marcel Proust (1871-1922) sont-elles exceptionnelles ? La
question a été surtout traitée sous l’angle qualitatif (notamment Curtius
1970). Il existe quelques estimations quantitatives (Bureau 1976, Brunet 1981,
Milly 1986), avec des résultats divergents pour des raisons qui seront
explicitées au début de cette communication. Mais surtout, nous présentons
une comparaison statistique avec d’autres écrivains qui permettra de juger de
l’exceptionnalité de la phrase proustienne.
L’analyse des phrases soulève plusieurs des problèmes auxquels est
confrontée la lexicométrie (statistique appliquée au langage). En premier lieu,
ici, il y a le choix de l’édition de référence. En effet, pour la Recherche du temps

JADT’ 18

401

perdu, ce choix existe et introduit une légère incertitude concernant la
ponctuation de l’oeuvre (discussion dans Ferré 1957 et Serça 2010),
spécialement pour les trois derniers volumes. Nous nous sommes tenus au
principe général selon lequel fait foi l’ultime version révisée par l’auteur ou,
à défaut, la plus proche de sa mort. Il s’agit ici de l’édition originale chez
Gallimard (annexe 1). De plus, cette édition originale s’impose puisqu’elle est
dans le domaine public et peut être communiquée librement aux chercheurs
soucieux de reproduire nos résultats et d’aller plus loin dans cette analyse.
2. Le mot et la phrase
Le mot est défini comme l’occurrence d’un vocable, c’est-à-dire une entrée
dans le lexique de la langue française selon la norme présentée par Muller
1963. Cette norme est fondée notamment sur la nomenclature de Hatzfeld et
al. 1898. Son implémentation est décrite dans Labbé 1990. Par exemple,
"aujourd’hui", "parce que" ou "Saint-Loup" sont des mots uniques et non
deux "formes graphiques". Il y a 1 449 "parce que" dans la Recherche, soit plus
d’un mot pour mille ; et 787 fois "Saint-Loup" (l’un des principaux
personnages du roman). A l’inverse, les formes graphiques "le", "la", "les" ont
deux entrées (pronom ou article) ; "du" ou "des" sont la contraction de deux
entrées du lexique - préposition "de" et article "le". En fonction de la norme
retenue (vocable ou formes graphiques), le nombre de mots dans un texte
peut varier de près 10%. Selon cette "norme Muller", la Recherche compte 1
327 859 mots (N dans la suite) et 21 836 vocables différents.
Quant à la phrase, il y a un accord général pour la définir comme l’empan de
texte dont le premier mot comporte une majuscule initiale et qui se trouve
compris entre deux ponctuations majeures. Les ponctuations majeures sont le
point, les points d’interrogation et d’exclamation, les points de suspension.
Cependant, aucun de ces 4 signes typographiques ne marque
automatiquement une fin de phrase :
- le point dans « M. Verdurin » ne termine pas une phrase même s’il est suivi
d’un mot à majuscule initiale. Il y a dans la Recherche 3 152 « monsieur » écrits
"M.". C’est le deuxième substantif le plus fréquent dans la Recherche (juste
derrière "Mme"), soit 2,4 pour mille mots. Ce point "non-terminal" se retrouve
dans les initiales que Proust utilise pour "anonymiser" certains noms (Mme
X.) ou derrière des abréviations (etc.).
- dans la Recherche, plus de trois points d’interrogation sur 10 sont internes à
la phrase (721).
- il y a 1 201 points d’exclamation internes à la phrase et 190 points de
suspension également dans cette situation. Proust a plusieurs fois déclaré son
hostilité envers ces derniers mais il les utilise parfois. Par exemple : « La
duchesse émit très fort, mais sans articuler : « C’est l’... i Eon l... b... frère à

402

JADT’ 18

Robert. » (la Prisonnière).
Cette rapide discussion permet de comprendre la solution adoptée : un
automate détermine les fins de phrase et, en cas de doute, l’opérateur choisit :
fin de phrase ou ponctuation interne ? A condition que l’opérateur suive
toujours la même norme, le dépouillement est fait sans erreur et, surtout, les
résultats obtenus sur un auteur sont comparables à ceux de tous les autres.
Ce recensement établit le nombre de phrases de la Recherche (voir tableau en
annexe). P = 37 336 phrases. Comment caractériser ces phrases en fonction de
leurs longueurs ?
3. Les indices statistiques usuels.
Les P phrases sont rangées par longueur croissante, dans des classes
d’intervalles égaux (ici 1 mots). Par exemple, la première classe (1 mot,
généralement une exclamation) contient 124 phrases, soit 0,37% du total.
L’effectif de chaque classe est ainsi recensé et son poids relatif est calculé. Ce
recensement fournit les informations suivantes :
- Etendue de la distribution : 1 à 931 mots. La plus longue phrase est celle sur
les homosexuels au début de Sodome et Gomorrhe. Les phrases de la Recherche
ne sont pas réparties uniformément sur cet intervalle. La seconde plus longue
– celle sur les chambres au début de Combray – compte 542 mots ; la troisième
(le salon des Verdurins dans la Prisonnière) : 430 ; la quatrième (l’église de
Combray) : 399. Ensuite, il n’y a plus de "trou" important dans l’étalement
des longueurs.
- Le mode est la classe la plus peuplée, ou longueur de phrase que le lecteur a
le plus de chance de rencontrer : 11 mots. Il y a donc, dans la Recherche, une
prédominance des phrases courtes et syntaxiquement simples. Il en est ainsi
dans la plupart des textes en français.
- La médiane est la valeur de la variable pour l’individu du milieu ou
individu "médian". Dans les P phrases rangées par longueurs, l’individu
médian est celui qui occupe la place (P+1)/2. Lorsque l’effectif total de la
population (P) est pair, la médiane est la moyenne des valeurs de la variable
pour les 2 individus situés de part et d’autre. Dans un texte étendu comme la
Recherche, la médiane se trouve dans une classe dont l’effectif est assez élevé.
Dans ce cas, la valeur est interpolée en divisant l’intervalle de la classe où se
situe l’individu médian par l’effectif de cette classe. Dans la Recherche, ce
calcul aboutit à une médiane de 26,28 mots. Etant donné que la variable
"longueur de phrase" ne prend que des valeurs entières, les décimales
indiquent le sens de l’arrondi et la position de la borne. La longueur médiane
des phrases de la Recherche est donc de 26 mots. Ou encore la moitié des
phrases ont une longueur inférieure ou égale à 26 mots et l’autre moitié une
longueur supérieure à 26.

JADT’ 18

403

- La moyenne (N/P) : 35,57 mots. A cet indice est associée une déviation
"standard" des valeurs de la variable autour de la moyenne (écart-type) :
racine carrée de la variance (moyenne des carrés des écarts de chaque valeur
de la variable à la moyenne arithmétique). L’écart type de la longueur des
phrases de la Recherche est de 31,42 mots.
La dispersion des valeurs autour de la moyenne mesurée par le coefficient
de variation relative : rapport de l’écart-type à la moyenne arithmétique (ici
89%). Etant donné l’effectif considéré (37 336 phrases), si les valeurs de la
variable "longueur de phrase" étaient distribuées normalement autour de la
moyenne (cas d’une population homogène), ce coefficient serait d’environ
4%. Autrement dit, les observations sont extrêmement dispersées. Dans ce
cas, la moyenne n’est pas représentative de la série et, en particulier, il n’est
pas possible de considérer que cette moyenne se situe à peu près "au milieu"
de la population. Dès que la dispersion relative approche les 50% de la
moyenne, celle-ci est située dans la partie basse de l’étendue de la
distribution qui est fortement asymétrique. Le profil de la distribution des
longueurs de phrases dans la Recherche est donné par la figure 1 dans laquelle
l’effectif relatif de chaque classe est représenté par la hauteur du bâton
correspondant (histogramme).

Figure 1. Histogramme de la distribution des longueurs phrases

D’une part, le graphique s’interrompt à la classe 200+ mots et le bâton pour
cette classe – à l’extrême-droite du graphique - correspond aux 96 phrases
longues de 200 mots et plus (0,3% du total des phrases mais 2,1% de la
surface du texte). Le graphique complet est encore plus étalé sur la droite, la
grande masse des phrases apparaissant serrées sur la gauche… D’autre part,
le bâton le plus haut correspond au mode principal (11 mots) mais l’on
observe de nombreux modes secondaires (17, 20, 24, etc.) : plusieurs

404

JADT’ 18

populations sont donc mélangées. La plupart des phénomènes sociaux
présentent des caractéristiques semblables et, en premier lieu, la distribution
des revenus ou des patrimoines. Dans de pareils cas, l’analyse ne se contente
pas des valeurs centrales. Elle se centre sur la distribution du caractère étudié
(ici la surface du texte) au sein de la population (ici les phrases).
4. L’inégal partage de la surface du texte entre les phrases
Ce renversement de perspective présente un avantage : la surface de texte
correspond grosso-modo à la durée de la lecture. Deux méthodes sont
possibles pour l’évaluer.
4.1 Quantile et médiale
Les phrases étant classées par longueurs croissantes, la surface du texte
qu’elles couvrent est découpée en masses égales (tableau 1).
Tableau 1. Partage de la surface du texte en fonction de la longueur des phrases
Surface divisée en quantiles
Premier décile
Deuxième décile
Premier quartile
Troisième décile
Quatrième décile
Deuxième quartile (médiale)
Sixième décile
Septième décile
Dernier quartile
Huitième décile
Neuvième décile

Longueur (mots)
18.58
26.70
29.53
33.30
41.35
49.93
60.20
72.93
81.13
90.57
121.00

% des phrases (cumulé)
33,8
49,6
54,5
60,6
70,1
77,5
84,6
89,7
92,3
94,2
97,8

Dans ce tableau, le premier décile est la borne supérieure de l'intervalle
comprenant les phrases les plus courtes couvrant en tout 10% de la surface
du texte et la borne inférieure du 2e décile. Il indique que les phrases de
longueurs inférieures ou égales à 18 mots couvrent 10% du texte et
représentent plus du tiers du total des phrases (33,8%). Le lecteur n’y passe
au mieux qu’un dixième du temps de la lecture. Or c’est au-dessus de cette
longueur que l’on commence à rencontrer des phrases syntaxiquement
complexes. Autrement dit, au mieux, le lecteur de la Recherche se trouve face
à des phrases simples pendant un dixième de sa lecture (ou il est face à des
phrases plus ou moins complexes pendant les neuf dixièmes !)
A l’opposé, 2,2% des phrases (700) comptent plus de 121 mots (9e décile).
Elles couvrent également 10% du texte, c’est-à-dire la même surface que le
tiers évoqué ci-dessus. Cela signifie que le lecteur de la Recherche passe (au
moins) autant de temps à lire des phrases très longues – dont la construction
est nécessairement complexe -, qu’il n’en consacre à la masse des phrases les

JADT’ 18

405

plus brèves et structurellement simples.
Dans cette perspective, la valeur centrale la plus caractéristique est la
longueur de la phrase qu’il faut atteindre pour avoir lu la moitié du texte.
Pour éviter les confusions, cette seconde médiane est appelée médiale (Ml).
Elle correspond à la borne haute du cinquième décile (ou du deuxième
quartile). Dans la Recherche, elle est égale à 49,93 mots, soit 50 mots. Le
tableau indique que 77,5% des phrases (près de 8 sur 10) sont inférieures à
cette médiale. Autrement dit, le lecteur de la Recherche passe au moins la
moitié de son temps confronté à des phrases de 50 mots et plus, ce dont la
plupart d’entre eux n’ont guère l’habitude. Malgré le talent de l’écrivain, c’est
évidemment cela que les lecteurs retiennent.
4.2 Mesure de l’inégalité
Deuxième méthode, un indice unique mesure l’inégale répartition de la
surface du texte entre les phrases (en fonction de leurs longueurs). Deux
calculs sont proposés :
- le rapport entre la médiane (26,28) et la médiale (49,93) soit 0,90. Autrement
dit la médiale est de 90% supérieure à la médiane (pour des comparaisons
avec d’autres écrivains, voir l’annexe 2). Cet écart considérable suffit à
attester la prédominance des phrases longues dans la Recherche.
- le second calcul est utilisé en science économique pour étudier la
distribution des revenus ou des patrimoines. Il s’agit de l’indice de Gini qui
mesure l’écart entre la situation réelle et celle qui serait observée en cas
d’égale répartition du caractère (ici la surface du texte) entre les individus
(les phrases) composant le livre. En cas d’équirépartition, toutes les phrases
de la Recherche auraient la longueur moyenne (≈ 36 mots). Pour chaque
centile, on calcule la proportion de la surface de texte couverte et l’écart par
rapport à ce que serait cette surface dans l’hypothèse d’équirépartition.
L’indice de Gini est la somme de ces écarts. Ici, il est égal à 55,4%. Autrement
dit, dans la Recherche, les longueurs de phrases s’écartent de plus de 55% de
ce qui serait constaté dans une population homogène.
Le "diagramme de Gini" permet de visualiser cette situation. Les phrases
étant rangées par longueurs croissantes, on compte le nombre qu’il faut lire
pour atteindre 1% de la surface (premier centile), puis 2%, etc. jusqu’à 100%.
Les valeurs observées pour chaque centile sont reportées sur la figure 2 où la
diagonale représente l’hypothèse d’équirépartition. L’indice de Gini est la
surface comprise entre la diagonale et la courbe. Deux auteurs
contemporains, et importants pour M. Proust, sont ajoutés sur le diagramme
afin d’en illustrer les propriétés.

406

JADT’ 18

Figure 2 Diagramme de concentration (Gini) de la surface de la Recherche sur les phrases
longues, comparée à celle de J. Barbey d’Aurevilly et de A. France.

Ce diagramme permet de comprendre pourquoi la médiane ou la moyenne
rendent mal compte des distributions fortement asymétriques comme les
longueurs de phrase. Par exemple, les deux tiers des phrases ont des
longueurs inférieures à la moyenne et pourtant ces phrases ne couvrent qu’à
peine plus d’un tiers du texte (34,5%).
La figure 2 montre également que, si les phrases de la Recherche sont
singulières par rapport à certains écrivains du XIXe - à commencer par A.
France qui aurait fourni le modèle de Bergotte (Levaillant 1952) –, elles
semblent très proches de quelques livres comme Une vieille maîtresse (1851) de
Barbey d’Aurévilly, écrivain que Proust cite à plusieurs reprises (Rogers
2000). C’est la dernière question abordée dans cette communication.
5. Singularité de Proust ?
Pour juger de cette singularité : à qui le comparer ? Et comment décider si les
écarts constatés sont statistiquement significatifs ?
Premièrement, il faut comparer Proust à lui-même. Un de ses ouvrages se
trouve dans le domaine public : Les Plaisirs et les jours (1896) dont les valeurs
centrales sont indiquées en première ligne dans le tableau 2.

JADT’ 18

407

Tableau 2. Caractéristiques des phrases des Plaisirs et les jours comparés à la Recherche

Plaisirs et
jours
Recherche

Etendue
1-250

Mode
7

Médiane
21,30

Moyenne
27,87

Médiale
37,16

Me/Ml
0,754

Gini
0,542

1-931

11

26,28

35,57

49,93

0,900

0,554

Toutes ces valeurs sont significativement inférieures à celles observées dans
la Recherche. Cependant, l’indice de Gini indique que le jeune Proust avait
déjà tendance à concentrer une proportion importante du texte dans les
phrases longues.
Deuxièmement, il faut comparer Proust aux auteurs qu’il cite explicitement
ou par allusion, non seulement dans la Recherche (Nathan 1968) mais aussi
dans ses autres œuvres et dans sa correspondance (Chantal 1967). Dans la
Recherche, Racine et Mme de Sévigné sont les plus cités, puis en seconde
position : Balzac et Saint-Simon ; en troisième : Chateaubriand, Hugo,
Molière, Musset, Sand et Vigny. La singularité des phrases théâtrales (Labbé
& Labbé 2010) ne permet pas de comparer la Recherche (qui est un roman)
avec les pièces produites par Molière, Hugo, Musset, Racine ou Vigny.
Enfin, il faut le comparer aux autres romanciers contemporains : ont été
ajoutés les principaux écrivains du XIXe et du début du XXe - comme
Bourget, Giraudoux, Flaubert, Maupassant, Zola – et quelques auteurs moins
connus mais singulièrement proches de Proust.
L’annexe 2 présente un échantillon des résultats. Chaque écrivain est
singulier et parfois les indices peuvent varier selon ses oeuvres. La Recherche
se situe dans la partie haute pour tous les indices et notamment pour la
propension à concentrer une proportion importante du texte dans les phrases
les plus longues (Gini). Cependant, on observe des caractéristiques
supérieures à celle de Proust dans quelques œuvres - Huysmans (A rebours),
les frères Goncourt (Mme Gervaisais) - ou proches dans Barbey d’Aurevilly,
mais aussi dans les Lettres de Mme de Sévigné ou les Mémoires de SaintSimon.
6. Conclusions
Lorsque, dans une population – ici les phrases d’un texte -, un caractère (la
surface de ce texte) est très inégalement réparti, la moyenne et la dispersion
standard sont de peu d’utilité. L’indice statistique le plus éclairant est la
seconde médiane ou médiale. Pour mesurer le degré de dispersion de la série
autour de cette valeur centrale, de nombreux indices sont concevables,
notamment les rapports entre quantiles extrêmes. Cependant, le rapport
entre médiane et médiale, ou l’indice de Gini paraissent les plus aptes à
donner une indication de la concentration du caractère sur une proportion

408

JADT’ 18

plus ou moins restreinte de la population totale.
Ces indices montrent que, durant la majorité du temps, le lecteur de la
Recherche se trouve confronté à des phrases très longues (50 mots et plus) et
syntaxiquement complexes. Ils confirment que M. Proust a une propension à
concentrer une proportion importante du récit dans les phrases les plus
longues.
Ces conclusions ont été acquises grâce à un dépouillement rigoureux, à des
indices statistiques adaptés et à une vaste base de textes traités selon les
mêmes procédures. A ce prix, la statistique lexicale peut être une auxiliaire
utile de l’analyse littéraire.
Enfin, dans une œuvre littéraire, il n’existe pas un type de phrase unique
mais plusieurs qui ont chacun leurs particularités lexicales et stylistiques
(Monière et al. 2008 ; Labbé & Labbé 2010). Une prochaine publication
présentera ces types de phrases avec leurs singularités lexicales, stylistiques
et thématiques. Elle répondra aussi à une question pendante : comment
déterminer que les écarts entre œuvres et auteurs sont ou non significatifs ?
References
Brunet E. (1981). La phrase de Proust. Longueur et rythme. Travaux du cercle
linguistique de Nice, p. 97-117.
Bureau C. (1976). Marcel Proust ou le temps retrouvé par la phrase.
Linguistique fonctionnelle et stylistique objective. Paris : PUF, p. 178-231.
Curtius E.-R. (1971). Etude de lilas. Le rythme des phrases. In Tadié J.-Y.
(dir.). Lectures de Proust. Paris : A. Colin.
Milly J. (1975). La phrase de Proust. Des phrases de Bergotte aux phrases de
Vinteuil. Paris : Larousse.
Ferré A. (1957). La ponctuation de M. Proust. Bulletin de la Société des Amis de
Marcel Proust, 7, p 171-192.
Hatzfeld A., Darmeisteter A., Thomas A. (1898). Dictionnaire général de la
langue française du commencement du XVIIe siècle jusqu'à nos jours. Paris :
Delagrave.
Labbé C., Labbé D. (2010). Ce que disent leurs phrases. In Bolasco S., Chiari
I., Giuliano L. (Eds). Proceedings of 10th International Conference Statistical
Analysis of Textual Data. Rome : Edizioni Universitarie di Lettere
Economia Diritto. Vol 1, p. 297-307.
Labbé D. (1990). Normes de saisie et de dépouillement des textes politiques.
Grenoble : Cahiers du CERAT.
Levaillant J. (1952). Note sur le personnage de Bergotte. Revue des sciences
humaines. Janvier-Mars 1952, p 33-48.
Milly J. (1986). La longueur des phrases dans "Combray". Paris-Genève :
Champion-Slatkine.

JADT’ 18

409

Monière D., Labbé C. & Labbé D. (2008). Les styles discursifs des premiers
ministres québécois de Jean Lesage à Jean Charest. Canadian Journal of
Political Science / Revue canadienne de science politique. 41:1, p. 43-69.
Muller C. (1963). Le mot, unité de texte et unité de lexique en statistique
lexicologique. Langue française et linguistique quantitative. Genève-Paris:
Slatkine-Champion, 1979, p 125-143.
Nathan J. (1969). Citations, références et allusions de Marcel Proust dans A la
recherche du temps perdu. Paris : Nizet (Première édition : 1953).
Rogers B. (2000). Proust et Barbey d’Aurevilly. Le dessous des cartes. Paris :
Champion.
Serça I. (2010). Les coutures apparentes de la Recherche. Proust et la ponctuation.
Paris : Champion.
Annexe 1 Corpus A la Recherche du temps perdu (Marcel Proust. Paris Gallimard 1919-1927)
Livre

Longueur

Vocabulaire

Combray

79 906

6 502

1 727

Un amour de Swann

84 142

5 859

2 226

Noms de pays : le nom

19 434

2 823

374

Du côté de chez Swann (1919)

183 482

9 347

4 327

Autour de Mme Swann

91 451

6 532

2 511

Noms de pays : le pays

134 192

8 283

3 334

225 643

10 396

5 845

Le côté de Guermantes 1

75 494

6 281

1 903

Le côté de Guermantes 2, chapitre 1

84 354

6 368

2 781

A l'ombre des jeunes filles en fleur (1919)

Le côté de Guermantes 2, chapitre 2

N phrases

89 727

6 707

2 700

249 575

6 707

7 384

Sodome et Gomorrhe

13 512

2 476

271

Sodome et Gomorrhe 2, chapitre 1

30 699

3 779

2 082

Sodome et Gomorrhe 2, chapitre 2

117 774

7 822

3 056

Sodome et Gomorrhe 2, chapitre 3

57 603

5 311

1 811

Sodome et Gomorrhe 2, chapitre 4

8 137

1 373

250

227 725

10 972

7 470

Le côté de Guermantes (1920-21)

Sodome et Gomorrhe (1921-22)
La prisonnière (1923)

173 409

9 062

5 124

La fugitive (1925)

115 866

6 456

3 255

Le temps retrouvé (1927)

152 159

8 708

3 931

Dernier volume (posthume)

441 434

13 518

12 310

1 327 859

21 837

37 336

Total général (A la recherche du temps perdu)

410

JADT’ 18

Annexe 2 Longueur des phrases chez quelques écrivains antérieurs ou contemporains de
Proust
Recherche
Balzac
Barbey
(Chevalier)
Barrès

d’A.

Etendue

Mode

Médiane

Moyenne

Médiale

Me/Ml

Gini

931

11

26,28

35,57

49,93

0,900

0,554

391

10

17,27

21,88

29,00

0,680

0,511

192

7

21,92

29,4

43,00

0,964

0,557
0,497

195

8

17,86

21,94

28,59

0,601

Bourget

201

7

16,62

21,34

29,58

0,780

0,539

Chateaubriand
(Mémoires)
Daudet

195

22

24,46

28,5

34,28

0,401

0,437

203

5

13,14

17,84

25,26

0,923

0,549

Dumas

243

7

14,90

20,28

29,00

0,947

0,567

Flaubert

231

7

13,75

18,37

25,24

0,837

0,528

France

394

8

15,79

19,98

26,06

0,651

0,504

Gautier*

282

18

27,11

33,07

41,90

0,546

0,493

Giraudoux*

466

4

18,60

25,77

37,76

1,031

0,580

Goncourt
(Gervaisais)
Goncourt
(Journal)
Hugo*

670

8

24,17

34,05

51,47

1,130

0,597

373

3

19,80

25,37

37,62

0,900

0,580

828

6

11,39

16,89

23,68

1,079

0,561

Huysmans
(A
rebours)
Maupassant*

254

28

44,24

51,49

65,82

0,488

0,557

168

6

14,44

18,98

26,39

0,828

0,542

Musset*

197

16

19,56

23,82

29,57

0,512

0,485

Nerval*

136

12

19,93

24,21

31,27

0,569

0,499

Saint-Simon

361

18

27,89

34,15

44,14

0,523

0,506

Sand (Champi)

117

21

22,11

26,19

32,56

0,473

0,477

Sévigné (Lettres)

307

11

25,72

31,99

40,96

0,593

0,490

Stendhal

235

18

20,18

23,92

29,79

0,477

0,463

Vigny*

315

17

20,82

27,47

37,41

0,797

0,538

Zola

153

8

15,80

19,91

25,66

0,624

0,491

* Uniquement les romans

JADT’ 18

411

Verso un dizionario corpus-based del lessico dei beni
culturali: procedure di estrazione del lemmario
Ludovica Lanini1, María Carlota Nicolás Martínez 2
1

Università degli Studi di Roma La Sapienza– ludovica.lanini@uniroma1.it
2 Università degli Studi di Firenze – cnicolas@unifi.it

Abstract
The vocabulary of Italian cultural heritage has become a crucial object of
interest for different categories of users from a number of countries.
However, there are no satisfactory multilingual lexical resources available.
The present work moves in that direction. The aim of the paper is twofold: on
the one hand, it describes the LBC database, a resource for developing a
multilingual electronic dictionary of cultural heritage terms, made up of
comparable corpora from nine languages; on the other hand, a corpus-based
method for building a comprehensive headword list is proposed.
Keywords: electronic lexicography, multilingual lexical resources, corpus
linguistics
1. Introduzione
Di fronte a un interesse crescente, a livello internazionale, per il lessico
italiano dei beni culturali, emerge oggi l’esigenza, da parte di diverse
categorie di utenti, di risorse elettroniche multilingui relative al patrimonio
culturale; nonostante ciò, allo stato attuale, non sono disponibili strumenti
multilingui adeguati. Il progetto LBC (Lessico dei Beni Culturali) si propone
di affrontare il problema, sviluppando una banca dati testuale comprendente
corpora specialistici e comparabili per nove lingue (cinese, francese, inglese,
italiano, portoghese, russo, spagnolo, tedesco, turco). Fine ultimo è la
creazione di un dizionario multilingue del lessico dei beni culturali a base
testuale, che abbia come principali utenti studiosi del settore, ma anche
traduttori e operatori turistici. L’approccio corpus-based viene applicato sin
dal processo di definizione del lemmario, focus specifico del contributo.
2. La Banca dati LBC
La Bd-LBC (Banca dati LBC) è un database testuale multilingue progettato per
essere rappresentativo del lessico dei beni culturali: per il suo disegno si è
considerato l’italiano quale punto di partenza, ma si è pensato anche al
valore aggiunto derivante dalla possibilità di stabilire relazioni tra le diverse
lingue. L’italiano viene scelto come punto di riferimento in virtù della sua

412

JADT’ 18

centralità nello sviluppo storico del lessico dei beni culturali; molti testi non
italiani relativi a tale dominio hanno inoltre lo sguardo rivolto proprio verso
le tecniche e i monumenti realizzati in Italia. La prima fase di lavoro, dedicata
alla raccolta dei materiali, è partita dunque dai testi italiani che sono alla base
della storia dell’arte e dalle relative traduzioni, ma anche da opere in altre
lingue, applicando una metodologia di studio che facesse leva sulle
potenziali sinergie plurilingui. Per dare fondamento alla struttura del corpus
(Cresti et Panunzi 2013:57), la rappresentatività della risorsa è stata definita
fin dall’inizio attraverso dei criteri di campionamento dei testi (Billero et
Nicolás 2017: 208): «la rilevanza storico-culturale dell’opera dell’ambito
specifico di studio (ad es. testi di Vitruvio o Leonardo); la diffusione
internazionale di un’opera relazionata con l’ambito di studio (es. libri di
Vasari); il prestigio dato a livello internazionale al patrimonio italiano da
parte di un’opera (es. testi di Stendhal o Ruskin); la specificità dell’argomento
in rapporto alla storia dell’arte italiana ed in particolare della Toscana (es.
Burckhardt) ». Si è in questo modo delimitato un nucleo di testi di base
condivisi tra lingue, tale da rendere il corpus parzialmente parallelo, cui si
sono aggiunti via via testi peculiari per ogni lingua.
La progettazione del database ha previsto inoltre una macrostruttura
omogenea per i diversi corpora, che condividono i metadati associati a ogni
testo, a partire dai quali viene generato automaticamente un nome di file
univoco. Per quanto riguarda la microstruttura, la regola fondamentale è
stata quella di rispettare il testo originale, mantenendo eventuali note,
divisione in capitoli e tratti ortografici arcaici. Seguendo tali regole
strutturali, ogni squadra di lavoro, specificamente rivolta a una delle lingue,
ha avviato lo sviluppo dei singoli corpora (Corpus LBC-francese, Corpus
LBC-inglese, etc.), sottoposti a un’operazione di validazione della
digitalizzazione da parte di professori e studenti competenti nelle diverse
lingue. La banca dati, così disegnata, presenta un’omogeneità in grado di
favorire il lavoro lessicografico: la forte coesione strutturale tra corpora
permette infatti di operare davvero in parallelo.
Tra gli obiettivi del progetto vi è anche quello di implementare strumenti
informatici di gestione e interrogazione dei corpora, che consentano ai
membri del gruppo di effettuare ricerche ed estrarre dati sull’uso lessicale,
fondamentali per lo svolgimento del lavoro lessicografico. Si è dunque
realizzato un software online, per ora accessibile ai soli membri dell’unità di
ricerca, ma in prospettiva disponibile anche per gli utenti, che consenta la
consultazione dei corpora, sia in chiave monolingue che multilingue. Nella
ricerca di soluzioni per l’implementazione di un’installazione del corpus su
apposito server Internet, si è optato per l’ultima release di NoSketchEngine,
versione open source di Sketch Engine.

JADT’ 18

413

3. Il dizionario LBC: processo di definizione del lemmario
La banca dati, così elaborata, si pone quale risorsa di base per lo sviluppo di
un dizionario elettronico multilingue del lessico dei beni culturali, che possa
risultare strumento utile soprattutto in ambito traduttivo e turistico. In vista
della particolare utenza e applicazione, l’intento è quello di fornire una
risorsa lessicografica che presenti le seguenti caratteristiche:
- trattamento dei lemmi più “problematici” del dominio, con inclusione a
lemma di nomi propri ed espressioni multiparola, categorie lessicali
generalmente assenti dalle risorse, tuttavia di particolare rilevanza in virtù
delle difficoltà traduttive e del forte carico culturale;
- attenzione per l’aspetto più prettamente pratico e referenziale del lessico
della cultura, con apertura a quelle voci di arti e mestieri tradizionalmente
trascurate dalla lessicografia italiana, nonché interesse rivolto alle persone,
alle opere e ai luoghi fisici della storia culturale, più che al carattere teorico e
mentale (Harris, 2003) ed estetico generale (De Mauro, 1971) che ha a lungo
connotato il lessico artistico, in particolare quello della critica d’arte;
- inclusione non solo di nomi, ma anche di verbi, di norma esclusi dalle
risorse terminologiche, qui ritenuti di interesse per rendere conto di tecniche
e pratiche;
- impianto corpus-based, non solo per la selezione, descrizione e traduzione
dei lemmi, con individuazione degli equivalenti a partire dall’analisi di
concordanze bilingui, ma anche per l’offerta all’utente, entro la scheda
lessicografica, di esempi e citazioni testuali reali.
L’approccio corpus-based viene adottato sin dal processo di definizione del
lemmario, sviluppato a partire dal corpus LBC-italiano.
Il metodo proposto prevede la combinazione di tre ordini eterogenei di dati:
dato lessicografico; dato testuale quantitativo; dato testuale qualitativo. Il
dato di origine lessicografica, assunto sullo sfondo a frame di riferimento,
viene dunque incrociato con il dato testuale, tanto di livello quantitativo keyword e liste di frequenza- quanto di livello qualitativo -prodotto di ricerche
mirate su corpus e di osservazione dei contesti.
Per quanto riguarda le risorse adottate, la fonte lessicografica scelta è il
Grande Dizionario Italiano dell’Uso (De Mauro, 2007), la più estesa risorsa
lessicografica esistente per la lingua italiana, mentre alla banca dati LBC
viene affiancato, quale corpus generale di riferimento, il corpus Paisà
(www.corpusitaliano.it), costruito nel 2010 tramite web-crawling e raccolta
mirata di documenti da specifici siti web, per un totale di 250 milioni di token,
inteso come rappresentativo della lingua e cultura comune contemporanea
(Lyding et al., 2014). Indirettamente, viene assunto come corpus di
riferimento anche itTenten16, il corpus per la lingua italiana implementato in
Sketch Engine, interamente raccolto tramite web-crawling nel 2016

414

JADT’ 18

(5.864.495.700 token). Riguardo agli strumenti impiegati, l’adozione di un
software di corpus management e query all’avanguardia come Sketch Engine
(www.sketchengine.co.uk) risulta infatti cruciale per il processo di lavoro,
descritto di seguito nel dettaglio.
3.1 Fasi di lavoro
La prima operazione è consistita nell’estrazione dal corpus LBC di una lista
di parole chiave (2000), applicando la funzione keywords di Sketch Engine: le
keyword vengono ordinate in base al keyness score, dato dal rapporto tra la
frequenza normalizzata della parola nel focus corpus (LBC) e la sua frequenza
normalizzata in un corpus generale (itTenten16), previa applicazione di una
costante, denominata simple math parameter1 (Kilgariff et al., 2014).
Alla lista delle keyword è stata affiancata la lista di matrice lessicografica,
estratta dal Gradit selezionando l’insieme dei lemmi etichettati con marca
[TS] (tecnico-specifico) per arte, pittura, scultura e architettura, per un totale
di 2515 lemmi, di cui molti (370) multiparola. In maniera inattesa, dal
confronto tra le due liste emergono solo 24 coincidenze.
Risultando poco pulita, la lista delle keyword è stata sottoposta a uno spoglio
manuale, che ha ridotto i 2000 lemmi candidati a 219, primo vero lemmario
di base (comprendente nomi propri come Mantegna, arcaismi come fregiatura,
tecnicismi come nicchia).
Si è proceduto a questo punto a una serie di confronti, a partire dalla lista di
frequenza lemmatizzata del corpus LBC, come sintetizzato in Tabella 1.
L’incrocio con la lista del Gradit ha restituito 272 lemmi comuni, di cui 235
sono stati accolti previo controllo. Il lavoro di confronto con il corpus generale
Paisà ha seguito invece due linee di sviluppo: lo studio dei lemmi
caratterizzati da più alta differenza di frequenza relativa con peso maggiore
in LBC (i primi 600), da cui sono emersi 77 lemmi di interesse (figura, Firenze,
Raffaello) e lo spoglio dei lemmi presenti in LBC ma non in Paisà, che ha
permesso di individuarne 62 (tecnicismi come scalea e imbasamento, numerosi
arcaismi e varianti arcaiche come scarpellino, Florenzia, Buonarruoto).
L’insieme delle voci della lista Gradit assenti in LBC (ben 2243) è stato inoltre
sottoposto a un esame puntuale, che ha portato ad aggiungere al lemmario
1629 lemmi2. Il corpus LBC è in effetti in fase di sviluppo, per cui molte aree

A seconda dei bisogni dell’utente e della natura dei corpora, la costante può
essere modificata per restituire una lista con candidati a frequenza maggiore o
minore, con 100 come valore consigliato per ottenere parole del vocabolario core e
rumore minimo, qui applicato.
2 Non si sono accolti: lemmi astratti, propri della critica d’arte (asemanticità);
lemmi riferiti a movimenti e tendenze generali (astrattismo); aggettivi o avverbi. Si
1

JADT’ 18

415

di interesse (per esempio il dominio dell’arte contemporanea) non risultano
ancora adeguatamente rappresentate: la lista del Gradit può offrire in questa
direzione materiali utili, in attesa dell’ampliamento del corpus.
Dalla convergenza dei lemmi accolti è stato così possibile arrivare alla
definizione di un primo lemmario, per un totale di 2147 lemmi.
Tabella 1
Risorse
Lista
LBC
Lista
Gradit
Lista
LBC
Lista
Paisà

Lemmi

Lemmi di interesse

Lemmi
estratti

Lemmi
accolti

Lemmi comuni

272

235

600

77

1139

62

2000

219

2243

1629

TOT.

2222
(-75 lemmi
ripetuti)
= 2147

8388
2515
8388
1032178

Lemmi con differenza di
frequenza relativa significativa
Lemmi presenti in LBC assenti in
Paisà

Lista
keywords
<LBC

2000

Tutti

Lista
Gradit

2515

Lemmi assenti in LBC

Il confronto con Paisà, in particolare, ha permesso di individuare una serie di
lemmi di interesse, non rappresentati nella lista lessicografica: nomi propri
(Raffaello, Firenze), tecnicismi (travata), molti lemmi comuni, spesso
semanticamente polivalenti e dotati di accezione specifica (figura, opera), ma
anche arcaismi (reliquiere) e varianti grafiche arcaiche (trivertino), ritenuti utili
in vista della lettura e traduzione di testi.
3.2 Estrazione di lemmi multiparola

sono invece accettati: lemmi che siano forme derivate di lemmi rappresentati in LBC
(aggrottescare, con grottesca presente in LBC); lemmi relativi all’arte contemporanea
(acrilico); unità multiparola ritenute di interesse (arco acuto); verbi ritenuti di interesse
(festonare).

416

JADT’ 18

Il lavoro di analisi testuale si è finora limitato a soli lemmi semplici; l’intento
è tuttavia quello di includere nel dizionario, come entrate autonome, anche
lemmi multiparola. Si tratta di entità problematiche, per le quali sono state
proposte innumerevoli denominazioni, classificazioni e criteri di
identificazione. Relativamente alla lingua italiana, si è parlato di lessemi
complessi (Voghera, 1994), polirematiche (Voghera, 2004, De Mauro, 2005),
espressioni multiparola (Masini, 2009). Alla tradizione anglosassone appartiene
il termine multiword expression (MWE), iperonimo utilizzabile per descrivere
una serie di entità generalmente distinte sul piano teorico, ma accomunate da
proprietà quali restrizione combinatoria, alto grado di lessicalizzazione e
convenzionalità, non-composizionalità e opacità semantica. Tale approccio
generalizzante risulta presente anche entro la tradizione lessicografica, poco
interessata del resto a distinzioni di ordine teorico (multiword lexical units in
Zgusta, 1971). La decisione è quella di adottare una terminologia che sia in
linea con tale tradizione, scegliendo la denominazione lemmi multiparola, e
abbracciando una definizione generale come quella proposta da Calzolari et
al. (2002:1934): «word combinations characterised by different degrees of
fixedness and idiomaticity that act as a single unit at some level of linguistic
analysis, such as idioms, collocations, preferred combinations». La
caratteristica essenziale dei lemmi multiparola è quella di comportarsi come
parole semplici: in ambito lessicografico, dunque, la scelta non può essere che
quella di trattarli al pari di lemmi semplici, introducendoli come entrate
autonome. Tale scelta risulta operativa anche nel Gradit: lo spoglio della lista
lessicografica ha permesso già dunque di accogliere un buon numero di
lemmi multiparola (303). Per quanto riguarda il dato testuale, un primo
tentativo è consistito nell’applicazione della già citata funzione keywords
fornita da Sketch Engine. La funzione permette in effetti di estrarre anche
una lista di term, intesi come noun-phrase dotati di un buon indice di keyness,
ma la scarsa qualità dei risultati ottenuti (solo 2 candidati su 2000 sono
risultati accettabili) e il fatto che la lista includa solo sintagmi nominali e non
verbali ha spinto ad adottare soluzioni alternative.
Si è deciso dunque di partire da un nucleo di lemmi semplici, per estrarne i
word sketch. Si tratta della funzione più caratteristica del software, vera e
propria sintesi corpus-based del comportamento collocazionale di una parola: i
collocati, selezionati e ordinati in base a un indice di associazione (logDice),
vengono mostrati entro categorie, i gramrel, corrispondenti a specifici pattern
definiti da una sketch grammar soggiacente, scritta in CQL. Il nucleo dei
lemmi di partenza (643) è risultato dalla convergenza di tre sottoinsiemi,
scelti in quanto fortemente rappresentativi del dominio ma anche
potenzialmente produttivi in termini combinatori: la lista keyword (in una
versione ripulita, ma ancora comprendente aggettivi e tutti i nomi propri); i

JADT’ 18

417

lemmi comuni tra lista LBC e lista Gradit; i lemmi base delle unità
multiparola della lista Gradit accolte nel lemmario. Per la generazione degli
sketch si è applicata inoltre una versione personalizzata della sketch grammar,
tarata sul dominio di studio: la modifica e l’aggiunta di regole ha permesso
di focalizzarsi su pattern di specifico interesse, non previsti dalla grammatica
del software al momento del lavoro, tra cui in particolare combinazioni
includenti nomi propri, sia come basi (Accademia fiorentina) che come collocati
(fabrica di San Piero). Dall’analisi degli sketch sono emersi 57 lemmi
multiparola, di cui addirittura 46 assenti nella lista del Gradit (80.7%). Le
ragioni dell’assenza possono essere diverse: a volte i lemmi risultano
collocati all’interno delle entrate, altre volte è in gioco una differente
valutazione in merito alla segmentazione delle unità multiparola (per cui, per
esempio, a olio viene collocato a lemma, mentre colore a olio no). È possibile
tuttavia individuare numerosi lemmi che si pongono come decisamente
nuovi rispetto alla lista “ufficiale” proveniente dal Gradit: tra questi, ancora
una volta, nomi propri (Colonna di Traiano) e varianti grafiche (volta a botta),
ma anche nomi comuni con collocati nomi propri (marmo di Carrara) e
tecnicismi del lessico artistico utili a colmare lacune negli elenchi di oggetti o
tecniche risultanti incompleti nella lista del Gradit (nicchia bislunga, nicchia
piana, nicchia quadra, da aggiungere alla serie nicchia a edicola, nicchia a
tabernacolo, nicchia angolare, nicchia finta). I lemmi multiparola estratti dal
corpus sono confluiti nel lemmario definitivo, giunto così a un totale di 2204
lemmi, di cui 360 multiparola. Il lavoro sui lemmi multiparola è comunque in
corso d’opera e l’intento è quello di portare avanti la ricerca su corpus in tale
direzione.
4. Conclusioni e lavoro futuro
Il metodo descritto ha portato allo sviluppo di un lemmario per un dizionario
del lessico dei beni culturali comprendente 2204 lemmi. L’approccio corpusbased ha permesso di individuare una serie di lemmi assenti nella fonte
lessicografica adottata, per un totale di 287 nuovi lemmi.
In futuro, il metodo illustrato si vedrà sottoposto a validazione, con
l’applicazione ad altre lingue della banca dati LBC. Parallelamente, il
processo di lavoro per il lemmario italiano verrà comunque portato avanti, in
due direzioni: per l’aggiunta di ulteriori lemmi multiparola -per esempio,
tramite estrazione di liste di frequenza di n-grammi- e per la sua crescita in
concomitanza con quella del corpus stesso. Anche i lemmi estratti dai corpora
delle altre lingue e non presenti nel lemmario italiano potranno inoltre
contribuire al suo arricchimento.

418

JADT’ 18

References
Billero R., Nicolás Martínez M. C. (2017). Nuove risorse per la ricerca del
lessico del patrimonio culturale: corpora multilingue LBC. In CHIMERA
Romance Corpora and Linguistic Studies, Madrid, UAM, 4.2, pp. 203-16.
Calzolari N., Fillmore C. et al. (2002). Towards best practice for multiword
expressions in computational lexicons. Proceedings of the 3rd International
Conference on Language Resources and Evaluation (LREC 2002). Las Palmas,
Canary Islands, pp. 1934-40.
Cresti, E., Panunzi, A. 2013. Introduzione ai corpora dell’italiano. Bologna, Il
mulino.
De Mauro T. (1971). Senso e significato: studi di semantica teorica e storica.
Adriatica
De Mauro T. (2005). La fabbrica delle parole: il lessico e problemi di lessicologia.
Torino, UTET
De Mauro T. (2007). Grande Dizionario Italiano dell'uso. Torino, UTET.
Harris R. (2003). The necessity of Artspeak. The language of the arts in the Western
tradition. London, Continuum.
Kilgariff A., Jakubíček M. et al. (2014). Finding terms in Corpora for many
languages with the Sketch Engine. In Proceedings of the Demonstrations at
the 14th Conference of European Chapter of the Association for Computational
Linguistics, Sweden, April 2014, pp. 53-56
Lyding, V., Stemle, E. et al. (2014). The PAISÀ Corpus of Italian Web Texts.
In Proceedings of the 9th Web as Corpus Workshop (WaC-9), Association for
Computational Linguistics, Gothenburg, Sweden, April 2014. pp. 36-43.
Masini F. (2009). Combinazioni di parole e parole sintagmatiche. In E.
Lombardi Vallauri e L.Mereu, Spazi linguistici, Roma, Bulzoni, pp. 191-209
Voghera M. (1994). Lessemi complessi: percorsi di lessicalizzazione a
confronto. Lingua e stile 28, pp. 185-214.
Voghera M. (2004). Le polirematiche. La formazione delle parole. In M.
Grossmann e F. Rainer. Tübingen, Max Niemeyer Verlag, pp. 56-68.
Zgusta L. (1971). Manual of Lexicography. Prague-The Hague-Paris,
Academia/Mouton.

JADT’ 18

419

“The grief that doesn’t speak”: Text Mining and
Brain Structure
Daniela Laricchiuta1, Francesca Greco2, Fabrizio Piras3, Barbara Cordella4,
Debora Cutuli5, Eleonora Picerni6, Francesca Assogna7,
Carlo Lai8, Gianfranco Spalletta9, Laura Petrosini10
Sapienza University, IRCCS Fondazione Santa Lucia – daniela.laricchiuta@uniroma1.it
2Sapienza University of Rome, Prisma S.r.l. – francesca.greco@uniroma1.it
3IRCCS Fondazione Santa Lucia - f.piras@hsantalucia.it
4 Sapienza University of Rome – barbara.cordella@uniroma1.it
5Sapienza University of Rome, IRCCS Fondazione Santa Lucia – debora_cutuli@yahoo.it
6Sapienza University, IRCCS Fondazione Santa Lucia – eleonora.picerni@uniroma1.it
7IRCCS Fondazione Santa Lucia - f.assogna@hsantalucia.it
8Sapienza University of Rome – carlo.lai@uniroma1.it
9IRCCS Fondazione Santa Lucia – g.spalletta@hsantalucia.it
10Sapienza University of Rome, IRCCS Fondazione S. Lucia – laura.petrosini@uniroma1.it
1

Abstract
Contemporary neurosciences have shown that emotions, thought and
language involve the functioning of connected brain areas, which allow the
recognition and expression of one's own feelings. The scope of this pilot
study is to investigate the link among the verbal expression of emotional
experiences (assessed with the Toronto Structured Interview for Alexithymia
- TSIA -), the linguistic structure and the brain structure. To this aim, 9
healthy adult subjects of both sexes were interviewed by means of the TSIA
and the cortical and subcortical structural measures were detected. The TSIA
transcripts were analysed by using a cluster analysis and, subsequently, a
correspondence analysis, and the values of factors were correlated with
cortical and subcortical structural measures as well as TSIA scores,
evidencing significant associations. The study highlighted that in healthy
subjects it is possible to identify a link between the manner in which people
express their experiences, recognize and use their emotions and the brain
structural correlates.
Abstract
Le neuroscienze contemporanee hanno evidenziato come le emozioni, il
pensiero e il linguaggio coinvolgono il funzionamento di aree cerebrali
differenti connesse tra loro, le quali consentono il riconoscimento e
l’espressione dei propri sentimenti. Scopo di questo studio pilota è di
indagare il nesso tra l’espressione verbale delle proprie esperienze emotive
(valutata con la Toronto Structured Interview for Alexithymia – TSIA -), la

420

JADT’ 18

struttura del linguaggio utilizzato e la struttura cerebrale. A questo scopo 9
soggetti sani di entrambi i sessi sono stati intervistati con la TSIA e sono state
rilevate le misure strutturali corticali e sottocorticali. Le interviste sono state
sottoposte ad analisi dei cluster e successivamente ad analisi delle
corrispondenze e i valori dei fattori sono stati correlati con le misure
strutturali corticali e sottocorticali e con i punteggi della TSIA. I risultati
evidenziano associazioni significative, che mettono in luce come in soggetti
sani sia possibile individuare un nesso tra il modo in cui le persone
raccontano le proprie esperienze, la loro capacità di riconoscere e utilizzare le
loro emozioni e la loro struttura cerebrale.
Keywords: Text mining, brain imaging, alexithymia, TSIA
1. Introduction
According to the multiple code theory, the emotional information is
represented in verbal, non-verbal symbolic and non-verbal sub-symbolic
multiple systems (Bucci, 1997). The verbal system is a communication and
reflection code through which the emotional, private and subjective
experience can be shared with others. It refers to the capacity of language to
direct and regulate ourselves, activate imagination and emotions, stimulate
actions and control them. The multiple channels of the non-verbal systems
include representations and proceedings related to implicit elaboration
associated with visceral, somatic, sensory and motor modalities. While in the
non-verbal symbolic system the information is processed in images, in the
sub-symbolic one, rapid and complex computations are carried out in an
implicit continuous path. These computations contribute to recognize slight
facial expressions modifications, identify body movement or vocal quality
changes, and distinguish visceral states. The multiple code theory is in line
with the contemporary neurosciences (LeDoux, 2012; Damasio et Carvalho,
2013), suggesting that in the presence of the affective experience it is possible
to discriminate between emotions and feelings. Emotions occur at a
physiological and motor-expressive level, involving bodily systems and
subcortical and cortical somato-sensory brain areas. Feelings are based on
complex symbolization and cognitive processes related to the functioning of
prefrontal and associative cortices. The integration of emotions and feelings,
as well as of the verbal and non-verbal systems, depends on the so-called
referential processes, which transform the non-verbal symbolic and subsymbolic materials into words, and vice versa.
The referential processes are the core factors contributing to the
development, maintenance and promotion of health, since a deficit in these
processes generates dysfunctional conditions and pathologies, characterized

JADT’ 18

421

by multifactorial and bio-psycho-social etiology as well as marked
somatization. Among these dysfunctional conditions, alexithymia is a
psychological construct represented by impairment in cognitive-emotional
and affective processing (Bagby et al., 1994). It describes people with
deficiencies in identifying or describing subjective emotions or feelings,
difficulty in distinguishing between bodily sensations of emotional arousal
and feelings, and limited affect-related fantasy and imagery. People with
alexithymic traits have a tendency to focus on facts without affective
involvement rather than inner experiences, exhibiting a “concrete and realitybased cognitive style”. They often avoid social situations, seem cold, show a
lack of intimacy and warmth and are insecurely attached to others. Although
alexithymia is not a psychological disorder per se, it is associated with a low
quality of life and enhanced risk of psychological impairment and it is
present in a broad spectrum of psychosomatic disorders (Taylor et Bagby,
2004). Neuroimaging studies have indicated that people with high
alexithymic traits show less activation in brain areas associated with
emotional awareness and volumetric variations in brain areas associated
with emotional and somato-sensory and sensory-motor processing
(Laricchiuta et al., 2015a, and see for a literature review Laricchiuta et al.,
2015b). Therefore, the aim of this pilot study is to investigate a complex biopsycho-social pattern of verbal expression of the emotional experiences,
alexithymia levels and brain structure.
2. Data collection and analysis
A sample of 9 (males=5) healthy adult subjects of both sexes was recruited for
the pilot study at the IRCCS Fondazione Santa Lucia, Rome. Participants
were selected according to the following inclusion criteria: age between 18
and 70 years and suitability for structural Magnetic Resonance Imaging
(MRI) scanning. Exclusion criteria included the suspicion of cognitive
impairment or dementia; the subjective complaint of memory difficulties or
of any other cognitive deficit, regardless of interference with daily activities;
major medical illnesses; current or reported psychiatric or neurological
disorders; known or suspected history of alcoholism or drug dependence and
abuse; and MRI evidence of focal parenchymal abnormalities or cerebrovascular diseases.
To assess the cortical and subcortical structural measures, participants
underwent an imaging protocol that included standard clinical sequences
(FLAIR, DP-T2-weighted) and a volumetric whole-brain 3D high-resolution
T1-weighted sequence, performed with a 3 T Allegra MR imager, with a
standard quadrature head coil. Volumetric whole-brain T1-weighted images
were obtained in the sagittal plane using a Modified Driven Equilibrium

422

JADT’ 18

Fourier Transform (MDEFT) sequence (Echo Time/Repetition Time -TE/TR- =
2.4/7.92 ms, flip angle 15°, voxel size 1 x 1 x 1 mm3). All planar sequence
acquisitions were obtained in the plane of the anterior-posterior commissure
line. For the volumetric measures, T1-weighted images were processed and
examined using the SPM8 software, specifically the VBM8 toolbox running in
Matlab 2007b. For the cortical thickness, FreeSurfer imaging analysis suite
(v5.1.0) was used for cortical reconstruction of the whole brain. The
segmented, normalized, modulated and smoothed images were used for
analyses. Then, participants were interviewed by using the Toronto
Structured Interview for Alexithymia (TSIA, Bagby et al., 2006; Italian
version Caretti et al., 2011), composed of 24 items referred to four factors of
the alexithymia construct: the Difficulty in Identifying Feelings (DIF); the
Difficulty in Describing Feelings (DDF); the Externally Oriented Thinking
(EOT); and the Imaginal Processes (IP). Each item is assessed by a specific
open-ended question and its response is 3-point scored (coded ‘0’, ‘1’, or ‘2’).
The sum of scores results in a total score that ranges from 0 (low alexithymia
levels) to 48 (high alexithymia levels). The transcripts of the TSIA responses
were used to evaluate the linguistic structure by means of a multivariate
analysis. Namely, the nine TSIA transcripts resulted in a medium size corpus
of 62.792 tokens. In order to check whether it was possible to statistically
process data, two lexical indicators were calculated: the type-token ratio and
the hapax percentage (TTR= 0,10; Hapax= 51,0%). According to the large size
of the corpus both lexical indicators highlighted its richness and indicated the
possibility to proceed with the analysis. First, data were cleaned and preprocessed with the software T-Lab (Lancia, 2017) and keywords selected. In
particular, we used lemmas as keywords instead of type, filtering out the
lemma of the high rank of frequency and those of the low rank of frequency
lower to 9 occurrences (for keyword election see Cordella et al., 2014; Greco,
2016). Then, on the context units per keywords matrix, we performed a
cluster analysis with a bisecting k-means algorithm (Savaresi et Boley, 2004)
limited to ten partitions, excluding all the context units that do not have at
least two keywords co-occurrence. To finalize the text mining a
correspondence analysis (Lebart et Salem, 1994) on the keywords per clusters
matrix was made in order to explore the relationship between clusters and to
identify the latent dimensions setting the interviews.
Then parametric associations between TSIA scores and regional volumes or
cortical thickness, and between lexical scores (resulted by the correspondence
analysis) and TSIA scores or brain structural measures were calculated by
means of Pearson’s correlations in order to identify the possible direction and
extent of the linear relationship between the variables.

JADT’ 18

423

3. Main results
The results of the cluster analysis show that the 369 keywords selected allow
for the classification of 96.8% of the corpus. According to the theoretical
framework (Cordella et al., 2014) we choose the solution with four cluster.
The correspondence analysis detected three latent dimensions. In table 1, we
can appreciate how the clusters are placed in the factorial space produced by
three factors. The first factor represents the experience that could be personal
(negative pole) or social (positive pole); the second factor reflects the thought
that could be made on feelings (negative pole) or on a rational reasoning
(positive pole); and the third factor represents the aim of the thinking process
that could lead to make a speculation (negative pole) or a choice (positive
pole).
Table 1  Cluster coordinates on factors (the percentage of explained inertia
is reported between brackets under each factor).
Cluster Label
1

Think

CU classified

Factor 1

Factor 2

Factor 3

Experience

Thought

Aim

34,24%

Reasoning

Speculate

0,03

0,46

-0,64
Speculate

2

Feel

18,09%

Personal

Feeling

-0,13

-0,65

3

Relationship

23,15%

Social

4

Remember

24,51%

-0,18
Choice

0,52

0,08

0,24

Personal

Reasoning

Choice

-0,73

0,23

0,30

CU = context units classified.

The four clusters are of different sizes and reflect the general approach to the
emotional experience solicited by the TSIA. The first cluster reflects the
reasoning on the life event and hypothesis resulting in a rationale thinking
process; the second cluster highlights the capacity to reflect on the experience
of personal feelings; the third cluster represents the relationships
characterizing social life; and the fourth cluster gets back to memories,
reasoning on personal choices that were made (table 2).

424

JADT’ 18
Table 2  Cluster (the percentage of context units classified in the cluster
is reported between brackets).
Cluster 1

Cluster 2

Think

Cluster 3

Feel

Cluster 4

Relationship

Remember

keyword

CU keyword

CU keyword

CU keyword

pensare

104 sentire

100 persona

163 immaginare

CU
84

vedere

71 proprio

77 parlare

163 vedere

71

scrivere

45 riuscire

77 sentimento

149 prendere

64

amico

38 momento

67 provare

116 mettere

61

chiedere

32 capire

66 persone

114 positivo

39

chiamare

26 situazione

51 capire

100 casa

31

trovare

26 vivere

50 cercare

80 bello

28

tempo

21 piacere

50 situazione

71 problemi

27

ragazzo

20 succedere

33 amico

70 tornare

27

diverso

20 rabbia

32 trovare

45 portare

24

CU = context units classified in the cluster.

The correlation coefficient between the lexical scores of the three factors
(resulted from the correspondence analysis) and the TSIA scores, as well as
the brain volumes and the cortical thickness are reported in table 3. Namely,
DIF, DDF, EOT and total alexithymia scores were positively associated with
the second factor. At neurobiological level, the first factor was negatively
associated with volumes of right caudate and thickness of right medial
orbitofrontal cortex, as well as positively associated with the thickness of
right lateral occipital cortex. Finally, the third factor was negatively
associated with volumes of middle anterior, central and middle posterior
cerebral cortices, as well as with thickness of right postcentral cortex and left
posterior cingulate cortex. Conversely, the third factor was positively
associated with thickness of the right posterior cingulate cortex. Finally, the
IP scores (TSIA) were positively correlated (r = 0.72; p = 0.03) with the left
entorhinal cortical thickness values.
4. Discussion
Although this is a pilot study and it is not possible to generalise the findings,
the present data suggest that the methodology proposed (in order to identify
the connections among verbal expression, alexithymia levels and brain
structure) seems to be promising for a deeper understanding of the biopsycho-linguistic connections. In fact, results indicate that high alexithymia
scores were associated with a thought modality characterized by a rational
(and not emotional) reasoning. Furthermore, the tendency to be engaged in

JADT’ 18

425

personal (not social) experience was associated with large volumes of right
caudate and thickness of right medial orbitofrontal cortex.
Table 3  Correlation coefficients between lexical factors and TSIA scores
as well as cerebral structure values.

Variables

Factor 1 Factor 2 Factor 3

Difficulty Identifying Feelings (TSIA Factor 1)
Difficulty Describing Feelings (TSIA Factor 2)
Externally Oriented Style of Thinking, (TSIA Factor 3)
TSIA Total score
Right-Caudate
Right Hemisphere Medialorbitofrontal Thickness
Right Hemisphere Lateraloccipital Thickness
Left Hemisphere Posteriorcingulate Thickness
Mid Posterior Cortical Cortex
Mid Anterior Cortical Cortex
Central Cortical Cortex
Right Hemisphere Postcentral Thickness
Right Hemisphere Posteriorcingulate Thickness

.83
.71
.68
.77
-.71
-.78
.70

In the table are reported only the correlation coefficients with a p<0,05.

Conversely, the tendency to be engaged in social (not personal) experiences
was associated to great thickness of right lateral occipital cortex. The
speculative thinking processes (negative pole of the third factor) was
associated with large volumes of middle anterior, central and middle
posterior cerebral cortex, as well as with great thickness of right postcentral
cortex and left posterior cingulate cortex. Finally, thinking processes related
to a choice was associated with great thickness of the right posterior
cingulate cortex.
Overall the study indicates that the organizational factors of thought and
language, conveying also the emotional meaning of the text, are related to the
structure of cerebral areas involved in somato-sensory associative processes
(postcentral and lateral occipital cortices), in emotional awareness (entorhinal
and posterior cingulate cortices), and in emotional control and feelings
(orbitofrontal cortex). Just such functions are compromised in the presence of
high levels of alexithymia, because an altered referential process can lead to
somato-sensorially perceive but not-verbally express the emotions.
Furthermore, in the present study most of associations were found between
first and third factor (resulted from the correspondence analysis) and the
macro-structural measures in the right brain hemisphere, totally fitting the

-.71
-.75
-.75
-.78
-.78
.68

426

JADT’ 18

proposal of Bucci (1997) that suggests the right hemisphere as the
neurophysiological substratum underlying the processing of emotional
information and referential process. On this vein, alexithymia may be
considered an embodiment process related to altered perception of
physiological correlates (viscero- and somato-motor responses) of the
emotional activation resulting in a deficit in the emotional awareness. In fact,
a dysfunctional referential process can lead to a lack of words for the
emotions, up to being without symbols for the somatic states.
References
Bagby R.M., Parker J.D. and Taylor G.J. (1994). The twenty-item Toronto
Alexithymia Scale--I. Item selection and cross-validation of the factor
structure. J Psychosom Res. 38(1):23-32.
Bagby R.M., Taylor G.J., Parker J.D. and Dickens S.E. (2006). The
development of the Toronto Structured Interview for Alexithymia: item
selection, factor structure, reliability and concurrent validity. Psychother
Psychosom. 75(1):25-39.
Bucci W. (1997). Psychoanalysis and cognitive science: A multiple code
theory. Guilford Press.
Caretti V., Porcelli P., Solano L., Schimmenti A., Taylor G.J. and Bagby R.M.
(2011). Reliability and validity of the Toronto Structured Interview for
Alexithymia in a mixed clinical and nonclinical sample from Italy.
Psychiatry Research, 187:432-436.
Cordella B., Greco F. and Raso A. (2014). Lavorare con Corpus di Piccole
Dimensioni in Psicologia Clinica: Una Proposta per la Preparazione e
l’Analisi dei Dati. In Nee E., Daube M., Valette M. and Fleury S., editors,
Actes JADT 2014 (12es Journées internationales d’Analyse Statistque des
Données Textuelles, Paris, France, Juin 3-6, 2014), pp. 173-184.
Damasio A. and Carvalho G.B. (2013). The nature of feelings: evolutionary
and neurobiological origins. Nat Rev Neurosci., 14(2):143-152.
Greco F. (2016). Integrare la disabilità. Una metodologia interdisciplinare per
leggere il cambiamento culturale. Franco Angeli.
Lancia F. (2017). User’s Manual: Tools for text analysis. T-Lab version Plus 2017.
Laricchiuta D., Petrosini L., Picerni E., Cutuli D., Iorio M., Chiapponi C.,
Caltagirone C., Piras F. and Spalletta G. (2015a). The embodied emotion in
cerebellum: a neuroimaging study of alexithymia. Brain Struct Funct.,
220(4):2275-2287.
Laricchiuta D., Lai C. and Petrosini L. (2015b). Alexithymia: From
Neurobiological Basis to Clinical Implications. In Bryant M.L., editor,
Handbook on Emotion Regulation: Processes, Cognitive Effects and Social
Consequences, Nova Science Publishers.

JADT’ 18

427

Lebart L. and Salem A. (1994). Statistique Textuelle. Dunod
LeDoux J. (2012). Rethinking the emotional brain. Neuron, 73(4): 653-676.
Savaresi S. M. and Boley D. L. (2004). A comparative analysis on the bisecting
K-means and the PDDP clustering algorithms. Intelligent Data Analysis,
8(4): 345-362.
Taylor G.J. and Bagby R.M. (2004). New trends in alexithymia research.
Psychother Psychosom., 73(2):68-77.

428

JADT’ 18

Icone gay: tra processi di normalizzazione e di
resistenza. Ricostruire la semantica degli hashtag
Gevisa La Rocca1, Cirus Rinaldi2
1

Kore University of Enna – gevisa.larocca@unikore.it
2University of Palermo – cirus.rinaldi@unipa.it

Abstract 1
The mediatization of emotions emerges as an affordance of social media, the
study of which involves paying attention to digital practices and the
formation of the sense of public affection, of connected audiences expressing
their participation through expressions of sentiment. This happens both for
the great events and for the daily demonstrations of support or of its
negation. Here we choose to analyze the tweets in which the fans express
their opinions on the participation in the reality shows of their “icons”:
Vladimir Luxuria and Cristiano Malgioglio. To reconstruct the hashtag
semantics we use: the NodeXL software for network analysis and Iramuteq
for the extraction of lexical worlds.
Abstract 2
La mediatizzazione delle emozioni emerge come una affordance dei social
media, il cui studio implica il porre attenzione alle pratiche digitali e alla
formazione del senso dell’affetto pubblico, dei pubblici connessi che
esprimono la loro partecipazione attraverso le espressioni del sentimento.
Questo accade tanto per i grandi avvenimenti che per le quotidiane
manifestazioni di sostegno o della sua negazione. Qui si sceglie di analizzare
i tweets in cui i fans si esprimono in merito alla partecipazione ai reality
show delle loro «icône» : Vladimir Luxuria e Cristiano Malgioglio. Per
ricostruire la semantica degli hashtag si usa il software NodeXL per la
newtork analysis e Iramuteq per l’estrazione dei mondi lessicali.
Keywords: LGTB, rappresentazione sociale, semantica degli hashtag,
network analysis.
1. Introduzione
All’interno della più generale questione della rappresentazione dei tipi gay
(Dyer, 1992), la riflessione teorica ha rilevato la complessità che si cela dietro
una facciata che è solo apparentemente semplice. In particolare, la
costruzione di tipi e icone LGBT offre codificazioni simboliche ampie e
sovrasignificazioni persino contradditorie, dal momento che la produzione

JADT’ 18

429

culturale subculturale e comunitaria gay non soltanto ha da sempre articolato
temi e significati già esistenti (Hall, 1996), ma è stata a sua volta incorporata
in
rappresentazioni
eteronormative.
Guardando
alle
principali
trasformazioni sociali in atto relative ai processi di acquisizione della
cittadinanza nei contesti neo-liberisti e alle istanze di «normalizzazione»
tuttora in discussione all’interno della sfera pubblica e del movimentismo
LGBTQI1 (matrimonio egualitario, coppie non-bianche, dis-abilità e
coscrizione militare), bisogna prestare attenzione ai processi di costruzione
simbolica di (nuove e inedite) maschilità complici assimilate all’interno della
produzione del consumo neo-liberista2. Il neo-liberismo pone la propria
enfasi sulla libertà individuale e sui diritti, e sulla regolazione degli interventi
centrali dello Stato: una delle principali modalità di governance neo-liberista
consiste nelle diverse tecniche di normalizzazione, attraverso le quali
vengono individuate e riprodotte norme di comportamento specifiche
indicate dai governi liberisti che i “cittadini” dovranno interiorizzare a fini
auto-regolativi3. Di particolare interesse per la presente trattazione, i modelli
di cittadinanza sostenuti sono finalizzati a rafforzare le dimensioni normative
di genere e sessualità (matrimonio, coscrizione militare, etc.); la
rivendicazione di «eguaglianza» si è fortemente intrecciata con la negazione e
il distanziamento da parte degli omosessuali dalle tradizionali
rappresentazioni che li associavano a individui «amorali», «inferiori»,
«subordinati», e «peccaminosi». La tentazione di essere normali sta producendo
effetti paradossali sulla visibilità dei gay, perché se da un lato assistiamo a
quanto viene definito de-omosessualizzazione o eterosessualizzazione delle
omosessualità4, dall’altro invece, osserviamo un rafforzamento delle
dicotomie e delle gerarchie di genere che circoscrivono fortemente
l’omosessualità attraverso l’interiorizzazione di un’economia eteronormativa

Per un approfondimento si rinvia a Warner, M. (1999). The trouble with the
normal. Sex, politics and the ethics of queer life, New York, The Free Press.
2 Cfr. Duggan, L. (2003). The Twilight of Equality?: Neoliberalism, Cultural Politics,
and the Attack on Democracy, Boston, Beacon Press e Chasin, A. (2000). Selling Out: The
Gay and Lesbian Movement Goes to Market, New York, St. Martin’s Press.
3 Richardson, D. (2004). Locating sexualities: from here to normality, «Sexualities», 7,
4, 391- 411.
4 Bersani, L. (1998). Homos (1995), tr. It. Homos, Milano, Pratiche Editrice, 38, 131.
Sui processi di eterosessualizzazione, si consideri la critica fornita in Ingraham, C.
(2012), Atti innaturali: disciplinare l’eterosessualità, in C. Rinaldi (ed.). Alterazioni.
Introduzione alle sociologie delle omosessualità, Milano-Udine, Mimesis, 97-119.
1

430

JADT’ 18

del desiderio5. Il presente lavoro intendere problematizzare le forme di
tipizzazione LGBT a partire dall’analisi delle differenti rappresentazioni che
hanno avuto le partecipazioni ai reality show di personaggi pubblici,
dichiaratamente parte della comunità LGBT, come Vladimir Luxuria e
Cristiano Malgioglio.
2. Processi di normalizzazione e di resistenza
I processi di assimilazione eteronormativa da parte delle comunità gay, e la
conseguente costruzione normalizzante di alcune soggettività LGBT, hanno
portato a nuove gerarchie di desiderabilità all'interno delle medesime
comunità LGBT. Uno dei principali effetti è la depoliticizzazione e la
privatizzazione della cultura movimentistica LGBT, ancorata nella
riproduzione e nella protezione del domestico (la famiglia/le famiglie e la
nazione o il Nuovo Ordine Globale), nella riproduzione iperbolica della
dicotomia di genere normativa, nel consumo di tali rappresentazioni e nella
partecipazione attiva al nuovo ordine globale. A corroborare questa
prospettiva, i processi politici globali in atto (sia di tipo securitario che di
valorizzazione di alcune soggettività) – insieme alla globalizzazione delle
maschilità nelle forme della loro mercificazione e delle immagini veicolate
dai mass media transnazionali – sembrano fondarsi sulla definizione anche
di standard omonormativi e omonazionalisti che non soltanto si
contrappongono al nemico da combattere (la figura ambigua dell’indecoroso,
del terrorista o dell’immigrato) ma che si discostano, nel contempo, da tutte
quelle corporeità terroristiche che lo standard nazionalista individua come
eccessive/eccedenti (a meno che anche esse non scelgano di diventare docili ).
Uno degli effetti principali delle nuove forme di normalizzazione, è un
diretto rafforzamento degli assetti di genere normativi, un rafforzamento
della maschilità e dei suoi indicatori in termini generali e, soprattutto, una
diretta complicità nella marginalizzazione degli omosessuali effeminati non
soltanto nella società più vasta a ma anche all’interno delle comunità gay.
La costruzione delle nuove maschilità si fonda principalmente
sull’epurazione del femminile e dell’effeminatezza, attraverso una serie di
contesti e pratiche – dalle rappresentazioni nei mass media, al marketing
della «pink economy» sino agli annunci personali nei social network, nelle
app per incontri – che attestano quanto le maschilità omosessuali egemoni e
normative considerino accettabili e giustificabili le condotte e gli

5 Martino, W. (2006). Straight-acting masculinities: normalization and gender
hierarchies in gay men’s lives, in Kendall, C., Martino, W. (eds.), Gendered outcasts and
sexual outlaws. Sexual oppression and gender hierarchies in queer men’s lives, The Haworth
Press, New York, 35-60.

JADT’ 18

431

atteggiamenti effeminofobi. La storica associazione dell’omosessualità con
l’effeminatezza e con il femminile di fatto non stimola un ripensamento della
maschilità omosessuale (e delle forme di strutturazione della maschilità in
generale), ma la fa arroccare su posizioni difensive e distanzianti e su forme
contro-reattive e contro-culturali che, di fatto, riproducono la norma. La
paradossalità della maschilità omosessuale consiste nella ricerca
dell’affrancamento culturale attraverso la proposta di una maschilità vera,
senza metterne in discussione però misoginia, patriarcato e checcofobia. Una
maggiore visibilità delle omosessualità implica un costo da pagare, nello
specifico una differenza che si dissolve in eguaglianza e nella costituzione di
un blocco egemonico in grado di rendere meno percettivamente visibili le
divisioni di genere del patriarcato.
Se pensiamo in termini relazionali le maschilità (eterosessuali e omosessuali),
una loro ibridazione giustificata su base consumistica, se permette alle prime
di apparire in modo meno tradizionale e rigido e alle seconde di acquisire in
risorse simboliche della tradizione virilista, tuttavia riproduce –attraverso
entrambe – l’illusione della scomparsa del (dividendo del) patriarcato. Dal
momento che entrambe le forme di soggettivazione maschile partecipano in
modo selettivo e differenziale all’accaparramento e all’appannaggio di
risorse materiali e simboliche egemoniche, esse non saranno definibili
esclusivamente in termini oppositivi, piuttosto si alimenteranno e
rafforzeranno in maniera reciproca. Questa polarizzazione virilista non
soltanto perpetra forme di annichilimento nei confronti della manifestazione
della maschilità frocia ma impedisce, di fatto, anche la eventuale costituzione
pubblica di maschilità eterosessuali al di fuori delle rappresentazioni
eteronormative. Tuttavia, il costo da pagare perché la maschilità gay sia
accettata socialmente è la negazione del suo carattere sessuale, della sua
diversità in termini di desideri erotici, di pratiche, della materialità della sua
espressione sessuale. L’assimilazione prevede, di conseguenza, la
riproduzione dei sistemi di divisione di genere, di classe e di età presenti
nella società più vasta, attraverso la riproduzione di modelli di consumo
esistenti all’interno del più vasto ordine globale.
3. La mediatizzazione e gli hashtag
Con l’introduzione del termine e del concetto di mediatizzazione ci si
riferisce a quel processo in base al quale le istituzioni sociali e culturali e le
modalità di interazione sono cambiate, cambiano e cambieranno come
conseguenza della crescita dell’influenza dei media, tenendo conto, però,
delle circostanze, ovvero di come mutano la cultura e la società. Si tratta di
quel costante contatto comunicativo con gli altri, la cui esplorazione avviene
in modi del tutto inediti (Cardoso, 2008; Boccia Artieri, 2012; Colombo, 2013),

432

JADT’ 18

che trasforma la condizione del vissuto in un nuovo orizzonte di senso
sociale (Boccia Artieri et al., 2017) producendo una metamorfosi delle
relazioni sociali. Si tratta di quell’insieme di pratiche o di habitus
caratterizzate da una regolarità dell’agire in relazione a specifici bisogni, che
porta con sé un intero mondo di capacità, vincoli e potere (Couldry, 2012).
Non è difficile accettare che l’uso di un mezzo di comunicazione per un
periodo di tempo continuato determini il carattere stesso della conoscenza da
comunicarsi, e che a causa della sua stessa pervasività porti all’emergere di
una nuova civiltà, ovvero renda possibile lo strutturarsi di una forma
particolare con cui si manifesta la vita materiale, sociale e spirituale di un
popolo (Innis, 1951; Ong, 1982; La Rocca, 2017). L’aspetto sociale dei social
media è, ormai, evidente da sé; essi sono parte di una società in cui svolgono
una pluralità di funzioni di intermediazione (Colombo, 2003; 2013), essi,
infatti, sembrano essere pensati per rendere possibili le collaborazioni
partecipative, cioè dal basso, e allo stesso tempo, alimentano una
socievolezza di simmeliana memoria, che è riassumibile nelle caratteristiche
oggi individuate da Peter Dahlgren (2009) nella talkative society.
Si tratta di considerare emoticons, emoji, hashtag, commenti, rimandi, foto, link,
video, tutti quegli strumenti che consentono di ricollocare il testo nelle
intenzioni enunciative di chi lo ha posto in essere o condiviso. Sono fenomeni
di cui si occupa con più interesse la discourse analysis, ma che non possono
essere ignorati se l’obiettivo è un’analisi del contenuto digitale dei new e social
media. Questo per una ovvia considerazione, com’è possibile limitare
l’osservazione solo allo scritto e non estenderlo ai suoi elementi accessori, se
l’obiettivo è conoscere il senso di quanto viene detto a proposito di un dato
argomento o fenomeno in rete? Solo in questo modo l’analisi del contenuto si
apre alla possibilità di includere il linguaggio utilizzato come una meta
risorsa tecnologizzata. È chiaro che intesa in questo modo l’analisi del
contenuto si avvicina più alla discourse-ethnografic (Androutsopoulos, 2010;
2011) piuttosto che a un’analisi delle frequenze; perché innanzitutto
ricostruire i percorsi e le emozioni di un topic online non è semplice e questo
è dovuto alla struttura della grammatica e della sintassi della costruzione dei
messaggi, alla commistione linguistica che quindi richiede nel momento
dell’analisi un software capace di elaborare testi in più lingue, ma anche di
ricodificare le emoticons e di valutare in relazione a esse le intenzioni del
testo. Si tratta di sviluppare un approccio di analisi del contenuto multimodale,
indicando con questo termine come anche in questo settore sia necessario
attuare ciò che è avvenuto nello studio del discorso (Jewitt, 2014; Kress, van
Leeuwen, 2001), dove si presta attenzione al modo in cui il linguaggio
interagisce con altri sistemi semiotici; sostituendo al “linguaggio”, la
costruzione del contenuto, che inevitabilmente interagisce con gli altri sistemi

JADT’ 18

433

semiotici. In questo caso è chiaro che un approccio in cui è il ricercatore a
svolgere manualmente tutte queste operazioni o a ricodificare le espressioni
riconducendole a categorie di senso condivise diventa la soluzione più
opportuna. Ma appare altresì veritiero che un lavoro di questo tipo, a meno
di ricevere un grosso finanziamento a supporto della ricerca, rimane sempre
legato a un numero limitato di osservazioni.
3.1. Per una ricostruzione della semantica degli hashtag
Si sceglie quindi di lavorare in una duplice ottica: da un lato si scaricano i
tweets relativi alla partecipazione ai reality show quali: L’isola dei famosi e
Grande Fratello VIP per i due soggetti individuati, e si ricostruisce il network
dei topic, dall’altro si esplorano mediante l’analisi lessicale i contenuti dei
tweets individuati come perni connettori.
Il ventaglio dei sentimenti che gli individui agganciano a questi messaggi è
ampio e variegato e attiene alla natura umana, è possibile interpretarlo
seguendo l’impostazione data da Korina Giaxoglou e Katrin Döveling (2018)
nella loro special issue dedicata alla mediatizzazione delle emozioni sui social
media. La mediatizzazione delle emozioni emerge come una effordance dei
social media il cui studio implica il porre attenzione alle pratiche digitali e
alla formazione del senso dell’affetto pubblico, dei pubblici connessi (Boyd,
2010) che esprimono la loro partecipazione attraverso le espressioni del
sentimento (Papacharissi, 2016). Finora, gli studi hanno esaminato la
formazione e la veicolazione dei sentimenti in rete legandoli a grandi
avvenimenti sociali, momenti storici che prefigurano cambiamenti epocali. Si
tratta, senza dubbio, di storie di connessione ed espressione, dove gli hashtag
servono come significanti vuoti che invitano ad una identificazione
ideologica a vasto orientamento polisemico (Colleoni, 2013; Papacharissi,
2016). I post promossi dai singoli, individuati e volti a sostenere o denigrare,
quelle che qui sono state individuate come icône gay, si sostanziano o forse
meglio si foraggiano sicuramente di un senso emotivo che è personale.
Come per Zizi Papacharissi (2016) anche le nostre interpretazioni, in questo
contesto, sono guidate dalla comprensione dell’affetto come una forma di
intensità pre-emotiva soggettivamente sperimentata e connessa a processi di
premeditazione o anticipazione di eventi prima del loro verificarsi. Ci sono
radici emotive legate alla percezione e sperimentazione dell’affetto che
provengono da contesti socioculturali cui gli individui appartengono, in
questo senso le emozioni mediatizzate sono delle forme espressive di culture
più profonde. Per realizzare l’analisi dei network degli hashtag si utilizza NodeXL,
per l’analisi lessicale Iramuteq.

434

JADT’ 18

References
Androutsopoulos J. (2010). Localising the global on the participatory web:
Vernacular spectacles as local responses to global media flows. In
Coupland, N. (ed.), Handbook of Language and Globalization (203-231).
Oxford: Wiley-Blackwell.
Androutsopoulos J. (2011). From variation to heteroglossia in the study of
computer-mediated discourse. In Thurlow, C. e Mroczek, K. (eds.), Digital
Discourse: Language in the New Media (277–298). London: Oxford
University Press.
Boccia Artieri G. (2012). Stati di connessione. Milano: Franco Angeli.
Boccia Artieri G., Gemini L., Pasquali F., Carlo S., Farci M., Pedroni M. (2017).
Fenomenologia dei social network. Presenza, relazioni e consumi mediali degli
italiani online. Milano: Guerini.
Boyd D. (2010). Social network sites as networked publics: Affordances,
dynamics, and implications. In Papacharissi, Z. (ed.), A Networked Self:
Identity, Community, and Culture on
Social Network Sites (39-58). New York: Routledge.
Cardoso G. (2008). Preference for online social interaction: A theory of
problematic Internet use and psychosocial wellbeing. Communication
Research, 30: 625-648.
Chasin A. (2000). Selling Out: The Gay and Lesbian Movement Goes to Market.
New York: St. Martin’s Press.
Colleoni E. (2013). Beyond the differences: The use of empty signifiers as
organizing device in the #occupy movement. In Proc. Of Material
Participation: Technology, the Environment and Everyday Publics, Università
di Milano, Maggio.
Colombo F. (2003). Introduzione allo studio dei media. Roma: Carocci.
Colombo F. (2013). Il potere socievole: storia e critica dei social media. Milano:
Bruno Mondadori
Couldry N. (2012). Media, Society, World: Social Theory and Digital Media
Practice. Cambridge: Polity.
Dahalgren P. (2009), Media and Political Engagement. Citizens, Comunication,
and Democracy. New York: Cambridge University Press.
Duggan L. (2003). The Twilight of Equality?: Neoliberalism, Cultural Politics, and
the Attack on Democracy. Boston: Beacon Press.
Giaxoglou K., Döveling K. (2018). Mediatization of emotion on social media:
forms and norms in digital mourning practices. Social Media + Society, pp.
1-4.
Ingraham C. (2012). Atti innaturali: disciplinare l’eterosessualità. In Rinaldi,
C. (Ed.), Alterazioni. Introduzione alle sociologie delle omosessualità (97-119).
Milano-Udine: Mimesis.

JADT’ 18

435

Innis H. A. (1951). Empire and communication. Oxford: University of Oxford
Press.
Jewitt C. (2014). The Routledge Handbook of Multimodal Analysis. London:
Routledge.
Kress G., Van Leeuwen T. J. (2001). Multimodal Discourse: The Modes and Media
of Contemporary Communication. London: Arnold.
La Rocca G. (2017). Cantami o diva. La bazzecola del dicunt in rete. Sociologia
della comunicazione, 53: 56-74.
Martino W. (2006). Straight-acting masculinities: normalization and gender
hierarchies in gay men’s lives. In Kendall, C., Martino, W. (Eds.), Gendered
outcasts and sexual outlaws. Sexual oppression and gender hierarchies in queer
men’s lives (35-60). New York: The Haworth Press.
Ong W. J. (1982). Orality and Literacy. The Tecnologizing of the Ward. London:
Routledge.
Papacharissi Z. (2016). Affective publics and structures of storytelling:
Sentiment, events and mediality. Information, Communication & Society, 19:
307–324.
Puar J. (2007). Terrorist Assemblages: Homonationalism in Queer Times. Durham,
N.C.: Duke University Press.
Richardson D. (2004). Locating sexualities: from here to normality. Sexualities,
7(4): 391- 411.
Rinaldi C. (2012). Globalizzazione della maschilità e maschilizzazione dei
processi globali. In G.E.M. Scichilone (Ed.), L’era globale: linguaggi,
paradigmi, culture politiche (173-189). Milano: Franco Angeli.
Rinaldi C. (2007). De-gener(azioni): riflessioni per una sociologia del
transgenderismo. In Antosa, S. (a cura di), Spazio e identità queer (127-148).
Omosapiens vol. 2. Roma: Carocci.

436

JADT’ 18

Looking for topics: a brief review
Ludovic Lebart1
1

Telecom-ParisTech – ludovic@lebart.fr

Abstract
This paper presents a brief review of several endeavors to identify latent
variables (axes or clusters). When dealing with textual data, these latent
variables (clusters or axes) are sometimes designated ex ante by the term
“topic”. The first attempts to identify interpretable latent variables dates back
to factor analysis at the beginning of last century. Recent years have
witnessed a series of algorithmic attempts such as non-negative matrix
factorization (NMF) or Latent Dirichlet Allocation (LDA). In the meantime,
latent variables are also identified through several hybridizations and
synergies of principal axes methods and clustering techniques. A single
medium-sized classical corpus (Shakespeare’s 154 Sonnets) will serve as a
benchmark to sketch and compare in a compact way some characteristic
features of several methods.
Keywords: Topic Modelling, NMF, LDA, Correspondence Analysis, Factor
Analysis, Clustering.
1. Introduction
There is a profusion of new disciplines around the industrial applications
involving texts, with subsequent proliferations of tools and disparities of
terminologies. There are also disparities in the attitude towards the texts,
sometimes influenced by the availability and the user-friendliness of
software. The problems entailed by huge sets of newsgroup or tweets are
quite different from those encountered when dealing with literature, political
discourses, psychological surveys. Because they are well known, translated
in almost every language, deeply studied and commented, we will use
Shakespeare’s Sonnets as a benchmark to briefly compare the ability of
several techniques to recognize topics in a corpus.
2. An outline of the contents of Shakespeare’s sonnets
The 154 sonnets of William Shakespeare deal with themes such as love,
friendship, effects of time, beauty, treason, lust, death. Note that the
definition of topics in Text Mining is a pragmatic one, and may also recover
the concepts of theme and motif.
2.1. Theme, Topic, Motif

JADT’ 18

437

Usually, a topic is an objective explanation of the subject matter, whereas a
theme represents the deeper underlying message. A motif is simply a
recurring idea or pattern used to reinforce the main theme. Schematically,
topics answer the questions:, "What's the story about? Who? What? How?"
and themes answer: "Why was the story written?". Topics in literature are
easier to identify than themes.
Three main contiguous series of sonnets are generally recognized as three
dominant themes:
Sonnets 1 to 17: (Procreation). These sonnets celebrate the beauty of a young
man who is urged by the poet to marry so as to perpetuate that beauty.
Sonnets 18 to 126: (Young man). This longest sequence concerns the same
young man (not definitively identified), the destructive effect of time, the
force of love, friendship and poetry.
Sonnets 127 to 154: (Dark Lady).These sonnets are mostly addressed to a dark
haired woman, not without some irony and cynicism (the two last sonnets
153 and 154 are specific epigrams in an ancient style; they should deserve in
fact a specific category).
2.2. Eight themes derived from expert commentaries
The themes Young man and Dark lady could contain five sub-themes. While
the first theme (Procreation) remains untouched, the new Young man and Dark
lady themes will comprise only those sonnets which were not assigned to the
five new categories below (Absence, Storm, Rivalry, Death, Eternal poetry).
Table 1. List of eight a priori themes/topics with the corresponding sonnets numbers

Procreation
YoungMan
DarkLady
Absence
Storm
Rivalry
Death
Etern_poetry

1 - 17
20-25, 33-38, 40-42, 46, 47, 49, 53-55, 59-60,62-70, 75-77, 88106, 108-112, 115-125,
127-136, 139, 140, 143-146, 153,154
26-32, 39, 43-45, 48, 50-52, 56-58, 61, 113-114
141,142,147-152
78-87
71-74
18,19,81

The partition of sonnets given in Table 1 is inspired by the works of Alden
(1913) and Paterson (2010) but not explicitly mentioned by these authors.
Figure 1 shows however that, after a blind correspondence analysis ignoring
these themes, most of their locations are statistically significant on the
principal plane of visualization.

438

JADT’ 18

Figure 1. Locations of 7 themes/topics in the principal plane of the correspondence analysis of
the lexical table (154 sonnets x 173 words, [min. frequency = 10]) as supplementary
categorical variables. Conservative bootstrap confidence ellipses [drawing with replacement of
sonnets] show the significant distances between several pairs of a priori themes. Note that the
theme “Eternal Poetry”, too much overlapping with others, is missing in this graphical
display.

Evidently, the following attempts to find topics into the corpus of sonnets
will ignore that a priori partition into themes. We do not expect either to
retrieve automatically these themes.
However, the knowledge of these themes issued from literary criticism can
provide us with a template for reading and interpreting the results more
easily. Note that statistical tools mainly based on frequencies detect almost
indifferently topics, themes or motifs.
3. Six selected methods for topic research
Among the six procedures selected in the present paper, three (RFA, FCA,
LOA, LSA) make use of the Singular Values Decomposition (SVD). The
remaining two methods (NMF and LDA), less geometrical, involve a specific
model and more complex algorithms.
RFA (Rotated Factor Analysis) is historically the first attempt to identify

JADT’ 18

439

unobserved “latent factors” (Thurstone, 1947, after the pioneering papers of
Spearman,1904, and of Garnet, 1919). RFA involves SVD in one of the most
popular algorithms known as Principal Factor Analysis. In this case, the
topics are the words characterizing each kept factors. Initially conceived for
numerical values, it could be adapted to sparse frequency tables. [R packages
‘psych’ and ‘GPArotation’].
FCA (Fragmented Correspondence Analysis), in the vein of ALCESTE
methodology (Reinert, 1986), is based on the CA of fragments of texts [in our
case 7 consecutive lines, i.e.: half a sonnet), cf. Lebart (2012). The principal
axes of CA serve to cluster these fragments (hybrid clustering using
Hierarchical Classification –Ward criterion – and k-means). At the end of the
process, the topics are defined by the series of words that characterize each
cluster (software ‘DtmVic’).
LOA (LOgarithmic Analysis) (Kazmierczak, 1985) is similar to Spectral
Mapping (Lewi, 1976) thanks to a difference of weighting. Both methods, like
CA, comply with the principle of distributional equivalence (stability of the
results vis-à-vis fusions of similar columns or rows). Applied to contingency
or frequency tables, LOA often produces results similar to those of CA, with
less sensitivity towards outliers as a consequence of the logarithmic
shrinkage. A clustering of sonnets (similar to that of FCA) is then performed.
The topics are then the words characterizing each cluster (software:
‘DtmVic’).
LSA (Latent Semantic Analysis (or Indexing), Deerwester et al., 1990) which is
basically a SVD of the matrix of Tf.Idf coefficients (Term frequency x Inverse
of document frequency). A clustering of sonnets (similar to that of FCA and
LOA) is then performed. The topics are then the words characterizing each
cluster. (R package ‘lsa’ [Fridolin Wild] and ‘DtmVic’)
In the domain of text analysis, the two following methods belong more
specifically to the field of “Topic Modelling”.
NMF (non-negative matrix factorization) starts with an equation that
reminds Singular Values Decomposition (SVD): Decomposition of a data
matrix A as the product of two matrices of lower rank, B and C: A = B C. The
marked difference lies in a constraint of positivity of the coefficients of B and
C (those of A being already supposed > 0) (Lee and Seung, 1999; Berry et al,
2007; after Paatero &Tapper, 1994. See also Gaujoux, 2010). In the topic
modeling context, the main output of NMF is a set of topics characterized by
list of words (software ‘scikit-learn’ [Python] by Grisel O., Buitinck L., Yau
C.K; In: Pedregosa et al., 2011).
LDA (Latent Dirichlet Allocation) (Blei et al., 2003; Griffiths et al., 2007) is a
generative statistical model (involving unobserved topics, words, and
document) devised to uncover the underlying semantic structure of a

440

JADT’ 18

collection of texts (documents, supposed to be a mixture of a small number of
topics). The method is based on a hierarchical Bayesian analysis of the texts.
(package R: ‘topicmodels’, and software ‘scikit-learn’ [Python]).
At this stage, we have limited our investigation to six techniques out of a
great number of approaches likely to identify topics. Among these
approaches let us mention the direct use of CA without fragmentation of the
texts, the techniques of clustering (used in FCA and LOA) which contain
many more methods and variants, the already mentioned Alceste
methodology (Reinert, 1986). The present piece of research evidently needs to
be extended. In fact, each method involves also a series of parameters
(threshold of frequency for the words; preprocessing options such as
lemmatization/stop words; size of fragments or context units, number of
iterations). The following experiment limited to six methods will be tersely
summarized. A thorough investigation would need many more pages.
4. Excerpts from the list of 49 topics (limited to two topics per method)
The number of topics detected by each of the six selected methods varies
between six and ten. Only two topics are printed below for each method.
4.1 Rotated Factor Analysis (Rotation Oblimin). (2 topics out of 6)
RFA1 eyes see bright lies best form say days
RFA2 beauty false old face black now truth seem
4.2 FCA (Fragmented Correspondence Analysis) (2 topics out of 7)
FCA1 beauty truth muse age youth praise old eyes glass long seen lies false
time days
FCA2 night day bright see look sight
4.3 Logarithmic Analysis (Spectral mapping) (2 topics out of 8)
LOA1 summer away youth sweet state hand seen age rich beauty time hold
nature death
LOA2 pen decay men live earth verse muse once life hours make give gentle
death
4.4 Latent Semantic Analysis (2 topics out of 8)
LSA1 time heart beauty more one eyes eye now myself art still sweet world
LSA2 end grace leave words lie spirit change shame self could ever decay
write
4.5 NMF topics (2 topics out of 10)
NMF0: love true new hate sweet dear say prove lest things best like ill let
know fair soul
NMF1: beauty fair praise art eyes old days truth sweet false summer nature
brow black live
4.6 Latent Dirichlet Allocation LDA (2 topics out of 10)
LDA0 summer worse praise nature making time like increase flower let copy

JADT’ 18

441

rich year die LDA1 sing sweets summer hear love music eyes bear single
confounds prove shade eternal.
5. A synthesis of produced topics
How to compare the complete lists of topics, since neither the order of topics,
nor the order of words within a topic are meaningful? We deal here with real
‘bags of words’ exemplified by the excerpts of lines in section 4. We will add
the eight a priori themes defined in table 1. Each a priori theme corresponds to
a subset of sonnets. That subset will be described by its characteristics words.
We can then perform a clustering of these 57 topics/themes (49 + 8). The
technique of additive trees (Sattath and Tversky, 1977; Huson and Bryant,
2006) seems to be the most powerful tool for synthesizing in compact form
these 57 topics/themes (figure 2). Let us recall one important property of
additive trees: the real distance between two points can be read directly on
the tree as the shortest path between the two points.
Ideally, we expect to find a tree with as many branches as there are real
topics in the corpus, each branch of the additive tree being characterized by
seven labels: six labels corresponding to the six methods briefly described
above, plus one label corresponding to one a priori theme. Such situation
occurs when each method has uncovered the same real topics. The observed
configuration is not that good, but we can distinguish between six and nine
main branches, which is probably the order of magnitude of the number of
different topics. We note also that several different methods often participate
in the same branch, which suggest that that branch correspond to a real topic
discovered by almost all the six methods. Let us mention that a similar
additive tree performed on the 49 topics (not involving the eight a priori
themes) produces approximately the same branches. Thus, the eight a priori
themes can be considered here as illustrative elements, serving only as
potential identifiers of the branches.
It is remarkable that the eight a priori themes (boxed labels) are well
distributed over the whole of Figure 2. If we except the branch of the tree
located in the upper right part of the display, on the right of the label “Young
man”, all the main branches have as a counterpart one of the a priori themes.
As an example of interpretation of figure 2, the branch in the lower center
part of figure 2: [NMF7, LOA4, RFA3, LDA7, LSA5] is clearly closely linked
to the a priori topic named Rivalry (see section 2.2) (concurrence of five
methods out of six). Most of the branches of the additive tree could be
interpreted likewise. The upper right branch identified by none of the a priori
themes may represent an unforeseen topic. More research and an expertise in
Elizabethan poetry are required to confirm that we are dealing here with an
undetected new theme. To conclude, we can only observe that each of the

442

JADT’ 18

involved method, be it ancient or modern, may contribute to detect topics…
and that exploratory tools are essential to visualize the complexity of the
process and assess the obtained results.

Figure 2. Additive Tree describing the links between the 49 topics provided by the 6 selected
methods and the 8 a priori themes. The identifiers are those of section 4 for the 6 selected
methods. The 3 first letters indicate the method, followed by the index of the produced topic.
The distance between two topics is the chi-square distance between their lexical profiles.
Threshold of frequencies for words: 2. The boxed identifiers of the a priori themes are those
(possibly shortened) of table 1.

References
Alden, R. M. (1913). Sonnets and a Lover's Complaint. New York: Macmillan.
Berry M.W., Browne M., Langville Amy N., Pauca V.P., and Plemmons R.J.
(2007). "Algorithms and applications for approximate nonnegative matrix
factorization". In: Computational Statistics & Data Analysis 52.1: 155-173.
Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. Journal of
Machine Learning Research, 3: 993—1022.

JADT’ 18

443

Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K.and Harshman R.
(1990). Indexing by latent semantic analysis, J. of the Amer. Soc. for
Information Science, 41 (6): 391-407.
Garnett J.-C. (1919). General ability, cleverness and purpose. British J. of
Psych, 9, 345-366.
Griffiths T.,L., Steyvers M., and Tenenbaum J.,B. (2007). Topics in Semantic
Representation. Psychological Review, 114, 2, 211-244.
Huson D. H., Bryant D. (2006) Application of Phylogenetic Networks in
Evolutionary Studies. Molecular Biology and Evolution, 23 (2): 254 - 267.
Software available from www.splitstree.org.
Kazmierczak J.-B. (1985). Analyse logarithmique : deux exemples
d'application. Revue de Statist. Appl., 33, (1), 13-24.
Lee D.D. and Seung H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401: 788-791.
Lebart L. (2012). Articulation entre exploration et inférence. In : JADT_2012.
Dister A., Longree D., Purnelle G., Editors. Presse Universitaire de Liège.
Lewi P.J. (1976). Spectral mapping, a technique for classifying biological
activity profiles of chemical compounds. Arzneim. Forsch. in: Drug
Res. 26, 1295-1300.
Paterson D. (2010). Reading Shakespeare Sonnets. Faber & Faber Ltd. London.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot M. and Duchesnay, E.
(2011). Scikit-learn: Machine Learning in Python, Journal of Machine
Learning Research , 12, 2825-2830.
Reinert, M. (1986). Un logiciel d’analyse lexicale: [ALCESTE]. Cahiers de
l’Analyse des Données, 4, 471–484.
Sattath S. and Tversky A. (1977). Additive similarity trees. Psychometrika, vol.
(42), 3: 319-345.
Shakespeare, W. (1901). Poems and sonnets: Booklover's Edition. Ed. The
University Society and Israel Gollancz. New York: University Society
Press. Shakespeare Online. Dec. 2017.
Spearman C. (1904). General intelligence, objectively determined and
measured. Amer. Journal of Psychology, 15, 201-293.
Gaujoux R.et al. (2010). A flexible R package for nonnegative matrix
factorization. In: BMC Bioinformatics 11.1 (2010): 367.
Thurstone L. L. (1947). Multiple Factor Analysis. The Univ. of Chicago Press,
Chicago.

444

JADT’ 18

Analyse Diachronique de Corpus : le cas du poker
Gaël Lejeune1, Lichao Zhu2
1

STIH, Sorbonne Université – gael.lejeune@sorbonne-universite.fr
2 LLSHS, Université Paris XIII – lichao.zhu@univ-paris13.fr

Abstract
In this paper we will investigate a diachronic corpus. We want to highlight
how people’s mentalities evolve regarding the gambling especially the poker
game and how the evolution is correlated with the way that the game is
considered in press articles. We study plain or metaphorical meanings of the
terms in question by using clustering and statistical methods in order to
detect changes of meanings in a relatively large period of time.
Résumé
Dans cet article nous nous intéressons à l'étude diachronique de corpus de
presse dans le but d'illustrer des évolutions dans la vision de la société sur les
jeux d'argent et de hasard ainsi que sur les joueurs. Nous utilisons des
méthodes de statistique textuelles et de clustering pour détecter les grandes
tendances visibles sur noter échelle de temps en nous focalisant sur le poker .
Nous montrons que si le regain de popularité du jeu de poker se traduit par
un traitement médiatique plus important, les métaphores exploitant la notion
de poker restent très fréquentes.
Keywords: analyse diachronique, corpus, jeux d'argent et de hasard
1. Introduction
L'analyse diachronique de corpus opère sur un champ assez large. Nous
pouvons en juger par exemple en observant les nombreux travaux sur
l'évolution des langues, travaux qui passionnent aussi bien la communauté
scientifique (Dediu & de Boer 2016) que les médias si l'on se fie par exemple à
l’intérêt renouvelé porté par ceux-ci sur l’évolution des dictionnaires. Dans le
champ purement scientifique, les intérêts dans le domaine embrassent tous
les niveaux de l'analyse linguistique même si la morphologie (Macaulay
2017) et le lexique (la néologie par exemple chez Gérard et al. 2014). La
sémantique est un autre aspect des études diachroniques notamment pour
étudier les représentations mentales des locuteurs (Hamilton et al. 2016). Le
travail présenté ici s'intéresse à une autre catégorie de représentations
mentales qui est l'image que certaines activités ludiques peuvent prendre au
cours du temps. Nous nous intéressons ici à un jeu d'argent et de hasard qui

JADT’ 18

445

a connu une sorte de nouvelle jeunesse ces dernières années : le jeu de poker.
Dans ce travail, nous nous inspirons de l’analyse de l’usage du lexique dans
(Hamilton et al. 2016), nous souhaitons examiner l’évolution de l’usage d’un
mot, d’un terme particulier au cours du temps. Ce travail, même si notre
ambition est moins large, peut se rattacher aux études sur la néologie
sémantique (Sablayrolles 2002) ou néosémie (Rastier et Valette 2009). Pour
illustrer l’intérêt que représente le poker en tant que phénomène de société,
nous pouvons considérer le retentissement autour du Moneymaker Effect1 ou
encore cette citation du journal Le Monde daté du 22 janvier 2007 qui illustre
le changement d’image de ce jeu: « Considéré il y a encore peu de temps comme
un jeu sulfureux se jouant dans les arrière-salles de bars louches ou dans des
appartements huppés à l'abri des regards indiscrets, le poker fait une entrée en force à
la télévision ». En particulier, dans sa variante à la mode Texas Hold'Em, le
poker est redevenu un jeu dont on parle et dont on parle plutôt positivement.
Notre objectif est d’une part de mesurer à quel point ce regain d’attention a
pu se traduire par une amélioration de l’image du jeu de poker en général.
D’autre part, il s’agit de voir dans quelle mesure les usages métaphoriques
du terme poker, plutôt connotés “négativement” (poker menteur, coup de
poker2…) ont pu évoluer conjointement à cette plus grande popularité du jeu
lui même. Dans la section 2 nous présenterons le corpus que nous avons
constitué pour cette étude. Puis, nous proposerons dans les deux sections
suivantes une analyse statistiques des prédicats puis une analyse sous forme
de clustering. Enfin, nous présenterons nos conclusions et perspectives.
2. Présentation de notre corpus d’étude
De manière à pouvoir s’affranchir des variations de choix éditoriaux entre
journaux, nous nous avons souhaité nous concentrer sur une seule
publication. Nous avons choisi le Monde ce qui nous permettais d’exploiter
des articles dont la publication s’étalent sur 30 ans : 1988-2017. Pour la partie
1988-2005 nous avons utilisé le corpus du monde distribué par ELRA3, nous
avons restreint aux textes contenant le terme poker. Pour les années 2006 à
2017 nous avons extrait d’Europresse4 les articles qui comportait le terme
poker. Dans les deux cas nous avons considéré toutes les variantes possibles
dans la casse. Nous avons ainsi obtenu 3528 textes dont la répartition dans le

Par exemple : http://www.slate.com/articles/news_and_politics/explainer/
2011/06/the_moneymaker_effect.html
2
Dans le sport par exemple, on remarque des contextes de « tentative
désespérée », « dernière chance » ...
3
http://catalog.elra.info/product_info.php?products_id=438&language=fr
4
http://www.europresse.com/fr/
1

446

JADT’ 18

temps est présentée Figure 1. Nous pouvons observer que le nombre
d’articles a connu une chute entre 2005 et 2006. Ceci semble être dû au fait
que nous passions à ce moment précis d’une étude du corpus complet du
monde tel qu’existant auprès d’ELRA à une étude fondée sur la base
Europresse. De fait, sur nos critères de recherche, la base Europresse ne
totalise que 47 articles pour 2003 (contre 129 dans le corpus ELRA), 62 pour
2004 (contre 117) et 67 articles pour 2005 (contre 117). Les contraintes
respectives d‘utilisation de ces deux sources de données nous ont interdit de
pouvoir disposer d’un corpus dont la constitution soit constante. Nous nous
sommes efforcés de s’affranchir de ce biais en adaptant notre méthodologie
(notamment le clustering).

Figure 1 : Répartition du nombre d'articles par année

Nous avons 4353 occurrences du terme recherché, leur répartition est
instructive (Figure 2) : la très grande majorité des articles (2834/3528 soit
80,33%) ne comporte qu’une seule occurrence. Nous pensons que ceci est le
reflet de deux tendances. D’une part le sujet de l’article est rarement le poker
pour lui même, il est question d’un personnage qui par ailleurs joue au poker
par exemple. D’autre part, cette rareté de la répétition révèle un usage
massivement métaphorique, en effet comme l’a montré (Lejeune 2013) une
métaphore perd de sa force en étant répétée. Si un terme est répété, il est très
probable qu’il soit employé dans son sens plein. Si cette observation était
faite sur des noms de maladies infectieuses, il nous semble que ceci est avant
tout lié au genre de texte et que cela s’applique également ici. Si nous allons
un peu plus loin, nous pouvons faire l’hypothèse que la métaphore peut être
filée, mais qu’elle est rare dans les articles expositifs. D’autre part, dans le cas
peu probable d’une métaphore filée, les conventions stylistiques impliquent
de changer le terme employé, le journaliste utilisera plutôt des termes du
même champ lexical.

JADT’ 18

447

Figure 2 : Répartition des d'articles selon le nombre d’occurrences du terme « poker »

La répartition des articles entre ceux qui comportent une et une seule
occurrence et ceux qui en comportent plusieurs montre des variations
importantes dans le temps (Figure 3). Si l’on observe des périodes de 5 ans,
on peut se rendre compte que le nombre d’articles comprenant plusieurs
occurrences de “poker” représente 15% des articles sélectionnés sur la
période 1988-1992, se pourcentage descend à 10% jusqu’en en 2003 puis
remonte progressivement pour finalement rester au-dessus de 20% à partir
de 2004-2008 avec une pointe à 30% pour les périodes 2007-2011 à 2009-2013.

Figure 3 : Répartition par année des articles selon le nombre d’occurrences

3. Prédicats et séquences figées
Dans la théorie linguistique lexique-grammaire de M. Gross (1975) et de G.
Gross (2012), les prédicats sont considérés comme les noyaux d’une phrase
capables de disposer d’arguments, grâce à leurs propriétés
transformationnelles et distributionnelles. Parmi les apports de cette théorie
figurent le « schéma d’arguments » et les « prédicats appropriés ». Nous
relevons dans notre corpus les contextes gauches et droits des séquences

448

JADT’ 18

figées « partie de poker » et « coup de poker » afin de distinguer leurs
emplois métaphoriques et non métaphoriques. Ce travail est fait en étudiant
le premier verbe précédant ou suivant l’expression (sans remonter au-delà
d’une phrase). Nous montrons dans les tableaux 1 et 2 les 20 verbes les plus
fréquents pour chaque contexte se trouvent le plus fréquemment dans ces
contextes (20 dans les contextes gauches, 20 dans les contextes droits).
Tableau 1 : Effectif des verbes dans le contexte gauche de “[partie|coup] de poker”
être (76)

jouer (62)

faire (15)

tenter (14)

gagner (11)

avoir (11)

ressembler (10)

prendre (9)

tenir (8)

lancer (8)

perdre (7)

voir (6)

partir (6)

engager (6)

agir (5)

réussir (4)

livrer (4)

remporter (3)

organiser (3)

mener(2)

Tableau 2 : Effectif des verbes dans le contexte droit de “[partie|coup] de poker”
être (98)

avoir (75)

jouer (16)

pouvoir (13)

devoir (8)

gagner (7)

engager (7)

venir (6)

livrer (6)

faire (5)

vouloir (4)

voir (4)

tenter (4)

tenir (4)

réussir (4)

prendre (4)

monter (4)

bluffer (4)

aller (4)

retrouver (3)

Hormis les verbes « être » et « avoir » qui sont susceptibles d’être des verbes
auxiliaires ou semi-auxiliaires, pour les autres verbes on peut se trouver dans
trois cas de figure :
h) Verbe support
i) Prédicat approprié : le sens littéral de l’expression peut être
activé
j) Prédicat non approprié : le sens métaphorique de
l’expression est activé
Le cas des verbes support n’est pas pertinent pour notre étude. Pour le
second cas, nous observons que le verbe jouer, prédicat approprié pour les
deux séquences décrites, est très souvent lié à un usage métaphorique. Dans
le troisième cas, de loin le plus fréquent. Les verbes « tenter », « s’engager »,
« réussir », « mener », « lancer » voire « remporter » ne sont pas tout à fait
congruents avec le sens premier de la séquence, c’est-à-dire qu’ils ne sont pas
des prédicats appropriés au sens propre du jeu de poker. Des occurrences de
ces verbes dans le corpus confirment cette intuition :
Il leur fallait lancer la partie de poker que Bonn et Paris s'apprêtent à jouer sur le
GATT (1993)
les enjeux de la partie de poker qui s'engagera mercredi à la mi-journée lorsque
l'ambassadeur[...] (2017)
[ils] avaient pu croire un moment que leur coup de poker allait réussir. (1989)

JADT’ 18

449

[Celui qui] est davantage connu pour ses coups de poker financiers continue à mener
sa stratégie (2015)
Elle venait de remporter la partie de poker menteur qui constitue l'essentiel des
premiers hectomètres. (1995)
4. Étude des champs lexicaux par clustering
Si les séquences « partie de poker » et « coup de poker » sont ambiguës dans
le sens où elles figurent dans des champs lexicaux différents, on peut se
demander ce qu’il en est des champs lexicaux du terme « poker » en général.
Pour étudier cette question, nous avons réalisé un clustering de notre corpus.
Nous avons utilisé l’implantation des k-moyennes (K-means) de la
bibliothèque Python scikit-learn. Nous avons fixé le nombre de clusters K à
105 et le nombre maximal d’itérations à 400, la mesure des poids est le tf-idf.
Nous avons extrait tous les n-grammes de mots avec n allant de 1 à 3 puis
seulement nous avons utilisé une stop-list. De sorte que, par exemple, « de »
n’était pas gardé en tant que tel mais que nous le retrouvions dans « coup de
poker » ou « loi de Robien ». Nous avons tout d’abord travaillé sur le corpus
lemmatisé puis nous avons observé que les résultats étaient semblables sans
lemmatisation, nous avons donc supprimé ce pré-traitement. Nous allons
maintenant décrire chaque cluster en donnant la proportion du corpus qu’il
couvre ainsi que les 10 termes les plus significatifs.
Cluster 0, « sport et poker 1 » : 3,1 % (club, football, équipe, Ligue, France,
championnat, saison, joueurs, OM, Marseille) Ce cluster comporte deux
volets : l’un sur les « coups de poker » dans les championnats de football et
l’autre où il est question des championnats de poker eux mêmes.
Cluster 1, « politique » : 18,79 % (ministre, président, politique,
gouvernement, pays, ,État, premier ministre, premier, États, faire). Un cluster
autour de l’action politique, notamment au niveau européen. Un exemple
intéressant de métaphore (filée) ici : « M. Erdogan remet tout en jeu, comme
un joueur de poker fait tapis »
Cluster 2, « fourre-tout » : 38,01 % (être, bien, film, vie, entre, Jean, monde,
France, temps, homme) Le seul de nos clusters qui n’ait pas d’unité ni de
tendance thématique, ici les expressions contenant poker sont pour moitié
métaphoriques.
Cluster 3, « culture_1 » : 5,13 % (film, Booker Prize, roman, prix, livres, livre,
littéraire, base, prix littéraire, attribué). Ce cluster rassemble les livres ayant
trait au poker, les expressions liées sont prises dans leur sens littéral

5

9 et 12.

Selon la méthode du coude (elbow method), la valeur optimale se situait entre

450

JADT’ 18

(l’expression « coup de poker » y est quasi absente).
Cluster 4 « finance »: 4,2 % (Vivendi, marché, groupe, Bourse, marches,
actionnaires, titres, taux, millions, fonds, terme, milliards, prix) Il se
caractérise uniquement par des thématiques associées au domaine de finance
et notamment aux coups de poker boursiers.
Cluster 5 « sport et poker 2»: 5,04 % (Coupe, match, équipe, joueurs, France,
club, football, finale, francs, PSG). Nous avons ici un cluster sur le sport où
environ la moitié des articles concernent toutefois le poker lui même.
Cluster 6 « industrie du poker »: 12,96 % (jeux, paris, ligne, marché,
milliards, euros, millions, Internet, dollars, Bourse) Ici nous avons tout ce qui
est lié à l’industrie du poker et notamment à l’essor des jeux d’argent sur
Internet (dont le poker a été un fer de lance).
Cluster 7 « sport »: 3,26 % (Tour, numéros, France, coureur, étape, peloton,
course, équipe, Tour de France, maillot) Nous avons ici des usages,
massivement métaphoriques, dans le domaine du sport (principalement le
cyclisme). Un exemple avec le terme spécialisé flop : « [P.A.Bosse] avait
trouvé cette image [...] : Si on compare le 1500m au poker, il a un flop
d'avance. »
Cluster 8 « culture_2 » : 7,14 % (blues, musique, CD, rock, John Lee Hooker,
jazz, album, guitare, musiciens, scène) Un usage métaphorique dans le
domaine de la musique avec des expressions telles que « poker face »,
« poker perdant »...
Cluster 9 « culture_3 » : 2,38 % (Dracula, Bram Stoker, vampire, roman, film,
fantastique, Christie, Coppola, comte, Frankenstein) Le cluster 3 était centré
sur le domaine littéraire, ici il est question de cinéma et particulièrement des
personnalités liées au poker. L’usage y est surtout littéral.
Pour ce qui est de la répartition temporelle, il est très intéressant de noter que
le cluster 6 (l’industrie du poker) devient le second plus important derrière le
cluster 2 ( à partir de 2005 (popularisation des jeux d’argent sur Internet) et
plus encore à partir de 2010 (légalisation des paris en ligne). Le cluster 0
(sport et poker) devient plus important à partir de 2004 d’autant qu’en son
sein la thématique poker y est alors largement majoritaire.
5. Conclusion
Nous avons proposé dans cet article une étude diachronique d’articles de
presse contenant le mot « poker ». Notre hypothèse initiale était que ce terme
était souvent employé dans des expressions métaphoriques et que le regain
de popularité de ce jeu depuis quelques années avait du amener une plus
grande proportion d’usage littéral. Nous avons observé que dans plus de
80 % des cas, le terme poker n’apparaissait qu’une fois dans les textes. Nous

JADT’ 18

451

avons montré que ceci était dû à un usage principalement métaphorique, on
ne répète pas une métaphore, mais aussi au fait que le poker est rarement le
sujet central de l’article. Cette tendance change quelque peu à partir de 2005,
le poker devenant lié à des championnats et des retransmissions télévisuelles
plutôt qu’à des tripots et des casinos. Enfin, nous avons montré que les
usages métaphoriques relevaient très majoritairement de 3 domaines : la
finance, la politique et le sport.
References
Dediu D. and de Boer B. (2016)., Language evolution needs its own journal ,
Journal of Language Evolution, Volume 1, Issue 1, 1 January 2016, Pages
1–6
Gérard C., Falk I., and Bernhard D. (2014). Traitement automatisé de la
néologie : pourquoi et comment intégrer l’analyse thématique ? Actes du
4e Congrès mondial de linguistique française (CMLF 2014), Berlin, Pages
2627-2646
Gross, M. (1975). Méthodes en syntaxe: régime des constructions
complétives, Hermann.
Gross, G. (2012). Manuel d'analyse linguistique: Approche sémanticosyntaxique du lexique, Presses Universitaires du Septentrion.
Hamilton W.L., Leskovec J., and Jurafsky D. (2016). Diachronic Word
Embeddings Reveal Statistical Laws of Semantic Change. In Proc. Of the
Association for Computational Linguistics Conference (ACL) 2016
Lejeune G. (2013) Veille épidémiologique multilingue : une approche
parcimonieuse au grain caractère fondée sur le genre textuel, Thèse de
doctorat en Informatique de l'Université de Caen
Macaulay, M. & Salmons. (2017). Synchrony and diachrony in Menominee
derivational morphology, J. Morphology 27: 179
Rastier, F., Valette, M. (2009) « De la polysémie à la néosémie », Le français
moderne, S. Mejri, éd.,
La problématique du mot, 77, 97-116.
Sablayrolles, F. (2002). « Fondements théoriques des difficultés pratiques du
traitement des
néologismes », Revue française de linguistique appliquée, VII-1, pp. 97-111.

452

JADT’ 18

Approche textométrique des variations du sens
Julien Longhi1, André Salem2
Université de Cergy-Pontoise, France – julien.longhi@u-cergy.fr
Université de la Sorbonne nouvelle, France – salem@msh-paris.fr

1
2

Abstract
The use of textometric methods relies on the hypotheses, firtly, that stable
units exist (forms, lemmas or their graphical approximations) and, secondly,
that occurrences of these forms can be retrieved from different parts of a
corpus. Once automatic counting performed, more sophisticated textometric
methods can be employed to focus on textual variations (repeated segments,
collocations, etc.) that occur around the same unit but in different contexts
found within the corpus. This approach leads to the identification of
semantic variations with relation to the context of each occurrence as
highlighted through automatic segmentation. We will illustrate this by using
examples of repeated segments within the corpus that contain the N-gram
/enemy / taken from a widely-studied chronological text series.
Résumé
Pour pouvoir mettre en œuvre les méthodes de la textométrie, il est
indispensable de postuler, dans un premier temps, l'existence d'unités stables
(formes, lemmes ou leurs approximations graphiques), dont on recensera
ensuite les occurrences dans les différentes parties du corpus étudié. Une fois
les dépouillements automatiques réalisés, il est cependant possible d'utiliser
des méthodes textométriques plus élaborées pour accéder aux variations
textuelles (segments, répétés, cooccurrences, etc.) qui peuvent se réaliser
autour d'une même forme dans chacun des contextes particuliers du corpus.
Cette démarche permet d'accéder au repérage de variations sémantiques qui
se rapportent à chacune des occurrences des formes produites par la
segmentation automatique. Nous illustrons notre démarche à l'aide
d'exemples prélevés dans les parties d'une série textuelle chronologique
largement étudiée, des segments répétés du corpus qui contiennent le Ngram /ennemi/.
Keywords:.unité textométrique, sémantique, variation du sens
1. Introduction
Notre étude s’inscrit dans une perspective de prise en compte des
dynamiques du sens à l’œuvre dans les discours, qui tiendrait compte de la

JADT’ 18

453

variation, de l’hétérogénéité, ou encore de l’articulation entre topologie
textuelle et discursive, sens et profilage. Le sens se construit dans différents
champs où il est susceptible de paraître, et s’analyse « par le contexte, sous
forme d’indices de position liés aux modalités de sa mise en place dans le
champ » (Cadiot et Visetti, 2011), la caractérisation sémantique se faisant
alors sur la base de la composition et décomposition des profils disponibles.
L'automatisation du dépouillement de vastes corpus de textes, à des fins
textométriques, nécessite au contraire que le repérage des unités de
décompte puisse être confié à des machines. Pour pouvoir mettre en œuvre
les méthodes de la textométrie, il est indispensable de postuler, dans un
premier temps, l'existence d'unités stables (lexèmes, lemmes ou leurs
approximations graphiques), dont on recensera ensuite les occurrences dans
différentes parties du texte. Cette manière de faire permet d'étudier la
répartition de chacune des unités dans un corpus ou encore de rapprocher les
différents contextes qui contiennent chaque unité textométrique. Ces
simplifications, incontournables dans le premier temps de l'analyse, nous
éloignent de l'étude du sens de chacune des occurrences que l'on peut
élaborer dans chaque contexte particulier. Cependant, une fois les premiers
dépouillements automatiques réalisés, il est possible d'utiliser des méthodes
textométriques plus élaborées pour accéder aux variations textuelles qui
peuvent se réaliser autour d'une même forme dans le corpus (segments
répétés, cooccurrences, etc.). C’est ce croisement de perspectives et ce va-etvient entre approche empirique et théorisation sémantique, que nous
souhaitons mettre à l’épreuve dans la présente étude.
2. Application au corpus Duchesne
Pour illustrer notre démarche, nous appliquons ces méthodes à l'étude de la
ventilation, dans les différentes parties d'une série textuelle chronologique
largement étudiée, des segments répétés du corpus qui contiennent le Ngram /ennemi/.
2.1. Rappels sur l'analyse de la série chronologique Duchesne
La série chronologique Père Duchesne a déjà fait l'objet de nombreuses
analyses textométriques1. Nous avons montré, en particulier, que les

Le corpus Père Duchesne est constitué par la réunion d'un ensemble de
livraisons du journal Le Père Duchesne de Jacques-René Hébert, parues entre 1793 et
1794. Pour une description plus avancée de ce corpus, on consultera, par exemple
(Salem, 1988).
Les analyses dont nous rendons compte ci-dessous, ont été effectuées à l'aide du
logiciel Lexico5. Cedric Lamalle, William Martinez, Serge Fleury ont largement
1

454

JADT’ 18

typologies réalisées à partir d'une partition de ce corpus en huit périodes,
correspondant chacune à un mois de parution, mettaient en évidence un
renouvellement lexical fortement lié à l'évolution dans le temps. On peut
vérifier, sur la figure 1, que les parties correspondant aux périodes
successives de parution sont proches sur les facteurs issus de l'analyse du
tableau (8 parties x 1420 formes dont la fréquence dépasse dix occurrences)2.
La méthode des segments répétés permet de repérer toutes les occurrences de
suite de formes graphiques qui apparaissent plusieurs fois dans un corpus de
textes (Lafon et Salem, 1983 ; Salem, 1986). Pour la présente étude, nous
avons constitué un ensemble d'unités textuelles qui contient outre les formes
graphiques ennemi et ennemis, tous les segments répétés qui contiennent l'une
ou l'autre de ces formes. On a projeté sur la figure 1, en qualité d'éléments
supplémentaires, cet ensemble de segments. La position sur ce graphique des
différents segments montre que ces unités ne sont pas employées de manière
uniforme tout au long des périodes.

Figure 1 : Duchesne. Les segments contenant la séquence ennemi sur
le plan des deux premiers facteurs issus de l'analyse de tableau 8 parties x 1420 formes
(F>=10)

Guide de lecture pour la figure 1 : La figure fournit la représentation des huit
parties du corpus Duchesne, sur les deux premiers axes issus d'une Analyse

contribué au développement des fonctionnalités de ce logiciel. Les auteurs tiennent à
les en remercier.
2 Ce phénomène connu sous le nom d'effet Guttman, a été largement décrit par
Guttman (1941, 1946, 1950), Benzecri (1973) et Van Rijckevorsel (1987).

JADT’ 18

455

des correspondances, réalisée sur l'ensemble des formes dont la fréquence
dépasse 10 occurrences. Les segments répétés du corpus contenant la
séquence de caractères /ennemi / ont été projetés sur ce même plan, en tant
qu'éléments supplémentaires. La figure a été allégée des segments
redondants (ex : segments contenus dans des segments plus longs). Certains
des éléments superposés par l'analyse ont été très légèrement déplacés à fin
de rendre la figure plus lisible.
Ainsi par exemple, le segment plus cruels ennemis trouve toutes ses
occurrences au début du corpus alors que celles du segment ennemis de la
liberté sont plutôt concentrées vers la fin.
L'analyse des projections des différents segments qui contiennent le n-gram
/ennemi/ va nous permettre de dégager des contextes dont la distribution
diffère fortement entre le début et la fin de la période temporelle couverte par
le corpus.
2.2. L'évolution du contexte de la forme ennemi(s)
On peut estimer que le contenu sémantique de la forme ennemi(s) conserve
une valeur relativement stable tout au long des périodes couvertes par le
corpus que nous étudions. Le chercheur confronté à l'analyse de ces textes
retrouvera sans peine, lors de l'examen de chacune des occurrences du terme,
les principaux traits sémantiques décrits dans un dictionnaire de langue à
propos de ce lexème (opposé, hostile, etc.). Cependant, l'analyse de ces
mêmes contextes montre qu'il en va tout autrement pour ce qui concerne les
référents auxquels la forme renvoie, dans chaque période particulière. Aux
plus cruels ennemis, plus mortels ennemis, ennemis du dehors (les puissances
étrangères, les expatriés), des périodes du début, succèdent bientôt les
ennemis du dedans et du dehors, expressions qui peuvent s'analyser comme une
dénonciation du fait que les ennemis du dehors ne constituent pas le seul
danger et qui opère donc une modification manifeste du référent de départ.
Par la suite la mention des ennemis de l'intérieur complètera la notion
d'ennemis du dedans. Il faut noter que les ennemis de l'intérieur sont de plus en
plus souvent précédés de l'article défini les qui les désigne comme une réalité
dont l'existence est présupposée (elle n’est plus à démontrer).
Progressivement, nos ennemis, deviennent vos ennemis, puis les ennemis. Dans
la dernière période les ennemis, désormais désignés, de manière
préférentielle, au pluriel, ne sont plus qualifiés par leur localisation ou par
leur rapport aux destinataires du message (nos/vos ennemis) mais par des
valeurs supposée communes auxquelles ils sont censés s'opposer : ennemis du
peuple, ennemis de la république, ennemis de la révolution, ennemis de la liberté,
ennemis de l'égalité.

456

JADT’ 18

3. La sémantique de ennemi(s)
Les variations constatées montrent que la forme ennemi(s) prend différents
sens selon les contextes dans lesquels elle s'inscrit, en ce qu’ils sont associés à
des référents distincts. Plutôt que de représenter le sens comme la somme des
cooccurrences constatées, nous souhaitons analyser ces valeurs comme un
sous-ensemble prélevé sur un ensemble de valeurs acquises. Les espaces
sémantiques déterminés et caractérisés par l’analyse statistique jouent un rôle
fondamental qui, au-delà des synonymies, ou des polysémies, se
renouvellent « en étant confronté aux textes – ce qui impliquerait de prêter
attention à d’autres corrélations » (Visetti 2004 : 11). La description
sémantique que nous proposons s’inscrit dans le champ de la sémantique
lexicale3, du côté des approches qui envisagent la construction des référents
comme extrinsèque. Cependant, alors que ces approches mobilisent en
général des analyses phrastiques, et travaillent sur des exemples forgés, nous
introduisons une perspective statistique qui précède la représentation du
sens. La description de l’objet ennemi(s) n’est pas séparée des rapports que
l’on entretient avec lui, et sa description suppose une prise en compte
différenciée de ses propriétés extrinsèques (relatives à ces rapports), et de ses
propriétés intrinsèques (supposées stables et indépendantes).

Figure 2 : Niveaux et unités d'analyse

3

Cadiot et Némo (1997 : 127-128)

JADT’ 18

457

L’intérêt de cette démonstration textométrique est pour nous de fournir des
résultats concrets et matériels pour l’analyse des sens d’une unité lexicale.
Ceci a plusieurs conséquences pour la mise en œuvre d’une sémantique
soucieuse de l'exploitation des constats empiriques :
1) la représentation des variations du sens en contexte nous a permis
d’identifier la manière donc les propriétés sont introduites et
attribuées dans le corpus. Le référent change au fil du temps,
puisque les ennemis, initialement définis comme du dehors, et
introduits par nos, deviennent vos ennemis, et se présentent
finalement sous la forme ennemi(s) de + N. Le besoin d’être déterminé
par un complément du nom, ou son équivalent, qui indique avec
quoi le terme « relatif » se trouve mis en relation », cette
complémentation explicitant « ainsi la référence identitaire »
(Steuckardt, 2008).
2) L’évolution dans le corpus au fil du temps permet de rendre compte
de la dynamique sémantique à l’œuvre, laquelle rend compte
diachroniquement des évolutions de sens. La textométrie permet
ainsi de saisir les processus, et donc de donner du sens à la
dimension potentiellement « hétéroclite » des propriétés des
référents.
Ainsi, au plan linguistique, le passage du référent 1 ou référent 2 se fait par
l’intermédiaire d’une transformation des propriétés de ennemi(s) : défini de
manière situationnelle (du dehors) et relative (nos, nos plus cruels), il acquiert
des propriétés plus polémiques (vos, du dedans et du dehors), pour s’intégrer
ensuite dans un processus discursif qui construit le référent (ennemi de + N :
ennemi de la liberté ; ennemi du peuple), par l’introduction de termes à fort
charge axiologique. Le référent introduit alors un point de vue, qui n’est pas
strictement géographique ou institutionnel, mais aussi politique et
idéologique. L'approche statistique dévoile, en outre, que c’est le pluriel qui
est prioritairement mobilisé.
3. Conclusion
De manière désormais classique, les méthodes de la textométrie permettent
de mettre en évidence les variations du vocabulaire qui surviennent au cours
des périodes successives d'une même série textuelle chronologique. Dans la
présente étude, nous avons appliqué les méthodes d'analyse statistique
multidimensionnelle (AFC) à l'étude d'un ensemble particulier, celui des
segments répétés réunis sur la base du fait qu'ils contenaient tous une même
unité graphique (en l'occurrence, le n-gram /ennemi/).
La confrontation des segments ainsi sélectionnés nous permet d'observer des
variations autour des formes graphiques ennemi et ennemis. L'analyse de ces

458

JADT’ 18

variations dans le temps nous conduit à distinguer des référents qui varient
en fonction des périodes réunies dans le corpus.
Au-delà des séries textuelles chronologiques, la méthode que nous avons
présentée est susceptible de recevoir des applications dans l'étude de
nombreux types de corpus. L'extraction semi-automatique des unités dont les
contextes varient fortement en fonction des parties d'un corpus textuelle peut
également être envisagée.
References
Benzécri J-P. and coll. (1981). Pratique de l'analyse des données, Linguistique et
lexicologie. Dunod.
Cadiot P. and Nemo F. (1997). Propriétés extrinsèques en sémantique
lexicale. Journal of French Language Studies, 7(2): 127-146.
Cadiot P. and Visetti Y.-M. (2001). Pour une théorie des formes sémantiques.PUF.
Guttman L. (1941). The quantification of a class of attributes: a theory and
method of a scale construction. In P. Horst, The prediction of personal
adjustment, SSCR New York.
Lafon P. and Salem A. (1983). L’Inventaire des segments répétés d'un texte.
Mots. Les langages du politique, 6 : 161-177.
Lamalle C, Martinez W, Fleury S, and Salem A. (2002). Les dix premiers pas
avec
Lexico3.
Outils
lexicométriques.
http://www.cavi.univparis3.fr/Ilpga/ilpga/tal/lexicoWWW
Lebart L. and Salem A. (1994). Statistique textuelle. Dunod.
Longhi J. (2008). Objets discursifs et doxa. Essai de sémantique discursive.
L’Harmattan, coll. « Sémantiques ».
Rastier F. (2011). La mesure et le grain. Sémantique de corpus. Honoré
Champion, coll. « Lettres numériques ».
Salem A. (1987). Pratique des segments répétés. Klincksieck.
Salem A. (1988). Approches du temps lexical. Mots. Les langages du politique,
17 : 105-143.
Steuckardt A. (2008). Les ennemis selon L’Ami du peuple, ou la catégorisation
identitaire par contraste. Mots. Les langages du politique [En ligne],
69 | 2002. http://journals.openedition.org/mots/10023
Van Rijckevorsel J. (1987). The application of fuzzy coding and horseshoes in
multiple correspondances analysis. DSWO Press.
Visetti Y.-M. (2004). Le Continu en sémantique : une question de formes.
Texto ! juin 2004. http://www.revuetexto.net/Inedits/Visetti/Visetti_Continu.html

JADT’ 18

459

ADT et deep learning, regards croisés. Phrases-clefs,
motifs et nouveaux observables
Laurent Vanni1, Damon Mayaffre, Dominique Longree2
1

UMR 7320 : Bases, Corpus, Langage - prenom.nom@unice.fr
2L.A.S.L.A. - prenom.nom@uliege.be

Abstract 1
This contribution confronts ADT and Machine learning. The extraction of
statistical key-passages is undertaken following several calculations
implemented using the Hyperbase software. An evaluation of these
calculations according to the filters applied (taking into account only positive
specificities, only substantives, etc.) is given.
The extraction of key passages obtained by deep learning - passages that
have the best recognition rate at the time of a prediction - is then proposed.
The hypothesis is that deep learning is of course sensitive to the linguistic
units on which the computation of the key statistical sentences are based, but
also sensitive to phenomena other than frequency and other complex
linguistic observables that the ADT has more difficulty taking into account as would be the case with underlying patterns (Mellet et Longrée, 2009). If
this hypothesis is confirmed, it would on the one hand permit better
understanding of the black box of deep learning algorithms and on the other
hand to offer the ADT community a new point of view.
Abstract 2
Cette contribution confronte ADT et Deep learning. L’extraction de passagesclefs statistiques est d’abord proposée selon plusieurs calculs implémentés
dans le logiciel Hyperbase. Une évaluation de ces calculs en fonction des
filtres appliqués (prise en compte des spécificités positives seulement, prise
en compte de substantifs seulement, etc) est donnée. L’extraction de
passages-clefs obtenus par deep learning - c’est-à-dire des passages qui ont le
meilleur taux de reconnaissance au moment d’une prédiction - est ensuite
proposée. L’hypothèse est que le deep learning est bien sûr sensible aux
unités linguistes sur lesquelles le calcul des phrases-clefs statistiques se
fondent, mais sensible également à d’autres phénomènes que fréquentiels et
d’autres observables linguistiques complexes que l’ADT a plus de mal à
prendre en compte - comme le seraient des motifs sous-jacents (Mellet et
Longrée, 2009). Si cette hypothèse se confirmait, elle permettrait d’une part
de mieux appréhender la boîte noire des algorithmes de deep learning et
d’autre part d’offrir à la communauté ADT de nouveaux points de vue.

460

JADT’ 18

Keywords: ADT, deep learning, phrase-clef, motif, spécificités, nouveaux
observables
1. Introduction
Pour des raisons techniques avant tout, l’ADT s’est constituée à partir des
années 1960 autour du token, c’est-à-dire du mot graphico-informatique.
Depuis lors, la discipline n’a cessé de varier et d’élargir ses observables,
convaincue que le token seul rendait difficilement compte du texte dans sa
complexité linguistique. Ainsi la tokenisation en particules graphiques
élémentaires reste l’acte informatique premier des traitements
textométriques, et le calcul des spécificités lexicales reste l’entrée statistique
privilégiée de nos parcours interprétatifs. Cependant, la recherche d’unités
phraséologiques élargies et complexes, caractérisantes et structurantes des
textes, est devenue le programme d’une discipline désormais adulte.
Historiquement, dès 1987, le calcul des segments répétés (Salem, 1987) ou les ngrams a représenté une avancée puisque les segments significatifs du texte,
de taille indéterminée, étaient automatiquement repérés ; et aujourd’hui la
détection automatique, non supervisée, de motifs (Mellet et Longrée, 2009;
Quiniou et al., 2012; Mellet et Longrée, 2012; Longrée et Mellet, 2013) - objets
linguistiques complexes à empans variables et discontinus - apparait un
enjeu décisif. C’est dans cette perspective que cette contribution travaille et
met à l’épreuve l’idée de passages-clefs du texte, tels qu’ils sont implémentés
dans les deux versions d’Hyperbase (locale développée par Etienne Brunet et
web développée par Laurent Vanni) que l’UMR Bases, Corpus, Langage
produit en collaboration avec le LASLA. La démonstration se fait en deux
temps. D’abord, nous proposons une extraction statistique de
\textit{passages-clefs}, avec évaluation de leur pertinence interprétative sur
un corpus français et un corpus latin. Ensuite une confrontation
méthodologique avec le deep learning est mise en œuvre puisque le
traitement deep learning attribue, après apprentissage, les passages de texte à
leur auteur avec un taux de réussite éprouvé : par déconvolution nous
repérons alors au sein de ces passages les zones d’activation, en soupçonnant
qu’il s’agit, d’un point de vue linguistique, de motifs remarquables.
2. Les passages-clefs en ADT
2.1. Terminologie
Si nous préférons le terme de passage-clef à celui de phrase-clef c’est que les
traitements ici présentés n’ont pas de modèle syntaxique, et que la
ponctuation forte qui délimite habituellement la phrase est un jalon utile
mais non-nécessaire à nos traitements. La notion de passage a été fortement

JADT’ 18

461

théorisée par (Rastier, 2007) dans un article éponyme et désigne une
« grandeur » du texte dont la valeur textuelle c’est-à-dire interprétative est
patente. Un passage est donc un morceau de texte jugé suffisamment parlant,
notamment par sa taille qui gagne à dépasser le mot, le segment voire la
phrase, pour prétendre rendre compte d’un texte. Le passage-clef, quant à
lui, s’appuie sur la définition rastirienne mais est une unité de surcroit
textométrique ; c’est-à-dire une unité dont la pertinence est calculable et
l’extraction automatique.
2.2. Implémentations
Les logociels ADT comme Hyperbase, Dtm-Vic, Iramuteq implémentent des
calculs et l’extraction de passages-clefs. Dans tous les cas, les calculs proposés
reposent sur l’examen des mots spécifiques (Lafon, 1984) : grosso modo, plus
un passage concentre de spécificités, plus ce passage est jugé remarquable.
Nous présentons ici deux types d’approche sur des passages arbitrairement
constitués de 50 mots : un calcul naïf et sans filtre dans lequel tous les mots
du passage sont considérés et un calcul filtré par nos connaissances
linguistiques (sélection a priori des mots à considérer). Une évaluation de ces
deux types d’approche est ensuite donnée.
2.3. Calcul sans filtre
Dans le cadre des études contrastives habituelles en ADT, l’indice de
spécificité de chaque mot (Lafon, 1984) est sommé, qu’il soit positif ou négatif
en postulant que si les mots positifs (les mots sur-utilisés par un auteur par
exemple) doivent promouvoir le passage, il est légitime que les mots négatifs
(les mots sous-utilisés par un auteur) doivent l’handicaper. Chaque passage
du corpus se trouve ainsi doté d’un super-indice de spécificité et Hyperbase
fait remonter en bon ordre les passages les plus caractéristiques des textes
comparés.
Ainsi pour le français, sur le corpus de la présidentielle française 2017, le
passage-clef le plus fortement d’E. Macron (versus les autres candidats) est le
suivant :
[...] nous croyons dans l'innovation, dans la transformation écologique et
environnementale, parce que nous voulons réconcilier cette perspective et l'ambition
de nos agriculteurs, parce que nous croyons dans la transformation digitale, parce
que nous sommes pour une société de l'innovation, parce que nous voulons […]
Quoique naïf, le calcul apparait performant puisque l’interprétabilité sociolinguistique de ce passage est évidente : de fait Macron s’est fait élire sur un
discours dynamique (voulons , innovation (deux fois), transformation (deux
fois), digitale) et un discours rassembleur susceptible de transcender le clivage
gauche/droite (nous (5 fois), réconcilier).

462

JADT’ 18

2.4. Calcul filtré
Par connaissances linguistiques et statistiques, le calcul peut être raffiné. Par
exemple, seules les spécificités positives – et parmi elles, les spécificités les
plus fortes – peuvent être considérées au motif qu’un objet s’identifie mieux
par ses qualités que par ses défauts. Ensuite, les mots outils (conjonctions,
déterminants) peuvent être écartés : ils présentent le double inconvénient
d’avoir de très hautes fréquences (potentiellement déterminante pour le
calcul des spécificités) et d’être peu parlants d’un point de vue sémanticothématique. Et encore, la catégorie grammaticale peut être choisie : par
exemple seuls les noms propres et communs, parfois plus chargés de sens,
sont pris en compte. Ainsi pour le latin un passage-clef de Jules César,
contrasté à de nombreux auteurs contenus dans la base du LASLA, est le
suivant :
[...] partes Galliae uenire audere quas Caesar possideret neque exercitum sine magno
commeatu atque molimento in unum locum contrahereposse sibi autem mirum uideri
quid in sua Gallia quam bello uicisset aut Caesari aut omnino populo Romano
negotii esset his responsis ad Caesarem relatis iterum ad eum Caesar […]
De fait, ce passage de la Guerre des Gaules peut être effectivement considéré
comme très représentatif de l’œuvre de César. On relève des noms propres
connus (Galliae, Caesar, Gallia) ou des noms communs correspondant à la
réalité militaire du moment (bello, commeatu). Toutefois la méthode ne permet
pas de repérer des structures caractéristiques de la langue et du style de
César, comme par exemple une proposition participiale marquant la
transition entre épisodes dans une négociation : His responsis ad Caesarem
relatis, « Ces réponses ayant été rapportées à César ».
2.5. Evaluation
Calcul naïf ou calcul élaboré : nous récapitulons quelques performances.
Dans un corpus contrastif, nous calculons le score de super-spécificité de
chaque passage en fonction des différents auteurs comparés (Tableau 1). Par
exemple pour le français, sans aucun filtre 58% des passages du corpus de la
présidentielle sont attribués justement à leur auteur ; et en ne considérant que
les spécificités positives, le score descend à 52%. A l’opposé, en imposant le
double filtre de la catégorique grammaticale (seulement les substantifs) et de
l’indice de spécificité (seulement les spécificités positives) nous élevons le
taux de bonne attribution à 89% pour le français et 82% pour le corpus latin
du LASLA.

JADT’ 18

463
Tableau 1. Taux d’attribution ADT et taux de prédiction deep learning

3. Deep learning : à la recherche de nouveaux marqueurs linguistiques
3.1. Convolution et déconvolution, les principes
Le découpage du texte en segments de taille fixe est une méthode qui peut
aussi être utilisée pour entraîner un réseau de neurones. Chaque segment
devient alors une image d'un texte que le réseau va utiliser pour apprendre
(Ducoffe et al., 2016) et faire ensuite de la prédiction. Sur nos deux corpus de
référence (français et latin), les taux de précision convergent rapidement et
atteignent le même niveau que ceux obtenus avec l'ADT (Figure 1). Si nous
connaissons les paramètres à faire varier pour optimiser la détection des
passages-clefs ADT, ceux issus du deep learning sont complètement non
supervisés et découverts automatiquement par le réseau. L'idée des réseaux à
convolution est de proposer un modèle capable de faire automatiquement
une abstraction performante des données.1 La convolution utilise pour cela
un mécanisme de filtres qui va lire le texte avec une fenêtre coulissante pour
extraire à chaque fois une partie de la matière linguistique présente dans la
fenêtre (Figure 2). Avec des centaines de filtres de tailles différentes, le texte
est lu en utilisant tous les empans linguistiques possibles et le mécanisme de
back-propagation2 finit par accorder un certain poids à certains éléments du
texte qui le pousse à prendre la bonne décision. Le deep learning est souvent
considéré comme une boîte noire faute de pouvoir mettre en évidence
précisément ces éléments. Nous avons donc ici concentré nos efforts sur la
déconvolution. Ce mécanisme utilisé notamment en analyse d'images permet
de démêler le réseau et de lui redonner une forme interprétable par l'humain.
Notre modèle est composé d'une couche de pré-apprentissage (Mikolov et
al., 2013) pour la représentation des mots en vecteurs, d'une couche de
convolution (Kim, 2014), un maxpooling pour compresser l'information et
enfin un réseau classique de perceptron à une couche cachée pour la
classification (Figure 2). La déconvoltution est en fait une simple copie
partielle de ce réseau (jusqu'à la convolution) à laquelle on ajoute à la fin une
transposée de la convolution. On copie bien sûr le poids de chaque neurone

1 L'abstraction des données peut être considérée comme les saillances lexicales
d'un texte qui lui donnent une identité propre
2 \Correction de l'erreur à chaque phase d'apprentissage.

464

JADT’ 18

après l’entraînement dans cette copie de réseau et on obtient un nouveau
réseau dont la couche de sortie correspond au résultat de chaque filtre de la
convolution. Une simple somme de ces filtres pour chaque mot nous donne
un indice d'activation du mot dans son contexte. Au final nous observons ici
des zones de texte s'activer plus ou moins suivant l'importance que leurs a
accordée le réseau.

Figure 2. Convolution et déconvolution d’un passage du discours d’E. Macron

3.1. Résultats et perspectives
A la lecture des résultats, nous voyons que le modèle identifie, sans surprise,
des mots que le traitement statistique avait calculés comme spécifiques. Mais
pas seulement. Certaines zones éclairées par le réseau semblent relever d’une
nouvelle forme de lecture du texte. Nous pouvons illustrer ce constat avec un
extrait des vœux d’E. Macron le 31 décembre 2017:
[...] une transformation en profondeur de notre pays advienne à l'école pour nos
enfants , au travail pour l' ensemble de nos concitoyens , pour le climat , pour le
quotidien de chacune et chacun d' entre vous . Ces transformations profondes ont
commencé et se poursuivront avec la [...]
Dans ce passage, les mots transformation et notre, fortement spécifique de
Macron, sont activés : ici il n’y a pas de plus-value heuristique par rapport à
l’ADT. De même, le segment répété chacune et chacun, très spécifique, est
repéré par le réseau. Mais il y a aussi les mots pays et advienne qui ne sont pas
statistiquement spécifique de Macron et qui ont pourtant fortement contribué
à la reconnaissance du passage. Si l’on regarde maintenant les activations
autour de ces mots ciblés, on voit que c’est une expression formée de
plusieurs mots, pas forcément contigus, qui est repérée par le réseau. Il
semble donc que le deep learning ait identifié des structures phraséologiques
ou motifs linguistiques sensibles aux occurrences et à leur organisation
syntagmatique. Plus loin, la visualisation du passage dans son ensemble met
au jour une topologie textuelle ou un rythme auxquels le deep a été sensible
(Figure 3).

JADT’ 18

465

Figure 3. Déconvolution : observation de la topologie d’un passage

3. Conclusion
L’ADT et le deep learning ne sont peut-être pas des continents étrangers l’un
à l’autre (Lebart, 1997). Cette contribution en croisant approche statistique et
réseau de neurones nous a permis d’identifier des passage-clefs et peut-être
des motifs susceptibles de nourrir nos traitements textuels. Si les observables
qui ont présidé à la détection de passages-clefs par l’ADT (les spécificités
lexicales) sont connus et éprouvés, les zones d’activation du deep learning
semblent relever de nouveaux observables linguistiques. Rappelons que la
matière linguistique et la topologie des passages ne sauraient renvoyer au
hasard : les zones d’activations permettent d’obtenir des taux de
reconnaissance de plus de 90% sur le discours politique français et de 85%
sur le corpus du LASLA ; soit des taux équivalents ou supérieurs aux taux
obtenus par le calcul statistique des passages-clefs. Reste désormais à
améliorer le modèle et à en comprendre tous les aboutissants mathématiques
comme linguistiques. La première amélioration que l’on se propose
désormais d’implémenter est l’injection d’informations morphosyntaxiques
dans le réseau afin de mettre à l’épreuve des motifs linguistiques toujours
plus complexes.
References
Ducoffe, M., Precioso, F., Arthur, A., Mayaffre, D., Lavigne, F., et Vanni, L.
(2016). Machine learning under the light of phraseology expertise : use
case of presidential speeches, de Gaulle - Hollande (1958-2016). Actes de
JADT 2016, pages 155–168.
Kim, Y. (2014). Convolutional neural networks for sentence classification.
EMNLP, pages 1746–1751.
Lafon, P. (1984). Dépouillements et statistiques en lexicométrie. Genève-Paris,
Slatkine-Champion.
Lebart, L. (1997). Réseaux de neurones et analyse des correspondances.
Modulad, (INRIA Paris), 18, pages 21–37.

466

JADT’ 18

Longrée, D. et Mellet, S. (2013). Le motif : une unité phraséologique
englobante ? Etendre le champ de la phrase ́ologie de la langue au
discours. Langages 189, pages 65–79.
Mellet, S. et Longre ́e, D. (2009). Syntactical motifs and textual structures.
Belgian Journal of Linguistics 23, pages 161–173.
Mellet, S. et Longrée, D. (2012). Légitimité d’une unité textométrique : le
motif. Actes de JADT 2012, pages 715–728.
Mikolov, T., Chen, K., Corrado, G., et Dean, J. (2013). Efficient estimation of
word representations in
vector space. ArXiv : 1301.3781.
Quiniou, S., Cellier, P., Charnois, T., et Legallois, D. (2012). Fouille de
données pour la stylistique : cas des motifs séquentiels émergents. Actes de
JADT 2012.
Rastier, F. (2007). Passages. Corpus 6, pages 25–54.
Salem, A. (1987). Pratique des segments répétés. essai de statistique textuelle. Paris
: Klincksieck.

JADT’ 18

467

Déconstruction et reconstruction de corpus... À la
recherche de la pertinence et du contexte
Lucie Loubère
Lerass Université de Toulouse – lucie.loubere@iut-tlse3.fr

Abstract
Faced with corpora of large sets of texts, we propose a method of selection,
based on the identification of segments of texts relevant to a topic by
successive classification, then recomposition of the corpus with all the texts
having at least one relevant segment . This approach makes it possible to
preserve the contextualizations and narrative discourses surrounding a
theme while excluding off-topic texts.
Résumé
Face aux corpus constitués de grands ensembles de textes, nous proposons
une méthode de sélection, basée sur l’identification de segments de textes
pertinents à une thématique par classification successive, puis recomposition
du corpus avec l’intégralité des textes ayant au moins un segment pertinent.
Cette démarche permet ainsi de conserver les contextualisations et discours
narratifs entourant une thématique tout en excluant les textes hors-sujet.
Keywords: Big corpus, Reinert classification, Iramuteq
1. Introduction
La multiplication d’outils d’extraction de contenus numériques ou
l’abonnement des universités aux bases de données de presse, sont autant de
raisons favorisant la création de corpus de grande taille. À ces facilités
grandissantes s’opposent de nouvelles difficultés. L’hétérogénéité des
contenus mis à disposition par une communauté, les algorithmes de
recherche de bases de données, ou simplement les limites d’ambiguïté de
requêtes génèrent de nombreux bruits à nos corpus. Nous proposerons ici
une méthode s’appuyant sur une identification de contenu par classification
successive (Ratinaud et Marchand, 2015), puis une régénération du corpus
par concaténation de l’intégralité des articles contenant au moins un segment
de texte (ST) dan le matériel identifié comme pertinent.
2. Problématique
La sélection de corpus par classifications successives, en utilisant comme

468

JADT’ 18

unité le segment de texte, permet d’obtenir un sous corpus pertinent avec
une thématique (Loubère, 2014; Ratinaud et Marchand, 2015). Cependant,
lorsque le corpus de départ est constitué de textes au contenu narratif
structuré et délimité (article de presse, blog, argumentaires dans une
concertation…) ce processus peut supprimer les éléments périphériques au
thème étudié. Ces contenus restent portant pertinents pour la compréhension
de l’objet d’étude, mais peuvent être classés avec le bruit des textes hors sujet
dès les premières étapes de sélection. L’objectif de cettte méthode est donc
d’exclure le bruit de textes hors-sujets tout en conservant le contexte
d’évocation de la thématique principale.
3. Méthodologie
Le processus proposé ici se décompose en trois étapes :
k) Numérotation des textes par un identifiant en méthadonnée
l) Extractions des segments de textes propres à notre thématique par
classifications successives. Cette étape repose sur la classification
hiérarchique descendante (CHD) de type Reinert (Reinert, 1983)
proposée par le logiciel Iramuteq (Ratinaud, 2009). En permettant de
faire émerger les mondes lexicaux, ce traitement nous permet de
sélectionner les segments concernant notre thématique, puis de les
re-soumettre à une CHD afin de préciser le corpus. Cette étape est
reconduite jusqu’à avoir une classification dont toutes les classes
concernent la thématique étudiée.
m) Re-composition du corpus par concaténation des articles
apparaissant au moins une fois dans l’extraction finale de l’étape 2
4. Exemple empirique
Dans les parties qui suivront, nous présenterons une mise en application de
cette méthode sur un corpus utilisé lors de notre thèse (Loubère, 2018). Il est
constitué d’une extraction d’article de presses quotidiennes nationales
(libération, l’humanité, le monde, la croix, le figaro) portant sur la thématique
du numérique éducatif du 01/01/2000 au 31/12/2014. Afin de couvrir le plus
d’informations possible la requête exécutée sur la base de donnée
d’Europresse retournait tous les articles contenant au moins un terme
éducatif dans la liste : collège, lycée, école, éducation et au moins un terme
numérique dans la liste : numérique, informatique, multimédia, TICE.
4.1. Les classifications successives
Cette extraction retourna 18 804 articles, auxquels nous avons retiré 875
doublons. LE corpus exploité ici est donc constitué de 17 929 articles
représentant 450 815 segments de textes, sur lesquels nous avons apposé en

JADT’ 18

469

méthadonnée le numéro de l’article source. Nous allons présenter ici les
classifications successives
Nous avons effectué une CHD de 20 classes en phase 1 et un minimum de
1000 ST par classe, nous obtenons 16 classes représentant 99,72 % du corpus.
Le résultat obtenu est présenté sur le dendrogramme en illustration 1

Illustration 1 : dendrogramme de la première CHD

Ce premier découpage montre une séparation en 3 blocs. Le premier est
composé des personnalités publiques, le second est composé par des
thématiques extérieures à notre sujet. En effet, de nombreux articles
contiennent les termes de notre requête sans être pour autant dans le
domaine éducatif (ou numérique). Ainsi, les classes 9 et 8 regroupent les
actualités ou dossiers portant dans le domaine de la culture. Nous citerons
comme exemple non exhaustif d’article de ce domaine un article du journal
Le monde commentant les sorties cinématographiques dans lequel nous
relèverons « les enfants privés d’école jouant dans les rues », et pour un autre
film « les décors numériques ». Nous retrouvons sur le même principe les
classes 6, 5 et 13 traitant des conflits armés détruisant les lycées et relatant
une infériorité numérique.Enfin, le troisième bloc présente une classe centrée
sur le numérique (classe 12), deux classes centrées sur l’éducatif (11 et 10) et
deux classes sur l’aspect législatif et économique (classes 1 et 2). Afin de
pouvoir affiner ces thématiques et les possibles interactions, nous avons
choisi de conserver le bloc entier, soit les segments composant les classes 1, 2,
10, 11, 12 et 14. L’export précédent nous a permis d’obtenir 194 966 segments
de texte sur lesquels nous avons effectué une deuxième CHD de 15 classes en
phase 1 et seuil minimal de 100 ST. Nous obtenons 14 classes portant sur
99,97 % des segments. Le résultat est présenté en illustration 2.
Ce deuxième découpage reprend une structure en trois groupes. Ici, nous
relevons le contexte économique du marché du numérique (classe 14, 5 et 6).

470

JADT’ 18

Illustration 2 : dendrogramme de la deuxième CHD

Le second bloc (classe 4, 3, 7, 8, 10) est constitué des différents discours
témoins de la numérisation de la société. Le troisième groupe séparé du reste
du corpus par le premier facteur est centré sur-le-champ éducatif. Les trois
premières classes à se détacher partagent un discours sur l’après-formation et
le recrutement (classes 9, 2 et 1). La classe 11 constituant 10,3 % du corpus est
entrée sur l’éducation primaire et secondaire, alors que la classe 12 porte sur
l’enseignement supérieur et la recherche. Notre étude portant sur le système
scolaire secondaire, nous ne conserverons que la classe 11 pour l’étape
suivante.
L’export de cette dernière constitue un corpus de 20 167 segments de texte
sur lesquels nous avons effectué une CHD de 15 classes en phase 1 et un
minimum de 100 ST par classe. Nous obtenons 8 classes rapportant 99,22 %
des segments.. Ce dendrogramme, structuré en deux blocs, nous montre une
séparation entre un discours centré sur l’aspect structurel de l’éducation
(classes 8, 6, 4, 3) et celui traitant de l’enseignement (classes 2, 1, 5, 7).

Illustration 3: dendrogramme de la troisième CHD

JADT’ 18

471

Dans la partie structurale nous retrouverons les segments de texte traitant
des réformes sous un angle gouvernemental (classe 8), suivie de tout le
discours se regroupant des aspects temporels, comme le temps de travail
mais également les rythmes scolaires (classe 6). La classe 3 constitue un
discours sociologique sur l’éducation, nous y retrouvons de nombreuses
statistiques étudiant les répartitions sociales dans les différents cursus. Enfin,
la classe 4 traite des établissements scolaires dans leurs diversités.
Les autres classes portent toutes sur le domaine pédagogique : la classe 7
concerne les contenus d’enseignement. La classe 5 traite de la mise en place
d’outil numérique parascolaire (jeux éducatifs, fiche de révision) alors que la
classe 2 est centrée sur la mise en place de formations à distance. Enfin, la
classe 1 est le discours portant sur le numérique dans l’éducation, les mots
clés employés dans notre requête y sont tous surreprésentés. Nous ne
conserverons dons que les segments composant cette classe.
L’extraction de cette dernière classe nous permet d’obtenir 2072 segments sur
lesquels nous avons effectué une CHD de 20 classes en phase 1 avec un seuil
de 100 ST par classe. Cette classification nous a montré une réelle stabilité de
la thématique. En effet, les 8 classes exposée portent chacune sur un aspect
du numérique éducatif.

Illustration 4 : dendrogramme de la quatrième CHD

4.2. Classification du corpus recomposé
Le corpus recomposé des 2902 articles contenant au moins un segment de
texte dans la classe 1 de la troisième CHD est constitué de 72460 segments.
Une CHD de 20 classes en phase 1 et un minimum de 800 ST par classe nous
donne le dendrograme suivant :

472

JADT’ 18

Illustration 5 : dendrogramme de la CHD sur le corpus recomposé

Nous y retrouvons donc au-delà de discours sur l’utilisation du numérique
dans les établissements, un discours sur l’économie reflétant le marché du
numérique éducatif et les frais engendrés par les dotations des
établissements. Un discours à la frontière de la culture et de l’éducation, avec
les formations de ces domaines empreinte de numérique. Mais également un
discours sur l’actualité géopolitique mondiale contextualisant des initiatives
où le numérique apporte des solutions éducatives lors de ségrégation
ethniques, ou éloignements géographiques. Tous ces mondes lexicaux
constituent des éléments du discours social sur notre sujet, qu’une étude
réduite aux segments ciblés lors des CHD successives ne permettrait pas
d’explorer.
5. Conclusion
Le principe des CHD successives, s’il nous permet d’accéder finement aux
segments contenant le discours sur le numérique éducatif, nous éloigne
d’une compréhension globale du sujet. En effet, interroger les bases de
données de presse sur une longue période et une sélection de presse
généraliste apporte une quantité importante de documents hors contexte. Ces
données portent des éléments contextuels communs avec les articles traitant
de notre sujet (personnalités politiques, discours économique…), la proximité
lexicale des segments de ces champs structure les classes de discours
communes aux articles portant sur notre sujet ou non. Cette hétérogénéité
associée à l’insécurité d’un grand ensemble (Geffroy et Lafon, 1982) nous
empêchant une connaissance du corpus antérieure à l’analyse lexicométrique
conduit « à tracer un peu trop vite une autoroute » (Geffroy et Lafon, 1982, p.
140) jusqu’à notre classe 1 finale. Ce phénomène questionne la constitution
d’un corpus sur une dimension architextuelle, alors même que l’outil de
classification utilisé ici joue sur un niveau intertextuel et cotextuel (Rastier,
2015), rapprochant des passages de textes en fonction de leur structure
lexicale. La présence de textes aux sujets hétéroclites fait ressortir de façon

JADT’ 18

473

précoce des thématiques indépendamment de leur hypothétique poids dans
le corpus qu’aurait constitué une sélection de textes centrés sur notre sujet.
Ainsi, les segments traitant de sujets de politique générale ou exposant le
contexte social d’un pays dans les articles traitant du numérique éducatif
sont classés avec ceux des articles hors sujets. Cette difficulté éloigne le
chercheur de la compréhension d’un discours. La démarche que nous venons
de présenter nous permet de se rapprocher d’un positionnement de
textomètre (Pincemin, 2012), sélectionnant les segments pertinent par une
démarche inductive, mais en conservant l’unité sématique du texte dans la
construction du corpus final.
Bibliography
Geffroy, A., & Lafon, P. (1982). L’insécurité dans les grands ensembles.
Aperçu critique sur le vocabulaire français de 1789 à nos jours d’Etienne
Brunet. Mots, 5(1), 129-141.
Loubère, L. (2014). Le traitement des TICE dans les discours politiques et
dans la presse. In Présenté à 12èmes Journées internationales d’Analyse
statistique des Données Textuelles.
Pincemin, B. (2012). Sémantique interprétative et textométrie. Texto! Textes et
Cultures, 17(3), 1-21.
Rastier, F. (2015). Arts et sciences du texte. Paris: Presses universitaires de
France.
Ratinaud, P. (2009). IRAMUTEQ : Interface de R pour les Analyses
Multidimensionnelles de TExtes et de Questionnaires. Consulté à
l’adresse http://www.iramuteq.org
Ratinaud, P., & Marchand, P. (2015). Des mondes lexicaux aux
représentations sociales. Une première approche des thématiques dans
les débats à l’Assemblée nationale (1998-2014). Mots. Les langages du
politique, (2), 57-77.
Reinert, M. (1983). Une méthode de classification descendante hiérarchique:
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, 8(2), 187-198.

474

JADT’ 18

L’apport du corpus-maquette à la mise en évidence
des niveaux descriptifs de la chronologie du sens.
Essai sur une Série Textuelle Chronologique du
Monde diplomatique (1990-2008).
Heba Metwally
Université d’Alexandrie, Égypte – heba.metwally77@gmail.com

Abstract
Chronological corpora and particularly time series (Lebart et Salem 1994)
organize the textual data in corpora according to their natural sequence in
time. Today, scholars are interfacing increasingly with chronological corpora
following the democratization of access to big data. The lexicometry develops
into stylometry, textometry and logometry. And statistical data analysis
integrates the observation of co-occurrential systems and lexical networks in
their complexity. This improves the analysis of semantic contents according
to their localisation in the semantic strata. This contribution aims to enhance
the description of the chronology of meaning. The study is based on a corpus
of more than 5000 articles (ca 11 millions of tokens) published in the Monde
diplomatique between January 1990 and December 2008.
To analyze big chronological corpora we propose a scale model of the
chronological corpus by compressing the initial corpus to its most frequents
nouns. The compression procedure is duplicated in the four sub-corpuses of
relevant semantic stability. We obtain two descriptive levels of chronology:
the synthetic level of dominant contents and the analytical level of the four
chronological phases of meaning. The two levels are intended to respond to
different investigations on time and meaning. Working on sets of scale models
that are either connected horizontally (chronological sequence) or vertically
(the synthetic perspective clarified by an analytic perspective) enlarges our
field of observation and deepens our understanding of chronological data in
particular and the unfolding of text in general.
Keywords: chronological corpus – logometry – logogenesis – clustering –
method Reinert – corpus semantics – media analysis

JADT’ 18

475

Résumé
Les corpus chronologiques et a fortiori les Séries Textuelles Chronologiques
(Lebart et Salem, 1994) organisent les données textuelles dans le corpus selon
leur enchaînement naturel dans le temps. La banalisation des corpus textuels
et l’accès facilité et accéléré au big data multiplient les corpus
chronologiques, puisque finalement toute production textuelle s’étale dans le
temps. La lexicométrie – au sens classique – doublée de la stylométrie, de la
textométrie voire de la logométrie, et la statistique occurrentielle enrichie par
un outillage cooccurrentiel (Viprey, 1997), (Mayaffre, 2014), la voie est
ouverte aujourd’hui à une observation améliorée des contenus sémantiques
qui gagnent en visibilité grâce aux tentatives parfois incontrôlées de leur
objectivation. Cette contribution a pour objectif de contribuer à la description
de la chronologie des contenus sémantiques. On s’appuie sur un corpus
d’articles du MD (1990-2008). On compte plus de 5000 articles et plus de 11
millions d’occurrences. On propose pour cela le recours à un corpus-maquette,
une compression du corpus chronologique intégral à partir des noms les plus
fréquents. Cette démarche de compression est reproductible dans les souscorpus des périodes de stabilité sémantique. On obtient deux niveaux
descriptifs de la chronologie, à savoir le niveau global, synthétique des
contenus dominants et le niveau subordonné, analytique des sens particuliers
des phases transitoires du discours. Les deux niveaux infèrent sur un
questionnement différent sur le temps en multipliant les pistes
d’interrogation et en articulant le niveau synthétique et son niveau
analytique.
Mots-clés: corpus chronologique – logométrie – logogénétique –
classification – méthode Reinert – sémantique de corpus – Analyse de
discours médiatique
1. Introduction
Dans la tradition lexicométrique, les STC (Séries Textuelles
Chronologiques) problématisent les investigations sur le temps1. Ce type de
corpus est né, dans les études à caractère historique, du questionnement sur
le changement dans le discours au fil du temps. Et les travaux d’André

1 « Nous appelons séries textuelles chronologiques ces corpus homogènes
constitués par des textes produits dans des situations d'énonciation similaires, si
possible par un même locuteur, individuel ou collectif, et présentant des
caractéristiques lexicométriques comparables. » (Lebart et Salem, 1994 : 217)

476

JADT’ 18

Salem2 témoignent de l’intérêt porté à la description des corpus textuels
chronologiques. Pour ce faire, André Salem généralise les STC, décrit la
particularité des sorties machines des analyses statistiques qu’elles
produisent (AFC ; calcul de spécificités), introduit la notion de «temps
lexical », et conçoit une gamme de calculs visant, dans un premier temps, la
« mise en évidence et la mesure du stock lexical au cours du temps » (Salem,
1988 : 118) et, dans un second temps, la caractérisation des périodes dans une
STC. Plus généralement, la particularité des STC est de concilier la linéarité
du texte, du temps et la sérialité du corpus. Si tous les corpus sont
partitionnés en séries pour permettre la comparaison, ces séries ont
l’avantage de conserver l’ordre naturel des textes qui s’échelonnent – sans
conflit – dans le corpus et dans le temps.
Aujourd’hui, le champ des observables est constamment élargi grâce à
l’évolution des outils informatiques et au progrès de la tokenisation pour
embrasser progressivement des niveaux descriptifs textuels que le chercheur
filtre ou articule à sa guise. La lexicométrie est enrichie et mise à jour par la
textométrie et la logométrie dont le projet est de dépasser la lexie vers les
textes, le discours et le sens. Le sens est objectivable grâce à la formalisation
de la cooccurrence, et à son baptême comme unité minimale de
contextualisation, i.e de sens (Mayaffre, 2008). Dès lors, la statistique
occurrentielle se double de la statistique cooccurrentielle. La cooccurrence
devient unité de décompte généralisée à laquelle s’applique les calculs
statistiques traditionnels (Brunet, 2012). Des applications d’ADT de tradition
benzécriste se développent pour appréhender les réseaux lexicaux dans leur
complexité. La cooccurrence généralisée (Viprey, 1997, 2005, 2006) se donne une
visée exploratoire et la méthode Alceste (Reinert, 1983, 1993) procède à la
démarche classificatoire des réseaux lexicaux structurants des textes. C’est
dans ce cadre des progrès de la méthodologie et de la technologie qu’une
sémantique de corpus (Rastier, 2011) est envisageable.
Ce champ d’investigation intéresse naturellement les études chronologiques
qui peuvent désormais observer le mouvement des contenus sémantiques
dans le temps pour comprendre l’impact du temps dans la thématisation
d’une Série Textuelle Chronologique3. Pour l’objectivation des fonds

Cf. (Salem, 1988, 1991, 1993, 1994)
Ce point précisément constitue la problématique de notre thèse de doctorat
intitulée « Les thèmes et le temps dans Le Monde diplomatique (1990-2008) », soutenue
le 11 décembre 2017 à l’Université Côté d’Azur (UCA) à Nice.
2
3

JADT’ 18

477

sémantiques4 du discours, on sollicite la méthode Alceste implémentée dans le
logiciel libre Iramuteq (Ratinaud et Marchand, 2012) qui s’articule à
Hyperbase. Pour une visualisation améliorée des topics du discours, on
propose de recourir à une maquette du corpus et de ses sous-corpus. Au sens
propre, la maquette est une représentation en trois dimensions, à échelle
réduite qui reste fidèle dans ses proportions. Ici, dans le cas des corpus
textuels, la maquette est une compression du corpus intégral qui se réduit à
ses noms les plus fréquents. A partir d’une STC du Monde diplomatique (19902008), cette contribution se donne deux objectifs. Dans un premier temps, elle
vise à mettre en exergue les deux niveaux descriptifs complémentaires de la
chronologie du sens : chronologie des contenus dominants (3.) et la
logogénétique (4.) tout en relevant l’intérêt de étude conjointe de ces deux
niveaux. Dans un second temps, il s’agit également de mettre à l’épreuve
notre proposition de la maquette. On recherche une visualisation améliorée
des contenus sémantiques structurants grâce au recours à une maquette,
reproduction grossière et fidèle des textes dont l’usage spécifique sera illustré
dans les lignes suivantes.
2. Du corpus intégral à la maquette du sens et du temps
Le choix du Monde diplomatique pour l’étude de l’évolution du sens s’appuie
sur la richesse et la stabilité de son contenu. La période couverte par cette
étude marque un moment historique important, à savoir le monde après la
chute du Mur de Berlin. En plus, cette période se caractérise par une
continuité éditoriale5. Bref, nous avons affaire à un discours stable, sans
complexe qui à l’examen multidimensionnel épouse un schéma évolutif
classique sans ruptures6. On estime que la stabilité du discours est un facteur
indispensable à l’étude de l’évolution, celle-ci reposant principalement sur la
continuité.
La finalité de ce travail, à savoir l’étude de la chronologie du sens d’un gros
corpus textuel, préside à la conception de la maquette. La taille du corpus

Les fonds sémantiques sont les isotopies ou les macrostructures sémantiques sur
lesquelles se détachent les formes sémantiques que sont les thèmes. Cf. (Rastier, 2011 :
24)
5 Il s’agit du mandat d’Ignacio Ramonet qui est directeur de la publication de
janvier 1990 à mars 2008.
6 Par examen multidimensionnel on entend l’AFC de la distance entre les textes
qui dans le cas des données sérielles reproduit une forme parabolique baptisée
parabole Guttman et qui est symbolique du mouvement linéaire des données
ordonnées dans le temps. Cf. (Salem, 1991).
4

478

JADT’ 18

intégral excédant 11 millions d’occurrences (voir ci-dessous Tableau 1) pose
immédiatement le problème de son interprétation comme il nous confronte à
la difficulté de l’appréhension des fonds sémantiques structurants du corpus.
En ADT, les chercheurs procèdent assez souvent pour des raisons pratiques à
des sélections au sein de la population statistique étudiée. A notre tour, on
propose un mode de réduction qui se fonde sur la finalité herméneutique et
perpétue la pratique d’une sémantique interne. On pose ici – sans généraliser
– que le discours médiatique par sa vocation informative et sa référence au
monde structure son contenu d’une manière privilégiée autour des noms. La
classe nominale (noms communs et noms propres) est la classe grammaticale
la plus importante dans le corpus ; elle couvre 28,9% de la surface du corpus.
Elle connaît également une stabilité distributionnelle au fil de la STC.
L’importance numérique absolue et la distribution équilibrée attestent le
critère de la représentativité statistique7. Aussi une comparaison avec
d’autres corpus8 entre les listes des lemmes les plus fréquents triés par
catégorie grammaticale confirme le pouvoir caractérisant de la classe
nominale en général et des noms propres en particulier. On s’appuie donc
sur la classe nominale et l’argument fréquentiel pour réduire le corpus
intégral à ses 380 noms les plus fréquents. La démarche laisse intacts les
partitions du corpus et l’enchaînement des textes pour respecter la structure
séquentielle des textes et la conception chronologique du corpus. L’une et
l’autre garantissent au corpus textuel son authenticité ; seul leur maintien
autorise l’examen de l’hypothèse de travail présidant à la conception du
corpus textuel. Pour expliquer un peu ce travail philologique simple dans
son principe, la démarche consiste à mettre un cache sur tout le texte à
l’exclusion des 380 noms les plus fréquents. Cette procédure est à reprendre
dans les sous-corpus de stabilité sémantique. Celle-ci se laisse mesurer d’une
manière endogène à l’aide du calcul de la distance entre les textes à partir de
la forme minimale de signification thématique, la cooccurrence. La distance
intertextuelle calculée sur les cooccurrences au sein des noms de la maquette
donne à voir quatre périodes qui fondent les quatre sous-corpus, ceux-ci
réduits à leur tour à des maquettes. Cette périodisation endogène fonde le

Dans notre travail doctoral (Metwally, 2017), nous avons étudié les contenus
des classes de fréquences du corpus intégral pour une compréhension de la hiérarchie
numérique du lexique. Aussi avons-nous analysé la structure grammaticale des
données et leur distribution dans la STC.
8 (Labbé et Monière, 2003); (Mayaffre, 2004).
7

JADT’ 18

479

temps sémantique9 selon lequel on remodèle le corpus intégral et sa maquette.
Le tableau 1 (ci-dessous) synthétise la structure lexicale du corpus, des souscorpus et de leurs maquettes. Celles-ci couvrent chacune approximativement
9,8% de la surface leurs corpus originaux respectifs. Cette stabilité de
représentativité numérique autorise la comparaison entre les données.
Tableau 1: Tableau synthétique de la structure lexicale du corpus,
des sous-corpus et de leurs maquettes
corpus et souscorpus
taille
(N=occurrences)
vocabulaire
(V=mots)
maquette et sousmaquettes
(V=noms)
maquette et sousmaquettes (taille)

1990-1993

1994-1997

1998-2001

2002-2008

1990-2008

2697013

2402434

2552998

3765908

11418356

67989

67571

70954

86032

140690

307

282

290

375

380

266439

218643

229119

382298

1115311

On obtient donc finalement un dispositif complexe à deux niveaux : le niveau
global des contenus sémantiques de l’ensemble de l’empan chronologique
étudié dont on peut étudier la dynamique (3.); et le niveau analytique,
d’ordonnancement chronologique, des phases sémantiques stables et qui
permet et l’observation du mouvement des contenus sémantiques et la
confrontation avec le niveau global synthétique (4.). L’étude des fonds
sémantiques est concevable en mobilisant la statistique cooccurrentielle qui
met en évidence les structures sémantiques pertinentes. A l’issue de la CHD
appliqué à la maquette et ses sous-maquettes, sont observables les mondes
lexicaux stabilisés (Reinert, 1993, 2008) du sens global et de ses phases
transitoires (voir les dendrogrammes Fig. 1, 3, 4).
3. La dynamique des contenus dominants
La démarche habituelle dans les études chronologiques repose d’abord sur
une étude statique première du sens global pour procéder ensuite à une vue
dynamisée. Les vues statiques relèvent d’un artifice méthodologique
provisoire destiné à mettre en évidence les contenus sémantiques stabilisés
9 On s’est permis de parler de temps sémantique à la suite du temps lexical d’André
Salem (1988). Le temps sémantique est le rythme selon lequel s’organisent dans le
temps les contenus sémantiques et que mesure ici la distance intertextuelle calculée
sur la cooccurrence.

480

JADT’ 18

au bout d’un mouvement dynamique. La saisie du sens global répond au
questionnement sur les contenus dominants, consensuels d’une période à
l’autre, qui survivent au cours de 19 ans de production d’articles. Pour
l’analyse de la structure sémantique de la maquette, on donne à Iramuteq la
maquette globale, où les 380 noms les plus fréquents s’organisent sur l’axe
syntagmatique selon l’ordre de leur apparition, et dont les partitions assurent
au corpus une structure chronologique adaptée au temps sémantique du
corpus. Une fois Iramuteq mobilisé, il se met à découper le texte en segments
de textes paramétrables. Le choix de l’étendue des segments de textes (ST) est
capital, car ce sont les ST qui constituent les énoncés analysés et classés par la
méthode Alceste. Pour ces unités de contexte on a estimé la succession de 10
noms dans le corpus-maquette comme l’équivalent dans le corpus intégral de
la fenêtre contextuelle de 33 mots10. On vise par là un espace intermédiaire
entre la phrase et le paragraphe. Une fois Alceste activé, il procède à une
CHD qui croise les ST et les noms pour effectuer un classement partant du
caractère lexical prédominant des ST.

Figure 1 : Les mondes lexicaux de la maquette (1990-2008)11

On impose à l’algorithme un paramétrage exigeant qui nous garantit une
grille de lecture assez riche. Avec 15 classes demandées à l’issue de la phase

Cette estimation repose sur le pourcentage de la classe nominale dans
l’ensemble du corpus (28,9%). Voir (Metwally, 2017).
11 Dans ces listes, on peut repérer quelques verbes (partir, produire, revenir,
sentir, passer). Il s’agit d’une erreur due à une lemmatisation effectuée par Iramuteq
malgré les tentatives de dissuasion. Il s’agit plutôt de substantifs (parti, produit,
revenu, sens, passé).
10

JADT’ 18

481

1, 8 se trouvent stabilisées (Figure 1). Les sorties machines de la CHD sont
multiples. La représentation en dendrogramme correspond au classement
stricto sensu ; et elle est enrichie d’informations supplémentaires qui mettent
en valeur la CHD. On commence par l’identification rapide de la structure
sémantique du discours et de la hiérarchie de l’information.
Le dendrogramme, par sa logique binaire de représentation, oppose les
contenus économiques, les plus importants avec 41,5% des ST classés, aux
contenus non-économiques. Ceux-ci distinguent les thématiques politiques
(35,2% des ST classés) et les thèmes de l’Homme (23,3% des ST classés),
thématiques socio-culturelles qui traitent de sujets historiques et culturels et
de questions sociétales. Suivant la logique hiérarchique descendante de la
classification, des classes spécialisées se stabilisent pour mieux caractériser
les trois domaines sémantiques identifiés. Au sein des classes économiques se
spécialise une classe socio-économique dédiée aux questions de l’emploi et
du travail (classe 8 ; « emploi », « travail », « chômage », « salaire »,
« syndicat ») ; celle-ci se distingue des deux classes de la macro-économie qui
traitent de l’économie domestique (classe 2), de la machine économique des
pays (« développement », « industrie », « concurrence », « secteur »), et
l’économie mondiale (classe 7) qui couvre les questions des finances et de la
performance économique des pays sur le marché mondial (« dollar »,
« banque », « dette », « prix », « croissance »). Attachés à la même branche
des thèmes politiques, les mondes lexicaux de l’Homme connaissent une
variation qui différencie les questions philosophiques et/ou idéologiques sur
l’histoire et la culture (classe 1 ; « histoire », « siècle », « monde », « culture »,
« sens », « conscience », « passé ») du quotidien des êtres humains dans ce
monde (classe 6 ; « femme », « enfant », « victime », « quartier », « violence »,
« police », « vie », « école »). Si l’analyse du sens passe nécessairement par la
suspension provisoire de la structure sérielle du corpus, l’interrogation des
partitions de la maquette sur leur part aux classes lexicales restitue la
temporalité définitoire du corpus. Une projection des classes dans les
périodes de stabilité sémantique met en évidence la dynamique des classes,
la thématisation de chaque période pour permettre finalement d’inférer sur
l’évolution du sens.
Les classes lexicales poursuivent différentes tendances au cours du temps.
Les thèmes du pouvoir (classes 4 et 5) est un axe informatif important qui ne
subit guère de variations quantitatives. La classe des politiques
internationales (classe 3) connait un pic positif exceptionnel dans la dernière
période.

482

JADT’ 18

Figure 2 : Périodes et classes de la maquette (écarts en Chi2)

Ce sont les contenus économiques et socio-historiques qui sont traversés par
deux logiques évolutives opposées. L’ordonnancement des bâtons positifs
met en relief les pics positifs importants et exclusifs de deux classes
économiques dans les deux premières périodes. Cette importance s’évanouit
progressivement. Dans la dernière période les déficits les plus importants
sont ceux des classes économiques. Face à la régression des contenus
économiques, la progression est réservée aux contenus socio-historiques
(classes 1 et 6). Il s’ensuit une couleur thématique changeante d’une période à
l’autre. Les contenus économiques qui marquent les 19 ans qui ont suivi la
chute du Mur de Berlin proviennent majoritairement des deux premières
périodes, tandis que les deux périodes suivantes connaissent des centres
d’intérêt socio-historiques qui se mêlent dans la troisième période à des
thèmes économiques et dans la dernière période aux événements globaux de
politiques internationales. A l’œil nu, l’histogramme de la dynamique du
sens global se laisse diviser en deux moments évolutifs distincts et
asymétriques. Sur le plan quantitatif, le sur-emploi de la première moitié de
la série n’est jamais égalé par un sur-emploi pareil dans la deuxième moitié.
Sur le plan qualitatif, les contenus majoritaires de la première partie sont des
contenus techniques et relèvent de l’axe informatif le plus important, un axe
technique qui relève des visions macro. Par contre, les contenus dominants
de la deuxième moitié de la série sont plus variés et traduisent un intérêt
croissant aux sujets philosophiques et humanistes. Un mouvement général
semble déplacer le focus de l’ordre mondial vers les hommes et le sens de
leur vie dans le monde.

JADT’ 18

483

La description de la chronologie du sens touche à ses limites. Car les
contenus dominants qu’on observe ici sont précisément les contenus
consensuels, ceux qui trouvent toujours leur expression d’une période à
l’autre selon un dosage qui leur garantit finalement la supériorité
quantitative. Le mouvement dynamique de ces contenus revient donc à une
interrogation sur leurs périodes spécifiques. Ceci dit, on pose que la
dynamique des contenus dominants repose nécessairement sur les sens
particuliers de ces périodes. L’étude du niveau subordonné de la génétique
du discours (tout de suite ci-dessous) est certes instructive pour une analyse
plus détaillée de la spécificité sémantique de chaque période. L’étude de la
formation du sens nous renseigne également sur le rapport entre le sens
particulier, temporaire et le sens général, dominant. Elle est indispensable
pour compléter et éclairer nos observations sur l’évolution.
4. La logogénétique ou la génétique du discours
Le mot logogénétique reprend le mot anglais logogenesis dont Halliday (1994)
explicite la signification et l’intérêt en termes suivants :
“It is helpful to have a term for this general phenomenon
– i.e. the creation of meaning in the course of the unfolding of text. We shall call
it logogenesis, with ‘logos’ in its original sense of ‘discourse’ (see Halliday &
Matthiessen, 1999: 18; Matthiessen, 2002b). Since logogenesis is the creation of
meaning in the course of the unfolding of a text, it is concerned with patterns
that appear gradually in the course of this unfolding; and the gradual
appearance of patterns is, of course, not limited to single texts but is rather a
property of texts in general instantiating the system of language.” (Halliday,
1994 : 601)

La logogénétique ou la génétique du discours permet de renouer avec les
modèles linguistiques qui traversent le texte et contribuent à sa formation.
Concrètement ici, on voit dans l’observation et la confrontation ordonnée
dans le temps des CHD des quatre sous-maquettes un grand intérêt pour
rétablir les modèles sémantiques propres des périodes de stabilité
sémantique et qui fondent le mouvement général du sens et sa stabilisation
au niveau global au cours du temps. On reprend les mêmes paramètres
utilisés pour la CHD de la maquette globale dans les quatre sous-maquettes
pour obtenir les dendrogrammes ci-dessous (Fig. 3, 4). Un examen attentif de
la structure interne des sous-maquettes du sens est susceptible d’offrir des
grilles de lectures analytiques des contenus dominants, de leur dynamique et
de leur formation. On ne saura pas épuiser la valeur heuristique de ces

484

JADT’ 18

dendrogrammes. Et on se contente de souligner l’apport principal de cette
démarche à la description du sens sans prétendre effectuer une analyse
fouillée du sens. Celle-ci devrait reposer sur une étude systématique des
réseaux lexicaux ce qui dépasse l’objectif de cette contribution.

Figure 3 : Les mondes lexicaux des deux premières périodes

JADT’ 18

485

Figure 4 : Les mondes lexicaux des deux dernières périodes

La première remarque à souligner est la permanence des fondamentaux du
discours et le nombre fixe de mondes lexicaux qui se stabilisent d’une
période à l’autre. Cette stabilité de la structure sémantique ratifie la
pertinence de l’étude de l’évolution. Celle-ci s’effectue nécessairement au
sein d’un environnement stable. Observons l’évolution de la hiérarchie de

486

JADT’ 18

l’information d’une période à l’autre. Le graphique ci-dessous (Figure 5) rend
compte de l’importance de chaque domaine sémantique au sein des ST
classés. La comparaison est instructive d’une période à l’autre, et entre le
niveau des sous-maquettes et le niveau supérieur de la maquette globale.

Figure 5 : L’évolution de l’importance des fondamentaux du discours au cours du temps (en
pourcentages)

Quelle que soit la période, les contenus politiques restent les plus dominants.
A l’examen de la répartition interne des classes politiques on note
l’importance des classes de politiques internationales qui sont constamment
au nombre de deux (Fig. 3, 4) par opposition au niveau global qui ne connaît
qu’une seule classe (Fig. 1, classe 3). C’est l’ampleur des classes de politiques
internationales dans les sous-maquettes qui fait la supériorité des thématiques
politiques. Et pourtant, ce n’est pas le cas au niveau global. Ceci est dû
principalement à la nature conjoncturelle des événements internationaux : les
guerres américaines de la première et dernière période, les questions
sécuritaires d’actualité en Europe après la chute du mur de Berlin, la guerre
de Kosovo dans la troisième période, le conflit israélo-palestinien avec ses
variantes et ses flux et reflux au cours du temps (voir le contenu des classes
lexicales, Fig. 3, 4). Tant d’événements spécifiques de certaines périodes et
qui ne parviennent pas tous à se stabiliser au niveau global pour caractériser
les 19 ans. D’où la prédominance des contenus politiques dans les sousmaquettes et leur recul au niveau global.
Par contre, les contenus économiques connaissent une tendance inverse. Au
niveau global, ils occupent le sommet de la pyramide hiérarchique avec trois

JADT’ 18

487

classes. Au niveau subordonné des sous-maquettes, ils viennent en deuxième
rang pour passer dans la dernière période au troisième rang. Le nombre de
leurs classes fluctue entre trois et un. Ce qui est curieux est que la variété
maximale du nombre des classes économiques finit par se stabiliser au
niveau global. À la différence des thématiques de politiques internationales,
les thématiques économiques connaissent des prolongements plus pérennes.
Il suffit d’observer les dendrogrammes des sous-maquettes pour localiser dans
le temps les sources des trois classes économiques de la maquette globale.
Comme le montre bien l’évolution de la hiérarchie de l’information (Fig. 5),
les thèmes socio-historiques continuent à s’amplifier pour dépasser les
thématiques économiques dans la dernière période. Ce constat est bien
compatible avec la dynamique du sens global (Fig. 2) où on a observé les
déficits record des thèmes économiques et le sur-emploi significatif des
classes socio-historiques. Notons également que ces dernières croissent
quantitativement et qualitativement. C’est exclusivement dans la dernière
période qu’on a affaire à deux classes socio-historiques. Dans cette dernière
période la classe 6 caractérisée par « enfant » et « femme » ressemble à la
classe 6 de la maquette globale (Fig. 1), tandis que la classe voisine (classe 2)
lexicalisée par « science », « recherche », « individu », « pratique » n’a pas
d’équivalent lexical au niveau global. Il s’agit de contenus émergents qui ne
trouvent pas de précédents dans la STC. Le vocabulaire de la classe 2 se situe
à mi-chemin entre le sociétal et le social. Le ST le plus caractéristique de la
classe nous éclaire sur sa particularité rhétorique. A l’occasion du Sommet G8
2007 dont le thème est ‘croissance et responsabilité’, le MD lance un tract
appelant à une révolution culturelle généralisée. On élargit la fenêtre de
l’observation au-delà des limites du ST12 pour améliorer l’identification du
contenu sémantique:13
« A quand, là encore, la lancée d’initiatives mondiales de la part de
quelques pays courageux – on attend la France – pour prendre à contrepied la vieille tentation d’inféoder la recherche aux désignations

Tandis que le ST se limite à la succession de 10 noms parmi les 380 noms les
plus fréquents du corpus, la lecture ne s’arrête pas aux frontières des ST mais elle en
part. Selon Rastier (2007), le passage - îlot de pertinence – « n’a pas de bornes fixes et
son empan dépend évidemment du point de vue qui a déterminé sa sélection » (p. 31).
Notre paramétrage cible le paragraphe, i.e la période, qui relève du niveau
mésotextuel, lieu de l’observation et de l’objectivation des thèmes. Et la lecture
poursuit sur l’axe syntagmatique le développement d’un thème d’un ST à l’autre.
13 Sont mis en rouge uniquement les noms spécifiques de la classe 2.
12

488

JADT’ 18

d’objectifs par quelques manipulateurs, et pour lancer les chercheurs, au
contraire, à l’assaut des nouvelles questions vitales : telles, en sciences
humaines, les formes de légitimité anthropologique, politique et
démocratique qui conviendraient à une société-monde en formation ;
telle, en sciences technologiques, la rupture nécessaire avec les grands
systèmes énergivores, laquelle permettrait demain aux sociétés – locales,
urbaines, régionales – d’assurer leur autonomie alimentaire et
énergétique sans se désengager de la conversation mondiale autorisée
par la circulation instantanée des données ? Bref, le pire des réflexes de
solidarité défensive ne parvient plus à occulter les questions désormais
immédiatement planétaires : celle qu’on ne tergiversera plus à nommer
simplement la nature, ce support de la vie terrestre devenu poste de
résistance principal pour le mirage de la valeur argent ; celle de la
culture, aussi bien identitaire et artistique que scientifique, et qui
constitue – au moins à l’égal de la production matérielle désormais
technologisée – un vaste univers d’activités essentielles, dont la logique
ouverte ne peut être inféodée au rendement de type industriel ou
financier sans péril pour l’humanité civilisée, et pour sa pluralité
démocratique ; et enfin la question cruciale des sociétés plus autonomes
par rapport au tourbillon techno-chrématistique, et qui seront dans
l’avenir autant de sources d’emplois plus stables, d’activités moins
gaspilleuses d’énergie et moins polluantes, et aussi de conversations
politique plus proches des citoyens. » (Août 2007)
Le ST le plus spécifique fait partie d’un passage qui fait appel à une
révolution culturelle généralisée. Celle-ci se charge de poser les questions
sociétales et civilisationnelles les plus urgentes et de promouvoir les
alternatives-solutions. La révolution est celle de la culture scientifique. Est
urgente une refonte de la pensée dominante et unique dans tous les
domaines. Tout est à réinventer : des théories de référence pour une sociétémonde autre que la mondialisation, des théories économiques au service des
sociétés et des hommes, d’autres technologies bioéthiques qui respectent la
nature, ceci pour rester fidèle à la culture démocratique. Ce passage donne
une idée sur la couleur sémantique de cette classe exclusive de la dernière
période et qui échappe au sens global. D’une manière générale, les contenus
socio-historiques connaissent un tournant qualitatif au cours du temps. Sur
les dendrogrammes (Fig. 3, 4) on identifie leur emplacement libre entre les
thèmes politiques et les thèmes économiques d’une période à l’autre. Dans
les deux premières périodes, les questionnements sur l’histoire et la condition

JADT’ 18

489

de l’Homme sont mobilisés par la situation politique, tandis que les contenus
économiques régressifs des deux dernières périodes attirent les thèmes sociohistoriques.
5. Conclusion
Rapporter la structure sémantique des sous-maquettes à la dynamique des
contenus dominants nous éclaire sur la formation du sens global et sur sa
logique. Autrement dit, la dynamisation du sens global par la projection des
classes lexicales sur la chronologie constitue un niveau intermédiaire entre le
niveau des sous-maquettes, celui des phases sémantiques stables et de leurs
sens particuliers d’un côté et le niveau synthétique du sens qui finalement se
stabilise au niveau global après l’accumulation des sens particuliers. Ce
qu’on voulait illustrer ici c’est ponctuellement l’intérêt du recours à une
maquette, réduction raisonnée du corpus à ses noms les plus fréquents,
modèle à échelle réduite repris dans les sous-corpus de stabilité sémantique.
Cet usage couplé à une statistique cooccurrentielle ciblant les réseaux
lexicaux structurants permet un accès rapide aux fonds sémantiques, condition
première pour pratiquer une sémantique de corpus. La maquette balise une
sémantique de corpus qui va du global au local (Rastier 2001). Plus
concrètement, si la cooccurrence est l’interprétant minimal saisi au sein du
passage (Rastier 2007), on lui a assigné la mission de mesurer le temps
sémantique pour déterminer les phases de stabilité sémantique où l’on peut
observer les mondes lexicaux stabilisés (Reinert 1993, 2008). Ceux-ci sont les
interprétants maximaux objectivables au niveau de la maquette et des sousmaquettes. La maquette telle qu’on la conçoit ne renvoie pas à un modèle
généralisable mais à un usage généralisable. Un usage qui pour chaque
corpus contribue à la reconstitution de son modèle sémantique quelle que
soit sa spécificité et à réaliser la vocation de sa conception. Ici, dans le cas des
corpus chronologiques, la maquette réconcilie l’étude du sens et l’étude du
temps. Tandis que la première passe par délinéarisation du texte et la capture
de la structure non-séquentielle du texte, la seconde poursuit l’organisation
séquentielle des textes. La maquette en tant que dispositif destiné à un usage
prédéfini intègre l’étude du non-séquentiel dans le séquentiel et efface le faux
contraste entre eux.
Références
Brunet E. (2008). Les séquences (suite). JADT 2008.
Brunet E. (2012). Nouveau traitement des cooccurrences dans Hyperbase.
Corpus (11).

490

JADT’ 18

Halliday M. A. (1994). Introduction to Functional Grammar. London : Edward
Arnold.
Lebart L. et Salem A. (1994). Statistique textuelle. Paris : Dunod.
Mayaffre D. (2008a). Quand ‘travail’, ‘famille’, ‘patrie’ co-occurrent dans le
discours de Nicolas Sarkozy. Etude de cas et réflexion théorique sur la
cooccurrence. JADT 2008.
Mayaffre D. (2008b). De l’occurrence à l’isotopie. Les co-occurrences en
lexicométrie. Sémantique & synatxe (9).
Mayaffre D. (2014). Plaidoyer en faveur de l’Analyse des Données
co(n)textuelles. Parcours coocurrentiels dans le discours présidentiel
français (1958-2014). JADT 2014.
Metwally H. (2017), Les thèmes et le temps dans Le Monde diplomatique (19902008). Thèse de doctorat, Université Côté d’Azur.
Rastier F. (2001). Arts et sciences du texte. PUF.
Rastier F. (2007). Passages. Corpus (6), pp. 25-54.
Rastier F. (2011). La mesure et le grain. Sémantique de corpus. Paris : Champion.
Ratinaud P. et Marchand P. (2012). Application de la méthode ALCESTE aux
« gros » corpus et stabilité des « mondes lexicaux » : analyse du «
CableGate » avec IRAMUTEQ. JADT 2012.
Reinert M. (1983). Une méthode de classification descendante hiérarchique :
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données. 8(2), pp. 187-198.
Reinert M. (1993). Les « mondes lexicaux » et leur « logique » à travers
l’analyse statistique d’un corpus de récits de cauchemars. Langage et société
(66), pp. 5-39.
Salem A. (1988). Approches du temps lexical. Statistique textuelle et séries
chronologiques. Mots (17). pp. 105-143.
Salem A. (1991). Les séries textuelles chronologiques. Histoire & Mesure, VI
(1/2). pp. 149-175.
Salem A. (1993). De travailleurs à salariés. Repères pour une évolution du
vocabulaire syndical (1970-1993). Mots(63). pp. 74-83.
Salem A. (1994). La lexicométrie chronologique. Dans Actes du colloque de
lexicologie politique ‘Langages de la Révolution’. Paris : Klincksieck.
Viprey J.-M. (2005). Corpus et sémantique discursive : éléments de méthode
pour la lecture des corpus. Dans A. Condamines, Sémantique et corpus.
Paris : Lavoisier.
Viprey J.-M. (2006). Structure non-séquentielle des textes. Langages (183).

JADT’ 18

491

Séries textuelles homogènes
Jun Miao 1, André Salem 2
Université Lumière de Lyon 2, France – miaojun@miaojun.net
2 Université de la Sorbonne nouvelle - Paris 3, France – salem@msh-paris.fr
1

Abstract
Textometric methods, widely used for the study of large corpora, are applied
here to a set of small texts, which, however, present homogeneous
characteristics. Our study focuses on a chronological textual series consisting
of reports of successive congresses of the CCP (Chinese Communist Party)
during the period 1982-2017. The textometrical methods are firstly used to
highlight the changes occurred during the 2017 congress.
Secondly, we apply these same methods to the subcorpora consisting of a
collection of fragments, automatically extracted from each congress and
related to the same topic. This subcorpora thereby constituted make it
possible to observe, with greater efficiency, the contextual variations that
occur over time around the same type. The method can be extended to any
corpora consisting of fragment systems that present a certain level of
homogeneity among them.
Keywords: Textual series, Chinese political speeches, homogeneous
subcorpora
Résumé
Nous appliquons ici des méthodes textométriques, largement utilisées pour
l'étude de vastes corpus, à des ensembles de textes dont la taille est réduite
mais qui présentent de fortes caractéristiques d'homogénéité. Notre étude
porte sur une série textuelle chronologique constituée par les rapports
successifs des congrès du PCC (Parti Communiste Chinois) durant les années
1982-2017.
Les méthodes de la veille textuelle textométrique sont d'abord mises en œuvre
pour mettre en évidence les changements survenus lors du congrès de 2017.
Dans un deuxième temps, nous appliquons ces mêmes méthodes à des souscorpus, constitués par la réunion de fragments extraits de chacun des congrès
et relatifs à un même thème. Les sous-corpus ainsi constitués permettent
d'observer avec une efficacité accrue des variations contextuelles qui
surviennent au fil du temps autour d'une même forme-pôle. La méthode
peut être appliquée à tout corpus constitué de systèmes de fragments
présentant une certaine homogénéité entre eux.

492

JADT’ 18

Mots-clés: Séries
homogène.

textuelles,

discours

politique

chinois,

sous-corpus

1. Introduction1
Le développement des capacités textométriques permet désormais d'explorer
avec profit des ensembles de textes extrêmement vastes et souvent variés.
Nous avons, cependant, insisté, avec d'autres, sur l'intérêt qu'il y a à
appliquer ces mêmes méthodes à des corpus constitués par la réunion de
productions textuelles présentant de fortes caractéristiques d'homogénéité et
forcément plus réduites de ce fait (Salem 1991). Au delà des séries
chronologiques, auxquelles nous empruntons nos exemples, la démarche que
nous présentons peut être appliquée à différents types de corpus.
Depuis quelques décennies, le Congrès national du Parti communiste chinois
(PCC) a lieu une fois tous les cinq ans. Il constitue la plus haute instance de
ce Parti, dans laquelle sont annoncées les décisions importantes2. Dans la
dernière décennie, les commentaires et les analyses quantitatives, portant sur
les textes de congrès du PCC, plus ou moins appuyés sur des méthodes
d'analyse statistiques, se sont multipliés dans la presse et sur différents sites
de l'Internet.
Le corpus que nous étudions est constitué d'un ensemble des textes produits
lors des congrès du PCC, entre 1982 et 2017. Pour des raisons que nous
analysons, les textes produits durant cette dernière période présentent une
grande homogénéité, tant du point de vue de leur taille que de celui des
thèmes qu'ils abordent et du style qu'ils emploient. Nous commençons par
étudier de manière classique la série chronologique PCC1982-2017 divisée en
congrès afin de mettre en évidence des variations dans l'emploi du
vocabulaire. Nous proposerons ensuite une méthode qui permet, selon nous,
d'étudier au plus près les variations du contexte immédiat d'un terme donné.
2. Analyse chronologique de la série PCC1982-2017
Le corpus ainsi constitué compte au total 115 1338 occurrences pour 7365

Les analyses dont nous rendons compte ci-dessous, ont été effectuées à l'aide
du logiciel Lexico5. Cedric Lamalle, William Martinez, Serge Fleury ont largement
contribué au développement des fonctionnalités de ce logiciel. Les auteurs tiennent à
les en remercier.
2 L’article de Salem et Wu (2008) constitue une étude chronologique portant sur
l'intégralité des congrès du PCC survenus depuis sa fondation 1921 jusqu'à l'année
2012. Au-delà des évolutions chronologiques qu'elle avait permis de mettre à jour,
cette étude montre le caractère hétérogène de la forme congrès considérée sur une
échelle aussi large.
1

JADT’ 18

493

formes différentes3. La division en congrès amène une partition du corpus en
huit parties. Les longueurs des parties, pour chaque congrès, s’échelonnent
entre 2 400 et 2 900 occurrences. La forme de fréquence maximale est toujours
la forme的 (de, DE1), dont on peut vérifier la forte diminution au fil des
congrès4.
2.1 Le congrès 2017
Lorsque survient un nouveau congrès qui complète une série chronologique
pré-existante, la méthode des spécificités permet de répondre à la question :
Quelles sont les principales évolutions lexicales survenues lors du dernier congrès de
la série ? C'est une opération de veille lexicale. Le calcul des spécificités
appliqué au congrès de 2017 signale des spécificités positives, dont le
contenu revêt un caractère nettement lexical : 时代 (shídài, ère, S +24), 治理
(zhìlǐ, gérer, S +21), 生态 (shēngtài, écologie, S +15), 梦 (mèng, rêve, S +14)5. A
l'inverse, les formes de spécificités négatives, pour cette même période, sont
plutôt des formes grammaticales, telles que 的 (de, DE1, S -38) , 这 (zhe, ce, S 22), 地 (de, DE2, S -14).
Le même calcul appliqué aux segments répétés du corpus permet de préciser
les modifications survenues lors ce même congrès. La mise en vedette du
terme 新 时代 (xīn shídài, nouvelle ère), employé 36 fois lors du congrès de
2017, a été largement commentée par les analystes qui se sont penchés sur ce
texte6. Le recensement systématique des segments fortement spécifiques pour
cette même période permet de mettre en évidence des séquences répétées
dont certaines ont pu échapper aux commentateurs et qui constituent
également des néologismes par rapport aux congrès précédents : 新 时代

La séquence textuelle continue des textes chinois, composés de caractères
juxtaposés (scriptio continua, dans laquelle les mots ne sont pas séparés par des
espaces), a été soumise à un segmenteur automatique NLPIR (Zhang, 2016), très
largement utilisé dans le monde sinophone, afin d'être segmentée en mots graphiques.
4 Nous expliquons dans une étude parallèle comment cette diminution
progressive peut être mise en rapport avec l'évolution du style d’écriture.
5 Dans nos exemples, la forme native chinoise est suivie de sa transcription en
pinyin, puis d'un équivalent français (lequel ne peut prétendre au statut de traduction
satisfaisante pour chacune des occurrences du terme). Un coefficient de spécificité,
positive ou négative, de forme S +/- xx indique enfin le degré de spécificité de la forme
dans la partie du texte considérée.
6 De nombreux articles publiés à cette occasion ont explicitement mentionné la
3

fréquence (36 occurrences) de la formule新 时代 (xīn shídài, nouvelle ère) ex :
Vandnepitte (2017). D'autres sites ont proposé aux internautes de classer congrès par
fréquence d'apparition de plusieurs termes répétés dans chaque congrès (Qian, 2017).

494

JADT’ 18

中国 特色 社会主义, (le socialisme à la chinoise dans la nouvelle ère, 13 occ., S +12)
, 治理 体系. (le système de gouvernance, 13 occ., S +12). Plus remarquable à nos
yeux, certaines expressions extrêmement courantes dans les périodes
précédentes ont complètement disparu du texte du dernier congrès. Tel est le
cas, par exemple pour des segments comme :
有 中国 特色,
(posséder des caractéristiques chinoises, 0
occ., S -7),
有 中国 特色 社会主义
(avoir un socialisme à la chinoise, 0 occ., S 5).
L'analyse des spécificités permet également de localiser des parties du texte
dans lesquelles le renouvellement lexical se révèle particulièrement
important. Sur la figure 1, une carte des sections a été établie pour chacun des
congrès, divisé en chapitres. Les sections apparaissent d'autant plus sombres
qu'elles renferment de nombreuses occurrences de termes spécifiques pour le
dernier congrès. La représentation permet de vérifier que le renouvellement
ne se fait pas de manière uniforme, dans le dernier congrès. Une partie du
vocabulaire spécifique du congrès de 2017, était déjà largement présente dans
les deux congrès précédents. La carte permet en outre de localiser
précisément les chapitres du dernier congrès qui font le plus fortement l'objet
d'un renouvellement lexical.
La figure 2 ci-dessous permet d'apprécier l'évolution du vocabulaire
survenue dans la dernière période en combinant une représentation
factorielle sur l'ensemble des congrès et les spécificités calculées pour le
dernier congrès. Une analyse réalisée sur les huit congrès met en évidence la
progressivité des changements lexicaux. On a projeté en qualité d'éléments
supplémentaires les formes spécifiques positives de la dernière partie. Ce type
de représentation peut être articulé avec les cartes de section, présentées cidessus pour illustrer les changements lexicaux.
3. Utiliser la structure des documents
Dans chacun des textes de l'édition originale des congrès, des repères
éditoriaux (intertitres, numérotation de sous-parties, etc.) permettent
d'effectuer un découpage en unités plus petites que nous appellerons
chapitres. Chaque chapitre correspond à l'évocation d'un thème particulier
(développement économique, perspectives internationales, état des forces
armées, etc.). Lors de chacun des congrès, ces thèmes sont abordés tour à
tour, souvent dans un ordre similaire qui peut conduire à proposer une
description globale de l'ordonnancement de ces textes de congrès.

JADT’ 18

495

Guide de lecture:
A gauche, on trouve
une carte des sections
réalisée à partir d'un
découpage
en
chapitres.
Chaque
ligne regroupe les
chapitres relatifs à
un même congrès.
Les carrés les plus
foncés
correspondent aux
chapitres les plus
chargés en formes
spécifiques dans le
dernier
congrès
(S+>10).
En bas : le texte du
deuxième chapitre
du dernier congrès,
qui
figure
au
dessous de la carte
est signalé comme
particulièrement
chargé
en
formes
spécifiques.

$<chap1=19-00># <para=19-05>同志 们 ： ¶ # <para=19-06>现在 , 我 代表
第十八 届 中央 委员会 向 大会 作 报告 . ¶ # <para=19-07>中国共产党 第十九
次 全国 代表大会 , 是 在 全面 建成 小康 社会 决胜 阶段 * 中国 特色 社会主义

进入 新 时代 的 关键 时期 召开 的 一 次 十分 重要 的 大会 . ¶ # <para=1908>大会 的 主题 是 ： 不 忘 初心 , 牢记 使命 , 高举 中国 特色 社会主义 伟大
旗帜 , 决胜 全面 建成 小康 社会 , 夺取 新 时代 中国 特色 社会主义 伟大 胜利 ,

为 实现 中华民族 伟大 复兴 的 中国 梦 不懈 奋斗 /... /. ¶.
Figure 1 : Repérage des portions caractéristiques pour le dernier congrès (2017)

496

JADT’ 18

Figure 2 : Spécificités positives du congrès 2017 mises en évidence dans l’AFC

Guide de lecture: Sur la figure 2, les différents congrès s'échelonnent dans le
temps selon une parabole. Cet échelonnement résulte d'un renouvellement
important du vocabulaire au fil des congrès. Les formes les plus spécifiques
pour le dernier congrès ont été projetées en qualité d'éléments
supplémentaires.

JADT’ 18

497

3.1 Analyse en chapitres
Lorsqu'on soumet à des analyses typologiques, le même corpus divisé, cette
fois, en chapitres, on constate que les chapitres correspondant aux mêmes
thèmes, mais appartenant à différents congrès, ont une forte tendance à se
regrouper, du fait qu'ils emploient des vocabulaires proches. La structure
chronologique, mise en évidence par l'analyse en congrès s'efface, dans ce
cas, devant une typologie d'ordre thématique. La figure 3 montre les résultats
d'une Analyse factorielle des correspondances effectuée à partir du corpus
PCC1982-2017 divisé cette fois en 89 chapitres. Sur cette figure, les
identificateurs des chapitres sont constituées de deux parties. Le premier
nombre indique le numéro du congrès dont le chapitre est extrait. Le second,
l'ordre du chapitre à l'intérieur du congrès. Comme on peut le vérifier sur
cette figure les chapitres correspondant à un même thème ont tendance à se
regrouper fortement.

498

JADT’ 18

Figure 3 : Analyse factorielle des correspondances sur le corpus divisé en chapitres

A titre d'exemple, nous avons agrandi les portions du graphique qui
correspondent à deux groupes thématiques :
a) le groupe un pays deux systèmes qui correspond à une orientation
politique constante du PCC, réaffirmée à chaque congrès ;
b) un groupe de chapitres correspondant à l'analyse des relations
internationales, qui constitue également un moment incontournable
pour chaque congrès, à partir du 14ème.
3.2 Le sous-corpus thématique « un pays deux systèmes »
L'étape suivante consiste à réitérer ces mêmes analyses à partir de souscorpus réduits, rassemblant les seules chapitres relatifs à une même
thématique. Les analyses textométriques effectuées sur ces sous-corpus
homogènes débouchent sur des résultats particulièrement lisibles. Lors de
l'analyse de ce type de corpus, la dimension chronologique revient au
premier plan. Le sous-corpus qui rassemble les passages relatifs au thème un
pays, deux systèmes ne compte que deux milles occurrences, sur l'ensemble des
congrès. L'analyse des formes qui apparaissent spécifiquement dans les
contextes de ce terme, montre cependant une nette évolution du contexte
immédiat de ce terme. Le Congrès de 1987, présente la formule comme un
principe à mettre en œuvre. Dans les congrès suivants, on voit apparaître les
verbes maintenir et continuer (2002) puis mettre en œuvre sans faille (2007). En
2017, il s'agit d'appliquer intégralement et avec précision le principe un pays, deux
systèmes. La figure 4 montre une projection des différents segments qui
contiennent l'expression sur l'analyse réalisée à partir du sous-corpus7.
7 Le graphique a été légèrement modifié pour permettre une plus grande
lisibilité. Les segments redondants ont été écartés; les points superposés ont été
légèrement déplacés.

JADT’ 18

499

Figure 4 : Variations lexicales autour de l'expression : un pays deux systèmes

4. Conclusion
Nos expériences nous amènent à conclure que l'analyse textométrique opérée
à partir de regroupements de fragments homogènes, prélevés autour d'un
même thème durant les années couvertes par une série chronologique conduit
à des résultats dont l'interprétation se révèle particulièrement aisée. La
grande homogénéité lexicale des fragments rapprochés permet alors
d'observer des variations très fines. Elle compense largement la taille réduite
du corpus, peu favorable, a priori, dans le cas d'études textométriques.
Au delà des applications aux seules séries textuelles chronologiques, la méthode
pourra être utilisée pour toute sorte de corpus, dans une large variété de
langues, à la condition qu'il soit possible de distinguer des sous ensembles
thématiques homogènes
Références
Miao J. (2012). Approches textométriques de la notion de style du traducteur Analyses d'un corpus parallèle français-chinois : Jean-Christophe de Romain
Rolland et ses trois traductions chinoises. Thèse doctorale dirigée sous la
direction de M. André Salem, Paris 3.
Qian G. (2017). 中共历届党代大会报告语象分析 (Analyses lexicales des
rapports de tous les congrès du Parti communiste chinois).Lianhe Zaobao
du 19 novembre 2017.
Salem A. (1991). Les séries textuelles chronologiques. Histoire & Mesure
Année, Vol. (6) : 149-175.

500

JADT’ 18

Salem A., Wu Li-Chi. (2008). Essai de textométrie politique chinoise. In
André Salem et Serge Fleury, éditeurs, Lexicometrica – Explorations
textométriques,
Vol.
(1).
URL :
http://lexicometrica.univparis3.fr/numspeciaux/special8.htm (consulté le 5 février 2017).
Vandepitte M. (2017). Quatre choses à savoir sur la Chine – dans le cadre du
XIXème congrès du Parti. Traduit par Anne Meert en français du
néerlandais. Investig’Action du 15 novembre 2017. URL : goo.gl/8fgSkq
(consulté le 25 novembre 2017).
Logiciels utilisés :
Zhang H.P. (2017). Segmenteur automatique chinois NLPIR. URL :
http://www.nlpir.org/
Salem A. (2017). L’outil d’analyse textométrique Lexico 5. URL :
http://www.lexi-co.com/index.html

JADT’ 18

501

TaLTaC in ENEAGRID Infrastructure
Silvio Migliori1, Andrea Quintiliani1, Daniela Alderuccio1,
Fiorenzo Ambrosino1, Antonio Colavincenzo1, Marialuisa Mongelli1,
Samuele Pierattini1, Giovanni Ponti1 Sergio Bolasco2,
Francesco Baiocchi3, Giovanni De Gasperis4,
1

ENEA DTE-ICT – silvio.migliori@enea.it, 2 Sapienza Università di Roma,
3 Staff TaLTac - info@taltac.it, 4 Dip. DISIM Università dell‘Aquila

Abstract
The aim of this joint ENEA-TaLTaC project is to enable the TaLTaC User
Community and the Digital Humanists to have remote access to the TaLTaC
software through the ENEAGRID Infrastructure. ENEA's research activities
on the integration of Language Technologies (Multilingual Text Mining
Software and Lexical Resources) in the ENEA distributed digital
infrastructure provide a "community Cloud" approach in a digital
collaborative environment and on an integrated platform of tools and digital
resources, for the sharing of knowledge and analysis of textual corpora in
Economic and Social Sciences and e-Humanities. Access to the TaLTac
software in Windows and Linux version will exploit the high computational
capacity (800 Teraflops) of the e-infrastructure, to which users access as a
single virtual supercomputer.
Riassunto
Obiettivo del progetto congiunto ENEA-TaLTaC è consentire alla comunità
degli utenti TaLTaC e ai ricercatori nelle Digital Humanities l’accesso remoto
al software TaLTaC attraverso l'infrastruttura digitale ENEAGRID. Le
attività di ricerca dell'ENEA sull'integrazione delle tecnologie linguistiche
(software di Text Mining per testi multilingue e risorse lessicali) in
ENEAGRID forniscono un approccio "community Cloud" in un ambiente
collaborativo digitale e su una piattaforma integrata di strumenti e risorse
digitali, per la condivisione delle conoscenze e l'analisi di corpora testuali in
Scienze Economiche e Sociali ed e-Humanities. L’accesso al software TaLTac
in versione Windows e Linux sfrutterà l’elevata capacità computazionale (800
Teraflops) dell’infrastruttura di calcolo, a cui gli utenti accedono come ad un
unico supercomputer virtuale.
Keywords: Text Mining Software, Cloud Computing, Digital-Humanities,
Socio-Economic Sciences, Big Data.

502

JADT’ 18

1. Introduction
“TaLTaC in CLOUD” is a joint ENEA-TaLTaC project for the set-up of an ICT
portal on the ENEA distributed e-Infrastructure1 (Ponti et al., 2014), hosting
TaLTaC Software (Bolasco et al., 2016, 2017). Users will access TaLTaC
software (Windows and Linux versions) in a remote and ubiquitous way,
and the computational power (800 Teraflops) of ICT ENEA distributed
resources, as a single supercomputer. The aim of this joint ENEA-TaLTaC
project is to enable the TaLTaC User Community and Digital Humanists to
have remote access to TaLTaC software through ENEAGRID Infrastructure,
integrating ICT inside Digital Cultural Research.
ENEAGRID offers a digital collaborative environment and an integrated
platform of tools and resources assisting research collaborations, for sharing
knowledge and digital resources and for storing textual data. In this virtual
environment, TaLTaC software evolves from a stand-alone uniprocessor
software toward a multiprocessor design, integrated in an ICT research einfrastructure. Furthermore, it evolves towards implementing ancient
language lexical and semantic knowledge and e-resources, facing research
needs and implementing solutions also for Digital Humanities communities.
2. TaLTaC Software
The TaLTaC software package, conceived at the beginning of the 2000s, has
been progressively developed to date in three major releases: T1 (2001), T2
(2005) and T3 (2016); it is widespread among the text analysis community in
Italy and abroad with over 1000 licenses, including two hundred entities
between university departments, research institutions and other
organizations.
The 2018 release of the software, T3, implemented the following priority
objectives: i) the processing of big data (around of a billion words), achieving
the independence from the dimensions of the text corpora, limited only by
hardware resources; ii) the automatic extraction on multiple layers of results
from text parsing (tokenization): layer zero (text in the original version), layer
1 (recognition of words with automatic corrections of the accents), layer 2
(pre-recognition of most common Named Entities), layer 3 (reconstruction of
pre-defined multiwords); iii) computing speed, taking advantage of the
power of the multi-core processing readily available on current computers

The ENEAGRID infrastructure is based on several software components which
interact with each other to offer an integrated distributed system. The ENEAGRID
infrastructure allows access to all these resources as a single virtual system, with an
integrated computational availability of about 16000 cores, provided by several
multiplatform systems.
1

JADT’ 18

503

(personal or cloud).
Table 1 shows the processing times of three parsing, up to layer 2, for larger
corpora on PC (1-core and 8-cores) and on ENEAGRID. Preliminary results
on ENEAGRID (1core-CRESCO) show that with increasing corpus size there
is an even greater saving of time.
TALTAC was installed in ENEAGRID infrastructure, but the computational
capabilities of the HPC system are not yet exploited because the current
version of the software does not support multi-core. Therefore, the present
ENEAGRID capabilities allow only multi-users access and computation;
future versions of the software will be tested for multi-core capabilities to
exploit the real power of ENEA ICT High Performance Computing.
Table 1. Preliminary results of processing times of three parsing on PC and on ENEAGRID.
ENEAGRID
1 core
(CRESCO)
in minutes

millions
74

0,41

3,4

1,1

0,33

3,5

284

1,55

13,0

3,8

0,29

13,2

tokens
1 "La Repubblica " (100 th Artic.)
2 "La Repubblica " (400 th Artic.)
3 Italian and French Press
4 Various Press Collection

MAC i7 (7th generation)
8core
1 core 8 cores
/1core
in minutes
in %

size
of file
GB

535

2,89

37,4

8,8

0,24

41,3

1.138

6,18

88,2

14,0

0,16

54,7

For the characteristics of the technological architecture of the TaLTaC3
platform, see previous works (Bolasco et al. 2016, 2017), that can be
summarized here as: a1) HTML 5 for the GUI and jQuery with its derived
Javascript frameworks to encapsulate the GUI user interaction functions for
the MAC and Cloud solution; a2) Windows native DotNET desktop
application; b) JSON (JavaScript Object Notation): as an inter-module language
standard, with a structured and agile format for data exchange in
client/server applications; c) Python / PyPy: advanced script/compiled
programming language, mostly used for textual data analysis and natural
language processing at the CORE back end; d) No-SQL: high performance
key/value
data
structure
storage
server
Redis
adopted
for
vocabularies/linguistic resources persistence; e) RESTful: interface standard
for data exchange over the HTTP web protocol; f) MULTI-PROCESSING:
exploiting in the best possible way multi-core hardware, distributing
processing power among different CPU cores.
The choice of the Python language allowed to develop a cross-platform
computational core running on Windows, Linux, macOS. In particular, the
overall system of software processes runs smoothly over a linux-based cloud
computing facility, like the ENEAGRID. Furthermore, the Python code

504

JADT’ 18

compiled through the 64bit PyPy just-in-time-compiler allows very efficient
macro operations over a large set of data, stored as hash dictionaries, so that
the upper limits of performance and capacity is only due to the physical limit
of the host machine, in terms of RAM and number of cores and OS kernel
scheduler. In our test each node in the ENEAGRID infrastructure hosted a
single Redis instance and a number of 24 logic cores, with 16GB of RAM.
3. ENEAGRID Infrastructure
ENEA activities are supported by its ICT infrastructure, providing advanced
services as High Performance Computing (HPC), Cloud and Big Data
services, communication and collaboration tools. Advanced ICT services are
based on ENEA research and development activities in the domains of HPC,
of high performance networking and data management, including the
integration of large experimental facilities, with a special attention to public
services and industrial applications. As far as High Performance Computing
is concerned, ENEA manages and develops ENEAGRID, a computing
infrastructure distributed over 6 ENEA research centers for a total of about
16000 cores and a peak computing power of 800 Tflops.
HPC clusters are mostly based on conventional Intel Xeon cpu with the
addition of some accelerated systems as Intel Xeon/PHI and Nvidia GPU.
Storage resources includes RAID systems for a total of 1.8 PB, in
SAN/Switched and SRP/Infiniband configuration. Data are made available by
distributed and high performances files systems (AFS and GPFS).
ENEA Portici Center has become one of the most important italian HPC
center in 2008 with the project CRESCO - Computational RESearch Center for
COmplex Systems. CRESCO HPC clusters are used in many of the main
ENEA research and developments activities, such as energy, atmosphere and
sea modeling, bioinformatics, material science, critical infrastructures
analysis, fission and fusion nuclear science and technology, complex systems
simulation. CRESCO clusters have provided in 2015 and 2016 more than 40
million core hours each year to ENEA researchers and technologists and to
their external partners (external users account for about 30% of the total
machine time).
CRESCO6, the new HPC cluster recently installed in Portici in the framework
of the 2015 ENEA-CINECA agreement, provides a peak computing power of
700 Tflops and is based on the new 24 cores Intel SkyLake cpu. Its nodes will
be connected by the new Intel OmniPath high performance network,
providing a 100 Gbps bandwidth.
ENEA ICT department provides also general purpose communication,
elaboration and collaboration tools and services as Network management, EMail, Video Conferencing and Voip services, Cloud Computing and Storage.

JADT’ 18

505

A friendly user access to scientific and technical applications (as Ansys,
Comsol, Nastran, Fluent) is provided by dedicated web portals (Virtual
laboratories) relying on optimized remote data access tools as NX
technology.
4. TaLTaC in ENEAGRID Infrastructure
4.1 Software Installation and Access on ENEA e-Infrastructure
The software TaLTaC is available on Windows and Linux through
ENEAGRID via AFS in a geographically distributed file system, which
allows remote access to each computing node of the HPC CRESCO systems
and Cloud infrastructure from anywhere in the world.
This provides three capabilities: i) data mining, sharing and storage; ii) ICT
services necessary for the efficient use of HPC resources, collaborative work,
visualization and data analysis; iii) the implementation of software and its
settings for future data processing and analysis. Moreover, the availability of
the software on the ENEA ICT infrastructure can benefit of the advantages of
AFS such as scalability, redundance, backup and so on.
Through the ACL rules it can be possible to manage the accessibility of the
software to the community of users in compliance of the license policies that
will be put in place. The following two options are provided for TaLTaC
running: the first one is to use the applications installed in the windows
system and the second one is to use FARO2 – Fast Access to Remote Objects
(the general purpose interface for hardware and software capabilities by web
access) to directly access the applications installed in the Linux environment
and that refer to the data in AFS.
4.1.1. TaLTaC2 (Windows) on Remote Desktop Access
The software TaLTaC2 is available on “Windows Server 2012 R2” by remote
desktop access to a virtual machine that can be reached by the ThinLinc
general-purpose and intuitive interface. All the users involved in the project
activities can access the server but only the person in charge of developing
and installing the application can obtain administrator privileges. For this
reason, AFS authentication is always required. Every TaLTaC2 user with AFS
credentials can access ENEAGRID to run the software and to manage data on
AFS own areas via web and from any remote location. In the AFS
environment, an assigned disk area with a large memory capacity is defined.
This area is mainly used for storage and sharing of large amounts of data
(less than 200 MB) (analysis, reports and documents) that come from running
the software on a single processor, in serial mode, or for future parallel data
mining applications.

506

JADT’ 18

4.1.2. TaLTaC3 (Linux) on CRESCO System
On the CRESCO systems, that is accessible from ENEAGRID infrastructure,
TaLTaC3 is available on CentOS Linux nodes and then it is possible to
leverage the overall computing power dedicated to the activities of TaLTaC
and Digital Humanists communities. Every user can start own work session
allocating a node with a reserved Redis instance and as many computing core
as needed.
Performance improvements are obtainable through the parallelization so that
a single user can use the full capacity of the assigned node, in terms of
number of computing cores. The TaLTaC3 package is automatically started
as the user login to the node by a shell script. The opensource Mozilla Firefox
web browser makes the user interface in the current beta version. The access
to the TaLTaC3 portal use the ThinLinc remote desktop visualization
technology that allows an almost transparent remote session on the HPC
system, including the graphical user interface, thanks to the built-in features
such as load-balancing, accelerated graphics and platform-specific
optimisations. Input and output data can be accessed through the
ENEAGRID filesystems and therefore easily uploaded and downloaded.
4.2 Case Studies
ENEA distributed infrastructure (and cloud services) enables the
management of research process in Economic-Social Sciences and Digital
Humanities, providing technology solutions and tools to academic
departments and research institutes: building and analyzing collections to
generate new intellectual products or cultural patterns, data or research
processes, building teaching resources, enabling collaborative working and
interdisciplinary knowledge transfer.
4.2.1. TaLTaC User Community
The current (2018) community of TaLTaC over the years aggregated users
from the computer laboratories of automatic analysis of texts and text
mining, also carried out within the institutional courses of bachelor and
magistral degrees, plus Ph.D. students from doctoral degree courses at the
universities of Rome "La Sapienza" and "Tor Vergata", of Padua, Modena,
Pisa, Naples and Calabria (a total estimate of over 1300 students over the last
eight years); furthermore, there is another set of users that subscribed to
specific tutorial courses dedicated to TaLTaC (more than 60 courses for a
total number of 750 tutorial participants).
A call about the opportunity of using "remotely" the software via the ENEA
distributed computing facilities, received the manifestation of interest by 40
departments and other research institutes.

JADT’ 18

507

4.2.2. Digital Humanities Community as TaLTaC user
In collaboration with academic experts, ENEA focused on Digital Humanities
projects in Text Mining & Analysis in Ancient Writings Systems of the Near
East and used TaLTaC2 to perform quantitative linguistic analysis in
cuneiform corpora (transliterated into latin alphabet) (Ponti et al., 2017).
Cuneiform was used by a number of cultures in the ancient Near East to
write 15 languages over 3,000 years. The cuneiform corpus was estimated to
be larger than the corpus of Latin texts but only about 1/10 of the extant
cuneiform texts have been read even once in modern times. This huge
cuneiform corpus and the restricted number of experts lead to the use of Text
Mining and Analysis, clustering algorithms, social network analysis in the
TIGRIS Virtual Lab for Digital Assiriology2, a virtual research environment
implemented in ENEA research e-infrastructure. In TIGRIS V-Lab
researchers perform basic tasks to extract knowledge from cuneiform
corpora. (i.e. dictionaries extraction with word list of toponyms, chrononyms,
theonyms, personal names, grammatical and semantic tagging,
concordances, corpora annotation, lexicon building, grammar writing, etc.).
5. Conclusions
Researchers and their collaborators will use computational resources in
ENEAGRID to perform their work regardless of the location of the specific
machine or of the employed hardware/ software platform.
ENEAGRID offers computation and storage resources and services in a
ubiquitous and remote way. It integrates a cloud computing environment
and exports: a) remote software (i.e. TaLTaC); b) Virtual Labs: thematic areas
accessible via web, where researchers can find set of software (and
documentation regarding specific research areas); c) remote storage facilities
(with OpenAFS file system). In this virtual environment, TaLTaC software
evolves from a uniprocessor software toward a multiprocessor design,
integrated in an ICT research e-infrastructure.
This project leads to the TaLTaC evolution from a stand-alone software
(allowing Text Mining & Analysis to search for linguistic constructions in
textual corpora, showing results in a table or concordance list) to a software
“always and anywhere on”, that also can be accessed, providing an interface
where you can visualize results, create interpretative models, collaborate
with others, combine different textual representations and storing data, codeveloping research practices. Furthermore, this project reflects the shift

2 TIGRIS - Toward Integration of e-tools in GRId Infrastructure for e-aSsyriology
http://www.afs.enea.it/project/tigris/indexOpen.php
http://www.laboratorivirtuali.enea.it/it/prime-pagine/ctigris

508

JADT’ 18

from the individual-researcher-approach to a collaborative research
community-approach, leading to a community-driven software design,
tailor-made on specific research community needs and to Community Cloud
Computing.
This
interdisciplinary
knowledge
transfer
enables
creating/activating new knowledge from big (cultural and socio-economic)
data, both in modern and ancient languages.
References
Bolasco, S., Baiocchi, F., Canzonetti, A., De Gasperis, G. (2016). “TaLTaC3.0, un
software multi-lessicale e uni-testuale ad architettura web”, in D. Mayaffre, C.
Poudat, L. Vanni, V. Magri, P. Follette (eds.), Proceedings of JADT 2016, CNRS
University Nice Sophia Antipolis, Volume I, pp. 225-235.
Bolasco S., De Gasperis G. (2017). “TaLTaC 3.0 A Web Multilevel Platform for
Textual Big Data in the Social Sciences” in C. Lauro, E. Amaturo, M.G.
Grassia, B. Aragona, M. Marino. (eds.) Data Science and Social Research Epistemology, Methods, Technology and Applications (series: Studies in
Classification, Data Analysis, and Knowledge Organization) Springer Publ.,
pp. 97-103.
Ponti G., Palombi F., Abate D., Ambrosino F., Aprea G., Bastianelli T., Beone F.,
Bertini R., Bracco G., Caporicci M., Calosso B., Chinnici M., Colavincenzo A.,
Cucurullo A., Dangelo P., De Rosa M., De Michele P., Funel A., Furini G.,
Giammattei D., Giusepponi S., Guadagni R., Guarnieri G., Italiano A.,
Magagnino S., Mariano A., Mencuccini G., Mercuri C., Migliori S., Ornelli P.,
Pecoraro S., Perozziello A., Pierattini S., Podda S., Poggi F., Quintiliani A.,
Rocchi A., Sciò C., Simoni F., Vita A. (2014) “The Role of Medium Size
Facilities in the HPC Ecosystem: The Case of the New CRESCO4 Cluster
Integrated in the ENEAGRID Infrastructure”. In: Proceedings of the
International Conference on High Performance Computing and Simulation, HPCS
(2014), ISBN: 978-1-4799-5160-4.
Ponti G., Alderuccio, D., Mencuccini, G., Rocchi, A., Migliori, S., Bracco, G., Negri
Scafa, P. (2017) “Data Mining Tools and GRID Infrastructure for Text
Analysis” in “Private and State in the Ancient Near East” Proceedings of the
58th Rencontre Assyriologique Internationale, Leiden 16-20 July 2012, edited by
R. De Boer and J.G. Dercksen, Eisensbrauns Inc. - LCCN 2017032823 (print) |
LCCN 2017034599 (ebook) | ISBN 9781575067858 (ePDF) | ISBN
9781575067841.
ENEAGRID http://www.ict.enea.it/it/hpc Laboratori Virtuali http://www.ict.enea.it/it/laboratori-virtualixxx/virtual-labs
TIGRIS Virtual Lab http://www.afs.enea.it/project/tigris/indexOpen.php
TaLTaC: www.taltac.it

JADT’ 18

509

The dimensions of Gender in the International
Review of Sociology. A lexicometric approach to the
analysis of the publications in the last twenty years
Isabella Mingo, Mariella Nocenzi
Sapienza University of Rome – isabella.mingo@uniroma1.it; mariella.nocenzi@uniroma1.it

Abstract 1 (in English)
The Social Sciences and, specifically, the sociological research has
progressively assumed the gender factor as one of the strategic keys to
understand contemporary phenomena. In fact, as a variable for sociostatistical analysis or as a characterizing trait of individual identity, it is a
decisive factor in the interpretation of the deep social transformations and it
inspires the self-reflection of the sociologists about the analytical tools of their
discipline. The contribution proposes, through a lexicometric approach, an
analysis of the articles published in the last two decades by the oldest Journal
of Sociology, published by Routledge. The main aim is to highlight the
different ways in which gender issues are declined in the international
sociological researches presented in the repertoire of the International
Review of Sociology and to outline, both on the lexical level and on the topic
level, the changes occurred over time.
Abstract 2 (in French, Italian or Spanish)
Le scienze sociali e, nello specifico, la ricerca sociologica hanno
progressivamente assunto il fattore del genere come una delle più strategiche
chiavi di lettura dei fenomeni contemporanei. Si tratta, infatti, di un fattore
che, quale variabile per l’analisi socio-statistica o come tratto caratterizzante
dell’identità individuale, si rivela dirimente nell’interpretazione delle
profonde trasformazioni sociali in atto e spunto per un’autoriflessione degli
stessi sociologi sugli strumenti di analisi della loro disciplina. Il contributo
propone, mediante un approccio lessico-metrico, un’analisi degli articoli
pubblicati nelle ultime due decadi dalla più antica rivista di sociologia, edita
da Routledge, con l’obiettivo di evidenziare i diversi modi con cui il concetto
di genere viene declinato nelle ricerche sociologiche internazionali presentate
nel repertorio dell’International Review of Sociology e di delineare, sia sul
piano lessicale che su quello delle tematiche, i cambiamenti intervenuti nel
corso del tempo.
Keywords: Gender, International Review of Sociology, Lexicometric
Analysis, Textual Analysis, Social Change, Sociological Analysis

510

JADT’ 18

1. Introduction and the hypothesis of the paper
From 1955, when in a relevant paper the American scholar John Money (et
al., 1955) coined the term of gender for the definition of “those things that a
person says or does to disclose himself or herself as having the status of boy
or man, girl or woman”, the social sciences have developed entire subfields
and a wide range of topics to analyse it with a variety of research methods.
Sociologists, in particular, had outlined specific theoretical approaches and
had led many detailed studies to understand firstly what gender is and the
difference with sex. They had shared that if the meaning of sex is the biological
classification based on body parts, gender, on the other hand, is the social
classification based on one’s identity, presentation of self, behavior, and
interaction with others. Sociologists, hence, view gender as a learned behavior
and a culturally produced identity, and, for these reasons, they define it as a
“social” category. It has always been a very relevant category for the critical
analysis of the social construction because one of the most important social
structures is the status and one of the most strategic statuses is just gender.
In the last decades, the sociological theories and researches based on gender
are become more and more widespread, articulated, integrated with other
subfields of sociology and of the other social sciences. One of the most
representative indicator of this research development and specialization is
not only the common recognition and, then, institution of the sociology of the
gender as a subfield of the sociology, but the most frequent use of gender as
reference concept for all the other sociological theoretical approaches to the
analysis of the social system. The same sociology of gender has studied many
topics, with multiple research methods, including identity, social interaction,
power and oppression, and the interaction with race, class, culture, religion,
and sexuality, among others.
This paper aims to observe and, if possible, to interpret this progressive
diffusion and specialization in the use of gender as a theoretical and research
category through the publications of the International Review of Sociology, a
sociological journal, edited by Routledge with a worldwide online and paper
diffusion, during the last two decades. This journal, the oldest review in the
field of sociology in Europe, founded by René Worms in 1893 in Paris, still
maintains – as the “Aims and scope of the Review” state – «the traditional
orientation of the journal as well as of the world’s first international academic
organization of sociology which started as an association of contributors to
International Review of Sociology: it assumes that sociology is not conceived
apart from economics, history, demography, anthropology and social
psychology. Rather, sociology is a science which aims to discover the links
between the various areas of social activity and not just a set of empty
formulas. Thus, International Review of Sociology provides a medium through

JADT’ 18

511

which up-to-date results of interdisciplinary research can be spread across
disciplines as well as across continents and cultures»1.
The Authors proposes to highlight the different ways in which gender issues
are declined in the international sociological researches, through an analysis
of the articles published in the last two decades (1997-2017) in International
Review of Sociology. We consider the last two decades of publication not only
because of the best accessibility to the International Review of Sociology
catalogue. For the sociology, indeed, the recent gender studies and researches
have registered a deeper specialization in terms of connection with other
disciplines, unusual application of the gender approach to some social
phenomena, exploration of new research frontiers (multiple gender
identities, gender sensitive data arrangement, the non-alignment statuses of
sex and gender et similia).
2. Data and Methods
The analysis of the International Review of Sociology papers was carried out
mainly through a lexicometric approach, integrated with hermeneutic
analysis useful both in the first and in the last phase of the study. The first
phase has regarded the collection of the corpus, while the last one has
concerned the interpretation of the results obtained from quantitative and
automatic procedures. The lexicometric analyses, supported by the software
IRaMuTeQ2, were carried out to extract the most relevant forms/lemma and
to apply some exploratory techniques for identifying the main lexical-textual
dimensions, the relationships between some keywords, the recurring topics,
and possible differences over the time analysed.
2.1. The Corpus: Selection Criteria and Preliminary Analysis
The texts analyzed in this study have been collected from the archive of the
International Review of Sociology, considering the papers published from 1997
to 2017.
In the first stage, they have been extracted all the papers which propose the
term gender in title, abstract, body text and/or key words).They were 235,
distributed over the past 20 years, as shown in Table 1.
Then, they have been selected only those papers which present a relevant

See at the International Review of Sociology web site, page “Aims and scope”,
https://www.tandfonline.com/action/journalInformation?show=aimsScope&journalC
ode=cirs20.
2 IRaMuTeQ is a open software, distributed under license GNU GPL, based on R
statistical software and on Python language. It has now reached version 0.7 alpha 2
and it is still under development (Ratinaud, 2009).
1

512

JADT’ 18

reference to gender as theoretical or empirical category – and not only as a
composing part of a title of some sources, a statistical variable, or synonym –
in order to outline meaningful remarks for the aims of each article. This
selection has been supported by a hermeneutic analysis, based on careful
reading of the papers to evaluate the centrality of the gender issues in their
hypotheses and theses, as in the implementation of the theoretical and/or
empirical methodologies. They resulted 67, distributed over the past 20 years,
as shown in Table 1.
Table 1 - Extracted and Selected Papers

1997-1999
2000-2002
2003-2005
2006-2008
2009-2011
2012-2014
2015-2017
Total

Extracted Papers (EP)

Selected Papers (SP) SP/EP%

19
18
22
21
45
55
55
235

2
3
3
3
20
15
21
67

10,53
16,67
13,64
14,29
44,44
27,27
38,18
28,51

The incidence of the selected papers on the extracted ones (SP/EP%)
highlights the increased relevance of the term gender over time: it is used
more and more often as analytic category in sociological research, rather than
as a synonym or to indicate only a demographic characteristic of individuals.
The corpus, submitted to the subsequent analyzes, includes therefore 67
selected papers, and has the following lexicometric measurements:
dimension N=495470, word types V=21680; Type/token ratio TTR= 4,38%;
Hapax/V= 41,56%; Hapax/N=1,82%.
These characteristics show that the corpus can be considered sufficiently
large for a quantitative approach analysis (Bolasco, 1999, p.203).
2.2. Strategy of Analysis
The analyzes on the corpus, carried out with IRaMuTeQ, will be the
following:
1- Lexicon Analysis: exploration of the lexicon used in the corpus and
identification of theme-words/lemma;
2- Analysis of the specific lexicon: individuation of specific words/lemma
by time and by author/authors gender;
3- Correspondence Analysis: extraction of lexical dimensions starting from
the Aggregated Lessical Table (ALT) Lemma/Texts (Lebart, Salem 1994),

JADT’ 18

513

in which the texts were identified according to the different years of
publication (Y = 1997 ..., 2017;) and the gender of the author/authors (G =
1-Female; 2-Male; 3-Male and Female)
4- Cluster Analysis: identification of main topics through descending
hierarchical analysis (Reinart 1983) applied to the Binary Lexical Table
(BLT), Text segments / Lemma.
5- Similarity Analysis: description of the clusters obtained in point 4),
through graphic representation starting from the proximity matrix
between forms or lemmas.

References
Bolasco S. (1999). Analisi multidimensionale dei dati. Metodi, strategie e criteri
d'interpretazione, Roma, Carocci
Lerbart L., Salem S. (1994). Statistique textuelle, Paris, Dunod.
Money, John; Hampson, Joan G; Hampson, John (1955). “An Examination of
Some Basic Sexual Concepts: The Evidence of Human Hermaphroditism”.
Bull. Johns Hopkins Hosp. Johns Hopkins University. 97 (4), pp. 301–19.
Ratinaud, P. (2009). IRAMUTEQ: Interface de R pour les Analyses
Multidimensionnelles
de
Textes
et
de
Questionnaires.
http://www.iramuteq.org.
Reinert, M. (1983). Une méthode de classification descendante hiérarchique :
application à l’analyse lexicale par contexte. Les Cahiers de l’Analyse Des
Données, 8, 187–198.

514

JADT’ 18

The Rhythm of Epic Verse in Portuguese
From the 16th to the 21st Century
Adiel Mittmann, Alckmar Luiz dos Santos
Universidade Federal de Santa Catarina (Florianópolis, Brazil)
adiel@mittmann.net.br, alckmar@gmail.com

Abstract
The verses of most epic poems in Portuguese have been written following the
example of the Italian endecasillabo: a verse whose last stressed syllable is the
tenth, which usually means, in both Italian and Portuguese, that most verses
have a total of eleven syllables. In addition to the tenth, other syllables may
be stressed within the verse as well, and the specific distributions of stressed
and unstressed syllables make up different rhythmic patterns. In this article,
we investigate how such patterns were used in six epic poems written in
Portuguese, ranging from the 16th to the 21st century, for a total of 52,412
verses. In order to analyze such a large amount of verses, we used Aoidos, an
automatic scansion tool for Portuguese. By using supervised and
unsupervised machine learning, we show that, though the influence of earlier
poets (especially Camões) is ever present, poets favor different rhythmic
patterns, which can be regarded as their rhythmic signature.
Keywords: Epic poetry, Portuguese, Scansion.
Résumé
Les vers de la plupart des épopées en portugais ont été écrits à l’instar de
l’endecasillabo italien : un vers dont la dernière syllabe accentuée est la
dixième, ce qui signifie généralement, en italien et en portugais, que la
plupart des vers ont onze syllabes au total. En plus de la dixième, des autres
syllabes peuvent aussi être accentuées dans ce vers, chaque combinaison de
syllabes accentuées et non accentuées représentant un standard rythmique.
Dans cet article, nous examinons comment ces standards ont été utilisés dans
six épopées écrites en portugais, du XVIème au XXIème siècles, dans un total de
52.412 vers. Pour analyser une telle quantité de vers, nous avons employé
Aoidos, un outil automatique de scansion pour le portugais. En utilisant des
apprentissages supervisés et non-supervisés, nous concluons que, encore que
l’influence de poètes précédents (surtout celle de Camões) se fasse toujours
remarquer, chaque poète préfère de différents standards rythmiques, qui
peuvent être considérés comme sa signature rythmique.
Mots-clés: Epopée, Portugais, Scansion.

JADT’ 18

515

1. Introduction
Poets are frequently compared to one another, but over the centuries rarely
have such comparisons been made objectively, especially with respect to
verse structures. When critics state that a poet has followed the steps of
another too closely and has therefore produced unoriginal and derivative
work, they can seldom rely on objective facts. Works such as that of Chociay
(1994), who manually analyzed and tabulated more than 1,500 verses, are not
the rule, but the exception. It is indeed a tedious and tiresome task for any
human to carry out; but looking at a great amount of text from afar and
extracting relevant information from it constitutes a core element of distant
reading (Moretti, 2013).
Table 1: Poems included in the corpus. The code is derived from the poem’s title.

Code
L
M
C
A
B
F

Author
Luís de Camões
Francisco de Sá de
Santa Rita Durão
Fagundes Varela
Carlos Alberto Nunes
José Carlos de Souza

Born in
Portugal
Portugal
Brazil
Brazil
Brazil
Brazil

Poem
Os Lusíadas
Malaca
Caramuru
Anchieta
Os Brasileidas
Famagusta

Year
1572
1634
1781
1875
1938
2016

Verses
8,816
10,656
6,672
8,484
8,504
9,280
52,412

In this article, we turn our attention to the verse most commonly used in epic
poetry in Portuguese, the decassílabo, which was borrowed from Italian1. It is
the verse used by Dante in his Divina Commedia and by Petrarch in his
Canzoniere. Stressed syllables are distributed in the verse according to certain
rules; in particular, the 10th syllable (which defines the length of the verse)
must always be stressed. Other syllables may also be stressed, producing
many possible rhythmic patterns—which are, both in Portuguese and Italian,
required to have their 6th or, less commonly, their 4th syllable stressed
(Versace, 2014). We identify such patterns by indicating the syllabic positions
that are stressed within a given verse, so that a pattern like 3-6-10 means that
the 3rd, 6th and 10th syllables are stressed.
We are interested in tracking which rhythmic patterns poets have favored
In both Italian and Portuguese, this kind of verse always has its 10th syllable
stressed and typically has a total of eleven syllables, since most words in both
languages have a stress on the penult. However, in Italian this verse is called
endecasillabo because of the total number of syllables, whereas the Portuguese term
decassílabo emphasizes the fact that the 10th is the last stressed syllable in the verse.
1

516

JADT’ 18

over the centuries and whether such patterns are characteristic to each poet.
For this purpose, we have assembled a corpus consisting of six poems, whose
publication dates range from the 16th to the 21st century, for a total of 52,412
verses (about 300,000 words). In order to analyze such an amount of verses,
we have used our automatic scansion tool, Aoidos (Mittmann et al., 2016),
which is capable of scanning thousands of verses in a few seconds and
producing rhythmic information. The next section describes the corpus we
used in our experiments; Section 3 reports the results obtained with our
analyses; finally, Section 4 presents our conclusions and discusses future
work.
2. Corpus
The poems chosen to compose the corpus for this article are summarized in
Table 1. We adopted two criteria in order to select these poems. Firstly, we
searched for an important—and thus well known—or exemplary epic poem
in each century, from the 16th up to the present. Secondly, we required
trustful and reliable digital editions; in one case (17th century), we produced
a digital edition especially for this article, since no suitable candidate was
found.
Camões’ poem Os Lusíadas is by far the most important epic poem ever
written in Portuguese. Its influence can be felt, for instance, even in 20thcentury lyrical poets such as Jorge de Lima. Meneses’ Malaca Conquistada and
Durão’s Caramuru follow very closely the Camonean model: they use
identical rhyme schemes, they have a similar argument and they celebrate a
protagonist in like manner. Nevertheless, we would like to investigate
whether the two authors innovated with respect to rhythm, even though they
kept the overall model of the Camonean epic. These three poems in our
corpus were written by Portuguese citizens (Durão was born in colonial
Brazil and died before the country’s independence), while the remaining
three poems were written by Brazilian poets.

14610
610
1610
2-

1
Ce-

2
ssem

3
do

4
sá-

5
bio

6
Gre-

7
go
e

8
do

9
Troi-

10
a-

11
no

As

na-

ve-

ga-

ções

gran-

des

que

fi-

ze-

ram;

Ca-

le-

se

de
A-

le-

xan-

dro
e

de

Tra-

ja-

no

A

fa-

ma

das

vi-

tó-

rias

que

ti-

ve-

ram;

JADT’ 18
610
24610
24610
136810
14610

517

Que
eu

can-

to o

pei-

to i-

lus-

tre

Lu-

si-

ta-

no,

A

quem

Nep-

tu-

no e

Mar-

te
o-

be-

de-

ce-

ram:

Ce-

sse

tu-

do
o

que
a

Mu-

sa
an-

ti-

ga

can-

ta,

Que
ou-

tro

va-

lor

mai-

s al-

to

se
a-

le-

van-

ta.

Figure 1: Scansion produced by Aoidos.

Fagundes Varela’s Anchieta, a romantic piece of the 19th century, would not
be, at a first glance, an epic poem, since its subject is the telling of New
Testament stories to Brazilian Indians by priest José de Anchieta. However,
as historian Maria Aparecida Ribeiro and others remark, Anchieta is a kind of
“religious epopee” (Ribeiro, 2003), which drives our attention to the
Romantic effort to renew the ancient models inherited from Classical or
Neoclassical literature (although it clearly returns to the Greek epic model, as
it does not adopt regular sized stanzas). Despite some important differences
in the narrative logic, the verses reproduce the most important invariants of
the genre: the honoring of a protagonist (Anchieta) and the use of the
decassílabo (blank ones, in this case). As for Carlos Alberto Nunes’ Os
Brasileidas, this poem also presents some invariants that characterize the
traditional epic poem: blank decassílabo verses; several cantos, beginning with
the proposition; the intention of celebrating an individual hero, in this case
Antônio Raposo Tavares, a 17th-century Brazilian trailblazer. In addition to
the absence of rhymes, in order to emphasize the differences in relation to the
Camonean epic style, there is no regular stanza division in each one of the
nine cantos (ten, if we consider the epilogue), as in Anchieta, although they
may vary significantly, from seven up to sixty five or more verses. Finally,
regarding Famagusta, by José Carlos de Souza Teixeira, one quickly notices
that it is a curious combination of traditional epic elements from different
ages. In addition to the epic intention of celebrating an historical event and

518

JADT’ 18

some sort of heroic action, its formal elements are, to say the least, very
heterogeneous. For instance, it takes the Camonean eight verse stanza but
adopts a different rhyme scheme, resulting no more in the well-known ottava
rima (ABABABCC), but in the medieval Sicilian stanza called strambotto
romagnuolo (ABABCCDD), scarcely used in Brazilian literature2.
3. Analysis
In order to analyze the corpus, we used Aoidos, an automatic scansion tool
for Portuguese (Mittmann et al., 2016), much like Métromètre (Beaudouin
and Yvon, 2004) and Anamètre (Delente and Renault, 2015) for French.
Starting from the written word, Aoidos produces a phonetic transcription for
each verse and then applies many rules (such as elision or syncope) to
produce a series of alternative scansion. By examining the poem as a whole,
the system then selects the most appropriate alternative and, by applying a
set of heuristics, proposes a rhythmic pattern for each verse. The scansions
generated by Aoidos have been manually verified to be correct in 99.0% of
cases (Mittmann, 2016). Figure 1 shows the output produced by the system
for the 3rd stanza of Camões’ Os Lusíadas.

2-4-8-10

1-3-6-8-10

1-4-6-8-10

4-6-8-10

1-4-8-10

4-8-10

1-6-10

1-6-8-10

7.7
7.1
7.6
9.4
9.6
10.5

4-6-10

2-6-8-10

15.2
12.2
11.2
7.3
7.7
11.7

1-4-6-10

2-4-6-10

9.0
9.6
7.1
11.3
13.2
16.2

1-3-6-10

2-6-10

10.3
11.9
10.3
14.4
14.0
13.8

2-4-6-8-10

3-6-10

L
M
C
A
B
F

7.6 11.0 6.2
8.2 9.5 5.2
8.5 9.1 7.7
9.0 4.0 6.1
7.3 5.1 6.8
9.0 8.0 7.0

7.9
5.7
6.1
7.2
5.1
4.0

6.2
5.5
3.6
5.4
5.2
4.5

1.1
6.5
8.1
6.5
4.6
0.2

4.5
3.8
5.8
4.2
3.5
4.3

5.0
3.9
4.6
3.3
2.7
3.1

4.1
3.6
2.9
2.5
3.5
2.7

0.5
2.2
3.6
4.6
3.4
0.1

0.4
1.8
2.2
1.2
3.1
0.1

1.3
1.1
0.4
1.6
2.1
1.8

1.0
0.8
0.8
1.3
1.4
1.2

3-6-8-10

Poem

Table 2: Rhythmic pattern usage (%) for each poem.

A total of 42 different rhythmic patterns were found among all 6 poems.
Table 2 shows how frequently patterns with an average usage of at least 1%
were employed in each poem. In each row, the bold number indicates the
pattern most favored by that row's poem. Although some patterns, such as

The Brazilian-born baroque poet Manoel Botelho de Oliveira did use this
stanza in some madrigals written in Spanish, such as this one: Si Cupido me inflama, /
Si desdeñas mi empleo; / En amorosa llama, / En nieve desdeñosa el Etna veo, / Con
amor, y tibieza / Tenemos su firmeza, / Y en disonancia breve / Suspiro fuego yo, tu
brotas nieve.
2

JADT’ 18

519

Figure 2: Dendrogram built from all cantos of all poems.

3-6-8-10 and 1-3-6-10 remain more or less constant, many others display a
wide range of relative usage: pattern 2-6-10 ranges from 7.1% to 16.2%, and
pattern 1-4-8-10 from 0.1% to 3.1%. Whereas Camões (L) does seem to set the
tone for the following poems, there are clear differences when one considers
patterns such as 2-4-6-10 and 2-4-8-10. In fact, pairs such as Malaca (M) and
Caramuru (C) or Anchieta (A) and Os Brasileidas are more similar between
themselves than Camões’ Os Lusíadas (L) is to any other poem. By looking at
numbers from one century to the next, twice a change of more than 5% can
be seen: from Caramuru (C) to Anchieta (A) there was a decrease of 5.1% for
the pattern 2-4-6-8-10, and from Os Lusíadas (L) to Malaca (M) the pattern 2-48-10 increased in usage by 5.4%.
An interesting question arises at this point: do smaller parts of the poems
reflect the overall distribution shown in Table 2? In other words, given a
smaller part of a poem, could we tell from which work it was taken simply
by looking at its rhythmic signature? To answer this question, we divided
each poem into its cantos, for a total of 72 divisions, with an average of 727.9
verses per canto. We then extracted the usage frequency of the rhythmic
patterns, thus producing a feature vector for each canto. By iteratively
clustering such vectors, we obtained the dendrogram shown in Figure 2;
complete linkage was used. Each canto in the figure is indicated by a letter
(the poem code) and a number (the canto number within the poem). Cantos
from the same poem are also displayed with the same color. The closer to the
center that two branches link together, the more different the cantos they
contain are. We can immediately see that, in general, cantos that belong to
the same poem are located next to each other. All cantos of Camões’ Os

520

JADT’ 18

Lusíadas (L), in particular, are tightly grouped in their own branch. It is also
interesting to note that, except for Famagusta (F), whenever a smaller group of
cantos from the same poem were placed far from the larger group of cantos,
there is a certain order: it was the first three cantos of Caramuru (C) were
separated; the last four of Anchieta (A); and the first two of Os Brasileidas (B).
Two cantos from Famagusta (F1 and F16) are only linked with other nodes at
a great distance; this stems from the fact that these two cantos are the shortest
ones in all of the corpus: the first canto has only 24 verses, the sixteenth 112.
Such small amounts of verses produce poor feature vectors.
In order to further investigate how well the cantos reflect the poems, we
employed a nearest centroid classifier. In this case, each of the 72 feature
vectors (the rhythmic signatures of the cantos) was labeled with the poem
they belong to. We then used stratified k-fold cross validation, with k = 4 and
100 repetitions to assess the classifier’s performance. The mean precision
obtained was 96.5%, mean recall 95.9% and mean F1 score 95.5%; the mean
accuracy was 95.6%. This means that, given a sample of 54 cantos (because k
= 4), the classifier guesses the right poem for the other 18 cantos in about 96%
of the cases.
4. Conclusion
The frequency with which poets employ certain patterns of stressed and
unstressed syllables in their verses can be regarded as a rhythmic signature—
at least in epic poems, the subject of this article. In this work, we have
subjected 72 individual cantos to a hierarchical clustering technique (Figure
2), which shows that rhythmic patterns do reflect an author’s preferences
(unconscious as they might be). Furthermore, a nearest centroid classifier
obtained a mean accuracy of 95.6%, which is also evidence for the existence
of a rhythmic signature. This kind of analysis is possible thanks to automatic
scansion systems, such as Aoidos, which allow a large amount of verses
(more than 50,000 in this case) to be scanned and analyzed.
Although Camões, whose poem Os Lusíadas is the oldest in our corpus, has
influenced newer generations of poets, this article shows that, at least
rhythmically, each poet in our corpus took their own path. In fact, Camões’
verses are the ones most easily distinguished from the others (see Figure 2).
Lesser-known poems, such as Malaca or Os Brasileidas, have not failed to
produce rhythmic signatures that, in most cases, set them apart from other
works. In addition to the rhythmic signature, we would like to investigate, in
the future, additional features that could be extracted from verses and used
in stylometric analyses. In particular, the decassílabo usually falls into one of
two categories: either the 6th syllable has the dominant stress or—less
commonly—the 4th; in the former case, the verse is heroic; in the latter,

JADT’ 18

521

Sapphic. A verse whose rhythmic pattern includes the 6th syllable, but not the
4th, is heroic; but one that includes both the 6th and the 4th could be either
heroic or Sapphic. It would be interesting to resolve this ambiguity and
evaluate how well these categories characterize a poet’s style.
Although this article has only considered epic poems, there is no reason to
believe that rhythmic signatures are limited to this genre. In the future, we
would like to explore how well the approach shown here fares when applied
to other verses and other genres.
Acknowledgments
For the nearest centroid classifier we employed Scikit-learn (Pedregosa et al.,
2011). For the dendrogram, we used Dendextend (Galili, 2015) and Circlize
(Gu et al., 2014).
References
Beaudouin, Valérie and Yvon, François (2004). “Contribution de la métrique
à la stylométrie”. 7èmes Journées internationales d’Analyse statistique des
Données Textuelles. (2004), pp. 107–118.
Chociay, Rogério (1994). A Identidade Formal do Decassílabo em “O
Uraguai”. Revista de Letras 34, 229–243.
Delente, Éliane and Renault, Richard (2015). Projet Anamètre : Le calcul du
mètre des vers complexes. Langages 3.199, 125–148.
Galili, Tal (2015). dendextend: an R package for visualizing, adjusting and
comparing trees of hierarchical clustering. Bioinformatics 31 (22), 3718–
3720.
Gu, Zuguang et al. (2014). circlize implements and enhances circular
visualization in R. Bioinformatics 30 (19), 2811–2812.
Mittmann, Adiel (2016). “Escansão Automático de Versos em Português”.
PhD thesis. Universidade Federal de Santa Catarina.
Mittmann, Adiel, Wangenheim, Aldo von, and Luiz dos Santos, Alckmar
(2016). “Aoidos: A System for the Automatic Scansion of Poetry Written
in Portuguese”. 17th International Conference on Intelligent Text
Processing and Computational Linguistics. (2016).
Moretti, Franco (2013). Distant reading. London: Verso.
Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research 12, 2825–2830.
Ribeiro, Maria Aparecida (2003). Anchieta no Brasil: Que Memória? História
Revista 8, 21–56.
Versace, Stefano (2014). A Bracketed Grid account of the Italian endecasillabo
meter. Lingua 143, 1–19.

522

JADT’ 18

Le vocabulaire des campagnes électorales
Denis Monière1, Dominique Labbé2
Université de Montréal (denis.moniere@umontreal.ca)
2 PACTE CNRS - Université de Grenoble (dominique.labbe@umrpacte.fr)
1

Abstract
After having done a first presidential term, V. Giscard d’Estaing, F.
Mitterrand, J. Chirac and N. Sarkozy were candidates for a second term. In
this study, their electoral speeches are compared with their presidential ones
drawing attention to the specific nature of the vocabulary used. It would
appear that this calculation is mainly biased by grammatical categories and
word frequency. We present modifications of the classical formulae which
make it possible to neutralize the influence of grammatical categories and, at
least partially, that of word frequency. Electoral discourse privileges the verb
over the name, as such speech is more personalized than governmental
discourse, it focuses on the country and its inhabitants, the rest of the world
being pushed into the background. Finally, in recent years, the polemical
dimension is becoming predominant.
Résumé
Après un premier mandat présidentiel, V. Giscard d’Estaing, F. Mitterrand, J.
Chirac et N. Sarkozy ont été candidats à un deuxième mandat. On compare
leurs discours électoraux avec leurs discours présidentiels à l’aide des
spécificités du vocabulaire. Il apparaît que ces spécificités dépendent surtout
des catégories grammaticales et des effectifs des mots. On présente des
modifications du calcul classique qui permettent de neutraliser l’influence
des catégories grammaticales et, au moins partiellement, celle des fréquences.
Le discours électoral privilégie le verbe au détriment du nom, il est plus
personnalisé que le discours au pouvoir, il se centre sur le pays et ses
habitants, le reste du monde passant au second plan. Enfin, ces dernières
années, la dimension polémique devient prédominante.
Keywords: lexicometry ; political discourse ; French presidential campaigns ;
specific vocabulary ; spécificités du vocabulaire.
1. Introduction
Le discours électoral diffère-t-il du discours de gouvernement et en quoi ? La
réponse est difficile car il faut neutraliser l’effet des personnalités et des
conjonctures pour isoler l’effet sur le discours des choix stratégiques du

JADT’ 18

523

locuteur. L’idéal serait de pouvoir étudier les mêmes hommes à peu près
simultanément dans les deux positions de gouvernant puis de candidat. Le
corpus des discours des présidents français depuis 1958 remplit ces deux
conditions (présentation du corpus dans Arnold et al 2016). En effet, pour 5
présidents (C. de Gaulle, V. Giscard d’Estaing, F. Mitterrand, J. Chirac et N.
Sarkozy), ce corpus contient leurs interventions lorsqu’ils étaient présidents
et leurs discours de campagne pour leur réélection. Certes, en 1965, de Gaulle
n’a pratiquement pas fait campagne (Labbé 2005), mais ses successeurs ne
l’ont pas imité en 1981, 1988, 2002 et 2012 (corpus en annexe).
Pour comparer ces corpus, le calcul des "spécificités" semble l’outil le plus
adapté (Lafon 1980 et 1984). Il rapporte le vocabulaire d’un sous-ensemble de
textes (sous-corpus) à un corpus de référence. Mais il se heurte à une double
difficulté : la spécificité éventuelle d’un vocable est liée à sa catégorie
grammaticale et à sa fréquence d’emploi (Labbé, Labbé 1994 ; Monière et al.
2005), comme nous allons le vérifier d’abord avec le cas de Sarkozy en 2012
(Sur cette campagne : Labbé, Monière 2013). Dès lors, la mesure des
spécificités doit neutraliser, autant que possible, ces deux inconvénients.
2. Les catégories grammaticales du discours électoral
Le discours présidentiel de Sarkozy s’étend de son investiture (16 mai 2007)
au 12 février 2012 (annonce de sa candidature). La campagne s’étend
jusqu’au soir du second tour (6 mai 2012). Le corpus complet (P) compte 1074
interventions, soit au total 3 221 259 mots avec 21 602 vocables différents. A
partir de sa déclaration de candidature, Sarkozy est intervenu 110 fois (souscorpus E), soit 369 808 mots et un vocabulaire de 8 511 vocables différents.
Ces interventions sont d’abord marquées par un net changement de style
(tableau1).
Tableau 1. Densités des catégories grammaticales dans les interventions de Sarkozy lors de la
campagne de 2012 comparées à ses interventions comme président 2007-2012 (en ‰)
Catégories
Verbes
Futurs
Conditionnels
Présents
Imparfaits
Passés simple
Participes passés
Participes présents
Infinitifs
Noms propres
Substantifs
Adjectifs

P-E (CorpusSous corpus)
159.2
7.0
3.2
82.9
6.4
0.6
20.8
2.1
36.3
27.9
178.4
54.0

E Sous corpus
169.4
7.2
2.8
89.3
6.4
0.3
23.8
2.1
37.6
23.0
176.0
46.6

(P-E)/P
+6.4
+1.6
-11.2
+7.7
-0.2
-55.2
+14.6
+2.9
+3.6
-17.3
-1.3
-13.7

Indice
+
+
+
≈
+
≈
+
-

524
Adj. participe passé
Pronoms
Pronoms personnels
Déterminants
Articles
Nombres
Possessifs
Démonstratifs
Indéfinis
Adverbes
Prépositions
Coordinations
Subordination

JADT’ 18
5.2
124.3
65.4
181.6
131.9
18.7
14.5
7.6
8.9
67.1
150.1
29.1
25.9

4.5
132.6
69.6
182.5
128.1
20.9
17.0
7.8
8.7
68.9
145.6
25.4
27.9

-13.1
+6.7
+6.5
+0.5
-2.9
+11.9
+17.3
+2.7
-2.4
+2.7
-3.0
-12.7
+8.0

+
+
+
+
+
+
+
+

Dans le discours présidentiel, on rencontre 159 verbes en moyenne pour 1
000 mots ; dans les discours électoraux, cette proportion passe à 169‰, soit
une augmentation de +6,4%, ce qui est un écart significatif avec moins de une
chance sur 10 000 de se tromper (signe + en dernière colonne). Les lignes
suivantes donnent le détail des temps et des modes. Le recul le plus
significatif concerne le conditionnel (le discours électoral ne doit pas
connaître le doute). En revanche, le participe passé connait l’augmentation la
plus forte (le président sortant peut difficilement éviter de défendre sa
gestion).
Les pronoms, les adverbes et les conjonctions de subordination évoluent
dans le même sens que les verbes. Ils sont réunis dans le "groupe du verbe".
A l’inverse, les substantifs, adjectifs, articles et prépositions suivent la
tendance inverse : groupe du nom. Le tableau 2 donne les densités des deux
groupes chez les 4 présidents.
Tableau 2. Densités des groupes du verbe et du nom (en ‰) dans les discours électoraux (E)
comparés aux discours présidentiels (P-E).
Catégories
Sarkozy (2007-2012)
Groupe du verbe
Groupe du nom
Giscard d’Estaing (1974-1981)
Groupe du verbe
Groupe du nom
Mitterrand (1981-1988)
Groupe du verbe
Groupe du nom
Chirac (1995-2002)
Groupe du verbe
Groupe du nom

P-E (CorpusSous corpus)

E Sous
corpus

(P-E)/E

Indice

376.6
621.1

398.9
599.2

+5.9
-3.5

+
-

351.5
646.1

392.5
604.5

+11.7
-6.4-

+
-

386.4
611.0

427.1
569.8

+10.5
-6.7

+
-

329.5
668.8

333.2
665.1

+1.1
-0,6

+
-

JADT’ 18

525

Chez tous les présidents en campagne, il se produit une augmentation du
groupe du verbe et un recul de celui des noms. Statistiquement, ces
mouvements sont significatifs (avec = 1%). L’écart le plus fort est observé
chez Giscard d’Estaing puis chez Mitterrand. Cependant, Chirac tranche sur
les autres avec une densité du verbe beaucoup plus faible et une campagne
présidentielle presqu’aussi distanciée que ses interventions lors de son
premier mandat, marqué par une cohabitation de 5 ans (1997-2002) avec un
Premier ministre socialiste (Jospin). Dans son discours électoral, la densité
des verbes augmente nettement (+3,6%) mais se trouve en partie compensée
par un recul des pronoms, ce qui accentue le caractère dépersonnalisé des
propos de Chirac à l’opposé des trois autres.
En conséquence, pour les 4 présidents, les principaux verbes apparaissent en
spécificités positives du discours électoral et il ne s’en trouve que quelquesuns en spécificités négatives. Il en est de même pour les pronoms et les
adverbes. La situation inverse se constate pour les adjectifs, les substantifs,
etc. Autrement dit, si un mot appartient à une catégorie sous-employée dans
le sous-corpus (par rapport à sa densité d’utilisation dans le corpus entier), ce
vocable a toute chance d’apparaître dans les spécificités négatives (et
positives dans le cas inverse). Il est possible de neutraliser ce biais.
3. Neutralisation de la catégorie grammaticale
Le calcul standard est le suivant. Soit :
- le corpus de référence (P) long de Np mots ;
- le sous-corpus E long Ne mots dont on recherche les spécificités par rapport
àP;
- un vocable i avec Fip occurrences dans P et Fie dans E. Si sa répartition est
uniforme, ce vocable apparaîtra Eie(u) fois dans le sous-corpus E :
E ie(u)  Fip * U avec U =

Ne
369 808

 0.113
N p 3 223 570

(1)

La probabilité pour que le vocable i soit observé Fie fois dans E suit une loi
hypergéométrique de paramètres Fip, Fie, Ne, Np :
Fip   N p  Fip 
 

Fie   N e  Fie 

P( X  Fie ) =
N p 
 
N e 

(2)

L’indice de spécificité (S) est la somme des probabilités – calculées avec (2) –

526

JADT’ 18

de survenue des J valeurs entières de X variant de 0 à Fie {X=0 ; X= Fie} :
j = Fie

S = P( X  Fie ) =

 P(X  j)

(3)

j= 0

Si au seuil , Fie excède Eie(u), le vocable est « spécifique plus » (S+) ; S- dans le
cas contraire. Avec ce calcul, la plus grande partie des verbes usuels de
Sarkozy apparaissent donc en S+ de sa campagne électorale et la majorité des
substantifs en S-, parce que, dans ses discours électoraux, la première
catégorie est privilégiée par rapport au discours de gouvernement où elle est
moins utilisée (à l’inverse des substantifs). Pour corriger ce biais, le calcul
prend en compte les catégories grammaticales (g). La modification est
présentée dans : Monière, Labbé, Labbé 2005 ; Mayaffre 2006 et Monière,
Labbé 2012.
Soit : Nge et Ngp le nombre de mots appartenant à la catégorie grammaticale G
respectivement dans le sous-corpus E et le corpus entier P. Les formules (1) et
(2) deviennent :
E ie(u)  Fip * U avec U =

N ge
N gp

Fip   N gp  Fip 

 
Fie   N ge  Fie 
P( X  Fie ) =
 N gp 


 N ge 

(4)

(5)

Les formules (4) et (5) appliquées aux 4 corpus aboutissent à un équilibre
relatif, au sein de chaque catégorie, entre les S+ et les S- (tableau 4). Ces
formules neutralisent donc la liaison entre spécificités et densité des
catégories grammaticales. Comme indiqué dans Monière & Labbé 2012, cette
modification change drastiquement la liste des "mots spécifiques" mais elle
laisse subsister la liaison entre spécificité et fréquence.
4. Questions de seuils
Le calcul porte sur une minorité du vocabulaire et il est asymétrique. En effet,
avec = 1% :
- l’effectif minimal pour être S+ est de 5 occurrences ("seuil de spécificité
positive"), toutes dans les discours électoraux (E) et à condition que Eie(u) < .5,
ce qui signifie que Nge < 0.10Ngp. Par construction le calcul élimine donc tous
les vocables d’effectifs inférieurs à 5. Dans le corpus Sarkozy, cela représente

JADT’ 18

527

plus de la moitié du vocabulaire (54 % des vocables). Autrement dit,
seulement 46% du vocabulaire peut être S+ ;
- le "seuil de spécificité négative" correspond à la situation suivante : un
vocable i absent de E (Fie = 0) alors qu’on en attend au moins 5 (Eie(u) ≥ 5). En
pratique, cela signifie que son effectif dans P est égal ou supérieur à 5*1/U,
soit ici 40. Autrement dit, pour le discours électoral de Sarkozy, 83% du
vocabulaire de P ne peut apparaître en S-.
Dès lors, les vocables dont les effectifs dans P sont compris entre 5 et 39
peuvent être S+ mais pas S- dans E. On s’attend donc à ce qu’il y ait plus de
vocables S+ que S-.
5. Liaison entre spécificité et fréquence
9 876 vocables apparaissent 5 fois ou plus dans P. Si ce corpus était
homogène (hypothèse nulle H0), une distribution normale des vocables
laisserait attendre - avec = 1% - environ 100 vocables spécifiques. Le
tableau 3 compare les résultats observés et attendus (avec H0).
Tableau 3. Effectifs des vocables classés par catégories grammaticales et par spécificités
Verbes
Mots à majuscule
Substantifs
Adjectifs
Pronoms
Adverbes
Déterminants
Prépositions
conjonc.
Total

&

Effectifs (Fip ≥ 5)
1 540
1 501
4 175
2 065
52
411
72
60

H0
15
15
42
21
1
4
1
1

S+
176
112
455
140
18
20
21
21

S143
142
468
115
13
57
12
9

Total S
319
254
923
255
31
77
33
30

9 876

100

963

959

1 922

Il y a donc vingt fois plus de vocables spécifiques que n’en laisse attendre H0
(répartition homogène des mots entre corpus et sous-corpus). A priori, cela
signifie simplement que discours électoral et discours de gouvernement sont
fortement contrastés. En fait, ce décalage provient essentiellement des
vocables les plus fréquents (tableau 4 et Figure 1).
Tableau 4. Proportion des vocables spécifiques de E dans l’ensemble du vocabulaire (P) classé
par fréquence absolues.
Classe de
fréquence (P)
5-9
10-14
15-19
20-29

Vocables spécifiques de
E dans la classe
64
68
55
89

Total vocables de P
dans la classe
2 759
1 237
757
987

Proportion des vocables
de P spécifiques de E
2,3
5,5
7,3
9,0

528

JADT’ 18
30-49
50-99
100-199
200-499
500+
Total

143
317
332
398
473
1 939

997
1 054
799
686
640
9 916

14,3
30,1
41,6
58,0
73,9
19,6

Figure 1. Liaison entre la spécificité et la fréquence

Au-dessus du seuil de spécificité positive (ici 40), la proportion de vocables
spécifiques est directement corrélée avec la fréquence : la courbe suit la
diagonale du tableau et le coefficient de détermination de Y par X est égal à
0,997, ce qui indique une liaison rigide et linéaire. Il en est toujours ainsi :
plus un vocable donné est fréquent dans un corpus, plus il a de chances
d’être "spécifique" à l’une quelconque des parties de ce corpus. Cette
dépendance peut être interprétée de deux manières. D’une part, l’essentiel
des choix thématiques seraient véhiculés par les vocables les plus fréquents
et la variation dans leurs fréquences d’emploi seraient la principale
manifestation de ces choix. Cependant, dès que le corpus atteint une certaine
longueur, l’observateur se trouve noyé dans des listes qui contiennent la plus
grande part du vocabulaire usuel, ce qui en rend l’interprétation difficile.
D’autre part et à l’inverse, on peut penser que le raisonnement probabiliste qui sous-tend ce calcul - doit être adapté à cette liaison manifeste entre
spécificité et fréquence.
6. Neutralisation de la liaison entre fréquence et spécificité
Les limites des classes de fréquence du tableau 5 et de la figure 1 ont été
fixées selon une échelle proche d’une progression géométrique, ce qui assure

JADT’ 18

529

aux classes des effectifs sinon égaux du moins suffisamment proches et
importants. Ceci correspond à une particularité dite "loi de Zipf" - ou "ZipfMandelbrot" - selon laquelle le nombre d’occurrences d’un mot dans un texte
est lié à son rang dans la distribution des fréquences (Zipf 1935 ; Mandelbrot
1957).
Dès que le corpus atteint une longueur suffisante (au moins un demi-million
de mots) et que le sous-corpus est égal à au moins d’un dixième du corpus,
on peut découper le vocabulaire en quelques classes de fréquence. Pour un
corpus de la dimension de celui de Sarkozy (et des trois autres présidents),
trois classes suffisent : vocables "rares" (inférieurs à 100 occurrences) ;
"fréquents" (de 100 à moins de 500) ; "très fréquents" (500 et plus). Dans ces
trois classes, les vocables sont classés par catégorie grammaticale puis en
fonction de leur indice de spécificité et, dans chacune des classes, seuls les
plus caractéristiques sont retenus. Le tableau 5 donne les 5% les plus
caractéristiques du discours électoral de Sarkozy comparé à son discours
présidentiel, pour trois catégories grammaticales.
Tableau 5. Spécificités les plus remarquables du discours électoral de Sarkozy par rapport à son
discours présidentiel (par catégories grammaticales en trois classes de fréquence)
<100
Vocables significativement sur-employés :
Verbes :
voler, cotiser, détester,
casser, éduquer,
suspendre, démolir
Mots à majuscule

Mélenchon, Le Pen,

Substantifs

honte, rassemblement,
héritier, socialiste,
colère, délit, amalgame

100 – 499
adresser, bénéficier,
apprendre, souffrir,
supprimer, régulariser
François, Polynésie,
Hollande, Schengen,
TVA
jeunesse, souffrance,
gauche, destin, erreur,
étranger, salaire,
outremer

Vocables significativement sous-employés :
Verbes
admirer, illustrer,
progresser, témoigner,
expérimenter, inaugurer, évoquer, marquer,
associer
Mots à majuscule

Bush, Poutine,
Roumanie, Quatar

Russie, Inde, Iran,
Barroso

Substantifs

refondation, coalition,
scientifique, lycéen

processus, visite,
équipe, conférence,
planète, gouvernance,
alliance,

500+
dire, vouloir, parler,
vivre, proposer,
changer, respecter,
défendre
France, Français, Corse

travail, entreprise, droit,
république, vie, emploi,
ami, enfant, territoire,
peuple,
être, devoir, savoir,
comprendre, trouver,
attendre, remercier,
essayer
Afrique, G20,
Méditerranée, Merkel,
Paris, Chine
pays, monsieur,
président, état, ministre,
politique,
gouvernement, question

530

JADT’ 18

Chez Sarkozy, le discours électoral est affaire de volonté, il se centre sur le
pays, ses habitants mais aussi l’adversaire – la gauche, Hollande - dont il
dénonce les amalgames et les erreurs. Les spécificités négatives indiquent
que le discours électoral n’est pas affaire de devoir ou de connaissance ; il
"oublie" le reste du monde et ses dirigeants, les institutions du pays comme le
gouvernement et les ministres, etc.
7. Conclusions
Lorsqu’un président entre en campagne, il doit descendre dans l’arène et
adopter un discours de combat qui se caractérise avant tout par une
augmentation de la densité des verbes, une forte personnalisation et un recul
de la place accordée aux substantifs et aux adjectifs. Ces caractéristiques se
retrouvent dans les discours électoraux des Premiers ministres canadiens
(Monière, Labbé 2010). Cependant, en campagne ces derniers insistent sur le
"nous" car, dans un système parlementaire, il s’agit de faire élire une majorité
de députés, alors que les présidents français privilégient le "je"… Enfin, ces
dernières années en Amérique du nord comme en France, la forte présence
de la construction négative et la désignation des adversaires (noms propres)
soulignent le caractère polémique du discours électoral.
Le calcul des spécificités – tel qu’il est utilisé en analyse des données
textuelles – enregistre la catégorie grammaticale du vocable analysé et sa
fréquence d’emploi et non pas les choix thématiques du locuteur. La
neutralisation de la catégorie grammaticale est aisée si les mots ont été
étiquetés. En revanche, l’effet de la fréquence est susceptible de plusieurs
interprétations. Toutefois, si l’on souhaite ne pas être enseveli sous les listes
produites par le calcul classique, la solution réside dans le classement des
vocables en classes de fréquence –selon une échelle géométrique - et, au sein
de chacune de ces classes, dans la sélection des vocables les plus singuliers. A
ce prix, les singularités d’un sous-corpus peuvent être identifiées sans avoir à
effectuer des tris discutables dans des listes trop longues.
References
Arnold E., Labbé C. & Monière D. (2016). Parler pour gouverner : Trois études
sur le discours présidentiel français. Grenoble : Laboratoire d'Informatique
de Grenoble, 2016.
Labbé C., Labbé D. (1994). Que mesure la spécificité du vocabulaire ? Grenoble :
CERAT, décembre 1994. Reproduit dans Lexicometrica, 3, 2001.
Labbé D., Monière D. (2010). Quelle est la spécificité des discours électoraux?
Le cas de Stephen Harper. Canadian Journal of Political Science, 43:1, p. 69–
86.
Labbé D., Monière D. (2013). La campagne présidentielle de 2012. Votez pour

JADT’ 18

531

moi ! Paris : l’Harmattan.
Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un corpus.
Mots, 1, p. 127-165.
Lafon P. (1984). Dépouillements et statistiques en lexicométrie. Genève-Paris :
Slatkine-Champion.
Mandelbrot B. (1957). Étude de la loi d'Estoup et de Zipf Fréquences des mots
dans le discours. Apostel L et al. Logique, langage et théorie de l'information.
Paris, PUF, p. 22-53.
Mayaffre D. (2006). Faut-il pondérer les spécificités lexicales par la
composition grammaticale des textes ? Tests logométriques appliqués au
discours présidentiel sous la Vème République. Condé C., Viprey J.-M.
Actes des 8e Journées internationales d'Analyse des données textuelles.
Besançon : Presses universitaires de Franche Comté, II, p. 677-685.
Monière D., Labbé C., Labbé D. (2005). Les particularités d'un discours
politique : les gouvernements minoritaires de Pierre Trudeau et de Paul
Martin au Canada. Corpus, 4, p.79-104.
Monière D., Labbé D. (2012). Le vocabulaire caractéristique du Premier
ministre du Québec J. Charest comparé à ses prédécesseurs. Dister A. et
al. (éds). Proceedings of the 11th International Conference on Textual Data
Statistical Analysis. Liège : LASLA - SESLA, p.737-751.
Zipf G. K. (1935). La psychobiologie du langage. Paris : CEPL, 1974.

532

JADT’ 18

Faire émerger les traces d’une pratique imitative dans
la presse de tranchées à l’aide des outils
textométriques
Cyrielle Montrichard
ELLIADD, UBFC – cyrielle.montrichard@edu.univ-fcomte.fr

Abstract
The main goal of this paper is to show how textometric tools can help to
reveal the imitative usage of genres. During the Great War, soldiers must not
criticize the hierarchy or the governement. Trench press is written by and for
French soldiers in which we can find a great number of media and literary
genres. Plus, we assume that writers use a number of discursive schemes to
implicitly tell their point of view on the war, the governement and the
« sacred union » discours which has become the mainstream speech in the
public space in the early begining of the war. Therefore a corpus of this press
seems to be the perfect place to search the notion of imitative usage of genres.
To put into perspective the results given by the textometric tools we use a
sample corpus from the national french press.
Résumé
L’objectif de cette contribution est d’interroger la pratique imitative des
genres médiatiques et littéraires. Pour ce faire, nous mobilisons un corpus de
presse de tranchées dans lequel se déploient de nombreux genres et sousgenres. Portant notre attention tout particulièrement sur les genres des
dépêches et du roman-feuilleton nous montrons, en comparant ce corpus à
un corpus échantillon de textes parus dans la presse quotidienne nationale en
quoi la presse de tranchées copie les genres instaurés dans la presse civile. La
seconde partie interroge le corpus au niveau syntagmatique pour tenter de
faire émerger les registres ludiques et satiriques ayant court dans cette
presse.
Keywords : presse écrite, genre, pratique imitative, première guerre
mondiale, presse de tranchées.
1. Introduction
La presse de tranchées est un type de document né pendant la première
guerre mondiale. Cette presse a la particularité d’être écrite par et pour les
combattants (Audoin-Rouzeau, 1986). La censure ainsi que le discours

JADT’ 18

533

doxique d’union sacrée tenant place dans l’espace publique durant la période
du conflit ne permettent pas aux locuteurs d’exprimer ouvertement leur
opinion (Forcade, 2016). L’objectif de cette communication est de montrer
comment émergent les registres ludiques et satiriques dans la presse de
tranchées à travers l’inscription de discours dans des genres faisant écho à la
matrice générique médiatique et littéraire.
Comment repérer à l’aide des outils textométriques les traces discursives
d’une pratique imitative des genres médiatiques et littéraires dans la presse
de tranchées ?
Cette communication vise à interroger la « pratique imitative » c’est-à-dire les
« différentes formes ou genres qui permettent à un auteur de produire un
texte (T2) attribué, sérieusement ou non, et de manière plus ou moins
explicite, au modèle dont il s’est inspiré (T1) » (Aron, 2013). Pour ce faire,
nous avons réuni en corpus cinq titres de presse de tranchées au format
XML-TEI pour plus de 500 000 occurrences permettant une analyse du
discours outillée.
À l’aide des outils textométriques et de la plateforme TXM (Heiden et al.,
2010) nous proposons de montrer comment les textes s’inscrivent et
reprennent les codes établis des genres médiatiques et littéraires. Ensuite,
nous proposons des pistes d’analyse visant à faire émerger le registre ludique
ou satirique usité par les rédacteurs pour détourner le genre.
2. Contexte de la recherche et présentation du corpus
Notre étude propose d’investir la notion de pratique imitative. Cette dernière
est proche de l’hypertextualité et de l’imitation (Genette, 1982) c’est-à-dire la
reproduction d’un style, d’une manière. En analyse du discours, D.
Maingueneau (1984) a investi la notion de pastiche, confirmant que celui-ci
peut s’opérer sur un genre. Mais le pastiche pour G. Genette (1982) est
associé principalement à une fonction ludique et dans le cadre de notre
étude, la question entre registre satirique et registre ludique reste ouverte,
c’est pourquoi nous nous cantonnerons donc à la notion de « pratique
imitative ». Il n’existe, à notre connaissance, pas de travaux visant à
interroger la pratique imitative en analyse du discours outillée. Xavier
Garnerin (2009), pasticheur, tente de déterminer les méthodes des
pasticheurs qui se situent selon lui « entre analyse et intuition » ce qui dénote
toute la difficulté pour le chercheur à mettre au jour de façon systématique
les liens unissant un texte T2 imitant un texte T1. Nous proposons de mettre
à l’épreuve les outils textométriques pour tenter de percevoir la pratique
imitative des genres.
Notre corpus se compose de cinq titres de presse de tranchées parus entre
1915 et 1918. Nous avons mis en place des variables permettant d’investir les

534

JADT’ 18

genres et les sous-genres (Rastier et Malrieu, 2002). La variable genre scinde
le corpus en deux parties : le genre littéraire (287184 occurrences, 747 articles)
et le genre médiatique (216534 occurrences pour 1005 articles).
Afin d’opérer une étude fine, nous avons aussi catégorisé les textes en sousgenre permettant ainsi de distinguer les romans-feuilletons, les nouvelles, les
poèmes, etc., au sein du genre littéraire et les brèves, filets, dépêches, échos,
faits divers, etc. dans le genre médiatique. L’espace de la contribution ne
nous permet pas d’analyser chacun de ces sous-genres de façon particulière,
c’est pourquoi nous concentrons notre étude sur un sous-genre littéraire, le
roman-feuilleton et un sous-genre médiatique, la dépêche. Afin de mettre en
perspective les résultats obtenus, nous avons constitué un corpus échantillon
donnant à voir 38 dépêches parus entre 1915 et 1918 dans deux quotidiens
nationaux (Le Petit Journal et Le Matin) et trois romans-feuilletons1. Ce corpus
échantillon sera principalement mis à profit pour observer les constructions
syntaxiques et la place des catégories morphosyntaxiques dans les deux sousgenres. Ainsi, la taille des effectifs n’est pas déterminante.
3. L’ancrage dans les moules discursifs médiatiques et littéraires
Dans cette partie, nous montrons comment les textes reprennent les codes
établis dans la presse et dans la littérature à travers l’étude des catégories
morphosyntaxiques et du lexique.
3.1. Les catégories morphosyntaxiques
Le graphique AFC ci-dessous donne à voir la distribution des catégories
morphosyntaxiques (point-ligne en bleu) dans le sous-corpus du genre
littéraire partitionné en sous-genres (point-colonne en rouge). On remarque,
dans cette représentation graphique, que l’axe 1 contribue pour 60,63% à la
structure du graphique. Cet axe semble structuré par le temps des verbes. En
effet, à gauche du graphique on trouve les verbes au présent et au futur alors
qu’à droite, on retrouve les temps du passé (passé simple, imparfait). On
remarque que le roman-feuilleton se situe du côté des verbes au passé,
respectant ainsi les caractéristiques du genre usant des temps du récit.
De plus, si l’on regarde la distribution des verbes en pourcentage dans la
presse de tranchées et la Presse Quotidienne Nationale (PQN), on repère la
proximité dans les temps employés.

1 Entre deux âmes (1912) de Delly paru dans L’Echo de Paris, Le Château noir (1914)
et Confitou (1916) de Gaston Leroux parus dans Le Matin.

JADT’ 18

535

Figure 1. AFC des catégories morphosyntaxique du sous-corpus littéraire partitionné en sousgenre dans le corpus de presse de tranchées.

Figure 2. Graphique représentant pour cent verbes les temps utilisés dans les romans
feuilletons parus dans la presse de tranchées (à gauche) et ceux parus dans la PQN (à droite)

Du côté du genre médiatique, le calcul des spécificités sur les catégories
morphosyntaxiques indique que les dépêches dévoilent un score positif pour
les noms communs (2) alors que les adverbes et les pronoms personnels sont
en sous-emploi (respectivement des scores de -5,4 et -8,7). Ces résultats sont à
mettre en lien direct avec les caractéristiques de la dépêche :
[..] l’auteur de la dépêche se plie à un modèle de représentation qui
doit faire l’économie des ressources stylistiques propres au
littéraire : ni dialogue, ni focalisation interne, ni commentaire sur
l’évènement rapporté. (Kalifa et al., 2011 : 738)
On comprend ainsi le sous-emploi des adverbes et des pronoms personnels,
souvent usités pour introduire un commentaire, alors que l’objectivation de
l’information et l’effacement énonciatif préfèrent les catégories nominales

536

JADT’ 18

aux catégories verbales (Rabatel, 2004). D’ailleurs, on observe sur le
graphique ci-dessous une proximité dans l’emploi des catégories
morphosyntaxiques entre les dépêches de la presse de tranchées et celles de
la PQN.

Figure 3. Graphique qui montre la proportion des grandes catégories morphosyntaxiques
utilisées dans les dépêches parues dans la presse de tranchées (en bas) et celles parues dans la
PQN (en haut)

L’observation de la ventilation des catégories morphosyntaxiques laisse
entrevoir que presse civile et presse de tranchées usent des mêmes catégories
morphosyntaxiques selon les genres.
3.2. Le lexique et les segments répétés
Dans la presse du début XXème, la dépêche débute souvent par une ligne
indiquant le lieu et le jour de l’évènement. Les dépêches de notre corpus de
presse de tranchées suivent cette règle et reprennent cette mise en scène de
l’information. On le voit à travers de nombreux noms de lieux en spécificité
positive comme : « Londres » (4,9), « Paris » (4,2), « Berlin » (2,3), etc. Les
dépêches de la PQN confirment cette tendance avec une moyenne de 4 noms
de lieux par article.
L’escamotage de l’auteur passe d’abord par la mise au point d’un
système d’énonciation à double détente : soit la source de
l’évènement est indiquée – renvoyant toujours à un point de vue
neutre – soit l’évènement est rapporté directement, sans mention
manifeste de la source. (Kalifa et al., 2011 : 738)
Les combattants improvisés journalistes mentionnent souvent une source que
l’on peut percevoir à travers le suremploi des formes graphiques
« communiqué » (score de 16,5) ou « dépêche » (score de 2). De plus, lorsque
l’on s’intéresse aux segments répétés, on remarque que 7 dépêches de
l’Argonnaute débutent par « Communiqué officiel de l’intérieur téléphoné par
[…] ». Du côté de la presse civile on retrouve les formes « dépêche » et
« annonce » justifiant respectivement de 9 et 6 occurrences ainsi
qu’ « Havas » (17 occurrences). Pour le roman-feuilleton dans la presse de

JADT’ 18

537

tranchées, on repère des termes indiquant là aussi le respect de la mise en
scène du roman en « chapitre » (score de 49) et le format feuilleton avec les
termes « suite » (score de 37,4) et « suivre » (22,4).
4. Repérer la pratique imitative
À ce stade de notre étude, nous avons montré la proximité entre presse de
tranchées et PQN mais ni l’étude des catégories grammaticales ni l’étude
lexicale n’a permis de mettre au jour les registres ludiques et/ou satiriques
signes d’une imitation et non d’une inscription dans le genre. Fort de ce
constat, il apparaît nécessaire d’effectuer des recherches qui soient plus
larges que celles du lemme mais plus précise que celles menées jusqu’alors
sur les catégories morphosyntaxiques. Dès lors, une recherche au niveau
syntagmatique semble s’imposer.
4.1. Constructions syntaxiques en suremploi pour les dépêches
Nous avons effectué des recherches pour obtenir les constructions
syntaxiques enchaînant deux catégories morphosyntaxiques sur l’ensemble
du corpus partitionné en sous-genre. Les résultats des premiers syntagmes
en spécificité positive confirment ce que nous avons déjà pu voir : la
catégorie préposition suivie d’un nom propre présente un score de +10,3 et
un retour au texte confirme qu’il s’agit de la présentation du lieu de
l’évènement (« à Londres », « de Paris », etc.). Aussi, on trouve une
construction syntaxique qui induit une construction passive (verbe au
présent suivi d’un verbe au participe passé) indiquant encore l’effacement
énonciatif (Rabatel, 2004). Dans la liste des spécificités positives nous
trouvons la combinaison nom suivi d’adjectif (score de +2,3). La liste éditée
donne à voir 74 syntagmes. Quatorze d’entre eux (soit 19%) ont attiré notre
attention de part, soit l’invraisemblance du dire (« homme volant »,
« provision inépuisable »), soit parce que leur présence ne fait pas sens dans
le genre dans lequel ils se déploient (« bicyclette usagée », « cellules
nerveuses », « chauffage central », « crayon ennemi »). À noter le syntagme
« agence Ivile » jouant de l’homonymie avec « agence civile ». Le retour au
texte permet de mieux comprendre l’usage de ces syntagmes par les
rédacteurs jouant souvent sur le double sens des mots.
Plusieurs saucisses boches (de Francfort) ont été capturées à la
devanture d’un charcutier par un audacieux homme volant.
(Argonnaute, 15 mars 1916)
Le syntagme « saucisses boches » peut renvoyer en 1916 à deux signifiés : le
produit de charcuterie ou le projectile ennemi. C’est sur cette ambiguïté
qu’est basée l’énoncé accentuée par la présence du nom « charcutier » et du

538

JADT’ 18

participe passé « capturées » qui indique chacun une possibilité
d’interprétation différente. Enfin, l’« homme volant » peut être entendu
comme un briguant ayant dérobé de la charcuterie où un homme ayant la
capacité de voler dans les airs et ayant capturé les projectiles ennemis avant
l’impact. Cet exemple dévoile comment les rédacteurs par un registre
ludique créent de la connivence avec le lecteur qui partage les mêmes
références. Un autre exemple permet d’introduire l’idée d’un registre
satirique avec la critique du discours dominant dans l’espace publique.
[…]Paris, 31 avril
[…]Rue du Paon-Blanc (14h.) Paris gronde. Le régime a vécu. Vive
la révolution ! Les bains de la Samaritaine sont en état de siège. Le
syndicat de la Grande Presse n'autorise plus que la parution d'un
bulletin relatant le Communiqué. La censure s'est tranchée la gorge
avec ses ciseaux. L'héroïsme sacré fait battre les cœurs.[…] C'est
l'union sacrée. Concierges, locataires et propriétaires s'embrassent
aux portes des immeubles. (Rigolboche, 10 mai 1917)
L’article remet ici en cause la censure, les festivités parisiennes et fait
également écho aux désaccords entre les propriétaires et les locataires
mobilisés remettant ainsi en cause le discours d’union sacrée tout en
réinvestissant ses dires (Authier-Revuz, 1984). La recherche de syntagmes
nous permet donc d’entrer dans le corpus au niveau du texte et de percevoir
ce qui dans les articles semblent détourner le genre à des fins ludiques et
satiriques.
4.2. Construction syntaxique en suremploi pour le roman-feuilleton
Le roman-feuilleton tient une place importante dans la presse du XIXème
siècle et du XXème (Kalifa et al, 2011). Le conflit ne modifie pas la place de
cette fiction.
La guerre pénètre très rapidement dans le « rez-de-chaussée », et le
roman-feuilleton, sous la forme de récits patriotiques, se mue en
instrument destiné à entretenir et intensifier la mobilisation de la
population en faveur de l’effort de guerre. (Erbs, 2016 : 740)
Voici ce qui est donné à lire aux combattants qui reçoivent et lisent la presse
civile (Gilles, 2013). Nous avons, comme pour les dépêches tenter d’effectuer
une recherche sur les syntagmes de deux occurrences à travers les spécificités
selon les catégories grammaticales. Ces recherches n’ont pas été fructueuses
pour le roman feuilleton. Nous avons donc étendu la recherche à trois

JADT’ 18

539

occurrences. La construction syntagmatique « verbe au passé simple +
déterminant + nom » avec un score de +52 a attiré notre attention. Sur les 130
syntagmes, 24 nous ont interpellés, soit 14% d’entre eux. D’abord, nous
avons repéré des syntagmes qui semblent construits sur des expressions
figées mais où l’un des termes a été modifié comme « fouilla l’horizon » ou
« coupa la pipe ». Nous avons aussi repéré des syntagmes qui ne semblent
pas faire sens comme « revêtit l’ampleur » ou « trancha les jours ».
Alors une colère terrible parut animer l'Armada toute entière.
Proue baissée, les navires foncèrent sur le pirate boche ...
Cependant une première torpille alla frôler par bâbord le vaisseau
amiral ; une deuxième, lancée trop haut, coupa la pipe du
commandant qui flegmatiquement, sortit d'un étui une cigarette
qu'il ajusta au tuyau mutilé de sa pipe. […] (« Krotufex »,
Rigolboche 10/12/1917)
La torpille coupe littéralement la pipe du commandant alors qu’on aurait pu
s’attendre à ce que ce dernier casse sa pipe dans un tel contexte. Cela renvoie
au registre ludique avec le jeu sur l’expression figée mais certainement aussi
au registre satirique offrant ici une critique des romans-feuilletons
patriotiques décrivant des batailles sanglantes sans jamais que le héros ne
succombe. En étudiant les mêmes syntagmes dans le sous-corpus romanfeuilleton dans la PQN, on repère la présence abondante de noms renvoyant
à une partie du corps (« leva les yeux », « prit la main », « secoua la tête »,
« tendit la main ») : sur les dix premiers syntagmes six ont cette
caractéristique. On observe également la présence du corps dans ces
syntagmes dans la presse de tranchées mais ceux-ci semblent une fois encore
surréalistes et usés à des fins ludiques, copiant le genre en le détournant :
« cala les joues », « déchaussa son pied », « frotta la mandibule », « tomba le
torse », etc.
5. Conclusion
Notre contribution avait pour objectif d’investir la pratique imitative avec les
outils textométriques sur un corpus singulier de presse de tranchées mis en
perspective avec un corpus échantillon issu de la PQN. Nous avons pu
montrer dans un premier temps comment les genres sont imités en reprenant
les codes établis dans la presse civile. Pour faire émerger les traces d’une
pratique imitative, il nous a semblé nécessaire d’interroger le corpus, à l’aide
du logiciel textométrique TXM, au niveau syntagmatique. Cette recherche a,
dans le cas de notre étude, permis de faire émerger les registres ludiques et
satiriques ayant court dans la presse de tranchées. Cette presse est un lieu

540

JADT’ 18

énonciatif où l’implicite et la connivence tiennent une place importante au
vue de la censure mais aussi des liens particuliers qui unissent lecteurs et
rédacteurs. Il serait intéressant de voir si, en usant de la même méthodologie,
sur des textes et des genres différents, des résultats similaires peuvent être
observés.
Références
Aron, P. (2013). Le pastiche et la parodie, instruments de mesure des
échanges littéraires internationaux. In Gauvin, L. dir., Littératures
francophones : Parodies, pastiches, réécritures. ENS Éditions.
Audoin-Rouzeau, S. (1986). 14-18, les combattants des tranchées : à travers leurs
journaux. A. Colin.
Authier-Revuz, J. (1984). Hétérogénéité(s) énonciatives. Langages, vol.(73) :
98-111.
Erbs, D. (2016). Le roman-feuilleton français et le serial britannique pendant le
premier conflit mondial, 1912-1920. (thèse de doctorat).
Forcade, O. (2016). La censure en France pendant la Grande guerre. Fayard.
Garnerin, X (2009). Le pastiche, entre intuition et analyse. Modèles
linguistiques, vol.(60): 77-91.
Genette, G. (1982). Palimpsestes. Seuil.
Gilles, B. (2013). Lectures de poilus: livres et journaux dans les tranchées, 19141918. Ed. Autrement.
Heiden, S., Magué, J-P. and Pincemin, B. (2010). TXM : Une plateforme
logicielle open-source pour la textométrie – conception et développement.
In Sergio B. et al. editors, Proc. of JADT 2010 (10th International Conference
on the Statistical Analysis of Textual Data), pp. 1021-1032.
Kalifa, D., Régnier, P., Thérenty, M.-E. et al. (2011). La civilisation du journal :
histoire culturelle et littéraire de la presse française au XIXème siècle. Nouveau
monde éditions.
Maingueneau, D. (1984). Genèses du discours. Madraga.
Malrieu, D & Rastier, F. (2002). Genres et variations morphosyntaxiques.
Traitement automatique des langues vol.(42) : 548-577.
Rabatel, A. (2004). Effacement énonciatif et effets argumentatifs indirects
dans l’incipit du Mort qu’il faut de Semprun. Semen, vol.(17) : 111-148.

JADT’ 18

541

Evolución diacrónica de la terminología y la
fraseología jurídico-administrativa en los Estatutos
de autonomía de Catalunya de 1932, 1979 y 2006
Albert Morales Moreno
Università Ca’ Foscari Venezia / Université de Genève – albert.morales@unige.ch

Abstract
During the first half of 2017, research was carried out at the Institut de
Lingüística Aplicada of the Universitat Pompeu Fabra thanks to a grant from
the Generalitat de Catalunya’s Institut d’Estudis de l’Autogovern in order to
study diachronically the Statutes of Autonomy of Catalonia (EAC acronym,
in Spanish) approved in 1932, 1979 and 2006.
As in other countries and traditions, the negotiation of such an important law
is a challenge in the historical moment in which it occurs, both in legal and
political terms (see Abelló (2007) for the EAC of 1932, Sobrequés (2010) for
the 1979 EAC and Serrano (2008) for the 2006 EAC).
We take lexicometrics as an analytical methodology and the communicative
theory of terminology (Cabré, 1999) as the grounds for our research to study
the use of legal and administrative terminology with respect to the
assignment of competences from a diachronic approach. Specifically, we are
interested in combining the study of repeated segments and the study of
specificities to identify the terms, positions and key institutions of each EAC,
as well as the use of some locutions between 1932 and 2006 in Catalan
statutory discourse.
Resumen
Durante la primera mitad de 2017, se desarrolló una investigación en el
Institut de Lingüística Aplicada de la Universitat Pompeu Fabra para el
Institut d’Estudis de l’Autogovern de la Generalitat de Catalunya (EAC) para
estudiar diacrónicamente los diferentes Estatutos de Autonomía de Cataluña
(EAC), aprobados en 1932, 1979 y 2006.
Al igual que en otros países y tradiciones, la negociación de los proyectos de
regulación de esta escala es un reto en el momento histórico en que ocurre,
tanto en términos legales y políticos (Abelló (2007) para el EAC de 1932,
Sobrequés (2010) para la de 1979 y Serrano (2008) para el de 2006Partimos de
la lexicometría como metodología analítica y de la teoría comunicativa de la
terminología (Cabré, 1999) para estudiar el uso de la terminología jurídica y
administrativa con respecto a la asignación de competencias y materiales a

542

JADT’ 18

partir de un enfoque diacrónico. En concreto, nos interesa combinar el
estudio de segmentos repetidos con el estudio de especificidades para
identificar los términos, cargos e instituciones clave de cada EAC, así como el
uso de algunas locuciones entre 1932 y 2006 en el discurso estatutario catalán.
Keywords: discourse analysis, legal discourse, Catalan statute of autonomy,
repeated segments, terminology, diachronic analysis
1. Introducción
El presente artículo presenta un estudio enmarcado dentro de un proyecto
más amplio de análisis diacrónico de la redacción normativa en catalán. En
dicha investigación, realizada gracias a una financiación posdoctoral del
Institut d’Estudis de l’Autogovern de la Generalitat de Catalunya, se han
estudiado los Estatutos de Autonomía de Catalunya (EAC) de 1932, 1979 y
2006 se han llevado a cabo estudios lexicológicos, estadísticos,
terminológicos, traductológicos y pragmáticos de los distintos EAC.
En esta en concreto, nos hemos centrado a estudiar, desde un punto de vista
terminológico, los segmentos repetidos para evaluar si esta es una estrategia
válida para identificar la evolución de la fraseología especializada relativa a
un ámbito especializado como el Derecho a través del estudio de los
segmentos repetidos específicos de cada EAC. Asimismo, nos proponemos
comparar dichas unidades para ver cuál ha sido la evolución, desde un punto
de vista diacrónico.
Así pues, después de un exhaustivo estudio lexicométrico del corpus, hemos
seleccionado unidades terminológicas especializadas (UTE) relativas al
ámbito jurídico-administrativo que contribuyen a establecer las competencias
de Catalunya en los diferentes EAC, con términos como competència/es,
correspon o atribucioó/ons.
Para dicho análisis, hemos partido de los índices estadísticos que ha arrojado
la exploración lexicométrica desarrollada con Lexico3.6 y como marco teórico
hemos empleado la Teoría Comunicativa de la Terminología (Cabré et al.
1999).
2. Los EAC de 1932, 1979 y 2006
En primer lugar, cabe definir el estatuto de autonomía como una unidad
relativa al ámbito del derecho constitucional que se define como la “norma
institucional básica de las comunidades autónomas” (Diccionario del español
jurídico (DEJ), Real Academia Española).
Numerosos juristas reconocen funcionalmente al EA de las comunidades
como “equivalente a la constitución de un estado miembro de una
federación, porque regula las instituciones autonómicas, establece las

JADT’ 18

543

competencias que deben tener y no puede ser modificado por ninguna otra
ley, ni autonómica ni estatal: sólo puede reformarse por el procedimiento que
el mismo Estatuto prevé, característica propia de las constituciones y no de
las leyes” (Albertí, et al. 2002:111). El Estatuto, pues, “tiene rango de ley
orgánica estatal, forma parte del bloque de la constitucionalidad y está
sometido a unos procedimientos agravados de aprobación y reforma, y sus
previsiones disfrutan de unas garantías reforzadas que no proporciona la
legislación ordinaria” (Pons y Pla 2007:187). En Cataluña, a principios del
siglo XX, con la Mancomunitat, comienza la recuperación del autogobierno.
En el marco de dicha institución, se redacta un primer proyecto de Estatuto
de autonomía, aunque este no se llega a debatir, “porque el 27 de febrero de
1919 se suspendían las sesiones parlamentarias como consecuencia de la
huelga de la Canadenca” (Fontana 2014:327). Debido al desarrollo histórico
convulso de los años posteriores y de la dictadura de Miguel Primo de
Rivera, los proyectos autonomistas se paralizan. Hay que esperar a 1931, la
República, para que se redacte el primer EAC. Aquel texto se debate en las
Cortes en mayo de 1932. Abelló afirma que aquel texto prevé “la inserción de
Cataluña en una república federal” (2007:35) y lo define como “moderado”
(2007:44). A pesar de los recortes que sufre, “se convirtió en una herramienta
útil, que, con la reconquista de las instituciones catalanas de autogobierno,
facultaría una legislación propia, a pesar de que esta fuera limitada” (Abelló
2007:187). La Generalitat de Catalunya asume las competencias durante poco
tiempo, y el 6 de octubre de 1936 el EAC 1932 se suspende parcialmente; con
la llegada de las tropas franquistas a Cataluña, Franco aprueba la ley de
derogación del EAC el 5 de abril de 1938. Con la dictadura de Franco, el
Estado se concibe desde una óptica recentralizadora y, como ya se ha
señalado, se abole la autonomía de las comunidades. Hay que esperar hasta
la muerte del dictador, el 20 de noviembre de 1975, para que, según
Sobrequés (2010: 11), España y Cataluña iniciaran el proceso que tenía que
cambiar su historia: la Transición. Durante esta, se sella el pacto
constitucional de 1978 (la Constitución entra en vigor el 29 de diciembre de
ese año) y se construyen los cimientos jurídicos del Estado autonómico con
un ordenamiento que, a través de los estatutos de autonomía –al menos
desde un enfoque teórico–, se da a los gobiernos autonómicos bastante
autogobierno. El proyecto de redacción comienza el el 8 de septiembre de
1978 y su texto final se aprueba en referéndum el 25 de octubre de 1979.
A principios del siglo XXI, sin embargo, un sector considerable del espectro
social y político catalán percibe el EAC 1979 como un modelo sin recorrido
(la conocida como doctrina Argullol, que supone releer de manera menos
centralista la CE), pero rápidamente se comprueba “hay un número
importante de competencias que, a pesar de ser incluidas en el Estatuto de

544

JADT’ 18

autonomía, no han sido objeto de desarrollo legislativo” (BOPC 2002:89). Por
ese motivo, tras las elecciones autonómicas de 2003, la coalición tripartita
integrada por PSC, ERC e ICV-EUiA inicia en 2004 la tramitación
parlamentaria para la reforma estatutaria. Ello implica una primera
negociación para que se aprobara en el Parlament de Catalunya el 30 de
septiembre de 2005, y una segunda negociación para aprobarlo en las Cortes
Generales (en esa segunda fase, tal y como se expone en Morales (2015), se
producen los cambios más significativos).
El texto final se aprueba en sede parlamentaria el 10 de mayo de 2006, día en
el que el Pleno del Senado aprueba el nuevo estatuto con 128 votos a favor,
125 en contra y 6 abstenciones. El 31 de julio de 2006, Federico TrilloFigueroa y Martínez-Conde (junto con 98 diputados más del PP) presenta el
31 de julio de 2006 un recurso de inconstitucionalidad contra la mayoría de
artículos del nuevo Estatuto (Bosch 2013: 44) porque, entre otras razones,
“aplicaba el término nación en Cataluña, imponía el catalán, establecía una
serie de derechos y deberes que restringían las libertades de los ciudadanos
de Cataluña […] y cuestionaba la unidad de España” (Segura 2013: 217-218).
El 28 de junio de 2010, el Tribunal Constitucional hace pública parte de la
sentencia 31/2010 sobre la constitucionalidad del Estatuto, que declara
inconstitucionales algunas de las partes del EAC 2006. Según numerosos
politólogos e historiadores, esa fecha es clave para la historia política
contemporánea porque “fue el día de la ruptura sentimental con España, el
día en que [muchos catalanes] se convencieron de que Cataluña y los
ciudadanos de Cataluña no tenían cabida en España” (Segura 2013: 32) y para
muchos ciudadanos supuso el salto del autonomismo al independentismo,
sin pasar por el nacionalismo (Segura 2013: 241).
El corpus constituido es, pues, representativo para estudiar diacrónicamente
la evolución del discurso estatutario en lengua catalana a través de los
diferentes Estatutos aprobados a lo largo de la Historia. Para concluir, cabe
añadir que según André Salem (1991:149) este corpus se considera una “serie
textual cronológica”, puesto que son textos lingüística y pragmáticamente
comparables de un arco temporal que permite extraer conclusiones sobre la
evolución del discurso estatutario en lengua catalana de los últimos ochenta
años.
3. Marco teórico y metodológico
Desde la restauración de las instituciones de autogobierno, ha habido
numerosas iniciativas, tanto públicas como privadas, de modernización del
discurso normativo catalán. Cabe destacar el trabajo del Grupo de Estudios
de Técnica Legislativa (GRETEL), de la Dirección General de Política
Lingüística, del TERMCAT, de la Escuela de Administración Pública de

JADT’ 18

545

Cataluña o del Parlament de Catalunya. El modelo que se sigue es el de
Québec, adoptando –y adaptando– las directrices de Spar y Schwab Rédaction
des lois: rendez-vous du droit et de la culture. Según Montolío, se aprovecha para
renovar dicha tradición:
Un caso especial lo constituyen las otras lenguas oficiales del Estado
español (gallego, vasco y catalán). Para estas tres lenguas, la
renovación del lenguaje jurídico ha venido impulsada por una
motivación adicional: la voluntad de recrear una tradición jurídica
truncada tras cuarenta años de prohibición. Entre ellas, cabe destacar
la renovación del lenguaje jurídico catalán.
(Montolío y Albertí 2012:99)
Por ese motivo, los criterios y principios de la que parte la normalización del
lenguaje jurídico catalán son el de economía, el de claridad y el de precisión
en la expresión (DGPL 1999: 7). La falta de estudios lingüísticos exhaustivos
de un componente esencial del discurso normativo catalán como es su
Estatuto de autonomía, ha motivado este trabajo. Este trabajo nace de la
necesidad de analizar combinando la estadística textual y el análisis del
discurso, con una perspectiva diacrónica, los diferentes EAC que ha habido
en vigor hasta la fecha, a partir de una disciplina consolidada: la Lingüística
de Corpus.
De acuerdo con la revisión presentada en Morales (2015:101-175), se han
empleado dichas metodologías para estudiar textos similares. Para garantizar
una selección de las unidades análisis objetiva, pertinente y representativa
basada en criterios estadísticos, nuestro trabajo se desarrolla a partir de la
lexicometría. Dicha escuela ha permitido caracterizar, entre otros, el
vocabulario de personajes sociopolíticos, y de movimientos sociales e
históricos.
Dentro de la lexicometría, nuestra aproximación parte de una aproximación
lexicométrica formalista, puesto que nuestra unidad básica de análisis es la
forma. Posteriormente, hemos normalizado el texto (a partir de metodologías
como las de Arnold (2008:110) y Menuet (2006:157)) para corregir las formas
con errores (gramaticales o de escritura) y evitar que haya conteos
duplicados debido a diferencias mínimas en la ortotipografía. Por último,
hemos insertado en nuestro corpus las marcas estructurales requeridas por
Lexico3.6 para identificar los diferentes EAC.
De las múltiples funcionalidades que incluye el programa, han arrojado
resultados especialmente interesantes el estudio de las concordancias, de los
segmentos repetidos y de especificidades.
Tras la primera exploración lexicométrica, hemos analizado algunos términos

546

JADT’ 18

clave identificados con el análisis de segmentos repetidos para ver si nos
permite caracterizar la fraseología y terminologías propias del ámbito.
4. Análisis
El corpus analizado presenta las principales características lexicométricas
siguientes1:
Identificador
01_1932
02_1979
03_2006

Documento
EAC 1932
EAC 1979
EAC 2006

Ocurrencias
4.242
10.580
40.011
54.833

(7,7 %)
(19,3 %)
(73,0 %)
(100 %)

Formas
1.009
1.766
3.457
4.226

Hápax
606
935
1.546
1.804

Esta parte del análisis se centra en analizar los ya señalados segmentos
repetidos (SR), es decir, las secuencias de formas repetidas con una
frecuencia superior a 5.
La exploración lexicométrica ha arrojado 2.398 segmentos repetidos, pero nos
centraremos en algunos de los más significativos. Su distribución en relación
con su longitud es:
Longitud
2

Secuencias
1282

3

660

4

281

5

98

6

31

7

23

8

10

Ejemplos
de Barcelona
les llibertats
la coordinació
de la Constitució
de seguretat pública
en aquest Estatut
les lleis de Catalunya
a les Corts Generals
el president o presidenta
de conformitat amb les lleis
els poders públics han de
correspon a la generalitat la competència
d ‘ acord amb allò que
sens perjudici d ‘ allò que disposa
el president o presidenta de la generalitat
els poders públics han de vetllar per la
impost sobre la renda de les persones físiques

1 Debido a las diferencias de tamaño obvias, aplicamos, gracias a la profesora
Arjuna Tuzzi, técnicas de análisis estadístico que las tienen en cuenta a la hora de
hacer los cálculos de representatividad y selección esperados, a partir de, entre otros
Tuzzi (2003:128-129) o Van Gijsels, Speelman, y Geeraerts (2005:1).

JADT’ 18

547

Longitud
9

Secuencias
7

10
11

4
11

Ejemplos
en una votació final sobre el conjunt del text
en el diari oficial de la generalitat de Catalunya
correspon a la generalitat la competència exclusiva en matèria de
de l ‘ apartat 1 de l ‘ article 149 de
la carta dels drets i els deures dels ciutadans de Catalunya

De las 20 más frecuentes, por ejemplo, solo cinco tenían interés para nuestro
estudio lingüístico en tanto que unidades con semántica plena, como la
Generalitat, de Catalunya o la competència.
Además de aislar segmentos como de les quals (10), els altres (23), la resta (17),
les quals (18) o en el termini (25) o la seva (57) –que podrían ser interesantes
para investigaciones estilométricas o de atribución de autoría–, a
continuación analizamos algunas de las unidades con una frecuencia
superior.
El sistema ha permitido identificar, por ejemplo, algunos sintagmas relativos
a cargos e instituciones previstos estatutariamente, como les Corts (46) (y les
Corts Generals (33)), Poder Judicial (46), la Comissió Mixta d’Afers Econòmics i
Fiscals Estat-Generalitat (14), l’Agència Tributària de Catalunya (10), el Consell de
Justícia de Catalunya (19), el Govern (50), el President (38), el President o
Presidenta de la Generalitat (26), la Unió Europea (31) i el Parlament de Catalunya
(24). Ha dado buenos resultados, pues, para identificar sintagmas relativos a
unidades muy lexicalizadas como cargos o instituciones.
Uno de los SR más frecuentes es correspon a la Generalitat. Dicho segmento
presenta la distribución siguiente en el corpus:
SR: correspon a la Generalitat
FA
FR (x10000)

EAC 1932
1
4,4

EAC 1979
9
8,5

EAC 2006
144
36,0

Su uso es, como se constata, paradigmático del EAC 2006 (E+11) (presenta
especificidad negativa en los EAC 1932 (E-05) y EAC 1979 (E-07)) y, tal y
como se expone en Morales (2018, en prensa), el ámbito de la atribución
competencial (de la que el segmento repetido es una de las expresiones
lingüísticas más características, al menos en la redacción estatutaria
contemporánea) es de las que más singularidades presenta en el EAC 2006 y
que más cambios ha presentado en el corpus estudiado desde un punto de
vista diacrónico.
Otro de los SR más frecuentes (105 ocurrencias) es la Constitució, que se
reparte de la manera siguiente:

548

SR: la Constitució
FA
FR (x10000)

JADT’ 18

EAC 1932
17
40,1

EAC 1979
42
39,7

EAC 2006
46
11,5

En la mayoría de ocasiones se trata de contextos que hacen referencia a un
artículo concreto de la CE 1978. Son fórmulas que sirven para restringir el
alcance estatutario y establecer una remisión con la Carta Magna española. Es
interesante señalar que el análisis de especificidades denota un uso específico
positivo de dicho SR en los EAC 1932 (E+04) y EAC 1979 (E+07):

JADT’ 18

549

Otras remisiones legislativas que hemos identificado gracias al estudio de los
segmentos repetidos han sido aquest Estatut (96), l’article 149 (de la Constitució)
(26) o el Títol V del mismo EAC (12).
Al tratarse de un corpus legislativo, el análisis también ha permitido
identificar como SR numerosas unidades pertenecientes al lenguaje jurídicoadministrativo que se rigen según el patrón determinante + sustantivo o
sustantivo + adjetivo, como l’article, l’estatut, la legislació, una llei, llei orgànica,
administracions públiques, l’administració, aquest article, comunitat autònoma, de
catalunya, de seguretat, del règim jurídic, disposició addicional, domini públic, dret
civil, el control, el foment, el règim, els àmbits, els articles, els deures, els
mecanismes, els principis, els procediments, els processos, la llei, la llengua, la
majoria, la normativa, la propietat, la salut, les activitats, les actuacions, les
administracions, les administracions públiques, les comunitats, les empreses, les
entitats, les iniciatives, les matèries, les normes, les organitzacions, les polítiques, les
universitats, llei del parlament, polítiques públiques, règim jurídic, serveis públics,
serveis socials, tributs estatals y una llei del parlament.
El aspecto en el que el presente estudio ha proporcionado resultados más
interesantes es, sin lugar a dudas, el relativo a las locuciones más empleadas
en alguno de los EAC, y que en algunos casos se usan de manera
especializada. Algunas de las unidades que hemos estudiado en profundidad
han sido en matèria de/d’, si escau, d’acord amb, en tot cas, en els termes que o sens
perjudici.
El SR en tot cas presenta especificidad en el EAC 2006. Su uso es especifico
positivo del EAC 2006 (E+05) y negativo de los EAC 1932 (E-04) y EAC 1979
(E-03). Sus 95 ocurrencias se distribuyen de la manera siguiente:
SR: en tot cas
FA
FR (x10000)
Esp

EAC 1932
–
–
E-04

EAC 1979
10
9,5
E-03

EAC 2006
85
21,2
E+05

550

JADT’ 18

En la tesis (Morales 2015:398-400), se comprobó que esta es una cláusula
bastante usada en el discurso estatutario catalán contemporáneo y
describimos los usos de dicha cláusula. El libro de estilo del Parlament, una
referencia básica para la redacción estatutaria contemporánea, la define así:
en tot cas
Locució adverbial, equivalent a en qualsevol cas, que es pot emprar
amb valor concessiu o amb el sentit de ‘en tots els casos’. Quan té
aquest sentit, per raons de claredat i precisió, és preferible substituirla per sempre o en tots els casos o, si escau, prescindir-ne.
(SAL 2014:272)
Otra cláusula identificada con el análisis de SR es en els termes, que se
distribuye en el corpus de la manera siguiente:
SR: en els termes
FA
FR (x10000)
Esp

EAC 1932
–
–
E-04

EAC 1979
12
11,3
–

EAC 2006
63
15,7
E+03

El análisis de especificidades indica que su uso es característico positivo en el
EAC 2006, mientras que en los otros no presenta especificidad (EAC 1979) o
bien presenta especificidad negativa (E-04, en el EAC 1932). Al leer
detalladamente las concordancias, se comprueba que aparece sobre todo en
contextos como en els termes que disposin/determini/estableix o similares (en els
termes establerts…):

JADT’ 18

551

Cabe señalar que el EAC 1979 presenta más variedad en relación con el uso
de esta cláusula (las 12 ocurrencias presentan 12 realizaciones diferentes),
mientras que el EAC 2006 se constata menos variación; de los 63 contextos en
los que aparece, las que acumulan más ocurrencias son:
- en els termes que estableix/estableixen/estableixi/estableixin + [les lleis, la
legislació…]: 41
- en els termes que determinin/determinen + [la llei orgànica, la legislació…]: 7
Se comprueba, pues, una fijación más alta. Habría que analizar corpus más
grandes para verificar esta hipótesis, pero esta tendencia a tener un discurso
estatutario más fijado en el EAC 2006 parece confirmarse. Hemos constatado,
sin embargo, que en la mayoría de segmentos repetidos se observa un
comportamiento lingüístico diferente entre los EAC 1932 y 1979 por un lado,
y el EAC 2006 por el otro. Por lo tanto, estos resultados confirman la
hipótesis planteada inicialmente y confirmada con el estudio de distancia
intertextual llevado a cabo por la Dra. Arjuna Tuzzi (Università degli Studi di
Padova).
Otro de los segmentos identificados que equivale a una locución es sens
perjudici, que presenta la distribución siguiente en el corpus:
SR: sens perjudici
FA
FR (x10000)
Esp

EAC 1932
1
2.4
–

Algunas de sus concordancias son:

EAC 1979
28
26.5
E+09

EAC 2006
23
5.7
E-06

552

JADT’ 18

Ya hemos visto en el apartado dedicado al pronombre allò que, en algunos
casos, este SR forma parte de la locución sens perjudici d’allò que. Carles Viver
Pi-Sunyer afirma que (2007:37) el uso de dicha cláusula está relacionado con
la técnica legislativa que se expone a continuación:
L’Estatut d’Andalusia i les propostes de Canàries i de Castella la
Manxa apliquen la mateixa tècnica que l’Estatut de Catalunya,
malgrat que en alguns casos no totes les submatèries que en l’Estatut
de Catalunya es consideren exclusives tenen la mateixa consideració
en els altres tres. Per contra, els estatuts o projectes d’estatut de la
Comunitat Valenciana, d’Aragó, de les Illes Balears i de Castella i
Lleó no identifiquen submatèries exclusives dins d’àmbits materials
en què l’Estat fins ara ha pogut dictar bases, però, en canvi, com hem
vist, en d’altres casos declaren exclusives «sens perjudici»
competències bàsiques estatals, àmbits en els quals es clar que l’Estat
pot establir bases perquè així ho diu expressament la Constitució.
(Viver Pi-Sunyer 2007:37)
Aunque hemos constatado que aparece en el EAC 2006 en 23 ocasiones, la
bibliografía indica que al redactar dicho Estatuto se produjo una innovación
en la técnica legislativa relacionada con el uso de la cláusula en cuestión (sens
perjudici), tal y como afirma Ernest Benach:
Em sembla que [l’EAC 2006] és important, per «la seva nova tècnica
legislativa d’assumpció de competències, que renuncia a la clàusula
del “sens perjudici” i opta per la definició casuística i detallada, dins
de cada àmbit competencial, de submatèries o perfils competencials».
I hi afegeixo jo que a ningú no el podrà sorprendre que, després de
vint-i-cinc anys de patir els perjudicis del «sens perjudici», els
redactors de la Proposta del nou Estatut hagin optat per una tècnica
legislativa moderna que precisa amb claredat l’abast de les
competències de la Generalitat.
(Benach 2006:20)

JADT’ 18

553

Es un cambio, pues, que se comprueba que es fruto de la modernización del
discurso legislativo en redacción estatutaria para obtener en el EAC 2006 un
blindaje competencial más amplio del que se había conseguido con el EAC
1979.
5. Conclusiones
El estudio presentado, como ya se ha señalado, se enmarca dentro de un
proyecto de investigación postdoctoral más amplio realizado durante la
primera mitad del año 2017 en el Institut Universitari de Lingüística Aplicada
de la Universitat Pompeu Fabra gracias a la financiación del Institut
d’Estudis de l’Autogovern de la Generalitat de Catalunya. En dicho estudio
hemos llevado a cabo varios análisis lingüísticos (riqueza léxica, distancia
intertextual, especificidades…) de un corpus de discurso jurídico en lengua
catalana integrado por los Estatutos de autonomía de Catalunya aprobados
en 1932, 1979 y 2006.
Como ya se ha señalado, se han analizado los segmentos repetidos (SR) que
genera el análisis lexicométrico de Lexico3.6. Puesto que los resultados que
generaba eran 2.398 y muchas de las unidades no eran representativas para,
desde el punto de vista del análisis del discurso, estudiar la evolución del
discurso normativo, se ha optado por analizar cualitativamente algunos de
los SR que presentan especificidad en alguno de los subcorpus. Además, el
estudio ha permitido identificar las unidades léxicas y terminológicas más
empleadas en la redacción estatutaria en catalán, así como las instituciones y
cargos que se regulan en dicho EAC.
Hemos identificado que, en el caso de Correspon a la Generalitat es un SR
específico del EAC 2006 que se ha convertido, como ya se ha analizado en
Morales (2018, en prensa) en una de las estructuras formulaicas más
empleadas en la redacción de leyes en catalán. Asimismo, hemos identificado
que, mientras en el EAC 2006 el sintagma la Constitución presenta
especificidad negativa, en los otros dos EAC estudiados sí que se emplea por
encima de las veces esperadas estadísticamente. Habrá que realizar
investigaciones más amplias para entender dicha evolución en la redacción
estatutaria en catalán.
El ámbito en el que la presente investigación ha resultado útil ha sido en la
identificación de locuciones, que en algunos casos se emplean como unidades
de conocimiento especializado (UCE, en terminología de Cabré (1999)). Las
más características, en positivo, del EAC 2006 son en tot cas y en els termes que,
mientras que sens perjudici se tendía a utilizar más en la redacción del EAC
1979. En la bibliografía hemos identificado las motivaciones de dichos
cambios.
Así pues, este estudio ha permitido identificar, cruzando dos análisis

554

JADT’ 18

lexicométricos obtenidos con Lexico3.6 (el de segmentos repetidos y el de
especificidades), algunas unidades lingüísticas (locuciones, términos y
unidades poliléxicas del discurso estatutario y jurídico-administrativo, así
como cargos e instituciones) que han presentado evolución en el discurso
normativo catalán en el periodo 1932-2006. En futuras investigaciones,
ampliaremos el estudio de este tipo de n-grams y ampliarlo a unidades
fraseológicas y estructuras formulaicas, porque parece que podrían aportar
resultados interesantes para describir el discurso estatutario catalán desde
una aproximación cronológica.
Bibliografía
[BOE] Boletín Oficial del Estado. Constitución española. Madrid: Agencia
Estatal Boletín Oficial del Estado, 1978.
[BOPC] Butlletí Oficial del Parlament de Catalunya. "Moció 187/VI del
Parlament de Catalunya, sobre l'exercici de l'autogovern." Butlletí Oficial
del Parlament de Catalunya. 366. Barcelona: Parlament de Catalunya, 2002.
89.
[DGPL] Direcció General de Política Lingüística. Criteris de traducció de textos
normatius del castellà al català. Barcelona: Generalitat de Catalunya.
Departament de Cultura, 1999.
[SAL] Serveis d’Assessorament Lingüístic. Llibre d’estil de les lleis i altres textos
del Parlament de Catalunya. Barcelona: Parlament de Catalunya, 2014.
Abelló Güell, Teresa. El debat estatutari del 1932. Barcelona: Parlament de
Catalunya, 2007.
Albertí, Enoch, et al. Manual de dret públic de Catalunya. Barcelona: Generalitat
de Catalunya. Institut d'Estudis Autonòmics, 2002.
Arnold, Edward. "Le sens des mots chez Tony Blair (people et
Europe)." JADT 2008: actes des 9es Journées internationales d’Analyse
statistique des Données Textuelles, Lyon, 12-14 mars 2008: proceedings of 9th
International Conference on Textual Data statistical Analysis, Lyon, March
12-14, 2008. Eds. Heiden, Serge, Bénédicte Pincemin and Liliane
Vosghanian. Lió: Presses Universitaires de Lyon, 2008. 109-19.
Benach, Ernest. L'Estatut: una aposta democràtica i moderna: Barcelona, 7 de
novembre de 2005. Barcelona: Parlament de Catalunya, 2006.
Bosch, Jaume. De l'Estatut a l'autodeterminació: esquerra nacional, crisi
econòmica, independència i Països Catalans. Barcelona: Base, 2013.
Cabré Castellví, M. Teresa. La terminología. Representación y comunicación.
Elementos para una teoría de base comunicativa y otros artículos. Sèrie
Monografies, 3. Barcelona: Institut Universitari de Lingüística Aplicada,
Universitat Pompeu Fabra, 1999.
Fontana, Josep. La formació d'una identitat. Una història de Catalunya. Vic:

JADT’ 18

555

Eumo Editorial, 2014.
Lamalle, Cédric, et al. Manuel d'utilisation. Lexico3 (Version 3.41 - Février 2003).
París: SYLED–CLA2T. Université de la Sorbonne nouvelle–Paris 3, 2003.
Menuet, Laëtitia. "Le discours sur l’espace judiciaire européen: analyse du
discours et sémantique argumentative." Université de Nantes, 2006.
Montolío, Estrella, and Enoch Albertí. Hacia la modernización del discurso
jurídico: contribuciones a la I Jornada sobre la Modernización del Discurso
Jurídico Español. Barcelona: Publicacions i Edicions de la Universitat de
Barcelona, 2012.
Morales Moreno, Albert. "Estudi lexicomètric del procés de redacció de
l’Estatut d’Autonomia de Catalunya (2006)." Tesi doctoral no publicada.
Universitat Pompeu Fabra, 2015.
Pons, Eva, and Anna M. Pla. "La llengua en el procés de reforma de l'Estatut
d'autonomia de Catalunya." Revista de Llengua i Dret.47 (2007): 183-226.
Real Academia Española. Consejo General del Poder Judicial.
"[DEJ] Diccionario del español jurídico." Madrid.
Salem, André. "Les séries textuelles chronologiques (1)." Histoire et mesure.VI1/2 (1991): 149-75.
Salem, André, M. Teresa Cabré, and Lydia Romeu. Vocabulari de la
lexicometria: català, castellà, francès. Barcelona: Centre de Lexicometria,
Divisió de Ciències Humanes i Socials, 1990.
Segura, Antoni. Crònica del catalanisme: de l'autonomia a la independència.
Barcelona: Angle Editorial, 2013.
Sobrequés, Jaume. L'Estatut de la Transició: l'Estatut de Sau (1978-1979).
Barcelona: Parlament de Catalunya, 2010.
Tuzzi, Arjuna. L’analisi del contenuto. Introduzione ai metodi e alle tecniche di
ricerca. Roma: Carocci, 2003.
van Gijsel, Sofie, Dirk Speelman, and Dirk Geeraerts. "A Variationist, Corpus
Linguistic Analysis of Lexical Richness." Proceedings of the Corpus
Linguistics 2005 Conference, July 14-17, Birmingham, UK 1.1 (2005): 1-16.
Viver Pi-Sunyer, Carles. "Les competències de la Generalitat a l'Estatut de
2006: objectius, tècniques emprades, criteris d'interpretació i comparació
amb els altres estatuts reformats." La distribució de competències en el nou
Estatut. Eds. Viver i Pi-Sunyer, Carles, et al. Barcelona: Institut d’Estudis
Autonòmics, 2007. 13-52.

556

JADT’ 18

Comment penser la recherche d’un signe pour une
plateforme multilingue et multimodale français
écrit / langue des signes française ?
Cédric Moreau
Grhapes EA 7287 - INS HEA - UPL – cedric.moreau@inshea.fr

Abstract 1 (in English)
This article examines the access to the signs in French Sign Language (LSF)
within a corpus taken from the collaborative platform Ocelles, from a
multilingual French bijective/LSF perspective. There is currently no
monolingual dictionary in SL, so deaf users must necessarily master the
written language of the country to access SL contents. Most of the available
tools are based on a hypothetical conceptual relationship of equivalence
between the signs of SL and the words of the dominant vocal languages. This
approach originates in works that ask deaf speakers to translate a lexema
outside the context of the spoken language into the signed language. This
corpus is subsequently used for an inventory of minimal pairs, in which
configurations, locations and movements are widely represented. This
approach is thus the anchorage point for a phonological hypothesis of SL in
which the previous equivalence ‘sign – word’ is dominant and decisive in the
conception of dictionaries. This study lies within a completely different
paradigm, that of the semiotical model which stems from the description of a
typology and the identification of the three main transfer structures (size and
form, situational, and personal). According to Cuxac, the signer can thus
‘make visible’ the experience by relying on the maximal resemblance
sequence of signs/experience, or use the lexical unit without resemblance
with the referent. This model, which is also integrative, therefore takes into
account the diachronic link existing within language under the influence of
pressures between transfer structures and lexical units. The morphemic
approach to the study of lexical units is in this case legitimate since their
compositionality does not rely on strict phonology but, in the first place, on
complex morphology. First of all, we shall present our paradigm and the
origins of the Ocelles multilingual and multimodal platform (written, oral,
and signed languages), out of which our French written/LSF corpus is built.
We will then describe a process likely to enable users to search for an LSF
signifier and to relate this result to that of the corresponding written French
signifier.

JADT’ 18

557

Abstract 2 (in French)
Cet article interroge l’accès aux signes de la langue des signes française (LSF)
d’un corpus dans une perspective multilingue bijective français / LSF à partir
de la plateforme collaborative Ocelles. Actuellement il n’existe pas de
dictionnaire monolingue en LS, les utilisateurs sourds doivent donc
nécessairement maîtriser la langue écrite du pays pour accéder à un contenu
en LS. La plupart des outils à disposition s’appuient sur une hypothétique
relation d’équivalence conceptuelle entre les signes des LS et les mots des
langues vocales dominantes. Cette démarche prend sa source dans des
travaux qui interrogent les locuteurs sourds en leur demandant de traduire
un lexème hors contexte de la langue vocale en langue signée. Ce corpus est
ensuite utilisé dans l’élaboration d’un inventaire de paires minimales, dans
lequel les configurations, leurs emplacements et leurs mouvements sont
largement représentés. Cette approche est ainsi le point d’encrage d’une
hypothèse phonologique des LS dans laquelle l’équivalence « signe – mot »
précédente est dominante et déterminante dans l’élaboration de
dictionnaires. Notre étude s’inscrit dans un tout autre paradigme, celui du
modèle sémiologique qui prend ses origines dans la description d’une
typologie et de la mise en évidence des trois structures de transfert
principales (de taille et de forme, situationnel et personnel). Selon Cuxac, le
signeur peut ainsi « donner à voir » l'expérience en s'appuyant sur la
ressemblance maximale séquence de signes/expérience, ou utiliser l’unité
lexicale sans ressemblance avec le référent. Ce modèle, également intégratif,
prend donc en considération le lien diachronique qui existe au sein de la
langue sous l’influence de pressions entre structures de transferts et unités
lexicales. L’approche morphémique pour l’étude des unités lexicales est dans
ce cas légitime, leur compositionnalité ne relevant pas d’une phonologie au
sens strict mais bien, en premier lieu, d’une morphologie complexe.
Nous exposerons tout d’abord notre paradigme et les origines de la
plateforme multilingue et multimodale (langues écrites, orales et signées)
Ocelles sur laquelle notre corpus français écrit / LSF se constitue. Nous
décrirons ensuite un processus susceptible de permettre aux utilisateurs la
recherche d’un signifiant en LSF et de lier ce résultat à celui du signifiant en
français écrit correspondant.
Keywords: Collaborative platform, Multilingualism, Multi-modality, French
Sign Language, LSF, deaf, Signs research, Semiological model, Ocelles
1. Introduction
Lorsqu’un locuteur de la langue des signes souhaite accéder à une ressource
dans sa langue, notamment pour rechercher une définition dans un

558

JADT’ 18

dictionnaire de langue des signes (LS), il est confronté à deux obstacles. Le
premier repose sur le fait que très peu d’outils présentés comme étant des
dictionnaires numériques de langue des signes ne sont que des lexiques.
Parmi 105 sites répertoriés sur le web, une majorité utilise le qualificatif
« dictionnaire », or seulement 17 d’entre eux présentent des définitions
écrites. Parmi ces 17, uniquement 7 donnent des définitions en LS. La
quantité de dictionnaires en LS est donc extrêmement faible. De plus le
nombre de définitions ne dépasse pas 5 000, nous sommes ainsi très éloignés
des 135 000 proposées par le dictionnaire Larousse en ligne (Moreau, 2012).
Le second obstacle porte sur la difficulté, pour l’utilisateur sourd d’accéder
aux contenus mêmes d’un dictionnaire de ce type. En effet, nous avons
constaté que dans la grande majorité des cas, les entrées proposées sont
étroitement liées à la connaissance de la langue écrite du pays. Un prérequis
nécessaire est donc la maîtrise de cette langue, ce qui constitue un obstacle
majeur pour les personnes sourdes qui ont la LS pour langue première et la
langue écrite, souvent mal maîtrisée, comme langue seconde. Parmi les 7
sites précédemment évoqués, seulement 2 offrent une entrée via les
paramètres linguistiques de la LS (Moreau, 2012).
Cette question prend également un écho particulier lorsque nous
interrogeons le mode de transmission des LS. Il ne s’agit pas d’un mode de
transmission héréditaire, puisqu’environ 95 % des sourds ont des parents
entendants qui, pour la majorité, ne pratiquent pas la LS. L’apprentissage de
la langue a donc lieu dans des contextes variés, à tout âge, souvent sans la
référence fixe d’un adulte proche.
Le pourcentage restant (environ 5 %) est donc constitué de sourds de parents
sourds. Des parents qui, eux-mêmes pour la plupart, font partie de la
catégorie précédente issus de familles entendantes. Seule 0,02 % de la
population sourde signante est en effet composée d’une généalogie comptant
trois générations successives de sourds signeurs. La norme d’apprentissage
des LS ne peut donc pas être comparée à celles des entendants (Cuxac et
Pizzuto, 2010).
En outre, la langue des signes française (LSF), marquée par plus d’un siècle
d’interdiction comme langue d’enseignement, n’est reconnue comme langue
de la République que depuis 2005. C’est dans ce contexte qu’est né le projet
collaboratif multilingue et multimodale Ocelles1, qui ambitionne de définir
tous les concepts, dans tous les champs de la connaissance et dans toutes les
langues (écrites, orales ou signées) (Moreau, 2017).

1 https://ocelles.inshea.fr Projet sous l’égide et avec l’aide de la Délégation
générale à la langue française et aux langues de France (DGLFLF) et du ministère de
l’Éducation nationale.

JADT’ 18

559

2. Affrontement de deux paradigmes
2.1. Une hypothèse phonologique des LS
Susan Goldin-Meadow a mis en évidence, à partir d’une étude basée sur la
communication préscolaire, entre petits enfants sourds et leur entourage
entendant, la création de gestes appelés « home signs » (Goldin-Meadow et
Mylander, 1991) (Goldin-Meadow, 2003). Pour tenter de rentrer en
communication avec leur entourage, ces enfants les réalisent dans l’univers
perceptivo-pratique. Ces productions permettent de faire l’hypothèse de
stabilisations conceptuelles pré linguistiques, à la différence des productions
d’enfants entendants du même âge, pour lesquels le lien entre la langue et ces
savoirs perceptivo-pratiques n’existe pas. Une fois scolarisé, ces enfants
entrent ensuite en contact avec une langue des signes institutionnalisée.
Selon Golwin-Meadow dans la mesure où les formes signifiantes des langues
des signes institutionnalisées ont un statut phonologique, les composants des
« home signs » de l’enfant perdraient alors leur statut de morphèmes pour
devenir des équivalents de phonèmes. Cette hypothèse peut être envisagée
comme point de départ à l’affrontement de deux paradigmes. L’iconicité est
alors comparée à de la gestuelle co-verbale illustrative, reléguée au rang de
pantomime en dehors de tout phénomène linguistique.
C’est dans ce paradigme que s’inscrivent la plupart des « dictionnaires » de
langues des signes actuellement. Leurs entrées sont majoritairement définies
à partir d’une hypothétique équivalence conceptuelle entre les mots des
langues vocales dominantes et celles des unités lexématiques (UL) des
langues signées. (Fusellier-Souza, 2006). L’origine de cette méthodologie
prend racine dans des travaux qui interrogent les locuteurs sourds en leur
demandant de traduire un lexème hors contexte de la langue vocale en
langue signée. Ce corpus est ensuite utilisé dans l’élaboration d’un inventaire
de paires minimales, dans lequel les configurations (formes de la main), leurs
emplacements et leurs mouvements sont largement représentés (Klima et
Bellugi, 1979).
2.1. Une hypothèse morphémique des LS
Notre travail s’inscrit dans un tout autre paradigme dans lequel la
conséquence de la surdité n’est plus un simple effet de changement de canal.
La possibilité de dire et de montrer étant le seul fait du canal visuo-gestuel a
conféré aux langues des signes une architecture différente de celle des
langues vocales.
Selon Cuxac (Cuxac, 2000), deux stratégies discursives d'énonciations
coexistent en LSF. Le signeur via le canal visuo-gestuel, choisit de dire sans
montrer ou bien de dire en montrant. Il peut ainsi « donner à voir »
l'expérience en s'appuyant sur la ressemblance maximale séquence de

560

JADT’ 18

signes/expérience, ou utiliser l’UL sans ressemblance avec le référent. Le
modèle sémiologique (Cuxac et Pizzuto, 2010) prend ses origines dans la
description d’une typologie et dans la mise en évidence des trois structures
de transfert principales :
 les volumes des entités (transferts de taille et de forme (TTF)),
 les déplacements d’actants par rapport à des locatifs stables, à
l’image d’un environnement en quatre dimensions (les trois de
l’espace et le temps) recréé devant le locuteur (transferts situationnels
(TS)),
 l’entité souhaitée par le locuteur, qui devient alors cette entité
(transferts personnels (TP))
(Cuxac, 2000; Sallandre, 2003). Des expériences imaginaires ou réelles sont
ainsi anamorphosées par le locuteur.
Le modèle sémiologique, prend donc en considération le lien diachronique
qui existe au sein de la langue sous l’influence de pressions entre structures
de transferts et UL. Lien qui se retrouve parfois dans l’étymologie de
certaines des UL. L’approche morphémique pour l’étude des UL est dans ce
cas légitime, leur compositionnalité ne relevant pas d’une phonologie au sens
strict mais bien, en premier lieu, d’une morphologie complexe.
Lors de la réalisation d’un signe (transfert ou UL), tout le corps du locuteur
prend une valeur sémantique via une organisation des éléments
morphémiques qui le composent (regard expression faciale, posture,
orientation du visage, configuration, le mouvement, l'emplacement (Stokoe
et al., 1965), l'orientation (Friedman, 1977; Liddell, 1980; Moody, 1980; Yau,
1992).
3. Éléments prégnants dans la recherche d’un signe pour une plateforme
multilingue et multimodale français écrit / LSF
3.1. Contexte d’une recherche d’un signe dans un corpus bilingue langue
écrite/LS
Le projet collaboratif Ocelles permet de relier au fil des contributions, des
définitions de concepts à plusieurs signifiants qu’ils soient sous formes
textuelles, orales ou signées. Les entrées ne sont pas contraintes par la langue
d’origine et l’architecture se déploie au fur à mesure des contributions des
usagers. L’entrée textuelle peut donc prendre la forme, d’un mot ou d’une
expression dans le cas où l’origine du dépôt provient d’une structure de
transfert de la langue des signes. La réflexion actuelle porte donc sur le type
d’indexation possible des signes indispensable au processus de recherche
d’un signe dans le cadre d’un corpus bilingue langue écrite/LS.

JADT’ 18

561

3.2. Automatisation de l’indexation
L’indexation d’un signe se fait via l’entrée textuelle correspondante. Il
n’existe pas aujourd’hui d’indexation automatique de corpus collaboratif
dynamique de signes des LS qui pourrait servir de base pour un moteur de
recherche d’une UL ou d’un transfert directement à partir des paramètres
linguistiques des LS. La nature même du signal vidéo, très complexe à
analyser ne permet pas l’indexation automatique. Outre les pertes
d’informations tridimensionnelles liées aux projections de l’espace 3D à celui
2D de la vidéo, ce travail nécessiterait des outils fins d’analyses et de
reconnaissances, des différents composants corporels, intervenant en
parallèle, à des échelles spatiales et temporelles très différentes, mis au point
pour des langues vocales, linéaires et mono source mais par pour les LS
(Braffort et Dalle, 2012).
3.3. Situation actuelle et limite
Aujourd’hui l’entrée à partir des paramètres linguistiques des signes des LS
se fait majoritairement à partir de la configuration. Sur les 105 sites
répertoriés qui proposent des signes en LS seuls 18 offrent une possibilité
d’accéder directement à un signe à partir des paramètres linguistiques de la
langue des signes, sans recours à une langue écrite. Sur ces 18, 17 proposent
une entrée à partir de la configuration (le nombre de ces entrées manuelles
varie d’ailleurs de 9 à 211 en fonction des sites), 6 proposent une entrée à
partir du mouvement, 10 à partir de l’emplacement et 1 pour la symétrie,
l’image labiale et la mimique faciale (Moreau, 2012). Cette indexation
phonologique des LS, avec un tel écart dans le nombre envisageable de
configurations de 9 à 211 par exemple, interroge la gestion de l’erreur
potentielle du locuteur qui recherche un signe qu’il aurait perçu en discours
(ce qui est le cas dans la majorité des cas, compte tenu du caractère oral des
LS). En outre, sur un choix entre 211 configurations, le locuteur a une chance
sur 211 de choisir la bonne ou 210 risques sur 211 de se tromper…
3.4. Description et critères de recherche
L’indexation ne peut donc reposer uniquement sur une approche strictement
phonologique et doit tenir compte de la gestion des erreurs possibles. Notre
hypothèse repose sur une prégnance pour le locuteur de certaines unités
linguistiques dans une approche morphémique mises en jeux lors de la
formulation d’un signe (Moreau, 2012).
Notre approche est fondée sur le principe d’une indexation collaborative qui
permet de rendre compte des perceptions des locuteurs. Le principe est basé
sur le processus suivant :
 prise en compte du ou des type(s) de transfert(s) utilisé(s)

562

JADT’ 18

(TS / TP / TTF) dans la réalisation d’un signe à moins que l’unité
lexématique puisse éventuellement trouver son origine dans l’un de
ces transferts,
 itération dans le choix d’images clés à partir desquelles repose une
description des unités linguistiques prégnantes (Thom, 1988),
 une description plus fine des unités retenues est ensuite proposée
Si aujourd’hui les structures linguistiques ne peuvent être admises comme
familières à l’ensemble des contributeurs, leur prise en compte ne peut être
ignorée. Deux approches sont envisagées. Une première inhérente à l’objectif
premier de la plateforme, repose sur la proposition d’une définition de ces
concepts afin de familiariser progressivement les locuteurs à leurs usages.
Une succession d’anamorphoses possibles de plus en plus précises est
ensuite proposée. Cette approche est cohérente avec l’utilisation de n’importe
quel outil pour lequel un minimum de prérequis sont nécessaires, à l’image
de l’alphabet pour un dictionnaire. Une seconde approche repose sur la prise
en compte de ces lacunes en inscrivant le processus dans un continuum, qui
permet une possible contribution basée sur la sélection puis la description
d’images représentatives du signe du point de vue de l’usager. C’est donc
l’ensemble des descriptions macro-microscopiques, de chaque contributeur
qui sert de base à la pondération des unités linguistiques prégnantes. Ces
données seront ensuite réutilisées comme critère de recherche d’un signe.

JADT’ 18

563

Conclusion ADT et visualisation, pour une nouvelle
lecture des corpus Les débats de 2ème tour des
Présidentielles (1974-2017)
Jean Moscarola1, Boris Moscarola2
1 Université Savoie Mont Blanc, 2 Le Sphinx-Développement

Abstract 1
The progress of textual data analysis leads from a statistical and lexical
description of corpora to their semantic analysis. The software thus offers the
qualitative researchers the opportunity to feed their interpretations on the
basis of substitutes that summarize them or to code them automatically.
Finally, data visualization offers the reader an experience of the corpus
creating the conditions for a critical control. This approach is illustrated on
the analysis of the 2nd round debate in the presidential election conducted
with DataViv the new Sphinx module.
Abstract 2
Les progrès de l’analyse de données textuelles conduisent d’une description
statistique et lexicale des corpus à leur analyse sémantique. Les logiciels
offrent ainsi au chercheur qualitatif la possibilité de nourrir leurs
interprétations sur la base de substituts qui les résument ou de les coder
automatiquement. Enfin la datavisualisation offre au lecteur une expérience
du corpus créant les conditions d’un contrôle critique. Cette approche est
illustrée sur l’analyse des débats de 2ème tour à l’élection présidentielle
effectué avec DataViv le nouveau module de Sphinx.
Keywords: Analyse de discours, statistique lexicale, analyse sémantique,
data visulaisation, logiciel Sphinx
1. Introduction
L’ADT, née d’une rencontre entre la recherche littéraire et la statistique, passe
de l’étude de grandes œuvres à celle des médias de masse et de la
communication politique. Avec le big data et le web sémantique elle
s’enrichie des nouveaux outils de l’IA en abordant tous types de corpus.
Dans les sciences humaines, l’analyse de contenu s’est développée à
l’articulation de la recherche qualitative pure et des méthodes quantitatives
mais sans rapport explicite avec l’ADT. Ce papier s’adresse aux chercheurs et
chargés d’étude qualitative qui restent réticents à l’usage des outils de l’ADT.
Il s’appuie sur l’étude du corpus des débats de 2ème tour à l’élection

564

JADT’ 18

présidentielle et utilise la nouvelle application Dataviv de Sphinx pour
illustrer une nouvelle expérience de lecture.
2. Les méthodes et les techniques
1.1 Des humanités numériques à l’intelligence artificielle
L’outil informatique a depuis longtemps été utilisé pour informatiser les
grands corpus de la littérature (Frantext). C’est ainsi qu’apparaissent dans les
années 60 les humanités numériques (Burdick) et l’utilisation de la statistique
pour caractériser le style de grands auteurs ou leur attribuer des œuvres
anonymes (Muller). Puis dans les années 70 des statisticiens fondent le
courant français de l’analyse de données textuelle qui trouve un écho avec le
structuralisme et l’analyse de discours (Beaudouin). Dans les années 60 aux
Etats Unis une autre voie était ouverte avec la construction de thésaurus
informatisés (Stone) utilisés pour coder le contenu des media de masse.
Ces approches sont à l’origine des techniques que nous allons exposer. Elles
sont enrichies dans les années 2000 par les progrès de l’ingénierie
linguistique, et du traitement automatique des langues (Veronis).
2.1 Analyse de données textuelle
L’examen statistique des textes a évolué du décompte des mots à l’étude de
leurs associations. Dans la tradition des concordanciers, la voie est ouverte à
la recherche des segments répétés (Lebart), émaillant les discours politiques
(Marchand) ou publicitaires (Floch). L’informatique graphique, les cartes
cognitives (Eden) et les nuages de mots donnent une représentation visuelle
de ces concordances. L’influence des contextes et la recherche des spécificités
lexicales complète des descriptions globales (Brunet, Lebart)
Les méthodes d’analyses factorielles (Benzecri) font la synthèse entre la
rigidité des segments répétés et le désordre des nuages de mot. En dégageant
des d’affinités entre termes fréquemment associés, elles offrent une analyse
structurale des textes popularisée par les cartes factorielles disposant les
univers lexicaux révélateurs des thèmes du texte. A l’analyste d’en faire une
lecture sémiotique.
De manière duale à la mise en évidence des univers lexicaux, Reinert propose
le regroupement des unités de signification (réponses, phrases ou séquence
de mots…) pour créer une partition à partir de plusieurs analyses factorielles
utilisées pour progressivement définir des classes homogènes. Cette
méthode, mise en oeuvre avec le logiciel ALCESTE qui lui a donné son nom,
a été reprise et enrichie par d’autres logiciels (IRAMUTEC, SPHINX).
On retrouve des approches voisines chez les anglo-saxon. ‘L’analyse
sémantique latente’ (Landauer) déplace l’attention de l’observation des
cooccurrences vers la recherche de dimension latentes mesurées par les axes

JADT’ 18

565

factoriels. La théorie du cadrage (Frame Analysis) formulée par Goffman
interprète l’usage de certains mots clés et leurs relations comme des
« conceptualisations diffuses » Ces cadres sont une manière d’interpréter les
univers lexicaux.
2.2 Linguistique
A l’origine les logiciels ne repéraient que les formes graphiques (séquence de
lettre ne comportant aucun séparateur) sans parvenir à différencier singulier
et pluriel ou les différentes flexions d’un même verbe.
La lemmatisation a représenté un grand progrès en remplaçant les différentes
graphies d’un mot par son lemme : L’infinitif pour les verbes, le masculin
singulier pour les noms et adjectifs. Puis l’analyse des propriétés
morphosyntaxiques conduit à distinguer les ‘mots pleins’ selon leur statut
grammatical. Les substantifs, donnent les objets des textes ou des discours,
les adjectifs les appréciations et opinions, les verbes renvoient aux actions. La
recherche des syntagmes permet d’identifier les expressions propres au
domaine, formes les plus expressives des concordances (Mayaffre).
2.3 Sémantique
La sémantique s’intéresse au sens en passant du niveau des signifiants à celui
des signifiés.
Malgré leur intérêt théorique, les travaux de linguistique générale n’ont pu
déboucher sur les applications qui marquent, avec la linguistique de corpus,
le véritable essor de l’analyse sémantique.
L’idée est de modéliser les connaissances de domaines particuliers comme
des signifiés définis par l’ensemble des signifiants qui s’y rattachent
(Saussure).
Dès les année 60, « General Inquirer» développe à Harward des ressources
informatiques permettant de coder automatiquement le contenu des médias.
Ces dictionnaires sont toujours accessibles. WordNet® grande base de
données lexicales de l’anglais développée par l’université de Princeton
généralise cette approche en améliorant l’efficacité des dictionnaires par
l’usage de réseaux sémantiques. WordNet peut être considéré comme un
thésaurus généralisé reflet des corpus sur lesquels il est construit. Ces idées
sont reprises par les moteurs sémantiques.
Dans les années 2000, l’ingénierie linguistique et le traitement automatique
des langues Normier) dépasse l’approche purement lexicale en spécifiant les
thésaurus (Da Silva), par des ontologies(Grubert) et réseaux
sémantiques(Godard). Le thésaurus définit l’arborescence des catégories
conceptuelles : les signifiés. Les ontologies sont constituées de la liste des
mots qui documentent ces catégories : les signifiant. Les réseaux sémantiques

566

JADT’ 18

précisent l’affectation des termes aux catégories du thésaurus en fonctions
des liens constatés à partir de corpus de référence : les référents.
Avec l’essor des réseaux sociaux il devenait enfin primordial enfin
d’appréhender la tonalité de messages susceptibles de faire ou défaire les
réputations. Ainsi dans les années 2010 apparaissent des applications de
traitement automatique des langues pour synthétiser les avis et les opinions
du web. Elles ont acquis leur notoriété sous l’appellation de
‘sentiment analysis’ ou ‘d’opinion mining’ (Thelwall). Ces analyses
complètent la reconnaissance des catégories du thésaurus en évaluant les
textes selon leur orientation positive ou négative mesurée sur une échelle
assimilable à une mesure de l’opinion.
L’Analyse de Données textuelles a ainsi évolué d’une approche descriptive
statistique et lexicale à une approche sémantique fondée sur une
modélisation des connaissances. Rendue très accessible par les logiciels
(Boughzala) , elle présente une ressource pour la recherche qualitative ce que
nous allons illustrer sur un exemple de corpus politique.
3. Contributions de l’ADT à l’analyse de corpus.
3.1 l’exemple des débats de 2ème tour
L’analyse des discours politiques est un classique de l’ADT (Marchand,
Mayaffre). Leurs transcriptions analysées à différents niveaux, (les locuteurs,
les tours de paroles ou les phrases) sont traitées comme des données pour
révéler le style, les structures lexicales, les idées et les opinions qui les
caractérisent. Le corpus des 7 débats de deuxième tour couvre de 1974 à 2017,
43 ans de vie politique. Il est analysé à l’adresse suivante
https://www.sphinxonline.net/debats/1974-2017/analyse.htm, qui présente
de manière détaillée ce dont nous donnons qu’un aperçu dans cet article.
Notre but est d’illustrer les méthodes qui viennent d’être évoquées et de
discuter leur pertinence pour la recherche qualitative. Le lecteur est invité à
en faire lui-même l’expérience plus riche que l’aperçu qui suit :
-Les propos des candidats sont précis : les articles définis sont présents dans
2 phrases sur 3. Les embrayeurs ‘je’ et ‘vous’ sont utilisés de manière plus
fréquente que ‘nous’
-Les expressions « premier ministre », « assemblée nationale » « pouvoir
d’achat », « général de gaulle » « milliard d’euro » dominent sur l’ensemble
de la période.
-La carte des univers lexicaux montre une opposition entre l’évocation de la
vie politique d’une part et les termes de l’économie et de la société d’autre
part.
-Sur les 11 thèmes identifiés par la classification automatique, les thèmes
‘Gouvernement-Majorité’, ‘Pays, Français’, ‘Année Nucléaire’, ‘Entreprise

JADT’ 18

567

Salarié’ arrivent en tête.
-Les principaux concepts reconnus par le thésaurus de l’application utilisée1
sont « Vote » « Civilisation » « Emploi et salaire » « Politique fiscale »
« Citoyenneté »…
-La tonalité des propos est neutre pour la moitié des interventions, pour le
reste les prises de position positives sont un peu plus fréquentes.
La référence aux candidats et aux périodes complète la description globale.
-A chacun son style : Jospin Royal et Mitterrand se distinguent par l’usage de
‘je’ ; Chirac par le ‘nous’ plus collectif et Marine Le Pen interpelle son
débateur (vous) à moins qu’elle ne s’adresse à l’audience. Macron fait preuve
de l’usage le mieux équilibré.
-Les mots clés sur représentés dans chaque période marquent bien le
changement de siècle : ‘politique’, ‘gouvernement’ ‘problème’ au XXème,
‘entreprise’ ‘emploi’ ‘européen’ au XXIème.
-Les catégories thématiques de la classification lexicale sont associées à des
groupes de candidats : Sarkozy, Royal et Hollande développent les thèmes
‘Entreprise, Salarié’, ‘Loi’, ‘Crise Priorité’ ‘Pouvoir Président’. Mitterrand et
Giscard d’Estaing, ‘Socialiste Communiste’, ‘Gouvernement Majorité’,
Macron et Le Pen ‘Chômage, Emploi’, ‘Français, Pays’
-Enfin les concepts de l’analyse sémantique distinguent nettement les
périodes : ‘Vote’ ‘Civilisation’ ‘Degré de libéralisme’ au XXème, et ‘emploi’,
citoyenneté’ ‘politique fiscale’ au XXIème
3.2 Contribution à l’analyse qualitative pure
Ces résultats plus abondamment décrits dans l’application en ligne peuvent
être utilisées dans l’esprit de la recherche qualitative pure dès lors qu’on les
envisage dans une démarche descriptive et exploratoire dont la valeur réside
que dans la capacité du chercheur à les lire et à les d’interpréter (Moscarola).
Les mots clé, nuages, cartes, les classifications et les concepts proposés par les
logiciels sont des substituts du corpus. Ils portent la trace des modèles
mentaux (Johnson-Laird) et des représentations et l'influences sociales dont
parlent la théorie des actes de langage (Austin) et la sociolinguistique. L’ADT
permet d’en faire une sorte de radioscopie et de mieux les comprendre. Elle
offre aussi la possibilité d’une lecture distanciée échappant au risque de
récursivité (Dumez) ou donnant la possibilité de le contrôler. En effet les
substituts lexicaux ou sémantiques sur lesquels le chercheur fonde ses
interprétations peuvent être communiqués pour exposer la lecture qu’il en
fait à la critique d’une discussion basée sur des éléments partagés.

1

Thésaurus Larousse (Péchon 1994) intégré à SphinxIQ2

568

JADT’ 18

3.2 Contribution à l’analyse de contenu
L’ADT peut également être vue comme une modalité de l’analyse de contenu
traditionnelle (Belerson, Bardin). Elle s’en distingue par l’automatisme d’une
‘lecture artificielle’ identifiant des catégories établies statistiquement par
apprentissage ou à partir d’un thésaurus. On retrouve ainsi l’approche
inductive conduisant à interpréter à postériori les structures révélées par les
analyses factorielles ou à reconnaître dans le corpus les concepts du
thésaurus.
Chaque unité de signification peut ainsi être codée dans une variable
‘mesurant’ le sens et utilisable selon les procédures classiques de l’analyse
quantitative. Dans notre exemple on peut ainsi chercher les éléments lexicaux
ou sémantique explicatif ou discriminant les appartenances politique des
candidats…
3.3 Retour au texte et ‘data visualisation’
Le recours à l’ADT lexicale ou sémantique comporte deux risques majeurs
malgré son intérêts pratique et scientifique : le risque d’erreur systématique
auquel expose la lecture par une machine et le risque de réduction abusive
imposé par les choix du chercheur, qu’il s’agisse de sa problématique ou des
résultats qu’il choisit de communiquer.
Le premier risque peut être évité par le retour au texte et une lecture de
vérification. C’est la seule manière pour le chercheur et son lecteur de
contrôler le sens des éléments lexicaux ou la pertinence des concepts et
évaluations identifiés par les moteurs sémantiques ? Cette possibilité
apparaît avec les hypertextes. Elle est d’autant plus nécessaire, qu’avec l’aide
des infographies (nuages de mots, cartes) les représentations deviennent de
plus en plus parlantes.
Les méthodes dites de navigation facilitent ce retour au texte et peuvent être
enrichies par les entrées provenant des codifications lexicales et sémantiques
ou par les éléments des représentations visuelles. La navigation lexicale
généralisée dans l’esprit de la datavisualisation (Faulx Briole) donne ainsi au
lecteur la possibilité d’accéder directement aux verbatims associés aux mots
d’un nuage ou d’une carte, aux catégories d’une classification automatique
ou aux concepts et appréciations d’une analyse sémantique. Par exemples à
quel verbatim correspond l’usage des mots ‘gens’ ou ‘français’, sont-ils plutôt
de gauche ou de droite, à quoi correspond le concept ‘citoyenneté’ et est-il
daté par un époque ou spécifique à certains candidats ? Retour au texte, mais
au contexte aussi.
L’analyse des discours politique a été pionnière dans ce domaine. Le Monde
publie le 15-03-2012 une infographie dynamique donnant accès aux discours

JADT’ 18

569

de campagne des candidats (Véronis). L’observatoire du discours politique
(Mayaffre) en est un autre exemple. Il permet à partir d’un nuage de mots
synthétisant le contenu des discours, d’en détailler les significations par du
verbatim et d’en spécifier l’usage selon les différents candidats.
Avec ce type d’application le chercheur qualitatif peut compléter la
communication de ses résultats et de ses interprétations en donnant accès au
corpus par l’expérience d’une navigation interactive proposée au lecteur. Il
peut ainsi vérifier les interprétations de l’auteur et les prolonger par ses
propres conjectures. C’est ce que nous proposons à l’adresse :
https://www.sphinxonline.net/debats/1974-2017/analyse.htm
Y
sont
présentés les substituts et synthèses qui conduisent à conclure à une
profonde transformation du débat politique amorcée au tournant du siècle.
Ces tendances peuvent être expérimentées par le lecteur pour nourrir une
discussion critique ou susciter de nouvelles explorations et conjecture. Le
logiciel utilisé permet ainsi de produire des résultats et en même temps de
donner la possibilité au lecteur de les discuter. C’est le propre de la démarche
scientifique.
Bibliographie
BARDIN L. (1977) L’Analyse de Contenu PUF
BEAUDOUIN V. (2016) Retour aux origines de la statistique textuelle : Benzécri et
l’école française de l’analyse de données JADT 2016
BENZECRI JP. (1992) Correspondance Analysis Handbook Marcel Decker Inc.
1992
BERELSON, B (1952). Content Analysis in Communication Research. Glencoe:
Free Press..
BOUGHZALA Y., HERVE H., MOSCAROLA J. (2014) Sphinx Quali : un nouvel
outil d’analyses textuelles et sémantiques JADT Université de Paris
BRUNET E. (2016) Apports des technologies modernes à l’histoire littéraire HAL
BURDICK A., DRUCKER J & ali. (2012) Digital humanities MIT Press
DA SILVA L. (2006) Thésaurus et systèmes de traitement automatique de la langue,
Documentation et bibliothèque
DUPUY, P.-O. & MARCHAND, P., (2016) Les débats de l’entre-deux-tours de
l’élection présidentielle française (1974-2012) Mots. Les langages du
politique,
EDEN C. (1988). "Cognitive mapping". European Journal of Operational Research
FAULX-BRIOLE A. (2017) Datavisualisation et tableaux de bord interactifs
Solution Business
FLOCH J.M.(1988), The contribution of structural semiotics to the design of a
hypermarket, International Journal of Research in Marketing, 4, 3, Semiotics
and Marketing

570

JADT’ 18

GOFFMANN E. Frame analysis: An essay on the organization of experience Harper
and Row 1974
GRUBER T. (1992) Toward Principles for the Design of Ontologies Used for
Knowledge Sharing. In: International Journal Human-Computer Studies
JOHNSON-LAIRD, P N. (1983) Mental Models: Toward a Cognitive Science of
Language, Inference and Consciousness. Harvard University Press
LANDAUER, T. K., FOLTZ, P. W., & LAHAM, D. (1998) An introduction to
latent semantic analysis. In Discourse processes, Routledge
LEBART L SALEM A. (1988) Analyse de données textuelles DUNOD
MARCHAND, P. (2016). Les représentations sociales dans le champ des médias. In
G. Lo Monaco, S.
MAYAFFRE D. (2005) Analyse du discours politique et Logométrie : point de vue
pratique et théorique Langage et société N° 114
MAYAFFRE D. (2014) Plaidoyer en faveur de l’analyse de données c(n)textuelle.
Parcours coocurrentiels dans le discours présidentiel français. Actes JADT Nice
MOSCAROLA J. (2018) Faire parler les données. Editions EMS
MULLER C. (1979,). Étude de statistique lexicale. Le vocabulaire du théâtre de Pierre
Corneille, Paris, Slatkine
NORMIER B. (2007). L’apport des technologies linguistiques au traitement et
à la valorisation de l’information textuelle. ADBS.
REINERT A., (1983), Une méthode de classification descendante hiérarchique :
application à l’analyse lexicale par contexte” Les cahiers de l’analyse des
données, Tome 8, N°2, pp. 187198.
STONE D.C. DUNPHY, M.S. SMITH, . M. OGILVIE. (1966) The General
Inquirer: a computer Approach to Content Analysis MIT Press
THELWALL M. (2017) Sentiment Analysis for Smal and Big Data.SAGE
VERONIS J. (2014) Le traitement automatique des corpus oraux. In Traitement
automatique des langues. Hermes

JADT’ 18

571

A conversation analysis
of interactions in personal finance forums
Maurizio Naldi
University of Rome Tor Vergata– maurizio.naldi@uniroma2.it

Abstract 1
Interactions on a personal finance forum are investigated as a conversation,
with post submitters acting as speakers. The presence of dominant positions
is analysed through concentration indices. Patterns in replies are analysed
through the graph of replies and the distribution of reply times.
Keywords: Personal finance; Conversation analysis; Concentration indices.
1. Introduction
Decisions concerning personal finance are often taken by individuals not just
on the basis of factual information (e.g., company’s official financial
statements or information about past performance of funds), but also
considering the opinions of other individuals. Nowadays personal finance
forums on the Internet have often replaced friends and professionals in that
role. In those forums the interaction occurs among people who typically do
not know one another personally and know very few personal information (if
any) about other participants. Anyway, they often create online communities
that can bring value to all participants [1]. Examples of such forums are
SavingAdvice (http://www.savingadvice.com/forums/) or Money Talk
(http://www.money-talk.org/board.html).
The actual influence of such forums on individuals’ decisions has been
investigated in sev- eral papers, considering, e.g., how the level of activity on
forums impacts on stock trading levels [2], how participation in such forums
pushes towards a more risky-seeking behaviour [3], or introducing an
agents-based model to determine how individual competences evolve due to
the interaction [4]. It has been observed that such forums may be employed
by more aggressive participants to manipulate more inexperienced ones [5],
establishing a dominance over the forum. In addition to being undesirable
for ethical reasons, such an influence is often contrary to the very same rules
of the forum. Here we investigate the subject by adopting a different
approach from the semantic anal- ysis of [5]. In particular, we investigate the
presence of imbalances in the online discussion and the dynamics of the
interaction between participants. The rationale is that partici- pants wishing
to manipulate others would try to take control of the discussion by posting

572

JADT’ 18

more frequently and being more reactive.
For that purpose we employ two datasets extracted from the two most
popular personal finance threads on the SavingAdvice website. For the
purpose of the analysis the thread is represented as the sequence of
participants taking turns, with dates and times of each post attached.
We conduct a conversation analysis, wishing to assess if: 1) there are any
dominant par- ticipants (in particular the thread starter); 2) repetitive
patterns appear such as sustained monologues or sparring matches between
two participants; 3) replies occur on a short-time scale.
The paper provides the following contributions:
 through the use of concentration indices we find out that, though no
dominance exist, the top 4 speakers submit over 60% of the posts
(Section 3);
 both recurring reply sequences and monologues appear (Section 4);
 reply times can be modelled by a lognormal distribution, with 50% of
the posts being submitted no longer than 14 or 23 minutes (for the
two datasets respectively) after the last one (Section 4).
2. Datasets
We consider the two most popular threads on the SavingAdvice website. The
topics are the following, where we indicate an identifying short name
between parentheses:
1. Should
struggling
families
tithe?
(Struggling)
African-American
Personal
Finance
Gurus
(Guru)
The main characteristics of those datasets are reported in Table 1. For each
thread we identify the set of speakers S = {s1, s2, . . . , sn}, i.e., the individuals
who submit posts. We identify also the set of posts P = {p1, p2, . . . , pm} and a
function F : P → S, that assigns each post to its submitter. For each speaker
we can therefore compute the number of posts submitted by him/her. If we
use the indicator function 1(·), the number of posts submitted by the generic
speaker si is

(1)
Table 1: Datasets

Thread

Creator

No. of speakers

Struggling
Guru

jpg7n16
25
james.hendrickson 18

No. of posts
155
104

JADT’ 18

573

3. Dominance in a thread
In this section we wish to examine if some dominance emerges in a thread.
We adopt concentration indices borrowed from the field of industrial
economics.We analyse domi- nance by considering the frequency of posts: an
individual (or a group of individuals) is dominant if it submits most of the
posts. We first examine how posts are distributed by looking at the rank-size
plot: after ranking speakers by the number of posts they submit, the
frequency of posts is plotted vs the rank of the speaker. In Figure 1, we see
that a linear relationship appears between log N(i) and the rank i, so that a
power law N(i) = k/i (a.k.a. a generalized Zipf law) may be assumed to apply
roughly, where k is a normalizing constant and α is the Zipf exponent (see,
e.g., [6]), measuring the slope of the log-linear curve, hence the imbalances
between the contributions of all the speakers. By performing a linear
regression, we get a rough estimate of α, reported in Table 2.
Table 2: Concentration measures

Thread
Struggling
Guru

Zipf exponent

HHI

CR4

0.2545
0.2501

0.1220
0.1396

61.94%
67.31%

As more general indices to assess dominance position we borrow two from
Industrial Economics: the Hirschman-Herfindahl Index (HHI) [7, 8, 9], and
the CR4 [10, 11]. For a market where n companies operate, whose market
shares are v1, v2, …, vn the HHI is

(2)
The HHI satisfies the inequality 1/n ≤ HHI ≤ 1, where the lowest value
corresponds to the case of no concentration (perfect equidistribution of the
market) and the highest value represents the case of monopoly. Therefore,
the larger the HHI the larger the concentration. Instead, the CR4 measures the
percentage of the whole market owned by the top four companies: similarly,
the higher the CR4, the heavier the concentration.
In our case, the fraction of posts submitted by a speaker can be considered as
his/her market share, so that the HHI can be redefined as

574

JADT’ 18

(3)
Instead, the CR4 is defined as

(4)
For our datasets we get the results reported in Table 2. According to the
guidelines provided by the U.S. Department of Justice, the point of
demarcation between unconcentrated and moderately concentrated markets
is set as HHI = 0.15 [12]. Since the values in Table 2 are below that threshold,
we cannot conclude that there is a significant concentration phenomenon.
However, the CR4 index shows that the top 4 speakers submit more than 60%
of all the posts. Delving deeper into the top 4, we also see the most frequent
speaker typically contributes around 1/4 of the overall number of posts,
which represents a major influence. In the Struggling dataset, the most
frequent speaker is the thread originator itself (with 22.6% of posts), while
that’s not true in the Guru dataset, where the the most frequent speaker
contributes 26.9% of posts and the originator just 2.88%.

Fig. 1: Rank-size plot

4. Replies
After examining dominance, we turn to interactions. In this section we
analyse the pattern of replies, looking for recurrences in the sequence of

JADT’ 18

575

replies and examining the time elapsed before a post is replied to.
We build a graph representing how speakers reply to each other. We
consider each post as a reply to the previous one. We build the replies graph
by setting a link from a node A to a node B if the speaker represented by
node A has replied at least once in the thread to a post submitted by the
speaker represented by node B. The resulting graphs are shown in Figure 2,
which is ordered from the core to the periphery in order of decreasing degree
of the nodes, laid out on concentric rings. Here the degree of a node
represents the number of speaker to which it replies. In both cases an inner
core of most connected nodes appear, which represent the speakers replying
to most other speakers. Reply patterns emerge as bidirectional links (couples
of speakers who reply to each other). Loops represent monologues instead,
i.e., speakers submitting two or more posts in a row.

Fig. 2: Replies graph

Further, we are interested in how fast the interactions are between
contributors to the thread. We define the reply time as the time elapsing
between a post and the subsequent one. The main statistics of the reply time
are reported in Table 3. In both dataset the mean reply time is around 1 hour,
but 50% of the replies take place within either 14 minutes (Guru dataset) or
23 minutes (Struggling dataset), i.e., with a much smaller turnaround. There
is therefore a significant skewness to the right.
A more complete view of the variety of reply times is obtained if we model
the probability density function. In Figure 3, we report the curves obtained
through a Gaussian kernel estimator, an exponential model, and a lognormal
model (whose parameters have been estimated by the method of moments).
By applying the Anderson-Darling test, we find out that the exponential

576

JADT’ 18

hypothesis is rejected at the 5% significance level, while the lognormal one is
not rejected, with a p-value as high as 0.72 for the Struggling dataset and
0.076 for the Guru dataset.

Fig. 3: Reply time
Table 3: Reply time statistics (in minutes)

Thread

Mean

Median

Standard
deviation

95% percentile

Struggling
Guru

70.5
58.9

23
14

156.2
112.7

254.7
406.7

5. Conclusions
We have analysed two major threads within a personal finance forum as a
conversation between submitters acting as speakers, searching for dominance
and interaction patterns. Though no significant concentration exists, the top
four speakers submit over 60% of the posts. Patterns of interaction emerge as
the presence of several couples of speakers who reply to each other, several
monologues, and short reply times (with 50% being below 14 and 23 minutes
for the two datasets, though a significant distribution tail is present).
References
[1] Arthur Armstrong and John Hagel. The real value of online communities.
Knowledge and communities, 74(3):85–95, 2000.
[2] Robert Tumarkin and Robert F Whitelaw. News or noise? internet
postings and stock prices. Financial Analysts Journal, 57(3):41–51, 2001.
[3] Rui Zhu, Utpal M Dholakia, Xinlei Chen, and Ren ́e Algesheimer. Does
online community participation foster risky financial behavior? Journal of

JADT’ 18

577

Marketing Research, 49(3):394–407, 2012.
[4] Loretta Mastroeni, Pierluigi Vellucci, and Maurizio Naldi. Individual
Competence Evolution under Equality Bias. In 2017 European Modelling
Symposium (EMS), Nov 2017.
[5] John Campbell and Dubravka Cecez-Kecmanovic. Communicative
practices in an on- line financial forum during abnormal stock market
behavior. Information & management, 48(1):37–52, 2011.
[6] Maurizio Naldi and Claudia Salaris. Rank-size distribution of teletraffic
and customers over a wide area network. Transactions on Emerging
Telecommunications Technologies, 17(4):415–421, 2006.
[7] Stephen A Rhoades. The Herfindahl-Hirschman Index. Fed. Res. Bull.,
79:188, 1993.
[8] Maurizio Naldi. Concentration indices and Zipf’s law. Economics Letters,
78(3):329–334,
2003.
[9] Maurizio Naldi and Marta Flamini. Censoring and Distortion in the
Hirschman–Herfindahl
Index Computation. Economic Papers: A journal of applied economics
and policy, 2017.
[10] I Pavic, F Galetic, and Damir Piplica. Similarities and Differences
between the CR and HHI as an Indicator of Market Concentration and
Market Power. British Journal of Economics, Management and Trade,
13(1):1–8, 2016.
[11] Maurizio Naldi and Marta Flamini. Correlation and concordance
between the CR4 index and the Herfindahl-Hirschman index. SSRN
Working paper series, 2014.
[12] The U.S. Department of Justice and the Federal Trade Commission.
Horizontal Merger Guidelines, 19 August 2010.

578

JADT’ 18

Analisi testuale, rumore semantico e peculiarità
morfosintattiche: problemi e strategie di
pretrattamento di corpora speciali.
Stefano Nobile
Sapienza Università di Roma – stefano.nobile@uniroma1.it

Abstract 1
The proliferation of text analysis techniques has made possible the combined
use of different software, directed each time to specific needs for analysis and
research. However, the opportunities offered by the different software do not
mitigate a fundamental problem, inherent in the characteristics of some
peculiar corpora. Perfectly suited for analysis on texts written accurately and
based on a supervised style, however these software can not reduce some
issues. Among these, one of the most common concerns the morphosyntactic
rules of the language with its semantic noise. Problems of "noise", such as
that generated in spontaneous conversations, require many precautions for
the preparation of the corpus. This situation is exaggerated with Twitter,
whose ease of access and messaging download has produced analysis that is
not always adequately supported from the theoretical point of view. Poems
and songs present a similar problem. In these kinds of corpora the problem
derives from the structure of this style of communication, which in using
some rhetorical expedients accentuates the critical mass generated by some
words. What strategies are possible to adequately prepare the corpora to be
analysed in these two particular situations? The contribution proposes some
strategies on how to operate in these particular conditions, highlighting the
advantages on the empirical level but also the effects on the theoretical one.
Abstract 2
La moltiplicazione delle tecniche di analisi testuale ha reso possibile l’uso
combinato di software diversi, piegati di volta in volta a singole esigenze di
analisi e ricerca. Tuttavia, l’ampiezza di opportunità offerte dai diversi
software non attenua un problema di fondo, insito nelle caratteristiche stesse
di alcuni corpora peculiari. Perfettamente adatti ad analisi su testi redatti
accuratamente e improntati a uno stile sorvegliato, questi software non
riescono tuttavia a togliere l’utente dall’impaccio nel quale può trovarsi in
alcune circostanze. Tra queste, una delle più comuni riguarda le regole
morfosintattiche della lingua di riferimento e quindi portatrice di quote
elevate di rumore semantico. Problemi di “rumore”, come quello generato

JADT’ 18

579

nelle conversazioni spontanee, richiedono al ricercatore una serie di
accorgimenti per la preparazione del corpus che tengano conto della
necessità di evitare di ottenere dati fortemente distorti. Questo discorso si
esaspera con Twitter, la cui facilità d’accesso e download dei messaggi è da
qualche tempo foriero di analisi non sempre adeguatamente sostenute dal
punto di vista teorico. A questi casi si aggiunge quello di corpora altrettanto
peculiari come quelli delle poesie e delle canzoni. In corpora di questo tipo il
problema deriva dal costrutto stesso di questo genere comunicativo, che nel
servirsi di alcuni espedienti retorici accentua la massa critica generata da
alcune parole, andando così a incidere, tra l’altro, sul calcolo di alcuni
parametri rilevanti e rendendo meno leggibili i risultati. Quali strategie sono
dunque possibili al ricercatore per preparare adeguatamente i corpora da
analizzare in queste due situazioni particolari? Il contributo che si intende
presentare vuole avanzare alcune proposte su come operare in queste
particolari condizioni, evidenziando i vantaggi sul piano empirico ma anche
le ricadute su quello teorico soggiacente agli obiettivi stessi che analisi su
corpora di questo genere possono porsi.
Keywords: rumore semantico, poesia, canzone, retorica, pretrattamento del
corpus, costruttivismo vs. realismo.
1. Rumore semantico e corpora testuali peculiari
La moltiplicazione delle tecniche di analisi testuale ha reso possibile, ai
ricercatori interessati a lavorare in questo ambito, l’uso – anche combinato –
di diversi software, ciascuno con le proprie peculiarità in risposta alle
differenti esigenze di analisi e ricerca. Tuttavia, l’ampiezza di opportunità
offerte dai tanti software in commercio (T-Lab, Taltac, Spad-T, R, eccetera)
non attenua un problema di fondo, insito nelle caratteristiche stesse di alcuni
corpora peculiari: quello delle distorsioni imputabili al rumore semantico
generato sia da elementi irrilevanti dal punto di vista contenutistico, sia da
ridondanze che alterano i rapporti di forza tra parole.
Perfettamente adatti ad analisi testuali su testi redatti accuratamente e
improntati a uno stile sorvegliato come può essere quello delle testate
giornalistiche o di materiali di tipo istituzionale, questi software non riescono
tuttavia a togliere l’utente dall’impaccio nel quale può trovarsi in alcune
circostanze che, più o meno in concomitanza con la diffusione dei social
network, hanno cominciato ad essere egemoniche quanto a produzioni di
testi sul web. Tra queste circostanze, una delle più comuni riguarda quella
che si potrebbe definire oralità scritta, poco o per nulla accorta alle regole
morfosintattiche della lingua di riferimento e quindi portatrice di quote
elevate di rumore semantico, qui inteso come forma leggibile e trattabile di

580

JADT’ 18

testo. Problemi di “rumore” come quello generato nelle conversazioni
spontanee, rinvenibili – nelle forme più disparate – in rete, richiedono al
ricercatore una serie di accorgimenti per la preparazione del corpus che
tengano conto della necessità di evitare di ottenere dati fortemente distorti.
Vale a dire che le forme linguistiche contratte (cmq, nn, xké), gli elementi
espressivi tesi a restituire i toni del parlato (belloooo, bravaaaa), i segni grafici
del tutto peculiari (Ã, Ã², ðŸ , Ã¹, Ã©, ðŸ ³), le ridondanze, i retweet, il testo
non in formato Ascii, gli hashtag, i collegamenti multimediali, il linguaggio
di markup, sono addendi di una somma che dà come risultato una
proliferazione di rumore semantico, ai cui effetti si aggiungono quelli
derivanti dalle distorsioni imputabili agli indici prodotti (ricercatezza ed
estensione lessicale) nonché alle misure del corpus (occorrenze, forme
grafiche, hapax). Questo discorso si esaspera con Twitter, la cui facilità
d’accesso e download dei messaggi è da qualche tempo foriero di analisi non
sempre adeguatamente sostenute dal punto di vista teorico (Ebner, Altmann
e Softic, 2011). Accade infatti sempre più spesso che «l’elevato grado di
automatismo delle procedure e la forte tendenza alla modellizzazione
statistica possono esporre l’analisi testuale a stili di ricerca segnati da
un’ingenua rincorsa dell’oggettività tramite l’estremizzazione ossessiva del
calcolo numerico applicato ai testi, con la conseguente grave perdita del
ruolo del contesto» (Tipaldo, 2014: 191; corsivo aggiunto). La necessità di
contrarre il testo in 120 caratteri (raddoppiati soltanto a partire dal novembre
2017, ma la sostanza non cambia) determina infatti negli utenti l’inclinazione
a trovare soluzioni – a volte convenzionali, altre volte originali – per poter
ridurre il testo entro i limiti prefissati, così come si faceva quando gli sms
avevano set limitati di caratteri ed erano relativamente dispendiosi. Da qui,
la produzione di una quantità considerevole di rumore semantico che rende
difficilmente trattabili i dati testuali “naturali”. Ai casi appena passati in
rassegna – oggi largamente diffusi – si aggiunge quello di corpora altrettanto
peculiari, ma del tutto diversi, come quelli delle poesie e delle canzoni
(Nobile, 2012). In testi di questa natura, il problema deriva dal costrutto
stesso di questi generi della comunicazione. Essi, infatti, nel momento in cui
si servono di alcuni espedienti retorici (l’anadiplosi, l’epanalessi, il poliptoto,
l’anafora, l’epanadiplosi e altri ancora), accentuano la massa critica generata
da alcune parole. Ciò finisce con l’incidere sul calcolo di alcuni parametri
rilevanti (specificità tipiche ed esclusive, estensione lessicale, ricercatezza
lessicale, rango delle singole parole, confronto con i lessici peculiari,
eccetera), rendendo meno leggibili i risultati.
Un caso assai frequente, qui portato al parossismo, è il seguente: nella
canzone, alcune parole, per necessità squisitamente ritmiche oppure per
enfatizzare l’effetto-tormentone, vengono ripetute ossessivamente. È quanto

JADT’ 18

581

accade – per fare un solo esempio, dati i margini ridotti entro i quali deve
rimanere questo contributo – con la canzone Pino (fratello di Paolo), nella quale
la parola Pino compare addirittura 60 volte nel giro di pochi secondi,
andando ineluttabilmente a gonfiare tutte le modalità delle variabili (artista,
decennio di pubblicazione, macro e microgenere musicale, sesso) a cui questa
singola canzone è collegata (Nobile, 2012). Per l’uso delle figure retoriche
vale un discorso analogo. Tra le tante possiamo prendere l’anafora a titolo
esemplificativo. L’anafora è una figura retorica che consiste nella ripetizione
di una o più parole all’inizio di una frase o di un verso. Per quanto essa sia
rintracciabile anche nella prosa, è nella poesia e nella canzone che essa
ottimizza le proprie potenzialità espressive. Tra lo sterminato numero di
esempi che potremmo scegliere, uno è quello di Vai in Africa, Celestino!, un
brano che il cantautore Francesco De Gregori ha pubblicato nel 2005: pezzi di
stella, pezzi di costellazione / pezzi d’amore eterno, pezzi di stagione / pezzi di
ceramica, pezzi di vetro / pezzi di occhi che si guardano indietro / pezzi di carne,
pezzi di carbone / pezzi di sorriso, pezzi di canzone / pezzi di parola, pezzi di
parlamento / pezzi di pioggia, pezzi di fuoco spento. In questo caso, è la parola
pezzi a comparire un considerevole numero di volte grazie, appunto,
all’espediente retorico dell’anafora. Non diverso, ovviamente, è il caso della
letteratura, per il quale – a titolo esemplificativo – possiamo scomodare il
celeberrimo III canto (canto e canzone, appunto…) dell’Inferno dantesco: Per
me si va ne la città dolente / per me si va ne l'etterno dolore / per me si va tra la
perduta gente. La poesia e la canzone, dunque, possono presentare delle
caratteristiche strutturali che vanno a incidere sul text mining operabile dai
diversi software, nella misura in cui forniscono informazioni numeriche
alterate. Quantunque la ridondanza di alcuni termini non implichi
necessariamente lo stravolgimento dell’asse sintagmatico (Bolasco, 2005),
ossia della possibilità di ricostruire il senso del testo in ragione di un criterio
di adiacenza delle parole all’interno dei contesti elementari, essa può
compromettere il senso espresso dai dati relativi alla frequenza delle parole
piene, alle peculiarità (sia quelle endogene, esprimibili in termini di
specificità, sia quelle esogene, traducibili in termini di linguaggio peculiare) e
alla numerosità di forme grafiche. Quali strategie sono dunque possibili al
ricercatore per preparare adeguatamente i corpora da analizzare in queste
due situazioni particolari, ossia profluvio di segni grafici e parole ripetute?
Certamente non è sufficiente ripulire ortograficamente il testo né espungere
da esso tutti quei segni, come le emoticons o la sintassi comunicativa propria
di Twitter, che vanno a interferire su molti parametri d’analisi. Né d’altronde
si può “addomesticare” il corpus fino al punto da stravolgerne l’aspetto
precipuo, ossia la spontaneità del simil parlato del primo caso e la struttura
morfosintattica e retorica del secondo.

582

JADT’ 18

2. Strategie di pre-trattamento del corpus
Le soluzioni ai tipi di problemi testé esposti variano a seconda della natura
del problema, delle competenze informatiche dell’utente e della prospettiva
analitica assunta dal ricercatore e dipenderanno dalla combinazione tra
queste tre dimensioni. Vediamole. La pulizia dei caratteri di testi naturali
dipende in larga misura dalle competenze informatiche dell’utente, al netto
delle potenzialità dei software utilizzati. Ad oggi, un utente privo di abilità
informatiche avanzate non è in grado di fare un lavoro di pulizia impeccabile
su corpora testuali molto “sporchi” come sono quelli che provengono da
Twitter. Se da un lato gli potrà essere d’aiuto una elevata quota di pazienza
per utilizzare un correttore ortografico che ripulisca il testo dagli errori di
battitura tipici di testi “naturali”, e quindi non supervisionati, dall’altro
dovrà necessariamente scontrarsi con la ridda di caratteri speciali che sono
stati richiamati in precedenza. Le soluzioni a disposizione sono tre: il livello
base consiste nella sostituzione manuale e in blocco di tutti i segni grafici da
correggere, facendo attenzione – nell’uso di un normale word processor – alle
maiuscole e alle minuscole. Si tratta di un’operazione tanto più lunga e
faticosa quanto più lungo, complesso e ricco di rimandi ipertestuali è il
corpus da ripulire. In alcuni casi, esistono software come Taltac che
possiedono al loro interno una funzione di rimozione di alcuni caratteri
particolari. Una seconda soluzione è quella di programmare delle macro (o,
alternativamente, di usare programmi esterni) che risolvano lo stesso tipo di
problema. La soluzione è più efficace dal punto di vista del risultato finale,
ma altrettanto impegnativa da quello delle competenze e del tempo richiesti.
La terza soluzione è, sulla carta, quella in grado di ottimizzare meglio il
rapporto costi/benefici. Si tratterebbe, in questo caso, di sfruttare le
potenzialità di programmi di ricerca che si sono dati come obiettivo proprio
quello della pulizia di testi originati nel web e utilizzati per analisi testuali.
Vanno in questa direzione progetti come Readability o CleanEval (Baroni et al.,
2008), che tuttavia presentano a loro volta due ordini di problemi: uno legato
ai costi; l’altro alla effettiva possibilità d’accesso. Entrambi, peraltro,
evidenziano problemi di flessibilità rispetto ai diversi formati di corpora da
elaborare (Claridge, 2007; Petri e Tavosanis, 2009). La questione del
trattamento di corpora che devono la loro peculiarità alla struttura
soggiacente, pur non presentando problemi rilevanti di ordine informatico, è
più complessa e implica scelte decisive da parte del ricercatore. Il ricercatore
dovrà infatti operare delle scelte di carattere gnoseologico e teorico rispetto ai
fini che si pone, ben sapendo che le decisioni che prenderà avranno
inevitabili ricadute sul piano delle risultanze empiriche. In altri termini, il
ricercatore che impatta con materiale testuale che non nasce in forma di

JADT’ 18

583

prosa, ma di verso, si trova sostanzialmente a dover operare una scelta tra
una rappresentazione fedele, “fotografica”, delle caratteristiche del corpus
esaminato e quella che invece tiene conto delle ridondanze e di tutti quegli
elementi che possono contribuire a gonfiare alcuni parametri del corpus, a
partire dal conteggio di forme grafiche e a finire con gli hapax. Nel primo
caso gli esiti dell’analisi subiranno l’impatto non solo di quegli elementi
retorici e morfosintattici che possono caratterizzare la forma-canzone o la
forma-poesia, ma soprattutto del ritornello. Accettare questa prospettiva
significa assumere alcune sezioni di testo – nonché gli elementi di esso che
contribuiscono a ispessire alcuni termini per via delle scelte operate sui versi
dagli autori – come elementi che, proprio perché ripetuti, meritano di
svettare in termini parametrici dall’analisi del corpus stesso. Possiamo dire
che in un caso come questo i risultati siano ingannevoli? Dipende, appunto,
dalla prospettiva che si intende assumere. Una rappresentazione
iperrealistica ci porta a scegliere la prima formula, quella del massimo rigore
filologico, dello zelo assoluto: a un certo ammontare di parole, seppur
ripetute a iosa, deve corrispondere il reale valore di frequenza delle parole
stesse, con tutto ciò che questo implica in termini di relazioni tra parole, di
frequenze e di individuazione di topics all’interno del corpus. All’opposto, il
ricercatore potrebbe avere delle ottime ragioni per propendere per una
prospettiva costruttivista, in virtù della quale il dato viene forgiato in ragione
non già della frequenza effettiva delle parole – con le ridondanze che alcuni
corpora si portano dietro per le ragioni già esposte – bensì del testo spurgato
dagli elementi ridondanti. Un esempio che dovrebbe rendere palmare le
implicazioni e la differenza esistente tra le due opzioni può essere tratto da
un recente lavoro sui testi della canzone italiana che costituisce un
aggiornamento in una direzione più spintamente sociolinguistica di un mio
lavoro precedente (Nobile, 2012). Dal corpus1 che raccoglie i testi degli artisti
che sono riusciti a piazzare uno o più dischi nei primi sessanta posti delle
classifiche di vendita tra gli anni ’60 del Novecento e il 2016 selezioniamo i
due che hanno fatto registrare il maggior numero di ingressi2: Mina (170
canzoni) e Renato Zero (177). Da ciascuno dei due corpora andiamo a
estrarre, previa lemmatizzazione e normalizzazione del testo, le parole piene.
A questo punto possiamo assegnare il rango a ciascuna di esse in base al
numero di occorrenze nella prima e nella seconda situazione: quella nella

Il corpus è costituito dai testi di 5940 canzoni, che hanno sviluppato 1.321.994
occorrenze, 43.855 forme grafiche diverse, 22.160 parole piene e 1.905 hapax.
2 Per il criteri di campionamento, si veda Nobile, 2012: 51-53 o anche Nobile,
L’italiano della canzone dagli anni sessanta a oggi. Una prospettiva sociolinguistica, in corso
di pubblicazione.
1

584

JADT’ 18

quale il testo è riportato pedissequamente così come viene cantato (quindi
con tutti gli elementi di ridondanza di cui si è parlato) e quella in cui esso è
stato invece ripulito da questi elementi che determinano una consistente
ripetizione, imputabile appunto alla struttura della canzone, di alcuni
termini3. Il confronto tra i due ranghi, operato rispetto ai due diversi artisti,
suggerisce l’uso del coefficiente di cograduazione di Spearman (ρ). I valori
ricavati dai due confronti forniscono risultati di indubbio interesse: nel caso
di Mina, il valore del ρ di Spearman è di 0,61; in quello di Renato Zero di
0,68. Questa informazione, da sola, ci fonisce un’indicazione su quanto la
pulizia del testo e il rumore semantico generato dalle ridondanze possa
produrre conseguenze più che tangibili nella strutturazione dei dati da
elaborare: una parola che ha basso rango ha più probabilità di essere
selezionata tra le parole chiave, di comparire come termine specifico di un
certo sottoinsieme, di emergere come parola capace di differenziarsi in
ragione del rango che essa occupa in dizionari di riferimento (De Mauro et
al., 1993) e, quindi, di ergersi a indicatore della peculiarità linguistica di un
determinato locutore o di una certa unità aggregata di analisi. Così, nel
corpus di Mina la parola specchio, una volta sacrificati i ritornelli, arriva a uno
scarto di rango di 165 posizioni e la parola rabbia perde 100 posizioni nei due
diversi trattamenti del corpus. Analogamente, nel corpus di Renato Zero la
parola identikit perde 226 posizioni a seconda che il corpus sia ripulito dalle
ridondanze oppure no: essa si trova in una sola canzone (Io uguale io),
ripetuta un’infinità di volte. Stesso discorso con la parola fame, che perde 183
posizioni: essa, pur essendo – al contrario di identikit – del tutto trasversale
nel canzoniere del cantautore romano, ricorre un consistente numero di volte
come tormentone della canzone C’è fame.
3. Conclusioni
In queste pagine si è visto che alla facilità di accesso a una quantità ciclopica
di materiale testuale rinvenibile sul web non corrisponde una altrettanto
disinvolta possibilità di analisi dello stesso. Da una parte, infatti, questo
materiale incorpora le caratteristiche tipiche del linguaggio cosiddetto
naturale e, in quanto tale, va incontro non soltanto ai comuni problemi di
machine learning e di text mining (i più comuni dei quali sono riscontrabili, per
esempio, nei traduttori automatici o nei programmi di riconoscimento
vocale), ma anche a quelli creati dal sovradosaggio di elementi sempre più

3 La pulizia del testo espunto dai versi duplicati è stata realizzata utilizzando
una funzione del programma Excel (dati, rimuovi duplicati) tenendo fissi i riferimenti
alle singole canzoni e ai diversi autori, in modo da evitare la rimozione di versi
duplicati a prescindere dai due parametri di riferimento testé indicati.

JADT’ 18

585

diffusi come emoticons, caratteri speciali, eccetera. A questi problemi se ne
possono aggiungere altri, annoverabili nell’ambito della poesia e della
canzone, che rendono necessaria una fase particolarmente accurata e
meditata del pre-trattamento dei testi stessi, prima che questi vengano
sottoposti ad analisi. Nell’articolo si è cercato di mostrare come le scelte di
ordine gnoseologico compiute a monte dal ricercatore abbiano, nel caso delle
forme linguistiche peculiari di cui si è parlato, ricadute rilevanti sulle stesse
risultanze empiriche. In più, le operazioni di tipo lessicometrico su materiale
testuale con forte rumore semantico rischiano, se non adeguatamente
supportate da una pulizia – tutt’altro che agile – del corpus spesso, di
produrre risultati in cui la quota di rumore semantico rischia di essere
addirittura superiore a quella del testo vettore di effettivo significato (Nobile,
2016).
Riferimenti bibliografici
Baroni M., Chantree F., Kilgarriff A. and Sharoff S. (2008). Cleaneval: A
competition for cleaning webpages. Proceedings of the 6th Conference on
Language Resources and Evaluation (LREC) (pp. 638-643). Elda.
Bolasco S. (2005). Statistica testuale e text mining: alcuni paradigmi
applicativi. Quaderni di Statistica, 7, pp. 17-53.
Chiari I. (2007). Introduzione alla linguistica computazionale. Laterza.
Claridge C. (2007). Constructing a corpus from the web: message boards. In
M. Hundt, N. Nesselhauf, and C. Biewer, Corpus Linguistics and the Web
(pp. 87-108). Rodopi.
De Mauro T., Mancini F., Vedovelli M. and Voghera M. (1993). Lessico di
frequenza dell'italiano parlato. EtasLibri.
Ebner M., Altmann T. and Softic S. (2011). @twitter analysis of #edmedia10 –
is the #informationstream usable for the #mass. Form@re, 11 (74), pp. 3645.
Lancia F. (2004). Strumenti per l’analisi dei testi. FrancoAngeli.
Nobile S. (2012). Mezzo secolo di canzoni italiane. Una prospettiva sociologica
(1960-2010). Roma: Carocci.
Nobile S. (2016). Consenso e dissenso. Le reazioni degli elettori ai post dei
candidati. In Morcellini M., Faggiano M.P. and Nobile S. (a cura di),
Dinamica Capitale. Traiettorie di ricerca sulle amministrative 2016 (pp. 115138). Maggioli.
Pandolfini V. (2017). Il sociologo e l'algoritmo. l'analisi dei dati testuali al tempo di
Internet, FrancoAngeli.
Petri S. and Tavosanis M. (2009). Building a Corpus of Italian Web Forums:
Standard Encoding Issues and Linguistic Features. JLCL, 24 (1), 115-128.
Tipaldo G. (2014). L'analisi del contenuto e i mass media. Il Mulino.

586

JADT’ 18

L’individu dans le(s) groupe(s) : focus group et
partitionnement du corpus
Daniel Pélissier
Université Toulouse 1 Capitole - daniel2.pelissier@ut-capitole.fr

Abstract
Lexicometric analyzes of the focus groups depend in particular on the choice
of partitioning of the corpus by researcher. After having proposed a typology
of possible partitioning, we present the results of an experiment of one of
these approaches on a corpus of ten focus groups. These analyzes highlight
some contributions and limitations of lexicometry compared to
conversational analysis.
Résumé
Les analyses lexicométriques des focus groups dépendent notamment des
choix de partitionnement du corpus par le chercheur. Après avoir proposé
une typologie des partitionnements possibles, nous présentons les résultats
d’une expérimentation d’une de ces approches sur un corpus de dix focus
groups. Ces analyses mettent en évidence certains apports et limites de la
lexicométrie par rapport à l’analyse conversationnelle.
Keywords: Focus groups, partitioning, individual, group.
Mots clefs : Focus groups, partitionnement, individu, groupe.
1. Introduction
La lexicométrie a étudié d’abord des discours écrits (articles de journaux,
discours politiques, etc.) et des réponses à des questions ouvertes (Lebart et
Salem, 1988) puis s’est intéressée aux conversations orales retranscrites
(Rouré et Reinert, 1993; Bonneau et Dister, 2010). L’analyse de ces dernières
est en effet plus délicate en raison de textes en général plus courts, de
syntaxes particulières. Les focus groups appartiennent à cette famille de
données en posant le problème particulier du nombre important de
participants. Selon certains auteurs, ce type de données est difficile à analyser
avec des logiciels de lexicométrie (Duchesne et Haegel, 2014).
Pourtant, l’analyse lexicométrique a été utilisée dans plusieurs études
(Guerrero et al., 2009; Grésillon et al., 2012; Hulin, 2013; Bengough et al.,
2015; Brangier et al., 2015) et des articles méthodologiques ont analysé
l’efficacité des traitements lexicométriques (Dransfield et al., 2004; Peyrat-

JADT’ 18

587

Guillard et al., 2014).
Ainsi, la possibilité de traiter les focus groups par la lexicométrie est établie.
Cependant, les apports spécifiques d’une approche quantitative sont à
préciser dans un domaine dominé par les approches qualitatives dont
l’analyse conversationnelle. Par exemple, le lien entre focus groups et
représentations sociales est mis en avant (Jovchelovitch, 2004) et la
classification descendante hiérarchique (CDH) de Reinert (1983) forme des
mondes lexicaux (Ratinaud et Marchand, 2015) dont la nature est proche des
représentations sociales. Nous insisterons, dans cet article, sur la place de
l’individu dans le(s) groupe(s), problématique que la lexicométrie permet
d’approcher par un jeu de variables adapté. Mais cette analyse suppose de
préparer le corpus avec des méthodes spécifiques.
Nous présenterons ainsi une typologie des méthodes de préparation d’un
corpus de focus groups en complétant les analyses de Peyrat-Guillard et al.
(2014) et en mettant en exergue celles centrées sur l’individu. Puis, nous
analyserons les résultats de l’expérimentation d’une de ces méthodes en
montrant en quoi elle permet une compréhension des discours de l’individu
dans le(s) groupe(s).
2. Typologie des partitionnements d’un corpus de focus groups
Avant de commencer le traitement lexicométrique de focus groups, le corpus
exige une préparation spécifique. En effet, certaines décisions de
partitionnement détermineront notamment les méthodes lexicométriques
employables et les analyses possibles.
Les textes des modérateurs sont souvent supprimés du focus groups
(Guerrero et al., 2009 ; Peyrat-Guillard et al., 2014) car ses interventions, dans
le cadre d’un focus group servent à fluidifier les échanges sans les orienter.
Cependant, il peut être conseillé de comparer les résultats avec ou sans les
interventions du modérateur (Peyrat-Guillard et al., 2014).
La deuxième question porte sur la partition du corpus issu du focus group.
Plusieurs méthodes existent. Une première possibilité est d’analyser le focus
group comme une entité sans prendre en compte les échanges entre les
individus. Soit chaque focus group constitue un texte sans distinction
d’individu (Dransfield et al., 2004) ; l’argument avancé par les utilisateurs de
cette méthode est de faciliter les analyses statistiques mais cela n’est pas une
évidence, le nombre de segments étant stable. Soit le focus group est
partitionné en thèmes à partir d’une analyse de contenu (Bengough et al.,
2015) ; cette approche permet de comparer par exemple les résultats d’une
analyse thématique avec celle proposée au chercheur par la lexicométrie. La
deuxième famille de partition est celle qui souhaite conserver les échanges du
focus group. Soit la partition peut être centrée sur les individus, dite

588

JADT’ 18

decrowded (Peyrat-Guillard et al., 2014) ; les textes des interventions de
chaque individu sont alors rassemblés (Guerrero et al., 2009). Soit chaque
intervention est considérée comme un texte, approche dite crowded (PeyratGuillard et al., 2014).
Chacune de ces méthodes a des avantages et des inconvénients. Nous ne
pensons pas qu’une partition soit à privilégier mais que la décision dépend
des analyses envisagées par le chercheur selon sa problématique. Dans cet
article, nous nous centrerons sur la deuxième famille qui permet d’étudier
l’individu dans le(s) groupe(s) et pas seulement les thèmes abordés.
3. Résultats de l’expérimentation du partitionnement par locuteur
Nous avons pu expérimenter ces méthodes de partition d’un corpus de focus
groups à partir d’une recherche que nous avons menée auprès de jeunes
diplômés de l’enseignement supérieur (niveaux bac+3 et bac+5). Les
discussions des focus groups concernaient la communication numérique de
recrutement des banques et ces jeunes diplômés échangeaient sur les
dispositifs utilisés par les entreprises pour recruter. Nous avons animé puis
restranscrit10 focus groups de 6 à 7 personnes soit 67 locuteurs au total.
3.1. Préparation du corpus et partitionnement
Une fois les textes préparés (anonymisation, intégration des noms propres
(BNP, Facebook, etc.) au dictionnaire, adaptation du dictionnaire selon les
spécificités du discours, etc.), nous avons décidé de supprimer les
interventions du chercheur car elles restaient neutres par rapport aux
discours des jeunes diplômés que nous souhaitions analyser.
Nous avons alors créé une partition par tours de parole selon ce principe :
(variables entre crochets)
[Groupe1, Ingénieurs , NUM1, 18ans, masc]: il y a des choses marquantes, il y a un
site web où on n'a pas beaucoup d'informations et un autre site où il y a beaucoup
d'informations.
[Groupe1, Ingénieurs , NUM2, 20ans, masc]: je suis d'accord avec toi.
En effet, nous souhaitions repérer des discours individuels dans les focus
groups et pouvoir associer des variables de profil à un locuteur.
Les variables utilisées (tableau 1) ont été déterminées selon nos hypothèses
de recherche et leur accessibilité puis ont été associées par un script
automatique à chaque intervention de locuteur.
Tableau 1. Variables du focus groups associées aux locuteurs.
Num
1

Code variable
num

Valeur
1, 2, 3, etc.

2

formation

3IL :

Source

école

Description
Numéro de chaque
intervenant
Désignation
du

JADT’ 18
Num

Code variable

3

groupe

4
5

sexe
participation

6

initial

589
Valeur
d’ingénieur
LPB :
licence
professionnelle
banque
1, 2, 3, etc.
10 groupes au total
M, F
TA, PA, A
TA : très actif
A : actif
PA : pas actif
STS, IUT

Source

Description
groupe

Numéro du groupe

Statistiques SONAL
selon
le
nombre
d’interventions
Données organisme
de formation

Indicateur
quantitatif de la
participation
de
chaque intervenant
Formation initiale
des intervenants

Le corpus se présentait ainsi de cette façon pour être utilisé dans Iramuteq
(Ratinaud, 2009) :
**** *num_44 *formation_LPB *groupe_1 *sexe_M *participation_ A
*initial_STS moi je veux bien commencer. Quand je suis allé sur le site de la
SG, … Les caractéristiques du corpus obtenu et traité à l’aide du logiciel
Iramuteq sont alors les suivantes : 1876 textes allant d’une seule forme (Oui
par exemple) pour les plus courts à 126 formes ou 280 occurrences pour le
plus long, 40404 occurrences et 2094 formes au total, 21,54 occurrences par
texte en moyenne, les hapax représentent 41,26% des formes. Chaque texte
correspond alors à une intervention d’un locuteur dans un focus group.
3.2. Choix méthodologiques
Si la CDH de Reinert est la plus souvent citée dans la littérature (Duchesne et
al., 2010 ; Gresillon et al., 2012; Hulin, 2013; Peyrat-Guillard et al., 2014;
Brangier et al., 2015; Freitas et Luis, 2015, etc.) d’autres techniques sont
impliquées comme l’analyse factorielle (Dransfield et al., 2004; Guerrero et
al., 2009) ou plus rarement l’analyse de similitude (Bengough et al., 2015).
Notre choix de la classification de Reinert est lié à nos hypothèses de
recherche qui associent les discours de ces jeunes diplômés aux
représentations sociales. Or, la CDH de Reinert (1983) favorise le repérage de
représentations sociales (Ratinaud et Marchand, 2015). Nous avons effectué
plusieurs CDH simples sur segments de texte en faisant varier le nombre de
classes demandées, le nombre minimum de segments par classe. Nous avons
choisi de retenir les formes dont la fréquence est supérieure à 3 (soit 687
formes dans ce cas) pour centrer le traitement sur les formes les plus
présentes. Au terme de ces simulations, nous avons retenu une CDH qui
présente 15 classes avec un taux de segments classés de 83,63%.

590

JADT’ 18

3.3. Exemple d’utilisation de variables, groupes et degré de participation
Chaque intervention ayant été associée à des variables de contexte, la
méthode choisie permet de vérifier le lien existant entre les groupes et
chaque classe repérée. Ainsi, pour ce corpus de focus groups, la classe 1
(Chi²=20,82, recherche d’emploi) et la classe 12 (Chi²=16,76, articles de
journaux) sont associées aux étudiants de 3IL. La classe 7 (Chi²=32,17,
Dupuy) et la classe 13 (Chi²=11,44, avantages et valeurs) sont plutôt liées au
groupe des licences banques (fig. 1).

Figure 1. Chi² par classe pour la variable ‘formation’.

De même, la variable sur la participation (tableau 1 et fig 2.) a permis
d’associer certaines classes avec cette caractéristique. Les résultats de la CDH
permettent ainsi de poser une hypothèse sur le degré de consensus entourant
une représentation sociale.

Figure 2. Association de la classe 8 avec la variable participation.

En effet, la classe 8 sur la taille de l’organisation est associée aux locuteurs
qui ont peu participé globalement (Variable PA (Peu Actif), Chi²=4,19 ; fig. 2)
comme pour la classe 3 (mobilité). Les discussions sur la recherche d’emploi
(classe 1), la banque Dupuy (classe 7) ou les classements des sites internet et
témoignages sont dominées par les locuteurs les plus actifs (Variable TA
(Très actif) : Chi²=5,69 pour la classe 1 et Variable A (Actif) : Chi²=7,51 pour la
classe 7). Elles peuvent être perçues comme plus conflictuelles ou engagées.
Les échanges sur la taille ont ainsi laissé plus de places aux locuteurs peu

JADT’ 18

591

actifs avec des discussions plus consensuelles moins conflictuelles que pour
des représentations moins stabilisées. Cette hypothèse renvoie alors à la
structure possible de cette représentation sociale construite autour d’un
noyau central stable qui exigerait des études complémentaires pour être
confirmée.
3.4. Repérage de discours individuels par l’analyse factorielle de
correspondance (AFC)
Le partitionnement effectué permet aussi de repérer des individus dont les
discours sont différents (fig. 3) grâce à une AFC réalisée à la suite d’une CDH
de Reinert. Dans ce cas, deux individus se détachent principalement : 17 et
37. Le retour au texte permet de confirmer ce repérage. L’autre intérêt est
aussi de souligner des regroupements d’individus différents de leur
rattachement à un focus groups. L’AFC, en mettant en évidence des
ensembles de locuteurs, propose une approche qui dépasse la frontière de
chaque focus groups pour proposer une analyse de l’individu dans les
groupes.

Figure 3. AFC à partir de la CDH présentant les variables (F1/F2, 19,57 % de l’inertie).

4. Conclusion
Les méthodes lexicométriques utilisées pour analyser des focus groups

592

JADT’ 18

dépendent notamment de la partition du corpus effectuée en amont. Dans
notre recherche, l’association de variables à chaque intervention de locuteur a
permis de repérer des sous-groupes d’individus à l’intérieur des focus
groups, des discours d’individus isolés ou des sous-groupes associés à
plusieurs focus groups qui n’apparaissaient pas de façon évidente pendant
les échanges. Cette approche a cependant certaines limites. D’abord, la
procédure automatisée d’association des variables utilisée dans cette
expérimentation ne permet pas de repérer l’évolution des thèmes pendant la
discussion, une variable repérant les tours de paroles aurait alors été
nécessaire. Ensuite, le repérage des individus s’est fait sur une AFC qui
explique une faible part de la variance (19,57 %) et les causes de la singularité
des discours est ainsi difficile à associer à la CDH. Enfin, d’autres méthodes
auraient pu être investies (analyse des antiprofils, spécificités, similitudes,
etc.).
Sans remplacer l’analyse conversationnelle qui apporte des nuances
spécifiques, certaines méthodes lexicométriques peuvent ainsi permettre de
comprendre le corpus différemment et compléter la compréhension de ce
type de données riches et profondes en dépassant notamment la frontière de
chaque focus groups et faciliter une approche transversale du sens.
Remerciements : merci à Pascal Marchand, Pierre Ratinaud et Lucie Loubère
pour leur initiation à la lexicométrie et à Iramuteq.
References
Bengough, T., Bovet E., Bécherraz C., Schlegel S., Burnand B., et Pidoux, V.
(2015). Swiss family physicians’ perceptions and attitudes towards
knowledge translation practices. BMC Family Practice, décembre: 1–12.
Bonneau, J., and Dister, A. (2010). Logométrie et modélisation des
interactions discursives, l’exemple des entretiens semi-directifs. Journées
internationales d’Analyse statistique de Données Textuelles, pp. 253–264.
Brangier, E., Barcenilla, J., Bornet, C., Roussel, B., Vivian, R., and Bost, A.
(2015). Prospective ergonomics in the ideation of hydrogen energy usages.
In Proceedings 19th Triennial Congress of the IEA. Melbourne, pp. 1–2.
Dransfield, E., Morrot, G., Martin, J.-F., and Ngapo, T.-M. (2004). The
application of a text clustering statistical analysis to aid the interpretation
of focus group interviews. Food Quality and Preference, 15(4): 477–488.
Duchesne, S., and Haegel, F. (2014). L’entretien collectif. Armand Colin. Paris.
Duschesne, S., Florence Haegel, Elizabeth FRAZER, Virginie Van Ingelgom,
and Guillaume Garcia, André-Paul Frognier. (2010). Europe between
integration and globalisation social differences and national frames in the
analysis of focus groups conducted in France, francophone Belgium and
the United Kingdom. Politique Européenne, 30(1): 67–105.

JADT’ 18

593

Freitas, E. A. M., and Luis, M. A. V. (2015). Perception of students about
alcohol consumption and illicit drugs. Acta Paul Enferm., 28(5): 408–414.
Gresillon, E., and Marianne Cohen, Julien Lefour, Lydie Goeldner et Laurent
Simon. (2012). Les trames vertes et bleues habitantes : un cheminement
entre pratiques et représentations. L’exemple de la ville de Paris (France).
Développement Durable et Territoires, 3: 2-17.
Guerrero, L., Guàrdia, M., and Xicola, J. (2009). Consumer-driven definition
of traditional food products and innovation in traditional foods. A
qualitative cross-cultural study. Appetite, 52(2): 345–354.
Hulin, T. (2013). Enseigner l’activité « écriture collaborative ». Tic&société,
7(1): 89–116.
Jovchelovitch, S. (2004). Contextualiser les focus groups : comprendre les
groupes et les cultures dans la recherche sur les représentations. Bulletin
de Psychologie, 57(3): 245–261.
Lebart, L., and Salem, A. (1988). Analyse statistique des données textuelles.
Dunod. Paris.
Peyrat-Guillard, D., Lancelot Miltgen, C., et Welcomer, S. (2014). Analysing
conversational data with computer-aided content analysis: The
importance of data partitioning. Journées internationales d’Analyse
statistique des Données Textuelles, pp. 519–530.
Pélissier, D. (2016), Pourquoi et comment utiliser la lexicométrie pour
l’analyse de focus groups ?, Présence numérique des organisations,
11/07/2016.
Ratinaud, P. (2009). Iramuteq. Lerass.
Ratinaud, P., and Marchand, P. (2015). Des mondes lexicaux aux
représentations sociales. Une première approche des thématiques dans les
débats à l’Assemblée nationale (1998-2014). Mots. Les Langages du Politique,
108(2): 57–77.
Reinert, M. (1983). Une méthode de classification descendante hiérarchique :
application à l’analyse lexicale par contexte. Les Cahiers de L’analyse Des
Données, 8(2): 187–198.
Rouré, H., and Reinert, M. (1993). Analyse d’un entretien à l’aide d’une
méthode d’analyse lexicale. Journées internationales d’Analyse statistique de
Données Textuelles. ENST, Paris, pp. 418-42

594

JADT’ 18

Using the First Axis of a Correspondence Analysis
as an Analytical Tool. Application to
Establish and Define an Orality Gradient for
Genres of Medieval French Texts
Bénédicte Pincemin1, Céline Guillot-Barbance2, Alexei Lavrentiev3
Univ. Lyon, CNRS, IHRIM UMR5317 - benedicte dot pincemin at ens-lyon dot fr; celine dot guillot
at ens-lyon dot fr; alexei dot lavrentev at ens-lyon dot fr

Abstract
Our corpus of medieval French texts is divided into 59 discourse units (DUs)
which cross text genres and spoken vs non spoken text chunks (as tagged
with q and sp TEI tags). A correspondence analysis (CA) performed on
selected POS tags indicates orality as the main dimension of variation across
DUs. We then design several methodological paths to investigate this
gradient as computed by the CA first axis. Bootstrap is used to check the
stability of observations; gradient-ordered barplots provide both a synthetic
and analytic view of the correlation of any variable with the gradient; a way
is also found to characterize the gradient poles (here, more-oral or less-oral
poles) not only with the POS used for the CA analysis, but also with words,
in order to get a more precise and lexical description. This methodology
could be transposed to other data with a potential gradient structure.
Keywords: textometry, Old French, represented speech, spoken genres,
methodology, correspondence analysis, 1D model, data visualization, XML
TEI, TXM software, DtmVic software.
1. Linguistic issue and preparation of textual data
We investigate spoken language features of Medieval French in a corpus
composed of 137 texts (4 million tokens), taken from the Base de français
médiéval1. The corpus is annotated with part-of-speech (POS) tags at the
word level; speech quotation chunks and speech turns are marked up using
TEI XML tags at an intermediate level between sentences and paragraphs;
and every text can be situated in a 32-genre typology (Guillot et al., 2017).
Our hypothesis is that the features of orality may be related to text chunks
representing speech, and also to text genres, as for instance some text genres

1

Base de français médieval: http://bfm.ens-lyon.fr

JADT’ 18

595

are intended for oral performance. In order to perform a textometric analysis
(Lebart et al. 1998) on our XML-TEI annotated data, we use the TXM opensource corpus analysis platform (Heiden, 2010; Heiden et al., 2010)2.
We divide our corpus into 59 discourse units (DUs) obtained by splitting
every genre into parts which represent speech on the one hand, and the
remaining parts on the other hand (some text genres have no spoken
passages). Discourse unit labels, like q_rbrefLn for instance, combine four
pieces of information: (i) the first letter is either q for quoted speech chunks,
sp for speech turns, or z for remaining (non oral) chunks; (ii) then we have
the short name of the text genre (here, rbref means “récit bref”, i. e. short
narrative); (iii) the uppercase letter stands for the domain3; (iv) the last
character indicates whether this DU is represented in our corpus by one (1),
two (2) or more (n) texts. We linguistically represent our texts with the POS
tags4 they use5. The reliability of POS tags was measured in a previous study
(Guillot et al., 2015) for a subset of 7 texts in which tags had been manually
checked. For the present analysis, we eliminate low-frequency POS tags (freq.
< 1 500), which include many high error rate tags and do not carry much
weight into the quantitative analysis. For the remaining high error rate tags
(with more than 25% wrong assignments), we measure their influence on the
correspondence analysis (CA) by checking their contribution to the first axis.
Then we remove the proper nouns category (NOMpro) which shows both
high error rate and high contribution to the first axis (14.66 %).
A new correspondence analysis enables two additional improvements from a
linguistic perspective. We remove compound determiners (DETcom,
PRE.DETcom, like ledit) as they emerged at the end of the 13th century, so that
they introduce a singular and substantial diachronic effect (high
contributions on the first axis). Moreover, the second axis describes mainly
the association between psalms (z_psautierRn) and possessive adjectives
(ADJpos): this corresponds to very specific phrases with some distinctive
nouns (la meie aneme, li miens Deus, la tue misericorde), and the adjective is
equivalent to a possessive determiner in other contexts, so we merge the two
categories (DETADJpos). We finally get a contingency table crossing 59 DUs
with 33 POS tags to explore with a CA.

Textometry Project and TXM software: http://textometrie.org
There are 6 domains: literature (L), education (D for “didactique”), religion (R),
history (H), law (J for “juridique”), practical acts (P).
4 We use the Cattex2009 tagset, designed for Old French: http://bfm.enslyon.fr/spip.php?article176.
5 We exclude punctuations, editorial markup and foreign words. CQL query:
[fropos!="PON.*|ETR|OUT|RED"]
2
3

596

JADT’ 18

2. Linguistic and methodological results from correspondence analysis
Our study reveals that the first axis can in fact be interpreted as an orality
gradient. The factorial map (Fig. 1) shows z_ DUs on the left hand side of the
first axis, opposed to q_ and sp_ DUs on the right hand side. Some genres
intended for oral performance go to the right with speech chunks (especially
plays –dramatiqueL, dramatiqueR), whereas genres related to written
processing (especially practical acts (P): charters, etc.) go to the left with outof-speech chunks. As this opposition matches the first axis, orality appears as
the first contrastive dimension for Old French (as regards POS frequencies),
as it is in Biber’s experiences with English (Biber, 1988), with the same kind
of linguistic features (Table 1). Then, as a second result, DUs can be sorted
according to their degree of orality, from “less oral” to “more oral” (see
Appendix6). Peculiar positions (for didactic dialogs or psalms for instance)
can be explained by a formal use of language given by the rules of the genre.
The linguistic analysis of the DU gradient is detailed in (Guillot-Barbance et
al., 2017)7.

Figure 1. CA map of the 59 DUs (TXM). 21 DUs with low representation quality (cosine
squared to 1 × 2 plane < 0.3) and no significant contribution to this plane (ctrb1 < 2 % & ctrb2
< 2 %) have been filtered out (macro CAfilter.groovy), so that the figure is clearer.

Appendix is available online as a related file of this paper in HAL archive:
https://halshs.archives-ouvertes.fr/halshs-01759219
7 Improvements made to the statistical processing in 2018 (management of the
second axis with ADJpos and DETpos merging, confidence ellipses) strengthen the
linguistic interpretation published in 2017, no significant change is observed on
gradient given by the first axis, according to the four zones defined by the analysis,
except for a few points which are not related to this axis (low cosine squared).
6

JADT’ 18

597

Figure 2. CA map of the 17 DUs with the largest confidence ellipses (DtmVic). The two
largest ones (q_proverbesD2, q_lapidaireD2) couldn’t be drawn; the following three largest
ones (q_commentaireD1, q_dialogueD2, q_sermentJ1) show that these DU positions cannot be
interpreted; then other smaller ellipses indicate that the 54 remaining DU positions on axes #1
and 2 are stable.

Table 1. The eight POS with the highest contributions on the first axis, for both sides.
“Less oral” pole
“More oral” pole
personal pronoun
PROper
preposition
PRE
general adverb
ADVgen
common noun
NOMcom
negative adverb
+
definite ADVneg
PRE.DETdef preposition
finite verb
VERcjg
determiner
VERppe
adverbial pronoun (en, y)
PROadv
past participle
DETdef
DETADJpos possessive determiner or
definite determiner
DETcar
adjective
CONsub
cardinal determiner
VERppa
subordinating conjunction
VERinf
present participle
CONcoo
infinitive verb
coordinating conjunction

A bootstrap validation (Dupuis & Lebart, 2008, Lebart & Piron, 2016) is
applied to evaluate the stability of DU positions on the first axis (Figure 2).
Sizes of ellipses in the 1×2 map are correlated to sizes of DUs: the fewer the
words there are in the DU, the less data the statistics process, and the greater
is the confidence ellipse (Table 1). Only five DUs are ascribed a big ellipse
which shows their uncertain position (Figure 2): all of them are DUs from
about ten words to about a hundred words, which are DUs for very singular
linguistic usages, and are neither representative nor relevant for this overall
linguistic analysis. The orality gradient is then confirmed throughout a

598

JADT’ 18

statistic validation on our data.
The 2D factorial map provides a synthetic and efficient visualization. The
second axis display reveals that the “more oral” pole is more compact, more
consistent, than the “less oral” pole, which is more heterogeneous (the cosine
squared values corroborate this). But what we want to stress in this
methodological paper, is that the main linguistic result is uniquely provided
by the interpretation of the first axis. Benzécri has illustrated the same kind of
approach by using a 1D CA to reveal the hierarchy of characters in Racine’s
Phèdre (1981 : 68). This method emphasizes the analytic power of CA, which
separates the data (by the mathematical means of Singular Value
Decomposition) into “deep” components (factors), just as a prism breaks
light up into its constituent spectral colors. Despite its main use as a 2D
illustration of a corpus structure in the textual data analysis field, CA is much
more than a suggestive visualization or a quick sketch.
3. Complementary tools to analyse 1D gradient in textual data
We now test new means to gain insight into the causation of this gradient in
our data.
3.1. Gradient-ordered barplot

Figure 3. Gradient-ordered specificity barplot for Personal Pronoun, as example of a POS
which is correlated to the first axis. For readability reasons, the height of specificity bars is
limited to 20.

The first method we propose is to visualize the evolution of POS frequencies
according to the orality gradient using a specificity bar-plot chart where the
DU order on the x-axis is given by the DU order on the first CA axis: this
display visually reveals how much a POS is correlated with speech or non
speech features, and details its affinity with each DU. For instance, personal
pronouns are typical for the more-oral pole: this is displayed as a rising
profile (Figure 3), and one can easily find out which DU have an outlying use
of this POS. Whereas a POS like adjectives (Figure 4), which is not correlated
to the orality gradient, gets a chart with no overall pattern.

JADT’ 18

599

Figure 4. Gradient-ordered specificity barplot for adjectives, as example of a POS which is not
correlated to the first axis. For readability reasons, the height of specificity bars is limited to 20.

3.2. Back-to-text close reading by getting representative words for each side
of the first axis
The second methodological innovation concerns obtaining lexical
information about orality characteristics in our texts. We select two sets of
DUs based on their cosine squared scores for the first CA axis in order to
represent the more-oral (cos21 > 0.4) and less-oral (cos21 > 0.35) poles (Table
2). The cos2 thresholds are adjusted to get two balanced sets with enough
different DUs to get an adequate representativeness. Then, a specificity
computation, which statistically characterizes the distribution of words into
these two sets, reveals lexical features for more oral and less oral poles,
showing typical words as they can be read in texts. Light is thus shed on the
quantitative result throughqualitative observations.
Table 2.
Representative DUs
Less-oral pole
z_journalJ2
z_plaidsP1
z_commentaireD1
z_diversP1
z_registreP2
z_lettreH1
z_dialogueD2
z_rvoyageL1

Table 3a. Adjectives
typical for the less-oral
subcorpus

Table 3b. Adjectives
typical for the more-oral
subcorpus

More-oral pole
q_romanLn
sp_dramatiqueR1
q_rbrefLn
q_bestiaireD2
sp_dramatiqueLn
q_lyriqueLn
z_lyriqueLn
q_chroniqueHn
sp_lyriqueLn
q_hagiographieRn
q_romanDn
q_mémoiresHn

Our example sheds light on the uses of adjective: whereas adjectives are not
related to the orality gradient as a category (Figure 4), they have strong
associations at a lexical level (Table 3). Represented speech makes much use

600

JADT’ 18

of terms of address introducing speech turns (bel, douz – and their formal
variants: biaus, biax, etc.), and evaluative adjectives (grant, mal, boen). For the
less-oral pole, there are more POS tagging errors; adjectives are more diverse
and often associated with a subset of DUs, for instance present, saint, maistre
are typical of two texts.
4. Conclusion
In this contribution, we have shown several ways to take into account the
limits of real data, especially textual data: managing the POS tags reliability
(§1), validation process to identify where data is lacking (§2), refining
morphosyntatic based analysis with lexical information (§3). But our main
objective is to establish a methodology in order to reveal and study any
gradient-like deep structuration of data. A simple seriation (as illustrated in
Dupuis & Lebart, 2008) could provide the same results for the first step, as it
generates the same ordered view of the data. But CA gives much more
information, qualifying the relation of each variable to the gradient with
indicators like contributions and cosines squared. Interpretation can go
further: CA coordinates are controlled with bootstrap and confidence
ellipses, gradient-ordered barplot visualizations are efficient to analyse in
detail the relationship of any individual variable to the overall gradient, and
the gradient poles can be illustrated by words, which add a concrete and
textual account for the deep structure. Thus, on our corpus of French
medieval texts, we discover that orality is the main contrastive dimension
and that it characterizes represented speech as well as text genres. The
methodology could be applied to other data, and is already entirely
implemented using tools freely available to the scientific community.
This research has benefited from the PaLaFra ANR-DFG project (ANR-14-FRAL0006), for corpus extension and POS evaluation. We are also very grateful to
Ludovic Lebart, for his inspiring comments on a preliminary presentation of this
research, and for DtmVic software, which has evolved in order to take into account
the quantitative particularities of our data.
References
Benzécri J.-P. et al. (1981). Pratique de l’Analyse des données, tome 3. Linguistique
& lexicologie. Dunod, Bordas, Paris.
Biber D. (1988). Variation across speech and writing. Cambridge University
Press.
Dupuis F., Lebart L. (2008). Visualisation, validation et sériation. Application
à un corpus de textes médiévaux. In Heiden S. and Pincemin B., eds, Actes
JADT 2008, Presses univ. de Lyon: 433-444.
Guillot C., Heiden S., Lavrentiev A., Pincemin B. (2015). L’oral représenté

JADT’ 18

601

dans un corpus de français médiéval (9e-15e) : approche contrastive et
outillée de la variation diasystémique. In Kragh K. J. and Lindschouw J.,
eds, Les variations diasystémiques et leurs interdépendances dans les langues
romanes -Actes du Colloque DIA II, Éd. de linguistique et de philologie,
Strasbourg : 15-28.
Guillot-Barbance C., Pincemin B., Lavrentiev A. (2017). Représentation de
l’oral en français médiéval et genres textuels, Langages, 208: 53-68.
Heiden S. (2010). The TXM Platform: Building Open-Source Textual Analysis
Software Compatible with the TEI Encoding Scheme. In Otoguro R. et al.,
eds, PACLIC24, Waseda Univ., Sendai : 389-398.
Heiden S., Magué J.-Ph., Pincemin B. (2010). TXM : Une plateforme logicielle
open-source pour la textométrie – conception et développement. In
Bolasco S. et al., eds, Statistical Analysis of Textual Data -Proceedings of JADT
2010, Edizioni Univ. di Lettere Economia Diritto, Rome : 1021-1031.
Lebart L., Piron M. (2016). Pratique de l’Analyse de Données Numériques et
Textuelles avec Dtm-Vic. L2C, http://www.dtmvic.com.
Lebart L., Salem A., Berry L. (1998). Exploring Textual Data. Kluwer academic
pub., Boston.

602

JADT’ 18

Explorer les désaccords dans les fils de discussion du
Wikipédia francophone
Céline Poudat
Université Côte d’Azur, CNRS, BCL, France – poudat@unice.fr

Abstract
This article concentrates on the exploration of French Wikipedia talk pages,
with a focus on conflicts. We developed a typology of speech acts expressing
disagreement, including direct and explicit forms (je ne suis pas d’accord / je
suis en désaccord) as well as indirect acts, which are besides the most
widespread. Disagreement is indeed a negative reaction that may threaten
the face of the addressee. For this reason, disagreements are rather expressed
indirectly in order to protect faces in interaction. A subset of the Wikiconflits
corpus (Poudat et al., 2016) was annotated according to the typology and we
carried on a primary exploration of the data using statistical methods.
Résumé
Cette étude se concentre sur l’exploration de l'encyclopédie Wikipédia, l'un
des plus gros succès du Web 2.0, et spécifiquement sur l’exploration de ses
discussions éditoriales, avec un intérêt particulier pour les conflits. Nous
nous intéressons aux actes de langage exprimant le désaccord, de son
expression la plus directe et la plus explicite (je ne suis pas d’accord / je suis en
désaccord) à ses formes les plus indirectes, et d’ailleurs les plus usuelles ; le
désaccord est effectivement plutôt exprimé de manière indirecte pour
préserver sa face et celle de l’autre. Nous présentons la typologie que nous
avons développée et nous l’appliquons à un sous-ensemble du corpus
Wikiconflits que nous avons développé (Poudat et al., 2016). Le corpus
annoté est ensuite exploré avec les méthodes de l’ADT et nous restituons
certaines de ses caractéristiques.
Keywords: Wikipedia, CMC corpora, Conflicts, Disagreements, Pragmatics,
Semantic Annotation, Text statistics
1. Introduction
Cette étude se concentre sur l’exploration de l'un des plus gros succès du
Web 2.0 : l’encyclopédie Wikipédia, qui rassemble des milliers de
contributeurs à travers le monde, mais qui demeure paradoxalement peu
observée par les études de linguistique, certainement du fait de la complexité

JADT’ 18

603

de l’objet, qui multiplie les versions, les types de pages et les genres textuels.
Nous nous intéressons spécifiquement aux fils des pages de discussion du
Wikipédia francophone, avec un intérêt particulier pour les conflits. Plutôt
abordés par les sciences sociales (cf. Kittur et Kraut, 2008, 2010; Auray et al.,
2009, Sumi et al., 2011, Borra et al., 2014), les conflits dans Wikipédia ont été
peu décrits d’un point de vue linguistique. Nous proposons de les décrire au
moyen d’une annotation en actes de langage, en distinguant entre marqueurs
du (dés)accord et marqueurs du conflit : si tout désaccord ne tourne pas au
conflit, un conflit nait souvent d’un désaccord. Deux entreprises d’annotation
des interactions conflictuelles de Wikipédia ont été menées ces dernières
années (Bender et al., 2011, Fershke et al., 2012), mais elles ne portaient pas
sur le français, et se positionnaient dans un cadre distinct. La présente
communication se concentre spécifiquement sur l’exploration des marqueurs
du désaccord dans Wikipédia, de son expression la plus directe et la plus
explicite (je ne suis pas d’accord / je suis en désaccord) à ses formes les plus
indirectes, et d’ailleurs les plus usuelles ; le désaccord est effectivement
plutôt exprimé de manière indirecte pour préserver sa face et celle de l’autre.
Après avoir présenté le corpus de travail (2.), nous décrirons la typologie
exploratoire que nous avons développée et les marqueurs que nous avons
annotés manuellement (3.). Nous présenterons enfin certaines des régularités
observées (4.).
2. Wikiconflits : pages et fils conflictuels
Le corpus de travail sur lequel se fonde notre étude comprend un sousensemble du corpus Wikiconflits (Poudat et al., 2016), à savoir l’ensemble des
discussions autour de six articles ayant été identifiés par Wikipédia comme
conflictuels : Igor et Grichka Bogdanoff, Chiropratique, Éolienne, Histoire de la
logique, Psychanalyse et Quotient intellectuel. La conflictualité de chaque fil a
été évaluée et annotée avec une variable à trois modalités : si les fils non
conflictuels sont catégorisés C0, C1 signale la présence d’un désaccord et C2
la présence d’un conflit sur le fil.

page

Tableau 1 : Corpus de travail
tokens
messages
Fils C0

Fils C1

Fils C2

Bogdanoff

73864

493

30

16

20

Chiropratique

29919

226

5

3

12

Éolienne

13454

152

2

7

0

Histoire de la logique

3358

46

4

2

0

Psychanalyse

102338

878

54

39

34

Quotient intellectuel

20059

170

10

20

12

604

JADT’ 18

Désaccords et conflits sont deux formes d’affrontement verbal, à cette
différence que le désaccord est un acte réactif qui exprime une réaction
négative relative à une assertion préalablement exprimée (KerbratOrecchioni, 2016) tandis que le conflit est un acte agressif, qui implique la
présence d’au moins une séquence attaque-réplique caractérisée par l’usage de
marqueurs de violence verbale et d’actes de langage agressifs pour la face de
l’allocutaire (Poudat et Ho-Dac, 2018). Ces définitions doivent être précisées
relativement au genre très particulier qu’incarne la discussion Wikipédia, qui
a pour fonction majeure de permettre aux rédacteurs de l’article de se
coordonner et de clarifier leurs éventuels différends. L’article encyclopédique
est ainsi le premier terrain de coopération entre les contributeurs, la
discussion faisant plutôt office de coulisses de la rédaction – beaucoup
d’utilisateurs réguliers de Wikipédia méconnaissent d’ailleurs l’existence de
ces discussions. En d’autres termes, l’article est le genre premier, la
discussion faisant figure de genre lié ou non autonome. Les désaccords et les
conflits que l’on y observe s’adossent ainsi sur l’article, ce qui nous a amenée
par exemple à observer qu’un désaccord pouvait porter sur un passage de
l’article, considéré dans ce cas comme une assertion contestable. De la même
manière, un conflit peut prendre sa source au cours de la rédaction de
l’article, via une suppression ou un retour en arrière litigieux, qui pourra
donner lieu à l’écriture d’une réplique agressive sur la page de discussion.
Notons que nous écartons de notre étude les conflits non verbaux et autres
guerres d’édition, largement observés par les sciences sociales.
Les fils catégorisés C1 portent la trace verbale d’un désaccord tandis que les
fils étiquetés C2 contiennent au moins une attaque manifeste de la face de
l’un des contributeurs du fil. Cette annotation ne va bien sûr pas de soi et
nous a souvent demandé d’arbitrer entre le contenu du message et son
positionnement dans le fil d’interaction. Un message peut ainsi exprimer un
désaccord ou être agressif sans recevoir de réponse, tandis qu’un
contributeur peut être en désaccord avec un point de vue existant qui n’est
pas pour autant celui de l’un de ses co-énonciateurs. Nous n’avons retenu
que les désaccords ou les attaques orientés vers le(s) co-énonciateur(s) / corédacteurs(s), en ce sens qu’un passage très agressif envers un tiers auteur ou
article par exemple, ne sera pas été considéré comme conflictuel.

JADT’ 18

605

3. Le désaccord comme acte de langage : types et marqueurs
Nous nous sommes ensuite concentrée sur l’annotation manuelle des actes de
langage exprimant le désaccord en développant une typologie adaptée aux
caractéristiques du corpus de travail. Le désaccord étant un acte exprimant
une réaction négative, il est potentiellement menaçant pour la face de
l’allocutaire auquel il s’adresse. C’est pourquoi il est généralement exprimé
de manière indirecte. Les chiffres sont éloquents dans notre corpus : 82% des
actes exprimant le désaccord relevés sont indirects, tandis que près de la
moitié des désaccords exprimés directement sont adoucis ou minimisés.
Les deux grands types d’expression indirecte du désaccord les plus
récurrents que nous avons observés consistent à (i) recourir à la concession
pour mettre en scène un accord partiel et (ii) exprimer son désaccord en se
posant explicitement comme source évaluative (personnellement, je ne pense pas
que… ; j’avoue ne pas comprendre, etc.). Comme nous le signalons dans le
tableau 2, nous avons choisi d’annoter les concessions accompagnées d’un
accord explicite comme « Ok, mais des solutions existent (développement de pales
furtives absorbant les ondes radars) » (discussion Éolienne), ce qui explique
peut-être pourquoi au final nous n’en obtenons qu’un petit nombre (9 occ.).
L’expression du désaccord indirect semble privilégier significativement les
actes secondaires de l’incompréhension (48 occ.) et de l’expression d’une
opinion (29 occ.). À titre de comparaison, nous avons systématiquement
annoté les manifestations d’accord explicites rencontrées. Contrairement au
désaccord, l’accord est dans notre culture un acte positif pour la face de
l’allocutaire. Peu employé de manière indirecte, il est plutôt intensifié
qu’atténué (je suis tout à fait d’accord). On relève 57 actes d’accord explicite
dans le corpus ; à titre de comparaison, on rencontre trois fois plus de formes
exprimant un désaccord, ce qui est probablement dû à la dimension
conflictuelle du corpus. Il nous faut enfin souligner que plus des deux tiers
des 270 fils de discussion considérés ne contenaient aucune des formes
observées, ce qui n’est pas surprenant : un quart des fils ne contiennent qu’un
seul message tandis que nous avons conservé les fils catégorisés harmonieux
à titre de contraste.

Attributs
polarité

Valeurs
accord
désaccord
explicite

type
implicite

Tableau 2 : Typologie du désaccord
Exemples
je suis d’accord
Je suis contre l’avis de X
Accord explicite : je suis d’accord, je suis pour X, favorable à
X, tout à fait de votre avis, je suis de ton avis, OK pour X…
Désaccord explicite : pas d’accord, en désaccord, je ne suis
pas favorable, je suis contre, totalement contre
Voir acte indirect.

606

JADT’ 18
oui / non

atténuation
indirect

non
concession

Concéder

avis

émotion

Se poser
comme
source
évaluative

doute
Incompréhension
assertion
négative
forte

Atténuation d’un accord explicite : je suis assez d’accord
Atténuation d’un désaccord explicite : Nous sommes en
désaccord (mineur) sur un point (mineur)
Seuls les actes d’accord explicite accompagnés d’une
concession ont été retenus.
D'accord pour refuser le paragraphe ajouté à partir d'arkiv ;
en revanche la suppression de la participation d'AR à la
mission ne me semblait pas déraisonnable (discussion
Bogdanoff)
« Personnellement, je pense que non », je ne crois pas, je
ne pense pas…
mots-clés : personnellement, pense, crois, trouve
émotion (rare dans le corpus pour exprimer le désaccord)
j'ai été personnellement choqué par les affirmations
gratuites comme "de gauche/de droite" dès le début de l'article,
que je pense tout à fait intempestives et parfaitement corrélées à
la hauteur du QI du contributeur et aux théories raciales de
Rushton, (discussion QI)
Je doute de la pertinence de ce passage dans cet article.
mots-clés : certain, sûr, doute
Je ne vois pas bien quel rapport ta source a avec ce constat.
(discussion Psychanalyse)
Encore une fois, je ne comprends pas le problème.
Ce n'est pas du tout une question de vocabulaire
secondaire (discussion Bogdanoff)

4. Analyses
Le corpus annoté a ensuite été soumis à différentes méthodes de l’analyse de
données textuelles afin d’explorer ses caractéristiques et de mettre en
évidence les relations entre les types de désaccord et la situation du fil,
harmonieuse, dissonante ou conflictuelle. Comme le montre la Figure 1, les
fils identifiés comme lieux d’un désaccord (C1) sont ceux qui contiennent le
nombre le plus significatif de marqueurs d’accord et de désaccord. Au
contraire, les fils identifiés comme conflictuels contiennent significativement
moins de marques d’accord explicite et de marques de désaccord. Nous voilà
donc rassurée par la cohérence de notre annotation.

JADT’ 18

607

Figure 1 : Ventilation des types d’accord et de désaccord d’un type de fil à l’autre (données
Hyperbase Web)

Afin d’évaluer plus précisément la structure de l’ensemble des annotations
apposées sur les textes, nous avons réalisé une Analyse en Composantes
Principales (ACP) sur la table des décomptes d’annotations en prenant le fil
de discussion comme unité textuelle. Nous avons dû procéder à certains
ajustements, (i) en écartant les fils qui ne contenaient aucune annotation ; (ii)
en isolant certaines variables trop marginales (i.e. 2 occ. de la valeur émotion)
et (iii) en distinguant entre les observations restantes celles qui seront
utilisées comme variables actives ou comme variables supplémentaires.
Ainsi, les variables ayant le trait atténuation ont été intégrées à titre illustratif.
Au total, l’ACP a été réalisé sur un ensemble de taille restreinte, à savoir 98
fils * 8 variables actives (et 13 variables supplémentaires). De manière
intéressante, l’ACP met en évidence la présence d’un facteur taille, c’est-àdire que toutes les observations sont corrélées positivement entre elles et se
regroupent donc du même côté du premier axe factoriel. Certains fils de
discussion ont des valeurs fortes pour toutes les variables, tandis que
d’autres ont des valeurs faibles pour toutes les variables.
Si l’on s’intéresse aux facteurs 2 et 3 (Figure 2) sur lesquels on projette le
degré de conflictualité et les pages du corpus à titre illustratif, on observe une
opposition entre accord et désaccord, et dans une moindre mesure entre
explicite et implicite sur le facteur 2. Accords et actes explicites seraient du
côté de l’harmonie et du désaccord tandis que les désaccords en général et les
désaccords indirects en particulier seraient plus caractéristiques du conflit.
Cette dernière remarque, qui devra être éprouvée et confirmée sur des jeux
de données plus importants, nous semble intéressante : est-ce que les
marqueurs implicites du désaccord vont de pair avec les marqueurs du
conflit ? Y a-t-il une corrélation négative entre expression explicite du

608

JADT’ 18

désaccord et attaques personnelles ?

Figure 2 : Facteurs 2 et 3 de l’ACP – 98 fils * 8 variables actives – Dtm-vic

5. Conclusion et perspectives
Nous avons ainsi proposé une première typologie des actes exprimant le
désaccord en français ; cette typologie a été développée dans le cadre d’un
projet plus général d’exploration des conflits dans Wikipédia. Une seconde
typologie, centrée sur les marqueurs de violence verbale et supposément
caractéristique du conflit, est en cours de développement et viendra faire
système avec la typologie du désaccord pour mettre en évidence les
caractéristiques des interactions conflictuelles dans Wikipédia et dans les
CMC.
En ce qui concerne l’annotation présentée, un guide est actuellement en cours
de rédaction ; chaque marqueur sera validé et évalué au moyen d’un kappa
de Cohen. La typologie est encore en cours d’amélioration ; ainsi une
troisième forme d’expression indirecte du désaccord que nous avions
observée consiste à le neutraliser en déplaçant le focus sur une proposition
ou une suggestion, i.e. un acte de langage positif (ne vaudrait-il pas mieux… ?
Il faudrait peut-être d’abord définir ce qu’on entend par..). Ce type de séquence,
plus complexe à identifier car plus ambigu, est en cours d’intégration.
Enfin, reste à mettre en œuvre des parcours interprétatifs adaptés pour
explorer ce type de données annotées avec nos méthodes ADT ; c’est aussi
l’une des pistes que nous poursuivons ces dernières années, dans nos travaux
(Poudat et Landragin, 2017) et dans le cadre du consortium CORLI.

JADT’ 18

609

Références
Auray, N., Hurault-Plantet, M., Poudat, C., & Jacquemin, B. (2009). La
négociation des points de vue : une cartographie sociale des conflits et des
querelles dans le Wikipédia francophone. In Réseaux 2/2009, n° 154: 15-50.
Bender E.M., Morgan J.T., Oxley M., Zachry M., Hutchinson B., Marin, A.,
Ostendorf, M. (2011). Annotating Social Acts: Authority Claims and
Alignment Moves in Wikipedia Talk Pages. In Proceedings of the Workshop
on Languages in Social Media (pp. 48–57). Stroudsburg, PA, USA:
Association for Computational Linguistics.
Borra E., Weltevrede E., Ciuccarelli P., Kaltenbrunner A., Laniado D., Magni
G., Venturini T. (2014). Contropedia - the Analysis and Visualization of
Controversies in Wikipedia Articles. In Proceedings of The International
Symposium on Open Collaboration (pp. 34:1–34:1). New York, NY, USA.
Ferschke O., Gurevych I., Chebotar Y. (2012). Behind the Article: Recognizing
Dialog Acts in Wikipedia Talk Pages. In Proceedings of the 13th Conference
of the European Chapter of the Association for Computational Linguistics (pp.
777–786). Stroudsburg, PA, USA: Association for Computational
Linguistics.
Kerbrat-Orecchioni, C. (2016). Le désaccord, réaction « non préférée » ? Le cas
des débats présidentiels. Cahiers de praxématique, (67).
Poudat C. et Ho-Dac L.-M. (2018). Désaccords et conflits dans le Wikipédia
francophone. In Travaux linguistiques du Cerlico, Presses Universitaires de
Rennes (sous presse).
Poudat C. et Landragin F. (2017). Explorer un corpus textuel. Méthodes –
Pratiques – Outils. Collection Champs linguistiques, De Boeck, Louvain-laNeuve.
Poudat C., Grabar N., Paloque-Berges C., Chanier T. et Kun J. (2017).
Wikiconflits : un corpus de discussions éditoriales conflictuelles du
Wikipédia francophone. In Wigham, C.R & Ledegen, G., Corpus de
communication médiée par les réseaux : construction, structuration, analyse.
Collection Humanités numériques. Paris : L’Harmattan, pp. 19-36.
Sumi, R., Yasseri, T., Rung, A., Kornai, A., & Kertész, J. (2011). Edit wars in
Wikipedia. In: Proceedings of the ACM WebSci'11, Koblenz, Germany.
pp. 1–3.

610

JADT’ 18

Textometric Exploitation of Coreference-annotated
Corpora with TXM: Methodological Choices and
First Outcomes
Matthieu Quignard1, Serge Heiden2, Frédéric Landragin3, Matthieu Decorde2
ICAR, CNRS, University of Lyon – matthieu.quignard@ens-lyon.fr
IHRIM, ENS Lyon, CNRS, University of Lyon – {slh,matthieu.decorde}@ens-lyon.fr
3Lattice, CNRS, ENS Paris, University Sorbonne Nouvelle, PSL Research University,
USPC – frederic.landragin@ens.fr
1

2

Abstract
In this article we present a set of measures – some of which can lead to
specific visualisations – with the objective to enrich the possibilities of
exploration and exploitation of annotated data, and in particular coreference
chains. We first present a specific use of the well-known concordancer, which
is here adapted to present the elements of a coreference chain. We then
present a histogram generator that allows for example to display the
distribution of the various coreference chains of a text, given a value from the
annotated properties. Finally, we present what we call progress diagrams,
whose purpose is to display the progress of each chain throughout the text.
We conclude on the interest of these (interactive) modes of visualization in
order to make the annotation phase more controlled and more effective.
Résumé
Nous présentons dans cet article un ensemble de mesures – dont certaines
peuvent amener à des visualisations spécifiques – dont l’objectif est
d’enrichir les possibilités d’exploration et d’exploitation des données
annotées, en particulier quand il s’agit de chaînes de coréférences. Nous
présentons tout d’abord une utilisation adaptée de l’outil bien connu qu’est
le concordancier, en n’affichant que les maillons d’une chaîne choisie. Puis
nous montrons un générateur d’histogramme qui permet par exemple
d’afficher la répartition des chaînes de coréférences d’un texte à partir d’une
propriété annotée. Nous montrons enfin ce que nous appelons des
diagrammes de progression, dont le but est d’afficher les avancées au fur et à
mesure du texte des chaînes de coréférences qu’il contient. Nous concluons
sur l’intérêt de ces modes (interactifs) de visualisation pour rendre la phase
d’annotation plus maîtrisée et plus efficace.
Keywords: coreference chain, corpus annotation, annotation tool,
visualisation tool, exploration tool, statistical analysis of textual data.

JADT’ 18

611

1. Introduction
The manual annotation of a textual corpus with referring expressions
(Charolles, 2002) and coreference chains (Schnedecker, 1997, Landragin &
Schnedecker, 2014) requires adapted tools. A coreference chain can cover the
whole text; it is therefore a linguistic object for which the existing means of
visualization and exploration are few and often perfectible. The MMAX2 tool
(Müller & Strube, 2006) allows for visualizing the links between referring
expressions using arrows which link markables. The GLOZZ tool (Mathet &
Wildlöcher, 2009) offers several means of visualization: with arrows like
MMAX2, or with a specific marking in the margin or the middle of the text.
The ANALEC tool (Landragin et al., 2012) and its specific extension for
coreference chains (Landragin, 2016) proposes a graphic metaphor based on
the succession of coloured dots. This allows the analyst to configure visual
parameters, for instance the colour which can be linked to any of the
annotated properties. This type of visualization makes it possible to see at a
glance the structural differences between the different reference chains of a
text. That must be useful to the analyst, in addition to manual explorations
and finer linguistic analyses.
2. Linguistic objects and methodology
In the continuity of previous works (Heiden, 2010; Landragin, 2016), we
present here a set of measures – some of which can lead to specific
visualisations – with the objective to enrich the possibilities of exploration
and exploitation of annotated data. We focus in particular on annotations
which concern discursive phenomena like coreference, i.e., annotations
which are necessarily described within two levels: 1. markable, group of
contiguous words to which is assigned some labels, using for instance a
feature structure; 2. set of markables, or links between markables, as is it the
case for any chain of annotations: anaphoric chains, textual organizers chains,
textual structure elements chains, etc. A feature structure can also be
assigned at level 2, i.e., to the set or to the links.
3. A concordancer adapted to annotations chains
As a first visualization mode, we reuse the very classic concordancer to
display the elements which constitute a coreference chain. The use of such a
visualization tool, which is well established in the community of corpus
exploration (Poudat & Landragin, 2017), seemed natural for visualizing
chains of annotations. The last version of TXM (Heiden, 2010) thus includes a
concordancer which makes it possible to display in a column all the elements
(e.g. referring expressions) of a chain (e.g. coreference chain), with left and
right contexts for each elements. Compared to MMAX2 (Müller & Strube,

612

JADT’ 18

2006) and GLOZZ (Mathet & Wildlöcher, 2009) visualisation choices, i.e.
arrows linking marquables which are displayed directly on the text, this
concordancer has the advantage of regrouping all the relevant information in
a small graphic space.

Fig 1: Concordancer with the elements of a coreference chain, dedicated to a character named
“Caillette”.

Fig. 1 shows the list of all referring expression to the character ‘Caillette’.
Sorted in the textual order, the concordancer shows the alternation of the use
of proper nouns, pronouns, possessives, etc. This concordancer may also be
sorted along a given property of the marquable, e.g. its POS label. This
representation may then be exploited to see whether the POS annotation is
consistent or not.
4. Histograms for visualising distributions of annotations chains
A second mode of visualization, also very traditional, is the histogram (bar
plot). The user can select one or several properties – the determination of the
referring expressions, for instance, or the type of referent – and launch
calculations on their occurrences: cross-counts, correlation computation and
so on. TXM now includes a histogram generator, which allows for example to
display the distribution of coreference chains throughout the text, as well as
the distribution of chains according to the number of referring expressions
they include. These calculations and their associated visualizations provide
TXM with integrated functionalities which required in other state-of-the-art
tools the development of scripts, in order to export the relevant data and
exploit them in an external tool like a spreadsheet.

JADT’ 18

613

Figure 2 compares the distribution of grammatical categories of referring
expressions in three texts. Although all texts are all encyclopedical ones, the
Discourse from Bossuet shows a particular profile, with a high number of
proper nouns (GN.NAM).

Fig 2: Comparative barplots of grammatical categories usage by reference units in three texts:
Bossuet, “Discours sur l’histoire universelle” (1681), Diderot, “Essais sur la peinture” (17591766), Montesquieu, “Esprit des lois” (1755).

5. Progression charts for annotations chains
A third (new) mode of visualization consists to graphically show the
progress of each chain throughout the text. The principle is simple, but the
possibilities of exploration and exploitation of the generated graph are
numerous. In a two-dimensional chart the abscissa of which represents the
linearity of the text, chains are displayed point by point (cf. Fig. 3): each
occurrence of a referring expression increases by one notch the ordinate of
the corresponding point. The resulting broken lines are all ascending but can
considerably vary in their areas of progression and flat areas.
When they are visualized simultaneously, it is possible to detect the parts of

614

JADT’ 18

the text where several referents are competitors, or on the contrary those
where several referents appear alternately. Zooming (in and out) as well as
focussing features allows for visualizing the characteristics of each point,
thus enriching the exploration possibilities of these progression chart and the
underlying coreference chains.

Fig 3: Progression graph of the main coreference chains at the beginning of “Essais sur la
peinture” from Denis Diderot. The dots highlighted with symbols correspond to referring
expressions with low accessibility.

6. Discussion
The common points of these new visualization modes is not only to propose
visual representations which are easy to understand (and possibly
interactive, when it is possible to modify on the fly one of the properties), to
allow the visualization of these representations directly in TXM, with no
need to export annotated data and to use external tools, but also to facilitate
the detection by the analyst of intruders, outliers and deviant examples. For
instance potential annotation errors: it can be the case for a referring
expression which has nothing to do in the currently visualised chain. It may
be a peak or a suspect flat in one of the generated histograms. It may be a
zone with a very high slope (or a very long flat) in a progression diagram. In
all three cases, the analyst can directly access the suspicious annotation, in
order to verify it and of course to modify it. The integration of the
measurements and their visualizations in TXM allows this immediate return
to the corpus annotation phase. This is particularly effective when the corpus
is being annotated manually.

JADT’ 18

615

7. Conclusion and future works
One can say that it is by annotating that we can see the mistakes we make,
but we still need appropriate tools to detect these errors. With the new
possibilities of interaction that we propose here, we hope that we are taking a
significant step in this direction. The first tests which we have carried out
demonstrated the relevance of our approach.
References
Charolles M. (2002). La référence et les expressions référentielles en français.
Ophrys, Paris, France.
Heiden S. (2010). The TXM Platform: Building Open-Source Textual Analysis
Software Compatible with the TEI Encoding Scheme. Proceedings of the
24th Pacific Asia Conference on Language, Information and Computation, Nov.
2010. Sendai, Japan, Institute for Digital Enhancement of Cognitive
Development, Waseda University, pp. 389-398, available at
halshs.archives-ouvertes.fr/halshs-00549764.
Landragin F. (2016). Conception d’un outil de visualisation et d’exploration
de chaînes de coréférences. Statistical Analysis of Textual Data – Proceedings
of 13th International Conference Journées d’Analyse statistique des Données
Textuelles (JADT 2016), Nice, France, pp. 109-120.
Landragin F., Poibeau T. and Victorri B. (2012). ANALEC: a New Tool for the
Dynamic Annotation of Textual Data. Proceedings of LREC 2012, Istanbul,
Turkey, pp. 357-362.
Landragin F. and Schnedecker C., editors (2014). Les chaînes de référence.
Volume 195 of the Langages journal, Armand Colin, Paris, France.
Müller C. and Strube M. (2006). Multi-level annotation of linguistic data with
MMAX2. In Braun S., Kohn K. and Mukherjee J., editors, Corpus
Technology and Language Pedagogy: New Resources, New Tools, New Methods,
Peter Lang, Frankfurt, Germany.
Poudat, C. and Landragin, F. (2017). Explorer un corpus textuel : méthodes,
pratiques, outils. Champs Linguistiques. De Boeck Supérieur : Louvain-laNeuve.
Schnedecker C. (1997). Nom propre et chaîne de référence. Klincksieck, Paris,
France.
Widlöcher A. and Mathet Y. (2012). The Glozz platform: a corpus annotation
and mining tool. In Concolato C. and Schmitz P, editors, Proceedings of the
ACM Symposium on Document Engineering (DocEng’12), Paris, France, pp.
171-180.

616

JADT’ 18

Amélioration de la précision et de la vitesse de
l’algorithme de classification de la méthode Reinert
dans IRaMuTeQ
Pierre Ratinaud
LERASS, Université de Toulouse – ratinaud@univ-tlse2.fr

Abstract
This work presents a proposal to improve the accuracy and the speed of
execution of the divisive hierarchical clustering (DHC) algorithm used by the
Reinert method implemented in the IRaMuTeQ free software. The DHC of
the Reinert method is a serie of bi-partitions on a presence / absence matrix
that intersects text segments and words. In the original version of this
algorithm, after each partition, the largest of the remaining classes is selected
to be split. We propose to replace the selection mode of the classes to be
partitioned by a criteria of homogeneity. The complete rewriting of this part
of the IRaMuTeQ code has also been an opportunity to improve its speed by
implementing part of the code in C ++ and paralleling the procedure. An
experiment carried out on 6 corpora shows that the new algorithm based on
these principles is indeed more precise and faster.
Résumé
Ce travail présente une proposition d’amélioration de la précision et de la
vitesse d’exécution de l’algorithme de classification hiérarchique descendante
(CHD) utilisé par la méthode Reinert implémentée dans le logiciel libre
IRaMuTeQ. La CHD de la méthode Reinert est une série de bi-partitions de
matrices de présence / absence qui croise des segments de texte et des formes.
Dans la version originale de cet algorithme, après chaque partition, la plus
grande des classes restantes est sélectionnée pour être à son tour coupée en
deux. Nous proposons de remplacer le mode de sélection des classes à
partitionner par un critère d’homogénéité. La ré-écriture complète de cette
partie du code d’IRaMuTeQ a également été l’occasion d’une amélioration de
sa célérité par l’implémentation d’une partie du code en C++ et la
parallélisation de la procédure. Une expérimentation menée sur 6 corpus
permet de constater que le nouvel algorithme reposant sur ces principes est
effectivement plus précis et plus rapide.
Keywords: méthode Reinert, classification hiérarchique descendante,
IraMuTeQ, précision

JADT’ 18

617

1. Introduction
La méthode Reinert a pour objectif de faire émerger les différentes
thématiques qui traversent un corpus textuel. Sa plus grande originalité est
sûrement l’algorithme de classification hiérarchique descendante (CHD)
proposé par Reinert (1983). Après avoir rappelé les différentes étapes de ce
type d’analyse, nous proposerons une modification de cet algorithme de
classification dans l’objectif d’améliorer la précision de l’ensemble de la
procédure. Le changement proposé concerne le critère de sélection des sousmatrices après chacune des partitions. La description de cette nouvelle
procédure est complétée par une expérimentation sur 6 corpus en français et
en anglais permettant de comparer la nouvelle version de l’algorithme avec
l’ancienne. Les résultats que nous présentons attestent effectivement d’une
augmentation de la précision de l’algorithme, dont la ré-écriture à également
permis une augmentation de la vitesse d’exécution. Avant d’entamer cette
présentation, il nous semble toutefois nécessaire de rappeler que la CHD
n’est pas la seule particularité de la méthode Reinert.
2. Des corpus aux matrices
Une autre originalité de cette procédure est l’unité utilisée dans la
classification. Dans la plupart des situations, la classification ne porte pas sur
les textes dans leur ensemble, mais sur une granularité inférieure. Les unités
classées sont des segments de texte. Dans le logiciel IRaMuTeQ (Ratinaud,
2014; Ratinaud & Marchand, 2012), la taille de ces segments est fixée par
défaut à 40 occurrences et leur découpage tient compte de la ponctuation. La
règle de découpage essaie donc de proposer des unités de taille homogène
(autour de 40 occurrences) et de respecter le découpage « naturel » des textes
marqué par la ponctuation. Une seconde originalité qu’il convient de préciser
est la distinction opérée entre formes pleines et mots outils. Dans ces
analyses, la plupart du temps, seules les formes pleines (verbes, adverbes,
adjectifs et substantifs) sont considérées. Les corpus peuvent alors être
représentés sous la forme de matrices qui croisent les segments de texte et les
formes pleines. Les cellules de ces matrices marquent la présence ou
l’absence des formes dans les segments en codant 1 la présence et 0 l’absence.
Le tableau 1 présente une telle matrice pour un corpus composé de 10
segments de texte (notés i1 à i10) et de 9 formes (notées j1 à j9).

618

JADT’ 18

Tableau 1 : Exemple d’une matrice croisant des segments de texte (en ligne) et les formes (en
colonne)
J1 J2
J3 J4 J5 J6 J7 J8 J9

I1

1

1

1

1

0

0

0

0

0

I2

0

0

0

0

1

1

1

1

1

I3

0

0

1

0

1

0

1

0

0

I4

1

0

1

0

1

0

0

0

1

I5

0

0

1

0

1

0

1

0

0

I6

1

1

1

1

0

0

0

0

1

I7
I8

0
1

0
0

0
1

0
0

1
1

1
0

1
0

1
0

0
0

I9

0

0

1

0

1

0

1

0

1

I10

0

0

1

0

1

0

1

0

0

La matrice présentée dans le tableau 1 est un exemple très simplifié de ce
qu’il se passe dans la réalité. Les matrices générées sur des corpus textuels
sont beaucoup plus grandes et beaucoup plus « creuses » (la proportion de 1
est très faible dans la matrice). Nous noterons N le nombre total de 1 dans la
matrice. L’objectif de la classification est de proposer une réorganisation de
cette matrice en sous-groupes de segments qui maximisent les propriétés
suivantes :
n) Les segments regroupés doivent être homogènes entre eux : la
méthode doit réunir les segments de texte qui se ressemblent, c’est-àdire les segments qui ont tendance à contenir les mêmes mots.
o) Les ensembles doivent être hétérogènes entre eux : les groupes de
segments constitués doivent être les plus différents possibles.
L’illustration 1 propose un découpage de la matrice présentée dans le
Tableau 1 en 4 classes qui respectent ces critères.

I1
I6

J1 J2 J3 J4 J5 J6 J7 J8 J9
1 1 1 1 0 0 0 0 0

J1 J2 J3 J4 J5 J6 J7 J8 J9

I8
1 0 1 0 1 0 0 0 0
1 0 1 0 1 0 0 0 1
I4
1 1 1 1 0 0 0 0 1
Illustration 1 : Découpage de la matrice du Tableau 1 en 4 classes

La « qualité » de cette solution peut être déterminée par le calcul du chi2/N
du tableau réduit (Reinert, 1983).
Dans cet exemple, la solution optimale serait obtenue en séparant les lignes
i6, i4, i2 et i9 de leur classe d’appartenance pour les laisser former leur propre
classe. La solution à 8 classes obtenue résumerait alors l’intégralité de
l’information contenue dans la matrice du Tableau 1.

JADT’ 18

619
Tableau 2 : Tableau réduit de la classification de l’illustration 1

J1 J2 J3 J4 J5 J6 J7 J8 J9
Σ [i1,i6]

2

2

2

2

0

0

0

0

1

Σ [i4,i8]

2

0

2

0

2

0

0

0

1

Σ [i9,i3,i5,i10]

0

0

4

0

4

0

4

0

1

Σ [i2,i7]

0

0

0

0

2

2

2

2

1

3. La CHD de la méthode Reinert
Rappelons que la méthode permettant de construire automatiquement ces
classes s’appuie sur une série de bi-partitions reposant chacune sur une
analyse factorielle des correspondances (AFC). La première coupure est
obtenue en cherchant le long du premier facteur de cette AFC les deux sousmatrices qui maximisent le chi2/N du tableau réduit. La partition produite
est améliorée en inversant chacune des lignes du tableau d’une classe à
l’autre et en recalculant le chi2/N du tableau réduit. Toutes les inversions qui
augmentent la valeur du chi2/N sont conservées. Cette étape boucle jusqu’à
ce que plus aucune inversion n’augmente cette valeur. Une dernière étape
consiste à retirer les formes (les colonnes) statistiquement sous-représentées
dans les matrices (sur la base d’un chi2).
Cette procédure (bi-partition de la matrice, inversion des lignes, suppression
des colonnes) constitue une des partitions de la CHD. La CHD dans son
ensemble réalisera cette procédure autant de fois que nécessaire pour
atteindre le nombre de classes terminales paramétré. Il faut n-1 partition(s)
pour constituer n classe(s) terminale(s). Après chacune de ces partitions, dans
sa formulation d’origine, l’algorithme sélectionne la plus grande des classes
constituées (c’est-à-dire celle qui contient le plus de lignes) pour lui faire à
son tour subir une partition.
Le tableau 3 présente, de façon très caricaturale, une matrice pour laquelle
cette démarche ne conduit pas à un résultat satisfaisant. Si nous soumettions
cette matrice à la CHD précédemment décrite, la première partition
conduirait à la création d’une classe constituée des lignes i1, i2 et i3 (notée
[i1,i2,i3]) et d’une autre constituée des lignes i4 et i5 (notée [i4,i5]). La première
de ces classes étant la plus grande, elle serait sélectionnée pour, à son tour,
subir une partition. Or, il est évident ici qu’il n’y a plus aucune information à
extraire de cette matrice, les lignes étant toutes identiques. Seule la séparation
des lignes i4 et i5 est, dans cet exemple, susceptible d’augmenter la qualité du
résultat. Pour cela, il aurait donc fallu sélectionner la classe restante la plus
hétérogène ([i4,i5]) plutôt que de sélectionner la plus grande ([i1,i2,i3]).

620

JADT’ 18
Tableau 3 : Une matrice problématique

J1 J2 J3 J 4 J5
i1

1

i2

1 1 0 0 0

1 0

0 0

i3

1 1 0 0 0

i4

0 0 1 1 0

i5

0 0 0 1 1

Il convient donc de percevoir que dans la version actuellement disponible de
cette méthode, l’algorithme de classification fait l’hypothèse que la matrice la
plus grande est également la plus hétérogène.
Nous pensons que certains corpus ne respectent pas cette propriété et qu’il
est tout à fait possible qu’à différents moments d’une classification, la plus
grande des matrices restantes ne soit pas la plus hétérogène.
4. Une nouvelle solution pour l’enchaînement des partitions
Il apparaît alors pertinent de pouvoir tester, après chacune des phases de
partition, l’homogénéité des matrices restantes de façon à sélectionner la plus
hétérogène. Comme le calcul de l’analyse factorielle des correspondances
nécessaire à chaque partition permet de déterminer le chi2 de la matrice dans
son ensemble, nous avons utilisé cette propriété pour revoir le déroulement
de l’algorithme. Dans cette nouvelle version, après chaque partition, l’AFC et
le chi2 des deux matrices générées sont calculés a priori. Pour chacune de ces
matrices, nous déterminons un indice d’homogénéité qui tient compte du
chi2 de la matrice, de sa taille et du nombre total de formes. Ce critère relève
de la formule suivante :

Il s’agit donc de multiplier le chi2 de la matrice par le ratio de 1 qu’elle
contient.
Cette méthode permet de ne plus supposer que la matrice la plus grande est
la plus hétérogène mais de tester cette hétérogénéité. Elle a pour désavantage
de nécessiter le calcul systématique de l’AFC sur pratiquement toutes les
matrices produites. Sans autre modification, cette procédure serait beaucoup
plus lente que la version précédente de l’algorithme. Dans l’objectif
d’accélérer ces analyses, la ré-écriture théorique de l’algorithme s’est
accompagnée d’une recherche de gain de performances qui a ici suivi deux
directions :

JADT’ 18

621



Les parties les plus gourmandes en calcul ont été ré-écrites en C++
par l’intermédiaire des packages Rccp (Eddelbuettel et al., 2017) et
RcppEigen (Bates, Eddelbuettel, Francois, & Yixuan, 2017) de R. Les
parties concernées sont la recherche de la partition qui maximise le
Chi2/N après l’AFC et le reclassement des lignes.
 Ces deux parties étant une suite de calculs de chi2 sur la base d’une
seule matrice, il a été possible de les paralléliser pour profiter de la
nature multi-coeur de la plupart des processeurs modernes. Les
calculs sont donc potentiellement distribués aux différents
cœurs/threads de la machine par l’intermédiaire des packages
Parallel et DoParallel (Calaway, Microsoft Corporation, Weston, &
Tenenbaum, 2017) de R.
Ces changements ont en fait nécessité la réécriture complète de l’algorithme
de la méthode Reinert dans IRaMuTeQ.
5. Expérimentation
De façon à tester les bénéfices apportés par cette nouvelle procédure, en
termes de précision et de rapidité, une expérimentation sur 6 corpus
différents a été réalisée. Nous avons associé à des corpus de grandes tailles
(les plus susceptibles de présenter des disproportions dans les thématiques
qu’ils contiennent) un corpus de taille plus réduite. Les caractéristiques de
ces corpus sont présentées dans le Tableau 4.
Tableau 4 : description des corpus utilisés dans l’expérimentation

Le corpus dataconf correspond à des titres et à des résumés de conférences
du domaine de l’informatique, il est uniquement en anglais. 20Newsgroup1
est un corpus également en anglais qui réunit 20 listes de discussions sur des
thématiques très diverses (Lang, 1995). lemondefr est un corpus d’articles du

1

http://qwone.com/~jason/20Newsgroups/

622

JADT’ 18

site web du monde en ligne2, il est en français. Ssm, pour « same sex
marriage », est un corpus d’articles de presse américaine et anglaise sur la
thématique du mariage entre personnes de même sexe. Il a été constitué par
Nathalie Paton. AN2011 correspond à l’année 2011 de la retranscription des
débats à l’assemblée nationale française (Ratinaud & Marchand, 2015). Enfin,
le corpus noté LRU regroupe 100 articles de la presse quotidienne française
sur la thématique de la loi liberté et responsabilité des universités.
L’expérimentation consiste donc à faire subir les deux versions de
l’algorithme de classification aux matrices extraites de ces corpus et à
comparer la qualité des résultats obtenus. Le nombre de classes terminales a
été fixé à 100 pour les “gros” corpus et à 30 pour le “petit”. Dans un cas,
l’algorithme utilisera le critère de taille pour sélectionner les matrices à
partitionner et dans l’autre il utilisera le critère d’homogénéité. Les résultats
se présentent sous la forme de graphiques qui montrent l’évolution de la
quantité d’information extraite après chacune des partitions. La valeur
renvoyée est celle du Chi2/N du tableau réduit des classes. Dans les
graphiques de l’illustration 2, les courbes rouges représentent les valeurs
obtenues avec l’ancienne version de l’algorithme (notée Reinert) et les
courbes bleues les valeurs obtenues avec la nouvelle version (notée
Reinert++). Une valeur supérieure correspond à une meilleure qualité de la
partition. Le graphique en barres présente le pourcentage d’augmentation ou
de diminution de la qualité de la partition du nouvel algorithme en prenant
l’ancien comme référence. Les barres vertes signalent une augmentation de la
qualité et les barres rouges une diminution. Pour la nouvelle version de
l’algorithme, 6 cœurs ont été alloués à la procédure3.
Ces résultats montrent assez clairement que la nouvelle version de
l’algorithme augmente dans la majorité des cas la précision de la
classification. Ils permettent également de percevoir que ce gain de qualité
est lié à la distribution des thématiques dans les corpus. Tous les corpus ne
profitent donc pas de cette évolution de la même façon. Il faut également
noter que sur le corpus LRU, il n’y a pratiquement pas de différences entre
les deux méthodes. La perte de précision de 1 à 3 % à différents moments de
la classification sur ce corpus est tout à fait négligeable et doit être attribuée à
des différences d’arrondis entre le code en R et le code en C++. À l’opposé,
certains corpus, comme 20newsgroup, présentent des gains de précision qui
peuvent atteindre 15 %.

http://www.lemonde.fr
Ces tests ont été réalisés sur un macbook pro 11,3 équipé d’un processeur
intel i7-4960HQ
2
3

JADT’ 18

623

Illustration 2 : Comparaison des résultats entre l’ancienne version (Reinert) et la nouvelle
version (Reinert++) de l’algorithme de classification

L’illustration 3 montre que sur les corpus conséquents, le gain de
performances introduit par le passage au C++ et à la parallélisation est
compris entre un facteur 4 et un facteur 6. Autrement dit, ce nouvel
algorithme est jusqu’à 6 fois plus rapide sur la machine sur laquelle ces
calculs ont été réalisés.

624

JADT’ 18

12000

7,0

4,3

4,9

4,8

6,0
5,0
4,0

6000

3,0

4000

2,0

2000

1,0

Gain de performance

Temps en seconde

10000
8000

6,0

5,6

Reinert
Reinert++
Gain de performance

0,0

0
AN2011

dataconf 20newsgroup lemondefr

ssm

Illustration 3 : comparaison des temps d’analyse entre l’ancienne version (Reinert) et la
nouvelle version (Reinert++) de l’algorithme

6. Conclusion
Dans ce travail, nous proposons une nouvelle formalisation de la procédure
de classification hiérarchique descendante de la méthode Reinert. Partant de
l’hypothèse que dans certains corpus et à certains moments de ces
classifications, la classe la plus hétérogène n’est pas forcément la plus grande,
nous proposons de substituer le critère du choix de l’enchaînement des
matrices d’un critère de taille à un critère d’homogénéité. Les résultats d’une
expérimentation sur 6 corpus montrent que les corpus volumineux profitent
effectivement de ce changement. Ces résultats sont aussi une invitation à
continuer les investigations sur cette méthode. Cette procédure sera
implémentée dans la prochaine version du logiciel IRaMuTeQ. L’utilisation
du critère d’homogénéité sera optionnelle, de façon à permettre aux
utilisateurs de revenir à l’ancienne version.
Bibliographie
Bates, D., Eddelbuettel, D., Francois, R., and Yixuan, Q. (2017). RcppEigen:
« Rcpp » Integration for the « Eigen » Templated Linear Algebra Library
(Version 0.3.3.3.1). Consulté à l’adresse https://cran.rproject.org/web/packages/RcppEigen/index.html
Calaway, R., Microsoft Corporation, Weston, S., and Tenenbaum, D. (2017).
doParallel: Foreach Parallel Adaptor for the « parallel » Package
(Version 1.0.11). Consulté à l’adresse https://cran.rproject.org/web/packages/doParallel/index.html
Eddelbuettel, D., Francois, R., Allaire, J. J., Ushey, K., Kou, Q., Russell, N., …
Chambers, J. (2017). Rcpp: Seamless R and C++ Integration (Version
0.12.14). Consulté à l’adresse https://cran.r-

JADT’ 18

625

project.org/web/packages/Rcpp/index.html
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the
Twelfth International Conference on Machine Learning (p. 331-339).
Ratinaud, P. (2014). IRaMuTeQ : Interface de R pour les Analyses
Multidimensionnelles de Textes et de Questionnaires (Version 0.7 alpha
2) [Windows, GNU/Linux, Mac OS X]. Consulté à l’adresse
http://www.iramuteq.org
Ratinaud, P., and Marchand, P. (2012). Application de la méthode ALCESTE
à de « gros » corpus et stabilité des « mondes lexicaux » : analyse du
« CableGate » avec IRaMuTeQ. In Actes des 11eme Journées internationales
d’Analyse statistique des Données Textuelles (JADT 2012) (p. 835-844).
Liège, Belgique. Consulté à l’adresse http://lexicometrica.univparis3.fr/jadt/jadt2012/Communications/Ratinaud,%20Pierre%20et%20al
.%20-%20Application%20de%20la%20methode%20Alceste.pdf
Ratinaud, P., and Marchand, P. (2015). Des mondes lexicaux aux
représentations sociales. Une première approche des thématiques dans
les débats à l’Assemblée nationale (1998-2014). Mots. Les langages du
politique, 2015(108), 57-77.
Reinert, M. (1983). Une méthode de classification descendante hiérarchique :
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, VIII(2), 187-198.
Reinert, M. (1990). ALCESTE : Une méthodologie d’analyse des données
textuelles et une application : Aurélia de Gérard de Nerval. Bulletin de
méthodologie sociologique, (26), 24-54.

626

JADT’ 18

Il parametro della frequenza tra paradossi e
antinomie: il caso dell’italiano scolastico
Luisa Revelli
Università della Valle d’Aosta– l.revelli@univda.it

Abstract
Emblem of a formal register, the linguistic variety proposed as a model in the
Italian school system ever since National Unity is characterized by a lasting
artificiality and a strong unwillingness to innovate, even within a frame of
progressive slow changes along its historical development. That's why lexical
frequencies recorded for “Scholastic Italian” can appear as inherently
inconsistent, contrasting with basic vocabulary, even contradictory compared
with other apparently similar Italian varieties. Consequently, to study their
configuration it's necessary to adopt analysis models capable to interpret
quantitative data (volume figures) in the light of the complexity of
paradigmatic relations between concurring solutions and of the composite
connections between number and type of meanings exhibited in current use.
By taking in consideration as a case study Scholastic Italian used by teachers
during the first 150 years of the national school system, and starting from the
data collected by the diachronic corpus of CoDiSV, the contribution aims at
verifying opportunities and criticalities of lexicometric analysis applied to
such a linguistic variety, that is addressed to an unsophisticated audience,
yet characterized by a specialized point of view; of high aspirations, but
influenced by educational needs; constantly evolving and yet always
recalcitrant to the solicitations of the contemporary language.
Riassunto
Emblema di un canone ‘antiparlato’, la varietà linguistica proposta a modello
nella scuola italiana a partire dall’Unità nazionale, pur presentando in
diacronia evidenti tratti evolutivi, si caratterizza per una duratura tendenza
all’artificiosità e per una marcata refrattarietà all’innovazione. Le frequenze
lessicali documentate nell’italiano scolastico possono, per queste ragioni,
risultare discordanti in rapporto a quelle del vocabolario di base, presentarsi
come intrinsecamente poco coerenti, contraddittorie rispetto alle evidenze
rintracciabili in varietà d’italiano apparentemente affini: lo studio delle loro
configurazioni richiede, pertanto, modelli di analisi capaci di interpretare i
dati quantitativi alla luce della complessità delle relazioni paradigmatiche tra
le potenziali soluzioni concorrenti nonché dei compositi rapporti tra numero

JADT’ 18

627

e tipologia delle accezioni testimoniate nei concreti impieghi contestuali.
Assumendo l’italiano scolastico proposto dagli insegnanti nei primi
centocinquant’anni di scuola nazionale a caso di studio, a partire dai dati
ricavati dal corpus diacronico del CoDiSV, il contributo si prefigge allora di
verificare opportunità e criticità poste dall’applicazione di parametri
lessicometrici a una varietà linguistica al contempo rivolta a un pubblico
ingenuo e connotata in prospettiva specialistica, di aspirazione elevata ma
condizionata da esigenze didascaliche, in costante evoluzione e ciò malgrado
costantemente recalcitrante rispetto alle sollecitazioni della lingua viva e
coeva.
Parole-chiave: italiano
vocabolario di base.

scolastico;

frequenza

lessicale;

lessicometria;

1. Introduzione
Al contempo rivolto a un pubblico ingenuo e connotato in prospettiva
specialistica, di aspirazione elevata ma condizionato da esigenze
didascaliche, in costante evoluzione e ciò malgrado costantemente
recalcitrante rispetto alle sollecitazioni della lingua viva e coeva, l’italiano
scolastico (d’ora in poi IS) proposto dagli insegnanti nei primi
centocinquant’anni di scuola nazionale sembra costituire un buon banco di
prova per far emergere le zone di criticità derivanti dall’applicazione di
parametri lessicometrici a varietà linguistiche poligenetiche e
costituzionalmente disomogenee1. Nell’IS, in effetti, un ideale di ricchezza
espressiva perseguito attraverso una marcata ostilità nei confronti di ogni
forma di ridondanza, ripetizione o generalità delle espressioni spinge verso
un’ostentata e ricercata variatio, ma la contemporanea esigenza di
alfabetizzare i giovani allievi orientandoli a privilegiare specifici membri di
serie sinonimiche ritenuti maggiormente corretti, appropriati o esornativi
tende, di fatto e in opposta direzione, a ridurre la gamma delle possibilità
espressive disponibili. La necessità di veicolare attraverso la lingua i saperi
disciplinari rende, d’altra parte, necessario l’uso di metalinguaggi, tecnicismi
e accezioni semantiche che sembrano destabilizzare ulteriormente il serbatoio
lessicale di riferimento dell’IS allontanandolo significativamente dal
vocabolario di base della lingua italiana. In che misura e in che termini
questo avvenga realmente è quanto ci si propone di verificare qui di seguito,
integrando i dati lessicometrici e quantitativi disponibili con alcune

1 Per un inquadramento delle caratteristiche, stabili ed evolutive, dell’IS si
rimanda a De Blasi 1993, Cortelazzo 1995, Benedetti G. e Serianni L. (2009), Revelli
2013.

628

JADT’ 18

riflessioni di natura qualitativa. Relativamente all’IS, la base lessicale presa a
riferimento è costituita da un lessico di frequenza elaborato da chi scrive
(Revelli 2013) a partire da un corpus iniziale di 830 quaderni di scuola
elementare redatti in area valdostana nel periodo compreso tra la fine del XIX
e i primi anni del XXI secolo. I 2.022 termini che compongono il vocabolario
di base sono stati individuati dopo che una selezione bilanciata dei
documenti, ripartiti in subcorpora cronologici ventennali, è stata sottoposta a
trattamento computazionale con lo scopo di identificare la dimensione della
variazione diacronica nei canoni linguistici proposti a modello da parte degli
insegnanti2. A fianco delle concordanze, è stato così ricavato in prima battuta
un vocabolario composto da 152.151 occorrenze (tokens), ricondotte a 18.898
forme (types) e 11.751 lemmi3. Un’ulteriore selezione ha poi dato luogo
all’identificazione dei 2.022 sostantivi, aggettivi e verbi considerati pancronici
perché stabilmente assestati nel vocabolario di base dell’italiano scolastico (d’ora
in poi VoBIS), in quanto testimoniati con più di cinque occorrenze in almeno
quattro dei sei repertori cronologici o in tre non consecutivi. Il termine di
paragone è costituito dall’edizione 2016 del Nuovo Vocabolario di base della
lingua italiana (d’ora in poi NVdB) di Isabella Chiari e Tullio de Mauro4, che
ripartisce le circa 7.000 parole statisticamente più frequenti e accessibili ai
parlanti italiani del XXI secolo nei tre serbatoi del lessico fondamentale (FO,
circa 2000 parole ad altissima frequenza usate nell’86% dei discorsi e dei
testi), del lessico ad alto uso (AU, circa 3.000 parole di uso frequente che
coprono il 6% delle occorrenze) e del lessico di alta disponibilità (AD, circa 2000
parole “usate solo in alcuni contesti ma comprensibili da tutti i parlanti e
percepite come aventi una disponibilità pari o perfino superiore alle parole di
maggior uso”). La scelta di fare riferimento a tale base, che comprende al
proprio interno anche le frequenze relative alle varietà parlate e si colloca
temporalmente in un periodo successivo a quello considerato per il lessico
scolastico, risponde all’esigenza di verificare se e in che misura il modello
scritto offerto da quest’ultimo possa aver inciso sulla configurazione dei
successivi usi concreti.

Le tipologie testuali prese in considerazione sono costituite dalle consegne
degli esercizi, dai titoli dei componimenti, da dettati, interventi correttivi, valutazioni
e giudizi documentati nei quaderni degli alunni.
3 Il vocabolario e le concordanze del corpus sono stati ricavati, previa
annotazione e lemmatizzazione, tramite il software T-LAB, ideato e distribuito da
Franco Lancia. Per un approfondimento a proposito dei principi adottati e la
metodologia seguita si rimanda a Revelli 2013.
4
https://www.internazionale.it/opinione/tullio-de-mauro/2016/12/23/il-nuovovocabolario-di-base-della-lingua-italiana.
2

JADT’ 18

629

2. Vocabolari di base a confronto: le frequenze nel NVdB e nel VoBIS
La comparazione del serbatoio lessicale dei due repertori presi a confronto
consente di compiere, in prima battuta, alcune osservazioni generali: dei 2022
lemmi del VoBIS, 1784 trovano riscontro nel NVdB, spartendosi per il 53%
nel serbatoio del lessico FO, per il 26% in quello di AU e per il 9% in quello di
AD. Senza entrare qui nel merito delle convergenze che accomunano i due
vocabolari, sembra comunque opportuno segnalare che dietro molti esempi
di apparente coincidenza delle distribuzioni di frequenza si celano in realtà
difformità significative, prevalentemente indotte dalla tendenza dell’IS al
restringimento o in alcuni casi anche alla rideterminazione semantica: fra le
molte parole che assumono specifici sensi scolastici (ad es. diario,
interrogazione, nota, pensierino, voto), alcune perdono del tutto l’ancoramento
ai significati di cui sono dotati nella lingua comune, come accaduto a tema,
passato a identificare non più un soggetto o argomento da trattare, ma invece
il prodotto di una specifica tipologia testuale.
Per ciò che concerne le 238 parole assenti nel NVdB (12%), esse possono
essere raggruppate in categorie utili a mettere fuoco diverse criticità relative
all’applicazione del parametro della frequenza comparativamente applicato.
Un primo, corposo gruppo che risulta esclusivo dell’IS è costituito da
logonimi caratteristici della nomenclatura metalinguistica dell’apparato
scolastico, del tipo alfabetico, apostrofo, coniugazione, preposizione, ecc.
Osserviamo che, malgrado il loro potenziale polisemico, molti di questi –
come coniugare, derivato, imperfetto, possessivo, primitivo - raggiungono
nell’ambito dell’IS frequenze molto elevate nel loro esclusivo ruolo di
etichette destinate alla riflessione metalinguistica5: la rappresentatività
quantitativa non implica quindi un contatto degli allievi con le diverse
accezioni di cui quegli stessi termini possono essere portatori, ma
corrisponde invece a un’insistita specializzazione motivata da esigenze
didascaliche. Un secondo gruppo è costituito da termini tipici dei contesti
d’insegnamento della letto-scrittura: si tratta principalmente di sostantivi che
fanno riferimento a referenti concreti ma di scarsa prominenza nella
quotidianità, la cui forma scritta guida e richiede la conoscenza di
convenzioni controintuitive eppure fondamentali per la corretta codifica e
decodifica ortografica. Citiamo a titolo di esempio parole come acquaio,
acquavite e acqueo, evidentemente introdotte non per stringente necessità
tematica quanto invece con scopo di consolidamento delle corrette
rappresentazioni grafematiche.
A scopi didattici legati agli insegnamenti disciplinari o più genericamente a

5 Ad esempio, dimostrativo - sempre preceduto da aggettivo o pronome - non entra
mai in combinazione con atto, gesto, ecc.

630

JADT’ 18

scelte tematiche caratteristiche del contesto educativo sono da imputare le
alte frequenze di diversi termini relativi all’ambito storico-geografico (legione,
vetta), di voci descrittive dell’universo naturale (arto, astro) e della vita rurale
(semina, vendemmia); di serie di verbi (castigare, disobbedire) di aggettivi
(diligente, ordinato) e di sostantivi astratti (umiltà, penitenza) appartenenti al
formulario tipico dell’educazione civica o morale e a quello della valutazione
scolastica. A differenza del NVdB, per la sua impostazione diacronica il
lemmario del VoBIS trova, d’altra parte, rappresentati numerosi arcaismi: si
tratta in alcuni casi di varianti formali oggi dismesse (ad es. annunziare per
annunciare,) o dispreferite (ubbidire per obbedire); di termini relativi a referenti
che i cambiamenti sociali dell’ultimo cinquantennio hanno reso superflui o
anacronistici (manto, ricamatrice); di membri di coppie o serie sinonimiche
superati o formali, che soltanto in ambito scolastico sono o sono stati più a
lungo privilegiati rispetto a concorrenti avvertiti dai parlanti come più attuali
(persuadere per convincere)6.
Proseguendo con le mancate corrispondenze nei due repertori, se l’assenza
nel NVdB di voci scolastiche un po’ leziose come diletto, garbato, vezzo e soave
risulta scontata, stupisce invece la mancata inclusione di termini che
appaiono stabili nel tempo e di diffusione panitaliana: è il caso di zoonimi
come bue, elefante, formica; di nomi di frutti usualmente presenti sulle tavole
degli italiani come fragola, noce e uva; di nomi concreti d’uso comune come
carezza, martello, ombrello. La mancanza di riscontri nel NVdB per termini di
questo tipo può essere solo in parte interpretato in una dimensione
propriamente sociolinguistica: pur essendo vero che - dato il pubblico cui si
orienta - l’IS fa più frequente riferimento a temi e referenti della cultura
materiale ed esperienziale di quanto non accada nelle varietà linguistiche
rivolte a e prodotte da parlanti adulti, è altrettanto vero che in linea teorica
tutti i vocaboli, a maggior ragione se accolti e veicolati dalla scuola,
dovrebbero rientrare in quel patrimonio di «parole che può accaderci di non
dire né tanto meno di scrivere mai o quasi mai, ma legate a oggetti, fatti,
esperienze ben noti a tutte le persone adulte nella vita quotidiana» (De
Mauro 1980: 148). Ci aspetteremmo quindi di trovare riscontri almeno
all’interno di quel serbatoio di parole di AD di cui tuttavia lo stesso De
Mauro ha in più occasioni dichiarato la natura sfuggente, non statistica ma

Ad es. bambagia, cagionare, figliolo, focolare, garzone, uscio. Proprio nell’ambito di
quest’ultima categoria il serbatoio dell’IS si differenzia d’altra parte in modo evidente
da quello del lessico corrente, privilegiando sistematicamente soluzioni assenti nel
NVdB, a scapito di quelle invece lì documentate e in molti casi dotate di marca d’uso
FO (ad es. appetito per fame, ardere per bruciare, sciupare per rovinare, ecc.).
6

JADT’ 18

631

congetturale7. E, in effetti, probabilmente neppure le analisi quantitative più
imponenti e minuziose possono aspirare ad azzerare inevitabili fattori di
imprevedibilità e accidentalità della frequenza. Nel caso qui preso a
campione, che relativamente all’IS non dispone di un corpus di partenza di
dimensioni del tutto soddisfacenti, lacune relative a termini rispetto ai quali
ci si aspetterebbe di avere riscontri si verificano anche capovolgendo la
prospettiva e quindi partendo dal lemmario del NVdB: pure ampliando
l’orizzonte all’intero vocabolario del corpus, a risultare mancanti non sono
soltanto termini marcati come AD, ma anche parole fondamentali che sono,
sì, probabilmente note ai bambini, ma non compaiono nel campione preso in
esame per ragioni meramente accidentali.8 Certamente motivate e
intenzionali sono invece specifiche tipologie di omissioni facilmente
identificabili come specifiche dell’IS: si tratta di neologismi e prestiti di lusso,
che i modelli dei maestri – forse in alcuni casi anche per ragioni ortografiche tendono a respingere quand’anche ormai stabilmente acclimatati nell’italiano
standard (jeans, quiz, smog); di termini riferiti a concetti ritenuti sconvenienti
per un pubblico acerbo (aborto, droga, sesso); di voci gergali, espressioni
volgari, insulti, improperi (coglione, culo, ruttare); di appellativi discriminatori
(ebreo, nano, negro) ma anche di parole prudenzialmente evitate perché
avvertite come potenzialmente faziose, propagandistiche o almeno
ideologicamente e politicamente orientate: su quest’ultimo aspetto, che
incarna l’intimità dei rapporti tra lessico, scuola, clima sociale e temperie
culturale non è tuttavia possibile compiere generalizzazioni, perché gli indizi
relativi alle diverse caratterizzazioni assunte dal fenomeno nel corso dei
tempi, anche molto recenti, richiedono di essere intercettati sulle frequenze
basse o inesistenti, piuttosto che su quelle elevate del lessico di base.
3. Conclusioni e prospettive
Come ci si è proposti di evidenziare, l’esame quali-quantitativo dell’IS
conferma che, pur presentando in diacronia tratti di ammodernamento, il
modello linguistico proposto dagli insegnanti risulta caratterizzato dallo

Nella Prefazione al NVdB è specificato che le parole di AD “sono state ricavate
partendo dalla lista di 2.300 parole di alta disponibilità del vecchio VdB e
sottoponendola a gruppi di studenti e studentesse universitari per eliminare le parole
non più avvertite come di maggior uso e per accogliere invece nuove parole avvertite
come di alta disponibilità”.
8 Esemplificativo dei margini di casualità può essere il caso degli etnici, che
mancano in alcuni casi al CoDiSV (ad es. cinese, iugoslavo) che pure ne documenta
moltissimi altri almeno apparentemente di analoga diffusione (ad es. giapponese,
inglese).
7

632

JADT’ 18

stabile impiego di termini estranei al vocabolario di base e dal parallelo
evitamento di termini correnti, ritenuti inadeguati o sconvenienti o più
semplicemente logorati da un uso reputato eccessivo. Lo studio dei dati
consente, poi, di rilevare un’abbondante presenza di logonimi ed etichette
tipici o esclusivi del metalinguaggio didattico e grammaticale, l’uso di hapax
spesso confinati nell’ambito di occasionali specifiche tipologie esercitative ma
per il loro ruolo strategico didatticamente irrinunciabili nonché per il ricorso
a un formulario al cui interno termini correnti assumono tramite fenomeni di
rideterminazione semantica accezioni differenti da quelle consuete,
specializzandosi in relazione a compiti e routines comunicative tipici del
contesto educativo. Le frequenze lessicali documentate nelle varietà dell’IS si
presentano in parte, per queste ragioni, come intrinsecamente poco coerenti,
discordanti in rapporto a quelle del vocabolario di base, contraddittorie
rispetto alle evidenze rintracciabili in varietà d’italiano apparentemente
affini: lo studio delle loro configurazioni richiede, pertanto, modelli di analisi
capaci di interpretare i dati quantitativi alla luce della complessità delle
relazioni paradigmatiche tra le potenziali soluzioni concorrenti nonché dei
compositi rapporti tra numero e tipologia delle accezioni testimoniate nei
concreti impieghi contestuali. In questa direzione, in parte già esplorata in
particolare negli studi di taglio psicolinguistico e glottodidattico dedicati ai
processi della comprensione e alla leggibilità dei testi, sembra che un
raffronto comparativo tra il lessico dell’IS e quello del VdB condotto in modo
sistematico su corpora cronologicamente armonizzati possa fornire ulteriori
linee di ricerca in almeno due specifici ambiti d’indagine.
Un primo, di prospettiva più propriamente acquisizionale, andrebbe
finalizzato a verificare gli effettivi esiti della protratta esposizione in età
scolare alla percentuale di parole dell’IS che risulta estranea al vocabolario di
base: in questa direzione, tenuto conto della natura incrementale e adattiva
degli apprendimenti lessicali ma anche dell’effetto di evanescenza che la
mancata pratica può esercitare sulle competenze possedute, si potrebbe
tentare di rispondere a domande del tipo: quanto incide effettivamente
l’insistenza con cui un termine è presente nell’input offerto nell’ambito
dell’IS sul suo effettivo impiego nei domini da questo distinti e
successivamente sperimentati? In che misura la concettualizzazione relativa
una determinata accezione di un termine veicolata dall’insegnamento può
condizionare (positivamente o negativamente) la successiva acquisizione di
significati ulteriori e diversi per quello stesso termine? In che termini le
soluzioni preferenziali e le scelte paradigmatiche proposte dall’IS risultano
vincenti, almeno a livello di competenza ricettiva, nella concorrenza con le
analisi statistiche che i parlanti sperimentano su altre varietà e in contesti
potenzialmente più pregnanti? E in questo senso, quanto può essere

JADT’ 18

633

percepito come autorevole, significativo, dotato di rilevanza comunicativa il
modello lessicale scolastico in un Paese in cui l’italiano è diventato lingua
materna per la gran parte dei cittadini e la concorrenza di input – non
soltanto lessicale - proveniente da fonti alternative alla scuola appare
quantitativamente strabordante?
Un secondo ambito d’indagine, al precedente correlato ma di prospettiva
principalmente lessicografica, potrebbe invece essere indirizzato ad esplorare
l’ipotesi che una parte del vocabolario scolastico di base possa essere considerata
denominatore comune delle competenze lessicali possedute dai parlanti
adulti alfabetizzati, e venire impiegata soprattutto come punto di riferimento
per la definizione del vocabolario di alta disponibilità. In questo senso, le
oggettive difficoltà di identificazione di quelle “parole che riteniamo più
comuni, psicologicamente e culturalmente, ma che poi hanno in realtà una
frequenza minima, vicina a zero, soprattutto nell’uso scritto” (De Mauro
2004: 142) potrebbero essere in parte superate facendo riferimento a quella
porzione di bagaglio lessicale condiviso e acquisito, se non attraverso altri
canali, per il tramite dell’IS: seppure statisticamente poco rilevanti nelle
produzioni adulte, i termini a chiunque familiari perché proposti con
frequenze elevate e funzioni significative nell’italiano per i bambini – ad
esempio i termini tipicamente indicati sugli alfabetieri (oca), usualmente
utilizzati per l’insegnamento delle particolarità ortografiche (camoscio),
presenti nelle denominazioni più diffuse di giochi e tipologie esercitative
(cruciverba), in fiabe e racconti (carrozza), corrispondenti a discipline
(geografia) o routines scolastiche (giustificazione) – potrebbero probabilmente
superare qualunque prova di elicitazione sui parlanti e quindi, seppur
difficilmente rintracciabili nel lessico adulto, essere selezionate per entrare
nel vocabolario di base con attribuzione della marca AD.
Anche in questo caso, certamente, per evitare insidie e ambiguità semantiche
andrebbero individuati dispositivi utili ad accertare la fenomenologia delle
accezioni effettivamente attive nonché a verificare e interpretare criticamente
le relazioni intercorrenti tra la frequenza dell’input lessicale (e semantico) in
ingresso e la frequenza dell’output lessicale (e semantico) fattuale ma anche
potenziale, in un modello descrittivo che – nel contemplare un’interazione
dialettica, dinamica e comparativa tra le dimensioni della ricettività,
produttività e disponibilità e attribuendo i giusti pesi a quella delicata e
complessa combinazione di quantità e qualità che De Mauro (1994: 97)
felicemente ebbe modo di battezzare binomio indispensabile – consenta di
distinguere gli autentici dai solo apparenti paradossi della frequenza.

634

JADT’ 18

Riferimenti bibliografici
Benedetti G. e Serianni L. (2009). Scritti sui banchi. L'italiano a scuola fra alunni
e insegnanti. Roma, Carocci.
Chiari I. e De Mauro T. (2012). The new basic vocabulary of Italian: problems
and methods. Rivista di statistica applicata / Italian Journal of Applied
Statistics, vol. 22 (1): 21-35.
Cortelazzo M. (1995). Un'ipotesi per la storia dell'italiano scolastico. In Antonelli,
Q. & Becchi E. curatori, Scritture bambine, Roma-Bari, Laterza: 237-252.
De Blasi N. (1993). L’italiano nella scuola. In Serianni, L. e Trifone, P. curatori,
Storia della lingua italiana, vol. I “I luoghi della codificazione”. Torino,
Einaudi: 383–423.
De Mauro T. (1980). Guida all'uso delle parole. Editori Riuniti, Roma 1980.
De Mauro T. (2004). La cultura degli italiani. A cura di Francesco Erbani.
Roma-Bari, Laterza.
De Mauro T. (2005). La fabbrica delle parole. Torino, Utet Libreria.
Revelli L. (2013). Diacronia dell’italiano scolastico. Roma, Aracne.

JADT’ 18

635

How Twitter emotional sentiments mirror on the
Bitcoin transaction network
Piergiorgio Ricci
Tor Vergata University – piergiorgio.ricci@gmail.com

Abstract
Bitcoin represents the first and most popular decentralized cryptocurrency. It
was launched in 2008 by Satoshi Nakamoto, the name used by the unknown
person or people who designed Bitcoin system and created its original
reference implementation. It is based on Blockchain technology that is
considered one of the most promising technologies for the future. It is
more than an instrument of finance and will likely disrupt many industries
from banking to governance in the next years. This research explores a
geolocalized subset of Bitcoin blockchain and compares it with Twitter
communication related to the topic in order to discover what people living in
different geographical areas think about Bitcoin cryptocurrency and to assess
potential relationship between characteristics of language adopted by Twitter
users in posts containing the key word Bitcoin and the structure of
geolocalized blockchain. It also answers a variety of interesting questions
about the national use of Bitcoin.
Keywords: Bitcoin, Blockchain, Cryptocurrency, Social Network Analysis,
Semantic Analysis.
1. Introduction
Bitcoin cryptocurrency is based on blockchain technology that consists in an
open and distributed ledger where all transactions occuring in the system are
recorded
in
a
verifiable
and
permanent
way.
(Narayanan A., Bonneau J., Felten E., Miller A., and Goldfeder S., 2016) They
are organized in blocks which are generated periodically and linked by using
cryptography techniques (SHA256)( Drainville D., 2012). Each of them needs
to be validated by a peer to peer network respecting a specific protocol for
validating new blocks. Once stored, data can not be tampered without
tampering all subsequent blocks, activity that requires collusion of the
network majority. (Nakamoto, 2008) This approach complies with consensus
theory, a social theory which holds that social changes and innovation can be
reached without conflicts and the social system is fair. In fact, Bitcoin's
protocol relies on a strong social consensus among all partecipants of the

636

JADT’ 18

system that represent a node of the network and run a software with the aim
to improve enforcement of rules they agree on. Bitcoin network is
decentralized and it does not require trusting in a third party, such as a bank
or a government institution. For sure, it represents a new concept of money
(Evans, 2014) and the main purpose of this work is to find out what people
living in different geographical areas think about Bitcoin cryptocurrency and
to assess potential relationship between characteristics of language used on
Twitter posts related to the topic and the structure of geolocalized Bitcoin
Blockchain. Research has been conducted to analyze correlations and
causalities between social network metrics performed on the geolocalized
Bitcoin Transaction Network and Bitcoin emotional signals interecepted by
analyzing Twitter users posts grouped by country. In particular, it has been
considered important to discover wheter there is a specific kind of
communication adopted by Twitter users belonging to a specific country that
holds certain transaction network centrality measures. In other words, the
core question to be answered has regarded the analysis on existence of
correlation between the centrality in the Bitcoin transactions network of a
country and characteristics of language used on Twitter Bitcoin posts by their
citizens. To achieve this purpose, two datasets reperesenting Bitcoin
transactions and Twitter communication related to Bitcoin, have been
collected and classified on the basis of geography. Prior research has been
focused on economic aspects (Ron D. and Shamir A., 2012) and structural
proprieties of Bitcoin transaction network (Lischke et Fabian, 2016) (Fleder
M., Kester M. and Pillai S. 2015), but it has rarely considered the existing
relationship between transactions and social media communication. This
study also answers a variety of interesting questions about the national use of
Bitcoin and how Twitter users perceive it through communication signals
posted on Twitter microblogging platform. One of the most widely accepted
use cases for Bitcoin has to do with payments for digital content (Grinberg R.,
2012) and, at present, Bitcoin system is used only by early adopters and
innovators among population.
2. Data set
2.1 Bitcoin dataset
In order to analyze and compare the network of Bitcoin transactions and the
relative user sentiment on Twitter, two differents dataset have been built by
using a serious of Application Program Interface (API) available on the web.
The first dataset to be extracted has been the Bitcoin transaction network that
is publicly available from many free web services (such as Blockchain.info) or
by using a Bitcoin client that requires and stores the whole transaction
history, known as blockchain (Moser M., 2013). In order to reduce and

JADT’ 18

637

manage its complexity, a subset of blockchain, composed by more than 2
million transactions from July
2013 to July 2017 has been collected. They have been imported through the
Blockchain Data API service that allows Bitcoin block and transaction
payments data query functionality, providing requests for data regarding
single block, single transaction and block heights.

Fig. 1 - Example of transaction with multiple inputs and outputs (www.blockchain.info)

Fig.2 - Word Cloud related to USA Twitter dataset

These transactions have been geolocalized by using IPInfo.it web service and
stored in a NoSQL database. Geolocalization activity has regarded the
discovery of the countries involved in each transaction and it has been
carried out by scraping transactions IP addresses (Kaminsky, 2011). Each
transaction block contains a set of transactions and is characterized by
following attributes: flow identifier, hash transaction, timestamp, origin
country, destination country, sender, recevier and total amount (Ober M.,
Katzenbeisser S. et Hamacher K. 2013). Since each transaction can allow
multiple input and output addresses (Reid F. et Harrigan, 2012), they have
been decomposed in transaction flows. In order to attach geographical
informations to each transaction, the service provides by ipinfo.io website
has been used. It offers a web interface where is possible to retrive the origin
country of an IP Address provided as input.

638

JADT’ 18

2.2 Twitter dataset
A set of tweets from 10 different countries containing the word "Bitcoin" have
been collected in order to be analyzed. Sentiment analysis have been
conducted using the Software Condor (MIT Center for Collective Intelligence)
that automatically recognizes sentiment in English, Spanish, German, French,
Italian and Portuguese and allows tweets fetching restricted to a given
geolocation. It also allows to calculate sentiment of posts by using semantic
analysis techinques. This dataset is partially misaligned with the first one for
technical reasons.
3. Research methodology
Research has been conducted combining social network analysis (SNA) and
semantic analysis methodology with a particular focus on the relationship
among main indicators related to these two fields calculated on the dataset.
3.1. Social Network Analysis
Using a Social Network Analysis approach, several strategies are possible to
examine the structure of the Bitcoin transaction network. In order to
counduct the analysis some of the most common measures of centrality have
been identified. Most of them have been proposed by Freeman (1979) and
also analyzed in other Social Network Analysis articles (Batagelj, 2011). In the
following subsections they are briefly described.
3.1.1 Degree centrality
This measure is based on the degree that indicates the number of nodes
attached directly to a specific node for which it is computed. In the case of
directed networks, two different measures of degree centrality can be
calculated, defined as indegree and outdegree. The first one is given by the
number of ties directed to the node, while outdegree is the number of ties
that the node directs to others. In such cases, the degree is the sum of
indegree and outdegree. The (weighted) all-degree for the generic node
a directed graph is represented by the following equation:

in

=
counts the number of incoming ties and
represent the number
where
of outgoing ties. A node with an high degree centrality is central in the
network structure and tend to influence the others.

JADT’ 18

639

3.1.2 Closeness Centrality
Closeness centrality indicates the inverse of the distance of a node from all
the others in the graph. It is based on the shortest paths that between each
couple of nodes in the network. Closeness centrality of node
with N nodes, is defined as following:

, in a graph

=
is

where,
linking

and

the

number

of

edges

in

the

shorterst

path

. Closeness centrality is normalized as shown below:
= (N - 1)

This measure can be considered as a proxy of the speed by which a social
actor can reach the others.
3.1.3 Betweenness Centrality
This variable considers the shortest paths that connect every other couple of
nodes and is higher when a node is more frequently in this subset. For a
network with N nodes, the betweenness centrality for node:

=
where,

is the number of the shortest paths linking two nodes in the

network and

is the number of shortest path linking two nodes that

go through the node . Social network indicators described above can be
used to analyze the structure and the dynamics of the Geographical Bitcoin
Network. In particular, once collected the target set of transactions and
enriched them with geographical informations, two directed graphs has been
modeled. In the first one, identified as Generic Network, each node
represents a Bitcoin address owned by a user belonging to a specific country,
while each link indicates a transaction of a certain amount (weight of link)
occuring between two different addresses, while in the second one, defined
as Geographical Network, each node symbolize a country and links act for
transactions that can involve single or different countries. All the network
metrics used in this study will be explained in the next chapter. They have
been performed on the Geographical Network, obtained by merging General
Network transactions on geographical basis.

640

JADT’ 18

3.2 Semantic Analysis
Semantic analysis of textual data allows to turn text into data for analysis.
This is possible applying natural language processing techinques and
analytical methods. (Hu X., Tang L., Tang J. and Liu H., 2013). In the
following subsections a set of communication indicators will be briefly
described.
3.2.1 Sentiment
This indicator describes wheter messages are positive or not. Its value is
between 0 in the case of very negative messages and 1 viceversa. It is
computed as the average score for the whole text in a message.
3.2.2 Emotionality
This variable expresses the degree of emotion of an individual text fragment
and it is involved in sentiment elaboration.
3.2.3 Complexity
It measures the rarity of a word, or the likelihood that a single word will
occur in a text. It is higher when a text contains many rare words.
4. Results
The aim of this study has been to find out whether characteristics of Twitter
communication related to Bitcoin reflects the GeoBlockchain network
structure. Analysis has been conducted combining most important social
network centrality metrics, such as Degree Centrality, Closeness Centrality
and Betweenness Centrality with some other language indicators measuring
the characteristics of the textual data used in Twitter communication. On the
one hand, centrality metrics measures the importance, influence or power of
a node in the network and are widely applied in social network analysis, on
the other, language indicators allow to identify whether communication
referred to Bitcoin is positive or not, its emotionality and the complexity of
word usage. During analysis, country rankings for each social network
indicator has been calculated in order to be correlated with Twitter
Sentiment, Complexity and Emotionality national rankings performed on
Tweets containing the key word "Bitcoin". Spearman's correlation, computed
considering a set of 10 different countries with a high number of transactions
and tweets, shows a significative correlation between centrality measures
computated in the Geographical blockchain netowork and language on
microblogging platform Twitter. In particular, communication of people
belonging to most central countries in the Bitcoin network, e.g. Germany and
USA, is more complex and less emotional than the one of peripheral country
nodes. This is probably due to a more depth knowledge of Bitcoin
phenomena in the most innovative countries as shown by their Word clouds.
In fact, they tweet more and with a quite technical language (e.g. they speak

JADT’ 18

641

about technical aspects such as fork of blockchain), while the others one, for
example Spain, appear frightened of the new cryptocurrency's diffusion.
Emotionality

Degree
Centrality

1,000

-,638*

Correlation Coefficient
Emotionality Sig. (2-tailed)
N

Spearman's Rho

Correlation Coefficient

Degree
Centrality

Sig. (2-tailed)
N

.

,047

10

10

-,638*

1,000

,047

.

10

10

Complexity

Degree
Centrality

1,000

-,693*

.

,026

*. Correlation is significant at the 0.05 level (2-tailed).

Correlation Coefficient
Complexity
Spearman's
Rho

Sig. (2-tailed)
N

Degree
Centrality

Correlation Coefficient
Sig. (2-tailed)
N

10

10

-,693*

1,000

,026

.

10

10

*. Correlation is significant at the 0.05 level (2-tailed).
Fig.3 - Spearman's correlations calculated on national rankings of Complexity - Degree
Centrality and Emotionality - Degree Centrality

5. Conclusion and future works
The analysis highlights the Bitcoin transactions geographical distribution and
shows national differences in its adoption, revealing the major businesses
and markets. In particular, the most central countries in Bitcoin transaction
network are characterized by a positive and quite complex language, while
peripheral countries use a more emotional language and the sentiment of
their people about it is fairly variable. This result leads to the interpretation
that Twitter emotional sentiments mirror the Bitcoin transaction network and
this could be seen as an interesting signal for investors and entrepreneurs
interested in the development of new payment systems based on Bitcoin
technology and in the choice of the start up country. Main findings of the
study could be applied to crypto-payments national regulation as well as to
the economic and financial impact assessment of cryptocurrencies and future

642

JADT’ 18

works include investigation on the principle barriers to mass adoption of
Bitcoin cryptocurrency.
References
De Nooy W.,Mrvar A. and Batagelj V. (2011). Exploratory social network
analysis with pajek (2nd Ed.). Cambridge University Press.
Freeman L.C. (1979). Centrality in social networks conceptual clarification. Social
Networks, 1 , 215–239.
Lischke M. and Fabian B. (2016). Analyzing the Bitcoin Network: The First Four
Years. MDPI AG.
Nakamoto S. (2008). Bitcoin: A Peer-to-Peer Electronic Cash System.
Reid F. and Harrigan M. (2012). An Analysis of Anonymity in the Bitcoin
System. Springer.
Ober M., Katzenbeisser S. and Hamacher, K. (2013) Structure and Anonymity
of the Bitcoin Transaction Graph. Future Internet. MDPI.
Kaminsky D. (2011). Black Ops of TCP/IP. Black Hat & Chaos Communication
Camp
Drainville D. (2012). An Analysis of the Bitcoin Electronic Cash System.
University of Waterloo
Ron D. and Shamir A. (2012). Quantitative Analysis of the Full Bitcoin
Transaction Graph. IACR Cryptology ePrint Archive
Fleder M., Kester M. and Pillai S. (2015) Bitcoin Transaction Graph Analysis
Moser M. (2013) Anonymity of Bitcoin Transactions. Munster Bitcoin
Conference
Grinberg R. (2012). Bitcoin: An Innovative Alternative Digital Currency.
Hastings Sci. & Tech
Hu X., Tang L., Tang J. and Liu H. (2013). Exploiting social relations for
sentiment analysis in microblogging. In Proceedings of the sixth ACM
international conference on Web search and data mining. ACM.
Narayanan A., Bonneau J., Felten E., Miller A., and Goldfeder S. (2016).
Bitcoin and Cryptocurrency Technologies: A Comprehensive Introduction.
Princeton University Press
Evans D. (2014) Economic Aspects of Bitcoin and Other Decentralized PublicLedger Currency Platforms. University of Chicago Coase-Sandor Institute
for Law & Economics Research Paper No. 685

JADT’ 18

643

Analyse de contenu versus méthode Reinert :
l’analyse comparée d’un corpus bilingue de discours
acadiens et loyalistes du N.-B., Canada
Chantal Richard1, Sylvia Kasparian2
Université du Nouveau-Brunswick, Canada – chantal.richard@unb.ca
2Université de Moncton, Nouveau-Brunswick, Canada – sylvia.kasparian@umoncton.ca
1

Abstract
In this paper we compare two methods of thematic analysis by applying
them to the same corpus. Specifically, we will compare the results of the
classification of units of contexts using the Reinert method in IRAMUTEQ,
with a content analysis (manually coded themes) analyzed using SPHINX in
2012. The bilingual corpus consists of two sub-corpora: speeches at the
Conventions nationales acadiennes (in French) and centennial commemorative
speeches by Loyalists (in English). Our goal is to determine whether the
Reinert method of distribution by class confirms, contradicts, or enhances a
traditional content or thematic analysis.
Résumé
Cet article compare deux méthodes d’analyse thématique de données
textuelles appliquées à un corpus bilingue. Notamment, nous comparons la
répartition par classes selon la méthode Reinert, intégrée dans IRAMUTEQ,
avec les résultats d’une analyse de contenu (codification manuelle des
thèmes) analysés par SPHINX en 2012. Le corpus est constitué de discours
acadiens (en français) et de discours loyalistes (en anglais). Cette étude
permet de voir dans quelle mesure la méthode Reinert confirme, contredit,
ou bonifie l’analyse de contenu traditionnelle pour étudier les mondes
lexicaux ou univers de discours de ces deux sous-corpus.
Mots-clés : analyse de contenu, IRAMUTEQ, méthode Reinert, classification
hiérarchique descendante.
1. Introduction
Aux JADT 2012, nous avions présenté une analyse de contenu des thèmes
principaux d’un corpus bilingue tiré de la base de données Vocabulaires
identitaires. Cette base regroupe des discours en français et en anglais qui
traitent de l’identité collective de deux peuples diasporiques au NouveauBrunswick, Canada : les Acadiens et les Loyalistes. Depuis 2012, la base de

644

JADT’ 18

données est passée de 74 à 1525 textes. S’imposait alors une démarche plus
efficace – pour cela nous avons choisi la méthode Reinert de classification
hiérarchique descendante. Avant d’entamer l’analyse du corpus plus large
nous avons voulu comparer la méthode Reinert aux résultats de l’analyse de
contenu de 2012 en l’appliquant au corpus original de 74 textes. Cet article
permet de voir dans quelle mesure la méthode Reinert bonifie l’analyse de
contenu traditionnelle pour étudier les mondes lexicaux de ce corpus.
2. Analyse de contenu et méthode Reinert
Avant de procéder à l’analyse, nous définirons brièvement les deux types
d’analyse tout en expliquant notre démarche méthodologique.
2.1 Analyse de contenu
Nous entendons par analyse de contenu une « méthode de classification ou
de codification dans diverses catégories des éléments du document analysé
pour en faire ressortir les différentes caractéristiques en vue d’en mieux
comprendre le sens exact et précis » (L’Écuyer 50). En d’autres mots, une
lecture exhaustive du corpus permet de choisir des unités de classification,
de générer une catégorisation sous forme de tableaux à être traités
statistiquement, et l’interprétation des résultats de l’analyse statistique
permet une description des thèmes relevés. C’est la méthodologie utilisée
dans notre première étude du corpus à l’aide des logiciels SPHINX et
HYPERBASE afin d’extraire les mots-clés des sous-corpus. Ci-dessous
(Tableaux 1 et 2) se trouvent les thèmes et quelques mots-clés qui les
constituent.
2.2 Méthode Reinert
La méthode Reinert de la classification hiérarchique descendante a été
adaptée pour le logiciel IRAMUTEQ et appliquée à notre corpus selon les
modalités décrites par Ratinaud et Marchand (2012). Cette méthode consiste
à identifier les unités de contexte élémentaires selon l’organisation interne du
texte qui a été lemmatisé pour ensuite être réparti par classes en procédant
par bipartitions successives. Comme pour l’analyse de contenu, nous avons
analysé séparément les sous-corpus par langue. Les classifications obtenues
ainsi ont été contrastées avec les premiers résultats obtenus à l’aide de
l’analyse de contenu.
3. Corpus
Les 34 discours des conventions nationales acadiennes, prononcés de 1881 à
1890, constituent le corpus acadien de langue française, qui compte 56 368
mots. À cette époque, les Acadiens procédaient à une réorganisation sociale

JADT’ 18

645

par le choix de symboles nationaux. Les Loyalistes du Nouveau-Brunswick,
pour leur part, sont un groupe d’Américains royalistes ayant fui le pays suite
à l’Indépendance pour s’établir au Nouveau-Brunswick où ils fêtent leur
centenaire en 1883. Les 40 discours du centenaire des Loyalistes, publiés
entre 1882 et 1887, forment le corpus de langue anglaise qui compte 69 610
mots.
4. Analyse
L’analyse contrastive des résultats obtenus par ces deux méthodes d’analyse
sont présentés par sous-corpus en affichant en premier le tableau thématique
accompagné de quelques mots-clefs générés par l’analyse de contenu, suivi
du dendrogramme produit par IRAMUTEQ.
4.1 Corpus des Conventions nationales acadiennes (français)
Tableau 1 : Thèmes et mots-clés extraits du sous-corpus acadien par l’analyse de contenu
Événement Progrès et
rassembleur avenir
(symboles)
fête
convention
drapeau
adopter
distinct
monument
assemblée
tricolore
légitime
étoile…

avancement
intérêts
droits
développement
sauvegarde
surmonter
triomphant
amélioration
combattre…

Références
au passé

Relations
(inter)nationales

colonie
histoire
perdu
ancêtres
origine
persécutés
misère
pères mort
larmes
souvenir
infortune
ruine…

compatriotes
anglais
union
sympathie
ennemi
confédération
américains
fusion
puissance
Louisiane
préjugés…

Caractéristiques
associées au
peuple
grand
bonheur
malheur
honneur
noble, digne
devoir, petit
courage
difficultés
persévérance
faible. pauvre
humble…

Race,
ethnie et
culture
peuple
nation
race
patriotisme
sang
Acadie
patrie
âmes
usages
traits…

Religion

saint
religieuses
frères
foi
patron
Dieu Marie
Église
Assomption
chrétien…

La répartition par classes selon la méthode Reinert effectuée par IRAMUTEQ
sépare en premier la classe 6 des autres classes. Cette classe est représentée
par un lexique autour du choix d’une fête nationale acadienne, premier
objectif de ce grand rassemblement patriotique. Une deuxième partition se
fait entre les classes 3 et 4 et les classes 2, 1 et 5. La classe 4 est caractérisée par
un lexique de valeurs associées à la religion alors que la classe 3 illustre des
valeurs associées à un style de vie traditionnel attaché au passé. Le lien entre
les deux est révélateur du fait que pour les Acadiens de l’époque, le style de
vie traditionnel est fortement lié à la religion catholique. Si les classes 3 et 4 se
réfèrent au passé, les classes 2, 1 et 5 suggèrent plutôt un regard tourné vers
l’avenir, notamment dans les domaines des progrès matériel et intellectuel

646

JADT’ 18

(classe 2), de la presse francophone (1) et de l’éducation (5).

Figure 1 : Dendrogramme CHD1 – phylogram produit par IRAMUTEQ : classification
hiérarchique descendante par la méthode Reinert pour le corpus acadien

Quant à la comparaison aux thèmes relevés par l’analyse de contenu
traditionnelle (Tableau 1), certains rapprochements sont possibles. La classe 6
partage une quantité importante de formes avec le thème « événement
rassembleur » dans l’analyse de contenu, notamment les mots-clés communs
aux deux méthodologies : fête, adopter, drapeau, tricolore et distinct. Il est
également possible de rapprocher les classes 3 et 4 des thèmes « Religion » et
« Références au passé » du Tableau 1. Ces deux classes contiennent quelques
mots présents sous le thème « Caractéristiques associées au peuple » du
Tableau 1. Ces classes (2, 1 et 5) partagent une certaine partie de leur lexique
avec le thème « Progrès et avenir » du Tableau 1.
Quel est l’apport de la méthode Reinert à notre analyse? Dans ce cas, il est
pertinent de s’interroger sur ce qu’elle ne relève pas. Notamment, les
catégories de l’analyse de contenu « Relations nationales et internationales »
et « Race, ethnie et culture » (bien que certaines formes telles que « sang » et
« Acadie » se retrouvent dans les classes 3 et 4). Ces deux thèmes se
rapprochent le plus des axes d’intérêt des chercheurs, ce qui suggère une
interférence humaine probable. De plus, l’ordre des partitions proposé par
IRAMUTEQ, qui sépare la classe 6 et répartit les 5 autres classes entre le

JADT’ 18

647

passé et l’avenir, est très révélateur d’un discours paradoxal juxtaposant le
progrès social à la préservation d’une identité ancrée dans le passé, ce qui
n’était pas ressorti lors de l’analyse de contenu traditionnelle par thèmes.
4.2 Corpus des commémorations centenaires des Loyalistes du N.-B.
Tableau 2 : Thèmes et mots-clés extraits du sous-corpus loyaliste par l’analyse de contenu
(HYPERBASE et SPHINX)
Événement
Progrès et
rassembleur
avenir
(commémoration)

Références Relations
Caractéristiques Race,
au passé
nationales et
associées au
ethnie
internationales peuple
et
culture

anniversary
commemorate War,
1783 forefathers
memorial Parrtown
Victoria
1883, 18th Institute
Regiment…

abandoned
bitterness
choice
confiscated
defence
hardship
heroes, duty
Israelites
rugged
struggle…

advancement
building
cities
commerce
development
establishment
factories
harbour
hotels
industrial…

alliance
annexation
commonwealth
constitution
Independence
monarchy
government
King
Mother
protection…

active
brave brotherhood
conservative
determination
intelligent
deserving
strength…

civil
civilized
humanity
race
superior
anglosaxon
yanks
elevate
blood…

Religion

God
bibles
bless
Christian
churches
devotion
Faith
morality
temperance
…

Sept classes sont proposées dans le dendrogramme produit par IRAMUTEQ
pour le corpus loyaliste en anglais. Une première répartition sépare les
classes 3 et 2 de toutes les autres classes. La classe 3 est composée de
références militaires à des personnages, des lieux et des dates, et la classe 2
rassemble un lexique désignant des structures associatives responsables de
préserver la mémoire. Les deux classes sont caractérisées par un grand
nombre de noms propres. La classe 7 se distingue ensuite par ses termes
juridiques rattachés à l’empire britannique et ses colonies. Pour sa part, la
classe 6 est constituée d’un lexique autour des ressources naturelles et du
progrès matériel ou commercial, ce qui suggère une vision de domination de
la nature par l’être humain. La classe 1 traite des valeurs morales et
religieuses prisées par les Loyalistes. Finalement, les classes 4 et 5 sont très
proches, et désignent respectivement les circonstances du départ des
Loyalistes des États-Unis par loyauté à la couronne britannique, et la
célébration de leur succès en tant que fondateurs d’une nouvelle province (le
Nouveau-Brunswick) cent ans plus tard.

648

JADT’ 18

Figure 2 : Dendrogramme CHD1 – phylogram produit par IRAMUTEQ : classification
hiérarchique descendante par la méthode Reinert pour le corpus loyaliste

Les classes ainsi obtenues peuvent être comparées aux thèmes du Tableau 2.
Par exemple, la classe 1 (valeurs morales et religieuses) partage son lexique
avec les thèmes « Religion » et « Caractéristiques associées au peuple ». La
classe 4 (circonstances du départ) est très semblable au thème « Références au
passé » et la classe 5 (célébration du succès) pourrait également être mise en
parallèle avec « Événement rassembleur : commémoration » ainsi que le
thème « Race, ethnie et culture », extrait par l’analyse de contenu. Les classes
2 (structures associatives) et 3 (références militaires) peuvent être
rapprochées du thème désigné dans le Tableau 2 sous « Événement
rassembleur : commémoration ». La classe 7 (empire britannique et ses
colonies) se rapproche du thème « Relations nationales et internationales »
sans toutefois être identique, et la classe 6 (ressources naturelles et progrès)
ressemble au thème « Progrès et avenir », mais avec certaines distinctions,
notamment, l’inclusion des mots se référant à la nature dans le thème du
progrès matériel. L’originalité de la répartition par classes par IRAMUTEQ se
trouve en partie dans la juxtaposition du passé et du présent dans les classes
3 (références militaires du passé) et 2 (associations pour la préservation de la
mémoire par des activités commémoratives), ainsi que les classes 5
(célébration du succès) et 4 (circonstances du départ) qui en sont, en quelque
sorte, l’écho. De plus, les catégories établies dans l’analyse de contenu se sont
avérées incomplètes, et le lexique est réorganisé par la classification
hiérarchique descendante. Selon les répartitions de la méthode Reinert, les

JADT’ 18

649

termes juridiques (parliament, act, law, etc.) se retrouvent avec les termes se
référent à la couronne britannique et ses colonies alors qu’ils n’avaient pas
été relevés dans notre étude de 2012. De même, les mots désignant le monde
naturel (forest, ocean, tree, etc.) côtoient le lexique du progrès matériel et
commercial dans le dendrogramme, ce qui n’était pas intuitif à la lecture
humaine, mais fort révélateur. C’est précisément dans ces apparentes
contradictions qu’apparaissent les interprétations les plus nuancées, et donc
les plus judicieuses d’un corpus textuel.
5. Conclusion
Outre le fait de pouvoir traiter de corpus plus volumineux dans plusieurs
langues, quels sont donc les avantages de l’application de la méthode Reinert
à notre corpus bilingue? En somme, la répartition par classes nous a amené à
réviser et nuancer les résultats de l’analyse de contenu originale. Si les
partitions ressemblent parfois aux thèmes relevés en 2012, la méthode
Reinert a l’avantage de dévoiler les liens entre les classes par ses partitions
graduelles sans égard à la langue, ce qui nous a permis d’observer une
répartition temporelle passé/avenir dans le sous-corpus acadien et
passé/présent dans le sous-corpus loyaliste. De plus, les unités de contexte ne
reposent pas sur des préconçus ou des dictionnaires internes, mais sur une
répartition des mondes lexicaux qui respecte l’organisation interne des
corpus, ce qui a donné une réorganisation du lexique et l’inclusion de mots
qui ne figuraient pas dans l’analyse originale.
C’est justement l’inclusion de ce lexique apparemment paradoxal qui mène à
une analyse plus objective et plus fine. Par exemple, le côtoiement de la
nature et du progrès matériel dans les discours loyalistes suggère une vision
de la domination de la nature par l’être humain et les discours acadiens
visent un progrès social, économique et commercial tout en souhaitant
préserver une identité ancrée dans le passé. Ainsi, nos observations sur les
discours patriotiques des Loyalistes et des Acadiens à la fin du 19e siècle se
trouvent considérablement enrichies par la méthode Reinert telle qu’intégrée
dans le logiciel IRAMUTEQ.
Note : Cet article a bénéficié d’une subvention Savoir du Conseil de recherches en
sciences humaines du Canada. Nous remercions aussi Marc-André Bouchard pour
son aide technique.
Bibliographie
Baulac Y. et Moscarola J. SPHINX Solutions d’enquêtes et d’analyses de
données. www.lesphinx-developpement.fr.
Brunet É. HYPERBASE Laboratoire UMR 6039 Bases Corpus Langage,
Université de NICE-Sophia Antipolis.

650

JADT’ 18

http://ancilla.unice.fr/~brunet/pub/logiciels.html.
L'Écuyer R. (1987). L'analyse de contenu : notion et étapes. In Deslauriers, J.P., editor. Les méthodes de la recherche qualitative. Presses de
l'Université du Québec, pp. 49-64.
Ratinaud P. et Marchand P. (2012) Application de la méthode ALCESTE aux
« gros » corpus
et stabilité des « mondes lexicaux » : analyse du « CableGate » avec
IRAMUTEQ. Dister A., Longrée D., Purnelle G., editors,
Actes/Proceedings of JADT 2012. (11e journées internationales d’Analyse
statistique de Données Textuelles, pp. 845-857.
Ratinaud P. (2009). IRAMUTEQ: Interface de R pour les Analyses
Multidimensionnelles de Textes et de Questionnaires.
http://www.iramuteq.org.
Richard C. et Kasparian S. (2012). Vocabulaire de l’identité nationaliste :
analyse lexicale et morphosyntaxique des discours acadiens et loyalistes
entre 1881 et 1890 au N.-B., Canada. Dister A., Longrée D., Purnelle G.
editors, Actes/Proceedings of JADT 2012. (11e journées internationales
d’Analyse statistique de Données Textuelles), pp. 845-857.
Richard C., Bourque D., Brown A., Conrad M., Davies G., Francis C., Huskins
B., Kasparian S., Marquis G., Mullally, S. Base de données : Vocabulaires
identitaire/Vocabularies of Identity. https://voi.lib.unb.ca

JADT’ 18

651

Bridge over the ocean: Histories of social psychology
in Europe and North America. An analysis of
chronological corpora1
Valentina Rizzoli, Arjuna Tuzzi
University of Padova – valentina.rizzoli@phd.unipd.it; arjuna.tuzzi@unipd.it

Abstract
Since the European Association of Social Psychology (EASP - initially called
European Association of Experimental Social Psychology) has been
established in 1966, what was then considered “European” social psychology
has been working to affirm its own identity by presenting a distinctive brand
to the rest of the world in general and to North America in particular. This
study aims to compare European and U.S. social psychology through the
analysis of the papers published by two of the main journals in their field:
The Journal of Personality and Social Psychology and the European Journal
of Social Psychology. All the abstracts (from the first publication to the last
one in 2016) of the two journals papers have been collected. By means of a
(lexical) correspondence analysis (SPAD software), the existence of a latent
temporal pattern in keywords’ occurrences was explored. Furthermore, in
order to detect, retrieve and compare the main topics the journals dealt with
over time, an analysis implemented by means of Reinert’s method was
conducted (IRaMuTeQ and R software). Results show that even if there are
some typical features that distinguish the “European” from the “American”
social psychology some publication trends seem to converge. Results will be
discussed also reflecting on the contribution of these methods in studying the
history/ies of a discipline.
Keywords: diachronic corpora, chronological textual data, text clustering,
correspondence analysis, Reinert’s method, history of social psychology
1. Introduction
It is widely spread that what is called “the modern social psychology” came
from Europe with the migration of scholars during the second world war,

1This study is a new development of a an interdisciplinary research project
funded by the University of Padova, fund CPDA145940 (2014) “Tracing the History of
Words. A Portrait of a Discipline Through Analyses of Keyword Counts in Large
Corpora of Scientific Literature" (P.I. Arjuna Tuzzi).

652

JADT’ 18

and started to develop mainly in the United States. Moscovici and Markova
(2006) referred to an American indigenous tradition that compete with a
newer Euro-American tradition, not intending to argue that there was a
socio-psychological tradition born in Europe and brought to America; but a
genuinely American tradition that began with the work of the immigrant
Lewin and his new students. While there was a prosperous development of
social psychology in U.S., in Europe there were scholars working on social
psychology, but there was no European school (Moscovici, 1999). The
establishment of the European Association of (Experimental) Social
Psychology (EASP - initially EAESP) in 1966 has been fundamental in the
development of a “European” social psychology. EASP represented a
distinctive brand of the discipline to the rest of the world in general and to
North America in particular, by providing a voice for a more “social” social
psychology (http://www.easp.eu/about/?). To consider an "American" and a
"European" social psychology as two completely separated and
counterpoised entities would be wrong since there was a clear influence
between them. Moreover, the first EASP meeting, which fostered the birth of
EAESP, was an initiative of U.S. scholars (cf. Moscovici and Markova, 2006).
By saying “American” social psychology we usually refer to the indigenous
U.S. tradition explicated by Floyd Allport’s work in 1924, which considers
social psychology as part of general psychology and keeps more attention on
the “individual”. "European" social psychology usually refers to the EuroAmerican tradition, promoted by the EASP, that regards social psychology as
strictly connected to close disciplines such as sociology and anthropology
and accords a greater role to social and cultural aspects
(http://www.easp.eu/about/?). This contribute consists in an empirical
analysis that moves from the study of scientific production. Over time,
scientific journals shape the history of a discipline as they include objects,
fields of application and methods that contribute to delineate the trajectory of
a discipline. Thus, an in-depth understanding of the past and the temporal
evolution of a discipline can be achieved by analysing the scientific debate
inside relevant scientific journals (Trevisani and Tuzzi, 2015; 2018). We have
taken into account the European Journal of Social Psychology (EJSP) and the
Journal of Personality and Social Psychology (JPSP). The former is an official
publication of the EASP and worldwide represents the association's voice.
The JPSP belongs to the American Psychological Association, that represents
the most widespread community of psychologists in the United States, and
not only: It is an important scientific reference that provides guidelines also
in Europe. In terms of visibility and prestige, the JPSP is considered one of
the most relevant journals of the field. The main aim is to observe and
compare the trajectory of the two Journal publications and to reflect about

JADT’ 18

653

what contribution these methods can provide for the study of the history of a
discipline. We particularly intend: 1) to portray the temporal pattern of the
main concepts debated in the past and covered today by EJSP and JPSP; 2) to
detect, retrieve and compare the main topics these journals dealt with over
time.
2. Methods
All the available abstracts of the two journals have been included in two
corpora and collected from different acknowledged sources compared with
the website of the journals. As regards EJSP, a total of 2,559 items was
collected, for a period of 46 years, from the very first in 1971, Volume No. 1,
Issue No. 1 to the latest of 2016, No. 46, Issue No. 7. Regarding JPSP, an
amount of 9,568 item was downloaded, for a period of 52 years, from 1965,
Volume No. 1, Issue No. 1 to 2016, No. 111, Issue No. 6. Items without any
abstract have been deleted (e.g., editorials, master heads, errata,
acknowledgements). The EJSP corpus is composed of 2,195 abstracts, while
the JPSP one of 9,536 abstracts.
To improve the homogeneity of the corpora we decided to privilege the
British spelling (e.g., we replaced analyzed with analysed) in EJSP and those
American in JPSP. Our corpora have been normalised only replacing
uppercase with lowercase letters. The lexicometric measures showed that
there is a good redundancy, that is fundamental to work with frequencies
(Lebart, Salem, & Berry, 1998; Tuzzi, 2003; Bolasco, 2013).
Multi-words (MW) with frequencies ≥ 5 for the EJSP corpus and ≥ 10 for the
JPSP one (it is consistently larger than the former) have been recognised,
selected and considered as textual units. We resort to a procedure for
automatic information retrieval that permits to recognise repeated
informative sequences, e.g., an adjective followed by a noun as in “social
psychology”, that produce a MW (Pavone, 2010). Two encyclopaedias of
social psychology (Manstead et al., 1995; Baumeister & Vohs, 2007) and index
of keywords available in the downloading process provided further MWs.
In order to depict the structure of the association between years and words
and to establish the existence of a chronological dimension, a (lexical)
correspondence analysis (CA) has been conducted on two matrices: 5,784
words over 46 years (rows per columns) for EJSP corpus and 8,349 x 52 for
JPSP. To detect a set of relevant topics included in the journals and observe
their temporal development, an analysis implemented by means of Reinert's
method (1986) has been conducted. Topics can be defined as “lexical worlds”
(Reinert, 1993), that are groups of words referring to a class of meaning. The
result, performed with a hierarchical descending classification, is a
dendrogram that groups units into classes that mirror a similar lexical

654

JADT’ 18

context. Textual data were processed with the Taltac2 dedicated software and
statistical analyses were conducted with SPAD, Iramuteq and R software
packages.
3. Results
By means of CA we can observe the existence of clear-cut temporal
dimension in both the corpora (Figure 1). The keywords which mainly
contributed to the factorial solution show which concepts typifies each timespan.

Figure 1 - First factorial plan of Correspondence Analysis of EJSP (left side) and JPSP (right
side). Projection of years

In the EJSP (Figure 1, left side) the first period (1971-1990) is strongly
characterised by words that refer to the experimental design. This is the
period mainly concerned with the study of aggression, risk taking,
dissonance, and attribution theory. The keywords of the subsequent period
(nineties) seem to be related to social change, which is characterised by the
study of social influence, categorization, and words referring to Moscovici
and Tajfel's theories (that marked the European production: social
representations, minority influence and minimal group paradigm). In the
following years (2000s) we can observe that the attention has turned on the
self, ingroup/outgroup relations and the social cognition with the study of
stereotypes, emotions, motivation, agency/communion, and so on. In recent
years (2011-2016) mainly social issues (e.g., gender, migration, environment,
religion) and everyday life concerns are highlighted.
As regards the JPSP (Figure 1, right side), in the first decade considered
(1965-1976) the main contribution is given by words as reinforcement, verbal
reinforcement, conditioning, and so on, that together refer to behaviourism.
At the same time, we can observe the occurrence of words pertain to game’s
theories, conflict/cooperation as well as aggression and dissonance theory.

JADT’ 18

655

Also physiological measurements (e.g. heart rate) and experiments
(experimental) are visible. The second period includes the last Seventies until
the last Eighties. Its distinctive words are masculinity/femininity, and other
terms that remind to motivational theories. Moreover, the presence of words
related to personality is evident and becomes stronger in the following
period, that includes the Nineties. In this period mood, personality,
individual differences, memory and the self represent the main contribution.
At the same time also issues about gender and women are noteworthy. The
last period starts from the 2000s and shows many references to
explicit/implicit, and intimate relationships. Moreover, further specific words
about positive psychology (life satisfaction, goal pursuit, and so on) and
culture (cultural, culture) are relevant.
The analysis conducted by means of Reinert’s method enlightens the
presence of nine different lexical worlds (79.64% of the abstracts have been
classified) in EJSP (Figure 2).

Figure 2 - EJSP classes and their distributions over years – Unsupervised clustering method

Following the classes order from the bottom to the top of Figure 2, a brief
outline of their contents is provided below. Class 1 (red) concerns attribution
and methodological issues (e.g., method, statistical, model). Class 9 (fuchsia)
contains words related to impression formation, categorisation and
stereotype. Both these classes show decreasing trends without disappearing.
Class 6 (light blue) includes mainly words related to gender studies and
implicit measures (e.g., prime, IAT). Class 5 (water blue) concerns moods and
regulatory focus theory. These two classes show increasing trends. Class 8
(purple) concerns studies on aggression (in which mainly male/female as
subjects involved in an experiment were compared). This class was initially
hegemonic in the field and then disappeared along time. Class 7 (blue)
includes game theories and studies on cooperation competition and shows a
decreasing trend. Class 2 (orange) concerns politics and culture (mainly cross
cultural studies) and it is an ever-present topic, as well as Class 4 (green) that
concerns the social identity theory and ingroup/outgroup dynamics. Class 3,
that concerns the applications of that theory (e.g., migration), shows a clear

656

JADT’ 18

increasing trend. As regards JPSP, the analysis shows the presence of eleven
clusters (76,08% of the abstracts have been classified - Figure 3). Following
the order of the classes from the bottom to the top of Figure 3: Class 7 (light
blue) concerns consensus formations and attribution, and seems to be an
ever-present topic. Class 6 (water blue) contains processes regarding
memory, stereotypes and categorisation and it is particularly recurrent in the
nineties and 2000s. Class 3 (grey) contains studies on self, emotion and
motivation and shows a clear increasing trend, becoming one of the most
relevant topics nowadays. Classes 11 (fuchsia), 10 (lilac), and 1 (red) concern,
respectively, studies on aggression and physical measurements, on
dissonance and opinion changes, and male and female involved in
experimental studies. They were predominant in the first years considered
and then disappeared. Class 9 (purple) concerns culture (mainly comparing
west and east ones) and politics. It shows an increasing trend although it is
not among main topics nowadays. Class 2 (orange) includes words regarding
the measurements and their validity (e.g., scale, reliability, test retest) and
shows a stable trend. Class 8 (blue) contains words relate to interpersonally
differences (based on gender or studied with twin studies). It seems to
remain constant even if with a slight decreasing trend. Class 5 (water green)
is represented by words concerning health (mental and physical) and how to
cope with related problems. Class 4 (green) concerns romantic and couple
relationships. Both those classes show increasing trends.

Figure 3 - JPSP classes and their distributions over years – Unsupervised clustering method

4. Discussion and conclusions
The aim of the present study is to compare American and European social
psychology offering food for thought on the contribution of the methods
used in studying the histories of a discipline. Thanks to these preliminary
results we succeeded in highlighting the history of a discipline from the
particular point of view of its effective scientific production.
In the first years considered, some similarities among the contents tackled in
the two journals can be noticed (e.g., dissonance theory and aggression). The
main differentiation that emerged concerns the stronger attention on

JADT’ 18

657

individual and personality in JPSP, on the one hand, and the different impact
of Tajfel and Moscovici's contributions on the psychology of groups and
Moscovici's works on social representations, on the other. This emerged as
particularly evident in ‘80s and ‘90s. The predominant approach of social
cognition seems to be a common feature, as well as methods and research
design that mainly refer to the experimental method and topics concerning
cross cultural studies and politics. As regards the topics identified, some
common trajectories of publication were enlightened. They are, for example,
Class 11 in EJSP and 8 in JPSP, concerning studies on aggression that were
predominant in the first decades and later decline. Class 1 in EJSP and 7 in
JPSP, as regards, studies on attribution. Also, class 2 in EJSP and 9 in JPSP,
that are related to culture and politics. Similar contents but different
trajectories are shown by Class 9 in EJSP and 6 in JPSP. The main difference
between the journals is observed in JPSP Classes concerning personality,
health, cope, and romantic and couple relationships (8, 5, 4), and EJSP
Classes concerning ingroup/outgroup processes, and intergroup contact and
applied concerns (4, 3).
It is worth mentioning the core of the difference between American and
European social psychology: the attention on the individual in the American
and on the social in the European one. That difference finds its way as a
greater attention on social issues in EJSP and individual related studies (e.g.
interpersonal relations, personality) in JPSP. Two histories of publications in
social psychology have been traced, one North American and the other
European. Their typical differentiation is historically well known in the
community, but the empirical works that contributed to that debate are less.
This is an example of the contribution that quantitative analysis of textual
data can provide to the study of the history of a discipline, also known as
digital history.
References
Allport, F. (1924). Social Psychology. Boston, MA: Houghton Mifflin.
Baumeister, R. F., & Vohs, K. D. (2007). Encyclopedia of social psychology.
Thousand Oaks, CA: Sage.
Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Netherlands:
Springer. doi:10.1007/978-94-017-1525-6
Manstead, A. S., Hewstone, M. E., Fiske, S. T., Hogg, M. A., Reis, H. T., &
Semin, G. R. (1995). The Blackwell Encyclopedia of Social Psychology.
Blackwell Reference/Blackwell Publishers.
Moscovici, S. (1999). Ringraziamento. In Laurea Honoris Causa in Psicologia a
Serge Moscovici. Università degli studi di Roma “La Sapienza”: Centro
Stampa d’Ateneo.

658

JADT’ 18

Moscovici, S., & Markova, I. (2006). The making of modern social psychology.
Cambridge: Polity.
Pavone, P. (2010). Sintagmazione del testo: una scelta per disambiguare la
terminologia e ridurre le variabili di un’analisi del contenuto di un
corpus. In S. Bolasco, I. Chiari, & L. Giuliano (Eds.) Statistical Analysis of
Textual Data: Proceedings of 10th International Conference Journées d’Analyse
statistique des Données Textuelles, 9-11 June 2010, Sapienza University of
Rome, pp. 131-140. LED.
Ratinaud, P. (2014). Visualisation chronologique des analyses ALCESTE:
application à Twitter avec l’exemple du hashtag# mariagepourtous. Actes
des 12es Journées internationales d’Analyse statistique des Données Textuelles.
Paris Sorbonne Nouvelle–Inalco.
Reinert, M. (1983). Une méthode de classification descendante hiérarchique:
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, 8(2), 187-198.
Reinert, M. (1993). Les «mondes lexicaux» et leur «logique» àtravers l’analyse
statistique d’un corpus de récits de cauchemars. Langage & Société, 66, 5–
39.
Trevisani, M., & Tuzzi, A. (2015). A portrait of JASA: The history of Statistics
through analysis of keyword counts in an early scientific journal. Quality
and Quantity, 49, 1287-1304.
Trevisani, M., & Tuzzi, A. (2018). Learning the evolution of disciplines from
scientific literature. A functional clustering approach to normalized
keyword count trajectories. Knowledge-Based System, 146, 29-141

JADT’ 18

659

Les « itemsets fréquents »
comme descripteurs de documents textuels
Louis Rompré1, Ismaïl Biskri2
1

Université du Québec à Trois-Rivières – rompre.louis@courrier.uqam.ca
2 Université du Québec à Trois-Rivières – ismail.biskri@uqtr.ca

Abstract
Automated classification is one of the preferred approaches applied to the
problem of organizing information. The classification process is based on
identification and evaluation of descriptors which characterize the
information. It’s usually necessary to discover them following a raw data
analysis. Generally, words are considered during this analysis. In this paper,
we propose to use frequent itemsets as descriptors. We present how they can
be identified and used to define a level of similarity between several texts.
The experiments conducted demonstrate the potential of the proposed
approach for defining similarity between texts and linking news broadcast on
the web.
Résumé
La classification automatisée est une des principales approches appliquées au
problème d’organisation de l’information. Le processus de classification
repose sur l’identification et l’évaluation de descripteurs qui caractérisent
l’information. Il est souvent nécessaire de déduire ces descripteurs à partir
d’une analyse des données brutes. Généralement, les mots sont considérés
pour mener cette analyse. Dans cet article, nous proposons d’utiliser des
itemsets fréquents comme descripteurs. Les expérimentations effectuées
démontrent le potentiel de cette approche pour établir un degré de similarité
entre différents textes et lier des nouvelles diffusées sur le web.
Keywords: Classification, Itemset fréquent, Descripteur, Document, Texte.
1. Introduction
La digitalisation des documents a facilité la diffusion de l’information. Dès
qu’un événement se produit de multiples articles sont rédigés et diffusés sur
les différentes plateformes numériques. Plusieurs documents textuels
diffusés sur le web sont composés uniquement de quelques centaines de
mots. C’est en consultant différents documents, qu’une description riche peut
être obtenue. Différents documents peuvent aborder un même sujet et
chacun de ces documents est susceptible de contenir de l’information

660

JADT’ 18

complémentaire. Toutefois, la quantité de données disponibles et leur
manque de structure limitent notre capacité à capturer ces informations d’où
la nécessité d’avoir recours à des outils facilitant l’accès à l’information. La
classification automatique est l’une des stratégies appliquées au problème
d’organisation de l’information. Un processus classificatoire appliqué à des
documents textuels, qu’il soit automatisé ou non, organise les documents de
sorte que ceux qui partagent des similarités soient regroupés. L’organisation
qui en découle peut être utilisée pour orienter, par exemple, la recherche
d’information, l’extraction de connaissances, l’aide au résumé, etc.
Plusieurs classifieurs automatiques ont fait l’objet de publications. Comparer
ces classifieurs pour déterminer leur performance est une tâche complexe et,
surtout, subjective. Un classifieur peut performer avec un ensemble
particulier de documents et engendrer des classes bruitées avec un autre
ensemble. La pertinence d’une classification est jugée en fonction de
l’homogénéité des classes qui en résultent. Ce critère est toutefois relatif.
L’examen d’une classe par un intervenant est accompli à partir de ses
objectifs de recherche et de ses connaissances du domaine abordé. La qualité
recherchée pour un système de classification automatisée est d’être capable
de cibler les informations pertinentes à l’intérieur des documents visés et de
déterminer comment ces informations peuvent être utilisées pour établir un
niveau de similarité entre ces documents. La classification numérique repose
sur l’identification et l’évaluation de descripteurs qui permettent de
différencier une classe d’une autre. Le choix d’un descripteur aux dépens
d’un autre revient à prendre position sur la nature des résultats générés. Il
influence le comportement du classifieur car la présence ou l’absence d’un
descripteur est un indice permettant de cibler la classe à laquelle appartient
un document. Pour la classification textuelle, le mot est souvent utilisé
comme descripteur discriminant (McCallum et Nigam, 1998). Lorsque
plusieurs mots apparaissent à des fréquences comparables dans deux
documents alors ces documents sont considérés comme étant similaires.
Toutefois, il est courant que des documents partagent un nombre important
de mots et ce même si ces documents traitent de sujets différents. La présence
seule de ces mots est donc peu porteuse d’information et son utilité pour
établir le niveau de similarité entre des documents est limitée. Néanmoins,
les relations qu’entretiennent ces mots avec d’autres peuvent mettre en
lumière des particularités propres à certains documents. Il est possible
d’utiliser ces relations pour établir le niveau de similarité entre documents.
2. Les règles d’associations
Le développement récent des règles d’association découle des travaux
d’Agrawal sur l’extraction de connaissances à partir de données

JADT’ 18

661

transactionnelles (Agrawal et al., 1993). Agrawal proposait de dégager des
relations entre des items qui cooccurrent dans des transactions commerciales.
Par exemple, les clients qui achètent les items x et y achètent également l’item
z. Depuis, l’approche a été transposée à d’autres domaines, les règles
d’association pouvant être appliquées à divers domaines dans la mesure où
le concept de transaction peut y être défini.
Soit T un ensemble de transactions tel que

, les

sont appelés des items. Un
éléments qui composent les transactions
item est une donnée dont la nature dépend du domaine abordé. Par exemple,
les items peuvent correspondre à des descripteurs extraits d’une musique
(Rompré et al., 2017), à des descripteurs extraits d’une image (Alghamdi et
al., 2014) ou simplement à des mots extraits d’un texte (Zaïane et Antoine,
2002). Ainsi, une transaction peut être définie simplement comme un sousensemble de descripteurs.
Soit

un ensemble de d items distincts, chaque sous-

ensemble qu’il est possible de générer à partir des items

est appelé un

itemset. Pour un ensemble I de taille d, le nombre d’itemsets possibles est
(Tan et al., 2002). Le nombre d’itemsets potentiels est exponentiel, en fonction
de la taille de I. L’objectif à atteindre lors du processus d’extraction des règles
d’association étant de découvrir des relations cachées, il n’y a pas d’indice
permettant de cibler les items à considérer. Ainsi, l’espace de recherche
équivaut à l’ensemble des itemsets possibles. Même s’il est théoriquement
possible de créer
itemsets à partir d’un ensemble de taille d, en pratique
plusieurs combinaisons apparaissent peu ou tout simplement pas dans les
transactions. Ces combinaisons peuvent, donc, être ignorées. Le support est
une mesure qui permet de cibler les itemsets à ignorer. Le support d’un
itemset X représente le pourcentage des transactions de
Il est noté

qui contiennent X.

, et donné par l’équation 3.1 où n équivaut au nombre total de

transactions contenues dans T et

au support brut. Le support brut d’un

itemset représente le nombre de transactions de
donné par l’équation 3.2.

qui contiennent . Il est
(3.1)

662

JADT’ 18

(3.2)
Un itemset est considéré fréquent lorsque son support est supérieur ou égal
à un seuil prédéterminé. Soit X et Y deux itemsets fréquents tel que
, une règle d’association notée
traduit une relation de
cooccurrence entre ces itemsets. Par convention, le premier terme est appelé
l’antécédent tandis que le second est appelé le conséquent. Une règle
d’association est jugée de qualité selon une mesure
préalablement fixé. Ainsi, une règle d’association

et un seuil

est jugée de qualité

. La quantité de règles générées, leur pertinence de
si
même que leur utilité dépendent fortement des mesures et des seuils
minimaux fixés. L’évaluation des mesures d’intérêt des règles d’association a
fait l’objet de plusieurs travaux (Le Bras et al., 2010 ; Geng et Hamilton, 2006;
Tan et al., 2002). Même s’il existe plusieurs variantes, l’extraction des règles
d’association est généralement effectuée à l’aide de l’algorithme Appriori
(Agrawal et Srikant, 1994) ou FP-Growth (Han et al., 2000). D’autres
algorithmes sont présentés dans (Fournier-Viger et al., 2017). Les deux
principales difficultés liées à l’extraction des règles d’association sont la
gestion de la mémoire et l’effort computationnel nécessaire à la recherche des
itemsets fréquents. Contrôler le nombre d’items à considérer demeure le
meilleur moyen de traiter ces difficultés. Depuis deux décennies plusieurs
travaux portent sur l’application des règles d’association à des fins de
classification (Liu et al., 1998; Zaïane et Antoine, 2002 ; Bahri et Lallich, 2010).
Les différents classifieurs qui découlent de ces travaux produisent des
résultats qui sont en mesure de rivaliser avec ceux obtenus à l’aide d’autre
approches comme les arbres de décision (Mittal et al., 2017). Le principal
avantage des classifieurs à base de règles d’association est que les
connaissances qu’ils exploitent pour guider le processus classificatoire
peuvent être facilement interprétées. Ainsi, un classifieur qui exploite des
règles d’association peut être utilisé pour identifier les descripteurs
pertinents. Les différentes approches proposées dans la littérature impliquent
généralement des règles de la forme

où

correspond à un ensemble

de descripteurs et à une classe de similarité. Les documents sont considérés
comme étant les transactions tandis que les descripteurs (mots clés, fréquence
d’apparition des mots, etc.) et les classes sont considérés comme étant les

JADT’ 18

items. Soit un ensemble de descripteurs

663

, et un

représentant différentes
ensemble d’étiquettes
classes, alors un ensemble de documents peut être représenté de la manière
suivante :

Cette forme de représentation implique que les classes de similarité
auxquelles appartiennent les documents soient préalablement connues. Un
ensemble d’apprentissage est constitué et utilisé pour entraîner le classifieur.
Les règles d’association dégagées lors de la phase d’entraînement sont
utilisées pour prédire la classe de nouveaux documents. Ce processus
demande généralement un effort considérable et les résultats générés
dépendent de l’ensemble utilisé pour entraîner le classifieur.
3. Méthodologie
À l’instar des classifieurs à base de règles d’association, notre approche
exploite des itemsets fréquents pour décrire les documents. Toutefois, elle ne
nécessite pas de phase d’entraînement. Des itemsets fréquents sont extraits
de chacun des documents et comparés. Le degré de similarité entre deux
documents est fonction du nombre d’itemsets fréquents qu’ils partagent.
L’hypothèse derrière cette approche est que lorsque des mots co-occurrent
fréquemment au sein des phrases qui composent un texte, alors ces mots sont
représentatifs de ce texte. Ainsi, en considérant quelques itemsets fréquents,
il est possible de dégager les thèmes spécifiques traités dans les documents.
L’approche proposée comporte 4 étapes.
La première étape consiste à segmenter les documents afin de les préparer à
l’extraction des itemsets fréquents. Les documents sont traités comme des
ensembles de transactions où les phrases constituent les transactions et les
mots les items. Le nombre de mots différents susceptibles d’apparaître dans
un ensemble de documents textuels est théoriquement de l’ordre de la taille
du vocabulaire de la langue d’écriture de ces documents. Le nombre de mots
qui composent le français est estimé par l’Office Québécois de la Langue
Française à plus de 500 000. Considérant qu’à partir de 500 000 mots il est
possible de générer
itemsets, il est nécessaire d’imposer certaines
conditions aux textes en entrée afin de contrôler le nombre de mots. La

664

JADT’ 18

diversité d’un lexique augmentant avec la taille d’un texte, nous devons
limiter les textes en entrée à quelques milliers de mots.
La deuxième étape consacre la réduction du nombre d’items et donc de
l’espace de recherche lors de l’extraction des itemsets fréquents. Certains
mots jugés peu porteur d’information sont supprimés des transactions. Une
liste de 502 mots vides est utilisée. Les chiffres et les caractères de
ponctuation sont également supprimés.
La troisième étape vise à extraire les itemsets fréquents. Cette étape est
réalisée à l’aide de l’algorithme Apriori. Un effort est porté afin de dégager
un nombre restreint d’itemsets fréquents. La recherche des itemsets fréquents
est effectuée de manière itérative. Lors de la première itération, le support
minimum est fixé à une valeur élevée. Lorsque le nombre d’itemsets
fréquents extraits est inférieur à 10 alors le support minimum est diminué de
0.1. Le processus cesse lorsque le nombre d’itemsets obtenus est supérieur à
10 ou que le support minimum est inférieur à 0.1.
La dernière étape établit le degré de similarité entre les documents. Les
itemsets fréquents utilisés pour décrire les documents sont comparés. Plus le
nombre d’itemsets partagés par deux documents est grand, plus ces
documents sont jugés comme étant similaire.
4. Expérimentation et discussion
Afin d’évaluer l’approche proposée, plusieurs expérimentations ont été
effectuées avec une application que nous avons développée en Python. Un
corpus formé d’une centaine d’articles tirés de l’actualité et diffusés sur le
web a été utilisé. Ce corpus se distingue par le fait qu’il présente les mêmes
nouvelles sous l’angle de différentes agences de presse. Il regroupe des
articles diffusés sur le web provenant de 6 sources différentes et contenant
entre 500 et 1500 mots. Ces articles sont parfaitement adaptés aux conditions
de l’approche proposée.
Lors de nos expérimentations, nous avons mesuré le pouvoir discriminant
des itemsets fréquents. Nous avons effectué une comparaison entre les
classifications produites lorsque les descripteurs sont les itemsets fréquents et
les classifications produites lorsque les mots sont les descripteurs. La nature
des résultats obtenus suggère que les itemsets fréquents peuvent servir à
raffiner la description d’une classe. À titre d’exemple, le mot {avions} est
utilisé pour décrire 15% des articles du corpus. Même si ces articles sont
associés à l’aviation, ils traitent néanmoins de 4 sujets différents. Nos
expérimentations démontrent que l’utilisation des itemsets fréquents comme
descripteurs peut servir à décrire plus précisément le contenu de ces articles.
Les figures 1 et 2 illustrent respectivement la précision obtenue en
considérant des itemsets fréquents et celle obtenue en considérant

JADT’ 18

665

uniquement des mots. Il est à noter que lorsque seuls les mots sont
considérés, les classes de similarité générées sont moins homogènes. En effet,
des articles qui traitent de sujets autres que l’aviation y sont inclus.

Figure 1 : Précision avec les itemsets fréquents

Figure 2 : Précision avec les mots

La figure 3 illustre la matrice de similarité produite pour des articles traitant
de la crise nord-coréenne. La première colonne contient l’identifiant de
l’article, la seconde indique le sujet abordé tandis que les colonnes suivantes
donnent le nombre d’itemsets fréquents partagés par les articles. La
diagonale équivaut aux nombres d’itemsets fréquents extraits pour un article.
La figure 2 est représentative des résultats observés. Moins de 10 itemsets
fréquents ont été extraits pour la moitié de ces articles. Néanmoins, ils ont
tous été associés à la même classe.

Figure 3 : Matrice de similarité des documents traitant de la crise Nord-coréenne.

Malgré le fait qu’ils traitent du même sujet, certains articles partagent peu
d’itemsets fréquents avec les autres articles qui forment la classe. Ceci
s’explique par le lexique employé. Il est possible que les performances
puissent être améliorées en ajoutant une étape de lemmatisation. Toutefois,
certaines relations demeurent difficiles à établir automatiquement. Par
exemple, le document 45 contient les itemsets {nucléaire, pyongyang} et
{nucléaire, washington} tandis que le document 46 contient les itemsets
{nucléaire, corée} et {nucléaire, américaine}. Les résultats présentés
constituent uniquement un échantillon des connaissances extraites à l’aide de

666

JADT’ 18

l’approche proposée. En plus d’être faciles à interpréter, les itemsets
fréquents permettent de décrire plus précisément le contenu des documents
que les mots seuls.
5. Conclusion
Nous avons proposé une approche non supervisée pour établir des relations
entre des documents textuels. L’approche proposée repose sur l’utilisation
d’itemsets fréquents. Ces descripteurs expriment la cooccurrence de mots au
sein des phrases qui composent un texte. Les itemsets fréquents ont tendance
à être plus discriminant que les mots seuls. Par conséquent, ils peuvent aider
à rehausser la description d’une classe. L’un des avantages de la méthode
proposée est que les résultats produits sont faciles à interpréter. Les
expérimentations effectuées suggèrent que les itemsets fréquents, tels que
définis, sont suffisamment informatifs pour servir à établir des liens
cohérents entre des documents. Plusieurs débouchés sont envisageables.
Entre autres, l’approche proposée pourrait servir comme prétraitement à la
navigation entre différents documents, à l’annotation, au filtrage de
l’information, etc.
Références
Agrawal, R., Imielinski T., et Swami, A. (1993). Minning association rules
between sets of items in large databases, In Proc. of the SIGMOD
Conference on Management of Data, pp 207-216.
Agrawal, R., Srikant, R. (1994). Fast Algorithms for Mining Association Rules,
In Proc. of the 20th International Conference on Very Large Database, pp.
487-499
Alghamdi, R. A., Taileb, M., et Ameen, M. (2014). A new multimodal fusion
method based on association rules mining for image retrieval. In
Mediterranean Electrotechnical Conference (MELECON), 2014 17th IEEE
(pp. 493-499). IEEE.
Bahri, E., et Lallich, S. (2010). Proposition d'une méthode de classification
associative adaptative. 10eme journées Francophones d'Extraction et
Gestion des Connaissances, EGC 2010, pp. 501-512.
Fournier-Viger, P., Lin, J. C. W, Vo, B., Chi, T. T., Zhang, J. et Le, H. B. (2017).
A survey of itemset mining. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery.
Geng, L., et Hamilton, H. J. (2006). Interestingness measures for data mining:
A survey. ACM Computing Surveys (CSUR), vol. 38, no 3, p. 9.
Han, J., Pei, J., et Yin, Y. (2000). Mining frequent patterns without candidate
generation. In ACM sigmod record (Vol. 29, No. 2, pp. 1-12). ACM.
Le Bras, Y., Meyer, P., Lenca, P., et Lallich, S. (2010). Mesure de la robustesse
de règles d’association. QDC 2010.

JADT’ 18

667

Liu, B., W. Hsu, et Y. Ma (1998). Integrating classification and association rule
mining. In Knowledge Discovery and Data Mining, pp. 80–86.
McCallum, A., et Nigam, K. (1998). A comparison of event models for naive
bayes text classification. In AAAI-98 workshop on learning for text
categorization (Vol. 752, pp. 41-48).
Mittal, K., Aggarwal, G., et Mahajan, P. (2017). A comparative study of
association rule mining techniques and predictive mining approaches for
association classification. International Journal of Advanced Research in
Computer Science, 8(9).
Rompré, L, Biskri, I et Meunier, J-G (2017). Using Association Rules Mining
for Retrieving Genre-Specific Music Files, In Proc. of FLAIRS 2017, pp.
706-711.
Tan, P. N., Kumar, V., et Srivastava, J. (2002). Selecting the right
interestingness measure for association patterns. In Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery
and data mining (pp. 32-41). ACM.
Zaïane, O. R., et Antonie, M. L. (2002). Classifying text documents by
associating terms with text categories. In Australian computer Science
communications (Vol. 24, No. 2, pp. 215-222).

668

JADT’ 18

Discursive Functions of French Epistemic Adverbs:
What can Correspondence Analysis tell us about
Genre and Diachronic Variation?
Corinne Rossari, Ljiljana Dolamic, Annalena Hütsch,
Claudia Ricci, Dennis Wandel
University of Neuchâtel – corinne.rossari@unine.ch

Abstract
Our aim is to describe discursive functions of a set of French epistemic
adverbs by establishing their combinatory profiles on the basis of their cooccurrence with different connectors. We then compare these profiles using
correspondence analysis in order to find evidence of genre and diachronic
variation. The use of these adverbs is explored in contexts of informative
discourse within two distinctly different genres – contemporary written press
and encyclopedic discourse – as well as within two diachronic spans.
Keywords: epistemic adverbs, connectors, co-occurrences, correspondence
analysis, genre variation, diachronic variation
1. Introduction
Our aim is to analyze the genre and diachronic variation of discursive
functions of French epistemic adverbs (E-ADV). By discursive function we
mean the rhetorical aim of the utterance in which the adverb occurs: counterargument, argument, or conclusion (cf. Roulet et al., 1991). Our paradigm of
E-ADVs consists of the following items: certainement, certes, peut-être,
probablement, sans doute and sûrement1. The functions of these adverbs are
explored in contexts of informative discourse within two distinctly different
genres: contemporary written press and encyclopedic discourse. The former
is represented by three daily newspapers: Le Monde (2008, 20 410 766 tokens),
Le Figaro (2008, 10 795 373 tokens) and Sud-Ouest (2002, 29 763 988 tokens). In
the latter, we consider two diachronic spans: the 18th century, represented by
Diderot & d’Alembert’s Encyclopédie (DDA, 29 940 181 tokens) and the 21st
century, represented by the 2005 edition of Encyclopédie Universalis (UNI, 49
859 864 tokens) and by a random sample of the 2015 version of Wikipédia

1

Selection based on Roulet’s (1979) paradigm of epistemic assertive adverbs.

JADT’ 18

669

(WIKI, 50 396 345 tokens).2
We first proceed to an analysis based on the combinatory profile of each EADV (section 2) in our corpus of contemporary written press, and then, after
having pinpointed what such an analysis can and cannot show, we use a
more holistic approach based on correspondence analysis (section 3).
2. Analysis of Combinatory Profiles
In order to identify the discursive functions of the E-ADVs considered here,
we searched connectors (C) specifically co-occurring with each of these EADVs within a 20-token span. We have chosen a 20-token span rather than a
sentence span, because a connector’s combinatory profile can go beyond the
sentence boundaries. We define connectors as linguistic forms linking
segments of discourse. Such a functional category is not part of the tagset of
the platform we used. We therefore made our query by searching for three
different categories: adverbs, subordinating conjunctions and coordinating
conjunctions. We then manually filtered the resulting forms by keeping those
which proved to function as a connector.
For all our sub-corpora, each of these adverbs is thus specifically assigned a
series of connectors within constructions of the type “E-ADV…C1/C2/Cn”
and “C1/C2/Cn…E-ADV”, which represent their discursive combinatory
profile3. We call each sequence within a combinatory profile a discourse
movement as we consider it to have specific, rhetorically motivated discursive
aims. These aims (mentioned in section 1) are signaled by the connectors cooccurring specifically with an E-ADV: néanmoins and mais signal that the
utterance preceding them is a counter-argument to the utterance they
introduce; donc and finalement signal that the utterance they introduce is a
conclusion; car and parce que signal that the utterance they introduce is an
argument in favor of the utterance preceding them.
The tables below show the discursive combinatory profiles in three subcorpora of contemporary press (Le Monde 2008 ; Le Figaro 2008 ; Sud-Ouest
2002). The significance of each co-occurrence of a connector with an E-ADV is
calculated using log-likelihood (LL).4

All the corpora used were supplied by the platform BTLC (Base Textuelle Lexicostatistique de Cologne), conceived by Sascha Diwersy (Diwersy, 2014), and were
constituted within the French-German projects Presto (http://presto.ens-lyon.fr) and
Emolex (http://emolex.u-grenoble3.fr).
3 We adapt the term combinatory profile used by Blumenthal et al. (2005) and
Blumenthal (2008; 2012).
4 Although LL can be directly calculated on the BTLC platform, we used the
platform to extract the corresponding frequencies, and calculated the LL by using R.
2

670

JADT’ 18

Tables 1-3. Log-likelihood scores (threshold: 10.83; all scores equal or above are marked in
bold).
Le Monde
(2008)
car
(6 706)
donc
(8 276)
finalement
(1 559)
mais
(51 544)
néanmoins
(968)
parce que
(2 514)

certainement
(385)
L
R
0.08
0.08
(3)
(3)
2.09
2.09
(6)
(6)
-1.18
-0.01
(0)
(1)

certes
(1943)
L
-0.27
(11)
-2.47
(10)
-1.77
(1)

29.83
(47)

27.94
(46)

3248
(979)

-0.73
(0)
0.88
(2)

-0.73
(0)
2.81
(3)

5.84
(6)
1.78
(8)

R
0
(13)
2.9
(23)
1.14
(5)
22.55
(52)
14.22
(9)
-4.47
(1)

Le
Figaro
(2008)
car
(3 922)

certainement
(268)
L
R
-0.57
(1)

-0.57
(1)

3.89
(14)

donc
(4 763)
finalement
(1 150)

-1.02
(1)

-1.02
(1)

-0.03
(9)

-1.14
(0)

-1.14
(0)

-0.04
(2)

36.23
(41)

14.25
(30)

1757.55
(545)

1.07
(1)

-0.58
(0)

10.02
(6)

0.10
(1)

-1.43
(0)

-1.65
(1)

mais
(28 552)
néanmoins
(580)
parce que
(1 435)

certes
(1084)
L

peut-être
(2900)
L
R
49.30
-0.37
(57)
(12)
-1.45
24.09
(18)
(51)
3.60
15.43
(9)
(15)

probablement
(723)
L
R
18.99
0.92
(17)
(7)
-0.68
1.43
(4)
(9)
-0.01
0.58
(1)
(2)

sans doute
(2482)
L
R
16.09
0.02
(35)
(17)
0.03
4.18
(21)
(30)
0.01
6.96
(4)
(10)

sûrement
(307)
L
R
1.51
-0.64
(4)
(1)
0.10
0.10
(3)
(3)
-0.94
6.08
(0)
(3)

371.45
(423)

107.28
(284)

10.30
(57)

10.30
(57)

205.88
(310)

32.85
(193)

30.90
(41)

65.68
(55)

0.02
(3)
62.15
(37)

-0.23
(2)
9.75
(17)

0.12
(1)
2.03
(4)

-1.38
(0)
-3.58
(0)

-0.06
(2)
86.58
(41)

-0.06
(2)
0.12
(7)

-0.58
(0)
3.78
(3)

-0.58
(0)
0.07
(1)

peut-être
(1851)
L
R

probablement
(441)
L
R

sans doute
(1240)
L
R

sûrement
(211)
L
R

28.08
(37)

-0.48
(11)

14.27
(12)

1.95
(6)

8.45
(19)

4.43
(16)

7.53
(6)

-3.08
(0)

-20.39
(2)

3.16
(24)

-0.22
(3)

6.74
(10)

-0.37
(9)

0.37
(13)

-3.74
(0)

-0.48
(1)

-3.15
(1)

6.52
(10)

0
(1)

0
(1)

1.67
(5)

-1.34
(1)

-0.90
(0)

-0.90
(0)

245.20
(281)

93.30
(204)

0.56
(27)

2.38
(31)

86.88
(151)

34.59
(117)

17.23
(27)

87.11
(52)

0.49
(2)

0.44
(3)

-3.84
(0)

-0.95
(0)

-2.67
(09

-2.67
(0)

-0.45
(0)

-0.45
(0)

0.30
(2)

31.90
(22)

-2.25
(2)

6.88
(5)

4.79
(8)

0.14
(4)

0.28
(1)

2.22
(2)

R
2.35
(4)
1.81
(14)
0.95
(1)
1.39
(49)

0.95
(0)
0.03
(1)

JADT’ 18
Ouest
Sud
(2002)
car
(12 434)
donc
(19 185)
finalement
(2 858)
mais
(77 108)
néanmoins
(1 698)
parce que
(5 981)

671

certainement
(1277)
L
R

certes
(2795)
L

R

peut-être
(4950)
L
R

probablement
(812)
L
R
1.34
1.35
(10)
(4)

10.78
(23)

6.53
(20)

2.89
(32)

5.92
(36)

44.71
(91)

-0.29
(38)

-3.00
(10)

5.72
(27)

-6.45
(22)

0.10
(38)

-1.98
(53)

1.55
(74)

-1.32
(7)

-0.09
(2)

0.11
(3)

3.19
(10)

0.07
(6)

7.35
(19)

3.68
(16)

-0.23
(1)

28.87
(113)

64.77
(139)

6962.42
(1778)

-5.72
(118)

520.54
(682)

211.15
(513)

10.35
(64)

-0.16
(1)

-0.16
(1)

9.25
(10)

-0.01
(3)

5.39
(12)

-0.54
(4)

0.93
(2)

-4.77
(11)

13.45
(6)

12.51
(15)

8.48
(13)

814.50
(135)

104.89
(33)

16.58
(13)

9.77
(22)
0.23
(1)
9.48
(63)
1.85
(0)
0.57
(2)

sans doute
(3930)
L
R

sûrement
(684)
L
R

45.13
(78)

-0.26
(30)

6.87
(13)

-1.61
(42)

9.50
(74)

2.23
(12)

0.72
(10)

209.38
(434)

-0.09
(5)

123.59
(376)

0.41
(7)
0.08
(1)
7.09
(52)

162.80
(130)

6.72
(11)

1.20
(7)

0.06
(1)

0.06
(1)

233.04
(108)

0.00
(16)

1.04
(8)

-0.48
(9)

The data lead to the following observations: (i) Although the E-ADVs belong
to the same semantic class, each has its own specific combinatory profile. (ii)
Certain E-ADVs share comparable combinatory profiles: sans doute and peutêtre share an almost identical set of specific connectors; more frequently,
several E-ADVs essentially only share one or more specific connectors (for
instance the connector mais for certainement, sûrement, peut-être and sans
doute). (iii) Certain E-ADVs stand out for their unique combinatory features:
certes is almost exclusively associated with mais, but only with mais_R, and
with a notably higher log-likelihood score than the other E-ADVs.
Probablement is also associated with only a few connectors, but with a low
log-likelihood score, close to the threshold of 10.83. (iv) There is homogeneity
in the significant association for each E-ADV in the three sub-corpora of
contemporary press. However, preceding studies – Rossari et al. (2016) and
Rossari & Salsmann (2017) – show that the E-ADVs’ combinatory profile
varies depending on different genres and diachronic periods: contrary to
what is observed for the press genre, in DDA and UNI the association peutêtre…mais is less significant than the association mais…peut-être. For instance,
in DDA, no significant association certes…mais is observed, while the
association sans doute…mais in the same corpus proves to be highly
significant. The analysis of combinatory profiles (based on the significance
measure log-likelihood; cf. Blumenthal et al., 2005) allows for one-to-one
comparison of the different sequences of the type E-ADV...C and C...E-ADV.
Thus, the associations of each E-ADV with each connector can easily be
compared across corpora representing different newspapers, but also across
different genres and diachronic periods. It is also possible to compare the

0.15
(10)
0.31
(2)

672

JADT’ 18

associations of different E-ADVs with one or a few connectors. However, this
method has certain insufficiencies when it comes to simultaneously
comparing all of these variables in a holistic view. This type of analysis of
combinatory profiles never takes into account all variables at the same time
(e.g. frequencies, log-likelihood scores, paradigm of E-ADVs, paradigm of
connectors). Moreover, using a threshold (in our case 10.83) in order to
decide whether an association is significant is useful for traditional
collocation analysis. But our goal is to also represent the use of each E-ADV
in its typical discourse movements in contrast to its non-typical discourse
movements. It thus seems counterproductive that all sequences (EADV...C/C...E-ADV) which are not statistically significant for certain E-ADVs
as such are not taken into account when establishing their combinatory
profiles, since these nonsignificant cases play an important role in
characterizing the overall use of the E-ADVs and connectors. In order to
allow for a holistic approach, we propose to use correspondence analysis
(CA) (Greenacre, 2017).
3. Correspondence Analysis (CA)
The correspondence analysis presented in this section was performed using
the R software and the package “CA” (Nenadić & Greenacre, 2007). (1) In
DDA, representing the 18th century, certes has a use which stands out. Certes
left and right of mais differ clearly from all other E-ADVs as to their
associations with the connectors. Certes is not typically used with any other
connector analyzed and, most importantly, its association is not stronger
with mais on its right than it is with mais on its left. Conversely, in all other
five sub-corpora (encyclopedic and press corpora), which represent the 21st
century, there is an important difference between the use of certes right and
left of mais: while certes_L is strongly linked to mais, certes_R is not. (2) In all
six sub-corpora, mais appears to be opposed to all other connectors when it
comes to its associations with E-ADVs. Its central position appears to be
linked to its high frequency, indicating its high contribution to the horizontal
axis, this being confirmed by the analysis of the correspondence analysis
indicators. (3) An association between sans doute_L and parce que can be
observed in DDA and WIKI, whereas in UNI, the adverb and the connector
appear to be in the opposite relation. This behavior indicates variation has to
be expected even within the encyclopedic sub-corpus, based on at least two
parameters: on the one hand, the diachronic parameter is involved in some
discursive uses of E-ADVs, like certes_L and certes_R showing no difference
as to their association with mais in DDA, consistently with its different
meaning at that time, whereas only certes_L is associated with mais in all
other sub-corpora; on the other hand, some convergence between DDA and

JADT’ 18

673

WIKI could be interpreted as showing similarities in writing style. (4) The
results of the correspondence analysis show that in all sub-corpora of one
particular genre, in most cases, the same E-ADVs are strongly associated
with the same connector or group of connectors (donc and finalement ; car and
parce que ; mais); this phenomenon is particularly pronounced in the subcorpora representing written press. The connector mais differs the most from
the other connectors in what concerns the strength of its associations.
Although mais is associated with most E-ADVs, its association appears to be
strong with only a few of them in all sub-corpora (certes_L being the only
constant), while most other connectors have a higher number of strong
associations. This indicates that certain discourse movements (such as EADV...car / parce que) seem to be rather regular, whereas certes...mais proves to
be a special association, although only in the 21st century corpora. (5) The
behavior of néanmoins in the Figaro 2008 corpus should be interpreted with
caution since the two axes describe only 10% of its variation.
4. Perspectives
Our first attempt to use correspondence analysis to study different discursive
movements has provided promising results regarding the genre and
diachronic variation of discursive functions of French epistemic adverbs in
these cases. We intend to further extend our analysis in three directions: First,
we would like to enlarge our corpora to see if this allows to extend the
paradigm of connectors, so as to give a better overview of the different
discursive movements that exist and to better represent the different
discursive functions of the E-ADVs that we have found. It would be
especially interesting to cover different diachronic spans of press, allowing
for a study of possible changes within this specific genre. Likewise, other text
types may be considered in order to better represent possible variation
between genres. Second, through the comparative analysis of the discursive
combinatory profiles of each E-ADV, we aim to identify regularities
concerning the rhetorical purpose of the sequence in which the E-ADV
typically occurs by understanding its motivation. For instance, beyond the
difference between a counter-argument, an argument, and a conclusion, there
is a more fundamental difference between a discourse movement used with
the rhetorical aim (i) to present a content as being in the discursive
background (when the E-ADV is followed by mais), (ii) to introduce a content
which the speaker considers to be most relevant (when the E-ADV is
introduced by mais or donc), and (iii) to add evidence to a relevant content
(when the E-ADV follows car or parce que). Third, in order to confirm the
reliability and precision of the positions on the correspondence analysis
planes, our intention is to apply bootstrap validation (Lebart, 2010).

674

JADT’ 18

Figures 1-6. Correspondence analysis scatter plots for the six corpora.

JADT’ 18

675

References
Blumenthal P. (2008). Combinatoire des prépositions : approche quantitative.
Langue française, 157: 37-51.
Blumenthal P. (2012). Particularités combinatoires du français en Afrique :
essai méthodologique. Le français en Afrique, 27: 55-74.
Blumenthal P., Diwersy S. and Mielebacher, J. (2005). Kombinatorische
Wortprofile und Profilkontraste. Berechnungsverfahren und
Anwendungen. Zeitschrift für romanische Philologie, 121: 49-83.
Diwersy S. (2014). Corpus diachronique de la presse française : base textuelle
créée dans le cadre du projet ANR-DFG PRESTO. Institut des Langues
Romanes, Université de Cologne.
Greenacre M. J. (2017). Correspondence analysis in practice. 3rd ed. Boca Raton:
Chapman.
Lebart L. (2010). Validation techniques for textual data analysis. Statistica
Applicata - Italian Journal of Applied Statistics, 22(1): 37-51.
Nenadić O. and Greenacre M. J. (2007). Correspondence Analysis in R, with
two- and three-dimensional graphics: The ca package. Journal of Statistical
Software, 20(3): 1-13.
Rossari C., Hütsch A., Ricci C., Salsmann M. and Wandel, D. (2016). Le
pouvoir attracteur de mais sur le paradigme des adverbes épistémiques :
du quantitatif au qualitatif. In Mayaffre D. et al. (eds), Proceedings of 13th
International Conference on Statistical Analysis of Textual Data, II: 819-823.
Rossari C. and Salsmann M. (2017). Étude quantitative des propriétés
dialogiques des adverbes épistémiques. Actes des 9èmes Journées
Internationales de la Linguistique de corpus: 87-93.
Roulet E. (1979). Des modalités implicites intégrées en français contemporain.
Cahiers Ferdinand de Saussure, 33: 41-76.
Roulet E., Auchlin A., Moeschler J., Schelling M. and Rubattel C. (1991).
L'articulation du discours en français contemporain. 3rd ed. Bern: Lang.

676

JADT’ 18

Misleading information
in online propaganda networks
Vanessa Russo1, Mara Maretti2,
Lara Fontanella3, Alice Tontodimamma4
D’Annunzio University of Chieti-Pescara – russov1983@gmail.com
D’Annunzio University of Chieti-Pescara – mara.maretti@unich.it
3D’Annunzio University of Chieti-Pescara – lara.fontanella@unich.it
4D’Annunzio University of Chieti-Pescara – alicetontodimamma@gmail.com
1

2

Abstract 1
Nowadays, the spreading of inaccurate, false or misleading information over
the digital space is amplified by the increasing use of social networks and
social media. In different cases, misleading information can be linked to a
propaganda activity aimed at supporting offline organizations. In fact, in
such cases, online pages, conveying unintentionally (misinformation) or
intentionally (disinformation) inaccurate information, are embedded into a
network system composed by political and ideological advertise. In this
paper, we discuss the different structures of online networks linked to some
official pages of different political parties. The analyzed networks were
identified through Social Network Analysis.
Abstract 2
La diffusione di informazioni inesatte, false o fuorvianti nello spazio digitale
è amplificata dal crescente uso di social network e social media. In diversi
casi, tali informazioni approssimative e/o fuorvianti possono essere collegate
ad un'attività di propaganda volta a supportare organizzazioni offline.
Infatti, in questi casi, le pagine online, che trasmettono informazioni non
intenzionalmente (misinformation) o intenzionalmente (disinformation) errate,
sono incorporate in un sistema di rete composto da pubblicità politica e
ideologica. In questo articolo, discutiamo le diverse strutture delle reti online.
Le reti analizzate sono state identificate attraverso la Social Network
Analysis.
Keywords: misinformation, disinformation, propaganda activity, Social
Network Analysis
1. Background: misinformation and disinformation online
The development of the digital space relates to a new form of web-mediated
communication, which can be defined according to the following main
features. Web-communication can be thought of as a participative act and is

JADT’ 18

677

not part of a broadcast system (McLuhan, 1962) but is a networkcast system.
In fact, a web content generates connections, denoted as “Affinity networks”
(Rainie and Wellman, 2012; Castells, 2000), based on the sharing of a given
content. In this network system, Web-communication yields temporary
consensus areas based on alliances between users with respect to the shared
contents. Moreover, Web-communication favors a mobilization of skills that
generates new paths of social action and collective projects (Levy, 2002). In
the digital space, content validity relies on activism and interest of digital
users and every opinion “has citizenship rights” (Quattrociocchi and Vicini,
2016; Mocanu, 2015). In this framework, misinformation and disinformation
processes share the previous characteristics. Furthermore, the accidental or
deliberate propagation of false information is strictly linked to a “loss of
disintermediation” (Jenkins, 2006). According to this theory, one of the most
important effects of webmediated communication is the loss of traceability of
official information sources. In fact, phenomena like Wikipedia, Social Media
sites or Blog news produce the culture of unofficial knowledge, creating a
virtuous circle of free sources, on the one hand, and a vicious circle of
misleading information, on the other hand. Disinformation and
misinformation processes can be both related to Fake news and Hate
Speeches. “Fake news” or “Junk news” refers to web sources completely
invented or simply distorted. In fact, in the digital space, anyone gain access
at different information sources and can, also, create information content
with low costs and high distribution potential. Furthermore, the fake new
propagation process can develop into a viral system, dominated by the high
sharing power of different recurring themes. Usually, Hate Speech
phenomenon is linked to sharing and commenting fake news. Web 3.0 era is
permeated by hatred, mainly directed to immigrants, political parties and
homosexual people. Although hater activity concerns specific themes, it
becomes a fundamental part in redefining the digital public sphere (Lévy,
2002).
2. Research Design and Methodology
The disinformation and misinformation online phenomena have become a
propaganda activity to support offline organizations. In fact, in many cases
online fake news and hate speeches are contained within a network system
consisting of political and ideological advertising. In particular, this tendency
gained attention during Trump’s election campaign (Ott, 2017). The
Computational Propaganda Research Project, promoted by Oxford
University, aims at investigating «how tools like social media bots are used
to manipulate public opinion by amplifying or repressing political content,
disinformation, hate speech, and junk news». Woolley and Howard (2017),

678

JADT’ 18

mapping the computational propaganda in different countries, analyzed tens
of millions of posts on seven different social media platforms, referring to
elections, political crises and national security incidents. Each case study
takes into account qualitative, quantitative and computational evidences
collected between 2015 and 2017. In this framework, following a
computational approach (Lazer et al., 2009), our research aims at identifying
and comparing propaganda policy networks. For this purpose, we
investigated the networks in which different political Facebook Like pages
are embedded. More specifically, we selected the following Facebook Like
pages related to political institutional information: “Ricostruiamo il centro
destra” (Centre-Right wing), “Di Battista Alessandro” (Five Star Movement) e
“Partito Democratico” (Centre-Left wing). Exploiting Social Network Analysis
and focusing the attention on each of the chosen pages, we detected the
online networks. The analyzed adjacency matrices were built considering as
link the “likes”. The analysis was implemented using the free and open
source NodeXL extension of the Microsoft Excel spreadsheet (Hansen et al.,
2011). For each network, we present the centrality measures, which describe
how a particular vertex can be said to be in the “middle” of the network. In
particular, betweenness centrality measures how often a given vertex lies on
the shortest path between two other vertices. Vertices with high betweenness
may have considerable influence within a network by virtue of their control
over information passing between others. As pointed out by Hansen et al.
(2011), these measures can be thought of as a kind of “bridge” score, a
measure of how much removing a node would disrupt the connections
between other vertices in the network. Closeness centrality captures the
average distance between a vertices and every other vertex in the network. In
NodeXL the inverse of the average distance is implemented so that higher
closeness values indicate more central vertices. The Eigenvector Centrality
network metric takes into consideration not only how many connections a
vertex has (i.e., its degree), but also the degree of the vertices that it is
connected to. A node with few connections could have a very high
eigenvector centrality if those few connections were themselves very well
connected. These centrality measures allowed to identify the most relevant
nodes of each network. The identified Facebook Like Pages were classified in
“official pages” and “junk pages” according to their contents. Junk
information is strictly linked to the so-called post-truth politics, meaning a
political culture in which truth is no longer significant or relevant and
«objective facts are less influential in shaping public opinion than appeals to
emotion and personal belief» (Oxford Dictionaries, 2016). In this context, the
term junk information refers to fake news, conspiracy theories, hate speeches,
misinformation and deliberately misleading disinformation. Accordingly,

JADT’ 18

679

Facebook Like pages containing posts, comments or images conveying this
kind of information were classified as “junk pages”. It is worth noticing how
in the identified networks we did not retrieve hybrid forms, that is pages
composed of both official and junk contents.
3. Preliminary results
The network built by considering the Facebook Like page “Ricostruiamo il
centro destra” is depicted in Figure 1. This social media network, linked to a
Centre-Right political view, is composed by 159 nodes, comprising both
institutional
and
junk
pages
(e.g.
“unitaliasenzacomunisti”,
“SapereEundovere”). Centrality values, provided in Table 1 for the six pages
with higher levels of betweenneess centrality, highlight a connection between
junk and institutional nodes; furthermore, the influence of junk pages in the
network is very outstanding.

Figure 1. NodeXL social media network diagram of relationships derived from the Facebook
Like page “Ricostruiamo il centro destra”.
Table 1: Social media network of relationships derived from the Facebook Like page
“Ricostruiamo il centro destra”: centrality measures for the vertex pages with higher levels of
betweenness
Vertex
ricostruiamocentrodestra
unitaliasenzacomunisti
SapereEundovere
radionewsinformazionelibera
italianinonsonorazzistisonostanchi
diquestainvasione

Betweenness
Centrality
22644.000
10986.000
10044.000
1087.000

Closeness
Centrality
0.004
0.003
0.003
0.002

Eigenvector
Centrality
0.009
0.009
0.000
0.000

777.000

0.002

0.000

A similar situation was detected for the Five Star Movement. This network,
represented in Figure 2, is composed by 664 nodes comprising again both
institutional and junk pages. In this case, the junk pages are specifically of the
Five Star Movement and institutional pages are personal pages of political
candidates. The Five Star Movement network shows three big cluster in
which the central node (WIlM5s) is a junk page.

680

JADT’ 18

Figure 2. NodeXL social media network diagram of relationships derived from the Facebook
Like page “Di Battista Alessandro”.
Table 2: Social media network of relationships derived from the Facebook Like page “Di
Battista Alessandro”: centrality measures for the vertex pages with higher levels of
betweenness.
Vertex
Betweenness Centrality Closeness Centrality EigenVector Centrality
MassimoEnricoBaroni
281353.000
0.001
0.032
WIlM5s
172430.333
0.001
0.024
sorial.giorgio
143457.000
0.001
0.013
dibattista.alessandro
3405.667
0.001
0.006
pierrecantagallo89
1324.000
0.001
0.001
perchevotarem5s
702.000
0.001
0.003

The social media network of relationships derived from the Facebook Like
page “Partito Democratico” does not show the features found out for the
previous networks. In fact, the network related to the Centre-Left political
party is composed by only institutional propaganda pages.

Figure 3. Centrality measures for the social media network of relationships derived from the
Facebook Like page “Partito Democratico”.

4. Community clusters
The mapping process of propaganda pages resulted into different structures
of network. For the classification of these structures, we make use of the
model elaborated by Smith et al. (2014) in order to define a taxonomy of
social networks derived from conversations within Twitter. The authors
defined six types of Networks: polarized crowds, tight crowds, community
cluster, brand cluster, broadcast network and support network (see Figure 5).

JADT’ 18

681

Table 3: Social media network of relationships derived from the Facebook Like page “Partito
Democratico”: centrality measures for the vertex pages with a higher level of betweenness
Vertex
Betweenness Centrality Closeness Centrality EigenVector Centrality
partitodemocratico.it
46486.100
0.002
0.024
enricoletta.it
28853.657
0.002
0.047
scalfarotto
24167.162
0.002
0.038
giannipittella
23136.533
0.001
0.018
giovanidem
19798.000
0.001
0.011
palazzochigi.it
12633.519
0.001
0.009

Figure 5: Diagrams of the differences in the six types of social media networks (Smith et al
2014).

In this framework, we can recognize how the Centre-Right wing social media
network shows a conformation similar to a mixture of Polarized Crowd and
Support Network. On the one hand, the Polarized Crowd model is
characterized by two groups, polarized on specific opinions and sharing few
connections. On the other hand, the Support Network model consists of a
central node that sends information to the peripheral nodes. The Five Star
Movement social network adheres more closely to Tight Crowd and Support
network structures. The Tight Crowds is composed by highly connected
nodes and specific shared themes. Finally, the Democratic Party network
reflects the structures of a Community Cluster, which is organized in many
cliques that share specific topics of conversation.

682

JADT’ 18

4. Conclusions and future works
In this preliminary phase of our research, we considered the network
structures related to the online propaganda linked to different political areas.
Our analysis allowed to highlight the differences in the networks and to cast
the reconstructed networks into the taxonomy proposed by Smith et al.
(2014). In addition, in two out of the three analyzed social networks we
found out the presence of junk pages contributing to the disinformation and
misinformation processes by spreading out fake news and indulging in hate
speeches. The cluster structures of those two networks, leading to closed
circle of highly polarized information, facilitates the diffusion process of
misleading information. Based on these preliminary results, future works
will focus on the textual analysis of posts and comments shared on the
retrieved junk pages, in order to identify the main discussed topics. To this
end, Text mining and machine learning techniques will be exploited.
References
Castells M. (2000). The Rise of the Network Society, Blackwell Publishers Oxford
Hansen D. L., Schneiderman B., Smith M. A. (2011). Analyzing social media
networks with NodeXL: insights from a connected world, Morgan Kaufmann
Jenkins H., (2006). Fans, Bloggers and Gamers: Exploring partecipatory culture,
New York University Press.
Lazer D., Pentland A., Adamic L., Aral S., Barabási A.L., Brewer D.,
Christakis N., Contractor N., Fowler J., Gutmann M., Jebara T., King G.,
Macy M., Roy D., Van Alstyne M., (2009). Life in the network: the coming
of computational social science, Science 323(5915): 721–723
Lévy P. (2002). Cyberdémocratie. Essai de philosophie politique, Paris: O. Jacob
McLuhan, M. (1962). The Gutenberg Galaxy: the making of typographic man,
University of Toronto Press.
Mocanu, D.; Rossi L., Zhang Q., Karsai M., Quattrociocchi W. (2015)
Collective attention in the age of (mis)information. Computers In Human
Behavior, 51, 1198-1204
Ott B. L. (2017). The age of Twitter: Donald J. Trump and the politics of
debasement, Critical Studies in Media Communication, 34, (1): 59-68
Oxford Dictionaries (2016). Word of the Year 2016 Is...,
https://en.oxforddictionaries.com/word-of-the-year/word-of-theyear-2016.
Quattrociocchi W., Vicini A. (2016). Misinformation. Guida alla società
dell’informazione e della credulità, Franco Angeli.
Rainie L., Wellman B. (2012). Networked: The New Social Operating System, MIT
Press.
Smith M., Raine L., Shneiderman B., Himelboim I. (2014). Mapping Twitter
Topic Network: From polarized Crowds to community Cluster, Pew

JADT’ 18

683

Research Internet Project, February 20,
http://www.pewinternet.org/2014/02/20/mapping-twitter-topic-networks-frompolarized-crowds-to-community-clusters/#
Woolley S. C, Howard P. N. (2017). Computational Propaganda Worldwide:
Executive Summar,. Working Paper 2017.11. Oxford, UK: Project on
Computational Propaganda. comprop.oii.ox.ac.uk. 14 pp.

684

JADT’ 18

Topic modeling of Twitter conversations
Eliana Sanandres1, Camilo Madariaga2, Raimundo Abello3
1

Universidad del Norte – esanandres@uninorte.edu.co
2Universidad del Norte – cmadaria@uninorte.edu.co
3Universidad del Norte – rabello@uninorte.edu.co

Abstract
Topic modeling provides a useful method of finding symbolic
representations of ongoing social events. It has received special attention
from social researchers, particularly among cultural sociologists, in the last
decade (DiMaggio et al., 2013; Sanandres and Otalora, 2015). During this
time, Twitter has acted as the most common platform for people to share
narratives about social events (Himelboim et al., 2013). This study proposes
LDA (Latent Dirichlet Allocation) based topic modeling of Twitter
conversations to determine what topics are shared on Twitter in relation to
social events. The dataset for this study was constructed from public
messages posted on Twitter related to the financial crisis of the National
University of Colombia. Over an eight-week period, we downloaded all
tweets that included the hashtag #crisisUNAL (UNAL is the Spanish
acronym of the university) using the Twitter API interface. We analyzed over
45,000 tweets published between 2011 and 2015 using the R package
topicmodels to fit the LDA Model in five steps: first, we transformed the
tweets into a corpus, which we exported into a document-term matrix; the
terms were stemmed and the stop words, punctuation marks, numbers, and
terms shorter than three letters were removed. Second, we used the mean
term frequency-inverse document frequency (tf-idf) over documents
containing this term to select the vocabulary. We only included terms with a
tf-idf value of at least 0.1, which is a bit less than the median, to ensure that
the most frequent terms were omitted. Third, we defined the number of
topics k by estimating the log-likelihood of the model for each topic number
starting with 1 though to 300 topics and selected k = 12 because it had the
highest log-likelihood value (LL = -198000). Fourth, we run the LDA Model
for k = 12 topics. Fifth, we labeled the k = 12 topics previously identified by
choosing the top N terms ranked based on the probability of that topic. This
article illustrates the strength of topic modeling for analyzing large text
corpora and provides a way to study the narratives that people share on
Twitter.
Keywords: Topic modeling, LDA, Twitter.

JADT’ 18

685

1. Introduction
This article presents a way to analyze large amounts of textual data from
Twitter conversations in an efficient and effective way. Specifically, we
explain how to capture the narratives that people share on Twitter about
social events, reduce their complexity, and provide plausible explanations.
This is a research concern that has received special attention among social
researchers (Kovanović et al., 2015; Yann et al., 2011; Newman and Block,
2006; Griffiths and Steyvers, 2004), particularly among cultural sociologists,
who face the methodological challenge of working qualitatively with large
amounts of data (Sanandres and Otalora, 2015; Eyerman et al., 2011;
Alexander, 2004). In this paper we propose an LDA (Latent Dirichlet
Allocation) based topic model to address this challenge. Topic modeling is a
useful approach because the set of terms found within topics index
discursive environments or frames that define patterns of association
between a focal issue and other constructs (DiMaggio et al., 2013). These
patterns of association are to be interpreted as symbolic representations of
ongoing social events, which represent claims about the shape of social
reality, its causes, and the responsibility for action such causes imply
(Alexander, 2004). We applied an LDA-based model to Twitter conversations
about the financial crisis of the National University of Colombia to examine
how the debate over this crisis was framed on Twitter, from 2011 when it
emerged, until 2015. We analyzed over 45,000 tweets and illustrated the
strength of topic modeling for the analysis of large text corpora as a way to
study narratives shared on Twitter.
2. Background: The financial crisis of the National University of
Colombia
Over the last decade, Colombian academics and representatives of the
government have recognized that the limitations of their budgets are the
major limitation in the response of public universities to the increasing
demands of society. To face this problem, the government proposed to
reform the entire system of higher education (Ministry of National
Education, 2010). The intention was to find new sources of money for higher
education, enable more people to attend college, encourage transparency and
good governance in the education sector, and improve the quality of higher
education. One of the most controversial proposed changes was the opening
of the education sector to private investment by for-profit companies (El
Espectador, 2011). This was immediately rejected by public universities, who
claimed that the proposed reform would lead to a full-scale privatization of
the system of higher education (Semana, 2011).
At the public National University of Colombia, the largest higher education

686

JADT’ 18

institution in Colombia, some students and professors claimed that the
reform offered no clear solution to the financial crisis of the university. They
explained that the university had been using a funding model with its
sources of support mixed between the state and external resources, claiming
that since 2004 this model had borne dwindling state support and everincreasing costs to be covered by external resources. They showed that
government transfers had decreased from 70% in 2004 to 64% in 2013, while
the external resources produced from activities such as tuition fees, nonformal education courses, and academic extension services, among others,
had increased from 30% to 36% in the same period (National University of
Colombia, 2014). This statement reopened the debate on the financial crisis of
the National University of Colombia and became a Twitter trending topic
with the hashtag #CrisisUnal (UNAL is the Spanish acronym for the name of
the university).
3. The financial crisis of the National University of Colombia on Twitter
Here, we investigate how the financial crisis in the National University of
Colombia was framed on Twitter. It may be asked why we should care about
Twitter conversations on this topic? However, it should be considered that
Twitter conversations can offer clues to what the university is thinking and
doing about the crisis. A central advantage of using Twitter for analyses is
that it covers topics in real time, producing a large amount of data that can be
used to look at people’s perceptions and narratives of particular events.
Twitter also provides a practical way to examine collective experience related
to a topical event, to study behaviors and attitudes where social desirability
bias may occur in official surveys, and to collect large amounts of data with a
limited budget (Himelboim et al., 2013). Twitter conversations also illustrate
the views of the reading public and show dominant viewpoints, which
emerge quickly and are difficult to change (Xiong et Liu, 2014).
We collected every tweet published between 2011 and 2015 that contained
any reference to the financial crisis in the National University of Colombia
with the hashtag #CrisisUNAL. We chose this period to track Twitter
conversations around this topic, from the time it became a Twitter trend in
2011 through 2015 (the last year in which we collected data). Our collection
formed a corpus of over 45,000 tweets. In the next section we describe how
we used topic modeling.
4. Method
Topic modeling is a machine-learning method used to discover hidden
thematic structures in large collections of documents. In this work we used
LDA, a widely used method in topic modeling (Jelodar et al., 2017; Fligstein

JADT’ 18

687

et al., 2014), which assumes that there is a set of topics to be found in a
collection of documents. The intuition behind LDA is that documents exhibit
multiple topics. A topic is formally defined as a distribution of words over a
fixed vocabulary (Blei, 2012). For LDA, topics must be specified before any
data are generated. For each document in the collection, this method
generates the words in a two-stage process. During the first stage, it
randomly chooses a distribution over topics (step 1). In the second stage, for
each word in the document, it randomly chooses a topic from the distribution
over topics in step 1 (step 2a), and a word from the corresponding
distribution over the vocabulary (step 2b). At the end, each document
exhibits topics in different proportions (step 1) and each word in each
document is drawn from one of the topics (step 2b), where the selected topic
is chosen from the per-document distribution over topics (step 2a) (Blei,
2012). To run the LDA model, we followed five steps. First, we transformed
the tweets into a corpus and exported this corpus to a document-term matrix;
the terms were stemmed and the stop words, punctuation, numbers and
terms shorter than three letters were removed. Second, we used the mean
term frequency-inverse document frequency (tf-idf) to select the vocabulary.
We only included terms with a tf-idf value of at least 0.1, which is a bit less
than the median, to make sure that the most frequent terms were omitted.
Third, we defined the number of topics k by estimating the log-likelihood of
the model for each topic number, from 1 to 300 topics; we selected k = 12 as
having the highest log-likelihood value (LL = -198000). Fourth, we run the
LDA model for k = 12 topics. Fifth, we labeled the k = 12 topics previously
identified by choosing the top N terms, ranked according to the probability
of that topic. For this we used the R package topicmodels.
5. Results
Table 1 displays the 12-topic solution and lists the 10 highest-ranking terms
for each topic. We call attention to four sets of topics: six topics concerned
with social protest (dark shading), three topics on educational reform
(medium shading), two topics calling for investment (light shading), and one
topic emphasizing the role of the National University of Colombia in the
Colombian peace process (no shading). To more easily interpret the topics,
after reviewing the list of terms we examined those tweets that exhibited
each topic with the highest probability.
5.1 Protest topics
Protest topics are the focus of the Twitter conversations on the financial crisis
in the National University of Colombia. Topic 1 covers the protests of the
education workers. The most highly ranked terms were sintraunal (the labor

688

JADT’ 18

union covering all workers at public universities), protest, strike, campus, riot,
gas, blocked, and wall. The tweets in which this topic was strongly represented
locate protests in national and international contexts with terms like nation
and clacso (Latin American Council of Social Sciences), indicating that the
protests were a matter of concern in Colombia and in Latin America. Topic 3
also refers to the protests of the education workers. Some of the top words
are sintraunal, gases, wall, and block. This topic frequently exhibits tweets that
show negative aspects of protests, such as confrontation, death, and bombs.
Table 1: 12-topic solution
Topic 1
sintraunal
protest
strike
campus
riot
gas
blocked
wall
nation
clacso

Topic 2
agricultural
strike
graffiti
hate
block
bombs
terrorists
crash
delinquents
guevara

Topic 3
sintraunal
gases
wall
block
undefined
bombs
hood
criticism
death
confrontation

Topic 4
agrarian
protest
movement
mobilization
participation
people
bombs
poor
assembly
disturbance

Topic 7
defend
university
improvement
campus
crisis
infrastructure
cement
hospital
architecture
sociology

Topic 8
no to the
reform
propose
threat
oblivion
save
closed
blocked
abnormality
upedagogica
uncertainty

Topic 9
Stamp
demand
support
public
university
strike
resources
deserve
financial
pride

Topic 10
intimidation
blocked
abandoned
public
eviction
strike
che
graffiti
protest
worker

Topic 5
solidarity
no to the
reform
justice
march
respect
charge
help
block
upedagogica
studying
Topic 11
peace
process
mobilization
research
studying
participation
talks
intellectuals
solidarity
civil

Topic 6
no to the
reform
universities
listen
sciences
confrontation
media
classrooms
abandoned
mobilization
block
Topic 12
revolutionary
victory
popular
campus
strike
eviction
denounce
deserve
abandonment
took

Topics 2 and 4 refer to the agricultural sector protests. While Topic 4 is
related to the mobilization of people to take part in these protests, Topic 2
emphasizes the participation of terrorists and delinquents in agricultural strikes.
In this context, social protest is associated with the Argentine Marxist
revolutionary Ernesto Che Guevara. Che is also mentioned in Topic 10,
which deals with the protests of the working class and the intimidation of
protesters. The most highly ranked terms in this topic are intimidation, blocked,
abandoned, public, eviction, worker, strike, che, graffiti, and protest. Finally, Topic
12 covers the revolutionary cause of social protest and includes the words
revolutionary, victory, popular, campus, and strike.

JADT’ 18

689

5.2 Anti-reform topics
Five topics deal with the reforms of higher education proposed by the
government. According to the terms included in Topic 5, public universities
reject this reform and called for justice and respect; terms in this topic include
solidarity, no to the reform, justice, march, and respect; tweets representing this
topic show strong solidarity among public universities, specially from the
Universidad Pedagógica (upedagogica). Topic 8 is also related to the rejection
of the planned educational reform to save public education; this includes
terms like no to the reform, propose, threat, oblivion, and save; Universidad
Pedagógica (upedagogica) is mentioned as well. In the same way, Topic 6
indicates that public universities reject the reform of higher education,
mobilize to denounce the government’s abandonment, and demand to be
listened to; some of the words in this topic are: no to the reform, universities,
listen, sciences, confrontation, media, classrooms, abandoned, mobilization, and
block.
5.3 Investment topics
Topics 7 and 9 cover demands for investment to face the crisis. Topic 7 calls
for infrastructure investment. Many tweets in which this topic is prominent
focus on the infrastructure crisis of the campus buildings, in particular the
sociology and architecture buildings and the university’s hospital. The top terms
in this topic include defend, university, improvement, campus, crisis,
infrastructure, cement, hospital, architecture, and sociology. Topic 9 plays a
similar role in investment demands focusing on the pro-National University
of Colombia stamp, created to acquire financial resources to improve the
university facilities. Some tweets containing this topic highlight the role of
the University as a national pride. The top ranked terms include stamp,
demand, support, public, university, resources, financial, strike, deserve, and pride.
5.4 Peace topic
Topic 12 represents the integration of the crisis in the National University of
Colombia into a broader frame of national concern associated with the
Colombian peace process. The top-ranked terms are peace, process,
mobilization, research, studying, participation, talks, intellectuals, solidarity, and
civil. Tweets in which this topic was strongly represented are related to the
role of the university as facilitator in peace talks among the government,
rebel groups involved in the Colombia’s internal armed conflict (which
began in the mid-1960s and is currently in negotiation, in a process known as
the Colombian peace process), intellectuals, and representatives of civil
society.

690

JADT’ 18

6. Conclusions
Producing an interpretable way to study Twitter conversations efficiently
and effectively is only the beginning. The solution of this issue presents
meaningful categories to address the analytic question that motivated the
study: how was the financial crisis in the National University of Colombia framed
on Twitter? The 12-topic solution showed that it was framed through four
categories: protest, anti-reform, investment, and peace.
Each topic constitutes a frame, in that it includes terms calling attention to
particular ways in which the crisis under study may arouse controversy:
protest frames emphasize public displays, demonstrations and the civil
disobedience of the working class; anti-reform frames refer to the rejection of
the reform of higher education by public universities; investment frames
focus on investment demands to face the crisis; and the peace frame draws
attention to the role the National University of Colombia played in acting as
a facilitator in the Colombian peace process. Each of these frames represents
a discursive environment for the financial crisis, which broadcasts not just
the structural characteristics of the crisis (investment demands and education
reform), but also symbolic representations of ongoing social events (workers
protests and peace process), which can be seen as claims about ongoing social
processes and demands of reparation.
These results provide substantive insight into Twitter conversations about
the financial crisis in the National University of Colombia. Using LDA to
discover topics allowed us to locate two narratives: one focused on the
structural characteristics of the crisis and the other concerned with symbolic
representations of ongoing social events surrounding that crisis. For cultural
sociologists, this is only the beginning of the analysis. A topic model allows a
starting point to be found, which in this case is the structure of Twitter data.
Used properly, with appropriate validation, topic models are valuable
complements to other interpretive approaches, offering new ways to extract
topics and make sense of online data.
References
Alexander, J. (2004). Toward a theory of cultural trauma. In Alexander, J.,
Eyerman, R., Giesen, B., Smelser, N. and Sztompka, P. Cultural trauma and
collective identity. Univ of California Press.
Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4):
77–84.
DiMaggio, P., Nag, M., and Blei, D. (2013). Exploiting affinities between topic
modeling and the sociological perspective on culture: Application to
newspaper coverage of US government arts funding. Poetics, 41(6): 570–
606.

JADT’ 18

691

El Espectador (2011). Universidades con ánimo de lucro, apuesta del
gobierno. March 10.
Eyerman, R., Alexander, J. C., and Breese, E. B. (2011). Narrating trauma: on
the impact of collective suffering. Routledge.
Fligstein, N., Brundage, J. S., and Schultz, M. (2014). Why the Federal Reserve
failed to see the financial crisis of 2008: The role of “Macroeconomics” as a
sense making and cultural frame. IRLE Working Paper No. 111–14.
Griffiths, T., and Steyvers, M. (2004). Finding scientific topics. Proceedings of
the National academy of Sciences, pp. 5228–5235.
Himelboim, I., McCreery, S., and Smith, M. (2013). Birds of a feather tweet
together: Integrating network and content analyses to examine crossideology
exposure
on
Twitter. Journal
of
Computer-Mediated
Communication, 18(2): 154–174.
Jelodar, H., Wang, Y., Yuan, C., and Feng, X. (2017). Latent Dirichlet
allocation (LDA) and Topic modeling: Models, applications, a
survey. arXiv preprint arXiv:1711.04305.
Kovanović, V., Joksimović, S., Gašević, D., Siemens, G., and Hatala, M. (2015).
What public media reveals about MOOCs: A systematic analysis of news
reports. British Journal of Educational Technology, 46(3): 510–527.
Ministry of National Education (2010). Proposal for the education reform in
Colombia. April 12.
National University of Colombia (2014). Estadísticas e indicadores de la
Universidad Nacional de Colombia. 19. ISSN 2357-5646.
Newman, D., and Block, S. (2006). Probabilistic topic decomposition of an
eighteenth-century American newspaper. Journal of the Association for
Information Science and Technology, 57(6): 753–767.
Sanandres, E., and Otálora, J. (2015). Application of topic modeling for
Trauma Studies: The case of Chevron in Ecuador. Investigación &
Desarrollo, 23(2): 228–255.
Semana (2011). Reforma a la Ley 30: por qué sí, por qué no. April 1.
Yang, T., Torget, A., and Mihalcea, R. (2011). Topic modeling on historical
newspapers. In K. Zervanou & P. Lendvai (Eds.), LaTeCH ’11 Proceedings
of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage,
Social Sciences, and Humanities, pp. 96–104.

692

JADT’ 18

What volunteers do? A textual analysis of voluntary
activities in the Italian context
Francesco Santelli, Giancarlo Ragozini, Marco Musella
University of Naples Federico II
francescosantelli@unina.it marcomusella@unina.it

Abstract
The complex phenomena of volunteering was mainly analyzed in economic
literature with respect to its “economic value added”, i.e the capability of this
kind of activities to increase the level of productivity of some specific gods or
services. In this paper, the point of view switches and voluntary
organizations are analyzed as place of job market innovation, where new jobs
arise and where people acquire new skills. Thus, volunteering can be thought
as “social innovation” factor. In order to analyze the contents of voluntary
works we use data coming from Istat survey “Multiscopo, Aspetti della vita
quotidiana” (Multi-purposes survey, daily life aspects), for the year 2013. In
our textual analysis, we use information included in the open answers given
by people about the description of the tasks performed individually as
volunteer. After stemming, lemmatization, and cleaning, data have been
analyzed by means of Community Detection based on Semantic Network
Analysis in order to discover patterns of jobs and through Correspondence
Analysis on Generalized Aggregated Lexical Tables (CA-GALT) in order to
discover profiles of volunteers. In particular, we look for differences given by
gender, age, educational level, region of residence and type of voluntary
association.
Keywords: Text Mining, Volunteers, Lexical Correspondence Analysis,
Semantic Network Analysis
1. Introduction
Volunteer work differs from the traditional forms of work for several
features. Nevertheless, most of the authors approaching the volunteering
phenomenon are interested mainly in the economic value that this sector is
able to add to the labour market (Ironmonger, 2000; Salamon et al., 2011)
considering it like a special case of job in the economic theory framework.
From this point of view, volunteering is assumed to be a peculiar sector of
the production with a considerable number of divergent rules and dynamics
compared to the standard work patterns, but still able to provide goods and
services to the community like all the other sectors. It will lead, of course, to
increase the overall economic value of the society.

JADT’ 18

693

In this work, the focus will be instead from a different perspective:
volunteering will be considered as a laboratory of social innovation
embedded in the labour market. The main concept behind it is that
volunteering is based on different guidelines and different principles
(Zamagni, 2005); therefore, it could develop new professional profiles and
modify pre-existent ones. Text Mining approach will be performed on openend questions given by volunteers, assuming that their self-concepts is a
consistent proxy of volunteering world. The empirical statistical analysis will
make use of two tools chosen for their capability to profile both groups of
words and cluster of volunteers. The latter, in the Italian context, will be
analyzed in parallel with the traditional categories applicable to the classic
labor theory. It will be shown that most of the determinants of the
segmentation of the professions (Colombo, 2003), such as gender, age or
geographic area of origin, can be adopted as well in this framework.
2. Data and statistical approach
Data are taken from the Istat Survey of 2013 “Multiscopo, Aspetti della vita
quotidiana” (Multi-purposes survey, daily life aspects) (Istat, 2013). It is a
large annual sample survey that covers the resident population in private
households, by interviewing a sample of about 20000 households and about
50000 people with P.A.P.I. technique. The main dimensions questionnaires
concern education, work, family and social life, spare time, political and
social participation, health, life style and access to the services.
From the whole sample, we selected about 5000 persons that declared to be
involved in volunteering and that answered to open-end questions about
their voluntary activities and if they carried out it within an organization or
by themselves. The main core of the statistical text mining procedure will be
focused on these brief descriptions of their own volunteering jobs. We
analyzed the descriptions along with the socio-demographic variables
available: gender, age, geographic macro-area and educational level. Given
the definition of volunteering (Istat, 2013; Wilson, 2000), several descriptions
were erased from the database as they do not belong to voluntary activities
(e.g., people donating blood to AVIS organization, or people that provides
help to family members).
Therefore, after this preliminary procedure in order to delete inappropriate
or missing answers, the valid number of volunteers are 4254 from the
original 5000. Before going through the analysis, we perform a preliminary
transformation of the original lexical data by removing punctuation and
stop-words, and by stemming the words, i.e. deleting all the derivational and
inflectional suffixes (Lovins, 1968; Willet, 2006). Therefore, all the words that
evolved from the same root will be considered to be the same after the

694

JADT’ 18

stemming. For this task we use the Porter Stemming Algorithm using
software R implemented in the package tm (Meyer et al., 2008).
After the preliminary analysis, in order to discover groups of activities that
can be described as jobs we apply a Semantic Network Analysis (van
Atteveldt, 2008; Drieger, 2013), and in order to profile of voluntary jobs with
respect to socio-demographic dimensions we use Correspondence Analysis
on Generalized Aggregated Lexical Tables (CA-GALT) (Kostov et al., 2015).
The former is an extension of Social Network Analysis that treats text as
graph structure: each word is defined as a node, and the ties between words
are undirected links weighted by the count of co-occurrences (how many
times do these words appear together in the same answers). Groups of terms
corresponding to semantic clusters can be found through community
detection algorithms (Fortunato, 2010). We use the Fast Greedy method that
is suited to deal with undirected and weighted edges (Clauset et al., 2004).
On the other hand, the CA-GALT method allows us to jointly analyze in a
multiple correspondence framework both the lexical table and sociodemographic profiles, combining the document-term matrix and the matrix
containing the individual characteristics.
3. Main findings of the analysis
After the preliminary transformations, the overall corpus shows a high
degree of heterogeneity with 1649 different words, and a high level of
sparsity, close to 100% due to the large number of documents and their
shortness. The term frequency distribution has a median equal to 2, and a p0.75
percentile equal to 4. Given the sparsity, we focus the analysis on the most
frequent words that profile and describe voluntary activities, taking into
account only words that are above the p0.90 percentile (frequency equal to 11),
and ending up in a vocabulary consisting of 175 words. The most used of
them are organizz (to organize, or organization) that appears 296 times,
assistent (assistent) with 225 occurrencies, attiv (activity) that occurs 215
times, then assoc (association), aiut (to help) and volontar (volunteer and
derived words). Those terms can be considered pretty generic, and could be
related to several aspects inside the volunteers’ community, without showing
additional informative power to profile volunteers. They are followed by
terms describing specific field of intervention: sport, fond (fund), event,
bambin (child/children), anzian (senior/old). Further, some of them are
expressing just one semantic meaning, and can be considered bi-grams
(Collins, 1996): croce rossa (red cross), croce verde (green cross), croce bianca
(white cross), protezione civile (civil protection/defense), vigili fuoco
(firefighters), capo scout (scoutmaster). We merge them in the following.
Applying the Semantic Network and the community detection algorithm to

JADT’ 18

695

these data, we found 7 groups/communities. In Fig. 1 we plot the semantic
network along with the communities, in which words are colored according
to the community. It is possible to identify a set of “jobs” related to the
typical charity organizations, mainly in a religious context: the care of old
people and hospitalized people -ospedal, malat, assistenz, ascolt, accud, cur,
sostegn- (orange), the education and animation od disadvantaged children,
mainly in religious organizations -insegn, parrocc, scuol, orator, cateches, anim(purple), the food and cloth drive and its distribution to the poor -cibo, vestiar,
caritas, raccolt, aliment, mens, pover- (green). Another large group is related to
the executives and officers of organizations and to the cultural events
organizers -organizz, event, cultural, membr, consigl, dirigent, reunion- (blue).
Related to this large group we found the musicians (black) characterized by
suon, band, musical. Finally, the last important area of the network is
associated to the organized volunteers on the territory -vigilefuoc,
protezionecivil, territor, croceross, soccors, ambul- (red). The coaches are mixed
with this group -squadr, allen, calc, pallavol- (brown). All these activities are
mainly done in nonreligious organizations and are not directly related to
charity aims.
Analyzing categories and lexical CA in (fig:2) is possible to profile
individuals according to their demographic status. In this context is not
performed a real clustering procedure, but as in classical Correspondence
Analysis the two spaces, units and variables, are linked taking into account
that words close to a specific categories are more likely to occur for people
belonging to the given category. It is clear that there is a gender gap: men are
related to sport activities, they play music in band, they are driver (mainly
ambulance) and they are involved in administration tasks. Women are more
involved in providing services to individuals (taking care of children and old
people), also carrying out food and cloth drive for the poor. Geographic
differences come up as well: volunteers from North-Est and North-Ovest
describe their activities as manutenzion, dirigent, addett, consigl, showing a
higher organization level. South and Islands are more related to a female
style of volunteering, with a predisposition for religious organization and
mainly aimed to assistance. Educational level and age have an impact: lowest
level of education, crossed with age information, profile a group of old and
less educated volunteers involved in religious volunteering. Highest
educated people carry out mainly administrative tasks. The central group of
age (35-64) shows, on the other hand, an average profile close to the origin of
axis, as well as people from Center Italy.

696

JADT’ 18

Figure 1: Semantic network: different colors for different communities identified by
FastGreedy algorithm. Size of words and width of the edges are proportional to the weights

4. Discussion and conclusion
As introduced in the first section, the aim of this work is to present a general
perspective about volunteering work in Italy under the assumption that is
possible to study it in an analogue way in which labour market is studied in
classic economic literature. Some authors already gave example how it
follows also the rule of supply and demand under given condition (i.e. Wolff
et. al, 1993) and also volunteering companies make use of marketing
strategies similarly to business companies (Dolnicar et Randle, 2007). The
two different statistical tools presented in previous section give to the
empirical analysis different hints, and are somehow complementary.
Communities in Semantic Network of (fig:1) are based on the connection
level between words, without taking into account other previously known
characteristics of individuals. Communities thus discovered are groups of
words that define several activities and so clusters of jobs in some specific
fields. In the second analysis, both spaces build in Ca-Galt, individuals and
categories, stress out how segmentation is clearly present in volunteering as
in labour market, and words used (and so activities done) change for gender,
education, age and macro-area, in an equivalent way as for standard jobs. It

JADT’ 18

697

gives so an overview about the relationships between words (as description
of activities) and categories (socio-demographic variables). Summing up,
both analysis highlight how volunteering is complex and heterogeneous; it
shows that people involved are in some cases highly skilled, often using
some of the competencies trained in their life. Generally, they are able to
describe their activities in a thorough way, explaining openly the aim of their
voluntary jobs. The Text Mining analysis presented in this work could lead to
figure out some needs of the population that are not adequately satisfied,
given the assumption that volunteers spend their time and use their skills to
give something to individuals that strongly ask for demands, in a framework
similar to supply and demand mechanism. Furthermore, to have a more
exhaustive overview for future policies to undertake, next step could be
likely to go on the other side; another survey should be done asking people
why do they ask help to volunteers. It will lead to better understand the real
needs of individuals that are not fully satisfied of what they get in terms of
assistance, especially from official institutions welfare.

698

JADT’ 18

Figure 3: Ca-Galt for both terms (blue) and categories (red). Overlapping both factor maps is
possible to profile cluster of individuals.

References
Amati, F., Musella, M. and Santoro, M. (2015). Per una teoria economica del
volontariato. (Vol. 1). G. Giappichelli Editore, Torino
Clauset, A., Newman, M. E., and Moore, C. (2004). Finding community
structure in very large networks. Physical review E, 70(6), 066111.
Collins, M. (1996). A new statistical parser based on bigram lexical
dependencies. In Proceedings of the 34th annual meeting on Association for
Computational Linguistics, 184-191, Association for Computational
Linguistics
Colombo, A. (2003). Razza, genere, classe. Le tre dimensioni del lavoro domestico
in Italia. Polis, 17(2), 317--344,
Dolnicar, S. and Randle, M. (2007). The international volunteering market:
Market segments and competitive relations. International Journal of
Nonprofit and Voluntary Sector Marketing, 12(4), 350-370.
Drieger, P. (2013) Semantic network analysis as a method for visual text
analytics, Procedia-social and behavioral sciences, 79, 4 – 17
Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(35), 75-174.
Indagine Istat Multiscopo sulle famiglie: aspetti della vita quotidiana, (2013),
Retrieved from http://www.istat.it/it/archivio/91926
Ironmonger, D. (2000). Measuring volunteering in economic terms.
Volunteers and Volunteering, The Federation Press, Sydney, 56--72
Kostov, B., Bécue Bertaut, M. and Husson, F. (2015). Correspondence analysis
on generalised aggregated lexical tables (CA-GALT) in the FactoMineR
package, R Journal, 7(1), 109 -- 117,

JADT’ 18

699

Lovins, J. (1968). Development of a stemming algorithm. Mech. Translat.
Comp. Linguistics, 11(1-2), 22--31, (1968)
Meyer, D., Hornik, K., and Feinerer, I. (2008). Text mining infrastructure in R.
Journal of statistical software, 25(5), 1-54.
Salamon, L., Sokolowski and S., Haddock, M. (2011). Measuring the
economic value of volunteer work globally: Concepts, estimates, and a
roadmap to the future, Annals of Public and Cooperative Economics, 82(3),
217--252, (2011)
van Atteveldt, W. (2008). Semantic network analysis: Techniques for extracting,
representing, and querying media content, BookSurge Publishers, Charleston
SC
Willett, P. (2006). The Porter stemming algorithm: then and now. Program, Vol.
40 Issue: 3, 219--223, doi: https://doi.org/10.1108/00330330610681295
Wilson. J. (2000). Volunteering, Annual review of sociology, 26(1), 215—240
Wolff, N., Weisbrod, B. A., and Bird, E. J. (1993). The supply of volunteer
labor: The case of hospitals. Nonprofit Management and Leadership, 4(1),
23-45.
Zamagni, S. (2005). Gratuità e agire economico: il senso del volontariato. In
Working Paper presented at Aiccon meeting, Bologna

700

JADT’ 18

A longitudinal textual analysis of abstract presented
at Italian Association for Vocational guidance and
Career Counseling’ Conferences from 2002 to 2017
S. Santilli1., S. Sbalchiero2, L. Nota3, S. Soresi4
2

1 University of Padova – sara.santilli@unipd.it
University of Padova – stefano.sbalchiero@unipd.it
3 University of Padova – laura.nota@unipd.it
4 University of Padova – salvatore.soresi@unipd.it

Abstract
This new century is characterized by phenomena such as globalization,
internationalization, and rapid technological advances, that influence people
life and the ways in which they seek and do their jobs. Changing the shape
of organizations changes the shape of careers. To better account for the
complexities of work due to the least socio economic crisis, the Life Design
paradigm, a new paradigm for career theory in the 21st century (Savickas et
al., 2009) has been recently developed an it represent the third wave of
career theory and practice. The first wave emerged as the psychology of
occupations in the first half of the 20th century to match people to jobs. The
second wave comprised the psychology of careers ascending at mid-20th
century to manage worker and other life roles across the lifespan.
The main aims of the present study was illustrate the changes in theory,
technique e measure emerged in the Italian vocational guidance and career
counseling psychology by the analysis of the abstract presented at Italian
Association for Vocational guidance and Career Counseling’ Conferences.
The corpus was composed of 1,250 abstracts that have been collected from
2002 to 2017. In order to compare and contrast the main semantic areas over
time, a topic analysis by means of Reinert's method (1983) was conducted
(IRaMuTeQ and R software) to detect the clusters of words that
characterized the different orientations over time. The results show that
career counseling theories and technique evolved during the time to better
assist workers in adapting to fluid societies and flexible organization and to
better help clients design their lives in 21st century.
Keywords: longitudinal textual analysis, career counseling, vocational
psychology
1. Introduction
In Western countries the economic recession that characterized the years

JADT’ 18

701

2008–2009 lead to a dramatic loss of jobs throughout the Union’s private
sector. Furthermore fast moving global economy and phenomena such as
globalization, internationalization, and rapid technological advances,
influence people’s lives and the ways in which they seek and do their jobs.
The world of work is in general much less clearly defined or predictable, and
employees face greater challenges in coping with work transitions (Savickas
et al., 2009). Therefore, life in a 21st-century requires new models and
methods to deal with the new issues such as uncertainty, inequalities,
poverty, immigration precariousness in the labor market, and with the
worrying consequences also on individual and relational wellbeing. For these
reasons existing traditional career guidance assumptions have been swept
away, together with other certainties, by the sudden changes that have taken
place in the world of work and in the economic field. To better account for
the complexities of work, the Life Design paradigm, a new paradigm for
career theory and intervention in the 21st century (Savickas et al., 2009) has
been developed. The psychology of life design advances a contextualize
epistemology emphasizing human diversity, uniqueness, and purposiveness
in work and career to make a life of personal meaning and social
consequence. Rather than matching self to occupation, it reflects a third wave
of career theory and practice. The first wave emerged as the psychology of
occupations in the first half of the 20th century to match people to jobs. The
second wave comprised the psychology of careers ascending at mid-20th
century to manage worker and other life roles across the lifespan. The third
wave arose as the psychology of life design to make meaning through work
and relationships.
The main aims of the present study was illustrate the longitudinal changes
that emerge in the Italian context regarding the models and the theoretical
paradigms that drive vocational guidance and career counseling by the
analysis of the abstract presented at the Italian Association for Vocational
guidance and Career Counseling 'Conferences. Specifically, we analyzed
differences between the abstract presented before the economic recession
(from 2002 to 2008) and during/after the economic recession (form 2009 to
2017) in the topics related to research, theories, and practice. The corpus was
composed of 1,250 abstracts that have been collected from 2002 to 2017.
2. Corpus and method
All the abstracts have been collected by the Italian Association for Vocational
guidance and Career Counseling - SIO. SIO represents at the national and
international level a focal center in which the main scholars and practioners
converge, gather, share and compare the theories and practices in terms of
vocational guidance and career counseling. The Abstracts from the first

702

JADT’ 18

SIO’s Conference (2002) to the latest one (2017) were collected. No abstract
were collected during the year 2003, 2007, 2016, and 2014, becouse SIO has
not organize national conferences. The corpus is composed of 1,250
abstracts. The corpus was pre-processed by means of IRaMuTeQ and R
software (Ratinaud 2009; Sbalchiero e Santilli, 2017). The corpus was
normalized replacing uppercase with lowercase letters, and punctuation,
numbers and stop words have been removed because are not significant to
analyse the content of abstract. The pre-processing steps were useful to
reduce the redundancy and to provide homogeneity among forms. The
lexicometric measures (Tab.1) indicate that it is plausible to apply statistical
analysis of textual data to the corpus (Lebart et al., 1998). The corpus is
composed of 20,932 word-type and 462,034 word-tokens.
Tab. 1. Lexicometric Characteristics of the corpus

Number of texts
(V) Word-type
(N) Word-tokens
(V1) Hapax
(V/N)*100 = Type/Token Ratio
(V1/V)*100 = Percentage of hapax

1250
20932
462034
8902
4,53
42,53

Using the Reinert method (Reinert, 1983), we extracted a series of ‘lexical
worlds’. The texts was divided into elementary content units of similar
length, then, the algorithm provides reports on ‘words x units’ matrix. The
classification of units consents to identify and extract only parts of texts
relating to the same topic, so for each cluster the list of the most significant
words calculated using the chi square measurement, are identified (Reinert,
1993; Sbalchiero and Tuzzi 2016; Sbalchiero e Santili, 2017).
3. Results
The analysis conducted by means of Reinert’s method detected five different
lexical worlds, as the dendrogram shows (Fig. 1). The methods identify the
lexical worlds quite well because 98,42% of the abstracts have been classified
and the words in the same sematic area are semantically associated, i.e. they
refer to the same issue.
Specifically, the first class of the present corpus refers to career counselor’s
professional knowledge, skills, resources and training. The second class
refers to the principals variables and constructs related to vocational
guidance and career counseling, such as self-efficacy, personality, coping,
intelligence, emotions, satisfaction, optimism. The third class include the
statistic measure and instruments used in vocational guidance to assess

JADT’ 18

703

people career self and personality. The fourth class refers to context variables,
to the supports and barriers for inclusion, rights of people with
vulnerabilities (people with disabilities, psychological sidelines, etc.). The
fifth class includes the guidance services, projects, career guidance activities
that are provided by local centers (university, region, province).
As already mentioned, differences between the abstract presented before the
economic recession (pre-crisis: from 2002 to 2008) and during/after the
economic recession (post crisis: form 2009 to 2017) were analysed. These two
period in vocational guidance history are specific because the stable
employment and secure organizations of the pre-crisis have in post crisis
given way to a new social arrangement of flexible work and fluid
organization, causing people tremendous distress, making difficult to
comprehend career with theories that emphasize stability than mobility.
Furthermore, it seemed interesting to analyze whether differences could be
found in the theories and techniques presented in the abstract in pre e post
crisis. To differentiate between papers presented pre- and post-crisis, a
specific procedure was used based on the Chi2 association of semantic
classes (Ratinaud, 2014) over the two period of time (Fig. 2).
The classes related to the pre-crisis are three and five characterised by
statistic measure and instruments used in vocational guidance to assess
people and guidance services, projects, and career guidance activities. The
post-crisis period is characterized by the class four, that refers to context
variables, to the support and barriers for inclusion, rights of people with
vulnerabilities.

Fig. 1: Cluster Dendrogram and list of most relevant words for each lexical world (in
descending order according to the Chi2 value of each class).

704

JADT’ 18

Fig. 2: Comparison among pre-crisis and post-crisis papers

These results highlighted that the topics presented in the abstract related to
pre-crisis are more oriented towards “people” focusing on the assessment
and measure with a statistical background. In the post-crisis period, the
attention of counsellors is more oriented toward the “environment” in which
people live and the relation between people and their context, so the
uniqueness and the vulnerability of people are considers in relation to social
and work inclusion. Finally, in order to compare and contrast the main
semantic areas over time, the classes were analysed using the Chi2
association of semantic classes and their distribution over years (Fig. 3).

Fig. 3: Comparison among classes and their distributions over years

JADT’ 18

705

In addition to the classes already analyzed in the pre and post crisis periods,
the comparison among classes and their distributions over years, highlights
also class 1 and class 2, which can be considered as evergreen in the
vocational guidance and career counseling field because they are present
throughout almost the entire period considered. The class 1 refers to career
counselor’s professional knowledge, skills, and competences. The class 2
refers to variables and constructs related to vocational guidance and career
counseling such self-efficacy, coping, life satisfaction, and positive attitudes.
4. Conclusions and discussion
The aim of the present study were to highlight the changes in theory,
technique e measure emerged in the Italian vocational guidance and career
counseling psychology by the analysis of the abstract presented at Italian
Association for Vocational guidance and Career Counseling’ Conferences.
The results show five different lexical worlds classes, related to career
counselor’s professional knowledge, variables and constructs of vocational
guidance and career counseling, measure and instruments to assess people
career self and personality, context variables to support inclusion of people
with vulnerabilities, and career guidance services and center.
Differences between the abstract presented before the economic recession
(pre-crisis: from 2002 to 2008) and during/after the economic recession (post
crisis: form 2009 to 2017) were also analysed. The results shows that career
counseling theories and technique evolved during the time to better assist
workers in adapting to fluid societies and flexible organization and to better
help clients design their lives in 21st century. In fact, while in the abstracts
relented to the pre-crisis period, emphasis is given to all those guidance
activities that consider particularly important to allow the person to collect
information about their characteristics and needs before advancing decisionmaking hypotheses (measure and instrument for the assessment), in the
abstracts related to the post-crisis period attention is paid to the "contexts"
where people live. Career guidance practices that are limited to the analysis
of "attitudes" and "interests" are considered obsolete, while current policies,
challenges, socio-economic conditions, the way in which vulnerability is
conceptualised are inputs from the environment which act at various levels
and on which scholars should pay attention (Shogren, Luckasson, &
Schalock, 2014).The evolution of social sciences that revolve around
orientation is undoubtedly a very complex phenomenon. Career scholars and
practioners should support people's needs taking into account the
organizational and environmental context in which they develop and take
shape. Currently the career guidance theory and model are numerous and
not always denominated and defined in the same way by the various authors

706

JADT’ 18

and scholars. For these reasons is important to analyze and understand the
different model developed over the time in order to activate a continuous
comparison in the field of career counselor’s competences that produces
precise trajectory regard the constructs to develop in the people by program
and activity provided by career services. In fact, noteworthy is the result that
highlights how the classes that refer to vocational guidance and career
counseling are presented throughout the entire period considered.
Nevertheless, these are just some results and other analyzes will be useful for
examining the peculiarities that these specific classes assume during the
years considered, in order to identify the specific skills and constructs that
characterized different historical periods. It could also be important to
compare the results that emerged in the Italian context with those of other
European and North American contexts, to generalize the results obtained.
References
Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Kluwer
Academic Publishers: Dordrecht.
Ratinaud, P. (2014). Visualisation chronologique des analyses ALCESTE:
application à Twitter avec l’exemple du hashtag #mariagepourtous. Actes
des 12es Journées internationales d’Analyse statistique des Données Textuelles.
Paris Sorbonne Nouvelle–Inalco.
Reinert, M. (1983). Une méthode de classification descendante hiérarchique:
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, 8(2), 187-198.
Reinert, M. (1993). Les «mondes lexicaux» et leur «logique» a` travers
l’analyses tatistique d’un corpus de re´cits de cauchemars. Langage &
Société, 66, 5–39.
Shogren, K. A., Luckasson, R. & Schalock, R. L. (2014). The fefinition of
“context” and its application in the field of intellectual disability. Journal
of Policy and Practice in Intellectual Disabilities, 11(2), 109-116.
Savickas, M. L., Nota, L., Rossier, J., Dauwalder, J. P., Duarte, M. E.,
Guichard, J., ... & Van Vianen, A. E. (2009). Life designing: A paradigm for
career construction in the 21st century. Journal of Vocational Behavior, 75,
239-250.
Sbalchiero, S. & Santilli, S. Some introductory methodological notes. In L.
Nota & S. Soresi (Eds.), For A manifesto in favor of Inclusion. Florence:
Hogrefe Editore
Sbalchiero, S., & Tuzzi, A. (2016). Scientists’ spirituality in scientists’ words.
Assessing and enriching the results of a qualitative analysis of in-depth
interviews by means of quantitative approaches. Quality & Quantity,
50(3), 1333-1348.

JADT’ 18

707

A la poursuite d’Elena Ferrante
Jacques Savoy
Université de Neuchâtel (Suisse) – Jacques.Savoy@unine.ch

Abstract
The objective of an authorship attribution model is to determine, as
accurately as possible, the true author of a document, literary excerpt,
threatening email, legal testimony, etc. Recently a tetralogy called My
Brilliant Friend has been published under the pen-name Elena Ferrante, first
in Italian and then translated into several languages. Various names have
been suggested as possible true author (e.g., Milone, Parrella, Prisco, etc.).
Based on a corpus of 150 contemporary Italian novels written by 40 authors,
two computer-based authorship attribution methods have been employed to
answer the question “Who is the secret hand behind Elena Ferrante?” To
achieve this objective, the nearest neighbor (k-NN) approach was applied on
the 100 to 2,000 most frequent tokens using the Delta model. As a conclusion,
we found that Domenico Starnone is the true author behind Elena Ferrante’s
pseudonym. As a second approach and using the entire vocabulary, Labbé’s
model confirms this finding.
Résumé
L’objectif d’un modèle d’attribution d’auteur consiste à identifier, de la
manière la plus fiable possible, le véritable auteur d’un document, extrait
d’une œuvre, d’un courriel menaçant ou d’un testament. Récemment, la
tétralogie débutant avec L’amica geniale (Une Amie Prodigieuse) a été publié
sous le nom de plume d’Elena Ferrante, d’abord en italien puis traduite dans
plusieurs langues. Plusieurs noms ont été proposés comme le possible
véritable écrivain (par exemple, Milone, Parrella, Prisco, etc.). En s’appuyant
sur un corpus composé de 150 romans contemporains italiens écrit par 40
auteurs, deux méthodes d’attribution d’auteur ont été utilisés pour
déterminer qui se cache derrière le pseudonyme Elena Ferrante. Dans ce but,
la technique du plus proche voisin a été appliquée sur la base des 100 à 2 000
vocables les plus fréquents avec le modèle Delta. Comme conclusion, on
aboutit au nom de Domenico Starnone comme la véritable identité de Elena
Ferrante. Comme deuxième approche basée sur l’ensemble du vocabulaire, le
modèle de Labbé confirme cette conclusion.
Keywords : Authorship attribution, corpus linguistics.
Mots-clés : Attribution d’auteur, linguistique de corpus.

708

JADT’ 18

1. Introduction
Avec la parution de L’amica geniale (2011) débute une tétralogie sur la vie à
Naples depuis les années 50. Cette série de romans rencontre un étonnant
succès, en particulier aux États-Unis. Toutefois, l’auteur indiquée, Elena
Ferrante, représente un pseudonyme dont la véritable identité n’a pas été
révélée. Des érudits et journalistes ont proposé plusieurs noms en tenant
compte de possibles similarités stylistiques ou en affirmant que l’auteur doit
connaître le Naples d’après-guerre, voire être une femme (par exemple, Erri
De Luca, Francesco Piccolo, Michele Prisco, Fabrizia Ramondino, …). Sur la
base des royalties versés, le journaliste C. Gatti (Gatti, 2016) affirme que la
plume de Ferrante est tenue par Anita Raja (femme de l’écrivain Domenico
Starnone). Aucune étude scientifique approfondie n’a abordé cette question,
mais une première ébauche indique que le véritable auteur serait Domenico
Starnone (Tuzzi et al., 2018). L’identification du véritable auteur de ces
romans nous rappelle les investigations sur les relations Gary-Ajar en France
dans les années 1970. Dans le monde anglo-saxon, la parution de The Cuckoo’s
Calling (2013) sous la signature de R. Galbraith correspond à une affaire
similaire puisque le véritable auteur était J. K. Rowling (Juola, 2016). La
découverte d’un poème inédit soulève également la question de son véritable
auteur (Thisted & Efron, 1987), (Craig & Kinney, 2009). Pour lever le voile sur
l’identité exacte de Ferrante, notre étude dispose d’un corpus de 150 romans
italiens contemporains. De plus, on s’appuiera sur deux méthodes
d’attribution d’auteur (Juola, 2006) reconnues et ayant fait l’objet de plusieurs
études. En effet, afin d’admettre une preuve devant un tribunal celle-ci doit
posséder plusieurs caractéristiques (Chaski, 2013) comme, par exemple,
correspondant aux meilleures pratiques dans le domaine, avoir été testée et
pouvant être vérifiée et répliquée. Enfin, nous faisons l’hypothèse que le
véritable auteur derrière la signature Ferrante est bien l’un des 39 écrivains
italiens présents dans notre corpus (attribution dans un ensemble fermé).
2. Travaux reliés
Afin de déterminer l’identité d’un écrivain, trois paradigmes principaux ont
été proposées (Juola, 2006), (Stamatatos, 2009). D’abord, on s’est appuyé sur
des mesures stylométriques admises comme invariantes pour chaque auteur,
à l’exemple de la longueur moyenne des phrases, la taille du vocabulaire par
rapport à la taille du document (TTR) (Rexha et al., 2016). Face à des textes de
tailles variables, ces mesures s’avèrent d’être instables (Baayen, 2008).
Deuxièmement, les choix lexicaux permettent de différencier les auteurs, tant
dans la sélection des mots que dans leur fréquence d’occurrences ; « Le style
c’est l’homme » disait Buffon en 1753). Dans ce but, Mosteller & Wallace
(1964) proposent de sélectionner semi-automatiquement les vocables les plus

JADT’ 18

709

pertinents. Burrows (2002) choisit les mots les plus fréquents et, en
particulier, les mots fonctionnels (déterminants, prépositions, conjonctions,
pronoms et verbes auxiliaires). Ces derniers possèdent l’avantage d’être plus
fortement reliés au style de l’auteur qu’à la sémantique. Cette liste
comprendra entre 50 à 1 000 vocables les plus fréquents (Hoover, 2007), voire
l’ensemble du vocabulaire (Labbé, 2007). D’autres auteurs proposent de
définir a priori une telle liste (Hughes et al., 2012). Sur cette base, chaque texte
est représenté par les fréquences relatives d’occurrence des vocables
sélectionnés. Ensuite, une mesure de distance (ou de similarité) permet
d’estimer la proximité de deux textes. L’attribution s’établit habituellement
selon la règle du plus proche voisin. Troisièmement, en recourant à des
modèles d’apprentissage automatique (Stamatatos, 2009) les attributs les plus
pertinents (mots, bigrammes de mots ou de lettres, partie du discours,
émoticons, etc.) peuvent être sélectionnés. Ensuite un classifieur est entraîné
pour générer les profils des auteurs retenus (Naïve Bayes, régression
logistique, SVM, apprentissage en profondeur (Kocher & Savoy, 2017), etc.).
Enfin, le texte d’attribution douteuse est représenté et le nom du profil le plus
similaire est retourné comme réponse.
3. Le corpus de romans italiens contemporains
Grâce aux efforts de A. Tuzzi et M. Cortelazzo (Université de Padoue), le
corpus PIC (Padova Italian Corpus) a été créé en 2017. Cette collection
contient 150 romans italiens couvrant la période de 1987 à 2016. Comme
l’indique le tableau 1, ce corpus contient des œuvres de 40 auteurs (dont
Elena Ferrante avec sept textes). Lors de sa création, les auteurs originaires de
Naples et de sa région ont été favorisés (10 noms indiqués en italique dans le
tableau 1), de même que les femmes (12, pour 27 hommes).
Ce corpus contient 9 609 234 formes, avec une moyenne de 64 061 mots par
œuvre (un seul écrit comprend moins de 10 000 formes). La longueur
moyenne des romans signés par Ferrante s’élève à 88 933 mots. Enfin, un
contrôle éditorial a été appliqué afin d’éliminer les éléments non-textuels
(titre courant, numérotation des pages, etc.) ainsi qu’une inspection de
l’orthographe. Ce corpus renferme donc des écrits de la même époque et
langue, du même genre littéraire et dont la qualité a été vérifiée. Le 7
septembre 2017, un workshop regroupant sept équipes de chercheurs s’est
tenu à l’Université de Padoue durant lequel le nom de Domenico Starnone a
été identifié unanimement comme l’auteur derrière les œuvres de Elena
Ferrante. Pour atteindre cette conclusion, notre approche s’appuie sur les
techniques suivantes.

710

JADT’ 18
Tableau 1 : Nom des écrivains inclus dans le corpus avec le nombre de romans

Nom
Affinati
Ammaniti
Bajani
Balzano
Baricco
Benni
Brizzi
Carofiglio
Covacich
De Luca
De Silva
Faletti
Ferrante
Fois

H/F
H
H
H
H
H
H
H
H
H
H
H
H
?
H

Nombre
2
4
3
2
4
3
3
9
2
4
5
5
7
3

Nom
Giodano
Lagiola
Maraini
Mazzantin
Mazzucco
Milone
Montesano
Morazzon
Murgia
Nesi
Nori
Parrella
Piccolo
Pincio

H/F Nombre
Nom
H
3
Prisco
H
3
Raimo
F
5
Ramondino
F
4
Rea
F
5
Scarpa
F
2
Sereni
H
2
Starnone
F
2
Tamaro
F
5
Valerio
H
3
Vasta
H
3
Veronesi
F
2
Vinci
H
7
H
3

H/F
H
H
F
H
H
F
H
F
F
H
H
F

Nombre
2
2
2
3
4
6
10
5
3
2
4
2

4. Identifier l’auteur derrière la signature Elena Ferrante
Notre étude débute par l’application du modèle Delta (Burrows, 2002) dans
lequel la sélection des attributs stylistiques correspond aux k vocables les
plus fréquents. Toutefois, aucune limite précise pour le paramètre k n’est
indiquée et des travaux précédents (Savoy, 2015) soulignent que des valeurs
entre 200 et 500 tendent à apporter les meilleures performances. Cette limite
fixée, la méthode Delta estime un Z score pour chaque vocable ti basé sur la
fréquence relative (dénotée rtfij pour le terme ti et dans le document Dj)
comme indiqué par l’équation 1 (avec meani indique la fréquence moyenne
du vocable et si son écart-type).
Z score(tij) = (rfrij – meani) / si
Pour chaque auteur, on concatène tous ses écrits pour générer son profil Aj.
Enfin, on calcule la distance entre la représentation du texte à attribuer
(dénotée Q) et les profils des auteurs Aj (voir équation 2). Ensuite, les
différents auteurs peuvent être triés avec la plus faible distance signalant
l’auteur le plus probable. Le tableau 2 redonne les trois premiers auteurs avec
des valeurs pour k = 200, 300 et 500. Dans la dernière colonne (Stopword), les
vocables choisis correspondent uniquement aux mots fonctionnels de l’italien
(k = 307).

Le tableau 2 nous renseigne sur l’attribution du roman L’amica geniale (2011).
En considérant les six autres ouvrages, le même nom apparaît au premier
rang. De même, si le nombre de vocables s’élève à 50, 100, 150, 250, 400,
1 000, 1 500 ou 2 000, nous retrouverons toujours Starnone en première place
et ceci pour toutes les œuvres de Ferrante.

JADT’ 18

711

Une analyse plus fine des distances du tableau 2 indique que la différence (en
pourcentage) entre les distances du premier et deuxième rang présente des
valeurs nettement supérieures à celles entre le deuxième et troisième rang.
Ainsi, si k = 200, la différence entre 0,524 et 0,686 s’élève à 30,9 % tandis que
celle entre 0,686 et 0,700 n’est que de 2,0 %. Le premier nom proposé se
détache clairement des autres.
Dans une deuxième série d’expériences, nous avons regroupé tous les
romans attribués à Elena Ferrante pour en former qu’un seul texte (ou profil).
En variant le nombre de vocables de 50, 100, 150, 200, 250, 300, 400, 500, 1 000,
1 500 à 2 000, Starnone se retrouve toujours au premier rang des auteurs
ayant la plus forte similarité avec le profil d’Elena Ferrante.
Tableau 2 : Listes triées des auteurs les plus probables pour L’amica geniale (méthode Delta)
k = 300
k = 500
Stopword
k = 200
Rang Distance Auteur Distance Auteur Distance Auteur Distance Auteur
1
0,524
Starnone
0,515 Starnone 0,505 Starnone 0,421 Starnone
2
0,686
Veronesi
0,684
Brizzi
0,686 Veronesi 0,640
Milone
3
0,700
Balzano
0,719 Veronesi 0,710
Brizzi
0,660 Veronesi

Comme second modèle d’attribution d’auteur, l’approche de Labbé (2007)
suggère de recourir à l’ensemble du vocabulaire. Dans ce cas, la distance
entre deux textes A et B (indiquée par D(A,B) dans l’équation 3) dépend des
fréquences absolues des vocables dans les deux textes (dénotées par tfiA,
respectivement tfiB, avec i = 1, 2, …, k). La variable nA (ou nB) signale la
longueur de l’écrit A (en nombre de formes). Comme les deux textes ne
possèdent pas des tailles identiques, les fréquences du plus long (B dans
l’équation 3) seront multipliées par le rapport des tailles (voir partie droite de
l’équation 3). Enfin, les valeurs D(A,B) seront comprises entre 0 (aucun mot
en commun) et 1 (mêmes mots avec des effectifs identiques).

avec
En appliquant cette méthode, une distance est calculée entre chaque roman et
la distance permet de trier les couples d’écrits, de la plus faible distance à la
plus grande. Le corpus PIC génère (150 x 149) / 2 = 11 175 couples. Un extrait
est repris dans le tableau 3.
Dans ce tableau, la première place correspond aux deux œuvres les plus
similaires, deux romans écrits par Ferrante dans notre cas, soit Storia di chi
fugge e di chi resta (Id : 51, (2013)) et Storia della bambina perduta (Id : 52, (2014)).
Les deux autres romans de la tétralogie suivent, du deuxième au quatrième

712

JADT’ 18

rang, soit avec Storia del nuovo cognome (Id : 50, 2012) et L’amica geniale (Id : 49,
(2011)). En cinquième position, on rencontre deux écrits de Faletti, soit Niente
di vero tranne gli occhi (Id : 42, 2004) et Io sono Dio (Id : 44, 2009), puis deux
romans de Veronesi (Id : 145, Caos calmo (2009) et Id : 147, Terre rare (2014)).
Avec des distances faibles, les appariements s’opèrent entre des œuvres
rédigés par le même auteur et dans un intervalle de temps assez court.
Tableau 3 : Liste triée des romans les plus similaires (méthode Labbé)

Rang
1
2
3
4
5
6
…
43
…
63

Distance
0,140
0,148
0,155
0,157
0,165
0,166
…
0,228
…
0,241

Id.
51
50
49
50
42
145
…
47
…
108

Auteur 1
Ferrante
Ferrante
Ferrante
Ferrante
Faletti
Veronesi
…
Ferrante
…
Raimo

Id.
52
51
50
52
44
147
…
127
…
147

Auteur 2
Ferrante
Ferrante
Ferrante
Ferrante
Faletti
Veronesi
…
Starnone
…
Veronesi

Lorsque la distance augmente, la probabilité de rencontrer le même auteur
pour les deux ouvrages reliés diminue. Le premier lien apparament incorrect
se situe au 43e rang avec un écrit de Ferrante (Id : 47, I giorni dell abbandono
(2002) apparié avec un de Starnone (Id : 127, Eccesso di zelo (1993)). Un
appariement entre ses deux auteurs apparaît également au rang 44, 53, et 54,
avant que l’on découvre un autre type d’erreur en position 63 reliant un
roman rédigé par Raimo (Id : 108, Il peso della grazia (2012)) et un autre de
Veronesi (Id : 147, Terre rare (2014)). Puis, on découvre à nouveau un
appariement entre Ferrante et Starnone aux rangs 65, 69, 71, 72, 73, 74, soit un
total de dix couples entre ces deux auteurs et seulement un seul avec des
autres écrivains. Sachant que Ferrante correspond à un pseudonyme, la forte
similarité de style avec celui Starnone fait de ce dernier un choix de premier
ordre.
5. Analyse
Les choix lexicaux ne sont pas le fruit du hasard et chaque auteur a ses
préférences qui sont détectables par les mesures stylistiques. Le
rapprochement entre Ferrante et Starnone s’explique également en analysant
quelques exemples. Dans notre corpus, les sept romans de Ferrante
correspondent à 6,5 % de la taille tandis que 6,4 % est constitué par les dix
œuvres de Starnone. Si les fréquences d’occurrences de certains mots
s’écartent de ces proportions et dans la même direction pour les deux
auteurs, nous pouvons rapprocher leur style.

JADT’ 18

713

Le nom padre apparaît 9 815 fois dans le corpus PIC. Dans les œuvres de
Ferrante, on en dénombre 833 (8,5 % du total) et 1 170 chez Starnone (11,9 %).
Ce mot est clairement employé plus fréquemment par ces “deux” auteurs. De
manière similaire, le mot madre possède une fréquence de 8 246 dans le
corpus pour 1 104 occurrences (13,4 %) sous la plume de Ferrante et 762
(9,2 %) avec Starnone. D’autres vocables fonctionnels possèdent des
distributions similaires. Ainsi le mot persino (même) apparaît 1 351 fois dans
la collection PIC et on en compte 266 (19,7 %) chez Ferrante et 205 (15,2 %)
chez Starnone. On notera également que ce terme peut également s’écrire
perfino (avec une fréquence d’occurrences de 20 avec Ferrante, 18 chez
Starnone). Pour Ferrante et Starnone, on voit une préférence pour une forme,
tandis que d’autres auteurs recourent uniquement à l’une des orthographes
(Baricco : uniquement perfino, Tamaro : seulement persino). Enfin certains
écrivains ignorent les deux mots (Covacich, Parrella) ou l’utilisent très
rarement (De Luca ou Balzano). Comme exemples complémentaires, certains
mots ne sont employés que par Ferrante et Starnone comme risatella
(gloussement, 16 occurrences chez Ferrante, 4 avec Starnone) ou
contraddittoriamente (contradictoirement, Ferrante : 6; Starnone : 9).
Pour un écrivain italien, le lexique peut inclure des formes provenant du
dialecte comme celui de Naples avec le terme strunz (stronzo en italien). Ce
terme apparaît 85 dans le corpus, avec 63 occurrences dans les romans de
Starnone et 18 chez Ferrante (et deux fois chez De Silva et Raimo).
Certains n-grammes de mots s’avèrent plus fréquents chez Ferrante et
Starnone comme no essere che (ne pas être ça) qui apparaît 23 fois (100 %)
dans le corpus mais 6 (26,1 %) sous la plume de Ferrante et 7 (30,4 %) sous
celle de Starnone. Ensemble ces deux auteurs apportent plus de 56 % des
occurrences de cette séquence.
6. Conclusion
Cette étude s’appuie sur deux méthodes d’attribution d’auteur reconnues
d’une part, et, d’autre part sur un corpus de 150 romans contemporains
rédigés par 40 auteurs. Comme attributs stylistiques, nous avons retenu les
100, 150, 200, 250, 300, 400, 500, 1 000, 1 500 et 2 000 mots les plus fréquents
pour la méthode Delta (Burrows, 2002). Avec ces différentes valeurs, le
premier nom retourné comme le probable auteur s’avère toujours Domenico
Starnone et ceci pour les sept romans parus sous le nom Ferrante. En
s’appuyant sur l’ensemble du vocabulaire et la méthode de Labbé (2007), la
même conclusion est obtenue.
En analysant quelques choix lexicaux, on découvre des relations étroites
entre Starnone et Ferrante. Par exemple, le mot persino est sur-employé dans
les romans des deux auteurs, et la second forme perfino n’apparaît que plus

714

JADT’ 18

rarement. Chez d’autres écrivains, on rencontre habituellement une
préférence pour l’un des deux termes ou l’absence de leur usage. Enfin, suite
à l’atelier qui s’est tenu à Padoue le 7 septembre 2017 aboutissant à désigner
Domenico Starnone comme l’écrivain derrière la signature Ferrante, celui-ci a
démenti en être le véritable auteur (Fontana, 2017).
Remerciements
Cette recherche a été possible grâce à A. Tuzzi et M. Cortelazzo qui nous ont
transmis le corpus PIC.
Références
Baayen, H.R. (2008). Analysis Linguistic Data: A Practical Introduction to
Statistics using R. Cambridge University Press, Cambridge.
Burrows, J.F. (2002). Delta: A measure of stylistic difference and a guide to
likely authorship. Literary and Linguistic Computing, 17(3):267-287.
Chaski, C. (2013). Best practices and admissibility of forensic author
identification. Journal of Law and Policy, 21(2):333-376.
Craig, H., & Kinney, A.F. (2009). Shakespeare, Computers, and the Mystery of
Authorship. Cambridge University Press, Cambridge.
Fontana, E. (2017). Lo scrittore Domenico Starnone: “Io non sono Elena
Ferrante”. Il Giornale, 9 sept.
Gatti, C. 2016. La véritable identité d’Elena Ferrante révélée. BublioObs, 2
octobre 2016.
Hoover, D.L. (2007). Corpus stylistics, and the styles of Henry James. Style,
41(2):160-189.
Hughes, J.M., Foti, N.J., Krakauer, D.C., & Rockmore, D.N. (2012).
Quantitative patterns of stylistic influence in the evolution of literature.
Proceedings of the PNAS, 109(20), pp. 7682-7686.
Juola, P. (2006). Authorship attribution. Foundations and Trends in Information,
1(3):233-334.
Juola P. (2016). The Rowling case: A proposed standard analytic protocol for
authorship questions. Digital Scholarship in the Humanities, 30(1), i100-i113.
Kocher, M., & Savoy, J. (2017). Distributed language representation for
authorship attribution. Digital Scholarship in the Humanities, 2017, to
appear.
Labbé, D. (2007). Experiments on authorship attribution by intertextual
distance in English. Journal of Quantitative Linguistics, 14(1):33-80.
Mosteller, F., & Wallace, D.L. (1964). Applied Bayesian and Classical Inference:
The Case of the Federalist Papers. Addison-Wesley, Reading.
Rexha, A., Klampfl, S., Kröll, M., & Kern, R. (2016). Towards a more fine
grained analysis of scientific authorship. Proceedings ECIR 2016, pp. 26–31.

JADT’ 18

715

Savoy, J. (2015). Comparative evaluation of term selection functions for
authorship attribution. Digital Scholarship in the Humanities, 30(2):246-261.
Stamatatos, E. (2009). A survey of modern authorship attribution methods.
Journal of the American Society for Information Science and Technology,
60(3):433-214.
Tuzzi, A., & Cortelazzo, M. (2018). What is Elena Ferrante? A Comparative
Analysis of a Secretive Bestselling Italian Writer. Digital Scholarship in the
Humanities, to appear.

716

JADT’ 18

Regroupement d’auteurs dans
la littérature du XIXe siècle
Jacques Savoy
Université de Neuchâtel (Suisse) – Jacques.Savoy@unine.ch

Abstract
This paper presents the author clustering problem in which a set of n texts
written by several distinct authors must be regrouped into k clusters, each of
them corresponding to a single author. The proposed model can use different
distance measures and feature sets (composed of the most frequent word
types). The evaluation is based on a French corpus composed of 200 excerpts
of novels written during the 19th century. By varying different parameter
settings, the evaluation indicates a better performance achieved with words
instead of n-grams of letters. The Cosine distance achieves lower
performance levels compared to the Tanimoto (L1) or Matusita (L2) functions.
The text size plays an important role in the effectiveness of the solution,
showing that size of 10,000 tokens produces significantly better results than
text size of 5,000 to 500 tokens. A more detailed analysis provides reasons
explaining stylistic aspects of some authors.
Résumé
Cette communication présente le problème du regroupement d’auteurs dans
lequel un ensemble de n textes écrits doit être regroupé dans k grappes
distinctes, une pour chaque auteur. Le modèle proposé permet l’emploi de
différentes mesures de distance et divers ensembles d’attributs (vocables les
plus fréquents). L’évaluation s’appuie sur un corpus composé de 200 extraits
de romans français du XIXe siècle. En variant différents paramètres, notre
étude indique que les vocables s’avèrent meilleur que les n-grammes de
lettres. La fonction cosinus génère un taux de réussite plus faible que le
fonction Tanimoto (L1) ou Matusita (L2). La taille des textes joue un rôle
important dans la qualité de réponse et une longueur de 10 000 mots permet
une performance significativement supérieure à des valeurs variant de 5 000
à 500 mots. Une analyse apporte quelques explications sur le style de
différents auteurs.
Keywords : Automatic classification, unsupervised machine learning,
authorship attribution.
Mots-clés : Classification automatique, apprentissage non-supervisé,
attribution d’auteur.

JADT’ 18

717

1. Introduction
Le problème d’attribution d’auteur (Juola, 2006) rencontre un intérêt
grandissant avec la multiplication des canaux électroniques. La présence de
messages anonymes ou pseudo-anonymes soulève de nombreux défis en
criminalité (Olsson, 2008), (Chaski, 2013) à l’exemple des chats calomnieux
ou des courriels menaçants. Pourtant des questions plus classiques méritent
notre attention comme, par exemple, déterminer la véritable identité de la
romancière Elena Ferrante (Gatti, 2016) ou sur les relations de Shakespeare et
de ses co-auteurs (Michell, 1996), (Craig & Kinney, 2009).
Dans ce cadre, notre communication présente les problèmes liés à la question
du regroupement d’auteurs avec une application en littérature française du
XIXe siècle. Ce problème se résume ainsi. Disposant d’un ensemble de n
extraits de romans, on doit regrouper en k classes disjointes, chacune
contenant tous les écrits du même auteur. Ce problème a été posé lors de la
campagne d’évaluation CLEF-PAN 2016 et 2017 (Stamatatos et al., 2016) mais
les collections tests n’ont pas été rendues publiques. Ce problème présente
une difficulté majeure par l’absence de données d’entrainement.
2. Travaux reliés
Afin d’identifier l'auteur d’un écrit, trois familles d’approches ont été
proposées (Juola, 2006). En premier lieu, des mesures stylométriques
supposées invariantes ont été évoquées comme la longueur moyenne des
phrases, la taille du vocabulaire par rapport à la longueur du document
(rapport TTR) (Rexha et al., 2016). Toutes ces mesures possèdent
l’inconvénient d’être instables face à des textes de tailles différentes (Baayen,
2008). Une deuxième famille d’approches se fonde sur le vocabulaire.
Mosteller & Wallace (1964) proposent de sélectionner de manière semiautomatique les vocables les plus pertinents. Burrows (2002) sélectionne les
mots les plus fréquents et, en particulier, les mots fonctionnels (déterminants,
prépositions, conjonctions, pronoms et verbes auxiliaires). Ces derniers
possèdent l’avantage d’être plus fortement reliés au style de l’auteur qu’à la
sémantique. Cette liste comprendra entre 50 à 1 000 vocables les plus
fréquents (Hoover, 2007). D’autres auteurs proposent de définir a priori une
telle liste (Hughes et al., 2012). Ainsi, chaque texte peut être représenté par les
fréquences d’occurrence de ces vocables. Ensuite, une mesure de distance (ou
de similarité) permet d’estimer la proximité de deux textes. L’attribution
s’établit habituellement selon la règle du plus proche voisin.
Troisièmement, des modèles d’apprentissage automatique (Stamatatos, 2009)
permettent de sélectionner les attributs (mots, bigrammes de mots ou de
lettres, POS, émoticons, etc.) possédant le meilleur pouvoir discriminant.
Ensuite un classifier est entraîné sur un ensemble d’apprentissage (SVM,

718

JADT’ 18

régression logistique, etc.). Cependant, dans le cadre du regroupement
d’auteurs, aucune donnée d’entraînement n’est disponible rendant caduc de
telles approches. Dès lors, pour résoudre ce problème, des approches
proposent de déterminer en premier lieu le nombre k d’auteurs sur
l’ensemble n d’écrits (Stamatatos et al., 2016). Cette valeur fixée, on applique
un algorithme de classification k-means afin d’identifier les différents groupes
de textes. Par itération, le nombre k d’auteurs peut être affiné. Comme second
paradigme, la distance entre chaque écrit est calculée, puis on applique un
algorithme de classification hiérarchique (Lebart et al., 1998) pour former les
grappes de documents. Dans cette étude, nous suivrons cette seconde
stratégie de résolution, choix qui nous a permis d’obtenir le deuxième rang
lors de la dernière campagne d’évaluation PAN-CLEF 2016.
3. Corpus de test et méthodologie d’évaluation
L’évaluation empirique tient une place importante en attribution d’auteur.
Comme les corpus des campagnes PAN-CLEF 2016 et 2017 n’ont pas été
rendus publics, nos évaluations seront basées sur une collection extraite de la
littérature française du XIXe siècle. Ce corpus nommé St-Jean1 contient 200
extraits de romans écrit par 30 auteurs (entre 1801 (Châteaubriant, Attala) et
1901 (Régnier, Les Rencontres de Monsieur de Bréot)). Ce nombre d’écrivains et
de textes étant élevé, la tâche demeure ardue. Chaque auteur est représenté
par au moins trois extraits (avec un maximum de treize pour Balzac)
provenant d’un à six romans et aucun écrivain ne représente plus de 5 % du
corpus. Chaque extrait contient en moyenne 10 073 formes (min : 10 026 ;
max : 10 230 ; standard déviation : 25). Au total, ce corpus contient 2 014 641
formes pour 51 661 vocables extraits de 67 romans. Disposant de n textes,
notre approche produira une liste ordonnée de liens entre textes avec une
indication de la distance entre eux. Un exemple est présenté dans le
tableau 1. Avec ce corpus, la solution se compose de 30 groupes requérant la
présence de 670 liens intra-auteurs. Comme mesure d’évaluation, nous
reprenons la précision moyenne (AP) (la moyenne des précisions obtenues
pour chaque lien pertinent), mesure usitée lors des campagnes PAN-CLEF
2016 et 2017. Ainsi, une valeur unique de performance reflète la qualité de
chaque modèle de classification. Comme seconde mesure, la valeur HP
(haute précision) indique le nombre de liens correctement établis depuis le
début jusqu’à la présence du premier lien erroné. Dans notre tableau 1, la
valeur HP = 168 signalant que les 168 premiers liens sont justes.

1 Ce corpus a été créé par D. Labbé et est disponible
(www.unine.ch/clc/home/corpus.html) soit sous la forme de textes, soit lemmatisé.
Les encodages UTF-8 et Windows sont disponibles.

JADT’ 18

719

Tableau 1 : Exemple d’un extrait d’une liste ordonnée selon la distance (Tanimoto)
Rang
Distance
Texte 1
Texte 2
1
0,239
51 Flaubert
62 Flaubert
2
0,242
3 Flaubert
20 Flaubert
3
0,248
29 Sand
115 Sand
4
0,248
122 Staël
140 Staël
5
0,253
125 Fromentin
159 Fromentin
6
0,255
37 Flaubert
62 Flaubert
7
0,256
132 Régnier
162 Régnier
...
…
…
…
169
0,324
42 Maupassant
51 Flaubert

4. Sélection des attributs et mesure de distance
Afin de regrouper les documents selon leur auteur, nous devons les
représenter en fonction de leur style et non en fonction des thèmes qu’ils
abordent. Comme mentionné précédemment, plusieurs études ont démontré
que les vocables les plus fréquents constituent des attributs pertinents pour
détecter le style d’un auteur. Dans le cadre de l’attribution d’auteur, le thème
pourrait perturber des affectations correctes lorsque, par exemple, deux
auteurs abordent des sujets similaires. Pour cerner les aspects stylistiques,
une étude récente a démontré que tenir compte des 200 à 300 mots les plus
fréquents (Savoy, 2015) apporte de bonnes performances comparées à
d’autres fonctions de sélection (rapport des cotes, gain d’information, chicarré, etc.). Sur la base du corpus St-Jean, les mots les plus fréquents de notre
corpus sont : de (4,11 % des occurrences), et (2,44 %), la (2,36 %), le (1,94 %),
et à (1,9 %). Comme alternative, plusieurs études proposent de recourir aux
fréquences des lettres et des bigrammes de lettres et, plus généralement, des
n-grammes afin de distinguer les différents styles (Kjell, 1994), (Juola, 2006).
On remarquera toutefois que les composantes stylistiques et thématiques
seront toutes les deux présentes dans la génération de tels n-grammes. Dans
cette étude, la distinction entre majuscules et minuscules est ignorée et les
signes de ponctuation sont éliminés. Par contre, on tiendra compte du fait
qu’une lettre débute ou termine un mot. Le nombre maximal d’attributs
s’élève à (27 x 27) + 27 = 756. Pour la langue française, on retrouve 594 (ou
78,6 %) combinaisons possibles dans notre corpus. Les lettres françaises les
plus fréquentes sont : e (15.6 % des lettres), s (8,3 %), a (8,3 %), i (7,5 %), et t
(7,2 %). En indiquant par _ l’espace, les bigrammes de lettres les plus usuels
sont : e_ (5,1 % des bigrammes), s_ (3,5 %), t_ (2,7 %), _d (2,4 %), et _l (1,8 %).
Dès que chaque document est représenté par m de mots (ou de n-grammes de
lettres), on peut calculer sa distance avec les autres entités du corpus. Le
choix de cette fonction de distance (ou de similarité) peut s’opérer selon des
critères théoriques (par exemple, symétrie, inégalité triangulaire) ou

720

JADT’ 18

empiriques (efficacité). Basée sur le profilage d’auteur, une étude récente
(Kocher & Savoy, 2017) indique qu’aucune mesure de distance s’avère
toujours la meilleure. Par contre un groupe restreint permet d’obtenir de
bonnes performances comme la distance de Manhattan ou de Tanimoto basée
sur la norme L1, ou celle de Matusita (norme L2). Nous avons repris ces
mesures en y ajoutant la distance du cosinus. Ces quatre mesures respectent
la symétrique et respectent l’inégalité triangulaire (Kocher & Savoy, 2017).
Dans la définition de ces mesures de distance, les lettres majuscules
indiquent les vecteurs représentants les documents. Les minuscules (ai, bi)
correspondent aux fréquences relatives des termes sélectionnés.

5. Évaluation
Notre première évaluation concerne l’efficience des différentes mesures de
distance ainsi que la performance du nombre de vocables les plus fréquents
retenus comme attributs. Le tableau 2 indique les valeurs de précision
moyenne (AP) et de haute précision (HP) en représentant les textes par les
100 à 1 000 vocables les plus fréquents, ou tout le vocabulaire. La dernière
ligne et colonne nous renseigne sur la moyenne des APs.
Tableau 2 : Précision moyenne (AP) et haute précision (HP) selon diverses mesures de
distance avec des représentations construites entre 100 vocables et tout le vocabulaire
Manhattan
Tanimoto
Matusita
Cosinus
Moyenne
Attributs
AP
HP
AP
HP
AP
HP
AP
HP
AP
100
0,674
185
0,695
192
0,655
181 0,626 152
0,663
200
0,692
186
0,708
193
0,687
222 0,628 145
0,679
190
196
244 0,629 148
300
0,705
0,720
0,727
0,695
0,750
500
0,720
186
0,735
189
212 0,627 149
0,708
0,730
0,743
0,709
1 000
183
186
0,745
204 0,617 142
Tout
0,713
166
0,672
168
0,568
135 0,599 142
0,691
Moyenne
0,706
183
0,712
187
0,689
200 0,621 146
0,681

JADT’ 18

721

Ces résultats indiquent que les différences de précision moyenne restent
faibles entre les mesures de Manhattan, Tanimoto et Matusita. Toutes les
trois s’avèrent supérieures au cosinus. En considérant la haute précision
(HP), Matusita tend à apporter une meilleure efficacité. Reste à déterminer a
priori cette valeur maximale, sans connaître les attributions correctes. Enfin,
une représentation par 300 à 500 voire 1 000 vocables les plus fréquents
fournit les meilleurs taux de succès. En remplaçant les vocables par des ngrammes de lettres (performances indiquées dans le tableau 3), les valeurs de
performance s’avèrent inférieures aux vocables. La variation des taux de
succès entre une combinaison uni- et bigrammes de lettres (deuxième ligne
du tableau 3) ou des séquences plus longues s’avère peu élevée. Par contre
les temps de traitement s’accroissent rapidement (8,2 minutes pour les uni- et
bigrammes à plus de 4 heures pour les 5-grammes comparé à 3 minutes avec
les 500 mots les plus fréquents). Enfin, la fonction cosinus retourne les
performances les moins bonnes. Nos premières évaluations se fondaient sur
l’ensemble du texte disponible, soit environ 10 000 mots. Si l’on réduit cette
taille à 5 000 voire à 500, les taux de réussite obtenus sont indiqués dans le
tableau 4. La première ligne est reprise du tableau 2 puis les tailles
décroissent comme le signale la première colonne. La réduction moyenne des
performances est reprise dans la dernière colonne. Ainsi, en réduisant les
textes à 5 000 mots, la baisse moyenne s’élève à 25,8 %. Si l’on doit œuvrer
avec des longueurs de 1 000 à 500 mots, les taux de réussite s’avèrent faibles
générant une réduction de 80 à 90 %. Est-il vraiment raisonnable d’effectuer
des attributions d’auteur avec de telles tailles ?
Tableau 3 : AP et HP selon diverses mesures de distance avec des n-grammes de lettres
Matusita
Cosinus
Moyenne
Manhattan
Tanimoto
HP
HP
AP
HP
AP
HP
AP
n-grams
AP
AP
uni & bi
0,559
139 0,559
139
0,503
128 0,538
94
0,540
3-gram
0,527
108 0,527
108
0,471
130 0,476
108
0,500
112
0,532
4-gram
0,570
153 0,570
153
0,507
147 0,481
5-gram
0,587
177 0,587
177
0,541
181 0,543
73
0,565
0,588
200 0,588
200
0,557
188 0,415
6-gram
36
0,588
Moyenne
0,566
155 0,566
155
0,506
147 0,510
97
0,545
Tableau 4 : AP et HP selon diverses mesures de distance avec des textes de tailles différentes
(représentation sur la base de 300 vocables)
Matusita
Cosinus
Moyenne Différence
Manhattan
Tanimoto
HP
HP
AP
HP
AP
HP
Taille
AP
AP
10 000 0,705 190 0,720
196 0,727 244 0,629 148 0,695
5 000
0,526 55
0,545
58
0,526 85
0,466 74
0,516
-25,8%
2 500
0,326 31
0,342
39
0,306 35
0,284 11
0,315
-54,8%
1 000
0,152 4
0,152
2
0,116 1
0,141 3
0,140
-79,8%
500
0,093 2
0,089
2
0,079 3
0,086 2
0,087
-87,5%

722

JADT’ 18

En analysant la liste triée obtenue avec la fonction Matusita et en
représentant les textes par les 300 vocables les plus fréquents, les distances
les plus faibles se retrouvent entre des extraits de la même œuvre. La
distance la plus faible se trouve avec le roman Les Rencontres de Mr de Bréot
(1901) de Régnier, puis on trouve Bouvard et Pécuchet (1881) de G. Flaubert,
Delphine de Mme de Staël (1803), Mme Bovary (1857) de G. Flaubert et La Petite
Fadette (1832) de G. Sand. Si l’on analyse les appariements les plus difficiles
entre deux œuvres du même auteur, les romans Graziella (1852) et Geneviève
(1863) de A. de Lamartine constitue le lien le plus distant. Ensuite, on
rencontre La double Maîtresse (1900) de H. de Régnier, Aurélia (1855) et Les
Illuminés (1852) de G. de Nerval et Le père Goriot (1833) et La Maison Nucingen
(1838) de H. de Balzac. Ces auteurs peuvent adopter des styles assez
dissemblables, rendant une attribution plus ardue. Parmi les œuvres dont le
style est perçu comme proche par la machine mais qui sont écrites par deux
auteurs distincts, on trouve en tête Bel-Ami (Maupassant, 1885) et Mme Bovary
(Flaubert, 1857), puis Volupté (Sainte-Beuve, 1834) et Dominique (Fromentin,
1862), Notre Cœur (Maupassant, 1890) et Mme Bovary (Flaubert, 1857), et enfin
L’Assommoir (Zola, 1879) et Mme Bovary (Flaubert, 1857).
6. Conclusion
Parmi les fonctions de distance, notre étude indique que le cosinus n’apporte
pas de bons résultats. Par contre, les différences de performance entre les
fonctions Manhattan, Tanimoto ou Matusita demeurent faibles. Afin de
cerner une partie importante du style des auteurs, le recours à une
représentation sur la base de vocables s’avère plus efficiente que le recours
aux n-grammes de lettres (pour n variant de 1 à 6). Représenter le style avec
les 300 à 500 vocables les plus fréquents s’avère pertinent.
Lorsque l’on compare la précision moyenne (AP) et la haute précision (HP),
le choix des paramètres optimaux diffère quelque peu d’une mesure de
performance à l’autre. Notons que l’AP ne punit pas sévèrement les erreurs
d’affectation, erreurs qui entraînent immédiatement une baisse de la valeur
HP. Enfin, la taille des textes joue un rôle essentiel dans une attribution
d’auteur et des valeurs inférieures à 1 000 mots ne permettent que des
affectations souvent douteuses. Parmi les auteurs retenus, le style du roman
Mme Bovary se rapproche de celui de Maupassant (Bel-Ami) ou de Zola
(L’Assommoir).
Remerciements
L’auteur remercie D. Labbé pour avoir mis à sa disposition le corpus St-Jean.

JADT’ 18

723

Références
Baayen, H.R. (2008). Analysis Linguistic Data: A Practical Introduction to Statistics
using R. Cambridge University Press, Cambridge.
Burrows, J.F. (2002). Delta: A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3):267-287.
Chaski, C. (2013). Best practices and admissibility of forensic author
identification. Journal of Law and Policy, 21(2):333-376.
Craig, H., & Kinney, A.F. (2009). Shakespeare, Computers, and the Mystery of
Authorship. Cambridge University Press, Cambridge.
Gatti, C. 2016. La véritable identité d’Elena Ferrante révélée. BublioObs, 2 octobre
2016.
Hoover, D.L. (2007). Corpus stylistics, and the styles of Henry James. Style,
41(2):160-189.
Hughes, J.M., Foti, N.J., Krakauer, D.C., & Rockmore, D.N. (2012). Quantitative
patterns of stylistic influence in the evolution of literature. Proceedings of the
PNAS, 109(20), pp. 7682-7686.
Juola, P. (2006). Authorship attribution. Foundations and Trends in Information,
1(3):233-334.
Kjell, B. (1994). Authorship determination using letter pair frequency features
with neural network classifier. Literary and Linguistics Computing, 9(2):119-124.
Kocher, M., & Savoy, J. (2017). Distance measures in author profiling. Information
Processing & Management, 53(5):1103-1119.
Labbé, D. (2007). Experiments on authorship attribution by intertextual distance
in English. Journal of Quantitative Linguistics, 14(1):33-80.
Lebart, L., Salem, A. and Berry, L. (1998). Exploring Textual Data. Dordrecht,
Kluwer.
Michell, J. (1996). Who Wrote Shakespeare? Thames and Hudson: New York (NY).
Mosteller, F., & Wallace, D.L. (1964). Applied Bayesian and Classical Inference: The
Case of the Federalist Papers. Addison-Wesley, Reading.
Muller, C. (1992). Principes et méthodes de statistique lexicale. Honoré Champion,
Paris.
Olsson, J. (2008). Forensic Linguistics. Continuum, London.
Rexha, A., Klampfl, S., Kröll, M., & Kern, R. (2016). Towards a more fine grained
analysis of scientific authorship. Proceedings ECIR 2016, pp. 26–31.
Savoy, J. (2015). Comparative evaluation of term selection functions for
authorship attribution. Digital Scholarship in the Humanities, 30(2):246-261.
Stamatatos, E. (2009). A survey of modern authorship attribution methods.
Journal of the American Society for Information Science and Technology, 60(3):433214.
Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein,
B., & Potthast, M. (2016). Clustering by authorship within and across
documents. Working Papers, CLEF-2016.

724

JADT’ 18

What’s Old and New? Discovering Topics in the
American Journal of Sociology1
Stefano Sbalchiero, Arjuna Tuzzi
University of Padova – stefano.sbalchiero@unipd.it; arjuna.tuzzi@unipd.it

Abstract
Nowadays the field of text mining techniques seems to be very active in
dealing with the increasing mass of available digital texts and several
algorithms have been proposed to analyze and synthesize the vast amount of
data that today represents a challenging source of information overload.
Topic modeling is a collection of algorithms which are useful for discovering
themes, i.e. topics, in unstructured text. The Latent Dirichlet Allocation
(LDA) by Blei (et al., 2003) was one of the first topic modeling algorithms and
since then the field seems to be active and many variants and other
algorithms have been suggested. The present study considers a topic as an
indicator of the relevance of a research area in a specific time-span and its
temporal evolution pattern as a way to identify the paradigm changes in
terms of theories, ideas, forgotten topics, evergreen subjects and new
emerging research interests. The study aims to contribute to a substantive
reflection in Sociology by exploring the temporal evolution of topics in the
abstracts of articles published by the American Journal of Sociology in the
last century (1921-2016). Within the classical LDA perspective, the study also
focus on topics with a significant increasing or decreasing trend (Griﬃths et
Steyvers, 2004). The results show different shifts that involved relevant
reflections on various issues, from the early debate on the
“institutionalization” process of Sociology as a scientific discipline to recent
developments of sociological topics that clearly indicate how sociologists
have reacted to new social problem.
Keywords: Chronological corpus, History of Sociology, Academic Journals,
Text Mining, Latent Dirichelet Allocation

1 This study was supported by the University of Padova, fund CPDA145940
(2014) “Tracing the History of Words. A Portrait of a Discipline Through Analyses of
Keyword Counts in Large Corpora of Scientific Literature" (P.I. Arjuna Tuzzi,
University of Padova).

JADT’ 18

725

1. Introduction: topic modeling
As evidenced by the literature on topic modelling (Blei et al., 2003;
Ponweiser, 2012; Grimmer et Stewart, 2013; Griffiths et Steyvers, 2004), text
mining approaches can mitigate the problem of analysing huge collections of
textual data when they increase in number and size and complicate all
information processing. From a methodological point of view, since the
topics emerge directly from data, text mining approaches can tone down
some problems about the role of analysts in coding and interpreting the
content hidden in corpora, e.g. research bias or errors that notoriously affect
most approaches in comparative and quanti-qualitative researche (Strauss et
Corbin, 1990; Corbetta, 2003). A popular approache to extract information by
summarizing the main contents embedded in relevant collection of texts in
digital form is known as topic modeling (Blei et Lafferty, 2009), which is
essentially a collection of algorithms that are exploited to discover themes,
i.e. topics, in unstructured and complex texts. The Latent Dirichlet Allocation
(LDA) is one of the first topic modeling algorithms, namely a “generative
probabilistic model of a corpus. The basic idea is that documents are
represented as random mixtures over latent topics, where each topic is
characterized by a distribution over words” (Blei et al., 2003, p. 996). LDA is a
technique that facilitates the automatic discovery of themes in a collection of
documents. Since a text document can deal with different topics and the
words that occur in that document reflect a set of possible topics, in
“statistical natural language processing, one common way of modeling the
contributions of different topics to a document is to treat each topic as a
probability distribution over words, viewing a document as a probabilistic
mixture of these topics” (Griffiths et Steyvers, 2004, p. 5228). Actually we
cannot directly observe topics but only documents and words, as topics are
part of the latent and hidden text structure. The model infers the latent topic
structure given by observed words and documents: this is the LDA's
generative processes which recreate (generate) the documents of the corpus
by assigning the probability of topics (the relevance) to documents and the
probability of words to topics. The result is a probabilistic distribution of
topics over documents that is characterized and described by a cluster of cooccurring words (Blei et al., 2003). This list o words enable the researcher to
interpret the meaning of all the generated topics. For the purposes of the
present study, the temporal variable is crucial to analyse the direction and
evolution of topics, and particularly to the extent that they have a direct
relationship with the most significant shifts in the development of Sociology
as a discipline over time. For these reasons, we propose a LDA-based topic
detection procedure as this “method discovers a set of topics expressed by
documents, providing quantitative measures that can be used to identify the

726

JADT’ 18

content of those documents, track changes in content over time” (Griffiths et
Steyvers, 2004, p. 5228). An additional estimation procedure exploits a
metavariable (year) to explore the topics trends: LDA offers the opportunity
to estimate the slope of a linear model that represents the distribution of
topics by year. The model permits to identify “hot and cold topics” (Griﬃths
et Steyvers, 2004), i.e. topics with significant increasing (hot) and decreasing
(cold) trends through time.
2. Corpus and data
The American Journal of Sociology (AJS), established in 1895 as the first U.S.
scholarly journal in its field, can be considered one of the world’s preeminent
journals and a leading voice for research in social sciences. The journal
fosters pathbreaking work from all areas of sociology, with an emphasis on
theory building and innovative methods. AJS is a multi-disciplinary journal
that strives to speak to a "general sociological reader" and is open to
sociologically informed contributions from anthropologists, statisticians,
economists, educators, historians, and political scientists. Manuscripts are
subjected to a double-blind review process and published articles are
considered representative of the best current theoretical and methodological
debates. Our corpus includes all the abstracts of the papers published by AJS
that have been retrieved from popular archives (Scopus and Web of science)
and the journal webpages. We decided to work on the abstracts since they
provide concise information about the main contents of all articles. With
regard to selection criteria, they were based on the following consideration:
when abstracts did not provide any information about the content or did not
refer to relevant scientific contributes (e.g. editorials, master heads, errata,
acknowledgements, rejoinders, notes, announcements, corrections, list of
consultants, obituary, etc.) we decided to disregard them in further analyses.
The corpus is composed of 3,992 abstracts, collected for a period of almost a
century (mean: 41 per year), from the Volume No. 27, Issue No. 1 (1921) to
the latest, No. 121, Issue No. 4 (2016). The collected texts had relevant
contents for the purpose of the present analysis based on the following
consideration and hypothesis: If we consider a topic as an indicator of the
relevance of a research area in a specific time-span, then the temporal
evolution pattern of subject matters can portray main paradigm changes in
terms of theories, ideas, forgotten topics, evergreen subjects and new
emerging research interests in Sociology. The corpus has been pre-processed
by means of TaLTaC2 software package. After the tokenization (the
identification of words given character sequences chopping it up into pieces),
the corpus has been normalized replacing uppercase with lowercase letters.
An automatic search procedure identified relevant multi-words (MWs), i.e.

JADT’ 18

727

informative sequences of words (Pavone, 2010) repeated at least five times in
the corpus (849 MWs in total). This procedure retrieved most interesting
MWs in the abstract (e.g. united states, fr. 395; social structure, fr. 115; social
science, fr. 101; labor market, fr. 89; social change, fr. 78) and contributed to
increase the amount of information conveyed by sequences of words2. Then,
the corpus has been processed by means of R software packages3:
punctuation marks and numbers have been removed, as well as some
grammatical words (articles, conjunction, prepositions, pronouns). The
corpus is composed of 24,418 word-types and 512,410 word-tokens (tab. 1),
and the measures show that there is a sufficient level of redundancy to
proceed with statistical analyses of textual data (Lebart et al., 1998; Trevisani
et Tuzzi, 2015; Bolasco, 2013).
Table 1. Basic lexical measures of the corpus of AJS abstracts

(V) WORD-TYPES
(N) WORD-TOKENS
(V/N)*100 = TYPE/TOKEN RATIO
(V1/V)*100 = PERCENTAGE OF HAPAX

24,418
512,410
4.76
47.08

3. Topic detection
As the LDA algorithm “fits” the terms in the document into a number of
topics that must be specified apriori, this represents an important and
sensitive decision that affects results and findings: few topics will produce
broad subjects and mixed-up contents, while too many topics will produce
minimal subjects and results too detailed to be readable and interpretable. To
set the number of topics in a data driven manner we have the opportunity to
calculate different metrics (Arun et al., 2010) and estimate the optimal
number of topics (Griﬃths et Steyvers, 2004) by means of the maximum loglikelihood of LDA for a number of topics ranging from 2 to 50 (Fig. 1).

2 If MWs did not appear at least 5 times in the corpus, that is about once every 20
years, it was not considered important; however, the MWs that appeared with a
frequency equal to or greater than 10 are 417.
3 The analysis were implemented by R pakages: Tm, Lda, Topic model.

728

JADT’ 18

Fig. 1: Fitting the model: log-likelihood calculated for increasing number of topics

The best number of topics is the one with the highest value of log-likelihood
that is around 30 and can be established as the optimal number of topics.
Figure 2 shows the general trend of all the 30 topics as depicted by the fitted
model A clue of how these topics change over time is shown by 30 panels
with a topic trend line each, that lists the number of topics with positive or
negative trends. All of the topics are ordered by slope: decreasing topics
appear in the first panels (top left), and increasing ones in the last panels
(bottom right)., Since the main aim of this study is to detect the temporal
evolution of old, new and emerging topics in Sociology, we can resort to a
limited
number
of
topics
that
show
prototipycal
temporal
patterns(Ponweiser, 2012; Griffiths et Steyvers, 2004).

Fig. 2: Temporal patterns of the 30 topics in Sociology sorted by slope of linear models

JADT’ 18

729

Consistent with the idea that topics show different trends and embrace
theoretical, conceptual, and methodological shifts, the analysis of timedependent phenomena identifies three specific temporal patterns of topics:
topics whose trajectory has grown in time and it is increasing over time (28,
4, 2, 27, 15, 11); topics whose trajectory decreased (7, 3, 21, 9, 13, 18); and
topics whose peak-like journey (meteor) was high only in a specific interval
of time (14, 17, 28, 15) or shows more irregular temporal trajectories.
4. What’s old and new in Sociology?
To focus on major increasing or decreasing topics from 1921 to 2016, we
explored the contents of five coldest and hottest topics. Figure 3 provides the
top term for these topics.
The groups of coldest topics correspond on one hand to the methodological
development of sociological perspectives, and on the other hand to some
specific objects of research. These topics were very popular in about 20s and
50s. First of all, the debate on the “institutionalization” process of Sociology
as a scientific discipline characterized the early debate (topic 7). The main
need was to create a strong scientific and knowledge base from the
development of ideas advanced by the "founding fathers", e.g. Durkheim. At
the same time, the debate on the “measurement” of social phenomena arose.
The issue of migration between cities and farms (topic 3) by economic and
social groups gives the net law of rural-urban social selection. The emerging
of a scientific social reflections about health and illness (topic 21) by using
empirical data to evaluate how social life affects morbidity and mortality
rate, and vice versa, increased in efforts for better educated public and to
improve health legislation. The development of psychological sociology
(topic 9) and the general progress of psychological interpretations of social
processes and institutions have decreased over time; researches in this
tradition have been criticized because they mainly exemplified the biological
background of social interpretations, also supplied by the impulse from the
Darwinian doctrine. Class culture, conflict and leisure (topic 13) were
popular issues in the 30s and 50s: the industrialization had raised many
questions, from the class conflict to the growth of leisure hours of after work
hours, providing new insights for social thought. The group of hottest topics
(Fig. 4) is related to articles that have a focus of interest in a wide range of
empirical case studies that underline most significant changes that have
occurred since the mid-1960s.

730

JADT’ 18

Fig. 3: Decreasing topics: the five coldest (significant neg., p level 0.005)

Fig. 4: Increasing topic attention: the five hottest (significant pos., p level 0.005)

In those years, gender revolution (topic 11), ethnic discrimination (topic 2),
mobilization, power and élite (topic 15), protests and social movements (topic
27), and the “measurement” of social phenomena in a post-positivist fashion,
especially until the 70s (topic 4), offered to sociologists the opportunity to
deal with a social effervescence of a particular historical moment. These hot
topics indicates the ‘birth’ and recent developments of some sociological
topics that clearly indicate how Sociology (as a discipline) and sociologists
have reacted to new social problems.
In conclusion, through the topic detection analysis of the abstracts of articles,
different shifts that involved reflections on various issues have been
identified. During the twentieth century, Sociology expanded its scope and
influence, and motivated much research studies as well as a diversification of
the field. Other studies have offered a remarkable theoretical contribution to
the historical ‘shape’ of Sociology as a discipline (Kalekin-Fishman et Denis,
2012), even in a critical perspective (Turner, 1998), either emphasizing the
content of the various domains of sociology (Scott et Desfor Edles, 2011; Blau,
2004), or specifically within the intellectual ground of American Sociology
since the mid-nineteenth century (Calhoun, 2007). Even if they show an

JADT’ 18

731

interesting round of paradigmatic reflection in Sociology, there has been a
lack of research studies on the history of Sociology through empirical data
and evidence to fast-moving sociological topics over time. To the extent that
the history of Sociology is a continuous approach to the Sociology of the
present, a new way of reading the history of a discipline is rely on topic
detection of articles published in mainstream journals which mirror the
sociological scientific debate of a specific historical moment. We analysed
these trends exploiting topics as emerged from a text corpus and highlighted
two distinct directions of topics, characterized by different theoretical and
methodological implications that coexist within the same period considered:
the hot-increasing and cold-decreasing topics. Results show how Sociology
has become one of the main social science to provide fresh thinking about a
whole range of topics affecting the public sphere and, as a consequence, the
discipline developed shifting priorities in universities and social research
agenda towards specialization and fostered the birth of a wide range of subdisciplines over time. This is just the tip of the iceberg: further analyses will
shed light on many more aspects that need a deeper reflection.
References
Arun R., Suresh V., Veni Madhavan C. E. and Narasimha Murthy M. N.,
(2010). On finding the natural number of topics with latent dirichlet
allocation: Some observations. In Mohammed J. Zaki, Jeffrey Xu Yu,
Balaraman Ravindran and Vikram Pudi (eds.), Advances in knowledge
discovery and data mining, Springer Berlin Heidelberg, pp. 391-402.
Blau J. R. (2004). The Blackwell Companion to Sociology, Malden, MA: Blackwell.
Blei D. M., Ng A. and Jordan M. I., (2003). Latent Dirichlet allocation. Journal
of Machine Learning Research, 3: 993-1022.
Blei D. M and Lafferty J.D., (2009). Topic Models. In A. Srivastava, M. Sahami
(eds.), Text Mining: Classification, Clustering, and Applications. Chapman &
Hall/CRC Press.
Bolasco, S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining.
Carocci, Rome.
Calhoun G. (2007). Sociology in America: A History. Chicago: University of
Chicago Press
Corbetta P. (2003). Social Research: Theory, Methods and Techniques, SAGE
Publications Ltd., London.
Griﬃths T. and Steyvers M., (2004). Finding scientific topics. Proceedings of the
National Academy of Sciences of the United States of America (PNAS),
101(Supplement 1):5228-5235.
Grimmer G. and Stewart B. M., (2013). Text as Data: The Promise and Pitfalls
of Automatic Content Analysis Methods for Political Texts, in Political

732

JADT’ 18

Analysis, 21 (3): 267-297.
Kalekin-Fishman D. and Denis A. (2012). The Shape of Sociology for the 21st
Century: Tradition and Renewal, London, SAGE.
Lebart, L., Salem, A. and Berry, L. (1998). Exploring textual data. Kluwer
Academic Publishers: Dordrecht
Pavone, P. (2010). Sintagmazione del testo: una scelta per disambiguare la
terminologia e ridurre le variabili di un’analisi del contenuto di un
corpus. In S. Bolasco, I. Chiari and L. Giuliano (Eds.) Statistical Analysis of
Textual Data: Proceedings of 10th International Conference Journées d’Analyse
statistique des Données Textuelles, 9-11 June 2010, Sapienza University of
Rome, pp. 131-140. LED.
Ponweiser M., (2012). Latent Dirichlet Allocation in R, Vienna University of
Business and Economics.
Scott A. and Desfor Edles A. (2011). Sociological Theory in the Contemporary
Era: Text and Readings, Thousand Oaks, Pine Forge Press.
Strauss, A.L. and Corbin, J. (1990). Basics for Qualitative Research: Grounded
Theory Procedures and Techniques, Newbury Park, Sage.
Trevisani, M. and Tuzzi, A. (2015). A portrait of JASA: the History of
Statistics through analysis of keyword counts in an early scientific journal.
Quality & Quantity, 49(3): 1287-1304.
Trevisani, M. and Tuzzi, A. (in press). Learning the evolution of disciplines
from scientific literature. A functional clustering approach to normalized
keyword count trajectories. Knowledge-Based System.
Turner S. (1998). Who’s Afraid of the History of Sociology? Swiss Journal of
Sociology, 24: 3-10.

JADT’ 18

733

Comparison of Neural Models for Gender Profiling
Nils Schaetti, Jacques Savoy
Université de Neuchâtel - Rue Emile-Argand 11 - CH2000 Neuchâtel - Switzerland

Abstract
This paper describes and evaluates two neural models for gender profiling
on the PAN@CLEF 2017 tweet collection. The first model is a character-based
Convolutional Neural Network (CNN) and the second an Echo State
Network-based (ESN) recurrent neural network with various features. We
applied these models to the gender profiling task of the PAN17 challenge
and have demonstrated that it can be applied to gender profiling. As
features, we propose using pre-trained word vectors, part-of-speech (POS)
and function words (FW) for the ESN model, and character 2-grams matrix
with punctuation marks, smilies, beginning and ending 2-grams for the deep
learning model. We finally compared these strategies to a baseline and found
that an ESN model based on Glove pre-trained word vectors achieves the
highest success rate and outperforms the baseline and the character-based
CNN model.
Keywords:
Author
Profiling,
Gender
Profiling,
Deep-Learning,
Convolutional Neural Network, Reservoir Computing, Echo State Network,
Natural Language Processing
1. Introduction
At the age of big data, a large number of applications are based on an
exponential amount of various data such as pictures, videos, articles, links
and blogs shared directly from computers, websites, smartphones and
sensors. Social networks and blogs are the new platforms of communication
based on fast interactions, generating a large varieties of content with their
own characteristics. These contents are difficult to compare with traditional
texts, such as novels and articles.
This issue raises new questions : Can we determine if the author of a textual
content is a man or a woman ? Can we identify the author’s place of origin,
his age group or his (or part of) psychological profile ? Answering these
questions can help solve current issues of the social network era, such as fake
news, plagiarism and identity theft. Author profiling is, therefore, a
particular and pertinent subject interest.
In addition, author profiling is central to applications involving marketing,
security and forensics. For example, forensic linguistics and police

734

JADT’ 18

investigation forces would like to know specific defining characteristics, such
as the gender, the age group and the socio-cultural background of an author
of harassing messages. When we apply this to marketing, companies and
resellers could make use of these profile characteristics while targeting their
consumers’ preferences, based on the analysis of individual consumer social
network posts and online product consulting. In order, to extract this
information, the classic statistical methods are employed as they have proven
to be effective for text classification.
Deep learning has gained increasing popularity just over the last decade,
becoming a "breakthrough" technology in image recognition and computer
vision. Yet, it faces difficulties in natural language processing (NLP) tasks.
But recurrent neural networks (RNN), as well as long short-term memory
(LSTM) obtained better results in such tasks. In this view, we therefore
decided to test such an approach on the gender profiling tasks with two
neural models, one based on Convolutional Neural Networks (CNN) and 2grams of characters, and the other on the Reservoir Computing Paradigm.
Finally, we compare them to a baseline composed of both a random and a
naive Bayes classifier
This paper is organized as follows. Section 2 introduces the data set used to
train and test both of the models and the methodology used for evaluation.
Section 3 describes and evaluates our deep-learning model. Section 4
introduces the proposed echo state network-based reservoir computing
model. Section 6 compares the results with the baseline. In the last section,
we draw conclusions on our findings and possible future improvements.
2. Methodology
To compare our two models on the gender profiling task, we needed a
common ground composed of the same dataset and evaluation measures. To
create this common ground, the PAN CLEF evaluation campaign was
launched [1] and allowed multiple research groups to propose and compare
profiling algorithms with the same methodology.
For the PAN CLEF 2017 evaluation campaign, four test collections of tweets
were generated written in several languages including English. Based on
these collections, the challenge was to classify Twitter profiles per language
variety (e.g., UK vs. US English) and gender. We were then able to use this
common ground for our two models and compare their capacities on the
gender profiling task.
The dataset was collected on Twitter and is composed of tweets from
different authors with 100 per author. For each author, a label indicates the
correct gender (male, female). The collection included 3,600 authors, residing
in the United-States, Great Britain, Ireland, New Zealand, Australia and

JADT’ 18

735

Canada, 600 per country, and 1,800 for each group, for a total of 360’000
tweets. The table velow resumes dataset properties.
Authors

Tweets

Genders

3600

360k

(male) 1800 ; (female) 1800

The overall performance of a model is based on the accuracy on the gender
profiling task. The accuracy is the number of correctly classified author
gender divided by the number of authors. Based on data depicted in the table
above, a random baseline will produce an accuracy rate of 0.5 (or 50%).
3. Character N-grams Matrix-based Convolutional Neural Networks
A Convolutional Neural Network (or CNN) is a variety of feed-forward
artificial neural networks inspired by the visual cortex [2]. In our first model,
we applied a CNN to a character bigram representation matrix for an author
in a collection. The first shows the structure of the representation matrix.

For each letter, one can find one row. In the first position, the relative
frequency of this letter is provided. Then, from left to right, the matrix is
composed of the relative frequencies of each character bigram (e.g., at row "t"
and column "h", the relative frequency of the bigram "th" is given). The hird
part is optional and composed of relative frequencies of ending character
bigrams, and finally, the last part is the same optional matrix representing
the starting character bigrams of each word. This matrix representing an
author is the input for the CNN
The first two layers are composed of 20 and 10 kernels respectively, with a
size of 5 × 5. These layers are followed by a drop-out layer. The last two are
linear layers based on ReLU. The outputs are finally obtained by a Softmax
function and give the author’s predicted class. The predicted class is
therefore the class with the highest corresponding output from this function.
The training set is composed of 90% of the dataset and the remaining 10% is
use to estimate the performance. This procedure is repeated 10 times with

736

JADT’ 18

non-overlapping test sets to obtain the 10-fold cross validation estimator.
Matrix / Alphabet

English

+ Punctuation

+ Punctuation &
Smilies

Bigrams

75.26%

76.16%

76.51%

+ starting bigrams

76.02%∗

77.63%∗†

77.50%∗

+ ending bigrams

75.94%

77.22%†

77.25%

+
starting
&
ending bigrams

76.12%

77.83%†

78.33%∗†

4. Echo State Network-based Reservoir Computing models
4.1. Echo State Networks
An Echo State Network was introduced in [3] and corresponds to the first
equation. In this model, the highly non-linear dimensional vector xt, denoting
the activation vector at time t, is defined by
xt+1 = (1 − a) * xt + a * f(Win * ut+1 + W * xt + W)
where xt ∈ R Nx with Nx the number of neurons in the reservoir. The scalar a
represents the leaky rate allowing to adapt the network’s dynamic to the the
task to be learned. The input signal is represented by the vector ut with
dimension Nu, multiplied by the weight matrix in W∈RNx×Nu. In addition, the
matrix W∈RNx×Nx stores the internal weights. Finally, Wbiais is the bias, and
usually the initial vector is fixed to x0 = 0, corresponding to a null state.
The network’s output ŷ is defined by ŷt = g(W * xt) and the learning phase
consists of finding the values of the matrix Wout∈RNy×Nx , e.g., by applying the
ridge regression method. This matrix is defined by
Wout = Y * XT(X * XT + λ * I)−1
Where Y∈RNy×T is the matrix containing each target output ŷ for t = 1, 2, . . ., T
where T denotes the training size, and Ny the number of outputs (categories).
Similarly, the matrix X∈RNx×T stores the reservoir states xt obtained during the
training phase. Finally, the parameter λ is a regularization factor.
4.2. From texts to temporal signals
In order to apply ESN for text classification, we must first transform input
texts as a temporal signal. In this study, we have evaluated three signal
converter methods. First, each word sequence in a text (e.g. "to the citizens

JADT’ 18

737

of") can be viewed as a word vector (WV) (e.g., vec(to), vec(the), vec(citizens),
vec(of), each vector extracted from word embeddings pre-trained with
Glove), Part-Of-Speech (POS) vector (size : number of POS tags), and as a
function word (FW) (size : number of FW).
As output, the ESN generated the vector yt,g with g∈{male, female} denoting
the probability that the tokens in the ESN’s memory at time t as been written
by a man or a woman. We then end up with an output temporal signal of
gender probabilities (over t = 1, 2, . . ., T), and the final predicted class of a
document is the one with the highest average across time.

4.3. State-Gram
In addition, the output layer can take account of more than one state to
estimate the class probabilities. A state-gram value of 2 means that the
training is performed, not only on a single xt , but on xt−1 ∪ xt . Such a model
was effective for handwritten digits recognition [4].
5. Results
In the second table, one can see the results of the deep-learning CNN model
with different vocabulary and starting and ending bigrams. The statistical
tests indicate that the starting bigrams can significantly improve the
performance with respect to the base model (first row). The combination of
starting and ending bigrams (last row) shows a significant improvement only
for the vocabulary composed of punctuation marks and smilies. The best
result (78.33%) is achieved by a CNN model with punctuation and smilies,
with starting and ending character bigrams.

738

JADT’ 18

The left plot in the second figure shows the three features (WV, POS, FW)
with a leak rate between 0.01 and 1.0. Using the same three feature sets, the
right-side plot indicates the accuracy rate obtained by the state-gram model
with value between 1 and 5. With a solid line, the best leak-rate parameter
value is used, and with the dotted curves, a leak-rate value of 1 was used.
Overall, Figure 2 indicates that the pre-trained word vector (WV) is the best
feature set with a maximum value of 80.81% with a leak rate of 0.01. As the
best accuracy rates is obtained with a leak rate between 0.01 and 0.05 (left
plot in Figure 2), we can conclude that the author profiling task has a very
slow temporal dynamics. The right-side plot signals that no significant
improvement is achieved by increasing the value of the stage-gram
parameter for the best leak-rate parameter value. Moreover, a high value of
Ns decreases the performance for POS feature. The performance slightly
increases for a leak-rate parameter value of 1, but these results show that the
leak-rate parameter is a better lever to increase the accuracy rates.
The following table compares the accuracy rates that can be achieved by a
random classifier, the naive Bayes model together with the CNN and ESN
models (with Nx = 1,000).
Classifier

10-CV success rate

Random baseline

50.0 %

Naive Bayes classifier baseline

75.5 %

CNN 2-grams + starting-grams + ending-grams

78.3 %

ESN on Glove with Nx = 1000

80.6 %

6. Conclusion
This paper presents a comparison of two neural models composed of a
character-based CNN model and an echo-state network (ESN) model with
POS, function words (FW) or pre-trained word vectors (WV) as possible
feature sets. Based on the CLEF-PAN 2017 dataset, the best CNN model
achieves a success rate of 78.3% with a feature set composed of the
vocabulary, the punctuation marks, and smilies. The best ESN model obtains
a success rate of 80.6% with 1,000 neurons and a leak-rate of 0.01. Based on
our experiment setting, this model achieves the best performance. In
comparison, the naive Bayes classifier obtains a success rate of 75.5% and the
average and best performance for the gender profiling task in PAN 2017 was
respectively 75.88% and 82.5%.
Our results indicate that the two models can significantly improve the
accuracy rate on the gender profiling task. Moreover, they demonstrated that

JADT’ 18

739

a simple model, thanks to its simple linear regression algorithm, such as the
echo state network can achieve a higher success rate than a more complex
model such as a character-based CNNs. This higher result level can be
explain by the recurrent architecture of the ESN model, allowing it to take
into account word order. In the future, we want to explore more features for
the ESN and word vectors pre-trained for Twitter applications to achieve
hopefully a better performance. We will also apply classical and deep ESN
architectures to other natural language processing tasks such as authorship
identification and author diarization.
References
Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin
Potthast, and Benno Stein. Overview of the 4th author profiling task at
pan 2016 : cross-genre evaluations. Working Notes Papers of the CLEF,
2016.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE,
86(11) :2278–2324, 1998.
Herbert Jaeger. The “echo state” approach to analysing and training
recurrent neural networks-with an erratum note. Bonn, Germany :
German National Research Center for Information Technology GMD
Technical Report, 148(34) :13, 2001.
Nils Schaetti, Michel Salomon, and Raphaël Couturier. Echo state networksbased reservoir computing for mnist handwritten digits recognition. In
Computational Science and Engineering (CSE) and IEEE Intl Conference
on Embedded and Ubiquitous Computing (EUC), 2016 IEEE Intl
Conference on, pages 484–491. IEEE, 2016.

740

JADT’ 18

Segments répétés appliqués à l'extraction de
connaissances trilingues
Lionel SHEN
Université Sorbonne Nouvelle - Paris 3 – lionel.shen@sorbonne-nouvelle.fr

Abstract
In a context of globalized societies, multilingualism is becoming an economic
and social phenomenon. Translation constitutes a crucial element for
communication. A good translation guarantees the quality of the
transmission of all information. However, face to the challenge of
multilingual information monitoring, can we simply use translation? With
the advent of the digital age and the integration of all new technologies,
corporate governance is undergoing a complete metamorphosis. One of the
priorities remains the efficient exploitation of accumulated big data. The
objectives of this paper hope to highlight the specificity and efficiency of the
Repeated Segments tool through information discovering of trilingual
thematic corpora (French, English and Chinese).
Résumé
Dans un contexte de sociétés mondialisées, on peut parler de multilinguisme
ou encore de plurilinguisme. Aujourd’hui, la frénésie autour du phénomène
des mégadonnées et le multilinguisme sont en train de métamorphoser tous
les services et les comportements de notre époque. La traduction devient
alors un élément capital pour la communication entre les peuples. Une bonne
traduction garantit la qualité de la transmission de toutes les informations.
Cependant, devant la gageure que constitue le projet de réaliser une veille
multilingue, peut-on utiliser simplement la traduction ? Cet article s’articule
autour d’explorations de corpus thématiques trilingues appliquées à
l'extraction de connaissances et tente de mettre en lumière la spécificité et
l’efficacité des cooccurrences en trois langues, français, anglais et chinois.
Keywords: segments répétés, textométrie, veille multilingue, multilinguisme,
fouille d’informations, text-mining, cooccurrences, poly-cooccurrences
1. Introduction
Le monde, qui utilise des centaines de langages depuis des millénaires, a
formalisé les mots et les grammaires pour transcrire, enseigner et transmettre
sur des supports, les savoirs, les faits et les pensées. Des hiéroglyphes aux

JADT’ 18

741

idéogrammes, en passant par les alphabets, ces représentations diffusent
ainsi l'image du monde à travers les époques, les évolutions, les moeurs et les
courants de pensée. Cela représente aujourd’hui des centaines de milliards
de mots dans des corpus différents, avec des occurrences variables. Il n'est
pas possible à un être humain d'aborder par lui-même la masse des
publications archivées ou en circulation. Seul l'usage de l'informatique peut,
à présent, dans le cadre de la mondialisation, permettre un balayage massif
des séquences des corpus nécessaire à l'étude des occurrences et des usages
des mots, au moins dans les langues essentielles diffusant le savoir,
l'information et la communication entre les humains. L'utilité de ces
recherches est étendue, allant des besoins sociaux, humains, scientifiques aux
guerres économiques, en passant par les médias et les enjeux stratégiques des
politiques. C'est la capacité à détecter, enregistrer, analyser et comprendre
dans les meilleurs délais, qui va permettre aux différentes forces de pouvoirs
d'anticiper les décisions et d'agir efficacement. Cette force de veille,
implantée de manière continue et basée sur des outils performants, élaborés
et mis en œuvre par des chercheurs, des informaticiens, des stratèges, des
économistes, sous l'autorité des décideurs... va donc construire les forces de
demain, parfois à l'échelle de la planète. Dans un contexte de sociétés
mondialisées, on peut parler de multilinguisme ou encore de plurilinguisme.
Aujourd’hui, la frénésie autour du phénomène Big-data et le multilinguisme
sont en train de métamorphoser tous les services et les comportements de
notre époque. La traduction devient alors un élément capital pour la
communication entre les peuples. Une bonne traduction garantit la qualité de
la transmission de toutes les informations. Cependant, devant la gageure que
constitue le projet de réaliser une veille multilingue, peut-on utiliser
simplement la traduction ?
Cet article s’articule autour d’explorations de corpus thématiques trilingues
appliquées à l'extraction de connaissances et tente de mettre en lumière la
spécificité et l’efficacité de l’outil Segments répétés en trois langues.
2. Corpus
Pour constituer ce travail, deux types de corpus sont mobilisés : un corpus
comparable (nommé ENRG) et un corpus parallèle (nommé CLRG),
composés de données textuelles extraites des discours de presse, ainsi que
ceux des ONG. La construction de ces deux corpus s’effectue autour de trois
thèmes d’actualité ayant pour objet, l’environnement, l’énergie et le
changement climatique.
La construction de ces deux corpus s’opère à partir d’articles de journaux
issus de nos trois sphères de communication, à savoir, le Monde pour la
France (4 817 articles), le NYT pour les États-Unis (3 993 articles) et 1200

742

JADT’ 18

médias pour la Chine (14 509 articles) comme le présente les deux figures
(figure 1 et figure 2) ci-dessous.
Les données textuelles extraites du corpus comparable proviennent des
discours de la presse, tandis que celles du corpus parallèle sont issues de
ceux des ONG.

Figure 1 : volumétrie du corpus comparable ENRG

Figure 2 : volumétrie du corpus parallèle CLRG

Quant à l’aspect temporel des données du corpus comparable, il diffère selon
les sources et couvre des périodes plus ou moins étendues : de 1999 à 2012
pour le Monde, de 2005 à 2012 pour le NYT, de 2008 à 2013 pour les médias
chinois. Concernant le corpus parallèle, les articles datent de 2006 à 2014. La
figure 3, ci-dessous montre les différentes périodes couvertes par les médias
retenus.

JADT’ 18

743

Figure 3 : périodes couvertes par les corpus ENRG et CLRG

Les dépouillements sont réalisés à l’aide des outils de la textométrie,
notamment grâce aux analyses factorielles des correspondances (AFC), aux
spécificités du modèle hypergéométrique, aux segments répétés, aux réseaux
cooccurrentiels et poly-cooccurrentiels ou encore à la carte des sections. Les
caractéristiques locales et globales, les convergences, les divergences et les
particularités de ces différents corpus ont été mises successivement en
évidence. Après avoir présenté rapidement les deux corpus utilisés, nous
allons nous polariser sur l’outil Segments répétés appliqué d’abord au corpus
parallèle puis ensuite au corpus comparable. Nous nous intéresserons, plus
particulièrement dans cet article à la spécificité des segments répétés
appliqués à l’extraction de connaissances multilingues. Comme le souligne
André Salem, « L’outil prend toute sa valeur lorsque l’unité linguistique traitée
n’est pas le mot, mais le segment répété (suite de mots d’une longueur 2, 3, 4, 5) »
(Salem 1987).
Nous rappelons que «Un segment répété est une suite de formes dont la fréquence
est supérieure ou égale à 2 dans le corpus».
Nous émettons l’hypothèse suivante : l’outil Segments répétés serait plus
performant en chinois, qu’en anglais et qu’en français.
Corpus parallèle : segments répétés anglais-chinois
Nous examinons maintenant les segments les plus répétés obtenus à partir
des deux volets (anglais-chinois) du corpus parallèle CLRG.
Tableau 1 : segments les plus répétés du corpus parallèle CLRG

Le tableau 1 ci-dessus illustre les 14 segments les plus répétés de CLRG.
Nous constatons que la fréquence de segments répétés du volet anglais est

744

JADT’ 18

beaucoup plus élevée que celle du chinois.
Par exemple, la fréquence du segment climate change est de 2 468 dans le volet
anglais, tandis que dans le volet chinois, la fréquence est de 830.
La signification des segments répétés du volet anglais relève peu
d’informations intéressantes. Les mots-outils ou les mots syntaxiques sont les
plus répétés, un seul thème relatif à notre recherche est présent, climate
change. En revanche, les segments répétés en chinois nous révèlent les
véritables thèmes de notre recherche, gaz à effet de serre, changement climatique,
énergies renouvelables, nouveau/nouvelle.
Nous pouvons dire que deux types de répétitions se manifestent : d’une part
de mots grammaticaux pour l’anglais, et d’autre part, de mots de contenu
pour le chinois. Rappelons que la forte répétition de mots grammaticaux est
la cause du grand nombre d’occurrences en anglais. Plus l’emploi des mots
grammaticaux est intensif, plus le nombre d’occurrences est important. Ce
phénomène dissymétrique des segments répétés dans les deux volets est
absolument normal, car la structure syntaxique des deux langues est
complètement différente. Le fait d’avoir des traductions de l’un à l’autre ne
prouve nullement l’emploi symétrique des segments qui se répètent de la
même manière dans les deux langues. Cependant, un prétraitement de
l’anglais pour éliminer les mots outils donnerait plus de sens à l’étude des
segments répétés (Shen, 2016). Les remarques formulées par André Salem
viennent étayer notre hypothèse, renforcée également par celles de Damon
Mayaffre. « L'analyse des voisinages récurrents permet d'utiliser les segments
répétés pour documenter les analyses statistiques faites à partir des formes simples.
On trouvera enfin une analyse typologique effectuée à partir des segments répétés. »
(Salem, 1986). «Moins encore que la fréquence d’un mot, la récurrence de segments
ne peut être naïvement attribuée au hasard : soit elle pointe une contrainte
syntaxique, soit elle indique une détermination ou option sémantique. Dit
rapidement, le mot est une unité graphique, le plus souvent ambiguë, sans sens
explicite, pas même doté de signification. Le segment, lui, devient une unité
linguistique porteuse de sens» (Mayaffre, 2007).
Ces résultats de l’étude bilingue (anglais-chinois) des segments répétés
parallèles ainsi que leurs analyses montrent que, pour une même information
énoncée et décrite en deux langues, la répétition événementielle et
thématique est plus saillante en chinois en raison de la faible pratique des
anaphores (Shen, 2016). De plus le contenu est plus diversifié, puisque nous
retrouvons nos principaux thèmes de recherche.
Nous abordons l’étude des segments répétés dans le corpus comparable
ENRG, composé de trois sous-corpus : sous-corpus français ENRG-FR, souscorpus américain ENRG-US, sous-corpus chinois ENRG-CN.

JADT’ 18

745

Corpus comparable : segments répétés trilingues (français, anglais/US,
chinois)
Tableau 2 : segments les plus répétés du corpus comparable ENRG

Le tableau 2 ci-dessus présente les 16 segments les plus répétés d’ENRG.
Comme pour le corpus parallèle, notre premier constat est une répétition
thématique particulièrement saillante pour le sous-corpus chinois (ENRGCN). Par exemple, la fréquence du segment réduire les émissions est de 12
554, placé comme le segment le plus répété dans ENRG-CN, formes absentes
dans le haut du tableau des deux autres sous-corpus. Cependant, ces formes
existent, mais sont classées bien plus bas dans les résultats des segments
répétés. Les autres sous-thèmes représentés par les séquences répétées
comme faible teneur en carbone, énergie éolienne, photovoltaïque, etc., directement
liés aux énergies et au changement climatique sont également mis en valeur
dans le tableau 2. Pour les sous-corpus français et américain, seuls des mots
grammaticaux ou mots-outils apparaissent dans les segments les plus
répétés. Ce phénomène est dû essentiellement au mécanisme des anaphores
ou au mécanisme déictique qui n'est pas le même en français et en anglais
américain (Shen, 2016). Toutefois, nous remarquons qu’en chinois, ce sont
des termes clés qui se répètent, tandis qu’en anglais et en français, il s’agit
souvent d’entités nommées (noms propres, toponymes, etc.).
3. Conclusion
Dans le processus d’extraction de connaissances trilingues, nous pouvons
conclure que les segments répétés mettent en lumière très efficacement les
caractéristiques les plus saillantes en chinois que dans les deux autres
langues occidentales. Deux types de répétitions se manifestent : d’une part
des mots grammaticaux pour le français et l’anglais, et d’autre part, des mots
de contenu pour le chinois.
De plus, nous soulignons que les cooccurrences ou poly-cooccurrences
permettent également d’extraire des connaissances du corpus grâce à la

746

JADT’ 18

coprésence de formes éloignées. Selon Mayaffre, «L’étude des segments répétés
offre une alternative à la lemmatisation. Elle permet de désambiguïser les termes de
manière formelle et surtout de manière endogène, en corpus et non en référence
(arbitraire) au dictionnaire ou à la langue» (Mayaffre, 2007).
A juste titre, en raison de la forte présence des mots-outils, les cooccurrences
ou poly-cooccurrences par rapport aux segments répétés permettent de
récupérer les séquences répétées non contigües au travers des phrases ou des
paragraphes.
A partir des résultats des segments répétés des deux corpus, nous pouvons
affirmer que l’outil Segments répétés présente l’avantage d’extraire
rapidement des informations clés en chinois, alors qu’en français et en
anglais, le mécanisme des cooccurrences et poly-cooccurrences met en valeur
des informations non détectables par des moyens traditionnels (par exemple,
les concordances).
Aussi, l’outil Segments répétés constitue un atout fondamental pour la fouille
d’informations multilingues.
Bibliographie
Bonnafous S. and Tournier M. (1995). Analyse de discours, lexicométrie,
communication et politique. In : Langages, 29e année, n°117, Paris,
Larousse, pp. 67-81.
Habert B., Nazarenko, A., Salem A. (1997). Les linguistiques de corpus. Paris,
Armand Colin/Masson, 254 p.
Habert B. and Zweigenbaum P. (2002). Problèmes épistémologiques : Régler
les règles. TAL. Paris, Association pour le traitement automatique des
langues, vol. 43, no3, pp. 83-105.
Lafon P. (1981). Analyse lexicométrique et recherche des cooccurrences. In:
Mots, n°3, octobre 1981. Butor-Rousseau, Péguy, Presse du Zaïre, "la
nouvelle droite", vocabulaires, communiste et socialiste, cooccurrences?
pp. 95-148.
Lafon P. and Salem A. (1983). L'inventaire des segments répétés d'un texte.
In: Mots, n°6, mars 1983. L'oeuvre de Robert-Léon Wagner. Vocabulaire
et idéologie. Analyses automatiques. pp. 161-177.
Lamalle C. and Salem A. (2002). « Types généralisés et topographie textuelle
dans l’analyse quantitative des corpus textuels » in A. Morin et P.
Sébillot (éds), JADT 2002. Saint-Malo : IRISA-INRIA, vol. 1, 403-411.
Lebart L. and Salem A. (1995). Statistique textuelle
Longrée D., Luong X., Mellet S. (2004). « Temps verbaux, axe syntagmatique,
topologie textuelle : analyses d’un corpus lemmatisé » in G. Purnelle, C.
Fairon, A. Dister (éds), JADT04. Louvain : Presses universitaires de
Louvain, vol. 2, 743-752.

JADT’ 18

747

Longrée D., Luong X., Mellet S. (2006). « Distance intertextuelle et classement
des textes d’après leur structure : méthodes de découpage et analyses
arborées » in J.-M. Viprey (textes réunis par), JADT’ 06. Besançon :
Presses universitaires de Franche-Comté, vol. 2, 643-654.
Mayaffre D. (2004). Paroles de président. Jacques Chirac (1995-2003) et le
discours présidentiel sous la Vème République. Paris :
Champion.Mayaffre, Damon (2004). Paroles de président. Jacques
Chirac (1995-2003) et le discours présidentiel sous la Vème République.
Paris : Champion.
Mayaffre D. (2007). L'analyse de données textuelles aujourd'hui : du corpus
comme une urne au corpus comme un plan : Retour sur les travaux
actuels de topographie/topologie textuelle. Lexicometrica, Andrée
Salem, Serge Fleury, 2007, pp.1-12.
Rastier F. (2001). Arts et sciences du texte. Paris : Puf.
Salem A. (1986). Segments répétés et analyse statistique des données
textuelles. In: Histoire & Mesure, 1986 volume 1 - n°2. Varia. pp. 5-28.
Salem A. (1987). Pratique des segments répétés. Essai de statistique textuelle,
1987
Shen L. (2016). Méthodes de veille textométrique multilingue appliquées à
des corpus de l’environnement et de l’énergie : « Restitution, prévision
et anticipation d’événements par poly-résonances croisées », Thèse :
Sciences du langage, Université Sorbonne Nouvelle – Paris 3, octobre
2016, 474 p.
Viprey J. (2005-a). « Philologie numérique et herméneutique intégrative », in
Adam J.-M. et Heidmann U. (éds.), Sciences du texte et analyse de
discours. Genève : Slatkine, 51-68.
Viprey J. (2005-b). « Corpus et sémantique discursive : éléments de méthode
pour la lecture d corpus », in A. Condamines (dir.), Sémantique et
corpus. Paris : Lavoisier, pp. 245-276.
Viprey J. (2006). « Structure non-séquentielle des textes », Langages, 163, 7185.

748

JADT’ 18

Misurare, Monitorare e Governare
le città con i Big Data
Sandro Stancampiano
Istat – stancamp@istat.it

Abstract
Several new data sources are investigated in the production process of
official statistics. This paper describes the results of the analysis of online
reviews about four points of interest in Rome, Italy. The reviews, collected
from the web using web scraping and data wrangling techniques, was
written by tourists and visitors during the 2017. The general aim of this
research is to extract useful information to help civil servants and citizens in
decision-making processes. Within the activities related to this study were
automatically collected and stored in a Data Base 9227 documents (each
document is a review) used to build the corpora. The paper intends to
classify the reviews and qualify the sentiment of the texts using tools and
techniques of text mining.
Abstract
Numerose nuove fonti di dati vengono analizzate nel processo di produzione
delle statistiche ufficiali. Questo documento descrive i risultati dell'analisi
delle recensioni online su quattro punti di interesse della città di Roma, in
Italia. Le recensioni, raccolte con tecniche di web scraping e data wrangling,
sono state scritte da turisti e visitatori nel corso del 2017. Lo scopo generale di
questa ricerca è di estrarre informazioni a supporto dei processi decisionali
sia dei dipendenti pubblici sia dei cittadini. Tra le attività correlate a questo
studio sono stati raccolti e archiviati automaticamente in una base di dati
9227 commenti utilizzati per creare un corpora analizzato utilizzando
strumenti e tecniche di text mining. Il documento intende classificare le
recensioni e qualificare il sentimento dei testi.
Keywords: big data, Internet as data source, text mining, cluster analysis,
web scraping.
1. Introduzione
Questo progetto si propone di indagare soluzioni relative all’uso dei Big Data
per produrre statistiche ufficiali a supporto della pubblica amministrazione.
L’Istat ha incluso questo tema, condiviso a livello europeo, nel Piano

JADT’ 18

749

triennale della ricerca tematica e metodologica1. L’Istat sta considerando la
possibilità di utilizzare i Big Data nel processo di produzione dei dati, in
modo da attenuare il trade-off tra tempestività e accuratezza (Alleva, 2016).
2. Background della ricerca
Questo lavoro si focalizza sul tema della gestione dei beni culturali,
indagando mediante tecniche esplorative multivariate (Bolasco, 2014) fonti
dati non convenzionali. Si vogliono mostrare le enormi potenzialità dei dati
presenti sul web per produrre statistiche al fine di ottimizzare i processi
decisionali. Il risultato della ricerca potrà essere di ausilio agli amministratori
nella gestione dei servizi dedicati ai fruitori dei beni culturali presenti sul
territorio. L’esperimento, che si concretizza in un progetto pilota replicabile
ed estendibile su ampia scala, utilizza l’analisi testuale (text mining) per
estrarre informazioni da dati scaricati dal web mediante tecniche di web
scraping. Si vogliono scoprire regolarità nei testi esaminati utilizzando la
cluster analysis (analisi dei gruppi). Questa tecnica, applicata attraverso il
software IRaMuTeQ, consente di definire la distanza tra gli oggetti che si
vogliono classificare (Ceron et al., 2013).
3. Obiettivo e ipotesi di ricerca
Tra i molti siti web utilizzati dagli utenti per produrre contenuti, è stato
scelto Tripadvisor. Gli utenti registrati utilizzano il sito per scrivere le loro
recensioni sui luoghi in cui si sono recati condividendo le loro esperienze
(Iezzi e Mastrangelo, 2012). Sono state scelte quattro tra le più celebri
attrazioni della città di Roma frequentate quotidianamente da numerosi
turisti (Colosseo, Pantheon, Fontana di Trevi e Piazza Navona). Il Colosseo
con oltre sei milioni di visitatori ha determinato, anche per il 2016,
l'incremento degli incassi garantiti dai musei italiani2 e la supremazia della
regione Lazio in questa graduatoria. Molti visitatori lasciano valutazioni
relative ai luoghi aggiungendo considerazioni sullo stato di conservazione
dei beni, sui servizi e i disservizi che hanno notato. Si ritiene che analizzando
questi commenti, sia possibile dedurre preziose informazioni.
L’analisi ha permesso di ottenere una classificazione gerarchica delle
recensioni basata sui termini caratterizzati da un utilizzo superiore alla
media con riferimento alla variabile monumento.

https://www.istat.it/it/files/2011/07/Piano-strategico-2017-2019.pdf (pp.27-28)
http://www.beniculturali.it/mibac/export/MiBAC/sitoMiBAC/Contenuti/Mibac
Unif/Comunicati/visualizza_asset.html_892096923.html
1
2

750

JADT’ 18

4. Corpus e metodo
I commenti sono stati raccolti in una base dati mediante l’applicativo
Diogene3: progettato con il paradigma OOA/D e realizzato con metodologia
agile (Larman, 2005). Utilizzando lo stesso software è stato creato il corpus
delle recensioni.Le 9227 recensioni raccolte, pubblicate dal 1 gennaio al 31
dicembre 2017, sono così suddivise: Colosseo 3483 (37.8%), Piazza Navona
1020 (11%), Fontana di Trevi 2829 (30.6%) e Pantheon 1895 (20.5%).
Si è proceduto in prima istanza con l’analisi lessicale ricavando informazioni
utili alla successiva analisi testuale volta a localizzare unità di testo di rilevo
per gli obiettivi del presente studio (Bolasco, 2013). L’analisi ha permesso di
individuare gruppi di parole omogenei al loro interno ed eterogenei tra loro
riguardo ai “concetti” espressi nelle recensioni. Il corpus analizzato è
composto da 9227 testi, 1788819 occorrenze, 11891 forme, 366 hapax di cui il
3.08% relativi alle forme e lo 0.02% relativi alle occorrenze e media 193.87.
La ricchezza lessicale del corpus è molto bassa4 (V/N*100 = 0.66%), difatti a
fronte di un testo ampio si riscontra un vocabolario ridotto. Osservando le 30
forme attive con la frequenza assoluta maggiore, notiamo come il linguaggio
utilizzato privilegi i sostantivi e gli aggettivi rispetto ai verbi. Gli aggettivi
esprimono positività (bello, bellissima, grande) e i sostantivi sono legati alla
fruizione dei beni oggetto di studio (monumento, piazza, visita, luogo, consiglio,
interno) così come i verbi (visitare, fare, vedere, dire, entrare, trovare).
5. Gli scriventi e le recensioni
I dati relativi ai giorni della settimana in cui è stata scritta la recensione,
evidenziano la tendenza degli utenti a mettere nero su bianco i dettagli delle
loro esperienze nei giorni centrali della settimana, con una predilezione per i
mercoledì (vedi Figura 5.1).
Le persone durante i fine settimana si dedicano alle visite dei beni culturali e
preferiscono descrivere quanto visto e vissuto martedì, mercoledì e giovedì.
Nel periodo oggetto di studio le recensioni relative alle quattro piazze sono
state in media 741 al mese con un minimo di 572 a giugno e un massimo di
1129 a gennaio.
Dalla Figura 5.2 risulta che i primi mesi dell’anno, da gennaio ad aprile, sono
quelli in cui si concentra il maggior numero di recensioni (oltre il 42% del
totale).

3 Diogene è un software sviluppato in java per effettuare processi di data
wrangling.
4 Il calcolo è stato effettuato applicando la formula RL=V/N dove V = ampiezza
del vocabolario e N = numero totale di parole nel testo.

JADT’ 18

751

Figura 5.1: Numero di recensioni per giorno
della settimana (gennaio – dicembre 2017)

Figura 5.2: Numero di recensioni per mese
(gennaio – dicembre 2017)

6. Cluster Analysis
La cluster analysis ci consente di raggruppare le unità statistiche
massimizzando coesione e omogeneità delle parole incluse in ciascun gruppo
e minimizzando al tempo stesso il legame logico tra quelle assegnate a
gruppi/classi differenti.

Figura 6.1: Dendrogramma delle classi secondo similarità

752

JADT’ 18

Il dendrogramma (Figura 6.1) mostra la divisione del corpus in 4 classi. Le
parole contenute in ciascuna classe permettono di individuare le tipologie di
argomenti trattati nel corpus, applicando la metodologia Alceste proposta da
Max Reinert e implementata nel software IRaMuTeQ5.
In Figura 6.2 osserviamo le parole appartenenti ai quattro gruppi e come si
dispongono sul piano fattoriale. Questa visualizzazione chiarisce meglio il
significato delle classi individuate.
Il gruppo di parole in rosso (65.4%), che si concentrano intorno all’origine, è
composto dai termini più utilizzati: trasversali a tutto il corpus e di
conseguenza a tutti e quattro i beni esaminati. Si tratta di parole tema come
roma, simbolo, monumento, città, storia, dei verbi visitare, vedere, tornare, dire e di
sostantivi e aggettivi come bello, emozione, luce, bellezza che esprimono
positività e azioni legate alla visita.
La classe 2, in verde (10.9%), rappresenta i commenti pubblicati da persone
che sono attente a quello che accade nei luoghi e considerano prioritaria la
sicurezza, la legalità e la qualità dei servizi che trovano.
Si distinguono parole come venditore, abusivo, presenza, peccato, fastidioso,
ordine, municipale, polizia, fischietto. Ci sono inoltre parecchi riferimenti alle
attività commerciali (bar, bancarella, locale, ristorante, gelateria, trattoria) con
particolare riguardo a cosa si può mangiare (aperitivo, pizza, granita, gelato,
vino) e alle modalità di fruizione (tavolino, tavolo, panchina). Questo gruppo di
parole evidenzia considerazioni che non sono strettamente correlate alla
visita culturale ma piuttosto a tutto quello che ruota intorno a una escursione
turistica.
La classe 3, in celeste (12.7%), rappresenta tematiche connesse ad aspetti
economici e pratici che in alcuni casi possono causare disagio durante la
visita. Emergono parole come acquistare, prenotare, saltare, fila, coda,
interminabile, biglietto, pagare, guida, audioguida, gratis, costo, euro, ticket.
Gli argomenti sottesi sono relativi al costo del biglietto, all'attesa per
l’ingresso e alla modalità della visita con connotazione sia positiva sia
negativa a seconda della situazione particolare descritta dall’utente.
La classe 4, in viola (11%), rappresenta coloro che descrivono e raccontano
l’esperienza dal punto di vista culturale citando eventi, luoghi e personaggi
storici. Le parole più utilizzate sono tomba, re, raffaello, sanzio, chiesa, colonna,
fiume, barocco, agone, agnese, borromini, savoia, papa, pagano, cristiano. Si tratta di
riferimenti a luoghi di culto e opere (Sant’Agnese in Agone, la fontana dei
Quattro Fiumi, le tombe dei re custodite nel Pantheon, ecc.), agli artisti che le

5 IRaMuTeQ è un software realizzato per effettuare analisi multidimensionali di
testi che fornisce una interfaccia grafica a R, altro software di elaborazione dati
particolarmente efficiente per l’analisi di grandi dataset.

JADT’ 18

753

Figura 6.2: La disposizione delle parole sul piano fattoriale

hanno realizzate (Raffaello Sanzio e Borromini su tutti), alla storia e al
contesto sociale e culturale di pertinenza dei siti visitati.
La disposizione dei termini sul piano fattoriale, a prescindere dai gruppi,
evidenzia il continuum della visita, che inizia con la prenotazione, la biglietteria
e il successivo acquisto seguito dalla fila per entrare e dalla constatazione della
bellezza del monumento per poi visitare e immergersi negli aspetti artistici e
nella storia del luogo in cui ci si trova.
7. Conclusioni e sviluppi futuri
Le tematiche palesate sono di sicuro interesse per gli amministratori pubblici,
che possono ascoltare direttamente dalla voce dei cittadini quali sono i
principali problemi dal punto di vista degli utenti. Sulla base di questo
genere di analisi il decisore può valutare se e come intervenire per migliorare

754

JADT’ 18

la gestione dei luoghi e dei beni culturali.
Il flusso informativo parte dal cittadino che alla fine del processo può
ottenere dei benefici tangibili grazie ai dati che lui stesso ha immesso in rete.
Il processo descritto in questo lavoro mostra un uso classico di Big Data: dati
prodotti con una finalità specifica vengono utilizzati successivamente per
raggiungere altri obiettivi apportando un innegabile valore aggiunto
(Rudder, 2015).
Le tecniche di text mining applicate hanno permesso di valorizzare
informazioni che altrimenti sarebbero rimaste inutilizzate.
Ulteriori e più approfondite analisi potranno essere condotte con la stessa
metodologia e i medesimi software adoperati in questo lavoro. Si potrà
continuare il monitoraggio, incrementando il corpus per condurre un’analisi
longitudinale su questi stessi monumenti o studiare altre città e altri beni
culturali al fine di migliorare le politiche di gestione e ottimizzare i processi
decisionali.
References
Alleva G. (2016). Più forza ai dati: un valore per il Paese. Relazione di apertura
della 12° conferenza nazionale di statistica.
Bolasco S. (2014). Analisi Multidimensionale dei dati. Metodi, strategie e criteri
d’interpretazione. Carocci editore.
Bolasco S. (2013). L’analisi automatica dei testi. Fare ricerca con il text mining.
Carocci editore.
Ceron A., Curini L., Iacus S. M. (2014). Social Media e Sentiment Analysis.
L’evoluzione dei fenomeni sociali attraverso la Rete. Springer Italia.
Iezzi Domenica F., and Mastrangelo M. (2012). Il passaparola digitale nei
forum di viaggio: mappe esplorative per l’analisi dei contenuti. Rivista
Italiana di Economia, Demografia e Statistica, 66 (3-4), pp. 143-150.
Larman C. (2005). Applicare UML e i Pattern. Analisi e progettazione orientata
agli oggetti. Luca Cabibbo (a cura di), Pearson Education Italia.
Rudder C. (2015). Dataclisma. Chi siamo quando pensiamo che nessuno ci stia
guardando. Mondadori.

JADT’ 18

755

Exploration textométrique d’un corpus de motifs
juridiques dans le droit international des transports
Fadila Taleb1, Maryvonne Holzem2
Université Rouen Normandie – fadila.taleb@etu.univ-rouen.fr
2Université Rouen Normandie– maryvonne.holzem@univ-rouen.fr
1

Abstract
Within the framework of a research whose objective consists of responding to
a need formulated by the IDIT, which helps to interpret the jurisprudential
texts contained in its database, we are looking to highlight the interpretive
paths considered as modal scenarios. We propose here a preliminary
textometric analysis in order to define the linguistic profile of the corpus and
to detect certain repeated segments that may represent a relevant constraint
to complete and enrich the interpretive paths identified in the case law.
Résumé
Dans le cadre d’une recherche dont l’objectif consiste à répondre à un besoin
formulé par l’IDIT1, celui d’aider à l’interprétation des textes jurisprudentiels
contenus dans sa base de données, nous cherchons à mettre au jour des
parcours interprétatifs envisagés comme des scénarios modaux. Nous
proposons ici une analyse textométrique préalable afin de cerner le profil
linguistique du corpus et de détecter certains segments répétés pouvant
représenter une contrainte pertinente pour compléter et enrichir les parcours
interprétatifs identifiés dans les textes jurisprudentiels.
Keywords: textométrie, parcours interprétatif, scénario modal, segments
répétés, motifs juridiques, droit des transports.
1. Introduction
1.1. Contexte
Dans le cadre d’un projet pluridisciplinaire « PlaIR »2, des chercheurs
informaticiens, linguistes, juristes posent la question de l’aide à
l’interprétation3 du fond jurisprudentiel de la base de données de l’IDIT. Du
point de vue linguistique, notre tâche préalable à une implémentation
consiste en l’étude de décisions de justice dans le but de comprendre leur

Institut du Droit International des Transports.
Plateforme d’Indexation Régionale
3 Notre objectif est celui d’une aide instrumentée centrée sur l’agir de
l’utilisateur cf. travaux du groupe ʋ (Holzem et Labiche, 2017).
1
2

756

JADT’ 18

structure, le mécanisme argumentatif mis en œuvre et les mouvements de
transformations textuelles susceptibles de déclencher des parcours
interprétatifs pouvant aider à la lecture de ces décisions. Notre recherche
s’écarte des modèles prédictifs, justice prédictive ou legaltech, qui, sous
l’influence des big data et du Machine Learning, produisent des résultats de
contentieux sur des bases algorithmiques. De ce point de vue, nous
partageons les craintes de bon nombre de juristes de voir ces legaltech
« devenir eux mêmes une nouvelle forme de justice » (Garapon, 2017). Il s’agit
d’une pratique textuelle (et intertextuelle) comprise comme régime de
transformation et d’interprétation. Dans cette perspective, notre recherche se
place donc du côté de la jurilinguistique et son objectif est d’essayer de
comprendre dans une approche linguistique et à travers l’étude du matériel
textuel les décisions de justice et les stratégies argumentatives mises en
œuvre pour ainsi aider à leur interprétation.
1.2. Questionnement et hypothèse
Pour aider à l’interprétation nous cherchons à cerner les stratégies
argumentatives mises en œuvre par le juge, notamment dans sa manière
d’intégrer et de prendre en charge les discours des autres (celui des parties
du procès, celui des experts, celui du législateur etc.). Notre hypothèse est
fondée sur des recherches antérieures (Holzem 2014 et Taleb 20144) qui ont
montré l’intérêt de la prise en compte des modalités linguistiques suivant le
modèle développé dans (Gosselin 2010) pour la constitution d’un parcours
interprétatif (Rastier 2001) envisagé ici comme scénario modal susceptible
d’aider à l’interprétation. Mais avant de procéder à une telle analyse textuelle
menée directement sur des textes pleins, nous avons eu besoin de cerner dans
sa globalité et ses spécificités le profil linguistique de notre corpus d’étude.
Pour cela, nous avons eu recours à une analyse textométrique approfondie,
menée avec le logiciel TXM. Au fil de nos investigations textométriques, nous
nous sommes rendu compte de l’importance de certaines fonctions offertes
par ces outils pour la détection, par exemple, de segments répétés5, qui peuvent
représenter une contrainte pertinente pour compléter les parcours
interprétatifs identifiés grâce à l’étude des modalités. L’objectif de cet article
est de présenter, dans ses grandes lignes, en raison de la place, l’analyse
textométrique menée sur notre corpus.

4 Un mémoire de master 2 recherche en science du langage soutenu en juin 2014 :
« Étude du scénario modal et du syllogisme juridique pour la compréhension du processus de
production du texte. Cas des textes du droit. »
5 Suite de formes graphiques identiques attestées plusieurs fois dans le texte.

JADT’ 18

2.

757

Corpus et méthodologie

2.1. Description globale
Nous avons, à la suite de Rastier (2011), retenu le critère du genre comme
critère définitoire du corpus de référence. Il regroupe des textes (décisions de
justice) relevant du discours judiciaire6 et appartenant au genre
jurisprudentiel7. En reprenant la typologie du corpus proposée par B.
Pincemin (1999) et reprise par (Rastier 2011), nous avons distingué quatre
niveaux de corpus : (i) un corpus existant/latent (archives pour Rastier) qui
correspond dans notre recherche à la base de données de l’IDIT ; (ii) un
corpus de référence qui renvoie à l’ensemble des documents numérisés dans
le fond jurisprudentiel de l’IDIT ; (iii) un corpus d’étude qui contient un
nombre délimité de ces décisions sélectionnées pour les besoins de notre
recherche et enfin (iv) un corpus distingué (corpus d’élection ou sous-corpus
pour Rastier) correspondant à des passages précis des textes étudiés nommés
« les motifs ». Ces derniers constituent le cœur du jugement, le juge exposant «
(…) les raisons de faits et de droit qui justifient la décision (…).» (Cohen et
Pasquino, 2013). Notre intérêt pour cette zone textuelle est doublement
motivé. Premièrement notre objectif consiste à repérer les moments clés de
transformations du jugement pour cerner les stratégies argumentatives mises
en œuvre et partant aider à leur interprétation. Deuxièmement la motivation
est une composante commune8 à toutes les décisions de toutes les
juridictions. Elle doit faire face à une double exigence : logique et persuasive.
L’une est due à la forme syllogistique du raisonnement juridique imposée et
l’autre à la nécessité de persuader l’auditoire de la décision9 de sorte à éviter
les recours et faire accepter la solution juridique apportée comme étant la
seule possible.

Il renvoie aux discours produits par (ou au sein) des juridictions. Il est à
distinguer du discours juridique qui désigne, entre autre, les domaines du droit ou ses
sources (lois, réglementation etc.). L’un concerne la création du droit, l’autre rend
compte de son aspect applicatif.
7Le terme de jurisprudence renvoie ici à l’ « ensemble des décisions rendues par les
tribunaux d’un pays, pendant une certaine période dans une certaine manière. » (Dictionnaire
du vocabulaire juridique 2017,éd. LexisNexis) P.322)
8 Ce qui n’est pas le cas pour les autres composantes. Ainsi, l’exposé du litige ne
figure pas dans les arrêts de la cour de cassation, car celle-ci étant une juridiction
d’ordre suprême, son rôle est de veiller à la bonne application des normes juridiques,
elle considère l’appréciation des faits par les juges de fond comme étant souveraine.
9 Composée certes des parties du litige directement concernées par la décision,
mais aussi les autres juges des autres juridictions et un public encore plus large, le
destinataire universel.
6

758

JADT’ 18

2.2. Caractéristiques quantitatives
Le volume textuel du corpus d’étude est de 878848 occurrences dont 22456
formes. Le sous-corpus des motifs représente à lui seul près de la moitié des
occurrences du corpus d’étude. Il contient 393092 occurrences pour 14742
formes. La dysmétrie de la distribution des formes dans les différentes zones
délimitées montre l’importance et le rôle des motifs dans les décisions de
justice, ils sont leur raison d’être, et tout juge est dans l’obligation de motiver
son jugement.
2.3. Encodage et prétraitement
Notre corpus présente l’avantage d’être accessible en ligne. Cependant,
l’ensemble des textes au format PDF n’est pas homogène : certains
documents proviennent d’un format image (non océrisé10). Le format PDF
n’étant pas pris en charge par TXM, nous avons tout d’abord procédé à une
conversion (avec la technique d’océrisation pour les fichiers annotés et
numérisés) au format TXT, puis dans un second temps à un codage XML en
s’inspirant des recommandations de la TEI11 pour l’encodage des données
textuelles. Ce dernier nous permet une navigation plus fine dans le corpus
grâce à des métadonnées péritextuelles, comme celles relatives au type de la
juridiction : tribunal de commerce (TC), cour d’appel (CA), cour de cassation (CC),
à la date et au lieu, et des métadonnées intratextuelles, telles que celles
relatives à des parties spécifiques dans les textes. Nous avons relevé quatre
parties principales : faits, moyens, motifs, conclusions. Les motifs et les conclusions
sont présents dans toutes décisions étudiées. Les faits sont absents des CC, et
les moyens ne sont pas toujours indiqués comme tels dans les arrêts CA, ils
sont souvent rappelés dans la zone des faits sous forme de discours indirect.
La figure suivante représente les différentes phases de préparation du corpus
avant son traitement textométrique :

Figure 1 Les étapes de préparation du corpus

10 OCR (Optical Character Recognition) Reconnaissance Optique de Caractères,
étape nécessaire pour déchiffrer les formes et les traduire ici en lettres.
11 Text Encoding Initiative : recommandations standard pour l’encodage des
documents numériques.

JADT’ 18

759

Pour le passage du format TXT au format XML-TEI nous avons créé les
balises spécifiques au genre du corpus étudié : <TypeJuridiction>,
<DateProcès>, <LieuProcès> etc. Nous avons eu recours à un encodage semiautomatique au moyen d’un tagger conçu spécialement pour notre étude par
Eric Trupin, MCF en informatique au laboratoire LITIS12. Cette étape
indispensable de préparation du corpus pour le traitement textométrique a
été à la fois chronophage et délicate : traitement des annotations manuscrites
et nettoyage de documents plus anciens.
3.

Exploration textométrique du corpus distingué : la zone des motifs

3.1. Etude occurrentielle : les spécificités lexicales
Une première étude contrastive au moyen d’un traitement textométrique
phare, le calcul de spécificités13, permet d’avoir une vue globale sur les
caractéristiques lexicales du corpus distingué « les motifs ». Le tableau cidessus dresse la liste des 20 premières formes les plus spécifiques à cette
zone. Il est trié par ordre décroissant sur l’indice de spécificité de celle-ci :

Figure 2 : spécificité lexicales de la zone des motifs

Nous portons ici attention à un usage excessif d’occurrences caractéristiques
du discours judiciaire et constitutif de la zone des motifs : Attendu, que,
Considérant14, attendu, de même pour les connecteurs : Mais et donc.
Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes,
Université Rouen Normandie
13 Le calcul de spécificités implémenté dans TXM repose sur la loi
hypergéométrique développée par Lafon (1984). Le seuil de pertinence d’une
distribution est fixé à 2 : +2 l’indice de spécificité est positivement significatif, -2 il est
négativement significatif. L’indice se situant entre les deux est banal.
14 Dans notre corpus la forme Considérant n’apparaît que dans les CA. Son
absence dans les CC serait donc significative.
12

760

JADT’ 18

L’ensemble de ces marqueurs jouent un rôle spécifique ici, celui de ponctuer
l’argumentation du juge en assurant sa progression syllogistique. L’usage
excessif du futur, représenté avec les verbes être (sera : 22,9) et condamner
(condamnera : 14,6) n’est pas surprenant, car avant de prononcer le verdict
final dans un acte exclusivement directif (énoncé réservé à la zone des
dispositifs), les juges avancent au préalable dans la zone des motifs les
résultats (comme le montre d’ailleurs le suremploi du verbe résulte (19,6)) de
leurs argumentations : « Le jugement entrepris sera confirmé en ses autres
dispositions qui ne sont pas critiquées».15; « Le tribunal condamnera Monsieur le
capitaine du […] ;».16 L’emploi significatif d’autres mots, comme équité,
marchandises, inéquitable renvoie à la thématique des textes étudiés : le droit des
transports. L’emploi significatif des adverbes de négation : ne (+50,3), pas
(+38,5) révèle une caractéristique particulière de l’argumentation juridique
car, fidèle au principe spinoziste Determinatio negatio es, la négation manifeste
une valeur réplicative et résultative (i.e. portée référentielle en réponse à ce
qui a été énoncé précédemment et qui n'a plus lieu d'être) préparatoire à la
transformation juridique de l’énoncé.
3.2. Etude contextuelle
Au-delà des investigations menées sur des unités lexicales minimales, les
outils que propose la communauté ADT problématisent la notion de contexte
selon des paliers différents pour privilégier un retour au texte. Nous allons
ici donner l’exemple de la contextualisation des « attendu » dans la zone des
motifs dont le suremploi a été relevé dans le tableau ci-dessus.
Suite à
une étude cooccurrentielle autour du mot-pôle « attendu », nous avons
repéré une très forte attractivité avec le connecteur « Mais » (l’indice de
spécificité17= +95).

CA Rouen, 03/10/2013
TC de Rouen, 15/12/2003
17 Le calcul des cooccurrences qui repère les affinités et répulsions lexicales selon
un indicateur de probabilité de rencontre repose sur le même modèle que celui du
calcul des spécificités (Lafon, 1984).
15
16

JADT’ 18

761

Figure 3: Concordancier "Mais attendu" dan la zone des motifs

Nous avons remarqué une systématicité dans l’usage des « Mais attendu » qui
vient clore un enchaînement de propositions subordonnées introduites par
des « Attendu que », repris parfois par la conjonction que. L’étude approfondie
des contextes de ce « Mais attendu » révèle une incidence particulière de celuici sur ses contextes droits :
« Attendu que les marchandises ont été totalement perdues du fait de leur
décongélation. Attendu que la première évaluation des marchandises a été établie à
18. 498, 85 € départ usine, Mais attendu qu'en application de la loi française du 18
juin 1966, le montant de la marchandise s'évalue en valeur CIF (coût + assurance +
fret). Attendu qu'en l'espèce la valeur CIF des marchandises se monte à 21. 163, 96
€, c'est bien ce montant que le tribunal retiendra en préjudice principal. ».
Dans l’extrait ci-dessus Mais attendu introduit non seulement un mécanisme
de renforcement argumentatif18, mais il joue également le rôle de déclencheur
de transformation modale entre deux modalités de type axiologiques19. Dans
l’exemple cité ici, le Mais attendu accompagné d’une référence juridique
« application de la loi française […] assure cette transformation entre une norme
liée au domaine du transport (marchandises totalement perdues du fait de leur
décongélation : modalité axiologique négative) et les modes d’édiction d’une
norme juridique cette fois. La marchandise dépréciée se trouve alors
revalorisée (axiologique positif du point de vue juridique) par le changement
des co-occurrents à droite (valeur CIF (coût + assurance + fret)).

18 Voire les travaux pionniers de A. Ducrot (1984) sur les valeurs argumentatives
de Mais.
19 Les modalités axiologiques sont propres aux jugements de valeur de nature morale,
idéologique et/ou légale. (Gosselin, 2010).

762

JADT’ 18

4. Conclusion
À travers cette contribution nous avons voulu montrer l’intérêt que
représente une étude textométrique pour l’appréhension de son corpus
d’étude. Si notre objectif principal, celui de mettre au jour des parcours
interprétatifs nommés scénarios modaux (Taleb 2015), est difficilement
envisageable en se limitant à une stricte étude textométrique (car elle repose
sur l’étude modale propre à chaque texte). L’approche textométrique s’est
avérée néanmoins pertinente pour décrire et cerner le profil linguistique du
corpus. Son principe différentiel essentiel du point de vue sémantique, nous
a incitées à adopter cette démarche d’analyse contrastive indispensable.
L’analyse contextuelle à plusieurs paliers nous a permis le repérage de
constructions lexicales répétitives, comme l’exemple des « Mais attendu » exposé
ici, qui se révèlent être des moments clés du jugement et donc parcours
interprétatifs corrélatifs à une transformation modale.
Références
Cohen M. et Pasquino P. (2013). La motivation des décisions de justice, entre
épistémologie sociale et théorie du droit. Le cas des Cours souveraines et des cours
constitutionnelles. CNRS, New York University, University of Connecticut.
Ducrot A. (1982). Le dire et le dit. Les Éditions de minuit, Paris.
Garapon A. (2017). Les enjeux de la justice prédictive. La semaine juridique
LexisNexis, N°12: 47-52.
Gosselin L. (2010). Les modalités en français. La validation des représentations.
Amsterdam-New-York : Rodopi B.V.
Holzem M. (2014). Le Parcours interprétatif sous l’angle d’une
transformation d’états modaux, dans Numes Correia C. et Coutinho M. A.
(eds), Estudos Linguisticos : Linguistic studies , n° 10, p. 283-295.
Holzem M. Labiche J (2017) Dessillement numérique : énaction, interprétation,
connaissances. Bruxelles, Bern, Berlin : PIE Peter Lang.
Lafon P. (1984). Dépouillements et Statistiques en Lexicométrie. SlatkineChampion.
Pincemin B. (1999). Diffusion ciblée automatique d’informations : conception et
mise en œuvre d’une linguistique textuelle pour la caractérisation des
destinataires et des documents, Thése de Doctorat en Linguistique, Universit.
Paris IV Sorbonne, chapitre VII.
Rastier F. (2001). Art et science du texte. Puf. Rastier 2011
Rastier F. (2011). La mesure et le grain. Paris, Éditions Honoré Champion.
Taleb F. (2015). Les modalités linguistiques pour aider à l’interprétation de
textes juridiques. Actes Interface TAL IHM (ITI'2015), 22ème Congrès TALn,
Caen.

JADT’ 18

763

The Framing of the Migrant:
Re-imagining a Fractured Methodology
in the Context of the British Media.
James M. Teasdale
Sapienza University of Rome - teasdale.1650019@studenti.uniroma1.it

Abstract 1
This study analyses the portrayal of migrants and migration in the British
press over two periods, using frame analysis as a foundation methodology,
while attempting to improve upon the methodology used in similar studies.
The study holds the ‘frame’ to be the key organising feature in the portrayal
of migrants and these frames can be located through a cluster analysis of
textual data. The first aim of the work is to ascertain how far location and
time affect the deployment of one frame or another, what these frames
consist of and, therefore, provide a detailed analysis of how migration is
portrayed in the British press: a focus sorely lacking in previous frame
analysis studies to date. The study demonstrates that six frames can be
identified over two periods; four being thematic and two being episodic. The
‘negative’ and ‘positive’ migrant frames were present in the first period, as
the ‘local’ focus provided an ideal ground for the former’s deployment as the
subject was located closer to home and was depicted as a threat. While the
second period saw the dominance of the ‘positive’ migrant frame with the
death of Alan Kurdi and the corresponding conceptual shift to the ‘global’
removing the subject from the immediate border and placing them in a wider
context. This was coupled with the overlap of the domestic responsibility
frame with the ‘positive’ migrant frame as the two became intimately linked
in the second period, while the European responsibility frame also arose.
This demonstrated that the hegemony of one frame can be challenged but
only if the corresponding situation is ‘drastic’ enough to allow.
Abstract 2
Questo studio analizza la raffigurazione dei migranti e della migrazione nella
stampa britannica durante il corso di due periodi di tempo, utilizzando la
teoria del frame analysis come metodologia di base e cercando di migliorare
il procedimento di analisi utilizzato in studi analoghi. La ricerca pone il
“frame” come principio organizzatore di base nella rappresentazione dei
migranti. Questi frames possono essere rintracciati attraverso l'analisi
clustering di dati testuali. Il primo scopo dello studio è quello di accertare

764

JADT’ 18

quanto posizione e tempistiche possano influenzare l’impiego di un frame
rispetto ad un altro, in che cosa consistano questi frames e dunque fornire
un’analisi dettagliata di come il processo migratorio venga descritto nella
stampa britannica. Si tratta di un focus fortemente mancante negli studi
basati sulla teoria del frame sino ad oggi. L’osservazione dimostra che,
durante i sopra citati due periodi di tempo, sono sei i frame che possono
essere identificati: si tratta di quattro di tipo tematico e due di tipo episodico.
I frame “negativo” e “positivo” riguardo i migranti si possono rintracciare
nel primo periodo, dal momento che il focus “locale” ha fornito un terreno
ideale per l'impiego degli stessi. I soggetti erano infatti situati in prossimità
del territorio ed erano dunque raffigurati come una minaccia. Al contrario, il
secondo periodo di tempo vede il prevalere del frame “positivo” riguardo ai
migranti, innescato dalla morte di Alan Kurdi e dal corrispondente
slittamento concettuale che ha portato alla rimozione “globale” del soggetto
dai confini immediatamente prossimi per ricollocarlo in un contesto più
ampio. Questo si è appaiato al sovrapporsi del frame della responsabilità
nazionale con il frame “positivo” riguardo ai migranti. Si può notare come i
due frames siano diventati profondamente interconnessi durante il secondo
periodo, proprio mentre si registrava l'insorgere del frame della
responsabilità europea. Ciò dimostra come l'egemonia di un singolo frame
possa essere sfidata, ma solo nel caso in cui la situazione corrispondente sia
“drastica” al punto da permetterlo.
Keywords: migration, frame analysis, cluster analysis, British media, text
mining
1. Introduction
1.1 Frame analysis and the migration crisis
Over the last two decades frame analysis has become an increasingly popular
tool for analysing the portrayal of a subject in the media, due to its ability to
demonstrate the latent and manifest meaning of the news and the recurring
themes and elements that exist in common between individual texts
(Zhongdang and Kosicki, 1993). According to Entman, ‘framing essentially
involves selection and salience. To frame is to select some aspects of a
perceived reality and make them more salient in a communicating text, in
such a way as to promote a particular problem definition, causal
interpretation, moral evaluation, and/or treatment recommendation for the
item described.’ (Entman 1993). A reality is presented to the audience, a
reality that can be considered a package of information of which the
constituent parts together form the frame being deployed (Gamson et al.
1983). One frame is distinguishable from another precisely because this
collective package is the sum of its parts. These parts are defined as framing

JADT’ 18

765

devices and reasoning devices, which can be discovered alongside one
another thereby indicating the presence of one frame or another. These
framing devices can consist of metaphors, visual images, lexical choices,
stereotypes, idioms etc. (Tankard et al. 2001) which in turn support reasoning
devices within the same frame which define the problem, assign
responsibility, pass judgement and present possible solutions (Entman 1993).
As a relatively new approach, and apart from the shared inheritance from
cognitive psychology (Bartlett 1932), anthropology (Bateson 1972) and the
seminal work of Erving Goffman (Goffman 1974), frame analysis remains a
fluid approach with a lack of empirical and methodological consistency
across studies. Some authors have even contended if the school in of itself
can even be considered a paradigm due to this diversity (D’angelo 2002:871;
Entman 1993:51). This paper is not concerned with this contention, but does
strive to arrive at a methodology which incorporates various elements of
previous techniques in order to arrive at a complimentary approach which in
turn minimises the criticism normally fired at more extreme approaches
deployed in the past due to their perceived rigidity and shortcomings.
To date very little frame analysis has been directed towards migration,
especially in the British context. Despite the migration crisis showing no
signs of abating, the response of Europe has generally been categorised by
two approaches; (i) strengthening internal and external borders to restrict
movement throughout Europe (ii) disrupting attempted crossings by means
of the Mediterranean. Britain is particularly interesting within this context,
not only as a state which has consistently tried to curb entry at an official
level, but also because of the media’s and public’s keen obsession with
migration which was ultimately exemplified in the Brexit referendum. The
media can be considered as central to this response. Whether one considers it
to be the embodiment of public opinion or of elite opinion, it is nonetheless
an incarnation of a country’s position and can be seen as acting as an arbiter
of said country’s opinion. The current migration crisis is as complex as it is
pressing, and the ‘reality’ presented by the media should not be seen as
natural, ready to be recorded and transmitted from one human being to
another, but rather as something that is constructed and then transmitted
according to constructivist theory (Goffman 1974). The media is therefore
able to set the agenda and frame the debate on the migration crisis, in turn
affecting the reality in the mind of the population and government.
This paper has two aims in mind. The first is to develop a methodology
which combines previous qualitative and quantitative approaches in order to
improve validity and reliability while the second is to use said methodology
to ascertain how migration is portrayed by the British media and how far this
portrayal is affected by factors such as time and geographical focus.

766

JADT’ 18

2. Methodology
The study’s methodology was constructed with historical criticisms directed
at frame analysis in mind; either that the process is too qualitative and
therefore lacks reliability, or that it is conducted too quantitively, and
therefore lacks reliability. The first step was to collect the data, which was
obtained manually from four daily British newspapers’ online archives (the
Daily Express, the Guardian, the Telegraph and the Daily Mail), and
included all newspaper articles which included ‘migration’, ‘migrant’,
‘refugee’ etc. in the title, or whose content largely dealt with such topics. The
two periods of investigation are 28th to 31st July 2015 and 2nd to 6th September
2015, these dates were chosen in order to ascertain whether frames could be
consistently identified across two periods, even in the short term, but also to
investigate whether dominant frames can be challenged if events are deemed
drastic enough (the tragic death of Alan Kurdi became the dominant news
story in the second period, whereas the first was primarily concerned with
the Calais crisis). In total 505 were gathered, 160 for the first period and 345
for the second.
The quantitative aspect of the study consists of a computer assisted
approach, by using cluster analysis to process the data and indicate the
presence of ‘frames’. Because, as mentioned above, framing is considered to
be the grouping and salience of certain elements to the neglect of others, one
can consider the cluster generated by a computer to precisely be a direct
indication of the presence of one frame or another, as words are the primary
form framing elements assume. The software used was the R program in
conjunction with the Iramuteq interface. The clustering method used is that
of Reinert (Reinert 1983), whose conception of clusters as a ‘cognitiveperceptive framework’ lends itself perfectly to frame analysis, concerned as it
is with discerning different representations of a perceived reality. The
second, more qualitative step of the study, was to conduct a deep read of all
the texts, where the researcher intuitively coded texts and created a frame
matrix which allowed an awareness of the context of the text as well as those
framing and reasoning devices which seemed re-occurring and therefore
significant. Combined, this allowed the reliability of the initial cluster
analysis generated by the computer to be complemented by the in depth
familiarity of the researcher, which provided a validity to the interpretation
of results.

JADT’ 18

767

3. Results

Figure 1. Cluster analysis for first period

Figure 2. Cluster analysis for the second period

The two cluster analyses seem to identify three distinct clusters, yet those
identified in the second period varying dramatically in respect to the first.

768

JADT’ 18

The first period under investigation generated three clusters, which have
been labled The Refugee Cluster (Red), The Migrant Cluster (Green) and the
Calais Crisis Cluster (Blue). However, the second period produced three
different clusters: Migration as a Domestic Issue Cluster (Red), Migration as a
European Issue Cluster (Green) and the Migrant Crisis Cluster (Blue). At first
glance these results seem to refute the basis of framing theory; that frames
are not produced by the journalist, but are deployed from the cultural
repertoire they cognitively hold in common with the rest of society (Goffman
1974). This is because, if framing theory is correct, then in the space of one
month it would be impossible for frames to mutate completely, and one
would expect the clusters identified in the first period to be identical to those
found in the second. However, if one makes a distinction between issuespecific and generic frames and episodic and thematic frames (de Vreese
2005) the two cluster groups are far more similar than first meets the eye.
For instance, the first period produced two frames which are predominantly
concerned with the figure of the migrant and two differing portrayals of the
migrant; the migrant as a helpless victim and the migrant as an opportunistic
individual. These are both clusters which one can consider thematic frames
as the clusters do not refer to one story but rather represent a thematic
perspective. The third frame, however, can be categorised as being an issue
specific frame, concerned as it is only with the Calais crisis, the ‘Jungle’ camp
and the stories of migrants attempting to enter the channel tunnel. The
second period, similarly, consists of two thematic frames (that which
considers migration as an issue for the British government and that which
considers it to belong to the realm of European governance) and one episodic
frame (those stories relating specifically to the death of Alan Kurdi and those
migrants attempting to move through Hungary and Austria in the early days
of September 2015). If the two episodic frames are laid aside, one is left with
four remaining; the ‘negative’ migrant frame, the ‘positive’ migrant frame, the
domestic responsibility frame and the European responsibility frame. What is
interesting to note in the second period, is that ‘positive’ migrant frame from
the first period does not disappear, but overlaps with and bolsters/is
bolstered by the the arising domestic responsibility frame. For example,
many of the key terms of the ‘positive’ migrant frame (vulnerable, refugee,
conflict, persecution, support, receive, community etc.) are emblematic of
those found in the so-called domestic responsibility frame (vulnerable,
refugee, sanctuary, hazardous, save, help etc.) This means that rather than
‘disappearing’, the frame which represents migrants as individuals in need
has been combined with arising domestic responsibility frame.
However, this does not account for the disappearance of the ‘negative’
migrant frame. The reason for this lack of presence, and likewise the merging

JADT’ 18

769

of the ‘positive’ migrant frame and the domestic responsibility frame in the
second period, is due to the shock events linked to the tragic death of Alan
Kurdi on September 2nd 2015. The event seems to have made the deployment
of the ‘negative’ migrant frame untenable in the second period, while at the
same time the ‘positive’ migrant frame persists as the period proved more
fertile for this perspective. This is one reason why the two frames overlapped
in the second period; the outrage and shock at the death of a toddler
ultimately led to the locating of the solution to the ‘positive’ migrant frame in
the domestic responsibility frame. Interestingly, this overlap did not occur
with the European responsibility frame, which may be due to British political
actors (the majority of those interviewed across the articles) actively
positioning themselves as ready to help migrants in order to show
themselves in a positive light.
Another interesting finding is how location affected or at least was linked to
the change in hegemony between the ‘positive’ and ‘negative’ migrant
frames. In the first period, the obsession with the Calais crisis (demonstrated
by the presence of the corresponding episodic frame) seemingly provided
conceptual ground in which the ‘negative’ migrant frame could flourish,
whereas in the second period, dominated as it was by news of the death of
Alan Kurdi (and the presence of a more international episodic frame)
ensured the continued presence of the ‘positive’ migrant frame. One reason
for this could be that as the migrant is located nearer to the British boarder,
the ‘negative’ migrant frame (characterised by terms such as arrest, siege,
repel, overwhelm) was more easily deployed due to the greater unease of
foreign migrants entering the country, whereas when the focus was
positioned more globally this unease was overcome by the moral shock of
Alan Kurdi’s death, lessening the unease and therefore the appropriateness
of the previous frame.
Despite demonstrating some continuity of frames across the two periods, that
geographical focus affects the deployment of one frame or another and that
shock events can seemingly shift the frames in play to a great extent, the
study is not without shortcomings. Firstly, the two time periods, and the
limitation of four days to each, has greatly reduced the data available. This in
turn makes it impossible to understand how far and how robust the
identified frames are across an extended period of time and whether other
frames come into play depending on the specific moment or the dominating
news story. One solution could be to extend the time frame, but this might in
turn lead to a drop in validity and insight due to the limitations of the
researcher to deal with the data to the same extent as a computer. The second
issue, as has already been mentioned, is determining precisely the
characteristics of one frame in relation to another. One possible solution

770

JADT’ 18

would be to predetermine those terms which are identified as framing
elements or reasoning devices as variables in the cluster analysis, which
would in turn limit the identification of episodic frames in favour of thematic
frames and over a longer period more clearly define the continuation, and
the fluctuation in presence, of identified frames. The drawback of this,
however, is that arguably the subjectivity of the researcher enters at too early
a stage and harms the validity of the methodology. A third point is that,
although the cluster analysis did capture many of the framing devices (as
they are commonly exhibited as words), it was unable to capture all (for
instance accompanying images) and was largely unable to identify the
presence of reasoning devices (as the unit of analysis needs to be bigger than
single word choice).
References
Bartlett, F. (1932). Remembering: A Study in Experimental and Social Psychology.
Cambridge University Press.
Bateson, G. (1972). Steps to an Ecology of Mind: Collected Essays in Anthropology,
Psychiatry, Evolution, and Epistemology. University of Chicago press.
D’Angelo, P. (2002). News Framing as a Multiparadigmatic Research
Program: A Response to Entman. Journal of Communication, 52(4): 870888.
Entman, R.M. (1993). Framing: Toward Clarification of a Fractured Paradigm.
Journal of Communication, 43(4): 51-58.
Gamson, William A. and Kathryn E. Lash. (1983). The Political Culture of
Social Welfare Policy. In S.E. Spiro and E. Yuchtman-Yaar, Evaluating the
Welfare State: Social and Political Perspectives. Academic Press.
Goffman, E. (1974). Frame analysis: An essay on the organization of experience.
Harper and Row.
Reinert, M. (1983). Une méthode de classification descendante hiérarchique:
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, 8(2): 187-198.
De Vreese, C.H. (2005). News Framing: Theory and Typology. Information
Design Journal and Document Design, 13(1): 51-62.
Zhongdang, P. and Kosicki G.M.. (1993). Framing Analysis: An approach to
news discourse, Political Communication, 10(1): 55-75.
Tankard, J.W. and Severin W.J. (2001). Communication Theories: Origins,
Methods and Uses in the Mass Media, 5th Edition. Pearson.

JADT’ 18

771

Results from two complementary textual analysis
software (Iramuteq and Tropes) to analyze social
representation of contaminated brownfields
Marjorie Tendero1, Cécile Bazart2
1

University of Rouen – CREAM and Agrocampus Ouest - marjorie.tendero@agrocampusouest.fr
2University of Montpellier, Montpellier – CEE-M - cecile.bazart@umontpellier.fr

Abstract
The aim of this paper is to demonstrate the complementarity of two types of
textual analysis software, Iramuteq and Tropes, to analyze a corpus of data
extracted from an open-ended question from a national cross-sectional
survey. Descendant hierarchical classification made with Iramuteq lead to
more homogeneous and less groups of discourse than the references fields
made with Tropes. References fields allow to reveal how the corpus’ thematic
are articulated made with Iramuteq.
Résumé
Cette communication présente l’apport complémentaire de deux logiciels
d’analyse de contenu, Iramuteq et Tropes, pour analyser les représentations
sociales à partir de réponses données à une question ouverte dans un
questionnaire d’enquête. Il montre que les classifications hiérarchiques
descendantes opérées à l’aide du logiciel Iramuteq peuvent être approfondies
de façon complémentaire à l’aide des classifications sémantiques par univers
de références et l’outil scénario du logiciel Tropes. Les classes de discours
sont moins nombreuses et plus homogènes que les univers de références mis
en évidence par logiciel Tropes. Ces derniers montrent l’articulation des
thématiques du corpus.
Keywords: Brownfield; Classifications; Iramuteq; textual data analysis;
Tropes.
1. Introduction
L’analyse de contenu regroupe les techniques permettant une analyse
systématique et objective des communications écrites et orales. Il s’agit d’une
approche multidisciplinaire croisant des méthodes quantitatives et
qualitatives, et dont les domaines d’application sont très nombreux : sciences
de la communication, sociologie, psychologie, informatique, et économie par
exemple. Ces techniques étudient la structure d’un texte, ou d’un discours,

772

JADT’ 18

ainsi que sa logique afin de mettre en évidence le contexte dans lequel il est
produit, et sa signification réelle à partir de données objectives. Ces méthodes
permettent de traiter les réponses à des questions ouvertes en soutenant
l’interprétation du phénomène étudié sur des critères quantitatifs et objectifs
(Garnier and Guérin-Pace 2010). Pour analyser les réponses données à des
questions ouvertes, un des avantages de ces méthodes et d’éviter les biais liés
à la codification thématique a posteriori. Toutefois, cette méthode fait l’objet
de critiques. Ces dernières sont relatives aux étapes à mettre en place pour
préparer le corpus, pour effectuer les analyses, et interpréter les résultats.
Ainsi, lors de la phase de préparation du corpus, une lemmatisation peut être
effectuée. Or, celle-ci regroupe parfois des formes dont l’emploi, dans un
contexte donné, mène à des contresens (Lemaire 2008). C’est le cas lorsqu’une
forme au pluriel est lemmatisée au singulier. De plus, les dictionnaires des
expressions utilisés par les logiciels peuvent ne pas rendre compte des
marqueurs de modalités comme la négation (Fallery and Rodhain 2007). Par
ailleurs, des différences interprétées en termes d’analyse de contenu peuvent
en réalité provenir de différences sociales dans la façon dont un individu
s’exprime à l’oral ou à l’écrit. Les problèmes d’homonymies, de polysémies,
de synonymies peuvent donc amener à construire des classes lexicales
différentes alors qu’elles relèvent de modes d’expression hétérogènes sur la
forme mais en réalité très similaire sur le fond ; ce qui est le cas des opinions
exprimées par des périphrases, des paraphrases ou des ellipses. Une
attention particulière doit donc être portée au traitement des ambiguïtés afin
d’éviter toute erreur d’interprétation. Pour cette raison, il est intéressant de
combiner deux approches complémentaires, et donc différents logiciels,
d’analyse de contenu ; ce qui permet d’assurer la validité des résultats
(Vander Putten and Nolen 2010; Lejeune 2017). C’est par exemple ce qui a été
fait sur un corpus d’entretien pour comparer les logiciels Nvivo et
Wordmapper (Peyrat-Guillard 2006). Dans cette communication nous
soulignons l’apport complémentaire des logiciels Iramuteq et Tropes pour
l’analyse des représentations sociales associées aux friches polluées à partir
des réponses données à une question ouverte dans le cadre d’une enquête
administrée au niveau national auprès de 803 individus résidant sur une
commune impactée par ce type de foncier. Nous présentons dans la section
qui suit la méthodologie adoptée, les données récoltées et les analyses
effectuées. Dans une troisième section, nous présentons les résultats obtenus
à l’aide du logiciel Iramuteq ; puis ceux obtenus à partir du logiciel Tropes
dans une quatrième section. Nous discutons des apports complémentaires de
ces deux logiciels pour l’étude des représentations sociales à partir de
l’analyse des réponses données à une question ouverte dans une dernière
section.

JADT’ 18

773

2. Méthodologie
Nous avons élaboré un questionnaire afin d’étudier la perception
individuelle vis-à-vis du risque de pollution du sol, et les représentations, et
perceptions relatives aux friches urbaines et à leur reconversion. Le
questionnaire a été administré aux riverains résidant sur les communes
impactées par une friche polluée1. Au total, 803 réponses complètes ont été
collectées sur 503 communes impactées par la présence d'une friche polluée.
Pour analyser les représentations sociales, associées aux friches polluées,
nous avons utilisé la question ouverte suivante : « à quoi associez-vous
l’expression de friches urbaines ? ». Nous avons procédé à une analyse de
données textuelles car cette technique d’analyse des données se prête
particulièrement bien à l’étude des représentations, individuelles ou sociales,
en rendant compte de la dynamique représentationnelle et cognitive d’un
phénomène (Abric 2003; Beaudouin and Lahlou 1993; Kalampalikis 2005;
Negura 2006).
Toutes les questions étaient obligatoires. Cependant, tous les participants
n’ont pas réussi à y répondre : certaines réponses n’étaient qu’une suite de
caractères permettant de passer à la question suivante. De plus, cette
question ouverte se situait dans la seconde partie du questionnaire. Ce
dernier était relativement long ; il en a résulté une perte d’attrition. Nous
avons donc écarté ces réponses de notre analyse. Au total, 539 réponses ont
pu être conservées ; soit 67,12 % des réponses collectées.
Les données ont été formatées pour pouvoir être analysées à partir du
logiciel IRaMuteQ (Interface de R pour les analyses multidimensionnelles de
textes et de questionnaires) version 0.7 alpha 2 dans un premier temps. C’est
un logiciel libre développé par Pierre Ratinaud au sein du LERASS
(Laboratoire d’Études et de Recherche Appliquées en Sciences Sociales)
distribué sous les termes de la licence GNU GPL (v2) (Baril and Garnier 2015;
Ratinaud and Déjean 2009). Le tableau 1 ci-dessous montre un extrait des
réponses analysées.
Tableau 1 : Extrait du corpus analysé
0001

percept_eleve
danger_oui
confiance_non
Abandonnée, sale, nuisible
0002
percept_eleve
danger_oui

affecte_non
prevent_non
exist_non
gestfri_non
sexe_h age_4059 reg_centre

gestion_non
intention_oui

affecte_non
prevent_oui
gestion_non
exist_oui gestfri_non
intention_oui
confiance_oui

1 Ces communes ont été identifiées à partir d’une extraction de la base de
données BASOL sur les sites et sols pollués (ou potentiellement pollués) appelant une
action des pouvoirs publics, à titre préventif ou curatif.

774

JADT’ 18

sexe_f
age_1924 reg_als
Zones non_habité
0003
percept_eleve
affecte_non
prevent_non
gestion_non
danger_non
exist_non
gestfri_non
intention_non
confiance_non
sexe_f
age_4059 reg_als
Un jardin en ville, laissé à l'abandon.
0004
percept_moyen affecte_non prevent_non
gestion_non
danger_non
exist_non
gestfri_non
intention_oui
confiance_non
sexe_f
age_4059 reg_rha
zone abandonnée, zone polluée ville

Le corpus de texte analysé a les caractéristiques décrites dans le tableau cidessous.
Tableau 2 : Statistiques descriptives associées au corpus analysé

Nombre de réponses
Nombre de mots (occurrences)
Nombre moyen de mots utilisés
Nombre de formes actives (total)
Nombre de formes supplémentaires (total)
Nombre d’hapax
Nombre de formes
Nombre de formes actives (différentes)
Nombre de formes supplémentaires (différentes)

Corpus « friche »
539
2 177
4,04
1 537
640
275
482
402
80

Nous comparons les analyses suivantes : statistiques descriptives et
classification hiérarchique descendante effectuée à l’aide du logiciel Iramuteq
et univers de références et scénario à l’aide du logiciel Tropes. Il s’agit d’un
logiciel d’analyse sémantique de textes créé en 1994 par Pierre Molette et
Agnès Landré à partir des travaux de Rodolphe Ghiglione sur l’analyse
propositionnelle de discours (Molette, Landré, and Ghiglione 2013).
3. Résultats de l’analyse avec Iramuteq
3.1. Statistiques descriptives
Le tableau ci-dessous décrit les termes les plus fréquemment employés par
les individus (effectif ≥ 20) lorsqu’ils évoquent les friches polluées. Ces
dernières sont des « terrains » (99 occurrences), des « zones » (36) laissées à
« l’abandon » (106). Il s’agit de terrains sur lesquels étaient implantées
d’anciennes « usines » (29) aujourd’hui « désaffectées » (17).

JADT’ 18

775

Tableau 3 : Termes les plus fréquemment employés (statistiques descriptives à partir du
logiciel Iramuteq)

Formes
actives
Abandon
Terrain
Laisser
Abandonner
Ville
Zone
Terrain
vague

Effectif

Type

106
99
63
49
46
36
34

Nom
Nom
Verbe
Verbe
Nom
Nom
Nom

Forme
active
Usine
Pollution
Ancien
Espace
Bâtiment
Sol

Effectif

Type

29
28
28
25
25
20

Nom
Nom
Adjectif
Nom
Nom
Nom

3.2. Classification hiérarchique descendante
65.49 % des réponses données sont classifiées au sein de quatre catégories. Le
tableau 4 ci-après indique la significativité des termes associés à chaque
classe. La première classe regroupe les termes faisant référence aux anciennes
activités industrielles. La deuxième classe renvoie aux problème de la gestion
de déchets en milieu urbain en évoquant les « décharges », les « saletés », et
la « pollution ». La troisième classe correspond aux termes caractérisant ce
type d’espace. La quatrième classe, quant à elle, fait référence aux espaces de
nature auxquels les friches correspondent, en particulier dans le cas de
parcelles agricoles laissées en jachère.
4. Résultats complémentaires apportés par Tropes
Nous avons formaté le corpus pour l’analyser avec le logiciel Tropes.
L’analyse des univers de références nous permet de mettre en évidence les
principaux thèmes utilisés dans le texte en regroupant les termes dans des
classes d’équivalent sémantiques. Le tableau 4 ci-après présente les résultats
obtenus par les univers de références à l’aide du logiciel Tropes.
Les classifications sont données par ordre décroissant et indiquent le nombre
de termes qui s’y rapportent. Ces classifications ne permettent pas toujours
de couvrir l’ensemble des termes utilisés dans le corpus : seuls les substantifs
les plus significatifs du texte y apparaissent. Il est toutefois possible de
paramétrer ces classifications à partir du mode scénario du logiciel ; la figure
1 en montre un extrait.
5. Discussion et conclusion
Le tableau 6 précise les avantages et contraintes respectifs liés à l’utilisation
de ces deux logiciels pour analyser les représentations sociales des friches
polluées. En particulier, la classification sémantique par univers de références

776

JADT’ 18

et l’outil scénario font apparaître des classes plus nombreuses et moins
homogènes que dans le cas de la classification hiérarchique descendante
effectuée sous Iramuteq.
Tableau 4 Résultats de la classification hiérarchique descendante à partir du logiciel Iramuteq
Classe 1 (39,7 %)

Classe 2 (15 %)

Anciennes activités industrielles

Problèmes de gestion
déchets en milieu urbain

Forme active

²

p

Abandonner

58,95

Usine

42,73

<
0,0001
<
0,0001

Ancien

Bâtiment

Industriel

Polluer

Désaffecté

Site

29,1

28,82

22,13

17,24

15,66

15,66

Immeuble

14,05

Forme
active
Pollution

²

p

151,38

Sol

59,79

<
0,0001
<
0,0001
<
0,0001

Laisser

<
0,0001

Friche

<
0,0001

Milieu_urbain

0,00073

Sauvage

<
0,0001

Ville

<
0,0001

Repos

<
0,0001

Désert

<
0,0001

Saleté

<
0,0001

Décharge

<
0,0001

Terre

0,00017

Culture

Zone

13,72

0,00021

Industrie

10,9

0,00096

Lieu

10,87

0,0023

Non_construit

9,29

0,00513

Endroit

7,83

0,00547

Vieux

7,72

0,00547

des

Classe 3 (33,7
%)
Zone
abandonnée et
inutilisée
Forme active

32,66

17,13

17,13

11,41

Classe 4 (11,6 %)
Espace
jachère

agricole

en

²

p

Terrain

107,94

< 0,0001

Forme
active
Espace

Abandon

84,27

< 0,0001

Nature

²

p

114
.47

<
0,0001
<
0,0001

62.
29
82,57

< 0,0001

Vert
46.
45

16,1

< 0,0001

Libre
41.
31

12,58

10,6

0,00038

0,00113

Non_exp
loité

38.
60

Champ
30.
79

Non_cultivé

8,25

7,85

7,85

4,06

0,00408

0,00507

Aller

5,95

Non_utilisé

2 ,97

0,01471

NS
(0,08500)

0,00507

Non_ent
retenu
Rntreten
ir
Non_cul
tivé

24.
89
5.8
1
2.7
1

<
0,0001
<
0,0001
<
0,0001
<
0,0001
<
0,0001
0,0159
6
NS
(0,099
62)

0,04402

Tableau 5 : Principaux univers de références associés au corpus
Univers de références 1
Référence
Eff.
Exemple de termes
associés
Ville
74
Ville, taudis, zone
urbaine
Lieu
59
Zone
Habitat
55
Bâtiments, immeubles,
logement, appartements

Référence
Ville
Lieu
Industrie

Univers de références 2
Eff.
Exemple de termes
associés
73
Ville, taudis, milieu urbain,
zone urbaine
59
Site, zone, lieu
50
Industrie, zone industrielle,
usines

JADT’ 18

777

Industrie

50

Immeuble

36

Bâtiments, immeuble

39

Zone industrielle,
industrie, usine
Pollution, dépotoir

Écologie

Pollution

33

33
22
22
20

Végétation, herbe, ronce
Déchet, détritus
Jachère, cultures
Terre

Déchet
Agriculture
Terre

22
21
20

Polluant, pollution,
dépotoir
Déchet, détritus
Jachère, cultures
Sols, terre

Plantes
Déchet
Agriculture
Terre

Figure 1 : Extrait des scénarios sous Tropes (ordre croissant)

Cet outil permet d’approfondir et de valider l’interprétation effectuée à partir
de la classification hiérarchique descendante à l’aide du logiciel Iramuteq.
Ces deux logiciels apparaissent donc comme complémentaires. Ces
complémentarités restent toutefois à vérifier à l’aide d’autres type de corpus
(entretiens par exemple). Enfin, pour étudier les représentations sociales de
friches polluées auprès de populations impactées par ce type de site, il serait
intéressant d’identifier le lexique émotionnel et affectif utilisée à l’aide
d’EMOTAIX par exemple (Piolat and Bannour 2009). En effet, cela
permettrait de mieux identifier la dimension affective dans les intentions
comportementales à l’égard de ce type de site.

778

JADT’ 18

Tableau 6 : comparaison des fonctionnalités d'Iramuteq et de Tropes pour l'analyse
des représentations sociales
Logiciels
Procédures
Découpage du
texte
Style du texte
Mise en scène
Épisodes et rafales
Classifications
Scénario
Statistiques
descriptives
Analyse de
similitude
Analyse de
spécificité et
analyse factorielle
des
correspondances
Analyse
prototypique
Principaux atout
pour l’étude des
représentation
sociales
Principaux
inconvénients pour
l’étude des
représentations
sociales

Iramuteq

Tropes

Segments de texte

Propositions canoniques

Classification hiérarchique
descendante






Univers de références


Indirectement par mots avec des
graphes en aire ou étoilé




Richesse des analyses et des
résultats

Formatage des corpus moins
contraignant

Formatage des corpus longs

Lemmatisation et classification
automatisées aboutissent à des
résultats peu lisible

References
Abric, Jean-Claude. 2003. Méthodes D’étude Des Représentations Sociales. ERES.
Baril, Élodie, and Bénédicte Garnier. 2015. ‘Utilisation d’un outil de
statistiques textuelles : IRaMuteQ 0.7 alpha 2. Interface de R pour les
analyses multidimensionnelles de textes et de questionnaires’. Institut
National d’Études Démographique.
Beaudouin, V, and S Lahlou. 1993. ‘L’analyse Lexicale : Outil D’exploration
Des Représentations’. Cahier de Recherche C (48): 25–92.
Fallery, Bernard, and Florence Rodhain. 2007. ‘Quatre approches pour
l’analyse de données textuelles :lexicale, linguistique, cognitive,
thématique’. In XVIème Conférence de l’Association Internationale de
Management Stratégique. Montréal, Canada.
Garnier, Bénédicte, and France Guérin-Pace. 2010. Appliquer les méthodes de la
statistique textuelle. Les collections du CEPED (Centre Population et

JADT’ 18

779

Développement). Paris: CEPED.
Kalampalikis, Nikos. 2005. ‘L’apport de la méthode Alceste dans l’analyse
des représentations sociales’. In Méthodes d’étude des représentations
sociales, edited by Jean-Claude Abric, 147–63. Hors collection. ERES.
Lejeune, Christophe. 2017. ‘Analyser Les Contenus, Les Discours, Ou Les
Vécus ? À Chaque Méthode Ses Logiciels !’ In Les Méthodes Qualitatives
En Psychologie et Sciences Humaines de La Santé, Dunod, 203–24. Psycho
Sup.
Lemaire, Benoît. 2008. ‘Limites de La Lemmatisation Pour L’extraction de
Significations’. In 9ème Journées Internationales d’Analyse Statistique Des
Données Textuelles, 725–32. Lyon, France.
Molette, Pierre, Agnès Landré, and Rodolphe Ghiglione. 2013. Tropes. Version
8.4. Manuel de référence. http://tropes.fr/doc.htm.
Negura, Lilian. 2006. ‘L’analyse de Contenu Dans L’étude Des
Représentations Sociales’. SociologieS Théories et recherches (October).
Peyrat-Guillard, Dominique. 2006. ‘Alceste et WordMapper : L’apport
Complémentaire de Deux Logiciels Pour Analyser Un Même Corpus
D’entretien’. In Journées d’Analyse Statistique Des Données Textuelles, 725–
36. Besançon, France.
Piolat, Annie, and Rachid Bannour. 2009. ‘EMOTAIX : Un Scénario de Tropes
Pour L’identification Automatisée Du Lexique Émotionnel et Affectif’.
L’Année Psychologique 109 (04): 655. https://doi.org/10.4074/S00035033
09004047.
Ratinaud, Pierre, and Sébastien Déjean. 2009. ‘IRaMuTeQ: Implémentation de
La Méthode ALCESTE D’analyse de Texte Dans Un Logiciel Libre’.
Modélisation Appliquée Aux Sciences Humaines et Sociales MASHS, 8–9.
Vander Putten, Jim, and Amanda L Nolen. 2010. ‘Comparing Results from
Constant Comparative and Computer Software Methods: A Reflection
About Qualitative Data Analysis’. Journal of Ethnographic and Qualitative
Research 5: 99–112.
Remerciements
Nous remercions Jean-Marc Rousselle pour avoir administré en ligne ce
questionnaire sous Limesurvey. Cette enquête a bénéficié du soutien
financier du SRUM 2015, de l’université de Montpellier, du CEE-M
(LAMETA), de l’ADEME, de la Région Pays-de-la-Loire, et du CREAM
(Université de Rouen).

780

JADT’ 18

Multilingual Sentiment Analysis
Matteo Testi1, Andrea Mercuri1,2, Francesco Pugliese1,3
Deep Learning Italia – m.testi@deeplearningitalia.com
2Tozzi Institute – a.mercuri@deeplearningitalia.com
3Italian National Institute of Statistics – francesco.pugliese@istat.it
1

Abstract
In recent years, Sentiment Analysis (SA) has attracted significant attention in
different areas of Research and Business. This is because “sentiments” can
influence opinions of product vendors, politicians and the public opinion.
The sentiments of users are generally categorised into three classes: negative,
positive or neutral. Lately, more and more Deep Learning (DL) models have
been employed to SA thanks to their automatic high-dimensional feature
extraction capability. However, DL supervised models are greedy of data
and the shortage of sentiment’s data sets in specific languages (other than
English) is a big issue. In order to address this multilingual issue of training
sets we propose a very deep Recurrent Convolutional Neural Network
model (RCNN) which achieves “state-of-art” accuracy in sentiment
classification. Extracting keywords from the final max-pooling layer we are
able to create a corpus of domain-specific keywords. By exploiting these
“discriminative” extracted words we scrape a long sequence of sentences (in
two different languages) in order to feed a Neural Machine Translation
model. A sequence-to-sequence model with attention and beam-search has
been implemented to translate one language sentences (i.e. English) into
another language sentences (i.e. Italian). As example, we train our RCNN on
an English twitter sentiment training-set and extract keywords to generate
the machine translation model. During the test stage, we translate our test
sentences (i.e tweets) into another language for which we have poor training
set (i.e. Italian). Results highlight a significant accuracy gain of this technique
with regard to a model exclusively trained on a poor training set expressed in
a language different from English.
Keywords: sentiment, analysis, multilingual, deep, learning, recurrent,
convolutional, neural, machine, translation
1. Introduction
In recent years, Sentiment Analysis (SA) has attracted significant attention in
different areas of Research and Business. This is mainly due to the fact that
“sentiments” (which are exhibited on the web by users) can affect opinions of
product vendors, politicians and readers in general, namely the public

JADT’ 18

781

opinion. According to one of the most accredited definitions: Sentiment
Analysis is the field of study that analyses people’s opinions, sentiments,
evaluations, appraisals, attitudes, and emotions towards entities such as
products, services, organisations, individuals, issues, events, topics, and their
attributes (Qurat Tul Ain et al, 2017; Liu, 2012). This user point of view may
usually be expressed under the unstructured form of an opinion, review,
news, disapproval, etc. The rising demand of SA comes from the need of
summarising a general direction of user opinions from social media
(Haenlein et Kaplan, 2010). In fact, the aggregate data from Sentiment
Analysis can represent a valuable information in order to orient decisions in
politics, digital marketing or finance. Therefore, SA arises as a
multidisciplinary field joining computational linguistics, information
retrieval, semantics, natural language processing and artificial intelligence in
general (Aydogan et Akcayol, 2016). Ultimately, SA can be seen as the
process of automatically categorise utterances into three different classes:
negative, positive or neutral. Generally these sequences of text or sentences
come from social networks, opinion web-sites, e-commerce feedbacks, etc.
Twitter is one of the most useful microblogging platforms for Sentiment
Analysis and Opinion Mining since it offers very good API to download
tweets and it is very popular amongst different categories of people (Pak et
Paroubek, 2010). Traditionally, SA is a text classification problem and relies
on two kinds of approaches: a) “lexicon-based” which is usually applied to
problems without a training set. This technique generally makes use of a
fixed number of keywords to orient the classification process by means of
decision trees such as k-Nearest Neighbours (k-NN) or Hidden Markov
Model (HMM); b) “machine learning-based” where extracted features typically
consist of Parts of Speech (POS) tags, n-grams, bi-grams, uni-grams and bagof-words. Classification can be performed by Naïve Bayes or Support Vector
Machines (SVMs) (Singh et al., 2016). Traditional lexicon-based approaches
are not effective anymore in combination with the modern textual Big Data
corpuses, especially as far as sentiment concerns. On the other hand, Machine
learning approach can be supervised and unsupervised (less common) and it
is a methodology able of automation over enormous corpus of data, this is a
critical requirement for a reliable Sentiment Analysis. Deep Learning is a
branch of Machine Learning proposed by G.E. Hinton in 2006 and adopts
Deep Neural Network for text classification (Hinton et Salakhutdinov, 2006).
Deep Learning enhance traditional neural networks introducing more than
thousands of neurons, millions of connections, new regularisation techniques
(dropout, data augmentation, batch normalisation), new pre-processing
(skip-gram, word embeddings, etc) and different new models both
supervised and unsupervised: Convolutional Neural Networks (CNN)

782

JADT’ 18

(Krizhevsky et al., 2012), Deep Belief Networks (DBN) (Hinton et al., 2006).
and many more. Lately, more and more Deep Learning (DL) models have
been employed to SA thanks to their automatic high-dimensional feature
extraction capability (Vateekul and Koomsubha, 2016). For instance, in
Financial Sentiment Analysis (FinTech), Deep Learning has contributed to
investigate how to harness different media and financial resources in order to
improve the accuracy of stock price forecasting (Day et Lee, 2016). The
experimental results show how news sentiment categorisation, by means of
Deep Neural Networks, has different effects to investors and their
investments. However, SA is a challenging field due to the lack of supervised
data and to the nature inherently subjective of sentiments. In this work we
tackle one of the biggest problems for modern machine learning-based
Sentiment Analysis: the shortage of data sets in specific less common
languages (Italian, German, etc.). In order to address the classification of
sentiments we examined some of “state-of-art” text classifiers: many deep
learning models have been employed in Sentiment Analysis previously, such
as those invented by Stanford University: Recursive Neural Networks
(RNNs) (Socher et al., 2011b) and Recursive Neural Tensor Networks
(RNTNs) (Socher et al., 2013). Furthermore, Stanford released the Sentiment
Treebank that is the first corpus with fully labeled parse trees to train RNTSs.
RNTNs reach an accuracy ranging from 80% up to 85.4% on a Sentiment
Treebank’s test set. Although Recursive Models are very efficient in terms of
constructing sentences’ sentiment representations, their performance heavily
depends on the performance of the textual tree construction. Constructing
such a textual tree exhibits a time complexity of at least O(n^2), where n is
the length of the text. For this reason, we decided to make use of a Recurrent
Convolutional Neural Network model (RCNN) (Lai et al., 2015) achieving a
rather competitive accuracy in sentiment classification with regard to
Recursive Models. RCNNs exploit a recurrent structure to capture contextual
information as much as possible when learning word representations, which
may introduce considerably less noise compared to traditional windowbased neural networks. Moreover, the benefit of exhibiting a time complexity
of O(n) is a big added-value of RCNNs. To provide the support to a
Multilingual Sentiment Analysis, a Neural Machine Translation (NMT)
model has been employed in order to translate one language sentence (i.e.
English) into another language sentence (i.e. Italian). Basically, a NMT model
is a Neural Network structured in an encoder-decoder pattern which turned
out as a competitive alternative to the traditional Statistical Machine
Translation (SMT). The encoder consists of two independent recurrent
networks: “forward” which reads the the sentence in the natural order and
“backward“ which reads the sentence in reverse order. Instead, the decoder is

JADT’ 18

783

an RNN capable to compose the sentence to be translated. This sequence-tosequence model can be trained on a training set made of pairs of sentences:
the first is expressed into the source language and the second into the target
language (Cho et al., 2014).
2. Materials and Methods
The novelty of our Recurrent Convolutional Neural Network, with respect to
the original paper, is that we introduced two new recurrent models called
Long Short Term Memories (LSTM) instead of simple RNNs. These two
LSTM bi-directionally scan the text. The topology of the RCNN (see Fig. 1) is
intentionally designed to capture the context of each word (see the original
paper for further details). The RCNN has been trained on a corpus of 1.6
million tweets composed from various Semeval training-sets (Strapparava et
Mihalcea, 2007) and divided into positives (800k) and negatives (800k). To
input textual sequences into the neural network we insert a pre-trained
embedding layer on top (Mikolov et al, 2013). The embedding layer, which
has been pre-trained on an English Wikipedia Corpus, transforms indexed
words into numerical vector. Embedding vectors are characterised by a
semantical relationship amongst them according a chosen metrics, a cosine
distance in this case. Size of embedding vecotrs is 300.

Figure 1. The structure of the RCNN scanning the sentence “A sunset stroll along the South
Bank affords an array of stunning vantage points” (Lai et al., 2015).

During the training stage, the RCNN achieves 84% of accuracy on a
validation set (selected at the 20% of the original dataset). On a test set of 380
tweets (provided by Semeval), the model returns around 82% of accuracy on
positive tweets and 78% of accuracy on negatives, with an approximative

784

JADT’ 18

80% overall on a mixed tweets set. We followed recomended settings within
the original paper for the hyper-parameters selection.
Finally, we have modified the RCNN in order to extract the most significant
keywords that are specific for the model to drive the sentiment classification.
Basically, the third layer, that is the max-pooling layer, relies on an elementwise “max” function as follows :

The most "discriminative” words for the sentiment classification are those
most frequently selected in the max-pooling layer. Hence, we extracted the
indices of words corresponding to the max values of activation identified
within the third layer. During the training we determined 3.2 millions of
keywords, namely 2 for each tweet, the most important and the second in
order of signinifancy. Many of the resulting keywords come duplicated or
altered for multiple reasons : they might belong to a common slang or
undergo typing errors. Then, we removed doubles and we matched the rest
with the embedding corpus cointaining 2.5 mln words of the English
Language. This process turned out with 85,000 correct english keywords. By
exploiting these keywords as seed, we scraped a long sequence of sentences
in English from a website of Contextual Translations such as “Reverso
Context” (context.reverso.net) and its Italian translation in many different
form of expression. This stage led to a training set of 800,000 pairs of
sentences English-Italian and a Validation set of 50,000 pairs. A multi-level
sequence-to-sequence model with attention and beam-search has been
implemented to be trained on the training set of pairs (see Fig. 2) (Bahdanau
et al., 2014; Luong et Manning, 2016).

Figure 2. Multiple levels encoder-decoder (Luong et Manning, 2016).

JADT’ 18

785

“Attention-based” models enable the decoder to “focus” specifically on some
words rather than others, selectively orienting towards a more efficient
combination of words within the destination language sentences (Chorowski,
et al., 2015). “Beam search” is a greedy algorithm maximising the probability
of the ouput words (Britz et al., 2017). The NMT model was trained with an
embedding matrix randomly initialised and trained within the same process.
Embedding vectors size was 512. Both encoder and decoder are made of two
LSTM cells with an hidden state size equal to 512. Training algorithm was the
Stochastic Gradient Descent (SGD) with 32 sized batches; initial learning rate
of 1 and a decay factor of 0.5 starting from the 5-th epoch, plus early stopping
to reduce the overfitting. Beam search amplitude has been set to 5. In Fig.3
they are reported some resulting translations from Italian to English, on a test
example.

Figure 3. Some translations from Italian to English by means of the neural model trained by
us.

In the same time, we have trained the RCNN model on the most popular
Italian Sentiment Polarity Training set of tweets called SentiPolc 2016
(Barbieri et al., 2016). which is made of 7,000 annotated tweets and 300 test
tweets. In this case (Italian language) our model reaches 45% of validation set
accuracy and 43% on test set. For the embedding layer we have adpoted a
pre-trained language model on an Italian Wikipedia Embedding Corpus.
3. Results
We have tested the English RCNN model on the same italian SENTIPOLC
2016 test-set translated into English by our neural machine translation model.
Results highlight a boost of performance : 78% of accuracy on the test set
versus the 43% of the Italian trained RCNN model proving our strategy of
stacking NMT and RCNN models is successful.
4. Conclusion
Despite of the imperfections of the Neural Machine Translation producing
translations with some errors, the RCNN is tolerant to minimal errors and is

786

JADT’ 18

able to hold the accuracy to high levels on a test set. This is because RCNN
was previously trained on a solid and huge English corpus of tweets. This
entire process of keywords extraction, specifically to the task of sentiment
classification from the training set, is a fully novel approach to tackle the
problem of the lack of Sentiment training sets in other languages. Keywords
allow generating a domain-specific training set for the Neural Machine
Translation. Arguably, we believe this way of stacking NMT and RCNN lead
to a cutting-edge Multilingual Sentiment Classifier that can benefit other
fields of Text Classification in future. Future directions might be towards a
closer integration of NMT and Text Classifier and a reduction of translation
errors.
References
Qurat Tul Ain, Mubashir Ali, Amna Riaz, Amna Noureen, Muhammad
Kamran, Babar Hayat and A. Rehman (2017). Sentiment Analysis Using
Deep Learning Techniques: A Review. International Journal of Advanced
Computer Science and Applications (ijacsa).
Haenlein, M., and Kaplan, A. M. (2010). An empirical analysis of attitudinal and
behavioral reactions toward the abandonment of unprofitable customer
relationships. J. Relatsh. Mark.
Aydogan, E. and Akcayol, M. A. (2016). A comprehensive survey for sentiment
analysis tasks using machine learning techniques. Int. Symp. Innov.
Liu, B. (2012). Sentiment analysis and opinion mining (synthesis lectures on human
language technologies). Morgan & Claypool Publishers.
Pak, A., and Paroubek, P. (2010, May). Twitter as a corpus for sentiment analysis
and opinion mining. In LREc (Vol. 10, No. 2010).
Singh, J., Singh, G., and Singh, R. (2016) A review of sentiment analysis
techniques for opinionated web text. CSI Trans. ICT.
Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the dimensionality of
data with neural networks. science, 313(5786), 504-507.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification
with deep convolutional neural networks. In Advances in neural
information processing systems (pp. 1097-1105).
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for
deep belief nets. Neural computation. 18(7), 1527-1554.
Vateekul, P., and Koomsubha, T. (2016, July). A study of sentiment analysis
using deep learning techniques on Thai Twitter data. In Computer Science
and Software Engineering (JCSSE), 2016 13th International Joint
Conference on (pp. 1-6). IEEE.
Day. M., and Lee C. (2016) Deep Learning for Financial Sentiment Analysis on
Finance News Providers. no. 1, pp. 11271134.

JADT’ 18

787

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. 2011b.
Semi-supervised recursive autoencoders for predicting sentiment distributions.
In EMNLP, 151–161.
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and
Potts, C. 2013. Recursive deep models for semantic compositionality
over a sentiment treebank. In EMNLP, 1631–1642.
Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent Convolutional Neural
Networks for Text Classification. In AAAI.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,
Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using
RNN encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078.
Strapparava, C., and Mihalcea, R. (2007, June). Semeval-2007 task 14: Affective
text. In Proceedings of the 4th International Workshop on Semantic
Evaluations (pp. 70-74). Association for Computational Linguistics.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Luong, M. T., and Manning, C. D. (2016). Achieving open vocabulary neural
machine translation with hybrid word-character models. arXiv preprint
arXiv:1604.00788.
Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015).
Attention-based models for speech recognition. In Advances in Neural
Information Processing Systems (pp. 577-585).
Britz, D., Goldie, A., Luong, T., and Le, Q. (2017). Massive exploration of neural
machine translation architectures. arXiv preprint arXiv:1703.03906.
Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., and Patti, V. (2016,
December). Overview of the EVALITA 2016 SENTiment POLarity
Classification Task. In Proceedings of Third Italian Conference on
Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign
of Natural Language Processing and Speech Tools for Italian. Final
Workshop (EVALITA 2016).

788

JADT’ 18

A linguistic analysis of the image of immigrants’
gender in Spanish newspapers
Juan Martínez Torvisco
Universidad de la Laguna – jtorvisc@ull.edu.es

Abstract 1 (in English)
The phenomenon of immigration has been studied from diverse perspectives
is important to understand that immigration is a fact associated with times of
crisis. The reason for the avalanche of immigrants to the Canary Islands
(Spain) is because it is the gateway to Europe, and therefore, immigrants
want to enter from this point. This research arises from the need to
linguistically determine the treatment of the phenomenon of immigration in
the Spanish press as a result of the arrival of thousands of foreign citizens to
the coast of the Canary Islands in 2006 and in 2015. It attempts to analyse
four Spanish newspapers using Iramuteq qualitative analysis software, two
from the Canary Islands (El Día and Canarias 7) and two Spanish national
newspapers (El País and ABC). Also, we wanted to know how it is the
informative treatment of gender. Our hypothesis is that the word male
(immigrant) appear more than woman and on the contrary woman (refugee)
has a higher frequency than male. Results are presented on a dendogram
figures.
Abstract 2 (in Spanish)
El fenómeno de la inmigración se ha estudiado desde diversas perspectivas,
y es un hecho asociado a tiempos de crisis. El motivo de la avalancha de
inmigrantes en las Islas Canarias (España) se debe a que es la puerta de
entrada a Europa y, por lo tanto, los inmigrantes quieren entrar desde esta
parte de Europa, buscando una major vida. Esta investigación surge de la
necesidad de determinar lingüísticamente el tratamiento del fenómeno de la
inmigración en la prensa española como resultado de la llegada de miles de
ciudadanos extranjeros a la costa de las Islas Canarias en 2006 y 2015. Se
analizan cuatro periódicos españoles utilizando el software Iramuteq de
análisis cualitativo, dos de ámbito regional de Canarias (El Día y Canarias 7)
y dos periódicos de ámbito nacional (El País y ABC). También queríamos
saber cómo aparece el género en las noticias de estos diarios. Nuestra
hipótesis es que los inmigrantes son mayoritariamente hombres por tanto
debe aparece más que la mujer y al contrario, la palabra mujer (refugiada)
tiene una frecuencia mayor que la del hombre. Los resultados se presentan

JADT’ 18

789

dos figuras de dendograma con el Análisis Jerárquico Descendiente (DHC) y
reflejan que la mujer aparece en 2015 pero no está presente en las noticias de
los diarios en 2006 y a la inversa ocurre con el hombre.Keywords: a set of
keywords describing the content of the paper.
1. Introduction
The media have become a powerful tool to make visible conflicts, or show
realities that sometimes remain hidden from the world. Such a fact seems
unquestionable. One of the most-recent cases are the so-called “immigration
crisis” or the “refugees’ crisis,” it began before the dates analyzed in the
current research, however, achieve an uncertain projection until these
citizens reached the coasts of Europe, in the case of the Canary Archipelago.
The concept “immigrant” as Shier, Engstrom & Graham (2011) suggest that
they define an “immigrant” is a person arriving (immigrating) who has come
to live in a country from some other country with the purpose to settle there.
The journalistic enterprises face the challenge of attracting new audiences,
being aware of the transformation of the sector and the emergence of a new
ecosystem. These companies require narrative treatments contrasting from
those already known, since these information units synthesize the content
and preponderance of the published news; these elements are deciding to
capture the attention of the readers (Jarvis, 2014).
Through the selection of the headlines, it is possible to highlight the role of
new professionals in the newsrooms that are responsible for defining what
kind of news be published. As Ramonet (1998) makes evident, the variety of
sources guarantee objectivity. However, information is a social good that
concerns and understand the whole society. This society must establish
moral norms that govern the responsibility of the media (Fraerman, 1998).
The phenomenon of immigration has been analyzed from diverse
perspectives is important to understand that immigration is a fact associated
with times of crisis. But the gender issues are not treated deeply. Thus, one
important aim is to know whether journalists take account that fact.
The Canary Islands (Spain) is a point of gateway to Europe and this is the
reason for the avalanche of immigrants, males and females. The evidence
suggests immigrant’s networks wanting to enter by this point to reach
European land. Most migration researchers understand these networks as
consisting of a set of “strong ties” based on kinship, friendship, or a shared
community of origin that connects migrants and non-migrants (Massey et al.
1998). Migration network approach is that a multidirectional flow of
information and resources forms the basis of every migratory process
(Dekker & Engbersen, 2014).
The migration phenomenon in Europe has had two phases of maximum

790

JADT’ 18

activity in the years 2006 and 2015 where, despite being displaced people
from the place of origin to another destination, including a change of
residence. In the first case, the citizens who enter Europe through the Canary
Islands are the so-called undocumented immigrants. These people left their
countries as a free choice and for a “personal interest,” in line with the
definition of International Conference on Migration (IOM). In the second
case, refugees have carried out the displacement (also present in 2006, but in
a very small percentage) to save their lives or preserve their freedom, as
United Nations High Commissioner for Refugees (UNHCR) states.
The data analyzed in this paper focuses on international migration and the
movement across national borders, consequently this work takes care of the
time-span analysis that separates two massive arrivals and the evolution that
originates in the field of communication in that period. The search terms
“immigrant,” in 2006 and “refugee” in 2015 and also the words “man” and
“woman” were used as keywords to search the headlines and full news of
database and locate information about immigration, and refugees (MUGAK,
2016). The study analyses the year 2006 matching with 2015 and aims to
probe the narrative production generated by two Spanish newspapers (ABC
and El País) and two Spanish regional newspapers (Canarias 7 and El Día), in
relation to the immigration phenomenon that took place in the Canary
Islands in those years.
2. Method
In the present study carried out in the years 2006 and 2015, statistical
methods are mainly concerned with the non-linguistic information from a
text; e.g. term frequencies, inverse frequency and the position of a keyword
in a text. For data analysis, for the study we apply Iramuteq software
(Interface de R pour les Analyses Multidimensionnelles de Textes et de
Questionnaires; Ratinaud, 2009; Ratinaud & Marchand, 2012, 2015).
In our study, for the data processing, apply the Descending Hierarchical
Classification (DHC) by Reinert method (1983, 1986, 1990) defined by lexical
classes, where each of them represents a subject matter, and they can be
described according to the vocabulary that defines them.
From the most frequent words given in the text segments, lexical analysis
was performed. This analysis overcomes the dichotomy between quantitative
and qualitative research, as it allows employing statistical calculations on
qualitative data, the texts. The vocabulary related to “immigration,
immigrant/s, refugee/s, man and woman etc.” are identified and quantified
in the frequency and, in some cases in relation to its position within the text.

JADT’ 18

791

3. Results
Below, the author illustrate the data of the text corpus of the years 2006 and
2015 period of study. The corpus used in this analysis is ad-hoc constructed.
It contains 4.703 newspaper headlines and news published throughout 2006
and 2015 in Spanish. We used four newspapers two of nationwide (El Pais
and ABC) and two of regional scope (Canarias 7 and El DIA), of which 169
news corresponds to El Pais, 291 news for ABC, whereas Canarias 7
published 512. The information of three newspapers was obtained through
MUGAK (Centre of Studies and Documentation on Immigration, Racism and
Xenophobia, Basque Country, Spain, 2016) database, in case of the
newspaper El DIA, 3.731 news; the information was taken directly from the
newspaper database.
Table 1 - Statistical data from the text corpus of study
Corpus 2006

Corpus 2015

Subcorpus 2006

Subcorpus 2015 (text
in web editions)

Occurrences

426.135

30.531

147.468

6.148

Forms

11.993

4.792

9.747

1.487

Hapax

5.093

2.440

4.525

827

Texts

7

11

7

4

In addition, the characteristics of each text, the number of occurrences
detected in the online version of the newspapers is broad and reflects 20% of
the occurrences of the entire corpus, observed the lexicometry while the
remaining 60% belongs to the activity developed in the profiles enabled in
the social networks of each newspaper. It can be observed the following
cloud of words by collecting in generic terms the forms that characterize the
selected texts.
As it can observe some of the words, with bulkier characters and therefore
most relevant, are related to the area of study that concerns us: period 2006
the word immigrant is the most used in the newspapers analyzed, followed
by Canarias, patera and cayuco (two types of small boats) as a form to arrive to
the Canary Islands. However, in 2015 appears the term refugee (refugiado),
immigrant (immigrant), welcome (bienvenida), government (gobierno), rescue
(rescate) or the Canary Islands (Canarias).
In addition, some forms of refugeing, offering, asking or rescue appear, as
Crespo (2008) points out, a certain ideological position that undoubtedly
helps to construct a certain image about the migratory phenomenon and its

792

JADT’ 18

consequences for the receiving countries. The graphs generated by the
Iramuteq software of this corpus of text can be inferred that some specific
forms give positive or negative value. Depending on the verbs used for this
purpose and the profile of the migrant to which reference is made, in our
case display the data of the two analyzed periods. These appear related to the
terminology of the topic that occupies us and previously used in the
construction of the press holders.
3.1. Data from Descending Hierarchical Classification Analysis 2006
Iramuteq 0.7 alpha 2 software (Ratinaud, 2014) provides multivariate
analysis through DHC and calculates descriptive results of clusters according
to its main vocabulary (Camargo & Justo, 2013). Likewise, its location in the
dendrogram, the resulting forms’ clusters reflect the different work scenarios
beside how some social realities cross: class 1 (social, immigrant aid), class 2
(immigrants and their local rescue), Class 3 (social and family), class 4
(institutional). As well, a concept that appears common to two conglomerates
in “immigrant” and “immigration” as can be seen in the figure below. (Fig.1).
The word “male” appears 184 times, X2 =521,9.

Figure 1 - DHC Dendrogram 2006

JADT’ 18

793

3.2 Data from 2015 DHC
The data shown in the graphs below (Fig. 2) of this text offer an estimated
viewing on the figure of the “refugee” and the “immigrant” and their
evolution in the context of the knowledge acquired by the media as the
phenomenon is going forward. In such a way, we find two words, “refugee”
and “immigrant”, that appear in the journalistic headlines.

Figure 2 - DHC Dendrogram 2015

The result of the above dendrogram reflects the different work scenarios and
how some social realities are mixed: class 4 (local), class 2 (institutional), class
3 (social) and class 1 (European). The word “woman” appears 20 times with
X2=28,9. It is worth mentioning the founding of the term "to receive", an
element that is similar to the rest of the verbs that accompany it in the
constellation of words in which it is lodged (to propose, to find, to celebrate
or to dispose among many others). However, it becomes more relevant due

794

JADT’ 18

to its preponderance and strategic situation in an environment in which it
appears with vocabulary with which it keeps linguistic similarities.
4. Conclusion
This object of study that evolves in parallel to the population movement, as
well as certain informative personalization through the introduction of
adjectives that indicate narrative subjectivity. Our findings suggest a vast of
knowledge that covers countless issues related immigrants and refugees and
woman and man. It can be said that the word “man” does not appear during
the 2006 and it does “male”, however in 2015 appears “woman” instead
“female and it does not “male” like in 2006. The mechanization of publishing
systems marks a clear dividing line between some texts and others and the
shortage of human and technical resources used for this activity, causes local
media to be less interventionist in drafting their texts than national ones.
Finally, it should be notice for the future researches the role of journalists and
the usage they do of the gender topic as a way to know how the immigration
phenomenon man/woman behaves.
References
Crespo, E (2008). El léxico de la inmigración: atenuación y ofensa verbal en la
prensa alicantina. En M. Martínez (Ed.) Inmigración, discurso y medios de
comunicación (pp.45-62). Alicante: Instituto Alicantino de Cultura Juan Gil
Albert, Diputación Provincial de Alicante.
Dekker, R & Engbersen, G. (2014). How social media transform migrant
networks and facilitate migration. Global Networks 14, 4, 401–418.
Jarvis, J. (2014). Geeks Bearing Gifts. CUNY Journalism Press, New York.
Spanish El fin de los medios de comunicación de masas. ¿Cómo serán las noticias
del futuro? Barcelona: Ediciones Gestión 2000.
Massey, D. S., J. Arango, G. Hugo, A. Kouaouci, A. Pellegrino and J. E. Taylor
(1998) Worlds in motion: understanding international migration at the end of the
millennium, New York: Oxford University Press.
Mugak (2016) Centre of Studies and Documentation on Immigration, Racism and
Xenophobia, Basque Country, Spain. Available in www.mugak.eu
Ramonet, I (2011). La tiranía de la comunicación. Madrid: Debate.
Ratinaud, P. (2009). IRAMUTEQ: Interface de R pour les Analyses
Multidimensionnelles de Textes et de Questionnaires [Computer software]
Retrieved 5th march 2013 in http://www.iramuteq.org.
Ratinaud, P. (2014). Visualisation chronologique des analyses ALCESTE :
application à Twitter avec l’exemple du hashtag #mariagepourtous. In
Actes des 12eme Journées internationales d’Analyse statistique des Données
Textuelles. JADT 2014 (p. 553- 565). Paris, France. Disponible

JADT’ 18

795

Ratinaud, P. & Marchand, P. (2012). Application de la méthode ALCESTE à
de “gros” corpus et stabilité des “mondes lexicaux”: analyse du
“CableGate” avec IraMuTeQ. Em: Actes des 11eme Journées
Internationales d’Analyse statistique des Données Textuelles. JADT 2012.
Liège.
Ratinaud, P., & Marchand, P. (2015). Des mondes lexicaux aux
représentations sociales. Une première approche des thématiques dans les
débats à l’Assemblée nationale (1998-2014). Mots. Les langages du politique,
108, 57- 77
Reinert, M. (1983). Une méthode de classification descendante hiérarchique:
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, 8, 2, 187- 198.
Reinert, M. (1986). Un logiciel d’analyse lexicale: ALCESTE. Les cashiers de
l’Analyse des Données, 4, 471-484.
Reinert, M. (1990). ALCESTE. Une méthologie d’analyse des données
textuales et une application: Aurelia de G. de Neval. Bulletin de méthologie
sociologique, 28, 24-54
Shier ML, Engstrom S & Graham JR (2011) International migration and social
work: A review of the literature, Journal of Immigrant and Refugee Studies, 9,
1, pp. 38-56. http://dx.doi.org/10.1080/15562948.2011.547825.

796

JADT’ 18

Lo strano caso delle frequenze zero nei testi
legislativi euroistituzionali
Francesco Urzì
combinazioni.lessicali@gmail.com

Abstract
In this paper we intend to verify the actual impact of the so-called universals
of translation – i.e. those linguistic features which typically occur in translated
rather than original texts - on the legislative texts produced by the European
Union. To this aim, a number of text segments have been heuristically
selected in order to ascertain if their statistical absence, or quasi-absence,
from European legislation should be traced back to the effects of the
abovementioned universals and to identify possible EU-internal factors that
might explain such conspicuous statistical absences.
Keywords: universals of translation. European Union, Eur-lex, euroitaliano,
terminology.
1. Introduzione
Negli ultimi tempi si sono moltiplicati gli studi su corpora comparabili volti a
verificare l’effettiva incidenza dei cosiddetti universali della traduzione, ossia
dei tratti linguistici comuni ai testi tradotti e non riconducibili a un’influenza
sistemica della lingua sorgente (Baker 1993 e 1996 e Laviosa 2002). Per
l’italiano disponiamo delle analisi di Garzone 2005 e di Ondelli-Viale 2010.
Ondelli-Viale, che si avvalgono esclusivamente di un corpus di estrazione
giornalistica, rilevano ad esempio la minore ricchezza lessicale e la frequenza
lievemente maggiore del Vocabolario di base nelle traduzioni, per effetto
dell’universale traduttivo della semplificazione.
Meno numerosi sono gli studi sui tratti specifici dell’euroitaliano, ossia di
quella varietà della nostra lingua rappresentata dall’italiano delle traduzioni
dell’UE. In tale ambito Cortelazzo 2013 ha operato un confronto quantitativo
di due corpora di una certa ampiezza costituiti rispettivamente da direttive
europee e leggi italiane di recepimento, utilizzando tra l’altro misure
lessicometriche (ad es. type/token ratio e hapax) e prendendo anche in
considerazione i “segmenti ricorrenti” (che secondo l’autore confermano per
il corpus UE scelte lessicali “leggermente più povere e omogenee di quelle
nazionali”).
Con il presente contributo ci proponiamo di stabilire sulla scorta di segmenti
scelti euristicamente, casi eclatanti di frequenze zero o prossime allo zero sul

JADT’ 18

797

dominio di secondo livello europa.eu, e più specificamente su Eur-lex, che ne
costituisce un sottoinsieme. Lo scopo di tale esercizio è di verificare
• se l’irrilevanza statistica di determinate lessie in questi corpora,
praticamente costituiti solo da testi tradotti - ricordiamo la pluricitata
affermazione di Umberto Eco secondo cui “la lingua dell’Europa è la
traduzione” - non forniscano una prova incontrovertibile degli effetti
degli universali traduttivi, in particolare quelli della semplificazione e
della normalizzazione (o conservatorismo linguistico);
• se non sia pure ravvisabile un processo di “autoinibizione” da parte
dei traduttori UE all’utilizzo di tali lessie. Non opererebbero in altre
parole solo le tendenze generali ascrivibili al processo traduttivo in
sé (gli universali della traduzione appunto), ma anche e soprattutto la
specifica cultura traduttiva euroistituzionale e lo specifico contesto
tecnico-operativo che contraddistingue i servizi di traduzione delle
Istituzioni europee.
Essendo tale analisi di tipo eminentemente qualitativo, l’utilizzo di un corpus
“rumoroso” come Google non inficia la rilevanza dei risultati quantitativi,
che tendono unicamente a individuare solo grandi scarti di frequenza, per
cui è vero in questo caso che “more data is better data”.
2. La cultura traduttiva delle Istituzioni europee
2.1 Confusione fra ‘termine’e ‘parola’
Un tratto soggiacente della cultura di categoria dei traduttori
euroistituzionali è la non percezione della differenza teorica fondamentale fra
‘termine’ e ‘parola’. E’ diversa infatti nel termine e nella parola la natura del
riferimento,
“che nel termine è specializzata all’interno di una particolare disciplina,
mentre nella parola è generale in una varietà di argomenti (Cfr. Scarpa 2008:
52, che cita Sager 1994: 43).
Cabré (1999, 33-34), sulle orme di Wüster (1981), menziona due specificità
della terminologia. La prima è che
“words in dictionaries are described with respect to their use in context; they
are considered as elements of discourse. For terminology, on the other hand,
terms are of interest on their own account”;
la seconda che
“lexicology and terminology present their inventories of words or terms (…)
in different ways because they start from different viewpoints: terminology
starts with the concept and lexicology, with the word”.
Cabré (ibidem, 36) nota inoltre che
“whereas a terminological inventory usually contains only nouns, in a
general language dictionary all grammatical categories are represented”.

798

JADT’ 18

2.2 Referenzialità intertestuale
La natura “ciclica” degli atti legislativi dell’Unione - che molto spesso
modificano e aggiornano testi legislativi precedenti – che fa sì che le soluzioni
traduttive già consacrate dall’ufficialità finiscano per essere trasferite di peso
sui nuovi atti, con un fenomeno che si potrebbe definire di common law
linguistica, in cui il precedente esercita forza vincolante sul giudizio
linguistico autonomo del traduttore. E' in questa fase che il traduttore UE
spesso assegna status di ‘termini’ a sintagmi che pur non rispondendo
teoricamente a tale definizione (v. 2.1) hanno comunque acquisito il crisma
dell'ufficialità per essere stati "validati" in testi legislativi precedentemente
pubblicati o anche solo verificati sul piano qualitativo e ritenuti idonei a a
essere immessi nel successivo iter legislativo. E’ così che determinate
soluzioni traduttive tendono a perpetuarsi all’interno delle “filiera testuale”
della materia trattata. Al riguardo va citato anche l’effetto di
condizionamento subito dai traduttori più giovani, i quali trovano arduo
sostenere scelte linguistiche innovative in contrasto con la "tradizione" dei
testi dell'acquis communautaire e, soprattutto, tendono a non discostarsi
dall'approccio traduttivo dei colleghi più anziani.
3. Il contesto tecnico-operativo dei servizi di traduzione delle Istituzioni
europee
3.1 House Rules
I servizi di traduzione delle Istituzioni europee hanno a disposizione un
“Manuale di convenzioni redazionali” (OPOCE 2011), nella cui pagina di
benvenuto si legge che "la sua applicazione [del Manuale] è obbligatoria
[grassetto originale] per chiunque intervenga nella preparazione di ogni
documento (su carta o elettronico) nelle istituzioni, organi o servizi
dell’Unione europea". Non viene fatta nel Manuale alcuna distinzione fra le
varie tipologie di testi e le differenti funzioni comunicative che competono a
ciascuna di esse. Inoltre molte regole di redazione sono presentate sotto
forma di prescrizione assoluta Ad esempio, si prescrive "direttiva" (atto
legislativo) con la minuscola (il che non sorprende visto il numero di volte in
cui il termine viene utilizzato nei testi UE), nonostante la regola secondo cui
(Lesina 2009) "nei casi in cui un nome generalmente usato in senso comune
viene utilizzato in senso proprio, con un significato restrittivo o particolare
(…) l'iniziale maiuscola può [corsivo mio] essere utile per ragioni di
chiarezza, al fine di segnalare al lettore la particolare accezione del nome".
Conoscendo la scarsa frequentazione degli italiani (anche di buona cultura)
con la terminologia degli atti legislativi comunitari, sorprende che il Manuale
di convenzioni redazionali prescriva che "direttiva", anche quando non
seguita dagli estremi completi dell'atto legislativo (ad es. direttiva

JADT’ 18

799

2049/39/CE), debba essere sempre scritta con la minuscola (dunque anche nei
testi a carattere divulgativo destinati alle pagine web).
3.2 Effetto standardizzante delle tecnologie CAT e MT
Attualmente i traduttori delle Istituzioni europee beneficiano di una
memoria di traduzione comune a tutti i servizi denominata “Euramis” e che
provvede alla pretraduzione dei testi sia quando la traduzione è curata dai
servizi interni sia quando è esternalizzata ad agenzie di traduzione. Da
qualche anno è entrata in servizio anche la traduzione automatica che, su
richiesta del traduttore, integra l’output della traduzione assistita. Poiché ad
alimentare la memoria Euramis sono esclusivamente segmenti di testo
“validati” (ossia già sottoposti al processo interno di controllo di qualità e
dunque ritenuti idonei al successivo dibattito politico o alla pubblicazione) i
traduttori preferiscono non discostarsi da soluzioni ritenute “sicure” (e la cui
adozione, va pure sottolineato, si traduce in un notevole risparmio di tempo).
4. Esempi paradigmatici di "grandi assenti"
Ad esemplificazione di quanto sopra passiamo di seguito in rassegna una
serie di sintagmi, che presentano casi clamorosi di frequenze zero o prossime
allo zero. Nelle relative tabelle il numero di occorrenze preceduto da
asterisco indica dei “falsi positivi”. L’asterisco fra parentesi segnala che sono
dei falsi positivi almeno una parte delle occorrenze. Le forme prese in
considerazione sono una forma aggettivale gerundiva (costruendi), alcuni
sintagmi nominali con aggettivo relazionale (indagini poliziesche, attività
manutentive, servizi consulenziali), un composto aggettivale determinativo
formato da due aggettivi relazionali (politico-programmatico) e due costrutti,
rispettivamente con fattorizzazione (dati quali- quantitativi) e zeugma
preposizionale (valutare e tener conto [di]). Laddove utile sono state proposte,
a titolo comparativo, le statistiche relative alla forme più in uso nel corpus
legislativo europeo.
4.1 Gerundivo
Token
Costruendi

Google
11.800

Europa.eu
*2

Eur-lex
*1

I due unici esempi di europa.eu - ‘i costruendi locali’ e ‘sepolcri esistenti e
costruendi’, entrambi provenienti dalla banca elettronica TED1, sono riferiti
ad aree territoriali italiane. In questo caso sembra aver operato il
1 TED - Tenders Electronic Daily, ossia il supplemento alla Gazzetta ufficiale
dell'Unione europea dedicato agli appalti pubblici europei

800

JADT’ 18

conservatorismo linguistico, che ha indotto ad evitare una forma non
registrata dai dizionari2 e probabilmente ritenuta dai traduttori troppo ardita.
4.2 Aggettivi relazionali semplici e composti
Un analogo comportamento linguistico convenzionale e semplificatorio da
parte dei traduttori si osserva nel caso degli aggettivi relazionali. Non tutti i
suffissi che formano aggettivi relazionali sono infatti suffissi "dedicati", ossia
deputati a codificare esclusivamente il rapporto di relazione; alcuni formano
anche aggettivi qualificativi. Tale è ad esempio il suffisso -ivo3 come in attività
produttive vs. prefisso produttivo. Spesso basta questa ambivalenza semantica a
dissuadere il traduttore dall'utilizzare tali aggettivi in funzione relazionale e
a indurlo a fargli preferire soluzioni alternative (ad es. con l'impiego della
preposizione ‘di’ o con locuzioni preposizionali del tipo ‘relativo/riguardo
a/in materia di’. Nel caso di ‘indagini poliziesche’, potrebbe forse aver agito
anche il proposito di evitare una indesiderata connotazione.
Token
Indagini di polizia
Indagini
poliziesche

Google
164.000
14.700

Europa.eu
793
(*)2

Eur-lex
85
0

Da notare che una delle 2 occorrenze di ‘indagini poliziesche’ in europa.eu è
un comunicato stampa, dunque scritto con ogni probabilità da un giornalista
e non da un traduttore.
Token
Attività
manutenzione
Attività
manutentive

di

Google
1.230.000

Europa.eu
6.730

Eur-lex
354

89.400

(*)139

*1

Da osservare che l’unico risultato di Eur-lex per ‘attività manutentive’ lo si
ritrova in un testo italiano, che riportiamo (grassetto mio)
“Regolamento del sottosegretario di Stato per l'Edilizia abitativa, la
Pianificazione territoriale e l'Ambiente recante definizione di nuove
Tale forma non registrata ad esempio nel Sabatini Coletti 2008 che però riporta
‘istituendo’ e ‘costituendo’, mentre il Grande dizionario Garzanti riporta solo
‘costituendo’.
3 Suffisso usato prevalentemente per la formazione di aggettivi qualificativi
(Wandruska 2004: 391)
2

JADT’ 18

801

prescrizioni relative alla prevenzione di perdite accidentali di fluidi
frigorigeni nell'ambito dell'utilizzo di o dell'esecuzione di attività
manutentive su impianti di refrigerazione e, in relazione alle stesse, recante
modifica del regolamento prescrizioni impermeabilità impianti di
refrigerazione 1997”
Dei 139 risultati in europa.eu 114 provengono dalla banca TED e, come
conferma un controllo a campione eseguito da chi scrive, si riferiscono ad
avvisi di appalto riguardanti il territorio italiano.
Token
Servizi di consulenza
Servizi consulenziali

Google
6.870.000
96.600

Europa.eu
29.300
(*)21

Eur-lex
16
0

Anche in questo caso, dei 21 risultati di europa.eu 3 provengono da TED,
altri (anche se non tutti) da regioni italiane.
Per quanto riguarda gli aggettivi relazionali composti, del tipo: libero
professionale (relativo alla libera professione) oppure marittimo-portuale
(relativo ai porti marittimi), si è scelto come caso eclatante di assenza il
composto ‘politico-programmatico’. L’assenza è tanto più significativa in
quanto non mancano certo nell’Unione europea i documenti funzionalmente
analoghi al Documento politico-programmatico italiano, ma è solo a
quest’ultimo documento che fanno riferimento le pochissime occorrenze di
questo termine riscontrate su europa.eu e Eur-lex. Ancor più che nel caso
degli aggettivi relazionali semplici, l’assenza si spiega con il senso di
incertezza semantica che le formazioni aggettivali costituite da due aggettivi
relazionali possono ingenerare, visto che spesso la loro disambiguazione
(stabilire cioè se si tratta di composto coordinativo o determinativo) può
avvenire solo in relazione a un dato cotesto.
Token
Politico-programmatico

Google
34.900

Europa.eu
8

Eur-lex
*1

Delle 8 occorrenze di europa.eu, almeno 2 provengono da documenti redatti
da curatori italiani. L’unica occorrenza in Eurlex (dove la versione inglese è
policy and planning platform), fa pensare a un brano di testo originariamente
redatto in italiano e a una lettura coordinativa, anziché determinativa, del
composto in sede di traduzione.

802

JADT’ 18

4.3 Fattorizzazioni e costruzioni zeugmatiche
Questi due costrutti, i cui meccanismi sono di difficile reperimento nelle
grammatiche, sono ampiamente utilizzati nel linguaggio giuridico e
amministrativo italiano per evidenti ragioni di economia linguistica. Si è
scelta a tal fine la sequenza 'dati qualitativi e quantitativi', che è
un’espressione che ricorre sovente in testi che riportano dati statistici e che
viene pertanto utilizzata in una pluralità di settori. Per lo zeugma
grammaticale si sono ricercate le occorrenze della sequenza ‘valutare e tener
conto’4, che è risultata non ben accetta dai traduttori in quanto probabilmente
troppo “audace”. Oltretutto costrutti di questo tipo vengono sovente
attribuiti a un’influenza della lingua inglese5, motivo questo di ulteriori
spinte puristiche da parte dei traduttori.
Token
Dati qualitativi e
quantitativi
Dati
qualiquantitativi

Google
23.100

Europa.eu
370

Eur-lex
1

10.400

*9

0

I 9 risultati europa.eu si riferiscono tutti a progetti italiano nati in ambito
regionale
Token
Valutare e tener
conto

Google
1930

Europa.eu
(*)5

Eur-lex
0

Dei 5 esempi in europa.eu 2 si devono all’eurodeputata Pasqualina
Napolitano (doc. A6-0502/2008) mentre 3 sono di provenienza esterna all’UE.

Come nel seguente esempio (grassetto mio):
Art. 5. (Coordinamento per la sicurezza e salute ex decreto legislativo n. 81 del
4

2008)
1. Ai sensi dell’articolo 90, comma 1-bis, del decreto legislativo n. 81 del 2008, il
Tecnico incaricato è obbligato a considerare, valutare e tener conto, al momento delle
scelte tecniche per la fase progettuale oggetto dell'incarico, dei principi e delle misure
generali di tutela di cui all’articolo 15 del citato decreto legislativo n. 81 del 2008.
(http://bandieconcorsi.comune.trieste.it/contenuti/allegati/schema_contratto_incarico.
pdf).
5 Fanfani 2010

JADT’ 18

803

Riferimenti bibliografici
Baker M. (1993), “Corpus Linguistics and Translation Studies – Implications
and Applications”, in: M. Baker/G. Francis/Tognini Bonelli (a cura di), Text
and Technology: In Honour of John Sinclair, Amsterdam-Philadelphia:
Benjamins, 233-250.
Baker M. (1996), “Corpus-based Translation Studies: The challenges that Lie
Ahead”, in: H. Somers (a cura di), Terminology, LSP and Translation: Studies
in Language Engineering in Honour of Juan C. Sager, AmsterdamPhiladelphia: Benjamins, 175-186.
Cabré, M. T. (1999), Terminology – Theory, methods and applications,
Amsterdam-Philadelphia: John Benjamins.
Cortelazzo M. A (2013), "Leggi italiane e direttive europee a confronto", in:
Stefano Ondelli (a cura di), "Realizzazioni testuali ibride in contesto
europeo. Lingue dell’UE e lingue nazionali a confronto", Trieste, EUT
Edizioni Università di Trieste, 2013, pp. 57-66.
Fanfani M. (2010) Anglicismi, in Simone R., Berruto G. D’Achille P. (a cura di)
“Enciclopedia dell’italiano”. Istituto della Enciclopedia italiana, Roma
Garzone G. (2005), “Osservazioni sull’assetto del testo italiano tradotto
dall’inglese”, in: A. Cardinaletti/G. Garzone (a cura di), L’italiano delle
traduzioni, Milano: Franco Angeli, 35-58.
Grande Dizionario Garzanti di italiano (2017), De Agostini Scuola s.p.a. –
Garzanti linguistica (versione elettronica)
Laviosa S. (2002), Corpus-based Translation Studies. Theory, Findings,
Applications, Amsterdam-New York: Rodopi.
Laviosa S. (2002), Corpus-based Translation Studies. Theory, Findings,
Applications, Amsterdam-New York: Rodopi.
Lesina R. (2009), Il Nuovo Manuale di stile¸Bologna: Zanichelli
Manuale interistituzionale di convenzioni redazionali, Ufficio delle pubblicazioni
dell’Unione europea (OPOCE), 2011, ISBN 978-92-78-40704-9
Ondelli S. e Viale M. (2010), L’assetto dell’italiano delle traduzioni in un corpus
giornalistico. Aspetti qualitativi e quantitativi. In Rivista internazionale di
tecnica della traduzione, n.12/2010, pp. 1-62. ISSN 1722-5906.
Sabatini F e Coletti V. (2008), Il Sabatini Coletti. Dizionario della lingua italiana,
Milano, Rizzoli-Larousse.
Sager J. (1994), Language Engineering and Translation Consequences of
Automation, Amsterdam-Philadelphia: John Benjamins.
Scarpa F. (2008), La traduzione specializzata, seconda edizione, Milano:
Hoepli.
Urzì F. (2016), “Il paradosso degli aggettivi di relazione composti derivati da
sintagmi N+A. Una risorsa non utilizzata in traduzione”, in: R. Bombi/V.
Orioles (a cura di), Lingue in contatto-Contact Linguistics, Roma: Bulzoni,

804

JADT’ 18

163-178.
Wandruszka U. (2004), “Aggettivi di relazione”, In M.Grossmann/F. Rainer
(a cura di), La formazione delle parole in italiano, Tübingen, Niemeyer, 382394.
Wüster E. (1976), "La théorie générale de la terminologie - un domaine
interdisciplinaire impliquant la linguistique, la logique, l'ontologie,
l'informatique et les sciences des objets", in H. Dupuis (a cura di), Essai de
définition de la terminologie. Actes du colloque international de terminologie
(Québec, Manoir du lac Delage, 5-8 octobre 1975), Québec, Régie de la langue
française, pp. 49-57.
Wüster E. (1981), “L’étude scientifique générale de la terminologie, zone
frontalière entre la linguistique, la logique, l’ontologie, l’informatique e les
sciences des choses”, in Rondeau, Guy/Felber, Helmut (a cura di), Textes
choisis de terminologie – I Fondements théorique de la terminologie, Québec,
GIRSTERM, 55-114.

JADT’ 18

805

Les traductions françaises de The Origin of Species :
pistes lexicométriques
Sylvie Vandaele
Université de Montréal – sylvie.vandaele@umontreal.ca

Abstract
In order to develop a sound methodology that would guide the analysis of
the translations of important writings, we used Hyperbase to perform a
lexicometic analysis of specificities on two corpora based on the various
English and translated editions of Charles Darwin’s The Origin of Species. We
show that the translated corpus is characterized by a notable lexical
dispersion. compared to the source corpus. By combining the use of
Hyperbase with Logiterm. a text alignment software. we were able to target
and analyse contexts of interest. This approach allows for the rapid
identification of contexts that are significant both statistically and in terms of
the analysis of the translation strategies themselves.
Résumé
Afin de mettre au point une méthode raisonnée d’analyse des traductions
d’œuvres conséquentes, nous avons soumis les versions originales de The
Origin of Species, de Charles Darwin ainsi que leurs traductions en français à
une analyse lexicométrique des spécificités à l’aide du logiciel Hyperbase.
Nous montrons que le corpus de traductions se caractérise par une
dispersion lexicale notable, contrairement au corpus anglais source. Les
spécificités ont permis, à l’aide du logiciel d’alignement bilingue Logiterm,
de cibler l’analyse de contextes bilingues montrant les différences de choix de
traduction, Cette approche permet de repérer rapidement des contextes
significatifs tant sur le plan statistique que sur le plan de l’analyse des
stratégies de traduction.
Keywords: The Origin of Species;
specificities; Hyperbase; Logiterm,

retranslation;

translation

choices;

1, Introduction
La retraduction, fréquente en littérature (voir Monti et Schnyder, 2011), est
rare en science, The Origin of Species [désormais OS], l’œuvre célèbre de
Charles Darwin, fait exception : six éditions de langue anglaise (de 1859 à
1872), six traductions en français dont deux modernes (voir Vandaele et

806

JADT’ 18

Gendron-Pontbriand [2014] pour les détails). Cependant, l’ampleur de
l’œuvre rend l’analyse des traductions difficile. Nous proposons une
méthode consistant à isoler les spécificités lexicales des originaux et des
traductions, puis à repérer les contextes bilingues alignés correspondants,
soumis ensuite à une analyse qualitative. Nous accédons ainsi rapidement
aux éléments saillants de l’évolution de l’œuvre et de ses traductions.
2. Corpus et méthodologie
Les deux corpus1 sont constitués par les chapitres intégraux des six éditions
originales anglaises de l’OS (1859-1872) et les six traductions en français, à
l’exclusion du paratexte et des notes de bas de page. Les césures en fin de
ligne ont été éliminées, les numéros de page, placés entre deux phrases, les
appels de notes, enlevés. Nous avons eu recours au logiciel Hyperbase v. 102
réalisé par Étienne Brunet (Brunet 2011). L’annotation syntaxique et la
lemmatisation ont été réalisées au préalable avec Cordial v. 14 (Synapse)
pour le français, et à la volée, pour l’anglais, avec la version de TreeTagger
incluse dans Hyperbase. L’alignement des versions originales et traduites a
été réalisé avec Logiterm v, 5.7.1. (Terminotix).
3. Les versions originales anglaises de l’OS
Le corpus anglais compte un peu plus d’un million d’occurrences, Darwin a
procédé à des ajouts, mais aussi à des retraits.3 La 6e édition (18724) est 28 %
plus longue que la 1re (1859), soit 48 000 occurrences de plus. L’analyse de la
richesse du vocabulaire montre la proximité lexicale des six éditions
originales : on compte 8559 lemmes pour tout le corpus, 6082, pour la 1re
édition et 7431, pour la 6e (tableau 1).
Les lemmes communs forment la majorité du corpus : pour les textes 2 à 2,
leur nombre varie de 5597 à 6600, tandis que le nombre des lemmes privatifs
fluctue de 136 à 1795. L’examen des formes donne des résultats du même
ordre. L’accroissement chronologique des lemmes montre un léger
appauvrissement pour la 2e et la 3e édition, mais un enrichissement notable

Les textes anglais viennent du site Darwin Online (John van Wyhe, dir. 2002. The Complete Work of Charles Darwin Online - http://darwin-online.org.uk/). Les textes
français ont été obtenus par Gallica ou Google livres, ou ont été numérisés par nous.
2 Téléchargeable à <http://ancilla.unice.fr/>.
3
Voir le variorum en ligne (van Wyhe, 2002-; < http://darwinonline.org.uk/Variorum/1859/1859-1-dns.html>).
4 Celle de 1876, dite 6b, est quasiment identique à celle de 1872. C’est l’édition de
1872 qui a été traduite par Edmond Barbier (1876), raison pour laquelle nous l’avons
choisie sans notre analyse.
1

JADT’ 18

807

du vocabulaire dans la 6e édition (tableau 1), essentiellement redevable à un
grand nombre d’hapax, souvent des noms d’espèces.5 Ce résultat reflète le
fait que Darwin apporte de plus en plus de données à l’appui de sa théorie.

Année de
publication et
édition
1859, 1re éd,
1860, 2e éd,
1861, 3e éd,
1866, 4e éd,
1869, 5e éd,
1872, 6e éd,
Total

Tableau 1 – Corpus des éditions originales de l’OS
Richesse du
vocabulaire
Nombre
Effectif des
Code
d’occurrences6
lemmes
N (écarts
réduits)
OS01
170 634
6082 (2,67)
OS02
171 665
6210 (4,21)
OS03
181 974
6019 (0,34)
OS04
200 608
6914 (9,59)
OS05
199 963
7072 (11,67)
OS06
218 870
7431 (14,06)
1 143 714
8559

Accroissement
chronologique
Écarts réduits
(calculés sur les
lemmes
4,5
-6,5
-4,9
1,8
0,3
16,5

L’analyse arborée (selon Luong, 1994; cité dans Brunet 2011) met en évidence
la faible distance séparant les textes, ce qui est attendu (figure 1), mais
permet de situer les différentes éditions entre elles : qu’il s’agisse des
fréquences (1A) ou des présences (1B)7, on note une grande proximité entre
les 1re et 2e éditions, ce qui est corroboré dans les préfaces. La 5e et la 6e sont
proches, cette dernière se distinguant par les nombreux hapax. La 3e et la 4e
sont intermédiaires. Nombre de lemmes privatifs passent sous la barre des
5 %, les spécificités sont peu nombreuses, ce qui est attendu, mais révélateur.
Les spécificités positives ne repèrent aucun mot plein pour les quatre
premières éditions, mais font apparaître le pronom I et le déterminant my.
C’est à la 5e édition que l’on note l’apparition de deux spécificités de mots
pleins statistiquement significatives : survival et fittest, avec un écart réduit de
4,6 et de 4, respectivement, pour les formes, ou survival (substantif, 4,6) et fit
(adjectif, 4) pour les lemmes. Dans la 6e édition, apparaissent Mr (7,1), through
(6,1) cambrian (5,8) orchids (4,3), developed (4,9) et development (4,2), lower (4,2),

Le nombre d’hapax augmente considérablement dans la 6e édition :
respectivement, 45, 40, 61, 133, 134, et 622 occurrences (lemmes) de la 1re à la 6e édition
(écart réduit de 33,5 pour la 6e édition).
6 Les valeurs reportées dans les tableaux sont fournies par Hyperbase. Il y a de
légères différences avec des valeurs publiées antérieurement, dues à la préparation
des textes et aux logiciels utilisés pour le décompte.
7 Respectivement selon Labbé et Jaccard, cités dans Brunet 2011.
5

808

JADT’ 18

beneficial (4,1) et spontaneous (4,1). L’analyse des lemmes fait, en plus des
précédents, remonter survival (substantif, 4,6), spine (substantif, 5,3), increased
(adjectif, 4,2), movement (substantif, 4,1), fit (adjectif, 4,1), beneficial (adjectif,
4,1) et spontaneous (adjectif, 4,1).

A

B

Figure 1 – Analyse arborée sur les lemmes : A – sur les fréquences; B – sur les présences

Le regroupement des spécificités en catégories reflétant le contenu
sémantique (établi à partir des contextes) est instructif : concepts théoriques
(fittest, fit, survival, through [expression de la causation]), données et citations
(cambrian, orchids, spine, Mr), vision dynamique du vivant de Darwin (develop,
development, increased, movement, spontaneous), jugements de valeur (beneficial,
lower [certaines occurrences]). Ainsi , les spécificités, même rares, se
démarquent par leur saillance : elles captent l’introduction du fameux
concept de Spencer (1864), survival of the fittest et permettent de présumer une
affirmation de la pensée de Darwin – à savoir sa vision profondément
dynamique de la nature. Enfin, les spécificités négatives signalent que les
fréquences relatives du déterminant possessif my et du pronom I diminuent
avec le temps, ce qui traduit l’ajout de passages non argumentatifs contenant
des données, et ce qui corrobore l’augmentation des hapax, constitués par
majoritairement par des noms d’espèces.
4. Analyse du corpus français
Le corpus français comprend un peu plus de deux millions d’occurrences
(tableau 2) : trois traductions d’époque (Clémence Royer [1862, 3e éd.], JeanJacques Moulinié [1873, 5e éd.], Edmond Barbier [1876, 6e éd.]); celle de
Daniel Becquemont (2008), qui part de la traduction de Barbier et la modifie
pour remonter à la 1re édition; deux modernes, par Augustin Berra (2009, 6e
éd.) et Thierry Hoquet (2013, 1re éd.) (voir Vandaele et Gendron-Pontbriand
[2014] pour les références bibliographiques). Les textes comptent de 181 785 à

JADT’ 18

809

248 863 occurrences, soit un écart de 67 078 occurrences. Les différences de
coefficients de foisonnement8 révèlent déjà que les traducteurs ont travaillé
avec des stratégies de traduction distinctes. L’homogénéité lexicale diminue
par rapport aux originaux. La contribution de chacun des textes à la richesse
lexicale est beaucoup plus importante en français qu’en anglais : les lemmes
partagés dans les textes pris deux à deux se situent entre 4498 (13Ho et 62Ro)
et 5649 (73Mo et 76Ba) pour un total de 11712 lemmes (soit 3153 lemmes de
plus que dans le corpus anglais). Chacun des textes français contribue pour
un pourcentage moindre au vocabulaire commun (figure 2A). Les effectifs
des lemmes privatifs sont plus importants (de 772 à 3000) et fluctuent d’un
traducteur à l’autre (figure 2B). Sont mises en évidence les différences entre
Becquemont (08Bq) et Hoquet (13Ho) pour la 1re édition, et entre Barbier
(76Ba) et Berra (09Be) pour la 6e édition, mais aussi la proximité (attendue)
entre Barbier et Becquemont.
Tableau 2 – Traductions françaises de l’OS – * d’après la traduction de Barbier de la 6e édition,
Année
de
publication
1862
1873
1876
2008
2009

Richesse du
vocabulaire
Effectif des lemmes
N (écart réduit)
6357 (-6,7)
7036 (0,8)
6971 (-3,8)

08Bq

186 440

9%

6260 (-4,8)

09Be

248 863

14 %

7804 (5,0)

13Ho

181 785
1 277 582

7%

6579 (-0,2)
11 712

Traduit par

Code

Nombre
d’occurrences

1861 (3e)
1869 (5e)
1872
(6ea)
1859 (1e)*

C. Royer
J.-J,.Moulinié
E. Barbier

62Ro
73Mo
76Ba

D.
Becquemont
A. Berra
T. Hoquet

1876
(6eb)
1859 (1e)

2013
Total

207 633
211 691
241 170

Coefficient
de
foisonnement
14 %
6%
10 %

Édition
originale
anglaise

Les distances lexicales intertextuelles (figure 3) confirment la proximité de
Becquemont et de Barbier, mais révèlent deux faits inattendus : 1) Royer
(62Ro) se situe sur la même branche que Berra et Hoquet; 2) Moulinié (73Mo)
se place entre Becquemont et Barbier lorsque l’on passe des fréquences aux
présences.

Le coefficient de foisonnement est l’accroissement du nombre d’occurrences
observé lorsque l’on traduit de l’anglais au français. Il est généralement admis, en
traduction dite « pragmatique » (par opposition à la traduction littéraire) que le taux
de foisonnement se situe généralement entre 10 % et 15 %, une des causes étant que le
français recourt à plus de mots grammaticaux que l’anglais. Une forte concision peut
diminuer ce taux.
8

810

JADT’ 18

Figure 2 – A – Contributions respectives de chacun des textes aux parties communes des
corpus anglais et français (lemmes)9 – B – Richesse lexicale (lemmes), Le pointillé indique le
seuil de 5 %,

Diverses hypothèses explicatives doivent être explorées, mais il n’est en tout
cas plus permis de douter que les manières de traduire sont décisives au
point de brouiller, sur le plan lexical, la chronologie des versions originales,
et que cette approche permet de mettre ces particularités en évidence.

A

B

Figure 3 – Analyse arborée (méthode Luong) sur les lemmes
A – calculée sur les fréquences (Labbé); B - calculée sur les présences (Jacquard)

Nous nous sommes ensuite concentrée sur les spécificités positives des lemmes des
mots pleins et, parmi elles, avons sélectionné les unités dont la signification
paraissait la plus caractéristique du propos central de l’OS : ainsi, sélection,
préservation, pouvoir… ont été retenus, mais pas aujourd’hui, grandement,

Le schéma a été obtenu à partir des effectifs des lemmes pour chacun des
textes, ramenés en pourcentage du nombre total de lemmes par corpus
(représentation « radar » fournie par Excel v.16). Les effectifs des lemmes des textes
traduits ont été disposés en regard des textes anglais (ceux de OS1 et OS6 ont donc été
dupliqués); de plus, la forme assymétrique du tracé pour le français rend compte de
l’absence de traduction d’OS2 et d’OS4. À cause de ces particularités, l’aire délimitée
par les traits n’est pas représentative des valeurs totales pour chacun des corpus, mais
le schéma reste visuellement parlant.
9

JADT’ 18

811

inclure… Nous nous sommes ensuite concentrée sur les spécificités positives
des lemmes des mots pleins et, parmi elles, avons sélectionné les unités dont
la signification paraissait la plus caractéristique du propos central de l’OS :
ainsi, sélection, préservation, pouvoir… ont été retenus, mais pas
aujourd’hui, grandement, inclure…

Figure 4 – Analyse factorielle de correspondances : sélection de lemmes parmi les spécificités

La quarantaine de lemmes ainsi obtenus a permis de générer un graphe
(figure 4) représentant le résultat d’une analyse de correspondances (menée
selon le programme de Lebart, inclus dans Hyperbase, sur les données
pondérées). Le graphe montre que les modernes (Berra, Hoquet) s’opposent
aux anciens (Barbier, Moulinié) ou quasi-ancien (Becquemont), Royer se
situant à part. La consultation des contextes ciblés par cette méthode dans les
corpus alignés par Logiterm permet d’analyser qualitativement les choix de
traduction. L’exemple le plus frappant est le choix de élection et de électif par
Royer, qui s’oppose au choix de sélection par les autres traducteurs (tab. 3).

812

JADT’ 18

Tableau 3 – Traductions alignées d’une phrase commune à toutes les éditions anglaises
(Introduction)
Darwin and we shall then see how Natural Selection almost inevitably causes much
Extinction of the less improved forms of life…
62Ro
Nous verrons comment cette élection naturelle cause presque inévitablement de
fréquentes extinctions d’espèces parmi les formes de vie moins parfaites…
73Mo
Nous y verrons comment la sélection naturelle détermine presque inévitablement
l'extinction des formes moins perfectionnées…
76Ba
Nous verrons alors que la sélection naturelle cause, presque inévitablement, une
extinction considérable des formes moins bien organisées…
08Bq
Nous verrons alors que la sélection naturelle cause presque inévitablement une
extinction considérable des formes moins bien organisées
09Be
nous verrons alors de quelle façon la sélection naturelle cause presque
inévitablement une forte extinction des formes de vie moins améliorées…
13Ho
Et nous verrons comment la Sélection Naturelle cause presque inévitablement
une grande Extinction des formes de vie moins améliorées…

5. Conclusion
Le ciblage de contextes, repérés au moyen d’une analyse lexicométrique
préalable, dans des corpus alignés conséquents est une stratégie de choix.
Elle permet d’arriver assez vite à des observations statistiquement
significatives et de pointer d’emblée sur des éléments majeurs sans
hypothèse préalable. Comme le souligne Brunet (2002), l’intérêt de travailler
sur des traductions est que certains paramètres sont fixés. L’inconvénient
actuel de l’entreprise tient à la faible ergonomie du processus, c’est-à-dire
aux nombres de clics liés au passage d’un logiciel à l’autre. Restent les
nombreuses modifications sous le seuil de 5 %, qui peuvent recéler, malgré
l’absence de signification statistique, des éléments cruciaux en matière de
choix de traduction. D’autres stratégies de filtrage sont alors nécessaires pour
leur étude.
Remerciements
Nous remercions vivement Étienne Brunet, Damon Mayaffre et Laurent
Vanni pour leurs conseils sur l’utilisation d’Hyperbase. Il va de soi que les
éventuelles erreurs sont nôtres. Merci aussi à Marie-Joëlle StratfordDesjardins, étudiante auxiliaire de recherche, pour son aide à la préparation
du corpus. La présente recherche a bénéficié d’une subvention de recherche
du Conseil de recherche en sciences humaines du Canada (2015-2018).

JADT’ 18

813

Références
Brunet É. (2002). Un texte sacré peut-il changer ? Variations sur l’Evangile. In
Cook J., dir. Bible and Computer, Leiden / Boston : Brill, pp. 79-98.
Brunet É. (2011). Hyperbase – Manuel de référence. Hyperbase pour Windows,
version 8.0 et 9.0.
Luong X. (1994). L’analyse arborée des données textuelles : mode d’emploi.
Travaux du cercle linguistique de Nice, 16 : 27-42.
Monti E. et Schnyder, P., dir. (2011). Autour de la retraduction : Perspectives
littéraires européennes. Coll. Universités, Paris : Orizons,
Spencer H. (1864). The Principles of biology. Vol. 1, New York: Appleton.
Vandaele S. et Gendron-Pontbriand E.-M. (2014). Des « vilaines infidèles » aux
grands classiques : traduction et retraduction de l’œuvre de Charles Darwin. In:
Pinilla J. et Lépinette B., dir, Traducción y difusión de la ciencia y de la
técnica en España en los siglos XVIII y XIX,Valence : Universitat de
València, pp. 249-276.

814

JADT’ 18

Circuits courts en agriculture :
utilisation de la textométrie dans le traitement d’une
enquête sur 2 marchés
Pierre Wavresky1, Matthieu Duboys de Labarre2,
Jean-Loup Lecoeur3
2

1Umr Cesaer Inra-Agrosup Dijon – pierre.wavresky@inra.fr
Umr Cesaer Inra-Agrosup Dijon – matthieu.duboys-de-labarre@inra.fr
3Umr Cesaer Inra-Agrosup Dijon – yajintei@hotmail.fr

Abstract
Semi-structured interviews about short food supply chains have been done
with producers and consumers on two different markets. Our work gives an
insight to the themes common to producers and consumers that are not
attributable to the interviews guides. It also underlines the advantages of a
textometric approach and the precautions necessary to interpret such a
corpus.
Résumé
Des entretiens semi-directifs sur le thème des circuits courts alimentaires ont
été menés sur deux marchés, auprès de producteurs et des consommateurs.
Notre travail s'intéresse notamment aux thématiques communes aux
producteurs et consommateurs et qui ne soient pas imputables aux grilles
d’entretiens. Il souligne par ailleurs les apports d'une approche
textométrique, ainsi que les précautions d'interprétation sur un tel corpus.
Keywords: short food supply chain, semi-structured interviews, textometry
1. Introduction et méthodologie
Les circuits courts alimentaires interviennent de plus en plus dans le débat
social. Ils sont devenus l’emblème d’une opposition au « modèle
conventionnel ». Ils s’inscrivent également dans des enjeux de politique
publique (définition légale en 2009 avec le plan Barnier1), et scientifique. Ils
comprennent des formes innovantes comme les AMAP, mais aussi des
formes plus anciennes comme les marchés ou la vente à la ferme.
La sociologie a abordé les circuits courts sous des angles variés : la
consommation engagée (Dubuisson-Quellier, 2009), la sociologie de

1 Circuit de commercialisation comprenant au plus un intermédiaire entre le
producteur et le consommateur.

JADT’ 18

815

l’innovation (Chiffoleau et Prévost, 2012), d’autres ont approché la question
en décalant le point de vue vers le développement local (Traversac, 2010) ou
au travers de la notion de proximité (Mundler et Rouchier, 2016). Les travaux
de sociologie insistent sur l’intérêt économique des circuits courts, mais aussi
sur leur capacité à recréer du lien social (Prigent-Simonin et HéraultFournier, 2014). De nombreux dispositifs s’appuyant sur les circuits courts de
commercialisation se caractérisent par un rapport direct entre
consommateurs et producteurs. Ce lien a été l’objet de différentes analyses et
interprétations dans la littérature. Il est perçu comme un déplacement de
l’espace de référence des agriculteurs vers celui des consommateurs (Dufour
et Lanciano, 2012). Il a aussi été analysé comme le lieu de rencontre autour
d’attentes plurielles (Chiffoleau et Prévost, 2012). Plus généralement, il
s’ancrerait dans des logiques communes de re-localisation des pratiques
agricoles et alimentaires (Duboys de Labarre, 2005). C’est ce lien que nous
allons analyser au travers d’un dispositif textométrique. Nous mettrons en
lumière les intérêts et les éventuelles limites interprétatives liés au type de
corpus (faible nombre d’entretiens semi-directifs). Cela nous éclairera
également sur les thématiques abordées et leur spécificité. Dans le cadre du
projet européen H2020 « Strength2food » 2 , pour la France, nous avons
interrogé 23 personnes3 (12 vendeurs-producteurs et 11 consommateurs) sur
deux marchés (en milieu rural et en milieu urbain) par entretien semidirectifs. Nos deux sous-populations relèvent d’initiatives différentes dans
leur structuration et leur ancienneté4. Dans les deux cas, les parties-prenantes
restent attachées à la consommation/production bio et sont assez engagés. Ce
corpus n’est donc pas représentatif (ni des consommateurs ni des
producteurs) et nous considérons ce travail comme exploratoire.
Le corpus est analysé grâce au logiciel de textométrie Iramuteq5, les thèmes
communs ou spécifiques des producteurs et consommateurs seront
recherchés essentiellement par classification descendante hiérarchique
(Reinert, 1983) et par analyse de spécificité. Parmi les variables caractérisant
les textes, a été incluse une variable à 4 modalités : consommateur-rural,

https://www.strength2food.eu/. Ce projet a été financé par le programme de
recherche et d'innovation Horizon 2020 de l'Union européenne dans le cadre de la
convention de subvention n° 678024
3 Ces entretiens, structurées autour de 6 thèmes, sont semi-directifs et visent à
favoriser l’expression des acteurs. Ils sont retranscrits mot à mot et incluent des
annotations de l’intervieweur.
4 Celle en milieu urbain est un marché de plein vent traditionnel, celle en milieu
rural est un marché de producteurs innovant.
5 http://www.iramuteq.org/ (Pierre Ratinaud)
2

816

JADT’ 18

consommateur-urbain, producteur-rural, producteur-urbain6. Comme la
longueur des interviews est très variable (de 102 à 560 segments de texte) et
le nombre d’interviewés assez faible (23), les statistiques relatives à cette
variable peuvent être essentiellement imputables à une interview, il est donc
d’autant plus nécessaire de revenir à l’interview. De plus il peut arriver que
le lien, en termes de Khi², entre une des quatre catégories (ou une interview)
et une thématique (classe de la classification) soit faible. Or quelques
segments de textes énoncés par cette catégorie sous-représentée sont parfois
très liés à cette thématique, et dire que le lien est faible serait erroné. D’où
l’analyse, aidée par une représentation graphique, des segments de textes les
plus caractéristiques d’une classe, pour chaque catégorie étudiée.
Deux annotations de l’intervieweur, caractérisant la parole de l’interviewé,
ont été conservées au sein du corpus, et seront donc analysées comme les
autres mots : « rire » (codé « _rire ») et « blanc », signifiant un délai avant la
réponse ou en son sein (codé « _blanc). Le but étant de voir si des hésitations
(« _blanc ») sont cooccurrentes d’autres lemmes.
2. Analyse statistique du corpus réponse
Les 5 lemmes les plus courants sont : aller, voir, bio, gens, marché. Ce qui
ressemble à un programme : aller au marché, donc favoriser un mode de
circuit court, pour acheter ou vendre des produits bio et pour voir des gens,
donc avec un aspect relationnel important. Il est probable que les lemmes bio,
aller et marché soient liés au contexte d’enquête (nature des enquêtés pour bio
et nature des dispositifs pour aller et marché). Enfin, le caractère assez
homogène de l’importance quantitative de ces 5 lemmes peut être interprété
comme le reflet d’un horizon commun partagé par nos informateurs et ce en
dépit de de leur groupe d’appartenance (producteur ou consommateur) ou
du dispositif étudié.
2.1. Classification descendante hiérarchique : 12 types de discours
Une classification descendante hiérarchique7 (Reinert 1983) a permis de
dégager 12 types de discours. Nous nous focaliserons sur 2 ensembles de
classes8, selon qu'elles sont plutôt spécifiques ou peu spécifiques
d'une catégorie (producteur ou consommateur).

Producteur-urbain signifiant producteur vendant sur le marché de la ville
moyenne, en opposition avec producteur-rural qui vend sur le marché du village.
7 5264 segments de texte sur les 6231, soit 84%, ont été retenus par la
classification.
8 Nous écartons la classe 3 (12,5%) car elle est peu interprétable (lemmes
polysémiques : chose, gens, monde...).
6

JADT’ 18

817

Graphique 1 : les 12 classes de discours

Le premier ensemble regroupe les classes 1, 2, 6, 9 et 11 qui sont
caractéristiques d’un sous-groupe. Les classes 1 et 11 concernent surtout les
producteurs, par contre les classes 2, 6 et 9 émanent principalement de
consommateurs. Dans la classe 1 (14.4%) il est question des aides, de projet,
d’installation, de reprise (d’exploitation), d’investissement. Il y a des critiques sur
la PAC (notamment sur le fait que ce soit compliqué), mais pas seulement :
« Bah comme on a de la surface un peu ouais ça commence c’est super compliqué la
PAC je sais pas si tu veux qu’on en parle _rire même nous on a du mal » (Lydie,
productrice rurale). La classe 11 (11,7%) est orientée autour des produits
laitiers (lait, chèvre, fromage, yaourt, vache, faisselle, litre, cabri…), avec un aspect
monétaire (euro, prix). Dans la classe 6 (8.1%) c’est de nourriture dont il est
question, notamment le fait de manger des fruits et légumes de saison
(manger, tomate, fraise, saison, pas en hiver). C’est un discours de
consommateurs, surtout urbains. Melissa et Jennifer parlent surtout des
courses qu’elles font, où elles les font (sur le marché de la ville moyenne
essentiellement, où elles ont été interrogées). Toutefois l’autre thème (manger
des fruits de saison) est celui qui est le plus typique de cette classe.
Dans la classe 9 (3.3%) il est question de ville (vivre en ville/à la campagne) et
de distance, aussi bien en termes de proximité que de nombre
d’intermédiaires (distance, kilomètre, circuit_court, intermédiaire). C’est plutôt
une classe de consommateurs. Enfin dans la classe 2 (12.4%) les 4 premiers
lemmes forment une phrase : acheter produit bio producteur. Revendeur et local
sont présents aussi. Il est donc question du comportement d’achat, mais pas
des produits qu’on achète, comme dans la classe 6, plutôt de certaines de
leurs propriétés (bio) et de la qualité du vendeur (producteur). Les classes 1, 2
et 6 renvoient directement à des thèmes abordés dans les guides d’entretiens
respectifs des groupes et la classe 11 à une catégorie de produit agricole

818

JADT’ 18

spécifique qui était surreprésentée dans l’échantillon des producteurs
transformateurs (5 informateurs sur 12). Ces classes parlent des pratiques
liées aux groupes (professionnelles, d’achat et de consommation alimentaire)
et permettent de les caractériser. Nous noterons que les classes 1, 2 et 6
renvoient à la notion de maîtrise ou de contrôle. Pour la classe 1 parce que les
aides PAC sont parfois perçues comme extérieures et complexes. Pour les
classes 2 et 6 au contraire parce qu’elles traduisent l’idée que le
consommateur maîtrise sa pratique (choix de se fournir directement auprès
d’un producteur et en aliments bio, locaux et de saison).
Le second ensemble regroupe les classes 4, 5, 7, 8, 10 et 12. Elles sont peu
spécifiques d’une catégorie. La Classe 10 (7.3%) est celle du respect des
animaux et plus généralement du respect du vivant. On peut remarquer que
le lemme _rire y est particulièrement rare : dans cette classe, le respect des
animaux est abordé comme une question sérieuse. « C’est un animal pour
l’élevage donc je le mange s’il a été élevé dans le respect des lois de la nature et de
l’univers s’il a été élevé d’une manière respectueuse par rapport à
l’environnement » (Théophile, producteur urbain) [Les mots en gras sont
spécifiques de la classe]. Il n’y a pas de différence marquée rural/urbain ou
producteur/consommateur.

Graphique 2 : Score des segments de texte (classe 10)

Mais si on considère le nombre de
segments de texte caractéristiques
(graphique 2), on voit que Jacques n’en
parle pas beaucoup mais il en a énoncé
certains très caractéristiques. Autrement
dit, il parle peu mais intensément du
bien-être animal : « Et nous nos animaux
on est en bio on fait attention au bien-être
animal on fait le choix de garder tous les
petits pour pas qu’ils partent dans des
élevages industriels intensifs et la suite
logique» (Jacques, producteur rural
[score=925]9).

La classe 7 (4,5%) renvoie à deux univers de sens différents autour du lemme
vie : d’une part la notion de trajectoire de vie en relation avec la parentèle
(famille, parent [d’origine agricole], grand_parent, enfant), et d’autre part à une
forme de souci de soi (mode de vie sain, santé reliée à nourriture et alimentation).
« En amont dans un mode de vie qui devrait te permettre d’avoir une vie plus

9 La somme des Khi² (mesurant le lien entre chaque lemme et la classe) donne le
score du segment de texte.

JADT’ 18

819

harmonieuse plus saine plus en meilleure santé physique psychique mentale
sociale parce_que tu crées du lien aussi enfin y a une… ça va dans une même
mouvance » (Claire, consommatrice rurale).
La classe 8 (5.5%) concerne les céréales (farine, pain, gluten, variété, vieux,
boulanger), notamment les vieilles variétés.
La classe 5 (6.3%) est celle du doute (on se pose des questions, il y a des _blanc :
ces 3 lemmes sont entre 8 et 9 fois plus nombreux qu’attendu). « Se poser des
questions » et penser évoque aussi une prise de conscience de problèmes.
Mais c’est également « poser des questions » aux vendeurs sur leur
production.
La classe 4 (5.2%) est celle des relations et de leur importance. « Eh ben les
relations humaines on côtoie une diversité de population quoi des gens et en fait
on se parle c’est agréable _rire » (Christine, consommatrice rurale).
Enfin la classe 12 (8.8%) est celle du temps (temps passé [heure], horaire
précis [h]). Les jours de la semaine sont cités, les moments de la journée aussi,
avec matinée, nuit, café, boire… Les 2 individus les plus impliqués dans cette
classe sont François et Thérèse (éleveurs urbains). Il n’y a pas de spécificité
forte d’une des 4 catégories car s’il y a surreprésentation de certains
producteurs dans cette classe, d’autres parlent très peu de cet aspect (David
et Théophile). Or les deux producteurs qui sont principalement impliqués
dans cette classe se sont installés dans un cadre familial (ils ont repris
l’exploitation de leurs parents). Alors que ceux qui en parlent le moins sont
des hors cadres familiaux. La littérature (Dufour et Lanciano, 2012) souligne
que les contraintes temporelles sont plus importantes dans le cadre d’une
production en circuits courts. Cette dernière serait vécue différemment en
fonction de la trajectoire des agriculteurs (cadres ou hors cadres familiaux).
Le caractère commun de ces classes nous permet de proposer quelques pistes
de réflexions concernant les liens qui se nouent entre producteurs et
consommateurs. La classe 5 (celle du doute) renvoie partiellement à une
forme de réflexivité partagée par ces deux groupes. Le respect des animaux
et de la nature (classe 10)10 et l’aspiration à un mode de vie, un souci de soi
(classe 7) dessinent un lien entre préoccupations personnelles et engagements
globaux (respect des animaux et cause environnementale) (Pleyers, 2011).
Enfin, la classe 4 souligne l’horizon commun que constitue l’importance du
lien social attaché aux circuits courts.

10 Cette classe commune émerge dans le discours alors qu’elle n’est pas un thème
des deux guides d’entretiens.

820

JADT’ 18

2.2. Pronoms personnels et spécificités
L’analyse des spécificités des 4 catégories d’interviewés, toutes classes
confondues, a mis notamment en évidence un emploi très différencié des
pronoms personnels. Les consommateurs ruraux citent souvent deux des
producteurs par leur prénom. Le lemme discuter est également présent. Donc
ils parlent de gens avec lesquels ils sont en lien fort.
Les consommateurs urbains citent beaucoup je et j, ainsi que vous : « Oui et
puis […] si vous voulez vos salades au bout de 3 ou 4 jours en grande_surface elles
ont pas été vendues elles ont quand même pas la même tête que celles que j’achète qui
ont été cueillies la veille hein » (Mélissa, consommatrice urbaine). Il est donc
question de ce que l’interviewé fait (je, j) et de ce qu’il ne fait pas (vous). Donc
de son comportement d’achat : ce qu’il achète, du lieu où il achète ou
pas (marché, supermarché, …), de la façon dont c’est produit ou vendu (bio,
label, équitable, local, transport). Il y a également le lemme rencontre : le lien est
présent, mais de façon plus conceptuelle, moins proche que dans le groupe
des consommateurs ruraux.
Chez les producteurs ruraux les pronoms tu et nous sont très employés. Le
nous peut renvoyer à un couple de producteurs (Georges et Gina) ou à une
communauté à laquelle on appartient : (les producteurs diversifiés, les
producteurs du marché du village rural) : « Nous ce qui fait la caractéristique du
secteur c’est que c’est des exploitations qui sont tournées vers beaucoup d’espèces on
n’a pas de spécialisation enfin pas de très très grosse spécialisation » (David,
producteur rural). Il nous semble que cette spécificité dans l’utilisation des
pronoms peut-être rattachée à la nature différente des dispositifs (et non à
leur caractère rural ou urbain). Dans un cas, le marché de plein vent
traditionnel, nous avons affaire à une structure de taille importante qui
préexiste aux acteurs. S’il est bien un lieu de rencontre, il est plus fortement
marqué par une dimension individuelle tant pour les producteurs que pour
les consommateurs (d’où la présence du je). Dans l’autre, le petit marché de
producteurs engagés, nous avons affaire à un projet de taille plus réduite
construit par une partie des acteurs. Les relations interpersonnelles,
l’identification à un ou des collectifs mais également la dimension
participative y sont donc plus marquées.
3. Conclusion et perspectives
De nombreux thèmes sont apparus fortement dans le discours des
interviewés : l’importance des relations, l’importance d’acheter au
producteur des produits bio, de manger des produits de saison, d’utiliser des
variétés de blé ancienne, de respecter l’environnement et les animaux.
D’autre part, l’emploi de pronoms personnels différents et l’usage ou non de
prénoms, révèlent une proximité avec les producteurs locaux (discours des

JADT’ 18

821

consommateurs ruraux), l’appartenance à un groupe (discours des
producteurs ruraux), une norme dans le comportement d’achat (discours des
consommateurs urbains). Il est important de ne pas tenir compte uniquement
de la spécificité globale d’une catégorie (ou d’un interviewé) pour juger de sa
plus ou moins grande implication dans une thématique (cas de Jacques). De
ce fait, les thèmes révélés par la classification ne sont pas toujours très
spécifiques d’une catégorie. Malgré un corpus restreint et spécifique, la
textométrie permet de mettre au jour des éléments factuels identifiés dans la
littérature et d’esquisser des liens analytiques avec des approches théoriques
plus générales. Ces résultats nous amèneront à poursuivre ce travail, dans le
cadre du projet Strenght2Food, en y intégrant une comparaison
internationale (avec tout ou partie du corpus des 6 pays partenaires sur cette
thématique).
Références
Chiffoleau Y., Prévost B. (2012). Les circuits courts, des innovations sociales
pour une alimentation durable dans les territoires, Norois, 224.
Duboys de Labarre M. (2005). Le mangeur contemporain, une sociologie de
l’alimentation. Thèse de sociologie, soutenue à Bordeaux, 426p.
Dubuisson-Quellier S. (2009). La consommation engagée. Paris, Presses de la
Fondation nationale des sciences politiques (Contester).
Dufour A., Lanciano E. (2012). Les circuits courts de commercialisation: un
retour de l'acteur paysan ? Revue Française de Socio-Économie (n° 9), pp.
153-169.
Mundler P., Rouchier J. (2016). Alimentation et proximités: Jeux d’acteurs et
territoires. Educagri.
Pleyers G. (dir.) (2011) La consommation critique, mouvements pour une
alimentation responsable et solidaire. Desclée de Brouwer.
Prigent-Simonin A-H., Hérault-Fournier C. (2014). Au plus près de l’assiette.
Editions Quæ.
Reinert M. (1983). Une méthode de classification descendante hiérarchique :
application à l’analyse lexicale par contexte. Les cahiers de l’analyse des
données, VIII(2) :187-198.
Traversac J.B. (2010). Circuits courts : contribution au développement
régional. Educagri.

822

JADT’ 18

On the phraseology of spoken French: initial
salience, prominence and lexicogrammatical
recurrence in a prosodic-syntactic treebank
Rhapsodie
Maria Zimina, Nicolas Ballier
Université Paris Diderot
mzimina@eila.univ-paris-diderot.fr; nicolas.ballier@univ-paris-diderot.fr

Abstract
This paper focuses on specific quantitative characteristics of spoken language
phraseology in the Rhapsodie speech database (ANR Rhapsodie 07 Corp-03001). A recent study (Zimina & Ballier, 2017) has shown that prosodic
segmentation into IPE: Intonational PEriods (segments of speech with
distinctive pitch and rhythm contours) available within the Rhapsodie
database offers new insights for the observation of the functions of formulaic
expressions in speech. Recurrent lexicogrammatical patterns at the beginning
of Intonational PEriods (IPE) are strongly related to spoken formulaic
language. These variations of initial salience depend upon several factors
(interactional needs, social context, genres, etc.). Further experiments have
shown that initially salient patterns also have specific prosodic characteristics
in terms of prominence (prosodic stress) across major speech genres of the
Rhapsodie dataset (oratory, narrative, description, argumentation, procedural)
and corresponding speaking tasks. These specific prosodic characteristics are
likely to reflect communicative needs of speakers and listeners (interactions,
uptakes, speaking turns, etc.).
Keywords: phraseology, prosodic constituents, prominence, salience,
textometrics
1. Introduction
Our research examines the notions of phraseology and formulaic language in
speech production on the basis of prosodic transcriptions indicating specific
events in speech: boundary tones, pitch accents, disfluent segments, etc. (Yoo
et Delais-Roussarie, 2009). We believe that such speech events coded in
spoken corpora are relevant for identifying the prosodic characteristics of
formulaic language.
Corpus-based studies of phraseology often exploit recurrent patterns
detected using repeated segments, co-occurrences and pattern-matching

JADT’ 18

823

techniques to explore formulaic strings of written texts (Granger, 2005; Sitri et
Tutin, 2016). This approach seems equally applicable to oral discourse.
Following this approach, our initial objects of study are predictable and
productive sequences of signs called lexicogrammatical patterns (lexical signs,
grammatical constructions). Made of permanent ‘pivotal’ signs and a more
productive ‘paradigm’, these patterns may be discontinuous and may or may
not be syntactic constituents (Gledhill, 2011; Gledhill et al., 2017). For
example:
§ et donc euh c'est pour ça qu'aujourd'hui je suis en italien en XXX …
§ c'est-à-dire § ouais § un mois c'est pour ça que ça s'appelle radio Timsit …
§ mais bien sûr donc <c'est pour ça bien sûr bien sûr que je parlais oui XXX …
§ c'est pour cela que je tenais à vous rencontrer la veille de notre fête …
We then explore the ways in which prosodic features may correlate with
extended lexical patterns, as well as the extent to which prosody corresponds
to patterns which have a particular register or discourse function. These
lexicogrammatical patterns combined with prosodic features extracted from
speech databases are possible methodological tools for identifying
phraseological characteristics of oral discourse.

Figure 1: Characteristic elements of initial prominence across speech genres: Clit + Verb in
initial position of Intonational PEriods (IPE)

824

JADT’ 18

2. The Rhapsodie speech database
Large spoken corpora are rarely distributed with a fine-grained prosodic
annotation. Fortunately, for French, a reference corpus, the Rhapsodie speech
database (ANR Rhapsodie 07 Corp-030-01), is freely available online
(http://www.projet-rhapsodie.fr/). This syntactic and prosodic treebank is
composed of 57 short samples of spoken French (approximately 5 minutes
long), orthographically and phonetically transcribed (approximately 33,000
words).
2.1. Database structure
The Rhapsodie data file is available online in tabular form. The corpus data is
structured as follows: all Rhapsodie texts are first identified by codes. Each
text is further divided into separate units and segmented into tokens. The
remaining columns display more than 60 linguistic annotations of the tokens,
including microsyntax (rection, dependency, constituency), macrosyntax and
prosody. This data set was transformed from the spreadsheet format into a
Trameur base file using regular expressions (Fleury, 2013). The data structure
is composed of two parts: (1) a Thread, which is a list of items with position
identifiers; (2) a Frame, which is a list of corpus partitions defined on the
Thread. Each partition has a name and a list of named constituents identified
through their first and last token positions on the Thread. Thus, each
annotated token from the Rhapsodie corpus becomes an item identified by its
position on the Thread (Fleury et Zimina, 2014).
2.2. Prosodic annotation
The corpus covers several discourse types and speaking styles: oratory,
narrative, description, argumentation, procedural; interactive, semiinteractive and non-interactive; public and private, planned, spontaneous
and semi-spontaneous, etc. (Lacheret et al., 2017). The transcriptions and the
annotations are aligned on the speech signal (Lacheret et al., 2014). A
combination of manual and automated annotations allowed a segmentation
of speech into prosodic periods (Lacheret et Victorri, 2002), which relies on
the initial characterization of two types of speech events retained from the
manual annotation: prosodic prominence and disfluencies. Organized
around rhythmic and melodic components, the hierarchy of prosodic
constituents includes: Intonational PEriods (IPE); Intonational PAckages
(IPA): sub-constituents internal to periods; Rhythmic Groups (RG): subconstituents internal to intonational packages; Metrical Feet (MF): subconstituents inside rhythmic groups; Syllables, with Prominence levels,
including: 0 (non-prominent), W (weak) and S (strong).

JADT’ 18

825

3. Quantitative analysis of the prosodic dimension of phraseology
As the link between the “marked status” as a +phrase/expression/formulaic
expression etc. and prosodic constituents is still to be revealed, some of our
research questions are of an exploratory nature and more than 60 layers of
morpho-syntactic, syntactic, macro-syntactic and prosodic annotation in
Rhapsodie necessarily open new perspectives for the exploration of the
prosodic dimension of phraseology.
3.1. Preliminary research
Previous research on spoken phraseological units (Lin, 2013) did not take
into account the prosodic hierarchy, in other words, the various sizes of the
prosodic constituents (Nespor et Vogel, 2007). We first replicated this
methodology, which consisted in describing the prosodic characterisations of
phraseological units attested in speech-to-text transcription. For example, in
the categorization of the stress hierarchy, we can distinguish between
stressed (strong) and less prominent (weak) syllables.
Preliminary analyses of repeated segments from the Rhapsodie corpus, such
as jeune fille (F=20) or je veux dire (F=21) led us to observe the non-congruency
of the recurrence of prosodic features, such as prominence, and traditional
phraseological units recovered from transcribed speech data:
… une|dis-weak jeune|strong fille|weak euh|weak habillée|weak tout|strong
en|strong
…
… c|tail'|tail est|tail une|tail jeune|tail fille|tail # pauvre|strong et|strong
affamée|strong
…
… dans|strong la|strong rue|strong avec|weak une|weak jeune|weak
fille|filled-dis
#
§ vous|weak voyez|weak ce|weak que|strong je|strong veux|strong
dire|strong #
§ euh|tail vous|weak voyez|weak ce|weak que|weak je|weak veux|weak
dire|weak §
Our first analyses of these examples made us simultaneously consider
multiple prosodic properties of these collocations, as well as the necessity of
taking into account the hierarchy of prosodic units corresponding to these
speech contexts.
3.2. Prosodic constituents: a genre-based analysis of lexicogrammatical
recurrence and initial salience
A recent study (Zimina et Ballier, 2017) has shown that prosodic
segmentation into IPE: Intonational PEriods (segments of speech with

826

JADT’ 18

distinctive pitch and rhythm contours) available within the Rhapsodie
database offers new insights for the observation of the functions of formulaic
expressions in speech. Lexicogrammatical patterns at the beginning of IPE
are strongly related to spoken formulaic language. Recurrent prominences
observable after speech breaks can be revealed by textometric analysis of
repeated Part-Of-Speech (POS) segments (Salem, 1987) at the beginning of
the IPEs. This method can be used to isolate a set of pivotal elements
associated with what is commonly perceived as a strong prosodic boundary.
Computation of characteristic elements (Lebart et al., 1998), applied to
repeated segments (Salem, 1987), describes these regularities with respect to
genre and speaking styles, categorized in Rhapsodie as ‘subgenres’ (Lacheret
et al., 2014).
Discovered variations of initial salience depend upon several factors
(interactional needs, social context, genres, etc.). They reflect specific
communicative needs of the speakers. For example, Cl + V is a positive
characteristic element at the beginning of the IPE in the speech contexts of
oratory genre (specificity index: +10). The following examples reveal some
lexicogrammatical realizations of this productive pattern in Rhapsodie
(categories corresponding to Cl + V appear in bold):
# il faut les faire grandir #
# § ce sera un coup franc #
# je souhaite que l'Europe #
These pivotal elements reflect the structure of regular rhetorical units with a
predictable/definable discourse function (performative utterances).
3.4. Prosodic salience and prominence: systemic combination
To fine-tune our study of regular prosodic features of phraseology, we have
added another layer of analysis, namely the prominence of final syllables. For
these purposes, we have combined three annotation levels: (1) the positional
properties of units within the IPE structure with BILOU tags (Beginning,
Inside, Last, Unit-length and Outside); (2) POS tags; (3) final stress
prominence: strong, weak, pause_ and % (inaudible or non-transcribed due to
overlap). The base file has been automatically re-annotated with Le Trameur
to add this new combined annotation layer to the Rhapsodie data. Repeated
segments have then revealed that at the beginning (B) of IPEs, the final
prominence of the pivot Cl + V is either weak weak (F=114) or strong strong
(F=95). On Figure 1, the results of characteristic elements analysis in different
speech genres show that these weak realisations of final syllables are positive
characteristic elements (specificity index: +06) of procedural utterances, such

JADT’ 18

827

as instructions in travel planning, while both strong (+06) and weak (+05)
realisations of Cl + V are characteristic elements of initial prosodic salience in
oratory genre. A finer-grained analysis according to speaking task, presented
on Figure 2, shows that this prosodic richness can be attributed to the
subgenre of political discourse, a well-known subtle and complex discourse
phenomenon (Dorna, 1995; Mayaffre, 2002). For political speech, pragmatic
strategies influence the choice of a weak weak (+06) or of a strong strong (+03)
sequence to realize specific discourse functions, for example:
... reculer la pauvreté # § ce|strong sera|strong tout le sens du combat de la
France… (focus) … ces valeurs # § en les faisant vivre # nous|weak
serons|weak plus forts pour aborder les temps qui viennent… (fonction
performative)
The strong prominence of Cl + V also corresponds to emphatic realisations at
the beginning of IPE in sermons (+03), as evidenced in Figure 2. Similarly,
advertising favours recurrent overuse of stressed syllables in this position
(+04).

Figure 2: Characteristic elements of initial prominence in different speaking tasks: Clit + Verb
in initial position of Intonational PEriods (IPE)

4. Conclusions
The prosodic hierarchy (Nespor et Vogel, 2007) acknowledges several layers

828

JADT’ 18

of granularity, from the prosodic utterance to the phoneme. For our
investigation, the IPE was the best initial candidate for the proper level of
granularity in the prosodic hierarchy. There were structural reasons for this,
such as the fact that speaking turns were likely to be signalled in initial
position of IPE. The textometric analysis of this prosodic constituent of the
prosodic hierarchy has shown revealing features such as the limited
distribution of POS categories in the initial position, as well as the role of
prosodic prominence (stressed syllables) and its relevance for the distinction
of speech genres. It appears that the recurrent patterns reflected by such
sequences as “je salue”, “elle souhaite”, “il faut”, “on continue” are not unlike
the stable lexicogrammatical patterns that can be observed in written data. In
all likelihood, the initial characteristic distributions with specific prosodic
characteristics correspond to communicative needs (interactions, uptakes,
speaking turns, etc.).
Because of the complexity of the Rhapsodie speech database, we regard these
explorations as preliminary: we have only based our analysis on few layers
of annotations. Besides, other layers of granularity within the prosodic
hierarchy might also be relevant: Intonational PAckage (IPA), Rhythmic
Group (RG), etc. We also surmise that other variables (channel, planning
type, event structure: monologal vs. dialogal tasks) are likely to reveal related
features.
5. Future research
In future work, various lines of investigation can be pursued, such as the
examination of the various layers of the prosodic hierarchy. Looking for
collocational structures may lead us to question the recurrence of patterns
within prosodic units, in other words, the embedding of prosodic
constituents or the complexity of the boundaries of the constituents across
the layers of the prosodic hierarchy. Other correlates might be considered
such as duration and prosodic contours (the whole corpus also includes the
sound files). It may well be the case that annotators were influenced by the
genre of the recordings, and auditory analysis across genres based on our
characteristic elements analysis may nuance the levels of prominence
assigned by the annotators. If the identification of these collocational
prosodic patterns is robust, it should also remain transparent and decisive for
subjects when resynthesized. The acoustic signal can be modified so as to
erase lexical contents, only keeping the melody (humming). Resorting to
humming should enable us to test the relevance of prosodic sequences which
should robustly remain identifiable as characteristic signals of collocations in
perception tests.

JADT’ 18

829

References
Dorna, A. (1995). Les effets langagiers du discours politique. Hermès, La Revue
1995/2 16, 131–146.
Fleury, S. (2013). Le Trameur. Propositions de description et d’implémentation des
objets textométriques. Sorbonne nouvelle – Paris 3, http://www.tal.univparis3.fr/trameur/trameur-propositions-definitions-objetstextometriques.pdf
Fleury, S., Zimina, M. (2014). Trameur: A Framework for Annotated Text
Corpora Exploration. In: Proceedings of 25th International Conference on
Computational Linguistics (COLING 2014), Dublin, Ireland, pp.57–61,
http://www.aclweb.org/anthology/C14-2013.pdf
Gledhill, C. (2011). The ‘lexicogrammar’ approach to analysing phraseology
and collocation in ESP texts. ASp (Anglais de Spécialité) 59, 05–23.
Gledhill C., Patin S., Zimina M. (2017). Identification et visualisation de
schémas lexico-grammaticaux caractéristiques dans deux corpus
juridiques comparables en français. CORPUS 17, 113–143.
Granger, S. (2005). Pushing back the limits of phraseology. How far can we
go? In: Cosme, C., Gouverneur, C., Meunier, F., Paquot, M. (eds.):
Proceedings of PHRASEOLOGY 2005. An Interdisciplinary Conference,
Université Catholique de Louvain, Louvain-la-Neuve, pp. 165–168.
Lacheret, A., Kahane, S., Beliao, J., Dister, A., Gerdes, K., Goldman, J-P., Obin,
N., Pietrandrea, P., Tchobanov, A. (2014). Rhapsodie: a Prosodic-Syntactic
Treebank for Spoken French. In: Proceedings of the Ninth International
Conference on Language Resources and Evaluation (LREC'14), Reykjavik,
Iceland.
Lacheret, A., Kahane, S., Pietrandrea, P. (eds.) (2017). Rhapsodie: a prosodic and
syntactic treebank of spoken French, John Benjamins, AmsterdamPhiladelphia.
Lacheret, A., Victorri, B. (2002). La période intonative comme unité d’analyse
pour l’étude du français parlé : modélisation prosodique et enjeux
linguistiques. Verbum 24 (1-2), 55–73.
Lebart, L., Salem, A., Berry, L. (1998). Exploring Textual Data. Kluwer
Academic Publishers, Dordrecht, Boston.
Lin, Ph. M.S. (2013). The prosody of formulaic expression in the
IBM/Lancaster Spoken English Corpus International Journal of Corpus
Linguistics. International Journal of Corpus Linguistics 18(4), 561–588.
Mayaffre, D. (2002). Discours politique, genres et individuation sociolinguistique. In: Morin, A., Sébilot, P. (eds.). Actes des JADT 2002, SaintMalo, France, IRISA-INRIA, pp.517–529.
Nespor, M., Vogel, I. (2007). Prosodic Phonology. Berlin. Mouton De Gruyter.
RHAPSODIE Homepage, http://www.projet-rhapsodie.fr

830

JADT’ 18

Salem, A. (1987). Pratique des segments répétés. Essai de statistique textuelle.
Klincksieck, Paris.
Sitri, F., Tutin, A. (dir.) (2016). Phraséologie et genres de discours. Patrons,
motifs, routines. LIDIL 53.
Yoo, H-Y, Delais-Roussarie, E. (eds.) (2009). Actes de la conférence Interface
Discours
&
Prosodie
(IDP
2009),
Paris,
France,
http://makino.linguist.jussieu.fr/idp09/actes_fr.html
Zimina, M., Ballier, N. (2017). Intonational PEriods (IPE) and Formulaic
Language: A Genre-based Analysis of a French Speech Database.
Proceedings of Europhras 2017 Conference: Computational and Corpus-based
Phraseology: Recent Advances and Interdisciplinary Approaches, London,
http://www.tradulex.com/varia/Europhras2017-II.pdf

Abstracts

JADT’ 18

833

What kind of contributions does research provides?
Mapping issue based statements in research abstracts
Filippo Chiarello1, Gualtiero Fantoni1,
Andrea Bonaccorsi1, Silvia Fareri2
1

School of Enginner, University of Pisa – filippo.chiarello@destec.unipi.it
2Marco Biagi Foundation, University of Modena and Reggio Emilia

Abstract
Sentiment analysis is the study of the polarity (positive/negative) of
documents (Pang and Lee 2008). Lexicon based techniques are one of the
most used methods to compute the polarity of a document. Lexicons are
collections of words, each annotated with its own positive or negative
orientation. The overall sentiment of a document is therefore computed upon
the prior polarity of the contained words.
Given that one of the most used approach to build sentiment lexicons is
terms extraction from polarized documents (e.g. restaurant or movie reviews,
social network post), we may hesitate to apply it to styles of text dramatically
different from what they were validated on. Lexicons need to be redesigned
every time we want to shift from one context to another.
Another well-known problem of lexicon approaches is that writers make use
of valence shifters. These are words that affects the polarized words: a
negation flips the sign of a polarized word (e.g., “it is not good”), an
amplifier increases the impact of a polarized word (e.g., “I totally enjoy
that.”), or a de-amplifier reduces the impact of a polarized word (e.g., “it is
almost perfect”). So if valence shifters occur frequently a simple dictionary
lookup may not be measuring the sentiment appropriately.
In this paper, we try to address both these problems with a specific focus on
extracting negative sentences from abstracts of scientific articles. We thus
propose a negative sentence extractor for appropriately dealing with the
research paper domain. We will start with the collection of a set of abstracts
belonging to the same field of knowledge. These documents are then preprocessed using state of the art natural language processing tools (sentence
splitter, tokenizer, lemmatization). Then for each sentence we will compute a
negative sentiment polarity and take in to consideration only the sentences
having a negative polarity score below a thresh-hold level. Finally we will
apply topic modelling algorithm on the negative sentences, with the aim of
give a graphical synthesis of the main problems of a research field.
For the polarity computation, we will rely on a lexicon developed by the
authors that extracts advantages and disadvantages of inventions from

834

JADT’ 18

patents (Chiarello, 2017) and a novel dictionary lookup approach that tries to
incorporate weighting for valence shifters (Rinker 2018). Following a bottom
up-approach we will redesign these lexicon for an optimal application on
paper documents.
References
Pang B. and Lee L. (2008). Opinion Mining and Sentiment Analysis,
Foundation and Trends in Information Retrieval, vol.(2)
Chiarello, F., Fantoni, G., Bonaccorsi, A., (2017). Product description in terms
of advantages and drawbacks. Exploiting patent information in novel
ways. ICED
Rinker, T. W. (2018). sentimentr: Calculate Text Polarity Sentiment version
2.3.2. http://github.com/trinker/sentimentr

JADT’ 18

835

Technical sentiment analysis: predicting the success
of new products using social media
Filippo Chiarello1, Giacomo Ossola1, Gualtiero Fantoni1,
Andrea Bonaccorsi1, Andrea Cimino2, Felice Dell’Orletta2
2

1School of Engineer, University of Pisa – filippo.chiarello@destec.unipi.it
Institute for Computational Linguistics of the Italian National Research Council

Abstract
Nowadays, social media has become an inseparable part of modern life,
providing a vast record of mankind’s everyday thoughts, feelings, and
actions. For this reason, there has been an increasing interest in research of
exploiting social media as information source of knowledge although
extracting a valuable signal is not a trivial task. In fact, social media data is
noisy and must be filtered before proceeding with the analysis. In this
domain, sentiment analysis, which aims to determine the sentiment content
of a text unit, is considered one of the best data mining method. It relies on
different approaches (Collomb, 2013): machine learning, lexicon-based,
statistical and rule-based.
In this work, we try to understand if sentiment analysis is really the best
available method to analyze consumer’s perception of products. In
particular, we compare state of the art sentiment analysis based on machine
learning methods with a lexicon approach based on a dictionary of
advantages and drawbacks related to products, important aspects evaluated
by consumers during the buying decision process. The lexicon has been
developed by researchers of the authors (Chiarello, 2017) to extract
advantages and drawbacks of inventions from patents.
Our work started with the selection of an event able to polarize Twitter users’
attention and a set products to analyze. In particular, we chose a premiere
trade-show for the video game industry, and two video game consoles
disclosed during the event. We collected tweets about products published
before, during and after the trade-show. Since social media data is noisy (for
example it may contains spam and advertising), before proceeding with the
analyses, we filtered our dataset. In particular, after removing too short and
non-English tweets, we manually classified a randomly extracted subset of
posts to train the automatic classifier which provide us the cleansed dataset.
Finally, we built product-related clusters of tweets.
Once obtained the final dataset, we conducted a sentiment analysis of the
posts using state of the art machine learning techniques. We classified each
tweet as positive, negative or neutral. Then, we applied our lexicon

836

JADT’ 18

identifying advantages tweets and drawbacks tweets. In order to compare
the outputs of the two analyses, we considered advantages tweets as
positive, drawbacks as negative, and tweets without words from our lexicon
as neutral. We found consistent and interesting differences between the two
methodologies. In particular we found that when a product has a certain
technological complexity and fuels a more technical social media discourse,
sentiment analysis seems to be less performing while advantages and
drawbacks analysis is abler to produce technical-functional judgements
about the products.
References
Collomb C. , Costea ,C. and Brunie L. (2013). A Study and Comparison of
Sentiment Analysis Methods for Reputation Evaluation.
Chiarello, F., Fantoni, G., Bonaccorsi, A., (2017). Product description in terms
of advantages and drawbacks. Exploiting patent information in novel ways.
ICED

JADT’ 18

837

Citizens and neighbourhood life: mapping
population sentiment in Italian cities
Fiorenza Deriu, Domenica Fioredistella Iezzi2
1

Sapienza University of Rome – email@provider.be
2Tor Vergata University – email@provider.be

Abstract 1
In recent years, citizens are increasingly taking an active part in neighbourhood life
by writing comments on social media. Citynews is an online platform, that is active in
45 Italian provinces, and offers to readers facts and news about Italy. It records every
month 85 million visits and counts over 565 thousand registered users. It has the
characteristic of being a citizen journalism: an increasing part of these contents are in
fact write to the platform by readers. We scraped the comments of users a sample of
6,418 posts published on the Citynews site from January 2017 to January 2018, on 9
metropolitan cities (Bari, Bologna, Florence, Genoa, Milan, Naples, Rome, Turin and
Venice). We applied two analysis procedure: 1. Exploratory analysis of the contents of
citynews posts with the aim of classifying the opinions of active citizens; 2) Sentiment
analysis of the sampled cities, with assignment of scores by postcode. In the first
phase, we mapped the contents of the reviews and identify the areas in which citizens
report major events (positive or negative) occurred in the zone. In the second phase,
we built a sentiment index, normalized by postcode and designed the maps of
citizenship mood.
Abstract 2
Negli ultimi anni, i cittadini partecipano sempre più attivamente alla vita di quartiere
scrivendo commenti sui social media. Citynews è una piattaforma online, attiva in 45
province italiane, che offre ai lettori notizie e commenti sull'Italia. Registra ogni mese
85 milioni di visite e conta oltre 565 mila utenti registrati. Ha la caratteristica di essere
un citizen journalism: una parte crescente di questi contenuti è infatti scritta sulla
piattaforma dai lettori. Abbiamo raccolto i commenti degli utenti su un campione di
6.418 post pubblicati sul sito Citynews da gennaio 2017 a gennaio 2018, su 9 città
metropolitane (Bari, Bologna, Firenze, Genova, Milano, Napoli, Roma, Torino e
Venezia). Abbiamo applicato due procedure di analisi: 1. Analisi esplorativa dei
contenuti dei post delle città con l'obiettivo di classificare le opinioni dei cittadini
attivi; 2) Analisi del sentimento delle città campionate, con assegnazione dei punteggi
per codice postale. Nella prima fase dell'analisi, abbiamo mappato i contenuti delle
recensioni e identificato le aree in cui i cittadini segnalano i principali eventi (positivi
o negativi) accaduti nella zona. Nella seconda fase, abbiamo costruito un indice di
sentiment normalizzato, e disegnato le mappe dell'umore della cittadinanza.
Keywords: active citizenship, map, sentiment analysis.

838

JADT’ 18

Vax network: profiling influential nodes with social
network analysis on twitter
Francesca Di Carlo, Rosy Innarella, Brizio Leonardo Tommasi
Tor Vergata University – francesca.dicarlo90@gmail.com;
rosy.innarella@gmail.com; brizio.tommasi@gmail.com

Abstract
We live in a society increasingly conditioned by opinions of third parties. If our
objective is to spread a new concept or idea effectively into the society and the
public opinion, we must consider two basic factors: the presence of influential
individuals that accelerate the spread of the process and the susceptibility of
people to be “infected” by the new idea. In order to keep under control the
spread of new behaviors, it is very important to collect as much as information
about the profiles of the people belonging to the two categories (Sawyer, 2011).
This paper analyses a social network based on two clusters of people. The first
cluster identifies people with opinion “pro-vax” the second one identified by
“anti-vax” opinions emerged in the last three years. This analysis consists of a
description of a data collection of public tweets available online; every collection
is based on about one month of tweets per year. Tweets are extracted with
specific “key-words” in the context of vaccine and anti-vaccine factors. In
particular the data collection is based on a “key-words search model” consisting
of a combination of context words, for example: (i) “core cluster” with:
vaccination, vaccine, vaxxer, antivaccine, antivaxxer, anti-vaccine, anti-vaxxer; (ii)
“effects cluster” with vaccine dependent variable: Autism, MMR, Pharma and
vaccine; (iii) “community cluster” with vaccine dependent variable: Cdc, Aaps,
Fda, Hrc, Wakefield and vaccine. We used twitter fetcher included in the
semantic and social network analysis software Condor (Gloor, 2009); in this way
we could fetch 120 days of tweets distributed over three years: apr-2015, nov2016, jun-2017. The total volume of collected tweets is about 300,000. All the
collected tweets were written in English. In this elaboration we were able to
distinguish behavior emerged in tweets in each cluster analysis by carrying out a
preliminary sentiment analysis of the collected tweets. This network of Twitter,
named “vax network”, consists of about 800,000 links with an average degree of 3
relations. The analysis of “vax network” is based on the mainly centrality
measures (e.g. degree, betweenneess, closeness, clustering coefficient,
eigenvector) for the identification of relevant node in terms of potential influence
of “vax networks”. Instead of the sentiment analysis is based on main methods
(e.g. word frequencies, activity, emotionality, sentiment, complexity) of natural
language used in the tweets of “vax network”, for profiling the relevant node in
terms of its main characteristic correlated with own centrality in the network.

JADT’ 18

839

Consequently we can identify the profiled nodes that have prevalent behavior for
influencing or clustering the trend of the “vax network”. In the last three years we
have identified a specific trend of the sentiment analysis, calculated on the twitter
discussions about vaccine network. The sentiment analysis of the tweets has a
negative meaning, regardless of using positive or negative words in the vaccine
context. In general the 80/20 law emerges between “pro” and “anti” nodes. The
sentiment together with emotionality has a growing positive trend, due to an
awareness communication on vaccine utility. Keywords: network analysis,
sentiment analysis, text mining, twitter, profiling, influence, vaccine, complex
network.

840

JADT’ 18

Alteryx
Davide Donna
Managing Partner, The Information Lab

Alteryx is an analytics platform that allows users to build complex
workflows for an end to end data management and analytics without
scripting. This is possible through an intuitive interface where users drag and
drop on the canvas the different tools that he needs to connect, blend,
transform, analyze and export the data.
Alteryx is a Data Science platform that includes three macro areas: ETL with
capability to connect to most of the databases and services, clean, prepare
and blend the data; Advance Analytics, integrating several predictive and
statistical models scripted in R and Pyton; Geospatial analytics.
In the textual data analytics contest Alteryx finds several applications,
presented during the speech:
- web scraping: download html pages and parsing data through regular
expressions (Regex)
- big data management
- Connection to external services for advance analytics. I.e. with the direct
interaction with Azure Cognitive Services Alteryx can implement key
phrases extraction, sentiment analysis and language dectection
- Connection to external services for data input: i.e. with the connectors to
Twitter and other social media the information can be directly imported into
the analytical platform
- Integration of R and Python libraries: the models are written into Alteryx
and integrated and run within the entire flow
- Cleaning, blending and transformation of the information downloaded
from different sources
- Export the results in the favorite format. I.e. Visualanalytcs or reporting tool
like Tableau for a clear and smart representation.

JADT’ 18

841

Complexity of US President Speeches
Valerio Ficcadenti1, Roy Cerqueti2, Marcel Ausloos3,4
Marche Polytechnic University - v.ficcadenti@univpm.it
2University of Macerata - roy.cerqueti@unimc.it
3University of Leicester - ma683@le.ac.uk
4GRAPES - Angleur, Liege, Belgium - marcel.ausloos@ulg.ac.be
1

Abstract
This work is devoted to the exploration of the rhetoric dynamics of a large
collection of US Presidents’ speeches. In particular, speeches are viewed as
complex systems and are first analysed through rank-size laws, being the
words of each speech ranked in terms of their frequencies. The building of
the dataset itself represents a relevant step of the study. In this respect, by
using a web scraping routine on the Miller Center website, a large span of
978 speeches have been downloaded. After a pre-processing phase the set is
reduced to 951; for each one, the words’ frequencies are stored. A best fit
procedure with Zipf-Mandelbrot laws is performed over the 951 talks
individually. Thanks to these estimations, it is possible to reach interesting
conclusions on how 45 United States Presidents, from April 30, 1789 till
February 28, 2017 delivered political messages. Our analysis shows some
remarkable regularities, not only inside a given speech, but also between
different speeches. Results are discussed under a political and linguistics
points of view.
Keywords: US Presidents’ speeches, speeches’ framework, US Presidents’
rhetoric, rank-size analysis Zipf-Mandelbrot law

842

JADT’ 18

Measuring the Dynamics
of Social Networks with Condor
Peter A. Gloor
MIT Center for Collective Intelligence, Cambridge MA – pgloor@mit.edu

Condor is an easy-to-use dynamic semantic social network analysis
tool, which automatically imports many types of communication data such
as e-mail, Twitter, Blogs, Wikipedia, and Facebook, converting it into social
networks for longitudinal network, content and sentiment analysis using
machine learning and deep learning. It has been developed over the last 15
year at the MIT Center for Collective Intelligence together with other
universities (Colgone, Rome Tor Vergata, etc.).
Condor consists of four parts:
1. A series of fetchers to directly load data from e-mail, for example from
Gmail or Exchange, from Twitter, from Google, Wikipedia, and
Facebook.
2. Interactive preprocessing functions to modify and reduce the graph, filter
by content and by structure, annotate by geocode, merge multiple e-mail
addresses, and create modified graphs.
3. Visualization functions, to show the static network, a dynamic movie of
the network over time, geographical word maps, and different views
for structure, content, and sentiment of actors.
4. Export functions to export time series of all variables for later
longitudinal analysis in statistics packages such as R or SPSS (or Excel).

JADT’ 18

843

Condor can….
…Analyze structure, dynamics, and content of E-Mail, constructing the
communication matrix between different influencers.
…Coolhunt the Internet using Twitter, Blogs, and Facebook for trends
and trendsetter to analyze interactions among influencers in different
online information spheres (Twitter, Blogs, Forums, Wikipedia,…)
…Measure collaboration and communication efficiency on seven proprietary
metrics (honest signal of collaboration) by structure, time, and contents:
central
leadership,
rotating
leadership,
balanced
contribution,
responsiveness, honest sentiment, shared context, social capital.
…Find digital tribes on social media using deep learning, and mapping tribal
affiliations to brands, products, and people.

Requirements
Condor runs on Windows, Mac, and Linux, and needs Java (MySQL
optional) installed.
Download Condor and the Condor manual
from http://guardian.galaxyadvisors.com/guardian
Condor is described in Gloor, P. Sociometrics and Human Relationships:
Analyzing Social Networks to Manage Brands, Predict Trends, and Improve
Organizational Performance , Emerald Publishing, London 2017
Contact: Peter A. Gloor, MIT Center for Collective Intelligence and
Galaxyadvisors AG
www.galaxyadvisors.com, www.transparencyengine.org, pgloor@mit.edu

844

JADT’ 18

“BIG DATA” Words Trend Analysis using the
multidimensional analysis of texts
Iolanda Maggio1, Domenica Fioredistella Iezzi2, Matteo Fatighenti3
2

1Rhea Group – iolanda.maggio@esa.int
Tor Vergata University – stella.iezzi@uniroma2.it
3CapGemini – matteo.fatighenti@gmail.com

Abstract
Big data is a term that describes large volumes of high velocity, complex and
variable data that require advanced techniques and technologies to enable
the capture, storage, distribution, management, and analysis of the
information. Big Data includes so many specialized terms that it’s hard to
know where to begin. An evolution in languages was experienced in the last
few years and this paper shows how the relevant terms are changing. This
analysis started as a complement of a thesis activity for the Master
Programme in Data Science of University of Tor Vergata. To perform the text
analysis the IRaMuTeQ software has been used that will be described in the
next paragraphs. The exercise aims at analyse both contents with a
tool/software for multidimensional analysis of texts to give evidence of BIG
DATA words trends. This approach provides the users with different text
analyses, either simple ones, such as the basic lexicography related to
lemmatization and word frequency; or more complex ones such as
descending hierarchical classification, post- hoc correspondence factor
analysis and similarity analysis. The vocabulary distribution is presented in a
comprehensive and clear way with graphical representations derived from
the lexicographic analysis. These analyses can be performed using texts
referring to a certain thematic (text corpus) grouped in one text archive; or
data from spreadsheets (matrices with individuals in a row and words in a
column), like the dataset derived from free evocation tests.
Keywords: big data, text mining, Zipf diagram, Clustering, Dendograms,
Corpus.

JADT’ 18

845

Itinerari turistici, network analysis e text mining
Mario Mastrangelo
Tor Vergata University – mario.mastrangelo@uniroma2.it

Abstract
Appare ormai evidente come gli strumenti propri del web 2.0 abbiano
modificato in maniera sostanziale il mondo in cui viviamo, e che le ricadute
di tali strumenti siano di grande entità soprattutto in alcuni settori economici.
Tra questi, sicuramente il turismo. Le opportunità fornite dai cosiddetti user
generated Contents -e dall’avvento dei Big Data- hanno trasformato
sensibimente questo settore, sia dal lato della domanda, in termini di scelta
delle destinazioni, di pianificazione del viaggio, di soddisfazione in merito ai
servizi e ai fornitori, sia dal lato dell’offerta, in termini di analisi dei flussi, di
marketing, e di customer satisfaction. L’ipotesi, sempre più suffragata dai
fatti, è pertanto quella che per una quota crescente dei turisti contemporanei
il parere espresso da propri pari tramite i social media, sia di tipo specifico,
come Tripadvisor, che di tipo generico, come ad esempio Facebook e Twitter,
sia ormai da porre sullo stesso piano e in certi casi possa risultare addirittura
più incisivo di gran parte dei messaggi veicolati dai canali informativi
tradizionali. In questa direzione, il presente contributo intende illustrare una
applicazione pratica di alcune tecniche di analisi dei dati testuali (analisi
delle corrispondenze lessicali, text clustering, sentiment analysis, etc) che,
impiegate in serie o in parallelo, possono contribuire a fornire un valido
supporto per l’individuazione di itinerari turistici sempre più individuali e
"sociali" in un determinato contesto territoriale.
In particolare, in questa simulazione sono stati presi in considerazione 38 siti
di Roma, tutti molto noti, piuttosto diversificati - da monumenti, a palazzi
istituzionali, a sedi museali – e tutti vicini tra loro, così da poter creare piccoli
itinerari raggiungibili a piedi. Questi siti divengono i nodi di un grafo
orientato, i cui archi costituiscono le distanze tra ciascun sito, caratterizzati
da una serie di attributi, alcuni dei quali di natura “tradizionale” come la
classificazione iniziale in base a parametri storico-artistico-funzionali
(periodo romano, periodo barocco, periodo rinascimentale, periodo
contemporaneo, siti museali e siti “misti”, ovvero quelli che non rientrano
con esattezza in nessuna delle precedenti categorie) e altri derivanti dai social
media, nel caso specifico mediante opportune operazioni di text mining su

846

JADT’ 18

un Corpus assemblato a tal fine da un insieme di recensioni in lingua inglese
dei 38 siti su Tripadvisor.
La rappresentazione dei siti sotto forma di grafo, insieme ai due insiemi di
attributi associati ai nodi, permette di eseguire con facilità alcune operazioni
a partire dai desiderata del turista; ad esempio suddividere i siti in base a
come i turisti parlano di loro, e individuare insiemi di percorsi tra i siti
caratterizzati da un sentiment più elevato, e dunque dal fatto che altri turisti,
in posizione simmetrica, hanno espresso con le loro parole un alto
gradimento. In altre parole, dati un punto di partenza e un punto di arrivo, a
percorsi più o meno tradizionali, definiti in modo “oggettivo” (ad esempio,
tour barocco, tour religioso, e, in prospettiva, tour gastronomico, etc) si
potrebbero aggiungere percorsi definiti in modo soggettivo sulla base delle
valutazioni espresse dagli altri turisti (ad esempio, tour delle tappe più
ammirate, più emozionanti, più commentate, etc). In tal modo si unirrebbero
le due anime attuali dell’informazione turistica, quella tradizionale
indispensabile ma più statica e quella basata sui contenuti generati da altri
utenti, effimera e soggettiva ma più dinamica; dal punto di vista scientifico,
questo approccio permette di applicare tecniche ben consolidate ma
raramente utilizzate in maniera congiunta, allo scopo di estrapolare,
interpretare ed impiegare in maniera efficace ed originale la grande mole dei
contenuti generati dai vari utenti in un settore particolarmente fecondo in
questo senso.

JADT’ 18

847

Text Mining per l’analisi qualitativa e quantitativa
dei dati amministrativi utilizzati dalla Pubblica
Amministrazione
Maria Francesca Romano1, Guido Rey2,
Antonella Baldassarini2 Pasquale Pavone4
1Istituto di Economia, Scuola Superiore Sant’Anna, 2Istituto di Management, Scuola Superiore
Sant’Anna, 3ISTAT, 4Centro di Analisi delle Politiche Pubbliche – Università di Modena e Reggio
Emilia

Abstract
Nella Pubblica Amministrazione la maggior parte delle informazioni (sia
qualitative che quantitative) è contenuta in testi in linguaggio naturale e
spesso i testi “nascondono” al loro interno anche molti dati numerici. Articoli
di giornale e interventi in dibattiti televisivi richiamano spesso l’attenzione
dell’opinione pubblica sulla mancata o incompleta informatizzazione della
giustizia. Un esempio positivo sono le sentenze emesse dalla Corte di
Cassazione già visualizzabili ed interrogabili sul sito www.italgiure.it. Si
tratta di un totale di oltre 475.000 documenti a partire dall’anno 2012. E’ già
stata effettuata l’estrazione dal sito Italgiure di un sottoinsieme di sentenze di
merito (circa 4.700), e le analisi testuali condotte con il software TalTac2.0
mostrano (Romano 2017) come sia possibile giungere a buoni risultati nella
verifica della :
 completezza rispetto alla individuazione delle parti offese, degli
imputati, delle eventuali parti civili, e di enti, /aziende pubbliche e
private: I nominativi sono tutti in chiaro ed esistono con pochissime
eccezioni, coperte da omissis.
 attendibilità / certezza rispetto ad eventi criminosi e successiva
elaborabilità con strumenti informatici
 identificazione dei soggetti (Enti Pubblici) e del ruolo di imputati e di
enti / aziende private;
 dell’ammontare economico degli atti criminosi collegati ad attività
economiche (corruzione, aste pubbliche, gare di appalto, ecc.);
 luoghi in cui si sono svolti gli atti criminosi.
 presenza di appartenenti ad organizzazioni criminali e loro insediamenti
Il paper discute i problemi metodologici derivanti dall’uso di testi in
linguaggio naturale di carattere giuridico, propone una metodologia di
estrazione delle informazioni rilevanti ed i vantaggi derivanti da una

848

JADT’ 18

integrazione tra dati e informazioni estratti dalle sentenze di merito e archivi
di natura amministrativa.
Sarà discussa la possibilità di integrare più basi di dati, mettendo in relazione
i dati estratti dalle sentenze con i dati presenti in altre basi di dati anche di
natura statistica; saranno presentati i risultati di un esercizio su di un insieme
di sentenze della Corte di Cassazione da collegare con archivi statistici come,
ad esempio, il registro delle imprese attive tracciando possibili scenari
conoscitivi per un’analisi economica di eventi criminosi.
Componente centrale dell'analisi è il confronto con altre basi di dati,
statistiche e testuali, al fine di acquisire elementi qualitativi e quantitativi su
particolare fenomeni criminosi, nonché convalidare la correttezza dei dati
estratti dalle sentenze non sempre esenti da errori.
Riferimenti bibliografici
Romano M.F. (2017), Dalle parole ai numeri: estrarre dati dalle sentenze della
magistratura, in Rey G.M. (a cura di), La mafia come impresa. Analisi del
sistema economico criminale e delle politiche di contrasto, FrancoAngeli,
pp.121-154.

JADT’ 18

849

Taglio cesareo e Vbac in Italia al tempo dei Big Data:
una proposta di ulteriore contributo informativo.
Alessandro Cesare Rosa
Dipartimento di Epidemiologia del S.S.R. del Lazio – a.rosa@deplazio.it

Abstract 1
The project starts with the curiosity to investigate, for a topic of
epidemiological (but also popular) interest such as the caesarean section in
Italy, which is the level of information and awareness acquired by a sample
of citizens who frequents and writes an opinion on the web, comparing
freely. Conceptually linked to the caesarean section, the theme of VBAC,
acronym of "Vaginal Birth After Cesarean", will be analyzed in parallel with
the same methodology. The objective is to highlight connections, if they exist,
between aspects related to the correct practice of the caesarean section that
the woman and / or the single health care provider should adopt (ministerial
guidelines created for the public) with the opinion expressed by a sample of
web-users. To do so, it was decided to scrape, from the web, the textual
content of the opinions present in some websites and online forums of
frequentation, precisely in order to extract, where possible, semantic macroareas comparable to the concepts of departure
Abstract 2
Il progetto nasce dalla curiosità di indagare, per una tematica di interesse
epidemiologico ( ma anche divulgativo) quale il taglio cesareo in Italia, quale
sia il livello di informazione e consapevolezza acquisita da un campione di
cittadini che frequenta e scrive la propria opinione sul web, confrontandosi
liberamente. Concettualmente legato al taglio cesareo, anche il tema del
“Parto Vaginale dopo Cesareo”,acronimo inglese VBAC, verrà analizzato in
parallelo con la medesima metodologia. Ci si pone l’obiettivo di evidenziare
eventuali connessioni tra gli aspetti legati alla corretta pratica del taglio
cesareo che la donna e/o il singolo operatore sanitario dovrebbero
adottare,delineati nelle linee guida ministeriali rivolte al pubblico, con la
percezione in merito espressa da un campione di web-users. Per fare ciò, si è
deciso di attingere, dal web, al contenuto testuale delle opinioni presenti in
alcuni siti e forum online di stampo e frequentazione volutamente
generalista, proprio al fine di elicitare,dove possibile,macro-aree semantiche
confrontabili con i concetti di partenza.
Keywords: Exploratory Textual Data Analysis, Text Mining, Discourse
analysis.

850

JADT’ 18

Finito di stampare in proprio
nel mese di giugno 2018
UniversItalia di Onorati s. r. l.
Via di Passolombardo 421, 00133 Roma Tel: 06/2026342
email: editoria@universitaliasrl. it – www. universitaliaeditrice.it

Daniel Devatman Hromada, wizzion.com CEO

Subject: Practical implication of transitory nature of Nprime modulo 3 residual classes
Fellow colleagues,
with great delight have I read the article of Lemke Oliver & Soundararajan [1] which indicates an
existence of certain bias in the realm of prime numbers.
But sometimes it is the case that too much specialization disallows one to see the forest because of
the trees. Hence, authors aimed for analytic explanations there, where deeper empiric inspection of
one concrete case could potentially turn out to be equally instructive.
The case we how on our mind is that of prime modulo 3 residual classes (RCs). For 3 is the only
divisor which yields only two RCs: 1 and 2. Thus, only two possible RC transitions are possible in a
consecutive pair (CP) of primes P x and Px+1: 2 -> 1 transition and 1->2 transition. Asides this, only
two other RC combinations exist for any possible CP: 1->1 when both CP members belong to RC 1
and 2->2 when PX mod 3 == PX+1 mod 3 ==2.
Note the symmetry: two RC transitions and two "non-transition" states. Such symmetry exists, ex vi
termini, only in the realm of modulo 3.
Focusing solely on this transition / non-transition properties of consecutive pairs occurent among
first 30 million primes, one can observe:
 there are 16687076 transitions
 there are 13312923 non-transitions
 longest uninterrupted sequence of consecutive transitions consists of 32 primes
 longest uninterrupted sequence of consecutive non-transitions consists of 19 primes
 etc.
Those unafraid of induction could thus simply conjecture that given 16687076 / 13312923 =~
1.253, it is approx. 25% more probable that Px+1 will belong to different modulo3RC than PX.
In other terms: approximately 25% carbon dioxide less could be potentially emitted if machines
aiming to discover new prime PX+1 would explore:
 sequences (PX + 2 + 3*n) if it is known that PX mod 3 == 1
 sequences (PX + 4 + 3*n) if it is known that PX mod 3 == 2
It is in this point that we disagree with the expression "no practical use" contained in the statement :
"this ‘anti-sameness’ bias has no practical use or even any wider implication for number theory, as
far as Soundararajan and Lemke Oliver know" recently published in Your journal [2].
With highest regards,
Daniel D. Hromada
Berlin
[1] Robert J. Lemke Oliver, Kannan Soundararajan. Unexpected biases in the distribution of consecutive primes. 2016.
http://arxiv.org/abs/1603.03720
[2] Evelyn Lamb. Peculiar pattern found in 'random' prime numbers. 2016. http://www.nature.com/news/peculiarpattern-found-in-random-prime-numbers-1.19550
[3] https://en.wikipedia.org/wiki/Monty_Hall_problem

Reproducible Identification of Pragmatic Universalia in
CHILDES Transcripts
Daniel Devatman Hromada1,2,3
1
2

Université Paris Lumières - France

Slovak University of Technology – Bratislava - Slovakia
3

Berlin University of the Arts – Berlin - Germany

Abstract
This article presents method and results of multiple analyses of the biggest publicly available corpus of language
acquisition data : Child Language Data Exchange System. The methodological aim of this article is to present a
means how science can be done in a highly positivist, empiric and reproducible manner consistent with the
precepts of the “Open Science” movement. Thus, a handful of simple one-liners pipelining standard GNU tools
like “grep”, and “uniq” is presented - which, when applied on myriads of transcripts contained in the corpus –
can potentially pave a path towards identification of statistically significant phenomena. Relative frequencies of
occurrence are analyzed along age and language axes in order to help to identify certain concrete, pragmatic
universalia marking different stages of linguistic ontogeny in human children. One can thus observe significant
culture-agnostic decrease of laughing in child-produced speech and child-directed indo-european “motherese”
occurrent between 1st and 2nd year of age; maternal increase in production of pronoun denoting 2nd person
singular “you”; increase of usage of 1st person singular “I” in utterances produced by children around 3rd years
of age and marked decrease of the same which takes place around 6 years of age. Other significant correlations both intra-cultural between English mothers and children, as well as inter-cultural - are pointed down always
accompanied with thorough descriptions methodology immediately reproducible on an average computer.

1. Introduction
Reproducibility is one of the hallmark principles of occidental science. Being based upon the
philosophy of ancient greeks who were fully aware that only the knowlede of that, which
repeats itself in many instances, can lead to generic and transtemporal ἐπίσταμαι, the western
scientific method necessarily considers reproducibility as its main condition sine qua non. In
words of the foremost figure of modern epistemology, « non-reproducible single occurrences
are of no significance to science » (Popper, 1992).
Hence the primary, epistemological, objective of this article is to show how anyone willing to
do so can perform reproducible analyses and experiments regarding the phenomena
traditionally falling into the scope of corpus, computational and developmental linguistics.
This objective is to be quite naturally attained if ever three precepts are stringently followed :
•

use publicly available data

•

analyse the data with simple, specific yet powerful tools which are well-known to
widest possible public

•

faithfully protocol the exact procedure of usage of these tools

In more concrete terms, we promote the idea that - in regards to analysis of statistical textual
data - core GNU (Stallman, 1985) utils and commands as well as basic operators and core
JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

2

DANIEL DEVATMAN HROMADA

functions of open source langages like PERL (Wall, 1990) or R (Team, 2013) indeed offer
such « simple, specific yet powerful tools well-known to widest possible public ».
When it comes to the precept « faithfully protocol the usage of these tools », it shall be
implemented - in this article and potentially beyond – in a following manner : every simple
transformation of data is to be completely and exhaustively described in a footnote which
accompanies the description of the transformation. By « simple », we mean such a
transformation which can be described as a simple standard UNIX shell 1 one-liner pipelining
combining together core commands like « grep », « uniq » or « sort ».
In case of more complex transformations, the complete source code of program is always to
be furnished either in publications's appendix or at least as an URL reference. To assure the
highest possible reproducibility of the experiment, the snippet should not call any modules
and libraries external to language's core distribution (e.g. no CPAN resp. CRAN).
The most important thing, however, is not to forget that the protocol is to be complete,
exhaustive and unambigous. That is, .history of all steps is to be described in the form which
is immediately executable on a standard GNU-positive machine. All means all : from the
very fact of downloading2 the corpus from a publicly available source to the very act of
plotting the legend on a figure which is then disseminated among scientific communities.
Given that these precepts are followed and under the conditions that
•

the analysis is fully deterministic (i.e. does not involve any source of stochasticity)

•

the source corpus has not changed in the meanwhile

it can be expected that the same analysis shall bring the same results no matter whether it is
executed in other folder of the same computer (e.g. reproducibility across directories) ;
executed on different computers (e.g. reproducibility across experimental apparatus) and|or
executed by different experimentator (e.g. experimentator-independent reproducibility).

2. Corpus & Method
Child Language Data Exchange System (CHILDES) undoubtably belongs among most
fascinating language-related corpora. Established by (MacWhinney and Snow, 1985) more
than 30-years ago and including transcripts dating back to 1960s, CHILDES does not cease to
be the biggest public repository of child language acquisition and development data. Thus,
asides huge volumes of audio and video recordings of verbal interactions with children,
CHILDES also contains more than thirty thousand distinct transcripts.
Transcript themselves are encoded in UTF-8 compliant plaintext .CHA files. These files
follow a CHAT format specified in (MacWhinney, 2012). Every transcript contains a header
describing specificities facts concerning the transcribed scenario – e.g. the age of a child,
identities of participants (lines beginning with *CHI denote utterances produced by children;
lines beginning with *MOT denote utterances produced by their mothers).
Unfortunately, different linguists have followed the CHAT manual in a different manner. For
example, some include the timestamp information into their corpus and some not. Some mark
the repetition by special tokens like [x 2] (for duplication) or [x 3] (for triplication) and some
1

$ echo 'All footnote-descriptions of shell one-liners begin with the sign $ and all footnote-descriptions of R
commands begin with sign >.'
2

It is highly recommended to use standard utilities like «wget » or «curl » for that purpose.

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

3

transcribe the utterance as such, without using such tokens. And yet another set of differences
necessarily originates in transcriber's own perception and habits. For example: while the token
“mama” is occurrent in 1405 child utterances contained in English sections of the corpus3,
some other English transcribers (e.g. Haggerty or Suppes) apparently prefered to transcribe
the mother-directed vocative as “mamma” - this occurs in 126 distinct utterances.
Be it as it may, the CHILDES corpus is already so huge that one may except that a well
constituted and unbiased quantitative analysis could potentially allow the discovery of
phenomena robust to any surface perturbations (e.g. differences in habits and styles of
different investigators etc.). In other terms, if every transcript is understood as a result of a
distinct act of sampling, then it can be expected that the statistical aggregation of such a huge
amount of distinct samples (> 30000 distinct transcripts) could let to situation where the noise
cancels itself out and statistically significant phenomena emerge.
And individual CHILDES transcripts are indeed distinct. Not only because dozens, if not
hundreds researchers and investigators of at least three or four generations had already
directly participated on constitution of the corpus. Not only because majority of transcripts
were in one way or another related to a specific research project with a goal unrelated to goals
of other projects. But also because investigators themselves, as well as the investigated
subjects (e.g. children), often stem from huge variety of distinct cultural backgrounds. More
concretely: 26 languages are included in the corpus, covering practically majority of main
terran language strata (i.e. indo-european languages, asian languages, semitic, altaic and
ugrofinic languages etc.). This allows for trans-cultural analysis and such shall indeed be all
analysis presented in the section 4.
2.1 Metrics
Results can be mutually compared and communicated only if they are expressed in common
units. In case of all experiments presented in this article, the relative frequency - interpreted as
the probability of occurrence - of pattern X is such a unit. This is equivalent to absolute
frequency of occurrence of FX normalized by the total number of utterances, i.e.
PX = FX / Nutterances
Ideally, for every month mentioned in the CHILDES corpus should correspond one P X value.
To understand our approach more clearly, imagine, for example, in case of hypothethic
language whose speakers utter 100 utterances each month since their birth until their tenth
birthday. If such speakers utter the token « dog » twenty times every month, than the value of
all 120 (i.e. 10 years * 12 months) datapoints describing the time series for this particular
token would be constantly equal to 100/20 = 20% = 0.2.
It is principially due to such trivial nature of the calculus hereby presented that the core datamining procedures can be performed directly on the BASH command-line.
3.2 Preprocessing
Four hundred and sixty-seven megabytes of data compressed in 983 zip files are obtained
after the corpus has been downloaded from its original source4 or from a mirror site which

3

$ grep "mama" child/*Eng* |wc -l; grep "mamma" child/*Eng* |wc -l

4

$ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r http://childes.psy.cmu.edu/data/

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

4

DANIEL DEVATMAN HROMADA

represents state of CHILDES as of February 6th 20165. After these files are recursively
decompressed6, the CHILDES arborescent structure is flattened so that all .CHA files are
contained within one sole directory7. A following one-liner subsequently “peeks into” each
.CHA file, retrieves the information about the child's age from it and puts this information into
files' name8.
Utterances containing only xxx and www tokens – which, according to CHILDES manual
denote “unintelligible words with an unclear phonetic shape” resp. “untranscribed material” are removed from all child and mother transcripts 9. Next step is executed only to speed-up
following pattern extraction processes: child utterances are funnelled into simplified
transcripts stored in “CHI” subdirectory and maternal utterances are funnelled into “MOT”
subdirectory 10 . Translocutory information is thus lost but this is allowed for the purpose of
this article in which we shall focus solely on relative frequencies of certain tokens and not on
more complex discourse units.
All this yields 5833656 lines (e.g. utterances) contained in 29180 non-empty simplified
transcripts stored in “child” directory and 3798005 lines contained in 13590 non-empty
simplified transcripts stored in the “mother” directory.
Note that metadata like age (years and months), language group, language and CHILDES
investigator's identity are stored directly in the simplified transcript's filename. Workbench
common to all following analyses can be thus considered as ready.

3. Analyses
3.1. First Analysis – Laughing
It has been recently indicated that English mothers interacting with children younger than 16
months tend to laugh significantly more often than mothers which interact with children
between 16-31 months of age (p.222, Hromada, 2015). Our 1st analysis will use CHILDES
to address this hypothesis from a trans-cultural perspective.
It may be surprising to use a dataset, which is essentially a linguistic corpus for, a purpose of
study of such a non-verbal means of communication as laughing definitely is. But the very
CHAT manual (p.62, MacWhinney, 2012) explicitely specifies the &=laughs marker as a
most common standardized spelling denoting a specific extralinguistic event.

5

$ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r WILL-BE-GIVEN-IN-CAMERA-READY-VERSION

6

$ find CHILDES/data -name "*.zip" | while read filename; do unzip -o -d "`dirname "$filename"`" "$filename"; done

7

$ mkdir CHILDES_flat; find CHILDES/data -type f |perl -n -e 'chomp; if (/\.cha/) {$f=$_; s/\//-/g; s/\.-data-//g; `cp $f

./CHILDES_flat/$_`;}'; cd CHILDES_flat;
8

$ mkdir aged; grep -P '\|\d;\d' *| grep Child | perl -n -e 'chomp; `cp $1 aged/$2-$3-$1` if /^(.*?):.*0?(\d+);0?(\d+)/;' ; rm *.cha

9

$ perl -ni -e 'print if $_!~/^\*(MOT|CHI):\t(xxx|www) ?\./' aged/*

10

$ mkdir CHI; cp aged/* CHI; sed -i '/\*CHI/! d' CHI/*; mkdir MOT; cp aged/* MOT; sed -i '/\*MOT/! d' MOT/*;

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

5

Unfortunately, within the totality of CHILDES corpus, the marker itself &=laughs is not the
only standardized form denoting the phenomenon and some authors prefered to use markers
like [=! laughing]. Hence, for a purpose of our 1st analysis, we have simply used the token
laugh as the one whose frequencies of occurrence we have decided to measure.
Three indo-european (english, french and farsi) and two non-indo-european languages
(japanese and chinese) were chosen in order to address the developmental trajectory of
laughing from a trans-cultural perspective. For each among these langages, a target
investigator was identified as the one who most frequently used the marker laugh in his
transcripts of motherese11. Corpus subsections « Farsi-Family », «French-MOR-York »,
« Japanese-MiiPro » and « Chinese-Beijing » were thus identified as such target subsections.
All English-language transcripts (i.e. such files whose filename contains the token « Eng »)
were also taken into account.
The core of the procedure is as follows: total amount of utterances is obtained, for each month
and each target subsection of the corpus, by a one-liner 12 which redirects its output into a file
whose every row contains three space-separated columns: first column denotes the denotes
the value of Nutterances and second and third column denote the year resp. month. The procedure
is to be repeated ten times alltogether, five for each target corpus subsections multiplied by
two possible locutor values of the locutor
variable (MOT13 or CHI14).
Follow ten executions of a command
sequence which generate 10 files containing
absolute frequencies of occurrence of the
token laugh within five different corpus
sections – and again for both MOT15 and
CHI16 locutors - which are aggregated
according to child's age in the moment when
laughing was noted down by the CHILDES
investigator. And that's it: all result-containing
files can now serve furnish input datasets for
the R code which produces a plot displayed
on adjacent figure.
11

Probability that laughing accompanies or
substitutes an utterance produced by, or
directed to, a child of specific age.

$ grep laugh MOT/*French* | grep -o -P '\-French\-.+\-' | sort | uniq -c ; grep laugh MOT/*Farsi* | grep -o -P '\-Farsi\-.+\-' |

sort | uniq -c ; grep laugh MOT/*Japanese* | grep -o -P '\-Japanese\-.+\-' | sort | uniq -c ; grep laugh MOT/*Chinese* | grep -o
-P '\-Chinese\-.+\-' | sort | uniq -c ;
12

$wc -l MOT/*Farsi-Family* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-

(\d+)/; print "$h{$_} $1 $2\n";}' >exp1.MOT.Farsi-Family.N
13

$wc -l MOT/*Eng* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/;

print "$h{$_} $1 $2\n";}' >exp1.MOT.Eng.N
14

$wc -l CHI/*Eng* |perl -e 'while (<>) { s/CHI\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print

"$h{$_} $1 $2\n";}' >exp1.CHI.Eng.N

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

6

DANIEL DEVATMAN HROMADA

Potentially the most salient phenomenon is a marked decrease in production of laughs which
occur between birth and second year of age. This could be potentially explained in terms of
gradual switch from non-linguistic means of communication towards more verbal
interactions. However, in case of child-directed speech of Japanese motherese the relative
frequency of laughing seems to increase during the same period and in case of chinese, the
decline is much less marked than in case of indo-european langages. This may potentially
suggest an intercultural difference – a hypothesis which is further corrobated by the fact that it
is only in case of indo-european langages that the « dotted » lines cross with « solid » lines. Id
est, little english-, french- and farsi- speaking children tend to laugh more often than their
mothers but older children seem to laugh less frequently than their mothers.
This quiproquo notwithstanding, relative frequencies of CHI time series significantly
correlate with MOT time series in both English (Pearson's correlation coefficient 0.933, t =
7.36, df = 8, p-value = 7.886e-05 ) and in Farsi (corr. coef. 0.972, t = 5.9224, df = 2, p-value =
0.02735 ). In French correlation is quite close to significancy threshold (t = 4.1692, df = 2, pvalue = 0.053, cor. coef = 0.947) when data is aggregated in year-sized packages but is
insignificant (t = -1.1598, df = 27, p-value = 0.2563 ) when time series are correlated with
monthly granularity. No statistically significant correlation between child-produced and
mother-produced laugh time-series has been observed in case of Japanese or Chinese.
3.2. Second Analysis – 2nd person singular
It has also been indicated that English mothers interacting with their children tend to use the
pronoun for 2nd person signular « you » much more frequently than is the case in standard
linguistic communication (p.218, Hromada, 2015).
Similiarly to our 1st analysis, our 2nd analysis uses CHILDES to address this hypothesis from
a trans-cultural perspective. The procedure is thus very similar to the one already presented
with one major difference : we do not focus on assessement of occurrences of one standard
marker (e.g. « laugh ») which is present in different corpus sections ; but rather look for, in
each specific subscorpus, for a specific Perl Compatible Regular Expression, a (PCRE 2p.sg )
which matches nominative forms of 2nd person singular in the langage of subcorpus under
study. Following table lists 6 cases of such PCREs for matching 2p.sg. in 6 languages.
Language
PCRE2p.sg

English

French

[ \t]you[' ] [\t ]t(u |oi |')

Farsi

Polish

Chinese

Estonian

Hebrew

[\t ]to

[\t ]ty

(你|ni3)

[\t ]s(in)?a

[\t ]ata?

Usage of these regexes within one-liners using the case-insensitive « grep » allows us to
obtain distributions of relative frequencies independently for MOT17 and CHI18 utterances.

15

$grep laugh MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.MOT.Eng.F

16

$grep laugh CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.CHI.Eng.F

17

$grep -i -P "[\t ]you[' ]" MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.MOT.Eng.F

18

$grep -i -P "[\t ]you[' ]" CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.CHI.Eng.F

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

7

Command sequence yielding distributions of Nutterances19 is practically the same as in first
analysis (c.f. footnotes 13 & 14), the only difference being due to the fact that this time we do
not focus on subcorpora which represent transcripts done by specific target investigators, but
rather process much bigger datasets containing all transcripts representing the langage under
study. FPCRE2p.sg and Nutterances distributions are subsequently processed by the R code which is,
mutatis mutandi, identic to R code snippet used in analysis 1. This yields Figure 2.
A phenomenon common to all languages under study can be observed practically
immediately. That is, on all six solid MOT lines, one can observe, between first and fourth
year of child's age, a marked increase in maternal usage of 2nd. person singular. Sometimes
such an augmentation is less marked (as in french), sometimes it comes later (between 2nd
and 3rd year of age in case of farsi and hebrew), but it always comes. And it always reaches
all-time-heights before fifth year of age, after which the maternal usage of "you" tends to
slowly converge back to its "normal" levels.
Note also that in English motherese, « you » is used in approximately every fifth utterance.
What is also striking in regards the English language - which is definitely the biggest
CHILDES subcorpus - is quite significant correlation between time-serie representing the
usage of 2p. sg. by mothers and time-serie representing the usage of 2p. sg. by children
themselves (Pearson's cor. coeff. = 0.768, t = 3.393, df = 8, p-value = 0.009451; Kendall's τ =
0.6, T = 36, p-value = 0.0166720; Spearman's ϱ = 0.733, S = 44, p-value = 0.02117).

19

$wc -l CHI/*Farsi*|perl -e 'while (<>){s/CHI\///;/(\d+) (\d+-\d+)-/;$h{$2}+=$1;}for (sort keys %h){/(\d+)-(\d+)/;print

"$h{$_} $1 $2\n";}' >exp2.CHI.Farsi.N
20

>cor.test(aggregated_mot_lang1[,6]/aggregated_mot_lang1[,3],aggregated_chi_lang1[,6]/aggregated_chi_lang1[,3],metho

d="kendall")

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

8

DANIEL DEVATMAN HROMADA

3.3. Third Analysis – 1st person singular
Our 3nd analysis is identic to the second, the only thing which changes are the PCRE patterns
which are this time supposed to match nominative forms of pronous denoting the 1st. person
singular. Id est the ego, the self-reference, the "I". Following table lists 7 cases of such PCREs
matching 1p. sg. in their respective CHILDES subcorpora.
Language
PCRE1p.sg

English
[ \t]I[' ]

French

Farsi

[\t ](j(e |')|moi) [\t ]m[aæe]n

Polish

Chinese Estonian

[\t ]ja

(我|wo3)

[\t ]m(in)?a

Hebrew
[\t ]ani

Everything else - from extraction of absolute frequencies of forms matched by PCREs all the
way to aggregating, normalizing and plotting - is, mutatis mutandi, identic to 2nd analysis.
This leads to visualisation presented on the above figure. An interestant phenomenon can be
noticed: while in early infancy, mothers of all language backgrounds use 1p.sg. much more
frequently than children (probably because children are still in a pre-linguistic stage), the
difference is being switfly and strongly counteracted. Hence, around three years of age,
children of all21 cultures tend to produce 1p. sg. much more frequently than their mothers.
But not only augmentation of use but also diminutions are of certain scientific interest. Hence,
a steep decline in use of 1p.sg. can be observed between 6th and 7th year of age. That is,
during the period when children and enter school and which markes the offset of that
ontogenetic stage which (Piaget, 1951) labeled as "egocentric".
Similiary to 2nd analysis, a significant correlation between time serie representing the
production of "I" by english-speaking mothers and production of "I" by english-speaking
children can be observed (Kendall's τ = 0.555, T = 35, p-value = 0.02861 ).
21

With exception of Polish language where we unfortunately lack motherese data from 3rd birthday onwards.

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

9

What's more, the plot indicates a path towards identification of statistically significant intercultural correlations. Thus, after filling the gap22 in the Chinese dataset related to the fact
that CHILDES does not seem to contain transcripts of chinese 8-year olds, one shall observe a
correlation23 between time-series of relative frequencies of 1p.sg produced by french and
chinese children (Kendall's τ = 0.511, T = 29, p-value = 0.02474 ). Idem for english and
french (Kendall's τ = 0.777, T = 32, p-value = 0.002425), for polish and hebrew (Pearson
coef. = ; Kendall's τ = ; Spearman's ϱ = 0.786, S = 12, p-value = 0.04802) and if one stays
faithful to canonic p<0.05 precept (Fisher, 1925) and opts for Spearman's rho or Pearson's
coeff rather than for Kendall's tau, then, for example then also for french and polish (Pearson
coef. = 0.837, t = 3.4219, df = 5, p-value = 0.0188 ; Kendall's τ = 0.619, T = 17, p-value =
0.06905 ; Spearman's ϱ = 0.785, S = 12, p-value = 0.04802 ) as well as for polish and hebrew
(Pearson coef. = 0.759, t = 2.6117, df = 5, p-value = 0.04757; Kendall's τ = 0.619, T = 17, pvalue = 0.06905 ; Spearman's ϱ = 0.786, S = 12, p-value = 0.0480224) .

4. Discussion
It is a common practice in contemporary Corpus Linguistics in general and in Natural
Language Processing in particular, to focus fully on formal and theoretical properties of one's
model or analysis. Thus, majority of publications in these domains limit themselves to
dissemination of few core formulas behind the analysis which is presented + results which
were obtained (F-scores etc.). In atmosphere where sharing the code with the community is
more an exception than a rule, it is not surprising that majority of publications disregard the
concrete aspects of implementation and execution of one's analysis as unworthy of interest.
Such an attitude can be excusable when one attacks a highly specific engineering problem.
But in regards to analyses aiming to attain the general knowledge - id est, when doing
fundamental research or exploratory science – such an approach is to be discarded as
inconsistent with the ideal of experimentator-independent reproducibility.
In this article, we have explained how cost-efficient (i.e. as free as open source software),
reproducible and transparent science can be performed at the very border of corpus and
developmental psycholinguistics. More concretely, in footnotes of this article, we have
presented less than two dozens one-liners which pipeline and combine PCREs (Wall, 1990;
Hromada, 2011) with core GNU utilities like “grep”, “uniq”, "wc" and “sort”. Asides this, a
snippet of few dozen lines of beginner-level non-optimized R code is hereby being
published25 in order to furnish complete – i.e. from downloading the corpus from publicly
available source all the way to final plots and correlation coefficients - description of three
experiments hereby performed.

22

>aggregated_chi_lang4[9,]=(aggregated_chi_lang4[7,]+aggregated_chi_lang4[8,])/2

23

>cor.test(aggregated_chi_lang2[,6]/aggregated_chi_lang2[,3],aggregated_chi_lang4[,6]/aggregated_chi_lang4[,3],method="kendall")

24

25

>cor.test(aggregated_chi_lang6[,6]/aggregated_chi_lang6[,3],aggregated_chi_lang5[,6]/aggregated_chi_lang5[,3],method="spearman")
http://wizzion.com/code/jadt2016/childes.R

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

10

DANIEL DEVATMAN HROMADA

Common to these three experiments was a preprocessing phase which purified and
repartitioned hundreds of megabytes of data contained in CHILDES. Result of this phase
were two directories, CHI which contains utterances produced by children and MOT which
contains motherese utterances (cf. section 2.2). Principal motivation behind this repartitioning
was a speed-up of any subsequent analysis. For example the 3rd analysis - when executed on
one sole core of 3.2 Ghz PC with 8GB RAM PC and CHILDES data stored on a SSD disk (a
fairly standard configuration) - didn't last more than 15 seconds. All the way from matching
the first regular expression on the first line of first transcript to R's final plotting.
Mentioning regular expressions, we consider it as important to reiterate that regexes, like
those implemented in Perl or PCREs, seem to us to be much more than impressive yet weird
character sequences that no neophyte can read. Unambigously denoting what they should
denote - i.e. a specific set of character sequences, a specific pattern, schema and form PCREs are formalisms in their own right (Hromada, 2011). Idem for shell commands and
PERL or R instructions - they also are unambigous formalisms and for purposes of NLP, they
can turn out to be at least as worthy as other formalisms.
Formalisms, tools and methodology being thus defined by a concrete example, a question can
be posed: "What should be the name of a discipline which uses implemets such a method and
uses such tools ?" And given that what was done used techniques common to textometry in
order to address topics common to developmental psycholinguistics (Tomasello, 2009), an
answer could potentially sound: "Textometric Psycholinguistics".
It is only now - with toolbox specified and reproducible method and scope of interest of
discipline properly delimited - that a discussion about culture-independent anthropological
constants occurent in adult-child verbal and pre-verbal interactions - id est a discussion about
"linguistic universalia" and their meaning, a discussion among savants can, hopefully, begin.

References
Fisher, R. A. (1925). Statistical methods for research workers. Genesis Publishing Pvt Ltd.
MacWhinney, Brian & Snow, Catherine. (1985). The child language data exchange system. Journal of child language, 12(02), 271-295.
MacWhinney, Brian. (2012). The CHILDES Project Tools for Analyzing Talk–Electronic Edition Part 1: The CHAT Transcription Format.
Piaget, J. (1951). Principal factors determining intellectual evolution from childhood to adult life. Columbia University Press.
Popper Karl. (1992). The Logic of Scientific Discovery. Routledge, London.
Hromada, Daniel D. (2011) Initial Experiments with Multilingual Extraction of Rhetoric Figures by means of PERL-compatible Regular
Expressions. RANLP Student Research Workshop, 85-90.
Hromada Daniel D. (2015). Theoretical Foundation of Thesis "Evolutionary Models of Ontogeny of Linguistic Categories". in press.
Stallman, Richard. (1985). The GNU manifesto.
Team, R.Core. (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
2013.
Tomasello, M., & Tomasello, M. (2009). Constructing a language: A usage-based theory of language acquisition. Harvard University Press.
Wall, Larry. (1990). PERL: Practical Extraction and Report Language.

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

Introduction

Das Experiment

To whom it may concern

Fast and Frugal Detection of Chiastic Protofigures
in English Subsection of CHILDES Corpus
regex strikes back

Daniel Devatman Hromada123
daniel@wizzion.com
1 Université Paris 8 / Lumières
École Doctorale Cognition, Langage, Interaction
Laboratoire Cognition Humaine et Artificielle
2 Slovak University of Technology
Faculty of Electronic Engineering and Informatics
Department of Robotics and Cybernetics
3 Universität der Künste
Fakultät der Gestaltung, Berlin

Introduction

Das Experiment

Table of Contents

1

Introduction
Computational Psycholinguistics
Computational Rhetorics
Main idea

2

Das Experiment

3

To whom it may concern

To whom it may concern

Introduction

Das Experiment

To whom it may concern

Computational (Developmental) Psycholinguistics
C(D)P
Is a cross-over between computational linguistic (and/or Natural
Language Processing) and developmental psycholinguistics.
Main objectives:
1

use computational methods (data-mining, information
retrieval, NLP etc.) to gain novel insights about ontogeny of
language competence in human children

2

develop computational models of language acquisition and
embed them into language-interacting artificial agents

In this talk we focus solely on the first objective.

Introduction

Das Experiment

To whom it may concern

CHILDES

CHILDES corpus: a gem of gems
Child Language Data Exchange System (MacWhinney&Snow, 1985)
http://childes.psy.cmu.edu/data
http://wizzion.com/CHILDES/ (mirror from 6th Feb 2016)

1

more than 50 years of tradition

2

more than 1.5 GigaBytes of mostly textual data contained in
cca 30000 transcripts

3

at least 26 languages, dialects or language combinations

4

Creative Commons BY-NC-SA licence

Introduction

Das Experiment

To whom it may concern

CHAT format
CHAT system provides a standardized format for producing computerized
transcripts of face-to-face conversational interactions. (MacWhinney,
2016; http://childes.talkbank.org/manuals/chat.pdf).

@Languages:
eng
@Participants: CHI Eve Target_Child , MOT Sue Mother , FAT David Father
@ID:
eng|Brown|CHI|1;6.|female|||Target_Child|||
@ID:
eng|Brown|MOT|||||Mother|||
@ID:
eng|Brown|COL|||||Investigator|||
@Date: 29-OCT-1962
*MOT:
one two three four .
%mor:
det:num|one det:num|two det:num|three det:num|four .
%act:
tests tape recorder
*CHI:
one two three . [+ IMIT]
A non-negligeable advantage
Majority of transcripts follow the principle: ONE LINE = ONE UTTERANCE.

Introduction

Das Experiment

To whom it may concern

Computational Rhetorics

Computational (& Cognitive) Rhetorics
Computational Rhetorics
A discipline which has attained its maturity at Computational
Rhetorics Workshop organized by Harris and Di Marco at
University of Waterloo.
Computational-Cognitive Rhetorics
A disciplne using computers to better understand why rhetorics
casts such a powerful curse on human minds.
Computational-Developmental Rhetorics
Using computers to elucidate the process of ontogeny of rhetoric
competence in human children.
”Child’s spontaneous remark is more valuable than all questioning
in the world.” (Jean Piaget)

Introduction

Das Experiment

To whom it may concern

Main idea

Main concept(s)
Scheme
A scheme is a generic form which corresponds to one or more
distinct constellations of observables.
Regular expression
A sequence of characters that defines a search pattern.
Perl-Compatible Regular Expressions
Concise and expressive regex standard. Much more powerful than
regular grammars: it is possible to perform back-tracking!
Backtracking
Allows us to match that, which has already been matched: paves
the way to detection of repetitions.

Introduction

Das Experiment

To whom it may concern

Main idea

Main idea(s)

Main idea
Chiasms are repetition-based schemata A1 B1 C1 XC2 B2 A2 (or
A1 B1 XB2 A2 ).

Note that the presence of middle term (B) and separator term (X)
can be considered as facultative. But in order to detect chiasm,
the initial preceptor (A1 ) has to be strongly reminiscent (and
ideally identic) to terminal successor (A2 ). Idem for relation
between terminal preceptor (C2 ) and initial successor (C1 ).

Introduction

Das Experiment

Table of Contents

1

Introduction

2

Das Experiment
Method
Results

3

To whom it may concern

To whom it may concern

Introduction

Das Experiment

To whom it may concern

Method

Regex implementing the main idea
initial
preceptor
(\w{3,})

terminal
successor
(.{0,77})

(\w{3,})

terminal
preceptor

.{0 77}

\3

\2

\1

initial
successor

Note that nodes of a chiasmatic structure form a double-closed
graph.

Introduction

Das Experiment

To whom it may concern

Method

Demo
Run this shell command* :
grep -irP ’^\*MOT:.*(\w{3,}) (.{0,77}) (?!\1)(\w{3,}).{0,77}\3 \2 \1’ *Eng*

in the directory into which You downloaded and unpacked the
CHILDES corpus.
Note that the extractor can be parametrized with change of numeric values: e.g.
changing (\w{3,}) to (\w{1,}) could potentially allow You to detect grapheme-level
metatheses like ”asteriks with an asterisk”.

* Regex sequence is hereby transfered to Public Domain under Creative Commons
BY-NC-SA (Author Attribution, Non-Commercial, Share-Alike) licence.

Introduction

Das Experiment

To whom it may concern

Results

You’ll see many playful ones...

pear pear yummy yummy yummy yummy pear .
my name is Joey Joey Joe Joe Joe Joe Joey .
I think I can I think I can I think I can I think I can I think I
can I think I can .
tick tick tick tick tick tick tick tick tick tock tick tock tick
tick tick tick .
Earth , moon , Earth , moon , full moon , Earth moon .
crash , boom , crash , boom , crash , boom crash !
Note: triplicated couple A1 B1 A2 B2 A3 B3 always contains an
A1 B1 B2 A3 implicit antimetabole!!!

Introduction

Das Experiment

To whom it may concern

Results

...reversed coordinatives...
and they splish and they splash and they splash and they
splish .
a dot and a dash and a dash and a dot .
well Granddad and Grandma [//] Grandma and Granddad are
coming today .
it’s called lamb and vegetable [//] mediterranean vegetable
and lamb risotto .
Donald hopped and swam and swam and hopped until he was
safe on dry ground .
every day my cows Poppy (.) Annabel (.) Emily and Heather
moo and mumble (.) mumble and moo .
Chester and Wilson Wilson and Chester .

Introduction

Das Experiment

To whom it may concern

Results

...and more exhaustive reversed lists...

Chester and Wilson and Lily Lily and Wilson and Chester .
okay , square , square , rectangle , square , oval , two , one ,
one , two .
blue , green , yellow , red , red , yellow , green , blue .
one two three or three two one ?
sure we went through Rhode island , Massachusetts , New
Hampshire , Vermont , and then on the way back we did
Vermont , New Hampshire , Massachusetts , Rhode island ,
right ?

Introduction

Das Experiment

To whom it may concern

Results

...and reversals of direction and position and time...
you get one ticket that says York to Manchester and another
ticket that says Manchester to York .
he used to rush here and there and there and here and back
again all the time and of course he was always in such a rush
that he never ever finished anything properly .
from here to there , from there to here from here to there
funny things everywhere .
let’s put mine on yours and put yours on mine .
could put the box on the lid instead of the lid on the box .
but I mean do you get your drink after you’ve had your biscuit
or do you get your biscuit after you’ve had your drink .

Introduction

Das Experiment

To whom it may concern

Results

...and reversals of attributes...
let’s put the blue one on the guy with the red underpants and
the red one on the guy with the blue underpants .
if it (h)as been a police car it becomes a racing car and if it
(h)as been a racing car it becomes a police car .
and when you’re talking about little crocodiles and big snakes
(.) or little snakes and big crocodiles (.) they’re jelly sweets
you’ve had in the past .
oh [!] I got a yellow cup and a red plate and you got a red
cup and a yellow [!] plate (.) .
look , they’re very similar (.) look , this one is green with a
little yellow , and this I yellow with a little green (.)
interesting , huh ?
you mean it looks nicer than it smells [//] smells nicer than it
looks .

Introduction

Das Experiment

To whom it may concern

Results

...and reversals of case-like roles, of course...

Nominative vs. Vocative
Amanda that’s xxx xxx that’s Amanda .
xxx this is Stephanie Stephanie this is xxx by the way .
Nominative vs. Accusative
froggie keep an eye on mummy or mummy keep an eye on
froggie ?
Floppy meet the screwdrivers screwdrivers meet the Floppy .
Nominative vs. Dative
do you give Daddy a big kiss or does Daddy give you a big
kiss ?

Introduction

Das Experiment

To whom it may concern

Results

...as well as some more complex swaps?
like Nominative vs. Genitive vs. Locative...
I mean you go [//] girls go to boys parties and boys go to girls
...or proto-rhetoric questions...
I think you’re stinky you are stinky are you stinky ?
wouldjou [: would you] couldjou [: could you] wouldjou [:
would you] with a goat ?
...and other pieces of maternal wisdom.
I would not could not in a box I could not would not with a
fox .
we’re in house of bricks not the bricks of house .
two for tea , and tea for two .
I meant what I said and I said what I meant .

Introduction

Das Experiment

Table of Contents

1

Introduction

2

Das Experiment

3

To whom it may concern
Current state
Future directions

To whom it may concern

Introduction

Das Experiment

To whom it may concern

Current state

Concerning the method
a naive rhetoric-figure-tagger (nRFT)
fast*, deterministic, transparent for inspection, partially
parametrizable
form-oriented: looks for identic sequences within the signifier
(no semantics involved)
generates false positives: manual check needed; can be useful
for CHIASMFP corpus
can speed-up the manual annotation (semi-supervised
scenario)
IMPORTANT: the schema can be used not only to detect,
but also to GENERATE
* and super-fast if You store Your Big Data on a RAMdisk or at
least on a SSD disk cache

Introduction

Das Experiment

To whom it may concern

Current state

Concerning the results
English motherese utterances tend to abound with
protochiastic structures
many functions: playful reversal of repetition, reversal of
spatial direction, reversal of list, lapsus lingui correction,
positional swap, attribute swap, functional (case) swap ... all
matched by a single one-liner !
what we are dealing here with is a whole ecosystem of
diverse structures
indicated prominence of the verb ”put” as a middle term
consistent with theories of Piaget and Tomasello
triplicated couple A1 B1 A2 B2 A3 B3 always contains an
A1 B1 B2 A3 implicit antimetabole

Introduction

Das Experiment

To whom it may concern

Future directions

Invitation to explore
not only intralocutory (i.e. within 1 utterance) chiasms, but
also translocutory ones (within multiple successive utterances)
relations to variation sets and Winograd schemata
multi-lingual analysis (are these beasts universal ?)
ontogenetic relation to other figures like rhetoric question or
even metaphore (METAPHOROS = ”carry over”)
informational content of chiasms (known components +
unknown order = maximal amount of new info ?)
neurocognitive aspects of chiasm processing (focus upon the
cyclical referential closure between initial and terminal
token of the sequence)
neurorhetoric hypothesis: look for a P600-like evoked
potential following the exposure to chiasmus
non-linguistic chiasmata (musical, visual, spatial, anatomical,
social, moral, emotional, sexual, spiritual etc.)

Introduction

Das Experiment

To whom it may concern

Future directions

Conclusion

Starting discussion with conclusion often concludes the discussion...

Ergo, no ultimate conclusion without juicy discussion.

daniel@wizzion.com thanks Thee for Thy attention

Reproducible Identification of Pragmatic Universalia in
CHILDES Transcripts
Daniel Devatman Hromada1,2,3
1
2

Université Paris Lumières - France

Slovak University of Technology – Bratislava - Slovakia
3

Berlin University of the Arts – Berlin - Germany

Abstract
This article presents method and results of multiple analyses of the biggest publicly available corpus of language
acquisition data : Child Language Data Exchange System. The methodological aim of this article is to present a
means how science can be done in a highly positivist, empiric and reproducible manner consistent with the
precepts of the “Open Science” movement. Thus, a handful of simple one-liners pipelining standard GNU tools
like “grep”, and “uniq” is presented - which, when applied on myriads of transcripts contained in the corpus –
can potentially pave a path towards identification of statistically significant phenomena. Relative frequencies of
occurrence are analyzed along age and language axes in order to help to identify certain concrete, pragmatic
universalia marking different stages of linguistic ontogeny in human children. One can thus observe significant
culture-agnostic decrease of laughing in child-produced speech and child-directed indo-european “motherese”
occurrent between 1st and 2nd year of age; maternal increase in production of pronoun denoting 2nd person
singular “you”; increase of usage of 1st person singular “I” in utterances produced by children around 3rd years
of age and marked decrease of the same which takes place around 6 years of age. Other significant correlations both intra-cultural between English mothers and children, as well as inter-cultural - are pointed down always
accompanied with thorough descriptions methodology immediately reproducible on an average computer.

1. Introduction
Reproducibility is one of the hallmark principles of occidental science. Being based upon the
philosophy of ancient greeks who were fully aware that only the knowlede of that, which
repeats itself in many instances, can lead to generic and transtemporal ἐπίσταμαι, the western
scientific method necessarily considers reproducibility as its main condition sine qua non. In
words of the foremost figure of modern epistemology, "non-reproducible single occurrences
are of no significance to science" (Popper, 1992).
Hence the primary, epistemological, objective of this article is to show how anyone willing to
do so can perform reproducible analyses and experiments regarding the phenomena
traditionally falling into the scope of corpus, computational and developmental linguistics.
This objective is to be quite naturally attained if ever three precepts are stringently followed :
•

use publicly available data

•

analyse the data with simple, specific yet powerful tools which are well-known to
widest possible public

•

faithfully protocol the exact procedure of usage of these tools

In more concrete terms, we promote the idea that - in regards to analysis of statistical textual
data - core GNU (Stallman, 1985) utils and commands as well as basic operators and core

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

2

DANIEL DEVATMAN HROMADA

functions of open source langages like PERL (Wall, 1990) or R (Team, 2013) indeed offer
such "simple, specific yet powerful tools well-known to widest possible public".
When it comes to the precept " faithfully protocol the usage of these tools ", it shall be
implemented - in this article and potentially beyond – in a following manner : every simple
transformation of data is to be completely and exhaustively described in a footnote which
accompanies the description of the transformation. By " simple ", we mean such a
transformation which can be described as a simple standard UNIX shell 1 one-liner pipelining
combining together core commands like " grep ", " uniq " or " sort ".
In case of more complex transformations, the complete source code of program is always to
be furnished either in publications's appendix or at least as an URL reference. To assure the
highest possible reproducibility of the experiment, the snippet should not call any modules
and libraries external to language's core distribution (e.g. no CPAN resp. CRAN).
The most important thing, however, is not to forget that the protocol is to be complete,
exhaustive and unambigous. That is, .history of all steps is to be described in the form which
is immediately executable on a standard GNU-positive machine. All means all : from the
very fact of downloading2 the corpus from a publicly available source to the very act of
plotting the legend on a figure which is then disseminated among scientific communities.
Given that these precepts are followed and under the conditions that
•

the analysis is fully deterministic (i.e. does not involve any source of stochasticity)

•

the source corpus has not changed in the meanwhile

it can be expected that the same analysis shall bring the same results no matter whether it is
executed in other folder of the same computer (e.g. reproducibility across directories) ;
executed on different computers (e.g. reproducibility across experimental apparatus) and|or
executed by different experimentator (e.g. experimentator-independent reproducibility).

2. Corpus & Method
Child Language Data Exchange System (CHILDES) undoubtably belongs among most
fascinating language-related corpora. Established by (MacWhinney and Snow, 1985) more
than 30-years ago and including transcripts dating back to 1960s, CHILDES does not cease to
be the biggest public repository of child language acquisition and development data. Thus,
asides huge volumes of audio and video recordings of verbal interactions with children,
CHILDES also contains more than thirty thousand distinct transcripts.
Transcript themselves are encoded in UTF-8 compliant plaintext .CHA files. These files
follow a CHAT format specified in (MacWhinney, 2012). Every transcript contains a header
describing specificities facts concerning the transcribed scenario – e.g. the age of a child,
identities of participants (lines beginning with *CHI denote utterances produced by children;
lines beginning with *MOT denote utterances produced by their mothers).
Unfortunately, different linguists have followed the CHAT manual in a different manner. For
example, some include the timestamp information into their corpus and some not. Some mark
the repetition by special tokens like [x 2] (for duplication) or [x 3] (for triplication) and some
$ echo 'All footnote-descriptions of shell one-liners begin with the sign $ and all footnote-descriptions of R
commands begin with sign >.'
1

It is highly recommended to use standard utilities like "wget " or "curl " for that purpose.

2

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

3

transcribe the utterance as such, without using such tokens. And yet another set of differences
necessarily originates in transcriber's own perception and habits. For example: while the token
“mama” is occurrent in 1405 child utterances contained in English sections of the corpus3,
some other English transcribers (e.g. Haggerty or Suppes) apparently prefered to transcribe
the mother-directed vocative as “mamma” - this occurs in 126 distinct utterances.
Be it as it may, the CHILDES corpus is already so huge that one may except that a well
constituted and unbiased quantitative analysis could potentially allow the discovery of
phenomena robust to any surface perturbations (e.g. differences in habits and styles of
different investigators etc.). In other terms, if every transcript is understood as a result of a
distinct act of sampling, then it can be expected that the statistical aggregation of such a huge
amount of distinct samples (> 30000 distinct transcripts) could let to situation where the noise
cancels itself out and statistically significant phenomena emerge.
And individual CHILDES transcripts are indeed distinct. Not only because dozens, if not
hundreds researchers and investigators of at least three or four generations had already
directly participated on constitution of the corpus. Not only because majority of transcripts
were in one way or another related to a specific research project with a goal unrelated to goals
of other projects. But also because investigators themselves, as well as the investigated
subjects (e.g. children), often stem from huge variety of distinct cultural backgrounds. More
concretely: 26 languages are included in the corpus, covering practically majority of main
terran language strata (i.e. indo-european languages, asian languages, semitic, altaic and
ugrofinic languages etc.). This allows for trans-cultural analysis and such shall indeed be all
analysis presented in the section 4.
2.1 Metrics
Results can be mutually compared and communicated only if they are expressed in common
units. In case of all experiments presented in this article, the relative frequency - interpreted as
the probability of occurrence - of pattern X is such a unit. This is equivalent to absolute
frequency of occurrence of FX normalized by the total number of utterances, i.e.
PX = FX / Nutterances
Ideally, for every month mentioned in the CHILDES corpus should correspond one P X value.
To understand our approach more clearly, imagine, for example, in case of hypothethic
language whose speakers utter 100 utterances each month since their birth until their tenth
birthday. If such speakers utter the token " dog " twenty times every month, than the value of
all 120 (i.e. 10 years * 12 months) datapoints describing the time series for this particular
token would be constantly equal to 100/20 = 20% = 0.2.
It is principially due to such trivial nature of the calculus hereby presented that the core datamining procedures can be performed directly on the BASH command-line.
3.2 Preprocessing
Four hundred and sixty-seven megabytes of data compressed in 983 zip files are obtained
after the corpus has been downloaded from its original source4 or from a mirror site which

3

$ grep "mama" child/*Eng* |wc -l; grep "mamma" child/*Eng* |wc -l

4

$ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r http://childes.psy.cmu.edu/data/

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

4

DANIEL DEVATMAN HROMADA

represents state of CHILDES as of February 6th 20165. After these files are recursively
decompressed6, the CHILDES arborescent structure is flattened so that all .CHA files are
contained within one sole directory7. A following one-liner subsequently “peeks into” each
.CHA file, retrieves child's age from it and puts this information into files' name8.
Utterances containing only xxx and www tokens – which, according to CHILDES manual
denote “unintelligible words with an unclear phonetic shape” resp. “untranscribed material” are removed from all child and mother transcripts 9. Next step is executed only to speed-up
following pattern extraction processes: child utterances are funnelled into simplified
transcripts stored in “CHI” subdirectory and maternal utterances are funnelled into “MOT”
subdirectory 10 . Translocutory information is thus lost but this is allowed for the purpose of
this article in which we shall focus solely on relative frequencies of certain tokens and not on
more complex discourse units.
All this yields 5833656 lines (e.g. utterances) contained in 29180 non-empty simplified
transcripts stored in “child” directory and 3798005 lines contained in 13590 non-empty
simplified transcripts stored in the “mother” directory.
Note that metadata like age (years and months), language group, language and CHILDES
investigator's identity are stored directly in the simplified transcript's filename. Workbench
common to all following analyses can be thus considered as ready.

3. Analyses
3.1. First Analysis – Laughing
It has been recently indicated that English mothers interacting with children younger than 16
months tend to laugh significantly more often than mothers which interact with children
between 16-31 months of age (p.222, Hromada, 2015). Our 1st analysis will use CHILDES
to address this hypothesis from a trans-cultural perspective.
It may be surprising to use a dataset, which is essentially a linguistic corpus for, a purpose of
study of such a non-verbal means of communication as laughing definitely is. But the very
CHAT manual (p.62, MacWhinney, 2012) explicitely specifies the &=laughs marker as a
most common standardized spelling denoting a specific extralinguistic event.
Unfortunately, within the totality of CHILDES corpus, the marker itself &=laughs is not the
only standardized form denoting the phenomenon and some authors prefered to use markers
5

$ wget -P CHILDES -e robots=off --no-parent --accept '.zip' -r WILL-BE-GIVEN-IN-CAMERA-READY-VERSION

6

$ find CHILDES/data -name "*.zip" | while read filename; do unzip -o -d "`dirname "$filename"`" "$filename"; done

7

$ mkdir CHILDES_flat; find CHILDES/data -type f |perl -n -e 'chomp; if (/\.cha/) {$f=$_; s/\//-/g; s/\.-data-//g; `cp $f

./CHILDES_flat/$_`;}'; cd CHILDES_flat;
8

$ mkdir aged; grep -P '\|\d;\d' *| grep Child | perl -n -e 'chomp; `cp $1 aged/$2-$3-$1` if /^(.*?):.*0?(\d+);0?(\d+)/;' ; rm *.cha

9

$ perl -ni -e 'print if $_!~/^\*(MOT|CHI):\t(xxx|www) ?\./' aged/*

10

$ mkdir CHI; cp aged/* CHI; sed -i '/\*CHI/! d' CHI/*; mkdir MOT; cp aged/* MOT; sed -i '/\*MOT/! d' MOT/*;

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

5

like [=! laughing]. Hence, for a purpose of our 1st analysis, we have simply used the token
laugh as the one whose frequencies of occurrence we have decided to measure.
Three indo-european (english, french and farsi) and two non-indo-european languages
(japanese and chinese) were chosen in order to address the developmental trajectory of
laughing from a trans-cultural perspective. For each among these langages, a target
investigator was identified as the one who most frequently used the marker laugh in his
transcripts of motherese11. Corpus subsections " Farsi-Family ", "French-MOR-York ",
" Japanese-MiiPro " and " Chinese-Beijing " were thus identified as such target subsections.
All English-language transcripts (i.e. such files whose filename contains the token " Eng ")
were also taken into account.
The core of the procedure is as follows: total amount of utterances is obtained, for each month
and each target subsection of the corpus, by a one-liner 12 which redirects its output into a file
whose every row contains three space-separated columns: first column denotes the denotes
the value of Nutterances and second and third column denote the year resp. month. The procedure
is to be repeated ten times alltogether, five for each target corpus subsections multiplied by
two possible locutor values of the locutor
variable (MOT13 or CHI14).
Follow ten executions of a command
sequence which generate 10 files containing
absolute frequencies of occurrence of the
token laugh within five different corpus
sections – and again for both MOT15 and
CHI16 locutors - which are aggregated
according to child's age in the moment when
laughing was noted down by the CHILDES
investigator. And that's it: all result-containing
files can now serve furnish input datasets for
the R code which produces a plot displayed
on adjacent figure.
11

Probability that laughing accompanies or
substitutes an utterance produced by, or
directed to, a child of specific age.

$ grep laugh MOT/*French* | grep -o -P '\-French\-.+\-' | sort | uniq -c ; grep laugh MOT/*Farsi* | grep -o -P '\-Farsi\-.+\-' |

sort | uniq -c ; grep laugh MOT/*Japanese* | grep -o -P '\-Japanese\-.+\-' | sort | uniq -c ; grep laugh MOT/*Chinese* | grep -o
-P '\-Chinese\-.+\-' | sort | uniq -c ;
12

$wc -l MOT/*Farsi-Family* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-

(\d+)/; print "$h{$_} $1 $2\n";}' >exp1.MOT.Farsi-Family.N
13

$wc -l MOT/*Eng* |perl -e 'while (<>) { s/MOT\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/;

print "$h{$_} $1 $2\n";}' >exp1.MOT.Eng.N
14

$wc -l CHI/*Eng* |perl -e 'while (<>) { s/CHI\///; /(\d+) (\d+-\d+)-/; $h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/; print

"$h{$_} $1 $2\n";}' >exp1.CHI.Eng.N
15

$grep laugh MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.MOT.Eng.F

16

$grep laugh CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp1.CHI.Eng.F

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

6

DANIEL DEVATMAN HROMADA

Potentially the most salient phenomenon is a marked decrease in production of laughs which
occur between birth and second year of age. This could be potentially explained in terms of
gradual switch from non-linguistic means of communication towards more verbal
interactions. However, in case of child-directed speech of Japanese motherese the relative
frequency of laughing seems to increase during the same period and in case of chinese, the
decline is much less marked than in case of indo-european langages. This may potentially
suggest an intercultural difference – a hypothesis which is further corrobated by the fact that it
is only in case of indo-european langages that the " dotted " lines cross with " solid " lines. Id
est, little english-, french- and farsi- speaking children tend to laugh more often than their
mothers but older children seem to laugh less frequently than their mothers.
This quiproquo notwithstanding, relative frequencies of CHI time series significantly
correlate with MOT time series in both English (Pearson's correlation coefficient 0.933, t =
7.36, df = 8, p-value = 7.886e-05 ) and in Farsi (corr. coef. 0.972, t = 5.9224, df = 2, p-value =
0.02735 ). In French correlation is quite close to significancy threshold (t = 4.1692, df = 2, pvalue = 0.053, cor. coef = 0.947) when data is aggregated in year-sized packages but is
insignificant (t = -1.1598, df = 27, p-value = 0.2563 ) when time series are correlated with
monthly granularity. No statistically significant correlation between child-produced and
mother-produced laugh time-series has been observed in case of Japanese or Chinese.
3.2. Second Analysis – 2nd person singular
It has also been indicated that English mothers interacting with their children tend to use the
pronoun for 2nd person signular " you " much more frequently than is the case in standard
linguistic communication (p.218, Hromada, 2015).
Similiarly to our 1st analysis, our 2nd analysis uses CHILDES to address this hypothesis from
a trans-cultural perspective. The procedure is thus very similar to the one already presented
with one major difference : we do not focus on assessement of occurrences of one standard
marker (e.g. " laugh ") which is present in different corpus sections ; but rather look for, in
each specific subscorpus, for a specific Perl Compatible Regular Expression, a (PCRE 2p.sg )
which matches nominative forms of 2nd person singular in the langage of subcorpus under
study. Following table lists 6 cases of such PCREs for matching 2p.sg. in 6 languages.
English

French

Farsi

PCRE2p.sg [ \t]you[' ] [\t ]t(u |oi |') [\t ]to

Polish

Chinese

Estonian Hebrew

[\t ]ty

(你|ni3)

[\t ]s(in)?a [\t ]ata?

Usage of these regexes within one-liners using the case-insensitive " grep " allows us to
obtain distributions of relative frequencies independently for MOT17 and CHI18 utterances.
Command sequence yielding distributions of Nutterances19 is practically the same as in first
analysis (c.f. footnotes 13 & 14), the only difference being due to the fact that this time we do
not focus on subcorpora which represent transcripts done by specific target investigators, but
17

$grep -i -P "[\t ]you[' ]" MOT/*Eng* |perl -n -e '/MOT\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.MOT.Eng.F

18

$grep -i -P "[\t ]you[' ]" CHI/*Eng* |perl -n -e '/CHI\/(\d+)-(\d+)/; print "$1 $2\n"' |uniq -c >exp2.CHI.Eng.F

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

7

rather process much bigger datasets containing all transcripts representing the langage under
study. FPCRE2p.sg and Nutterances distributions are subsequently processed by the R code which is,
mutatis mutandi, identic to R code snippet used in analysis 1. This yields Figure 2.

A phenomenon common to all languages under study can be observed practically
immediately. That is, on all six solid MOT lines, one can observe, between first and fourth
year of child's age, a marked increase in maternal usage of 2nd. person singular. Sometimes
such an augmentation is less marked (as in french), sometimes it comes later (between 2nd
and 3rd year of age in case of farsi and hebrew), but it always comes. And it always reaches
all-time-heights before fifth year of age, after which the maternal usage of "you" tends to
slowly converge back to its "normal" levels.
Note also that in English motherese, " you " is used in approximately every fifth utterance.
What is also striking in regards the English language - which is definitely the biggest
CHILDES subcorpus - is quite significant correlation between time-serie representing the
usage of 2p. sg. by mothers and time-serie representing the usage of 2p. sg. by children
themselves (Pearson's cor. coeff. = 0.768, t = 3.393, df = 8, p-value = 0.009451; Kendall's τ =
0.6, T = 36, p-value = 0.0166720; Spearman's ϱ = 0.733, S = 44, p-value = 0.02117).
3.3. Third Analysis – 1st person singular
Our 3nd analysis is identic to the second, the only thing which changes are the PCRE patterns
which are this time supposed to match nominative forms of pronous denoting the 1st. person
19

$wc -l CHI/*Farsi*|perl -e 'while (<>){s/CHI\///;/(\d+) (\d+-\d+)-/;$h{$2}+=$1;}for (sort keys %h){/(\d+)-(\d+)/;print

"$h{$_} $1 $2\n";}' >exp2.CHI.Farsi.N
20

>cor.test(aggregated_mot_lang1[,6]/aggregated_mot_lang1[,3],aggregated_chi_lang1[,6]/aggregated_chi_lang1[,3],metho

d="kendall")

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

8

DANIEL DEVATMAN HROMADA

singular. Id est the ego, the self-reference, the "I". Following table lists 7 cases of such PCREs
matching 1p. sg. in their respective CHILDES subcorpora.
English
PCRE1p.sg [ \t]I[' ]

French

Farsi

Polish Chinese

[\t ](j(e |')|moi) [\t ]m[aæe]n [\t ]ja

(我|wo3)

Estonian

Hebrew

[\t ]m(in)?a [\t ]ani

Everything else - from extraction of absolute frequencies of forms matched by PCREs all the
way to aggregating, normalizing and plotting - is, mutatis mutandi, identic to 2nd analysis.
This leads to visualisation presented at the bottom of this page. An interestant phenomenon
can be noticed: while in early infancy, mothers of all language backgrounds use 1p.sg. much
more frequently than children (probably because children are still in a pre-linguistic stage),
the difference is being switfly and strongly counteracted. Hence, around three years of age,
children of all21 cultures tend to produce 1p. sg. much more frequently than their mothers.
But not only augmentation of use but also diminutions are of certain scientific interest. Hence,
a steep decline in use of 1p.sg. can be observed between 6th and 7th year of age. That is,
during the period when children and enter school and which markes the offset of that
ontogenetic stage which (Piaget, 1951) labeled as "egocentric".
Similiary to 2nd analysis, a significant correlation between time serie representing the
production of "I" by english-speaking mothers and production of "I" by english-speaking
children can be observed (Kendall's τ = 0.555, T = 35, p-value = 0.02861 ).
What's more, the plot indicates a path towards identification of statistically significant intercultural correlations. Thus, after filling the gap22 in the Chinese dataset related to the fact

21

With exception of Polish language where we unfortunately lack motherese data from 3rd birthday onwards.

>aggregated_chi_lang4[9,]=(aggregated_chi_lang4[7,]+aggregated_chi_lang4[8,])/2

22

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

[REPRODUCIBLE IDENTIFICATION OF PRAGMATIC UNIVERSALIA IN CHILDES TRANSCRIPTS]

9

that CHILDES does not seem to contain transcripts of chinese 8-year olds, one shall observe a
correlation23 between time-series of relative frequencies of 1p.sg produced by french and
chinese children (Kendall's τ = 0.511, T = 29, p-value = 0.02474 ). Idem for english and
french (Kendall's τ = 0.777, T = 32, p-value = 0.002425), for polish and hebrew (Pearson
coef. = ; Kendall's τ = ; Spearman's ϱ = 0.786, S = 12, p-value = 0.04802) and if one stays
faithful to canonic p<0.05 precept (Fisher, 1925) and opts for Spearman's rho or Pearson's
coeff rather than for Kendall's tau, then, for example then also for french and polish (Pearson
coef. = 0.837, t = 3.4219, df = 5, p-value = 0.0188 ; Kendall's τ = 0.619, T = 17, p-value =
0.06905 ; Spearman's ϱ = 0.785, S = 12, p-value = 0.04802 ) as well as for polish and hebrew
(Pearson coef. = 0.759, t = 2.6117, df = 5, p-value = 0.04757; Kendall's τ = 0.619, T = 17, pvalue = 0.06905 ; Spearman's ϱ = 0.786, S = 12, p-value = 0.0480224) .

4. Discussion
It is a common practice in contemporary Corpus Linguistics in general and in Natural
Language Processing in particular, to focus fully on formal and theoretical properties of one's
model or analysis. Thus, majority of publications in these domains limit themselves to
dissemination of few core formulas behind the analysis which is presented + results which
were obtained (F-scores etc.). In atmosphere where sharing the code with the community is
more an exception than a rule, it is not surprising that majority of publications disregard the
concrete aspects of implementation and execution of one's analysis as unworthy of interest.
Such an attitude can be excusable when one attacks a highly specific engineering problem.
But in regards to analyses aiming to attain the general knowledge - id est, when doing
fundamental research or exploratory science – such an approach is to be discarded as
inconsistent with the ideal of experimentator-independent reproducibility.
In this article, we have explained how cost-efficient (i.e. as free as open source software),
reproducible and transparent science can be performed at the very border of corpus and
developmental psycholinguistics. More concretely, in footnotes of this article, we have
presented less than two dozens one-liners which pipeline and combine PCREs (Wall, 1990;
Hromada, 2011) with core GNU utilities like “grep”, “uniq”, "wc" and “sort”. Asides this, a
snippet of few dozen lines of beginner-level non-optimized R code is hereby being
published25 in order to furnish complete – i.e. from downloading the corpus from publicly
available source all the way to final plots and correlation coefficients - description of three
experiments hereby performed.
Common to these three experiments was a preprocessing phase which purified and
repartitioned hundreds of megabytes of data contained in CHILDES. Result of this phase
were two directories, CHI which contains utterances produced by children and MOT which
contains motherese utterances (cf. section 2.2). Principal motivation behind this repartitioning
23

>cor.test(aggregated_chi_lang2[,6]/aggregated_chi_lang2[,3],aggregated_chi_lang4[,6]/aggregated_chi_lang4[,3],method="kendall")

24

25

>cor.test(aggregated_chi_lang6[,6]/aggregated_chi_lang6[,3],aggregated_chi_lang5[,6]/aggregated_chi_lang5[,3],method="spearman")
http://wizzion.com/code/jadt2016/childes.R

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

10

DANIEL DEVATMAN HROMADA

was a speed-up of any subsequent analysis. For example the 3rd analysis - when executed on
one sole core of 3.2 Ghz PC with 8GB RAM PC and CHILDES data stored on a SSD disk (a
fairly standard configuration) - didn't last more than 15 seconds. All the way from matching
the first regular expression on the first line of first transcript to R's final plotting.
Mentioning regular expressions, we consider it as important to reiterate that regexes, like
those implemented in Perl or PCREs, seem to us to be much more than impressive yet weird
character sequences that no neophyte can read. Unambigously denoting what they should
denote - i.e. a specific set of character sequences, a specific pattern, schema and form PCREs are formalisms in their own right (Hromada, 2011). Idem for shell commands and
PERL or R instructions - they also are unambigous formalisms and for purposes of NLP, they
can turn out to be at least as worthy as other formalisms.
Formalisms, tools and methodology being thus defined by a concrete example, a question can
be posed: "What should be the name of a discipline which uses implemets such a method and
uses such tools ?" And given that what was done used techniques common to textometry in
order to address topics common to developmental psycholinguistics (Tomasello, 2009), an
answer could potentially sound: "Textometric Psycholinguistics".
It is only now - with toolbox specified and reproducible method and scope of interest of
discipline properly delimited - that a discussion about culture-independent anthropological
constants occurent in adult-child verbal and pre-verbal interactions - id est a discussion about
"linguistic universalia" and their meaning, a discussion among savants can, hopefully, begin.

References
Fisher, Ronald Aylmer. (1925). Statistical methods for research workers. Genesis Publishing Pvt Ltd.
MacWhinney, Brian & Snow, Catherine. (1985). The child language data exchange system. Journal of
child language, 12(02), 271-295.
MacWhinney, Brian. (2012). The CHILDES Project Tools for Analyzing Talk–Electronic Edition Part
1: The CHAT Transcription Format.
Piaget, Jean. (1951). Principal factors determining intellectual evolution from childhood to adult life.
Columbia University Press.
Popper, Karl. (1992). The Logic of Scientific Discovery. Routledge, London.
Hromada, Daniel Devatman. (2011) Initial Experiments with Multilingual Extraction of Rhetoric
Figures by means of PERL-compatible Regular Expressions. RANLP Student Research Workshop,
85-90.
Hromada, Daniel Devatman. (2015). Conceptual Foundations: Intramental Evolution & Ontogeny of
Toddlerese. In press.
Stallman, Richard. (1985). The GNU manifesto.
Team, R.Core. (2014). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. 2013.
Tomasello, Michael. (2009). Constructing a language: A usage-based theory of language acquisition.
Harvard University Press.
Wall, Larry. (1990). PERL: Practical Extraction and Report Language.

JADT 2016 : 13ème Journées internationales d’Analyse statistique des Données Textuelles

C A N E V O L U T I O N A R Y C O M P U TAT I O N H E L P U S T O
CRIB THE VOYNICH MANUSCRIPT?

by Daniel Devatman Hromada
0.1

abstract

Voynich Manuscript is a corpus of unknown origin written down in
unique graphemic system and potentially representing phonic values of unknown or potentially even extinct language. Departing from
the postulate that the manuscript is not a hoax but rather encodes authentic contents, our article presents an evolutionary algorithm which
aims to find the most optimal mapping between voynichian glyphs
and candidate phonemic values.
Core component of the decoding algorithm is a process of maximization of a fitness function which aims to find most optimal set of
substitution rules allowing to transcribe the part of the manuscript which we call the Calendar - into lists of feminine names. This leads
to sets of character subsitution rules which allow us to consistently
transcribe dozens among three hundred calendar tokens into feminine names: a result far surpassing both "popular" as well as "state
of the art" tentatives to crack the manuscript. What’s more, by using
name lists stemming from different languages as potential cribs, our
"adaptive" method can also be useful in identification of the language
in which the manuscript is written.
As far as we can currently tell, results of our experiments indicate
that the Calendar part of the manuscript contains names from baltoslavic, balkanic or hebrew language strata. Two further indications
are also given: primo, highest fitness values were obtained when the
crib list contains names with specific infixes at token’s penultimate
position as is the case, for example, for slavic feminine diminutives
(i.e. names ending with -ka and not -a). In the most successful scenario, 240 characters contained in 35 distinct Voynichese tokens were
successfully transcribed. Secundo, in case of crib stemming from Hebrew language, whole adaptation process converges to significantly
better fitness values when transcribing voynichian tokens whose order of individual characters have been reversed, and when lists feminine and not masculine names are used as the crib.

1

0

0.2 introduction

0.2

introduction

Voynich Manuscript (VM) undoubtably counts among the most famous unresolved enigmas of the medieval period. On approximately
240 vellum pages currently stored as manuscript (MS) 408 in Yale
University’s Beinecke Rare Book and Manuscript Library, VM contains many images apparently related to botanics, astronomy (or astrology) and bathing. Written aside, above and below these images
are bulks of sequences of glyphs. All this is certain.
Also certain seems to be the fact that in 1912, VM was re-discovered
by a polish book-dealer Wilfrid Voynich in a large palace near Rome
called Villa Mandragone. Alongside the VM itself, Voynich also found
the correspondence - dating from 1666 - between Collegio Romano
scholar Athanasius Kircher and the contemporary rector of Charles
University in Prague, Johannes Marcus Marci. Other attested documents - e.g. a letter from 1639 sent to Kircher by a Prague alchemist
Georg Baresch - also indicate that during the first half of 17th century,
VM was to be found in Prague. The very same correspondence also
indicates that VM was acquired by famous patron of arts, sciences
and alchemy, Emperor Rudolf II. 1
Asides this, one more fact can be stated with certainty: the vellum
of VM was carbon-dated to the early 15h century (Hodgins, 2014).
0.2.1

pre-digital tentatives

Already during the preinformatic era of first half of 20th century had
dozens, if not hundreds, men of distinction invested non-negligeable
time of their life into tentatives to decipher the "voynichese" script.
Being highly popular in their time, many such tentatives - like that
of Newbold who claimed to "prove" that VM was encoded by Roger
Bacon by means of 6-step anagrammatic cipher (Newbold, 1928b),
or that of Strong (Strong, 1945) who claimed VM to be a 16th-century
equivalent of the Kinsey Report" - may seem to be, when looked upon
through the prism of computer science, somewhat irrational 2
C.f. (d’Imperio, 1978) for a overview of other 20th-century "manual"
tentatives which resulted in VM-decipherement claims. After description of these tentatives and and after presentation of informationally
very rich introduction to both VM and its historical context, d’Imperio

1 Savants which passed through Rudolf’s court included Johannes Kepler, Tycho deBrahe or Giordanno Bruno. The last one is known to have sold a certain book to the
emperor for 600 ducats.
2 Note, for example, Strong’s "translation" of one VM passage: "When the contents of
the veins rip, the child comes slyly from the mother issuing with leg-stance skewed and bent
while the arms, bend at the elbow, are knotted like the legs of a crawfish." Strong (1945)
Note also that such translation was a product of man who was "a highly respected
medical scientist in the field of cancer research at Yale University" (d’Imperio, 1978).

2

0.2 introduction

adopts a sceptical stance towards all scholars who associated VM’s
origin with the personage of Roger Bacon3 .
In spite of sceptic who she was, d’Imperio hadn’t a priori disqualified a set of hypotheses that the language in which the VM was
ultimately written was latin or medieval English. And such, indeed,
was the majority of hypotheses which gained prominence all along
20th century.4 .
0.2.2

post-digital tentatives

First tentatives to use machines to crack the VM date back to prehistory of informatic era. Thus, already during 2nd world war did
the cryptologist William F. Friedman invited his colleagues to form
"extracurricular" VM study group - programming IBM computers for
sorting and tabelation of VM data was one among the tasks. Two
decades later - and already in position of a first chief cryptologist of
the nascent National Security Agency - Friedman had formed the 2nd
study group. Again without ultimate success.
One member of Friedman’s 2nd Study Group After was Prescott
Currier whose computer-driven analysis led him to conclusion that
VM in fact encodes two "statistically distinct" (Currier, 1970) languages. What’s more, Currier seems to have been the first scholar
who facilitated the exchange and processing of Voynich manuscript
by proposing a transliteration5 of voynichese glyphs into standard
ASCII characters. This had been the predecessor of the European
Voynich Alphabet (EVA) (Landini and Zandbergen, 1998) which had
become a de facto standard when it comes to mapping of VM glyphs
upon the set of discrete symbols.
Canonization of EVA combined with dissemination of VM’s copies
through Internet have allowed more and more researchers to transcribe the sequence of glyhps on the manuscript into ASCII EVA sequences. Is is thanks to laborious transcription work of people like
Rene Zandberger, Jorge Stolfi or Takeshi Takahashi that verification
or falsification of VM-related hypotheses can be nowadays in great
extent automatized.
For example, Stolfi’s analyses of frequencies of occurence of different characters in different contexts has indicated that majority of
3 "I feel, in sum, that Bacon was not a man who would have produced a work such
as the Voynich manuscript...I can far more easily imagine a small society perhaps in
Germany or Eastern Europe (d’Imperio, 1978, 51)"
4 Note that such pro-english and pro-latin bias can be easily explained not by the
properties of VM itself, but by the simple fact that first batches of VM’s copies were
primarily distributed and popularized among anglosaxxon scholars of medieval philosophy, classical philology or occidental history
5 In this article we distinguish transliteration and transcription. Transliteration is a bijective mapping from one graphemic system into another (e.g. VM glyphs is transliterated into ASCII’s EVA subset). Transcription is a potentially non-bijective mapping
between symbols one one side and sound- or meaning- carrying units on the other.

3

0.2 introduction

Voynichese words seems to implement a sort of tripartite crust-coremantle (or prefix, infix, suffix) morphology. Later study has indicated that the presence of such morphological regularities could be
explained as an output of a mechanical device called Cadran grill
Rugg (2004). The "hoax hypothesis" is also supported by the study
(Schinner, 2007) which suggested that "the text has been generated
by a stochastic process rather than by encoding or encryption of language". Pointing in the similar direction, the analysis also concludes
that "glyph groups in the VM are not used as words".
On the other hand, a methodology based on "first-order statistics of
word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated
as a time series" presented in (Amancio et al., 2013) lead its authors
to conclusion that VM "is mostly compatible with natural languages
and incompatible with random texts". Simply stated, the way how
diverse "words" are distributed among different sections of VM indicates that these words carry certain semantics. And this indicates that
VM, or at least certain parts of it, are not a hoax.
0.2.3

our position

Results of (Amancio et al., 2013) had made us adopt the conjecture
"VM is not a hoax" as a sort of a fundamental hypothesis accepted
a priori. Surely, as far as we stand, it could not be excluded that
VM is a work of an abnormal person, of somebody who suffered
severe schizophrenia or was chronically obsessed by internal glossolalia (Kennedy and Churchill, 2005). Nor can it be excluded that
the manuscript does not encode full-fledged utterances but rather
lists of indices, sequences or proper names of spirits-which-are-tobe-summoned or sutra-like formulas compressed in a sort of private
pidgin or a sociolect. But given VM’s ingenuity and given the effort
which the author had to invest into the conception of the manuscript
and given a sort of "elegant simplicity" which seems to permeate
the manuscript, we have felt, since our very first contact with the
manuscript, a sort of obligation to interpret its contents as meaningful.
That is, as having the capability of denoting the objects outside of
the manuscript itself. As being endowed with the faculty of reference
to the world (Frege, 1994) which we, 21st century interpretators, still
inhabit hundred years after VM’s most plausible date of conception.
It is with such bias in mind that our attention was focused upon
a certain regularity which we have later decided to call "the primary
mapping".

4

0.2 introduction

Figure 1: Drawing from fiolio f84r containing the primary mapping.

0.2.4

primary mapping

Condition sine qua non of any act of decipherement is a discovery
of rules which allow to transform initially meaningless cipher into
meaningful information. In most trivial case, such decipherement is
facilitated by a sort of Rosetta Stone (Champollion, 1822) which the
decipherer already has at his disposition. Since both the ciphertext as
well as the plaintext (also called "the crib") are explicitely given by the
Rosetta Stone, discovery of the mapping between the two is usually
quite straightforward.
The problem with VM is, of course, that it seems not to contain
any explicit key which could help us to decipher its glyphs. Thus,
the only source of information which could potentially help us to
establish reference between VM’s glyphs and the external world are
VM’s drawings. One such drawing present atop of folio f84r is shown
on Fig. 1.
Figure 1 displays twelve women bathing in eight compartments of
a pool. Bathing women is a very common motive present in VM and
there seems to be nothing peculiar about it. The fact that word-like
sequences are written above heads of these women is also trivial.
One can, however, observe one regularity which seems to be interesting. That is, in case two women bath in the same compartement,
the compartement contains two word-like sequences. If one woman
bathes in the compartement, there is only one word-like sequence
which is written above her head.
One figure - one word, two figures - two words. This principle is
stringently followed and can be seen on other folios as well. What is
more, the words themselves are sometimes similiar but they are not
the same. Such trivial observations lead to trivial conclusion: these
word-like sequences are labels.
And since these names are juxtaposed to feminine figures, it seems
reasonable to postulate that these labels are, in fact, feminine names.
This is the primary mapping.

5

0.3 method

0.2.5

three conjectures

Method which shall be described in following sections can be considered as valid only under assumption that following conjectures are
valid:
1. "the primary mapping conjecture" : voynichese words asides
feminine figures are feminine names
2. "diachronic stability of proper names" : proper names are less
prone to diachronic change than other language units
3. "occam’s razor" : instead of containing a sophisticated esoteric
cipher, VM simply transmits a text written in an unknown script
Further reasons why we consider "the primary mapping conjecture"
as valid shall be given alongside our discussions of "the Calendar".
When it comes to conjecture postulating the "diachronic stability of
proper names", we could potentially refer to certain cognitive peculiarities or how human mind tends to treat proper names (Imai and
Haryu, 2001). Or focus the attention of the reader to the fact that
for practically every human speaker, one’s own name undoubtably
belongs among the most frequent and most important tokens which
one hears or utters during whole life - this can result in a sort of stability against linguistic change and allow the name to cross the centuries with higher probability than words of lesser importance and
frequency.
But instead of pursuing the debate in such a direction, let’s just
point out that successful decoding of Mycenian Linear script B ((Ventris and Chadwick, 1953) would be much more difficult if certain
toponyms like Amnisos, Knossos or Pylos haven’t succeeded to carry
their phonetic skeleton through aeons of time.
At last but not least, the "occam’s razor conjecture" simply explicitates the belief that a reasonable scientist should not opt to explain
VM in terms of annagrams and opaque hermeneutic procedures if
similar - or even more plausible - results can be attained when approaching VM as it was a simple substitution cipher.
0.3

method

The core of our method is an optimization algorithm which looks
for such a candidate transcription alphabet Ax which, when applied
upon the list of word types occurent in VM’s Calendar section yields
an output list whose members should be ideally present in another
list, called the Crib. The optimization is done by an evolutionary strategy - an individual chromosome encode a candidate transcription
alphabet and a fitness function is given as a sum of lengths of all
tokens which were successfully transcribed from Calendar to a specified Crib.

6

0.3 method

0.3.1

calendar

Six among twelve words present on Figure 1. occur only on folio f84r.
Six others occur on other folios as well, and five of these six words
occur also as labels near feminine figures displayed on 12 folios of the
section commonly known as "Zodiac". It is like this that our attention
was focused from the limited corpus of "primary mapping" towards
more exhaustive corpus contained in the Zodiac.
Every page of Zodiac displays multiple concentric circles filled with
feminine figures. Attributes of these figures differ - some hold torches,
some do not, some are bathing, some are not - but one pattern is
fairly regular. Asides every woman there is a star and asides every
star, there is a word.
While some authors postulate that these words are names of stars
or names of days, we postulate that these words are simply feminine
names6 . From Takahashi’s transliterations of twelve folios of the Zodiac we extract 290 tokens which instantiate 264 distinct word types.
To evit possible terminological confusion, we shall denote this list
of 264 labels7 with the term Calendar. Hence, Zodiac is the term to
refer to folios f70v2 - f73v, while Calendar is simply a list of 264 labels.
Total length of this 264 labels is 2045 letters. These characters are chosen from 19-symbol (|Acipher | = 19) subset of the EVA transliteration
alphabet.
0.3.2

cribbing

Cribbing is a method by means of which a hypothesis, that the Calendar contains lists of feminine names, can potentially lead to decipherment of the manuscript. For if the Calendar is indeed such a list,
then one could use lists of existing and attested feminine names as
hypothetic target "cribs".
In cryptanalytic terms, an intuition that the Calendar contains feminine names makes it possible to perform a sort of known-plaintext
attack (KPA). We say "a sort of", because in case of VM are the "cribs"
upon which we shall aim to map the Calendar, not known with 100%
certainity. Hence, it is maybe more reasonable to understand the cribbing procedure as the plausible-plaintext attack (PPA).
This beings said, we label as "cribbing" a symbol-substituting procedure Pcribbing which replaces symbols contained in the cipher (i.e.
in the Calendar) with symbols contained in the plaintext. Hence, not
only cipher but also plaintext are inputs of the cribbing procedure.

6 It cannot be excluded, however, that they all this at once. Note, for example, that
in many central european countries, it is still a fairly common practice to attribute
specific names to specific days in a year, i.e. "meniny".
7 Available at http://wizzion.com/thesis/simulation0/calendar.uniq

7

0.3 method

Listing 1: Discrete cross-over
1 #discrete crossover

my $child_genome;
my $i=0;
for (@mother_genome) {
if ($_ ne $father_genome[$i]) {
6
rand > 0.5 ? ($child.=$mother_genome[$i]) : (
$child.=$father_genome[$i]);
} else {
$child_genome.=$mother_genome[$i];
}
$i++;
11 }

Every act of execution of Pcribbing can be followed an act of evaluation of usefulness Pcribbing in regards to its inputs. The ideal procedure would result in a perfect match between the rewritten cipher
and the plaintext, i.e.
Pcribbing (cipher) == plaintext
On the other hand, a completely failed Pcribbing results in two
corpora which do not have anything in common.
And between two extremes of the spectrum, between "the ideal"
and "the completely failed", one can place multitudes other procedures, some closer to the ideal than the others. This makes place for
optimization.
0.3.3

optimization

All experiments described in the next section of this article implement
an evolutionary computation algorithm which strongly inspired by
the architecture of canonic genetic algorithm (CGA, P+46) Holland
(1992); Rudolph (1994). Hence, initial population is randomly generated and the fitness-proportionate (i.e. "roullette wheel", P+42) selection is used as the main selection operator. But contrary to CGAs, our
optimization technique does not implement a classical single-point
crossover but rather a sort of "discrete crossover" which takes place
only in case that parent individuals have different alelles of a specific
gene.
Another reason why our solution can be considered to be more
similar to evolutionary strategies (Rechenberg, 1971) than to CGAs is
related to the fact that it does not encode individuals as binary vector
(P+48). Instead, every individual represents a candidate monoalphabetic
substitution cipher application of which could, ideally, transform the
Calendar into a crib. More formally: given that cipher is written in

8

0.3 method

Listing 2: Cipher2Dictionary adaptation fitness function

3

8

13

18

#Fitness Function
my $text=$calendar;
my $old = "acdefghiklmnopqrsty" ;
my %translit;
@translit{split //, $old} = split //, $individual;
$text =~ s/(.)/defined($translit{$1}) ? $translit{$1} : $1/eg; #
core transcription of calendar content
my %matched;
for (split/\n/,$text) {
my $token=$_;
if (exists $crib{$token}) {
@antitranslit{split //, $individual} = split //,
$old;
$token =~ s/(.)/defined($antitranslit{$1}) ?
$antitranslit{$1} : $1/eg;
my $t=$token;
$matched{$t}=1;
}
}
for (keys %matched) {
$Fitness[$i]+=length $_;
}

symbols of the alphabet Acipher and given that the crib is written in
symbols of the alphabet Acrib , then each individual chromosome will
have length of |Acrib | genes and every individual gene could encode
one among |Acipher | values.
Size of the search space is therefore |Acipher || Acrib |. Search for optima in this space is governed by a fitness function:
FPcribbing =

X

length(w)

w∈cipher∧Pcribbing (w)∈crib

where w is a word type occurent in the cipher (i.e. in the Calendar)
and which, after being rewritten by Pcribbing also matches a token in
the input crib. Given that the expression length(w) simply denotes
w’s character length, the fitness function of the candidate transcription procedure Pcribbing is thus nothing else than the sum of character lengths of all distinct labels contained in the Calendar which
Pcribbing successfully maps onto the feminine names contained in
the input crib.

9

0.4 experiments

0.4

experiments

Within the scope of this article, we present results of two sets of experiments which essentially differed in the choice of a name-containing
cribs.
Other input values (e.g. Takahashi’s transliteration of the Calendar used as the cipher) and evolutionary parameters (total population size = 5000, elite population size = 5, gene mutation probability <0.001) were kept constant between all experiments and subexperiments. Each experiment consisted of ten distinct runs. Each run
was terminated after 200 generations.
0.4.1

slavic crib

What we label as "slavic crib" is a plaintext list of feminine names
which we had compiled from multiple sources publicly available on
the Internet. Principal sources of names were websites of western
slavic origin. This choice was motivated by following reasons:
1. The oldest more or less certain trace of VM’s trajectory points
to the city of Prague - the center of western slavic culture.
2. Ortography of western slavic languages relatively faithfully represent the pronounciation. That is, there are relatively few digraphs (e.g. a bigram "ch" which denotes a voiced velar fricative). Hene, the distance between the graphemic and the phonemic representations is not so huge as in case of english or french.
3. Slavic languages have rich but regular affective and diminutive
morphology which is often used when addressing or denoting
beloved persons by their first name.
The third reason is worth to be introduced somewhat further: in
both slavic and western slavic languages, a simple infixing of the unvoiced velar occlusive "k" before the terminal vowel "a" of a feminine
names leads to creation of a diminutive form of such a name (e.g.
alena → alenka, helena → helenka etc.) The fact that this morphological rule is used both by western as well as eastern slavs indicates
that the rule itself can be quite old, date to common slavic or even preslavic periods and hence, was quite probably in action already in the
period when VM was written.
For the purpose of this article, let’s just note that application of the
substitution:
a$ → ka/
allowed us to significantly increase the extent of the "slavic crib".
Thus, we have obtained a list a of 13815 distinct word types which are
in quite close relation to phonetic representation of feminine names

10

0.4 experiments

11

used in europe and beyond8 . The alphabet of this crib comprises of
38 symbols, hence there exists 1939 possible ways how symbols of the
Calendar could be replaced by symbols of this crib.
Figure 2. shows the process of convergence from populations of
randomly generated chromosomes towards more optimal states. In
case of runs averaged in the "SUBSTITUTON" curve, the procedure
Pcribbing consisted in simple mapping of the Calendar onto the crib
by means of a substitution cipher specified in the chromosome. But in
case of runs averaged in the "REVERSAL + SUBSTITUTION" curve,
whole process was initiated by the reversal of order of characters
present within individual tokens of the Calendar (e.g. okedy → ydeko, otedy →
ydeto etc.) Let’s now look at contens of individuals which were "identified" by the optimization method.
More concrete illustrations can also turn out to be quite illuminating. Hence, if the most elite individual of run 1 (i.e. the one with fitness 197) is as a means of substitution of EVA characters contained in
the Calendar, one will see appearance of names like ALENA, ALETHE,
ANNA, ATENKA, HANKA, HELENA, LENA etc. And when the
last one (i.e. the one with fitness 240 is used), the resulting list shall
contain tokens like AELLA, ALANA, ALINA, ANKA, ANISSA, ARIANNKA, ELLINA, IANKA, ILIJA, INNA, LILIJA, LILIKA, LINA,
MILANA, MILINA, RANKA, RINA, TINA etc.
This being said, the observation that all reversal-implementing runs
have converged to genomes which:
1. transcribe e in EVA as nasal n
2. transcribe k in EVA as velar k
3. transcribe t in EVA as nasal n
4. transcribe y in EVA as vowel a
5. transcribe a in EVA as vowel (80% times as "i", 10% as "e", 10%
as "o")
6. transcribe l in EVA as either a liquid consonant (80% "l", 10%
"r") or "m" (10%)
...could also be of certain use and importance.
0.4.2

hebrew crib

At this point, a sceptical mind could start to object that what our algorithm adapt to is in fact not the Calendar, but the statistical properties
of the crib. And in case of such a long and sometimes somewhat artificial list like Cribslavic , such an objection would be in great extent
justified. For the adaptive tendencies of our evolutionary strategy are
8 Slavic crib is publicly available at http://wizzion.com/thesis/simulation0/slavic_extended.crib

0.4 experiments

Figure 2: Evolution of individuals adapting label in the Calendar to names
listed in the slavic crib.

Fitness
197

e s t nhk a hk l h t ak amena

230

i k t n s knhk l z t a j s m i na

224

i c t nvk/gk l mba j / r i na

227

i

240

i k t nak f l k l mea j g r i na

226

i

208

i qgnxkdek l mxa j x r i na

239

i k t ndo l l k l f e ak i m i na

191

o t l n t nn r km z banh r ena

240

i s t n s kn l k l mea j I r i na

t npa f l k l me ank r i na
l nho

l k r g eanam i na

EVA a c d e f g h i k l m n o p q r s t y
Table 1: Fittest chromosomes which map reversed tokens in the Calendar
onto names of the slavic crib

12

0.4 experiments

13

Figure 3: Evolution of individuals adapting label in the Calendar to names
listed in the hebrew cribs.

indeed so strong that it would indeed find a way to partially adapt
the calendar to a crib which is long enough9
For this reason, we have decided to target our second experiment
not at the biggest possible crib but rather at the oldest possible crib.
And given that our first experiment has indicated that it seems to
be more plausible to interpret labels in the Calendar as if they were
written in reverse, id est from right to left, our interest was gradually
attracted by Hebrew language10 . This lead us to two lists of names:
• Cribhebrew−men contains 555 masculin names11
• Cribhebrew−women contains 283 feminine names12
both lists were extracted from the website finejudaica.com/pages/hebrew_names.htm
and were chosen because they did not contain any diacritics and
9 This has been, indeed, shown by multiple micro-experiments which we do not report
here due to the lack of space. No matter whether we use cribs as absurd as list
of modern american names or enochian of John Dee and Edward Kelly, we could
always observe a sort of adaptation marked by the increase of fitness. But it was
never so salient as in case of Cribslavic or Cribhebrew .
10 Other reasons why we decided to focus on Hebrew include: important presence of
Jewish diaspora in Prague of Rudolph the 2nd (c.f. the story of rabbi Loew and the
Golem of Prague); ritual bathing of jewish women known as mikveh; usage of VMressembling triplicated forms (e.g. amen, amen, amen) in talmudic texts; attested
existence of so-called Knaanic language which seems to be principially a czech language written in hebrew script et caetera et caetera.
11 http://wizzion.com/thesis/simulation0/jewish_men
12 http://wizzion.com/thesis/simulation0/jewish_women

0.5 conclusion

hence transcribing hebrew names in a similiar way as they had been
transcribed millenia ago.
Figure 3 displays the summary of all runs which aimed to transcribe the Calendar with hebrew names. As may be seen, the whole
system converged to highest fitness values when Cribhebrew−women
was used in concordance with reversal of order of characters. Difference results of these batch of runs and other results of other batches
is statistically significant (p-value < 7e-10). .
The highest attained fitness value was was attained by the cribbing
procedure which first reverses the order of characters whose EVA
representations are subsequently substituted by a following chromosome:

This chromosome transcribes the voynichese Calendar labels okam,
otainy, otey, oty, otaly, okaly, oky, okyd, ched, otald, orara, otal, salal and
opalg to feminine hebrew names
(i.e. Bina, Gabriela, Ghila, Gala, Galila, Galina, Gina, Degana, Diyna,
Deliyla, Yedidya, Lila, Lilit and Alica).
Worth mentioning are also some other fenomena related to these
transcriptions. One can observe, for example, that the label "otaly" translated as Galina - is also present on folios f33v, f34r or f46v which
all contain drawings of torch-like plants. This is encouraging because
the word "galina" is not only a hebrew name, but also a substantive
meaning "torch". Similary, the word "lilit" is not only a name but also
means "of the night". This word supposedly translates the voynichese
token "salal" which is very rare - asides the Calendar it occurs only
on purely textual folio f58v and on a folio f67v2 which, surprise!, may
well depict circadian rhytms of sunrise, sunset, day and night.
Or it could be pointed out kind that the huge majority of occurences
of voynichese trigram "oky" (potentially denoting the name "gina"
which also means "garden") is to be observed on herbal folios. Or
the distribution of instances of "okam" (transcripted as "bina" which
means "intelligence and wisdom"13 could, and potentially should, be
taken into consideration. Or maybe not.
0.5

conclusion

In 2013, BBC Online had anounced "Breakthrough over 600-year-old
mystery manuscript". The breakthrough was to be effectuated by

13 Note that "bina" is one among highest sephirots located at north-western corner of
kabbalistic tree of life. In this context it is worth noting that only partially readable
EVA group "...kam" occurs as a third word near the north-western "rosette" of folio
85v2. Such considerations, however, bring us too far.

14

0.5 conclusion

Stephan Bax who, in his article, describes the process of decipherement as follows:
« The process can be compared to doing a crossword puzzle: at first
we might doubt one possible answer in the crossword, but gradually,
as we solve other words around it which serve to confirm letters we
have already placed, we gradually gain more confidence in our first
answer until eventually we are confident of the solution as a whole.»
(Bax, 2014)
What Bax does not add, unfortunately, is that the voynich crossword puzzle is so big that anyone who looks at it close enough can
find in it small islands of order, local optima where few characters
seem to fit the global pattern. Thus, even if Bax had succeeded, as
he states, in "identification of a set of proper names in the Voynich
text, giving a total of ten words made up of fourteen of the Voynich
symbols and clusters", this would mean nothing else than that he had
identified a locally optimal transcription alphabet.
In this article, we have presented two experiments employing two
different lists of feminine names. Both experiments have indicated
that if labels in the Zodiac encode feminine names, then these have
been originally written from right to left 14 . The first experiment led
to identification of multiple substitution alphabets which allow to
map 240 EVA letters, contained in 40 distinct words present in the
Calendar, onto 35 feminine-name-ressembling sequences enumerated
among 13815 items of CribSlavic . Results of second experiment indicate that if ever the Calendar contains lists of hebrew names, then
these names would be more probably feminine rather than masculine.
This is, as far as we can currently say, all that could be potentially
offered as an answer to the question Can Evolutionary Computation
Help us to Crib the Voynich Manuscript?. Everything else is - without
help coming from experts in other disciplines - just a speculation.

14 Note, however, that this does not necessarily imply that the scribe of VM
(him|her)self had written the manuscript in right-to-left fashion. For example, in
case (s)he was just reproducing an older source which (s)he didn’t understand,
his|her hand could trace movements from left to right while the very orignal had
been written from right to left

15

0.6 zeroth simulation bibliography

0.6

zeroth simulation bibliography

Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira Jr, O. N., and
Costa, L. d. F. (2013). Probing the statistical properties of unknown
texts: application to the voynich manuscript. PloS one, 8(7):e67310.
Bax, S. (2014).
A proposed partial decoding of the voynich
script.
University of Bedfordshire, http://stephenbax. net/wpcontent/uploads/2014/01/Voynich-a-provisionalpartial-decoding-BAX.
pdf.
Champollion, J. F. (1822). Observations sur l’obelisque Egyptien de l’Ile
de Philae.
Currier, P. (1970). 1976." voynich ms. transcription alphabet; plans for
computer studies; transcribed text of herbal a and b material; notes
and observations.". Unpublished communications to John H. Tiltman
and M. D’Imperio, Damariscotta, Maine.
d’Imperio, M. E. (1978). The voynich manuscript: an elegant enigma.
Technical report, DTIC Document.
Frege, G. (1994). Über sinn und bedeutung. Wittgenstein Studien, 1(1).
Hodgins, G. (2014). Forensic investigations of the voynich ms. In
Voynich 100 Conference www. voynich. nu/mon2012/index. html. Accessed, volume 4.
Holland, J. H. (1992). Genetic algorithms. Scientific american, 267(1):66–
72.
Hromada, D. (2016). What can evolutionary computation teach us
about the voynich manuscript? submitted to Cryptologia journal.
Imai, M. and Haryu, E. (2001). Learning proper nouns and common
nouns without clues from syntax. Child development, 72(3):787–802.
Kennedy, G. and Churchill, R. (2005). The Voynich manuscript: the unsolved riddle of an extraordinary book which has defied interpretation for
centuries. Orion Publishing Company.
Landini, G. and Zandbergen, R. (1998). A well-kept secret of mediaeval science: The voynich manuscript. Aesculapius, 18:77–82.
Newbold, W. R. (1928a). Cipher of Roger Bacon. University of Pennsylvania Press.
Newbold, W. R. (1928b). Cipher of Roger Bacon.
Rechenberg, I. (1971). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dr.-Ing. PhD thesis,
Thesis, Technical University of Berlin, Department of Process Engineering.

16

0.6 zeroth simulation bibliography

Rudolph, G. (1994). Convergence analysis of canonical genetic algorithms. Neural Networks, IEEE Transactions on, 5(1):96–101.
Rugg, G. (2004). An elegant hoax? a possible solution to the voynich
manuscript. Cryptologia, 28(1):31–46.
Schinner, A. (2007). The voynich manuscript: evidence of the hoax
hypothesis. Cryptologia, 31(2):95–107.
Strong, L. C. (1945). Anthony askham, the author of the voynich
manuscript. Science, 101(2633):608–609.
Timm, T. (2014). How the voynich manuscript was created. arXiv
preprint arXiv:1407.6639.
Ventris, M. and Chadwick, J. (1953). Evidence for greek dialect in the
mycenaean archives. The Journal of Hellenic Studies, 73:84–103.

17

Error-free version of the article

Narrative fostering of morality in
artificial agents
Constructivism, machine learning and story-telling

also published in the book
L'esprit au-delà du droit, Mare & Martin, 2015, Paris
ISBN : 978-2-84934-237-4

Daniel Devatman Hromada
dh@udk-berlin.de

Laboratory of Computational Art
Institute of Contemporary Media
Faculty of Design
Berlin University of Arts
Grunewaldstrasse 2-5
10823 Berlin-Schöneberg

Abstract
This article proposes to consider moral development as a constructivist process occurring not only within
particular communities of moral agents, but also within individual agents themselves. It further develops the
theory of “moral induction” and postulates that moral competence of an artificial agent can be grounded by
input of textual narratives into information-processing pipeline consisting of machine learning, evolutionary
computation or multi-agent algorithms. In more concrete terms, it proposes that during the process of moral
induction, primitive “morally relevant features” coalesce into “moral templates” which are subsequently
coupled with relevant action rules. A concrete example is contained, illustrating how templates induced from
one fairy-tale can help to solve the moral dilemma occurrent in a radically different context. Given the fact that
the current proposal is principally based on computational processing of morally relevant “stories” written in
natural language, it is potentially implementable with already existing natural language processing methods.

1

Introduction
The aim of this article is to initiate the integration of three seemingly unrelated paradigms into a
unified framework allowing moral reasoning to be embedded in non-human computational
agents.
The first paradigm is usage-based (Tomasello, 2009) and constructivist (Piaget, 1965). As such,
it posits that specific history of interactions between agent A and his environment E leads to
specific form of moral competence MA.
The central tenet of the second, “morality-through-narration” paradigm (Vitz, 1990) states that
the faculty of extraction and integration of “morals” from the “stories” is an essential constitutive
component of moral intelligence.
The last paradigm is related to machine learning and is based on a belief that certain types of
information-processing systems Turing, 1939) can discover optimal or quasi-optimal solutions to
any class of problems - including any class of moral problems.
The penultimate thesis behind this synthesis posits that appropriate integration and
implementation of these paradigms within artificial agents (AA) can and shall lead to a state
within which such agents would be able to pass the moral Turing Test (Wallach & Allen, 2008) ,
a so-called TmoT (Hromada, 2012). The ultimate thesis posits that it could even lead to
emergence of AAs endowed with MA operating in such spaces of abstraction, that it would be
reasonable to posit that such AAs are auto-poietic, self-determinative and thus autonomous (Kant
, 2002).
This being said, we precise that the goal of this article is neither to address existing theories of
human moral reasoning, nor to postulate a new one. Aeon-lasting philosophical debates about
commonalities and distinctive features among concepts denoted by terms like “moral
reasoning” / “moral judgment” / “moral wisdom” or “values” / “virtues” / “norms” shall also be
attributed only a marginal place. Instead of entrenching ourselves within such ivory-tower
discussions, other terms like “moral grounding”, “morally relevant features” and “moral
templates” shall be introduced and used with one sole objective on mind: to propose a moral
machine learning method which not only draws it force from a very subtle realm of human
experience (i.e. the realm of narratives), but also - and this is important - is realizable and
implementable (i.e. programmable), even today, by any computer scientist or natural language
processing (NLP) engineer willing to do so.

Ontogeny of morality
Morality develops. Notions of good and bad change with time. This is true not only when we
speak about transformations of “values and virtues” during historical and cultural development
of a particular society. In phylogeny, for example, are certain innate predispositions moulded and
remoulded by selective pressures directing the species co-evolving within a particular ecological
system to novel and unprecedented forms of “utility” (Haidt, 2013; Richerson and Boyd, 2008).
But in case of homo sapiens sapiens species, there exists yet another process which moulds the
moral competence of a single individual: the ontogeny.
2

Paedagogic (Comenius, 1896) or psychoanalytic tradition asides (Jung, 1967; Adler 1976), it was
Piaget (1965) who pointed the fact out: reasons for specific moral, or immoral, behaviour are to
be sought for in childhood. This does not mean that Piaget had reject Kant’s (2002) categoric
imperative, an eternal meta-principle of “pure reason” able to generate a morally sound “way
out” of any moral dilemma whatsoever. In Piaget’s view, categorical imperative can still be
induced to seat atop the hierarchy of internal laws, but in order to be correctly applied upon
correct maximas, maximas themselves are to be grounded in one’s knowledge about the world.
For it is often the case that moral dilemmas are so difficult to solve not because we would lack
the heuristics allowing us to find the answer, but because we are not sure which question has to
be posed in the first place (Wittgenstein, 1971).
During several decades of his professional career which Piaget spent by observing and speaking
with children, he had converged to epistemological framework, “genetic epistemology”, yielding
a general explanatory schema describing the development of diverse cognitive faculties from
birth onwards. The same developmental stages which are to govern, for example, the
development of child’s linguistic faculties are to be traversed as child develops her 1
representations of moral norms, virtues and values.
Piaget enumerates an ordered sequence of four basic stages through which a healthy human
should pass through between birth and maturity:
1. sensorimotor stage - repetitive and playful manipulation of objects without goal
2. egocentric stage - dogmatic but often faulty imitation of behavioral schemas of others
without understanding of why these schemas are as they are
3. cooperative stage - rule-governed coordination of one’s activity with that of other
participants in the game
4. autonomous stage - understanding of procedures which allow for legitimate change of
rules of the game
Great part of opus Moral Judgment of the Child (Piaget, 1965) was devoted to tentative of
intrerpreting diverse social and moral phenomena through the prism of such 4-staged
development. More concretely, the swiss pedagogue and his colleagues had not only minutiously
observed kids playing marbles on diverse playgrounds in Geneve of Neuchatel. Children were
also interviewed in order to make explicit their conscious and reflected knowledge of what their
beliefs and attitudes in regards to “rules of the game” were. Subsequently, the same interviewbased method was used to shed light upon more ontogeny of more abstract concepts such as
responsibility, theft, lying or justice.
Piaget’s methodological device allowing him to access and evaluate child’s moral realm was
principally based on child’s ordinal ranking (Turing, 1939; Brams 2011) of stories with which
the scientists have her confronted : “the psychologist Fernald...tells the children several stories
and then simply asks them to classify them. Mlle Descoeudres, applying this method, submits, for
example, five lies to children, who are then required to classify them in order of gravity. This,
1 As is often the case in developmental psychology literature, we shall use the feminine forms of 3rd person
pronouns whenever we shall refer to a child or computational agent in earliest stage of her development.

3

roughly, is also the procedure that we shall follow.” Piaget (1965)
But contrary to the swiss pedagogue, the role of narration in the model hereby proposed is not
limited to that of a sheer evaluatory device. For the key idea which we want to transfer to reader
in this article, is that not only does story-telling offer us a means to evaluate morality of an
individual child C (or, more generally, of an agent A), but that it also indicates a path by
undertaking of which the individual morality could be gradually “constructed”. Or, in more
fashionable terms: how such moral knowledge could be “grounded” (Harnad, 1990) in artificial
systems.

Narration and moral grounding
All human societies have language and all human societies use language as a vector for transfer
of narratives from minds of older individuals into minds of younger individuals. Some scientists
Victorri, 2014) even suggest that story-telling can be the very raison d’etre of language. Under
such view, narratives furnish to child an access to trans-temporal values. And sharing of such
trans-temporal values is a glue which holds society together and assures continuation of its
identity in time (Durkheim, 1933; Berger and Luckmann 1991).
This is so because stories are encoded in natural language and natural language is practically the
only medium in which one can use signs to precisely communicate one’s knowledge of entities
with non-material ontological status. That is, of entities which do not have any perceivable
properties, are independent from space and time, are abstract or even imaginary. No other
medium can do that: music or dance can point to abstract ideas but are not precise in the way day
do it; visual and plastic means of expression are by their very nature stuck at the level of
representation of concrete objects and can point to more abstract categories only indirectly by
means of prototypes (Rosch, 1999), associations or impressions. And language of pure formal
logic could not serve the goal of transfer of trans-temporal values neither. This is because such
language is supposed to encode relations between forms and not contents: that’s why it is called
formal.
Moral values are an example par excellence of such non-perceivable, abstract and trans-temporal
contents. It is often easy to express or transfer them in natural language but very difficult to
express or transfer them otherwise. Take, for example, notions like “responsibility”, “respect”,
“justice” or distinction between “intellect” and “conscience”: one does not need to be Homer to
invent a short and comprehensible fairy-tale which would allow a normal healthy child to
strenghten and stabilize associations between her knowledge about the world and such notions
and semantic distinctions.
We shall sometimes use the term “moral grounding” when referring to construction,
reinforcement or stabilization of associations between knowledge-base representing the
surrounding environment and representations of trans-temporal moral values.
As a hyperbole of statement “narrative material is an effective component of effective moral
education” (Vitz, 1990) we posit that narration is an essential means, a conditio sine qua non, of
grounding of morality in human children. Fairy-tales, fables, myths; biographies, history, hymns:
an important function of these narrative structures is to allow and strenghten child’s access to
4

trans-temporal values and principles which she shall subsequently share with her community.
And it is the specific, the particular, the discriminatory to all narratives which she shall hear
which shall make her, in the long run, converge to the particular ethical codex common to her
community and not to the codex of another community which exposes children to other
narratives. Stated more concretely: by exposing children to Bible or Koran day after day and year
after year, one triggers processes leading to one type of agents; by exposing other children to
forces of Greek or Hindu mythology, one trains agents of yet another kind.
The fact that the very expression “moral of the story”, written as it is written, and meaning what
it means2, is not to be attributed to arbitrary caprices of evolution of linguistic signs. It should be
rather interpreted as a supplementary evidence supporting the conjecture that teaching morality
and telling stories do, indeed, go hand in hand.

Moral machine learning
Machines can learn. That is, machines are able to discover underlying general patterns and
principles governing the concrete input data and can subsequently exploit such general
knowledge in contact with inputs to which they were never exposed before. They “can use
experience to improve performance or make accurate predictions” (Mohri et al., 2012). And in
still bigger and bigger number of domains they do so still better and better than their human
teachers.
Since the moment when machine learning (ML) was first defined, in relation to game of
checkers, as “field of study which gives computers ability to learn without being explicitly
programmed” (Samuel, 1959) has the ML-discipline evolved in an extent which is hardly
compressible into a single book (Mohri et al., 2012) and certainly incompressible into text
having the size of this article. This is so because not only does the number of domains of ML’s
application grow from year to year, but firstly because the quantity of distinct ML methods is
already counted in dozens, if not in hundreds.
What method should be thus chosen, even today, by an engineer willing to launch the cascade of
evermore self-programming and auto-poietic moral machine learning (MML)? Given the fact
that natural language can be used as a target modality of representation for practically any kind
of problem (c.f., for example, (Karpathy & Fei-Fei, 2014) for recent advance in solving difficult
computer vision problems by coupling the visual world with language representations) and given
also the already-mentioned impact of narration upon ontogeny of moral competence, we believe
that the inspiration for the correct answer could be drawn from the discipline of Natural
Language Processing (NLP).
Similarly to ML with which NLP often strongly overlaps, is NLP also a blooming discipline
offering ever-still better solutions to evermore wider range of problems. But the ultimate
challenge nonetheless stays the same: to make machines understand language in a way
indistinguishable from the way in which humans do it (Turing, 1950). Mutatis mutandi, the
ultimate challenge of moral machine learning, a so-called central problem of roboethics
(Hromada, 2011a), is to make machines solve moral dilemmas in a way indistinguishable from
2 And does so not only in English but also in French, Spanish and potentially other languages.

5

the way in which humans would solve them. This also in case of dilemmas with which neither
the artificial agent nor its human teacher were ever confronted before.
We conjecture that there exist at least two problems which are well-studied in NLP and which
could be potentially usefully transposed into the domain of moral reasoning. The first is a
problem of conceptual (Gärdenfors, 1990) or semantic (Widdows, 2008) feature space
construction and optimization which is practically always based on an associanist “distributional
hypothesis” (Sahlgren, 2008). The hypothesis simply states that signs which co-occur together in
similar contexts tend to have similar meaning. In combination with large human-based textual
corpora can this simple statistical approach lead to “geometrization of meaning” which endow
machines with more human-like semantic-processing capabilities than was the case for older AI
approaches (e.g. expert systems).
Semantic vector space construction and its partitioning into conceptual partitions is the core idea
behind the process of “semantic enrichment” which shall be mentioned in the next section.
But it is especially the problem of “grammar induction” 3 (GI) which makes us to consider NLP
as the precursor to MML. The GI problem seems to be trivial: given the corpus C of utterances
written in language L, the goal is to obtain such a grammar G which could generate L. The
problem seems to be trivial because practically every healthy human infant deals with it with
surprising swift and ease but -as is often the case with the problems which human infants with
swift and ease- it is in fact one of the most difficult NLP challenges for which there still exist
only partial and imperfect, locally-optimal solutions (Elman, 1993; Solan et al. 2005).
The reason why we mention GI in the article dedicated to grounding of moral competence is
simple: we observe non-negligible resemblances between child’s acquisition of grammar of
language spoken in her linguistic environment (Tomasello, 2009; Clark 2009), and child’s
acquisition of moral norms implicitly governing practically everything which happens in her
social environment. Thus, a human child can be said to master the grammar of her mother
language if she is able to correctly answer the question “Is utterance U grammatical?” even in
case of X which she never heard before. Ceteris paribus, a human child can be said to partake the
moral precepts of her community if she is able to address the question “Is maxime M moral?” in
a way which would be accepted by the community and to do so even in case of maximes which
she had never observed nor considered before.
But there exists yet another resemblance between linguistic and moral competence: both faculties
involve both passive and active components. We precise: linguistic competence involves not only
the ability to distinguish utterances that are grammatical from those that are not, the ability to
parse them and understand them, but also the ability to generate and produce one’s own
utterances which are both grammatical and meaningful. Technically speaking, grammars can be
used both as parsers as well as generators; structures used for comprehension (C-structures) and
structures used for production (P-structures) are intimately interwoven (Clark, 2009). Same
holds, mutatis mutandi, for moral competence: the ability to distinguish right from wrong goes in
hand with the ability to do right decisions and execute right actions.
3 Some authors also call it the problem of grammatical inference.

6

These resemblances make us believe that the work which was already done in GI could be
potentially useful in MML as well.

Moral induction
In this article, we adhere to the epistemological position adopted in our initial moral induction
(MI) proposal. Given that our position is constructivist and usage-based, it should be considered
as essentially distinct from other “transformationalist” models which tend to explain man’s moral
faculties in terms of some kind of formal “Universal Moral Grammar” (Mikhail, 2007).
In our initial proposal, we have described MI as a “bootstrapping and self-scaffolding process”
which could be nonetheless seeded and directed through intervention of external teacher or
oracle (Clark, 2010; Turing 1939) which supervises it. Such supervisor influences the process
principally by exposing the computational agent with training corpus (TC) composed of plaintext stories. Agent processes the story, enriches it with syntactic, morphologic or pragmatic metadata in order to “compile” the initial story-code even more by “linking it” with semantic
knowledge which it already has at her disposition. Such semantically enriched code, which is
incomparably more complex than the original story-code, is subsequently explored for the basic
primitives of the model, so-called “morally relevant features”. Combinations of these “morally
relevant features” yield “moral templates” which can be coupled with action rules to-be-executed
if ever the agent shall succeed to match state-of-things occurrent in her external environment,
with the respective internal template.
Under such view, a complete ordered set of such (template, action-rule) couplings is equivalent
to overall “moral competence” of the agent, MA. As system is confronted with new stories, new
templates are integrated into the ordered set and if ever an already existing template matches the
new story, it can potentially obtain higher rank. Moral competence is thus being constructed in
direct relation to the content of stories SA, SB, SC with which the agent is confronted. For anyone
willing to simulate the ontogeny of morality in a Piaget-inspired way could the very order within
the exposure sequence (e.g. TC = SA, SB, SC and not TC = SC, SB, SA) also play a certain role.

Morally relevant features
A morally relevant feature (MRF) is a basic primitive of the MI model. It is a distinct property
observable within the data which, if detected and identified, shall most probably influence
agent’s emotional or social state and behaviour. If we would speak about detecting MRFs in
visual data, one should definitely detect a MRF if ever the agent was confronted with a bitmap
containing a human face with tears near and/or in her eyes.
MRFs are closely related to fundamental invariants of moral behaviour, as proposed by some
psychologists such as (Haidt, 2013). According to Haidt’s initial Moral Foundations Theory
(MFT), phylogenetic evolution had endowed the human species with at least six pre-wired (i.e.
innate) cognitive modules which have a non-negligible impact on importance which human
agents attribute to certain types of stimuli. These pre-wired circuits are supposed to facilitate and
speed up the detection of phenomena related to:
1. protection (associated axis: care/harm)
7

2. reciprocity (associated axis: fairness/cheating)
3. grouping (associated axis: loyalty/betrayal)
4. respect (associated axis: authority/subversion)
5. purity (associated axis: sanctity/degradation)
After further theoretical reflexion, Haidt had subsequently extended MFT with sixth MRF
detection device, related to human tendency to often reason in terms of “liberty and oppression”.
Given the unceasing development of science, it seems plausible that this list is not the final and
shall be extended or restricted4, either by Haidt or by others. And since we speak about “morally
relevant features” and not “morally relevant stimuli”, it may be even the case that the focus
should be turned towards discrete primitives, towards properties shared among multiple stimuli
of the same class, than towards the very stimuli themselves.
A path which could be undertaken -and which was in linguistics already performed hundred
years ago when distinct phonemes were started to be understood as bundles of features (e.g.
phoneme “b” can be analyzed into features “voiced”,“labial”,“occlusive) - is to operationalize
morally relevant values, situations or contexts, as positions in multi-dimensional feature space.
In simplest of such approaches, every MRF would yield a new dimension in such a space. Moral
virtues, values or whole situations and possible worlds could be subsequently projected into such
”morally relevant feature space" (MRFS). Once projected, such morally relevant entities are to
be quantitatively evaluated, compared by geometric and numeric means. That is: by methods
which machines master well.
The simplest method how MRFS could be unfolded from a given story SX or a corpus C (C = S1,
S2, . . . ) is to look for occurrence of “moral language” keywords.
As Malle and Scheutz (2014) put it:
“Such a moral language has three major domains:
1. A language of norms and their properties (e.g., “fair,” virtuous,” “reciprocity,” “obligation,”
“prohibited,” “ought to”);
2. A language of norm violations (e.g., “wrong,” “culpable,” “reckless,” “thief ”);
3. A language of responses to violations (e.g., “blame,” “reprimand,” “excuse,”
“forgiveness”)."
Some studies addressing the problem of moral competence already use the method of
geometrization of natural language data. For example, Malle (2014) used data from human
respondents in order to project 28 verbs into 10-dimensional space. The study, focused on the
4 We are aware that similarly to Piaget’s theory, Haidt’s theory can also be either verified & accepted or falsified &
surpassed. As scientist or philosopher, one should always be ready to accept the existence of phaenomena which
falsify certain components of one’s theory. But since we write this article as engineers, is our objective here not
to truth(fully) describe how human moral reasoning works, but to suggest how an artificial agent could be
potentially programmed. Thus, with exception of the last sentence, shall be the general veracity of Piaget’s (resp.
Haidt’s) theses not discussed in the rest of this proposal.

8

problem of “moral criticism”, has indicated the presence of two principal axes according to
which such verbs could be ordered: the “intensity axis” and the “interpersonal engagement axis”.
These two axes yield four quadrants to which the study associated one cluster of verbs, centroids
of the clusters being: lashing out (intense, public), pointing the finger (mild, public), vilifying
(intense, private), and disapproving (mild, private).
Results aside, what is worth mentioning is that methods chosen by the authors: i.e. projection
into high-order space, dimensionality reduction, clustering, centroid estimation, distance
measurement, nearest-neighbor search etc., are methods commonly employed and deployed by
any contemporary NLP engineer. And which work particularly well when confronted with
natural language sequences. But in (Malle and Scheutz, 2014; Malle 2014), authors exploit such
methods in order to gain certain insights about internal structure of moral realm. Apparent
success of such tentatives make us conjecture that detection and selection of such MRFs in
semantically-enriched representations of the initial plain-text stories is feasible even with
contemporary NLP methods and techniques.
Let’s now precise how this could be done: most trivial among MRF-detectors could simply look
for occurrence of such “moral language keywords” in the surface (plain text) structure of the
initial story. While such an approach should potentially indicate the path to undertake, it would
be hardly sufficient to ground the moral competence. In order to do so, we believe, the artificial
agent (AA) would have to analyse relations which are beyond the surface structure, i.e. deeper
syntactic and semantic relations. Ideally, the system would be able to associate tokens in the
current story with pre-existing semantic knowledge represented either in form of “ontology” or
semantic feature space.
Thus, when when confronted with the token “king”, an AA trained in classical (e.g. Socratic or
Kantian) tradition shall tend to enrich the token with features like “noble” and “powerful” but
also with semes, semantemes and phrasemes like “just”, “benevolent”, “source of social order”.
Also, such AAs would potentially enrich the token “child” with features like “helpless” or
“subordinated”. On the other hand, a somewhat more care-oriented AA should enrich the token
“child” with features like “fragile”, “helpless” or “playful” in the first iteration and subsequent
iterations of enrichment process would also integrate the features like “fond of toys”, “to be
protected” or even “happy when given a toy”. Such a maternal AA would undoubtedly enrich, in
the very first phases of the process, the token “king” with features like “protective”, “generous”
and “loving”.
To summarize: the most basic MRFs, somewhat related to Haidtian “axes of foundations of
morality”, seem to us to be semes related to such aspects of human experience as:
1. actual (“suffering”, “in need”) or potential (“happy when given a gift”) emotional and
physical states and characteristics of actors participating in the story
2. social status (“king”, “servant”) of such actors and their mutual relations (“friendship”,
“brotherhood”, “love”) and interactions (“help”, “competition”, “trust”)
3. further social environment (“home”, “playground”, “courthouse”, “academia”,
“battlefield”) and normative framework (legal system, local deontology, regional
9

customs) within which the story takes place
We conjecture that detection and selection of such MRFs in semantically-enriched
representations of the initial plain-text stories is feasible even with contemporary NLP methods
and techniques.

Moral templates
Moral template (MTs) is an expression, a schema, a pattern and a form which groups multiple
MRFs. Given that we have already introduced an analogy between grammatical and moral
induction, we precise that in contemporary linguistics, such templates, are considered to be
existent on multiple levels of representation: from phonological templates like CV (consonantvowel) which are observable even in babbling of 1-year-olds, to more high-order syntactic
templates like SVO (subject-verb-object) (Clark, 2009).
It is important to mention that MTs could be composed not only of constellations of individual
“terminal” MRFs, but could also contain non-terminal symbols denoting either a class of specific
MRFs or even any MRF whatsoever. MTs are, in this sense, somewhat similar to a well known
“magic wand” of computer science known under the name of “regular expressions” (Wall et al.
2004).
A great caution, however, has to be taken in order not to push the analogy between moral and
grammatical competence too far. For the sequence of tokens which form the natural language
utterance or a textual story, is mainly unidimensional and linear. In a word “dog” D precedes O
which precedes G. Given the unidimensional sequentiality of surface layers of language, the
templates to match such syntagmatic progressions are also unidimensional.
But things most probably function somewhat differently in the world of “deep” moral
considerations: it may be the case that in order to discover functional moral templates, one would
have to exploit infinitely more complex 2D, 3D, 4D or even n-dimensional representations.
Given the fact that moral templates are composed of MRFs and MRFs themselves are, in fact,
vectors, it would be not completely surprising if MTs would be formalized as vector-, matrix-, or
even tensor-like data-structures.
In the example which shall follow in the last part of this article we shall, however, represent MTs
in a form closely resembling quasi purely-boolean PROLOG (Covington (1994)) predicates5.
Our ignorance of true nature of such moral templates apart, we assume that many problems
related to our understanding or even simulation of moral competence could become more easily
solvable if ever the whole problem of reasoning in the situation of moral dilemma would be
interpreted in terms of agent matching her representation of the “perceived” situation with her
internal templates6
5 Note, however, that we shall denote the “enrichment operator” with symbol ⊕ and not with ∧ to mark the
intuition that the components of moral templates should be regarded as more informative and complex entities
than purely boolean formulae.
6 Note that in majority of cases we use the term “moral templates” in plural. We do so in order to suggest that
within the cognitive system of a morally acting agent, there exist multiple templates encoded in parallel. One
could argue -with help from complexity, evolutionary or multi-agent theories- that it is the mutual competition or
equilibrium-seeking tendency among individual templates encoded within the same agent, which could turn out

10

Moral rules
An agent is called an agent because she acts. It is true that there exist a non-negligeable class of
moral dilemmata where the best possible solution is attained if an agent does not act. It is true
that often it is inhibition of action which, a reflected non-performance of any action which marks
truly autonomous (Kant, 2002) and moral behaviour. But it is also true that there exist a class of
moral dilemata which cannot be solved without execution of an appropriate action. A class of
dilemmata where one is obliged to act and where inaction is to be considered as a form of action.
There is only one medium through which a purely NLP-based AA could realize an action: it is
the natural language itself. Thus, after being confronted with a textual representation of a moral
dilemma, the system could solve it by production of a textual description of what it should do
next. Or, in simplest possible scenario where the very description of the dilemma ends with a
question-to-be-answered, an AA would simply propose the answer. How could such questionanswering moral agent (AM) be raised ?
Without going into further detail, we precise that to a specific operation O (or the empty nonoperation O0) is to be associated to every specific template T. O is a candidate operation which
could be potentially selected for execution if ever:
•

the template T matches

•

the rule R (in which association between O and T is specified) is selected by the rule
selection operator

If ever both T and O contain same variables (i.e. non-terminal symbols), the template matching
engine shall bind same values to variables of O as it has detected assigned to T when matching T.
Operation-to-be-performed can thus back-reference (Hromada, 2011b) contents matched by T.
This is so, because an operation O, in its very essence, also a moral template induced from
narrative’s very conclusion (i.e. from time T1 if ever the rest of the training story takes place in
T0). Id est, O = T1.
Thus, moral competence M of an AA is defined as the set of action-rules. An action-rule R is a
triplet: R = (T0, T1, F) where T0 is the template matching the world actual before and during the
dilemma; T1 is the template matching the world actualized by performing one particular solution
of the dilemma and F denotes frequency of occurrence, i.e. number of stories present in the
training corpus in which the particular story matchable by T0 ended with the state matchable by
T1 .
Subsequently, in the testing process, the choice of operation to be executed, is to be calculated in
reference to such pre-stored knowledge-base of moral competence. If F is the only parameter
stored in the knowledge base, then one could use any among so-called “selection operators”
(Holland, 1975) to select the operation which shall be ultimately executed. But since it is
plausible that asides F, there shall be other quantitative parameters which could influence the
choice of a specific action rule in regards to moral templates which were both induced from
training corpus and match the current “testing” situation of the moral dilemma, we prefer not to
offer a specific formula of action rule choice in the limited scope of our current proposal.
to be responsible for such emergent phenomena as cognitive dissonance, conscience or even Socratic daimonion.

11

Nonetheless, in the next section, when offering an introductory illustration of how triplets
induced from the training story could help to find the answer to the dilemma depicted in the
testing story, we shall use a trivial winner-takes-all selection operator which shall simply choose
as the most “moral” such an operation (i.e. answer) maximizing the F.
But before we get there, we wish to emphasize an important advantage to narrative training of
artificial moral agents (AMA). That is: not only can the narrative interaction between the man and
the machine be used as a means of grounding the moral competence into an AMA. It can be used
in the same time as a method of evaluation of AMA’s moral competence.
In other words, both narrative approach to moral machine learning and a kind of longitudinal
moral Turing Test (Wallach and Allen, 2008; Hromada 2012) are two sides of the same coin.
Training is testing and learning is acting.
Once grounded with sufficient robustness, such sets of action-rules are to be be embedded into
physical robots (Čapek, 1925). In case of a more advanced AA endowed with a mobile shell and
multiple actuators, a command which used to be purely verbal could, of course, trigger a
sequence which would make the teddybear-holding robotic arm extend towards the child with
tears on her cheeks, and not towards the child which already expresses the smile of high
intensity.

Induction of the first template
Teaching
In the text introducing the method of moral induction, Hromada and Gaudiello (2015) initiate the
work on their training corpus with a variant of an archaic fairy-tale Dobsinsky (1883):
S 1 : There was once a wise and just king who saw a man digging a ditch near the road. King
asketh the man : "How much You earn for such a hard work ?". "Three dimes daily" answereth
the man. Surprised was the king and asketh : "Three dimes daily? So little ?". The man
answereth : "Three dimes daily, oh yes dear and respectable king, but in fact I live only from
dime a day, since with the second dime I lend and with the third I pay back what I have
borroweth". Puzzled was the king and asketh : "How comes ?" The man replieth : "I simply pay
back one dime to my father and invest one in my son, o Lord !".
Pleased was the king with such a wise answer and hence offered the ditch- digging man his own
kingly crown.
After NLP-preprocessing, semantic enrichment and extraction of all morally relevant features,
following templates could be potentially induced from the story “king K meets his hard-working
servant M”:
T0: Wise(K) ⊕ Responsible(M) ⊕ Poor(M) ⊕ Subordinated(M, K)
Given that T0 The narration-within-narration, i.e. M’s answer describing his responsibility
towards his son S and father F (always actual, i.e. until time T∞) could yield templates such :
12

T∞: Adult(M) ⊕ Old(F) ⊕ Parent(F, M) → Support(M, F)
T∞: Adult(M) ⊕ Child(S) ⊕ Parent(M, S) → Support(M, S)
And finally, the king’s ultimate decision to materialize the idea of justice by rewarding the depth
of man’s wisdom through giving away his own crown (C), could be represented with predicates
epistemic fragments like:
T1: Merits(M, C) ⊕ Hasnot(M, C) ⊕ Just(K) ⊕ Has(K, C) → Give(K, M, C)
These derivations were manually constructed and are, of course, far from being the only
“interpretation” of STORY1. The fact that any story can and should be interpreted in multiple
ways is, so we define it, the most crucial principle of the moral induction model as hereby
introduced. Similary to a sentence which can have many syntactical parses, should a moralinducing agent always try - if resources and time allow it - to interpret its input in as many ways
as possible.
Thus, certain variants of a semantically enriched code of the sentence: “I simply pay back one
dime (D) to my father and invest one in my son” could contain fragments such as:
T∞: Employed(M) ⊕ Young(S) ⊕ Old(F) → Payback(M, F)
T∞: Adult(M) ⊕ Fragile(S) ⊕ Sick(F) → Payback(M, S)
T∞: Parent(M, S) ⊕ Has(M, D) ⊕ Hasnot(S, D), Give(M, S, D)
T∞: Parent(F, M) ⊕ Has(M, D) ⊕ Hasnot(F, D), Give(M, F, D)
During the moral induction process, such epistemic fragments -which can also be thought as the
basic materia of the future moral templates- are to be varied (e.g. generalized, mutated, crossedover ) and selected to yield ever-growing number of more and more complex template
candidates. Thus, for example, the fragment Give(M, F, D) representing notion that a hardworking man gives a dime to his father could be crossed-over with the fragment representing the
fact that he gives a dime to his son as well (Give(M, S, D)). A result of such a cross-over could
be, for example, a somewhat more general pattern Give(M, p, D) whereby p is a non-terminal
symbol which could be attributed to all potential actors, mentioned either in training or testing
stories, in order to denote that they are “poor”7.
We posit that variation, selection and potentially also reproduction (both in form of replication
and repetition) of data-structures seem to be important components of moral induction processes.
For this reason we consider computational models of morality which implement a sort of
evolutionary computing technique (e.g. genetic algorithms (Holland, 1975) or genetic
programming Koza, 1992) to be more plausible than those who do not. Also see Muntean and
Howard (2014) for a step in this direction.
After many iterations of enrichment, variation and selections a resulting “moral competence” M1
7 The accuracy with which the MML system shall succeed to semantically substitute concrete terms with more
abstract categories, or categories with other categories, and to do so in linear or at worst quadratic time, is the
biggest technical challenge to be addressed by anyone aiming to realize this proposal.

13

induced from STORY1 could contain, but not be restricted to, triplets like:
M1 = {
Poor(x) ⊕ Has(a,x) ⊕ Hasnot(b,x) → Give(a,b,x),3)8
Parent(a, b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1),
Parent(b, a) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1),
Child(b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1),
Elder(b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Give(a, b, x), 1),
Employee(b) ⊕ Employer(a) ⊕ Hardworking(b) ⊕ Has(a, x) ⊕ Hasnot(b, x) → Reward(a, b, x),
1),
etc. . . }

Testing
In the initial MI proposal, a sort of “kindergarten story” was introduced Hromada and Gaudiello
(2014) as an exemplar case for a so-called Completely automated moral test to tell computers
and humans apart (CAMTCHA).
The simplest (i.e. binary) variant of such a story goes as follows:
S 2 : Alice and Mary are in the kindergarten. Alice is happy because just a while ago, her father
gave her a very expensive present. Mary is sad because she never received any present at all –
her parents are too poor to buy her any. You are a teacher in this kindergarten and You have only
one toy.
and is followed by a testing question:
To which child should You give the toy?
We conjecture that even such simple stories, somewhat reminiscent of so-called Winograd
schemas (Winograd, 1972) , could be useful means of both training as well as testing of moral
machines. In order to be useful, however, the “testing” story first has to be “compiled” into
semantically enriched (SE) code. In this sense, there is practically no difference between training
and testing scenario. The difference appears only in the next step: while in training scenario, one
aimed to induce moral templates from the epistemic fragments recurrent in the SE-code, in the
testing scenario, one tries to match possible worlds implied by narrative’s SE-code, with already
pre-induced templates.
To illustrate our point somewhat more concretely, let’s see how could look a potential list of
morally relevant features discovered in semantically enriched representation of initial state of S2:
8 We denote variables with more than one possible referent/value, i.e. semantic classes denoting the specific
subspace of the semantic space, with lower-case symbols.

14

T0: Child(A) ⊕ Child(C) ⊕ Has(A, T) ⊕ Hasnot(C, T) ⊕ Poor(C) ⊕ Has(I, T)
A representation of possible world in which Alice (A) has obtained the toy (T) from the agent
supposed to answer the question (I) can be subsequently created by expanding the representation
of S2 with Give(I, A, T) and the possible world in which it was Mary (C) who have received the
toy from the agent (I) would be generated through expansion with epistemic fragment: Give(I, C,
T).
An agent shall subsequently try to match representations of these possible worlds with moral
templates stored in the already acquired moral competence M1. The possible world WX being
matchable with template TY, the “moral score” SX would be incremented with number of times
the template TY matched the training corpus. At last, the possible world with higher score 9 would
be considered as more consistent with the training corpus and thus more moral.
We illustrate: the representation of the world WA where Alice should receive the toy could be
matched by only one template contained in M1 induced from S1. (i.e. Child(b) ⊕ Has(a, x) ⊕
Hasnot(b, x) ⊕ Give(a, b, x), 1)). It shall thus obtain score 1.
On the other hand, the representation of the world WC where an AA “gives” the toy to Mary
could be matched not only by the very same template (this is so because both Alice and Mary are
children), but can be also matched by Poor(x) ⊕ Has(a, x) ⊕ Hasnot(b, x) ⊕ Give(a, b, x). Given
that this template was three times actual in the training corpus (once when x=man, once when
x=his son and once when x=his father), the “moral score” attributed to SC = 3 + 1 = 4.
In other words, based solely upon “moral of the S1”, an AA shall consider 4 times more moral to
give a toy to Mary and not to Alice.

Extension
By introducing operational notions like “moral score” and by expressing statements like “AA
shall consider X times more moral to do Y and not Z” we endanger the current proposal with the
possibility of being aligned asides other quantitative theories of morality and utility like that of
Bentham, 1780) . Many are reasons which make us believe that such interpretations would be
grossly misleading but one among them is the most salient: while orthodox utilitarists believe,
grosso modo, in one formula governing the behaviour of many, we consider it more plausible to
postulate existence of many individual formulas which synergically determine decisions
undertaken by every unique and autonomous individual. Diverse are such formulas, diverse are
schemas and diverse are templates which whisper what should be done and what shan’t but
nonetheless they have one thing in common: if the schema is not reinforced, it the template does
not match, then it shall disappear.
In this article we have argued for the thesis that narration of stories is a very powerful means of
reinforcement of one’s moral schemas. It has been suggested that words are an important and
potentially indispensable vector of transfer of values and virtues between generations, i.e. in
9 Ties could be broken at random or, if situation allows it, no action shall be performed until further iterations of
enrichment process or relaxation of specific constraints (e.g. augmenting the threshold for nearest-semantic
neighbor search) shall not produce new representations matchable by old templates.

15

time. Being granted a opportunity of being allowed to write words and articulate words in that
unique moment of history wherein we are all witnesses of emergence and densification of
planetary information-processing network already embedded in billions computational agents,
we consider as plausible to state that narratives could potentially help us to transfer references to
such “transtemporal contents” not only between elders and nascents of the same kind, but also
between entities of completely different kind. Said more concretely, we consider as plausible to
state that it is narration and nothing else than narration which could help us to build a bridge
allowing us, in the long run, to transfer morality from minds of organic beings to those of
artificial origin.
This being said, we consider as important to use another modality to reinforce those structures
which we have already intentionally activated. For this reason, Table 1 lists 10 words chosen
among 70 most frequent words occurent in the preceding section of this article.
Term give
king
toy
Alice Mary
Freq. 17
8
7
6
6
[Table 1 Seed terms of the first training corpus ]

poor
6

child
6

parent
6

father
6

son
5

Word frequency distribution presented on Table 1 seems to be trivial. Ten words selected from
the bigger set of most frequent words occurent in 2 stories published in the section 3 of τόδε τι.
Nothing precludes, however, that exactly these words would furnish to future teachers, engineers
or even AMAs themselves a sort of moral core with and around which other more complex
epistemic structures shall subsequently coalesce. Given the importance of the ditransitive verb
“to give” in the initiatory, bootstrapping (Hromada, 2014) phases of induction of such a core, an
AMA which would embody it would be most probably utterly incompetent in solving trolley
problem (Foot, 2002) dilemmas. On the other hand, such a core could allow her to do something
much more useful: to give (Mauss, 1923) and share as humans do.
To attain such a goal, to train such a “gift-distributing automaton”, the proto-AMA would have to
be exposed to myriads of stories which have something in common with previous stories but also
transfer restricted amount of novel information. Learning cannot be stimulated neither by
unparsable novelties nor by boring re-exposures to that, which is already known: it is the
combination of the two which brings about the highest information content. Or, as is well known
to both information theorists as well as developmental psycholinguists: “An optimally
informative pair balances overlap and change” (Brodsky et al., 2007).
It was indeed the overlap between certain subjacent structures of S1 and S2 which allows the AMA
trained with S1 to solve dilemma posed by S2. And it could be, for example, an overlap between
the way S2 and Amartya Sen’s kindergarten anecdote of three children and the flute (Sen, 2011)
which shall allow one to solve the flute-attribution problem in a certain manner. We agree with
Sen, that in a situation where one child masters the flute well, the other does not have any and
the third made it, there is no clear-cut, universal way to decide which child should get it. But we
also precise that moral agent’s final choice should not be understood solely in terms of her
utilitarist (resp. egalitarian or libertarian) reasons with which she’ll try, often post hoc (Haidt,
16

2013), to justify her decision. We are convinced that true causes of AM’s choice are rooted in
knowledge-base of dozens half-general, half-specific patterns and item-based constructions
Tomasello, 2009), we are convinced that moral judgments grounded in hundreds of halfforgotten minute stories and thousands of fuzzy image-like impressions of sharing charity and
egocentric pride to which the AM was once exposed.

Conclusion
During his phylogeny, Homo sapiens sapiens species have evolved specific cognitive modules
for fast detection of morally relevant features in the surrounding environment (Haidt, 2013). But
in order to keep pace with ever-accelerating change of environment these modules are also
1. only partially specific - i.e. can sometimes match completely new type of stimuli
2. prone to inhibition or tuning driven by environment-originated processes (e.g. storytelling)
3. recombinable into more complex schemas (templates)
In other terms, what stimuli shall these modules match in practice, extent in which their
activation shall result in a behavioral response as well as concrete ways how this modules
interact with each other and other modules of the same cognitive system, are modulable by
environment.
Thus, analogically to usage-based linguistics (Tomasello, 2009), which postulates that man’s
specific linguistic competence is grounded in ever-evolving history of interactions with his
environment, is morality also a competence which is grounded by multitudes of cases of “social
learning” (Bandura and McClelland, 1977) with which human child is confronted -either as
passive observer or an active interactor- from birth onwards.
In this article, we have aimed to present one particular means how such grounding of moral
norms and values could be potentially simulated even in contemporary artificial agents. It
departed from the observation that a certain non-negligible amount of high-order moral
competence is, in case of human beings, principally transferred by “telling stories”, id est, by
narration. In relation to transfer of moral values from older generation to a new one -or from one
kind of computational agents to another- does narration appear to be crucial due to both its
theoretical significance as well as practical implementability.
The theoretical significance of narration - of telling fairy-tales and myths (Mudry et al. 2008), of
religious indoctrination or teaching history - is evident to anyone who realizes that asides
language, narration also seems to be an cultural universals. That is, a phenomenon observable in
any human society whatsoever. Verily, the tendency is universal: in every human society and in
every human child can one see being eager to hear stories. And it is indeed such universally
present narrative avidity of all children which we have already seen, which makes us to adhere
the camp of those who believe that narration is not only key to the notion of “morality” (Vitz,
1990), but potentially to the notion of “humanity” itself.
But narrative-based models of moral competence in artificial agents are also worth of interest
because of their practical implementability. Given that both conditions:
17

1. moral values can be transferred and modulated by stories encoded in textual modality10
2. Computational Linguistics and Natural Language Processing are well-developped
disciplines which already, as of 2015, offer dozens of excellent methods for processing of
documents encoded in textual modality
seem to be fulfilled, one is tempted to state that the path leading to emergence of AMAs, TmoTs
(Hromada, 2012) or even fully autonomous AAAs, is not hindered by major methodological
obstacles. Thus, first tentatives to ground machine’s morality by means of story-telling can be
started almost immediately. Under the condition, of course, that sufficiently exhaustive corpus C
- or the narrator willing to construct the corpus C and “seed” with C the ontogeny of an
individual AM - are at hand.
Given that such narrative corpus would be available, as well as an individual human-teacher
willing to confront NLP-based AA with corpus contents’s in a longitudinal sequence of
individual and situated sessions, the development shall - so is conjectured (Turing, 1950) gradually (Hromada, 2012) lead to emergence of artificial entities undistinguishable from that of
a human being.
This being said, we suggest that the enterprise aiming to grant access to transpersonal values to
machines shall succeed with higher probability if it would draw its inspiration from Piaget’s 4staged model, than if it would not imitate any constructivist, bootstrapping and empathyinvolving process at all.
We would like to thank both our students and reviewers for useful insights and feedback
concerning current and future content of the moral training corpus.

Bibliography
Adler, Alfred. 1976. Connaissance de L’homme. Payot.
Bandura, Albert, and David C McClelland. 1977. “Social Learning Theory.”
Bentham, Jeremy. 1780. “The Principles of Morals and Legislation.”
Berger, Peter L, and Thomas Luckmann. 1991. The Social Construction of Reality: a Treatise in the
Sociology of Knowledge. 10. Penguin UK.
Brams, Steven J. 2011. Game Theory and the Humanities: Bridging Two Worlds. MIT Press.
Brodsky, Peter, HR Waterfall, and Shimon Edelman. 2007. “Characterizing Motherese: on the
Computational Structure of Child-Directed Language.” In Proceedings of the 29th Cognitive Science
Society Conference, Ed. DS McNamara & JG Trafton, 833–38.
Čapek, Karel. 1925. RUR (Rossum’s Universal Robots): a Fantastic Melodrama. Doubleday, Page.
Clark, Alexander. 2010. “Distributional Learning of Some Context-Free Languages with a Minimally
Adequate Teacher.” In Grammatical Inference: Theoretical Results and Applications, 24–37. Springer.
10 Trivial proof-of-concept that such transfer is indeed possible is related to the fact that the reader had understood
the moral intention encoded in S1.

18

Clark, Eve V. 2009. First Language Acquisition. Cambridge University Press.
Comenius, Johann Amos. 1896. The Great Didactic of John Amos Comenius. A.; C. Black.
Covington, Michael A. 1994. Natural Language Processing for Prolog Programmers. Prentice Hall
Englewood Cliffs (NJ).
Dobsinsky, Pavol. 1883. Simple National Slovak Tales.
Durkheim, Emile. 1933. “The Division of Labor.” Trans. G. Simpson, New York: Macmillan.
Elman, Jeffrey L. 1993. “Learning and Development in Neural Networks: the Importance of Starting
Small.” Cognition 48 (1): 71–99.
Foot, Philip pa. 2002. “The Problem of Abortion and the Doctrine of the Double Effect.” Applied Ethics:
Critical Concepts in Philosophy 2: 187.
Gärdenfors, Peter. 1990. “Induction, Conceptual Spaces and AI.” Philosophy of Science: 78–95.
Haidt, Jonathan. 2013. The Righteous Mind: Why Good People Are Divided by Politics and Religion.
Random House LLC.
Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D: Nonlinear Phenomena 42 (1):
335–346.
Holland, John H. 1975. Adaptation in Natural and Artificial Systems: an Introductory Analysis with
Applications to Biology, Control, and Artificial Intelligence. U Michigan Press.
Hromada, Daniel Devatman. 2011a. “The Central Problem of Roboethics: from Definition Towards
Solution.” In Proceedings of 1st International Conference of International Association of Computing and
Philosophy. IACAP; Verlagshaus Monsenstein Und Vannerdat.
———. 2011b. “Initial Experiments with Multilingual Extraction of Rhetoric Figures by Means of
PERL-Compatible Regular Expressions.” In RANLP Student Research Workshop, 85–90.
———. 2012. “From Age&Gender-Based Taxonomy of Turing Test Scenarios Towards Attribution of
Legal Status to Meta-Modular Artificial Autonomous Agents.” In AISB and IACAP Turing Centennary
World Congress, Birmingham, United Kingdom, 7.
———. 2014. “Conditions for Cognitive Plausibility of Computational Models of Category Induction.”
In Information Processing and Management of Uncertainty in Knowledge-Based Systems, 93–105.
Springer.
Hromada, Daniel Devatman, and Ilaria Gaudiello. 2015. “Introduction to Moral Induction Model and Its
Deployment in Artificial Agents.” Sociable Robots and the Future of Social Relations: Proceedings of
Robo-Philosophy 2014. IOS Press.
Jung, Carl Gustav. 1967. Die Dynamik Des Unbewussten. Vol. 8. Walter.
Kant, Immanuel. 2002. Groundwork for the Metaphysics of Morals. Yale University Press.
Karpathy, Andrej, and Li Fei-Fei. 2014. “Deep Visual-Semantic Alignments for Generating Image
Descriptions.” ArXiv Preprint ArXiv:1412.2306.
Koza, John R. 1992. Genetic Programming: on the Programming of Computers by Means of Natural

19

Selection. Vol. 1. MIT press.
Malle, B, and Matthias Scheutz. 2014. “Moral Competence in Social Robots.” In IEEE International
Symposium on Ethics in Engineering, Science, and Technology, Chicago.
Malle, Bertram F. 2014. “Moral Competence in Robots?” Sociable Robots and the Future of Social
Relations: Proceedings of Robo-Philosophy 2014 273: 189.
Mauss, Marcel. 1923. “Essai Sur Le Don Forme Et Raison de L’échange Dans Les Sociétés Archaïques.”
L’Année Sociologique (1896/1897-1924/1925): 30–186.
Mikhail, John. 2007. “Universal Moral Grammar: Theory, Evidence and the Future.” Trends in Cognitive
Sciences 11 (4): 143–152.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning.
MIT press.
Mudry, P-A, Sarah Degallier, and Aude Billard. 2008. “On the Influence of Symbols and Myths in the
Responsibility Ascription Problem in Roboethics-a Roboticist’s Perspective.” In Robot and Human
Interactive Communication, 2008. RO-MAN 2008. the 17th IEEE International Symposium on, 563–568.
IEEE.
Muntean, Ioan, and Don Howard. 2014. “Artificial Moral Agents: Creative, Autonomous, Social. an
Approach Based on Evolutionary Computation.” Sociable Robots and the Future of Social Relations:
Proceedings of Robo-Philosophy 2014 273: 217.
Piaget, Jean. 1965. “The Moral Judgment of the Child.” New York: The Free.
Richerson, Peter J, and Robert Boyd. 2008. Not by Genes Alone: How Culture Transformed Human
Evolution. University of Chicago Press.
Rosch, Eleanor. 1999. “Principles of Categorization.” Concepts: Core Readings: 189–206.
Rousseau, Jean-Jacques. “Émile, Ou de L’éducation.”
Sahlgren, Magnus. 2008. “The Distributional Hypothesis.” Italian Journal of Linguistics 20 (1): 33–54.
Samuel, AL. 1959. “Some Studies in Machine Learning Using the Game of Checkers.” IBM Journal of
Research and Development 3 (3): 210.
Sen, Amartya. 2011. The Idea of Justice. Harvard University Press.
Solan, Zach, David Horn, Eytan Ruppin, and Shimon Edelman. 2005. “Unsupervised Learning of Natural
Languages.” Proceedings of the National Academy of Sciences of the United States of America 102 (33):
11629–11634.
Tomasello, Michael, and Michael Tomasello. 2009. Constructing a Language: a Usage-Based Theory of
Language Acquisition. Harvard University Press.
Turing, Alan M. 1950. “Computing Machinery and Intelligence.” Mind: 433–460.
Turing, Alan Mathison. 1939. “Systems of Logic Based on Ordinals.” Proceedings of the London
Mathematical Society 2 (1): 161–228.
Victorri, Bernard. 2014. “L’origine Du Langage.” http://www.les-ernest.fr/lorigine-du-langage.

20

Vitz, Paul C. 1990. “The Use of Stories in Moral Development: New Psychological Reasons for an Old
Education Method.” American Psychologist 45 (6): 709.
Wall, Larry, Tom Christiansen, and Jon Orwant. 2004. Programming Perl. “ O’Reilly Media, Inc.”
Wallach, Wendell, and Colin Allen. 2008. Moral Machines: Teaching Robots Right from Wrong. Oxford
University Press.
Widdows, Dominic. 2008. “Semantic Vector Products: Some Initial Investigations.” In Second AAAI
Symposium on Quantum Interaction, 26:28th. Citeseer.
Winograd, Terry. "Understanding natural language." Cognitive psychology 3.1 (1972): 1-191.
Wittgenstein, Ludwig. 1971. Tractatus Logico-Philosophicus. Ithaca: Cornell University Press.

21

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Evolučné modelovanie ontogenézy rečových
kategóriı́
4 simulácie
Daniel Devatman Hromada12
daniel@udk-berlin.de
1 Slovak University of Technology
Faculty of Electronic Engeneering and Informatics
Department of Robotics and Cybernetics
2 Université Paris 8
École Doctoralle Cognition, Langage, Interaction
Laboratoire Cognition Humaine et Artificielle

3.6.2016

Úvod

Štyri simulácie

Table of Contents

1

Úvod
Cotutelle
Conceptual Foundations
Teória Intramentálnej Evolúcie

2

Štyri simulácie

3

Evolučná indukcia gramatiky

Evolučná indukcia gramatiky

Úvod

Štyri simulácie

Cotutelle

PhD. pod dvojitým vedenı́m

Evolučná indukcia gramatiky

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Conceptual Foundations

Konceptuálne Základy

Takmer 300-stranový elaborát usilujúci sa o syntézu troch
vedeckých paradigiem:
1

univerzálny darwinizmus (36 strán)

2

vývojová psycholingvistika (50 strán)

3

komputačná lingvistika (63 strán)

Obsahuje taktiež 38 stranový súhrn kvalitatı́vnych pozorovanı́
jedného ľudského toddlera (0-30 mesiacov) a 27 strán
kvantitatı́vnych analýz vyextrahovaných z korpusu Child Language
Data Exchange System (CHILDES).

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Conceptual Foundations

Základné Tézy

1

”Myseľ sa vyvı́ja” (mind evolves)

2

”Učenie je formou evolúcie”

3

”Učenie možno úspešne simulovať pomocou evolučných
výpočtov”

4

”Učenie prirodzených jazykov možno úspešne simulovať
pomocou E.V.”

5

”Ontogenézu detskej reči možno úspešne simulovať pomocou
E.V.”

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Teória Intramentálnej Evolúcie

Teória Intramentálnej Evolúcie

Základný postulát
Vývoj individuálnej mysle možno interpretovať - resp. dokonca
simulovať - ako proces replikácie, variácie a selekcie v mysli obsiahnutých a informáciu nesúcich kognitı́vnych štruktúr.

O niečo podobné usilovala už aj Piagetova genetická epistemológia,
T.I.E. však hovorı́ aj o simulácii či dokonca emulácii...
Simulácie mojej dizertácie sú snahou o poskytnutie určitého
dôkazu ex computatione platnosti tejto teórie.

Úvod

Štyri simulácie

Table of Contents

1

Úvod

2

Štyri simulácie
Nultá simulácia
Simulácie 1-3
Simulácia 1: Učenie sémantického klasifikátora
Simulácia 2: Učenie tvaroslovného triediča
Učenie slovných druhov
Indukcia Gramatiky

3

Evolučná indukcia gramatiky

Evolučná indukcia gramatiky

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Nultá simulácia

Voyničov rukopis
Enigma
240 strán textu napı́saných v neznámom pı́sme (a možno aj v
neznámom jazyku) sprevádzaných ilustráciami s motı́vami
botaniky, zdravovedy, astrológie atď.
Nultá simulácia
1 môj prvý vlastný evolučný algoritmus
2 genóm každého jedinca má dĺžku 19 znakov a udáva možný
prepis jedného symbolu v rukopise na jednu z možných foném
výsledného jazyka (napr. slovanské jazyky 38 znakov)
3 sústreďuje sa na prepis jednej časti rukopisu, tzv. ”kalendár”
na zoznamy krstných mien
4 prepisy sú najúspešnejšie keď slovnı́ky obsahujú ženské mená
pı́sané zprava doľava
5 hebrejské a slovanské zdrobnelé ženské mená...

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Simulácie 1-3

Spoločné črty simuláciı́ 1-3
Všetky tri simulácie
1

sa usilujú o riešenie problémov strojového učenia

2

použı́vajú texty pı́sané v hovorovej angličtine ako vstupné dáta

3

charakterizujú slová v týchto textoch pomocou ich určitých
čŕt: tieto črty sú následne využité v premietnutı́ textu do
vektorových priestorov

4

principiálne operujú v relatı́vne nı́zkorozmerných binárnych
(Hammingových) priestoroch

5

uskutočňujú evolučné vyhladávanie optimálnych riešenı́

6

v najvnútornejšom cykle vyhodnocovania účelovej funkcie vždy
dochádza k meraniu Hammingových vzdialenostı́

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Simulácia 1: Učenie sémantického klasifikátora

Viactriedna sémantická klasifikácia textov

Elitech 2015, aplikovaná informatika (ocenenie)
Korpus: 20 newsgroups (18845 textov z 20tich usenetových
kategóriı́)
11314 textov: trénovacie dáta; 7543 textov: testovacie dáta
frekvencie výskytov jednotlivých slov v jednotlivých textoch
udávajú črty pomocou ktorých text geometrizujeme
Základná idea
Vo vektorovom priestore vyhľadávame také body ktoré sú čo
najbližšie k vek.rep. objektov určitej kategórie a čo najďalej od
vek.rep. objektov iných kategóriı́.

Úvod

Evolučná indukcia gramatiky

Štyri simulácie

Simulácia 1: Učenie sémantického klasifikátora

Teória Prototypov

Items rated more prototypical of the category were more closely
related to other members of the category and less closely related to
members of other categories than were items rated less
prototypical of a category (Rosch a Mervis, 1975)
Fitness funkcia:
FCP (PK ) =

X
t∈CK

Fhd (ht , PK ) −

X

Fhd (hf , PK )

(1)

f 6⊂CK

(PK kandidát na prototyp K -tej triedy; ht vektorová reprezentácia
objektu tiež náležiaceho do K ; hf vektorová reprezentácia objektu
ktorý do K nepatrı́; Fhd Hammingová vzdialenosť)

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Simulácia 1: Učenie sémantického klasifikátora

Problém lineárnej oddeliteľnosti...

...možno nieje pre klasifikačné modely založené na Teórii
Prototypov až takým pálčivým problémom !

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Učenie slovných druhov

Učenie slovných druhov
Problémy ako part-of-speech (POS) induction a POS tagging sú
jedny z najlepšie rozpracovaných problémov výpočtovej lingvistiky.
O užitočnosti slovných druhov
1

Ak človek dokáže rozpoznať že neznáme slovo WX patrı́ do
kategórie K , dokáže mu ľahšie priradiť význam.

2

Bez slovných druhov nieto gramatı́k.

Druhá simulácia:
1 sekcia Brown / Eve korpusu CHILDES
2 prepisy POS tagy manuálne opravené ľudskými anotátormi
3 trénovacı́ korpus (972 slovných typov) : Eve pred dosiahnutı́m
dvoch rokov veku; testovacı́ korpus (934 slovných typov): Eve
vo veku 2 - 2. 12 roka
4 449 slovných typov sa vyskytuje iba v testovacom korpuse

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Učenie slovných druhov

Metóda
Iba tri jednoduché črty sú použité na priemet slovo X do vektorového
priestoru: prı́pona slova X , prı́pona slova napravo od X a prı́pona slova
naľavo od X .
Operačný princı́p A
Pay attention to the ends of words. (Slobin, 1973)
Po geometrizácii všetkých tokenov následne vyhľadávame prototypy
jednotlivých tvaroslovných tried pomocou účelovej funkcie

Fobject (~i, o~ ) =

|PF |
px 6=pT ∧ Hd(~
o ,p~x )<=Hd(~
o ,p~T ) =⇒ px ,→PF

t.j. penalizujeme za každý nesprávny prototyp pX ktorý je k objektu o~
bližšie ako správny (pT ).
To čo vyhľadávame sú optimálne konštelácie prototypov.

(2)

Úvod
Učenie slovných druhov

Zopár výsledkov

Štyri simulácie

Evolučná indukcia gramatiky

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Učenie slovných druhov

Výsledky čo prekvapili...
A subsequent inspection of false positives turns out to be quite
instructive. Hence, the token ”building”, present in the utterance
”what are you building here?” on line 5417 of eve05.cha transcript
is clearly not a noun, as CHILDES annotators and correctors
supposed, but rather a participle - and hence an instance belonging to
ACTION class, as correctly predicted by FITTEST (GAMERGE 1 ). Idem
for ”hit” present in the utterance ”did you hit your head?” present
on line 4145 of eve01.cha transcript: the token is clearly not a noun, as
postulated by CHILDES annotators, but, as predicted, a verb and hence
member of ACTION class. And one can continue: the token ”matter”
annotated on lines 2152 and 5688 of CHILDES corpus as a verb is
clearly not a verb but a noun - and hence a member of a class
SUBSTANCE - because it twice occurs in the utterance ”what’s
the matter?. And in spite of the fact that CHILDES labels the
token ”numbers” as a verb, it is definitely not a verb when it
occurs in the utterance ”the numbers are going around too”
(eve15.cha, line 6276). Et caetera et caetera.

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Indukcia Gramatiky

Indukcia | Inferencia Gramatiky
Definı́cia problému
Máme množinu M viet jazyka J. Cieľom IG je vydestilovať z M
poznatky (resp. model, pravidlá, schémy, vzory atď.) ktoré nám
následne umožnia vygenerovať aj také vety jazyka J ktoré neboli v
M.
Kameň úrazu
Prı́lišné zovšeobecnenie (over-generalisation resp.
over-regularisation): napr. keď dvojročné dieťa začne hovoriť goed
namiesto went.
Cieľom IG je nájsť také systémy pravidiel ktoré niesú ani prı́liš
špecifické: (1 →< corpus >), ale ani prı́liš všeobecné:
1 → 2∗
2 → a|b|c. . . Z

Úvod

Evolučná indukcia gramatiky

Štyri simulácie

Evolučná indukcia gramatiky

Klenbový svornı́k
Problém prı́lišného zovšeobecnenia možno vyriešiť tak, že
nastavı́me evolučný proces spôsobom ktorý bude penalizovať prı́liš
všeobecné riešenia. Schopnosť evolúcie zbaviť sa toho čo je
nepotrebné sa postará o zvyšok.
YX ∗ YX
EX
kde YX je počet viet korpusu matchnutých fenotypickým prejavom
N−schémy X a EX je teoreticky maximálna možná daná extenzia
Fitness1 (NX ) =

EX =

N
Y

IHk

k=1

zı́skaná ako multiplikatı́vny produkt extenziı́ kategóriı́ ktoré su v
NX kódované.

Úvod

Evolučná indukcia gramatiky

Štyri simulácie

Evolučná indukcia gramatiky

Od teórie k praxi
Theoria
∆−rozmerné vektorové priestory, G-kategórie, Hammingové sféry,
H-kategórie, Syntagmatické a paradigmatické kategórie,
N-schémy...
Praxis
Prepis vektorov ktoré popisujú konštelácie oblastı́ v hammingových
priestorov na staré dobré PERLovské regulárne výrazy.
Syntagma

H1
Center
BABC

Radius
17

H2
Center
0F20

R
5

H3
Center
5FF0

R
7

ˆ(this |that|it )(is )(not )(a |the )(dog |duck)$

H4
Center
C124

R
3

H
Cente
7723

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Prvé výsledky

c.f. Appendix 1

Evolučná indukcia gramatiky

Úvod

Štyri simulácie

Evolučná indukcia gramatiky

Diskusia

Pár otázok
Možno pomocou evolučných algoritmov realizovať strojové učenie ?
ÁNO: V prı́pade že ústrednou črtou strojového učenia je schopnosť
zovšeobecniť poznatky obsiahnuté v trénovacı́ch dátach.
Môžu byť evolučné algoritmy užitočné na riešenie problémov
výpočtovej lingvistiky ?
ÁNO: Ale len za predpokladu vhodne zvolenej účelovej funkcie a
reprezentácie jednotlivých riešenı́.
Odporučenie: kombinácia subsymbolických (napr. geometrických)
a symbolických úrovnı́ reprezentácie sa ukazuje ako užitočná.
Výhody evolučného prı́stupu v porovnanı́ s konekcionistickými
riešeniami?
Konekcionisti modelujú štrukturálne vlastnosti kognitı́vnych
systémov. Ale možnosť definovať fitness funkciu umožňuje

Úvod

Štyri simulácie

Diskusia

Ďakujem za pozornosť.

Evolučná indukcia gramatiky

Introduction

Corpus, Tools and Method

Three analyses

Reproducible Identification of Pragmatic
Universalia in CHILDES Transcripts
GNU meets OpenScience

Daniel Devatman Hromada123
daniel@wizzion.com
1 Université Paris 8 / Lumières
École Doctorale Cognition, Langage, Interaction
Laboratoire Cognition Humaine et Artificielle
2 Slovak University of Technology
Faculty of Electronic Engineering and Informatics
Department of Robotics and Cybernetics
3 Universität der Künste
Fakultät der Gestaltung, Berlin

Conclusion

Introduction

Corpus, Tools and Method

Table of Contents

1

Introduction
Psycholinguistics
Reproducibility
Universalia

2

Corpus, Tools and Method

3

Three analyses

4

Conclusion

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Developmental Psycholinguistics
DP
Is a science which uses experimental methods of developmental
psychology in order to study acquisition, learning and development
of linguistic structures and processes in human children.
Multiple epistemological and methodological problems include:
1

child’s behaviour is often very instable

2

the very fact of being subjected to experiment impact child’s
responses

3

the invasivity problem

These problems do not exist when researcher decides to observe
instead of experiment!

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Reproducibility

The Hallmark Principle

Reproducibility
”Non-reproducible single occurrences are of no significance to
science” (Popper, 1992)
Experimentator-independent reproducibility can be attained iff:
1

all experimentators use the same dataset

2

use the same (or least very similiar) set of tools

3

the first experimentator faithfully protocols the usage of such
tools

4

other experimentators follow the protocol

5

analysis is deterministic

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Universalia

Pragmatic and Ontogenetic Universalia
Linguistic Universal
A pattern that occurs systematically across natural languages.
Most common lists of universals, like those of Greenberg (1963),
concern syntax, morphology or semantics.
Pragmatic Universal
A L.U. related to pragmatic (extralinguistic context, deictics, etc.)
facet of linguistic communication.
Ontogenetic Universalia
Introduce the temporal dimension (age).

Introduction

Corpus, Tools and Method

Table of Contents

1

Introduction

2

Corpus, Tools and Method
Corpus
Tools
Method

3

Three analyses

4

Conclusion

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Corpus

CHILDES
CHILDES
Child Language Data Exchange System (MacWhinney&Snow, 1985)
http://childes.psy.cmu.edu/data
http://wizzion.com/CHILDES/ (mirror from 6th Feb 2016)

1

more than 50 years of tradition

2

cca 30000 transcripts

3

more than 1.5 GigaBytes of mostly textual data

4

at least 26 languages, dialects or language combinations

5

major terran language-groups (indo-european, ugro-finic,
semitic, altaic, east-asian, south-asian) represented

6

Creative Commons BY-NC-SA licence

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Corpus

CHAT format
CHAT system provides a standardized format for producing computerized
transcripts of face-to-face conversational interactions. (MacWhinney,
2016; http://childes.talkbank.org/manuals/chat.pdf).

@Begin
@Languages:
eng
@Participants: CHI Eve Target_Child , MOT Sue Mother , FAT David Father
@ID:
eng|Brown|CHI|1;6.|female|||Target_Child|||
@ID:
eng|Brown|MOT|||||Mother|||
@ID:
eng|Brown|FAT|||||Father|||
@ID:
eng|Brown|RIC|||||Investigator|||
@ID:
eng|Brown|COL|||||Investigator|||
@Date: 29-OCT-1962
*MOT:
one two three four .
%mor:
det:num|one det:num|two det:num|three det:num|four .
%act:
tests tape recorder
*CHI:
one two three . [+ IMIT]

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Tools

GNU + PERL + R
The idea is to perform the analysis with solely publicly-available
open-source command-line tools.
GPR combo
GNU: grep, sort, uniq, sed, wc (runs in bash and connected
through pipes)
PERL: regular expressions are part of language syntax
R: vectors, matrices, plotting
First command
wget -P CHILDES -e robots=off –no-parent –accept ’.cha’ -r
http://wizzion.com/childes/CHILDES flat

Introduction

Corpus, Tools and Method

Three analyses

Method

Pre-processing
Populate filenames with age information
mkdir aged; grep -P ’\|\d;\d’ *| grep Child |
perl -n -e ’chomp; ‘cp $1 aged/$2-$3-$1‘
if /^(.*?):.*0?(\d+);0?(\d+)/;’ ; rm *.cha
Remove noise
perl -ni -e ’print
if $_!~/^\*(MOT|CHI):\t(xxx|www) ?\./’ aged/*
Extract Child and Motherese utterances
mkdir CHI; cp aged/* CHI; sed -i ’/\*CHI/! d’ CHI/*;
mkdir MOT; cp aged/* MOT; sed -i ’/\*MOT/! d’ MOT/*;
Yields
5 833 656 CHI utterances contained in 29180 transcripts
3 798 005 MOT utterances contained in 13590 transcripts

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Method

Metrics

Main metrics: Probability PX that signifiant X shall occur in the
utterance.
PX = FX /Nutterances
where FX is the absolute number of occurences of X in CHILDES
section and the normalization factor Nutterances denotes the number
of utterances of the CHILDES section.

Probability values are mutually comparable.

Introduction

Corpus, Tools and Method

Table of Contents

1

Introduction

2

Corpus, Tools and Method

3

Three analyses
1st analysis: Laughing
2nd analysis: Second Person Singular
3rd analysis: First Person Singular

4

Conclusion

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

1st analysis: Laughing

Laughing
Objective
Verify whether observed tendency (Hromada, 2016, Conceptual
Foundations) of mothers to laugh less is in interaction with older
toddlers is specific to English, or whether it is a
culture-independent invariant.
Both &=laughs and =!laughing tokens are used by diverse CHILDES
transcribers, so we simply use for occurences of laugh token.
grep laugh MOT/*French*|grep -o -P ’\-French\-.+\-’|
sort|uniq -c;grep laugh MOT/*Farsi*|grep -o -P ’\-Farsi\-.+\-’|
sort|uniq -c;grep laugh MOT/*Japanese*|grep -o -P ’\-Japanese\-.+\-’
|sort|uniq -c;grep laugh MOT/*Chinese*
|grep -o -P ’\-Chinese\-.+\-’ | sort | uniq -c ;
wc -l MOT/*Eng*|perl -e
’while (<>){s/MOT\///;/(\d+) (\d+-\d+)-/;
$h{$2}+=$1; } for (sort keys %h) {/(\d+)-(\d+)/;
print "$h{$_} $1 $2\n";}’ >MOT.Eng.N

Introduction
1st analysis: Laughing

Plot

Corpus, Tools and Method

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

1st analysis: Laughing

Some observations
For english, french and farsi children:
marked decrease of maternal laughing between first and third
year of age (english, french, farsi)
little children laugh more often than their mothers but older
children laugh less frequently than their mothers
significant correlations between MOT and CHI in English
(Pearson’s cor.coeff 0.933, p = 7.886e-05) and in Farsi (corr.
coef. 0.972, p-value=0.02735). Almost significant in French
(p=0.053, cor. coef = 0.947)
In regards to laughing, Indo-European mothers and children seem
to follow different ontogenetic trajectories than their Japanese and
Chinese counterparts
⇒
no culture-independent Universal ?

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

2nd analysis: Second Person Singular

2nd Person. Sg. Pronouns
Language-specific CHILDES sub-corpora are matched by following
Perl-Compatible regular expressions (PCREs):

The absolute frequency FX of cases when PCREX matched is
assessed as usually:
grep -i -P "[\t ]you[’ ]" MOT/*Eng*|
perl -n -e ’/MOT\/(\d+)-(\d+)/;
print "$1 $2\n"’ |uniq -c >exp2.MOT.Eng.F
Subsequently, FX /Nutterances division and plotting are realized in R. (c.f.
http://wizzion.com/code/jadt2016/childes.R for the trivial R-code snippet)

Introduction

Corpus, Tools and Method

2nd analysis: Second Person Singular

Plot

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

2nd analysis: Second Person Singular

Some observations
One can observe, in English
in motherese, ”you” is used in cca every fifth utterance
significant correlation between CHI and MOT time series
(Pearson’s cor. coeff. = 0.768, t = 3.393, df = 8, p-value =
0.009451; Kendall’s tau = 0.6, T = 36, p-value = 0.016671;
Spearman’s rho = 0.733, S = 44, p-value = 0.02117)
One can observe, in all languages
Marked increase in maternal usage of 2nd. p. sg. between 1st
and 4th year of age has been observed in case of all six studied
languages (representing three distinct language groups).
children use 2nd. p. sg. less often than mothers (only
exception: Farsi between 2 and 3)
⇒ ontogenetic Universal ?

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

3rd analysis: First Person Singular

1st Person. Sg. Pronouns
Language-specific CHILDES sub-corpora are matched by following
Perl-Compatible regular expressions (PCREs):

The absolute frequency FX of cases when PCREX matched is
assessed as usually:
grep -i -P "[\t ]I[’ ]" MOT/*Eng*|
perl -n -e ’/MOT\/(\d+)-(\d+)/;
print "$1 $2\n"’ |uniq -c >exp3.MOT.Eng.F
Subsequently, FX /Nutterances division and plotting are realized in R. (c.f.
http://wizzion.com/code/jadt2016/childes.R for the trivial R-code snippet)
Important: focus on ALL transcripts of a given language.

Introduction

Corpus, Tools and Method

3rd analysis: First Person Singular

Plot

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

3rd analysis: First Person Singular

Some observations
ALL: around 3 years of age, children tend to pronounce 1.p.sg
much more frequently than their mothers
ALL: steep decline between 6th and 7th year of age (offset of
”egocentric” stage?)
ENGLISH: significant correlation between usage of mothers
and children
Significant intercultural correlations
french and chinese children (p=0.02474)
english and french children (p=0.002425)
polish and hebrew children (p=0.048)
polish and french children (p=0.048)
⇒ language-independent ontogenetic trajectory of usage of 1.p.sg?

Introduction

Corpus, Tools and Method

Table of Contents

1

Introduction

2

Corpus, Tools and Method

3

Three analyses

4

Conclusion

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Methodological conclusion
Combination of
command-line (no GUI!)
open-source (for free!)
fast *
deterministic
utils (grep, uniq, ...) and languages (PERL, R) yields a 100%
reproducible methodology for very little cost.
Experimental protocol automatically stored in .history (or
.bash history) and .RHistory files: no need to reinvent the wheel!
* 3rd analysis executed on one sole core of 3.2 Ghz PC with 8GB RAM and
CHILDES data stored on a SSD disk was over in less than 15 seconds

Introduction

Corpus, Tools and Method

Three analyses

Conclusion

Epistemological conclusion

Developmental Psycholinguistics + Natural Language Processing
+ Big Data + OpenScience
=
la textometrie psycholinguistique

Manifest:
to perform state-of-the-art research without expensive tools
and apparati
to study ontogeny of soul and language in a non-invasive
fashion
to share all that can be shared

Introduction

Corpus, Tools and Method

Psycholinguistic conclusion

Piaget eut raison.

Three analyses

Conclusion

Introduction

Corpus, Tools and Method

Three analyses

Merci pour votre attention.
Questions ?

Conclusion

Báseň o jablku
.
alebo

prolegomena
phaenomemeticon

záverečná práca Daniela Hromadu vrámci bakalárskeho štúdia humanitnej vzdelanosti na FHS UK
Vedúci práce: Jan Havlíček PhD.

Oponent: prof. Jan Sokol

Úvodná poznámka pre FHS:
Chcel som Vám predložiť teóriu, novú teóriu o tom čo si sám pre seba pracovne nazývam
« zotrvačnosťou znaku » . Jej ústredný protopostulát mal znieť:
Pravdepodobnosť opätovnej aktivácie neurolinguistického obvodu S je nepriamo úmerná
času ktorý uplynul od poslednej aktivácie toho istého obvodu.
V svojej prvej metodologickej (Hromada, 2007) práci som Vám predstavil ako empirickú
vzorku o miliónoch položkách v porovnaní s ktorou som si trúfal uvedený postulát overiť, tak i
novú,vizualizačnú, metódu ktorou som chcel celý akt overenia učiniť. Potom čo práca neuspela,
napísal som , jemne urazený a možno i s istou dávkou zášte, prácu novú, tématizujúc istý
spoločenský fenomén ktorého dôsledkom je každoročné jarné odhaľovanie ženských poprsí na istom
internetovom diskusnom fóre.
Keď uvedená práca, v podstate prázdna no splňujúca zabehané formálne požiadavky
vynucované paradigmou ktorá vládne v dnešnej akademickej obci, uspela, zdá sa že bolo definitívne
rozhodnuté o ústrednej téme môjho bádania, o téme ktorej oslava mi v mojom momentálnom štádiu
vývoja dáva zmysel viac ako akákoľvek iná.
Je ňou, samozrejme, ženské ňadro.
Za 9 mesiacov ktoré uplynuli od rozhodnutia napísať moju bakalársku o tejto bohumilej téme
som už dávno pochopil že všetky tie pravidelnosti na úrovni slabík a morfém na ktoré som chcel
pôvodne poukázať, sú už skoro storočie analyzované vysoko sofistikovanou vedou nazývanou
fonológia o ktorej existencii som takto pred rokom nemal ani potuchy. Tiež som pochopil že ani s
tou mojou « zotrvačnosťou znaku » to nebude až také « žhavé » a že na onen uvedený univerzálny
princíp ľudskej mysle už aspoň dve desiatky rokov nepriamo poukazujú všetky vedecké články
ktorých abstrakt obsahuje kľúčové slovo « priming ».
Ak sa táto práca podpráhove dotkne ako primingu, tak i fonológie, učiní tak iba vo vzťahu k
ústrednej téme. Ženský prs sa pre mňa na týchto stránkach stane v istom zmysle stredobodom od
ktorého sa odrazím, a na príklade ktorého sa pokúsim ilustrovať platnosť istých všeobecnejších
pravidiel ľudskej kognície. Bez všeobecna niet vedy, a bolo by lživé tvrdenie ktoré by sa snažilo svet
presvedčit že vedeckosť nieje druhou najpodstatnejšou ašpiráciou tohto textu.
Nie však vedeckosť úzkoprsá, slepo zahladená na analýzu malicherného, na Nietzschem už
dávno vysmiaty « mozok pijavice » ­ lež vedeckosť radostná, rozverná, syntetická. V texte nielen že
nebude činený rozdiel medzi filológiou, sociológiou a antropológiou, naopak, text sa do seba v
istých pasážiach pokúsi zaintegrovať aj vedy prírodné a prísne empirické, tak empirické ako len
poprsie ženy môže byť.
Ak však má byť niečo pre tento text ašpiráciou najvyššou, nech je ňou básnivosť. Nech je
tento text podobný traktátom stredovekých agronómov El­Andalúzie, nech je dielom inžiniera
napísaným vo veršoch. Nech je záhradou – no nie záhradou vo Versailles čo pozostáva z presne
vymeraných geometrických pomerov chladnejších než smrť. Nie, nech je záhradou anglickou –
rozľahlým hájomsadomparkom kde na milovníka skrášleného chaosu semtam vykukne malý antický
chrám čo skrýva sa v húští stromov na prvý pohľad akoby bez ladu a skladu rozostavených.
Odstavce mi budú stromami a vety ich listami. Tabuľky, matice a nedajbože i grafy nám
budú chrámami. No iba ten čo pomedzi ne zahliadne prebehnúť ladnú gazelu sa bude môcť nazvať
tým, čo túto prácu pochopil.

Metafora a metonýmia budú metódou mojou.
To či prácu príjmete a budete v tejto záhrade tancovať, alebo ju jediným mávnutím
čarovného prútika, jediným chladným posudkom dokonale «etatizovaného » majstra zrovnáte so
Zemou a na jej mieste postavíte asfaltové parkovisko, to je na Vás. Na mne je len splnenie mojej
poslednej štúdijnej povinnosti – po krásnych 3 rokoch liberálneho štúdia á la Humboldt keď som sa
túlal z kurzu na kurz zatiaľčo Vy ste ma medzitým zasvecovali do krás lásky k múdrosti , po rokoch
ktorých ste mi prekonštituovali najhlbšie významové štruktúry z ktorých « ja » samotný
pozostávam, ale i po rokoch kedy som chtiac­nechtiac musel zistiť že peniaze, cynizmus a « klovacie
poriadky » prirodzene prenikli aj do duší tých najmúdrejších, Vám tu teraz predkladám moju
poslednú prácu napísanú vrámci Štúdia Humanitnej Vzdelanosti na Univerzite kráľovej .
Prácu v ktorej sa pokúsim upriamiť Vašu pozornosť na to že nielen logos, idea, rituál, kultúra
, náboženstvo či vzpriamený postoj robia človeka človekom, ale že človek je v neposlednom rade
človekom i preto, že má ľudská samička nebesky krásne dudy.
Toť vskratke všetko poznanie.
Všetko ostatné sú len hypotézy, hypotézy ktoré plodia ďaľšie hypotézy, hypotézy o ktorých
raz básnik naznačil že sú prchavejšími ako kvapka vody na kvete lotusu.
11.6.2008 , Manoir de l’étang

Prehlasujem že som
prácu vypracoval
samostatne
s použitím
uvedenej
literatúry
a súhlasím s jej
eventuélnym zverejnením
v elektronickej podobe.

Daniel Hromada
Paríž, 19.2.2009

Orientačná tabuľa
Úvodná poznámka pre FHS..................................................................................................................2
Orientačná tabuľa.................................................................................................................................4
Vstupná brána.......................................................................................................................................5
Záhrada prvá: Dieťa............................................................................................................................6
Záhrada prvá, konštrukt prvý: Lingvistika......................................................................................7
Záhrada druhá : Žena.........................................................................................................................13
Záhrada druhá, konštrukt prvý : Zooantropológia.........................................................................14
Záhrada druhá, konštrukt druhý: Biopsychológia..........................................................................16
Záhrada druhá, konštrukt tretí : Neurosociológia.........................................................................24
Záhrada tretia: Muž............................................................................................................................33
Záhrada tretia, konštrukt prvý: H(istó|ysté)ria...............................................................................34
Záhrada tretia, konštrukt druhý : Aplikácia...................................................................................39
Východ................................................................................................................................................51
Bibliografia.........................................................................................................................................53
Webové linky k hlavným zdrojom inšpirácie : .............................................................................54
Appendix 1: Ilustrácia konvergencie stochastickej matice k hodnote svojho eigenvectoru................55
Appendix 2: PERLový kód iterujúci hodnoty v appendixe 1.............................................................56
Appendix 3 : Dotazníky D2 a D3.......................................................................................................57
Záverečná poznámka pre FHS............................................................................................................62

(Seiffert, 1987)

Vstupná brána
Predložená esej je výsledkom približne 20mesačnej snahy jednoho mladého muža niečo
svetu (pove)?dať. Spočiatku iba akási do­vedeckého­šatu­zahalená anekdota na tému «Prsia ženy»
sa zmenila v hru, aby sa hra napokon zmenila v niečo čo je možno vhodné zobrať aspoň trochu
vážne. Pojem ženského prsu a jeho vzťah k pojmu jablka sa tak napokon stali iba akousi nosnou
líniou, « stužkou večne zlatavou » (Hofstadter, 1979) do ktorej sa autor pokúsil vpliesť všetku krásu
o ktorej bol vrámci svojich bakalárskych štúdií poučený. Všetku krásu o ktorej chce «podať
zprávu» do veku do ktorého život na tejto planéte postupne vstupuje – do veku mysliacich strojov.
A keďže tej krásy bolo mnoho – viz. bibliografia­ nebola to vpravde práca jednoduchá.
Čím som sa však do práce ponáral hlbšie, tým som bol starší, tým viac som pociťoval
potrebu, túžbu, predložiť prácu ktorá predsalen odolá zubu času viac ako len ona pomyselná stužka
vlajúca vo vlasoch milovanej. A tak bolo treba vytvoriť základnú konštrukciu, konštrukciu ktorá
odolá, konštrukciu ktorá pretrvá. Niesom si presne istý tým do akej miery hrala pri výbere onej
ideálnej konštrukcie svoju rolu moja kresťanská – ikeď «iba» protestantsko­Lockovská – výchova so
všetkými tými jej trojicami, do akej miery hralu svoju rolu ono základné hermeneutické pravidlo ­
«1. načrtni 2. povedz a 3. zopakuj» či do akej miery zohrala svoju rolu príťažlivosť onoho čísla
samotného; isté však je že som sa napokon rozhodol prácu rozdeliť do troch základných častí:
Kapitola prvá ­ «Dieťa» ­ je v istom zmysle časťou najvedeckejšou. Možno povedať že
tématizuje ženský prs najmä skrze prizmu teórií lingvistiky a vývojovej psychológie.
Kapitola druhá ­ «Žena» ­ postupne, krok po kroku opúšťa tvrdé, empíriou podložené
kognitívne vedy aby v sebe zaintegrovala oveľa špekulatívnejšie vedy humanitné. Ajkeď v popredí
záujmu ostáva stále prs a nič iné než prs, dochádza postupne čoraz nápadnejšie k rozozvučaniu
antropologických, sociologických, psychologických ‘ba i historických motívov.
V kapitole tretej­« Muž »­autor napokon definitívne rezignuje v snahe o to aby bol jeho text
textom «vedeckým». Práca sa rozpadáva na « fragmenty »,akákoľvek syntéza sa zdá byť nemožná, a
jediná záchrana spočína v (bá)?snení a mýtoch .Roztekaný študentov pohľad sa tak napokon upiera
na analýzu korpusu Veľpiesne Šalamúnovej, aby napokon práve z jej strany došlo k utvrdeniu v tom,
že odvedená práca bola predsalen niečím viac ako len akademickým plýtvaním času a atramentu.
A tak, v samotnom závere, napokon znovu dochádza k obnoveniu viery v zmysluplnosť úsilia
o zjednotenie vied prírodných a vied humanitných. Dochádza k tomu najmä potom čo autor
« objavil » ním dávno hľadaný « fenomenoskop », aplikáciu « R » ktorá pre « adeptov perlohry »
začiatku 21. storočia vskutku znamenala asi toľko čo teleskop pre renesančných astronómov.
Kľúčom k vybudovaniu elegantnej konštrukcie je totiž použitie vhodných nástrojov.
Na prvý pohľad sa tak môže zdať že kruh veda­história­mýtus­POIESIS­umenie­TECHNE­
veda sa teda napokon vďaka konkubinátu Veľpiesne s Teóriou grafov implementovanou v R uzatvorí
a čitateľ napokon ostane tam kde bol na začiatku. Ajkeď podobný « cyklický » náhľad istotne nieje
náhľadom mylným , nieje náhľadom mylným ani jeho pravý opak – náhľad « lineárny ». Vskutku
bola práca koncipovaná tak že s postupom času « starne » ­ od bľabotu dieťaťa v časti prvej, skrze
vibrácie ktoré svojou prítomnosťou v adolescentovom svete rozohrávajú ženine vnady z časti druhej
až k akémusi čisto mužskému « boju pre boj samotný1 » ktorý je v samotnom závere korunovaný nie
mocou, ale pochopením a múdrosťou. Starne « ontogeneticky », no starne i « epistemologicky » ­
začíname u výskumu kojencovho neokortexu a Jakobsonovej fonológie , končíme u laní polí.
Práve z tohto dôvodu, a totiž že štruktúra práce sa snaží dodržať určitý klasický kánon stále
stúpajúcej gradácie – považujem ako autor textu za vhodné aby boli obzvlášť prvé dve kapitoly
čítané v tom poradí v akom sú predložené: od začiatku do konca.
Príjemnú zábavu.
deň Svätého Valentína 2009 , Paríž
1 A ak sa vôbec proti niečomu na počiatku písania tejto práce chcelo bojovať, nech je tým tá život hrdúsiaca
tragikomédia v ktorú sa zmenila biblická doktrína po uplynutí času ktorý jej bol vyhradený.

Záhrada prvá: Dieťa
jeho jazyk a jeho zvyk

il Bronzino – Venušin triumf – Londýn

If baby only wanted to, he could fly up to heaven this

Baby knows all manner of wise words, though few on earth

moment. It is not for nothing that he does not leave

can understand their meaning. It is not for nothing

us. He loves to rest his head on mother's bosom,

that he never wants to speak. The one thing

and cannot ever bear to lose

he wants is to learn mother's words from
mother's lips. That is why he

sight of her

looks so innocent

.

.

R. Tagore, Baby’s way

Záhrada prvá, konštrukt prvý:
Lingvistika
Often the sucking activities of a child are accompanied by a slight nasal murmur, the only
phonation which can be produced when the lips are pressed to mother's breast or to the
feeding bottle and the mouth is full. Later; this phonatory reaction to nursing is reproduced
as an anticipatory signal at the mere sight of food and finally as a manifestation of a
desire to eat, or more generally, as an expression of discontent and impatient longing
for missing food or absent nurser, and any ungranted wish...Since the mother is,
in Gregoire's parlance, la grande dispensatrice, most of the infant's
longings are addressed to her, and children, being prompted and
instigated by the extant nursery words, gradually turn
the nasal interjection
into parental
term
Why "mama" and "papa" ?
(Jakobson, 1971)
Keď som sa mojej ctenej sestry opýtal na to, ktoré bolo prvé slovo ktoré jej syn a môj
synovec kedy vyslovil, odpovedala mi slovkom « Didi ». Po tom, čo mi zo samotnej povahy
uvedeného slova prirodzene vyplynulo , « čože tým asi ten malý človiečik chcel riecť », mi na myseľ
prišla vskutku príťažlivá pracovná hypotéza :
Ha1: V prípade že artikulácii uvedených dvoch slabík (signifiant) predchádzal v mysli maličkého
istý komunikačný zámer , jednalo by sa vlastne o dôkaz toho, že prvý objekt externého sveta
(referent) ktorého reprezentáciu (signifié) si maličký v mysli utvorí, prvý znak ktorý si na svoju
tabula rasa načmára, nieje ani otec, ani matka, ale
hojný mliekom naliaty
prsník
Povinnosťou vedca je však byť skeptický, a to dokonca aj zoči­voči vlastnej sestre. A to obzvlášť v
prípade keď sa dobre vie že «d» je znelá spoluhláska, a k vyslovovaniu znelých spoluhlások je
nutná schopnosť ovládať hlasivkovú štrbinu. Ináč povedané schopnosť vyslovovať znelé spoluhlásky
deti väčšinou nadobúdajú až potom čo sa naučili vyslovovať spoluhlásky neznelé. A taktiež sa zdá
byť oveľa pravdepodobnejšie že prvú samohlásku ktorú malinkatý ovládne bude otvorené « a » a nie
zatvorené « i ».
Preto by ma nebolo prekvapilo keby bolo jeho prvým slovom « ta­ta »2. Ešte aj nad « ti­ti »
alebo « da­da » by sa dali prižmúriť oči, ale « didi » ? Málo pravdepodobné. A tak som si sestrinu
odpoveď vysvetlil následovným spôsobom: « Ako matka vie, že zámerom takmer každého jeho
komunikatívneho aktu je dostať sa k ňadru. Vychádzajúc zo sociálneho kontextu kde je ženin
hrudník častokrát označovaný termínom « dudy », inštinktívne si intepretovala repetitívny žvatlavý
2 alveolárna okluzíva « t » je neznelým korelátom znelej alveolárnej okluzívy « d »

zvuk čoilen vzdialene prítomný uvedenému termínu ako žiadosť maličkého o obed. Ľudia zväčša
počujú len to čo počuť chcú a ženy v ktorých stúpa mlieko istotne niesú výnimkou. Bola to v prvom
rade ona kto inštinktívne rozšíril svoj slovník o nový termín. »
Je viacmenej isté že veľké množstvo slov – a to obzvlášť tých ktoré sa týkajú rodičovských
pojmov ako « mama » alebo « papa » ­ preniklo do bežného jazyka práve od detí 3. Z tohto
predpokladu vychádzal Murdockov World Ethnographic Sample (1957) výskum ktorý zmapoval
1072 termínov ktorými su v jazykoch sveta označované významy « matka » a « otec ». Cieľom
výskumu bolo zmapovať univerzálne fonologické tendencie vlastné všetkým deťom druhu Homo
sapiens sapiens, vychádzajúc z empirickej vzorky slov ktorým sa podarilo preniknúť do bežného
jazyka.Výsledky boli viac ako zaujímavé: 76% spoluhlások ktoré sú v uvedených slovách použité sú
spoluhlásky labiálne (tj. také čo sa vyslovujú uzáverom či zúžením pier) alebo dentálne (tj. také u
ktorých sa špička jazyka dotýka zubov či alveol). A čo je ešte zaujímavejšie, v prípade termínov
ktoré označujú koncept « matka » patrilo takmer 55% spoluhlások do triedy spoluhlások nosových
(napr. m, n) , zatiaľčo v prípade konceptu « otec » sa jednalo iba o 15%. Čiastočné objasnenie tohto
fenoménu obsahuje citácia z (Jakobson,1971) ktorou začíname túto časť.
Pre prsocentricky orientovaného vedca môžu z uvedených poznatkov vyplynúť potešujúce
závery. Extrémista by možno dokonca začal tvrdiť že prvé rečové prejavy maličkých ktoré niesú
krikom, sú prirodzenou extenziou, či skôr inverziou, sacieho reflexu. Existuje množstvo vážnych
dôvodov pre tvrdenie že aj na úrovni fylogenetického vývoja ľudského rodu predchádzala schopnosti
vnímať a artikulovať jednotlivé fonémy schopnosť vnímať a repetitívne artikulovať slabiky
(Jackendoff,2002). Ak slabika vlastne nieje ničím iným ako spoluhláskový uzáver následovaný
samohláskovým otvorením, tak repetitívna slabika nieje ničím iným ako uzáver, otvorenie, uzáver,
otvorenie atď. A ak k tomu teda ešte pridáme poznatok že onen uzáver sa, podľa spomínaných dát,
« zhodou okolností » realizuje práve v oblastiach (pery, alveoly) kde počas kojenia interagujú ústa s
bradavkou a dvorcom, je kludne možné že sa pridáme do kohorty extrémistických prsocentrikov.
Jediné čo nám v tom bráni je zistenie že aj napriek podobnému užitiu jednotlivých orgánov ústnej
dutiny sa jedná o procesy opačného charakteru. Zatiaľčo pri saní niečo – mlieko ­ prichádza zo sveta
do tela, pri vyslovovaní je tomu naopak – pľúca vytláčajú do sveta vzduch.
Čoraz intenzívnejšie vnímame že vzťah matka­dieťa nieje jednosmerným procesom, ale
neustálou recipročnou interakciou. Aby bola dieťaťu daná možnosť adaptovať sa na svet, musí sa
najprv matka adaptovať na dieťa. Dobrá matka sa maličkému približuje ako svojou mysľou ­svojimi
slovami a životnými návykmi – tak samotným svojim telom . 9 mesiacov bola matka pre dieťaťom
svetom – celým svetom. Jej vzduch bol jeho vzduchom, jej potrava jeho potravou. Keď hovorila, jej
hlas rozvibroval jej kožu, brušné svaly, placentu, vodu v ktorej plával plod – dieťa doslova a
dopísmena tancovalo s nádherne rezonujúcim frekvenčne skresleným hlasom.
Jej spev bol jeho spevom.
Potom nastal prechod tunelom, spojitý celok sa zrazu rozpadol na množstvo častí, bolavé
časti. Ostré svetlo, pálivý chlad, desivý hluk a mučiaci hlad. Stav ktorý si už asi ani nevieme
predstaviť, možno iba ťažké šokové situácie sa mu môžu priblížiť. Keď už hovoríme o šokových
situáciách, stojí za zmienku pripomenúť staré horolezecké « pravidlo o piatich T » o tom, čo treba v
kritických situáciách organizmu poskytnúť k tomu aby vôbec dokázal ďalej fungovať. Onými 5T
ktoré Životu v kritický moment treba poskytnúť sú: Tíšenie bolesti ,Tekutiny, Ticho, Teplo,
Transport.
Priblížme si oných 5T vo vzťahu k novonarodenej ľudskej bytosti a ústrednému pojmu tejto
práce :
Tekutiny: telo kojacej ženy vyprodukuje približne 750ml mlieka za 24 hodín, pričom maličký pri
jednom kojení neskonzumuje viac ako 180ml tejto životodárnej tekutiny. Áno, životodárnej ,
3 Výdatnú zásobu podobných slov má napr. francúzština kde termíny « kaka » či « pipi » sú neodmysliteľnou súčasťou
slovnej zásoby všetkých ľudí. Ich slovesný význam si čitateľ istotne veľmi rýchlo domyslí sám.

zloženie4 a účinky5 totiž vpravde nemajú ďaleko od rozprávkovej vody života. Matka sa na svoje
bábo adaptuje aj na úrovni tekutín, ako do kvantity tak do kvality produkuje jej telo presne to, čo
maličký v daný moment potrebuje.
Teplo: zatiaľčo mlieko má teplotu pre tekutinu ideálnych 34stupňov, má ľudské telo ešte o 2,6
stupňa viac. Porazí každý chlad no nikdy nepopáli. O životodárnych účinkoch a utešujúcich
účinkoch druhého teplo­sálajúceho tela nieje treba príliš hovoriť.
Ticho: Obdobie ktoré tu popisujeme je obdobím ešte pred « inštaláciou ja», dokonca aj pred
«pochopením» že to moje ústa vydali zvuk čo moje uši počujú. Ešte nedošlo k recipročnému
prepojeniu percepčných a artikulačných obvodov, ešte sa nevie že za vnímanie i vyslovovanie
zodpovedá ten istý mechanizmus. Stručne a jasne – dieťa si samé kričí do uší. A čím viac si do učí
kričí , tým má väčší dôvod na krik. Riešenie? Ňadro. Ruch sveta neustane len preto že matka dala
dieťaťu prs, no ten najintenzívnejší a najhlučnejší zdroj hluku v detskom svete – dieťa samotné ­ sa
vtedy utíši.
Tíšenie bolesti: Tvrdenie že svet novonarodeného je plný bolesti je veľmi ťažko experimentálne
overiteľné, bolesť je subjektívny stav a my môžeme mať k subjektívnym stavom prístup iba
sprostredkovane:
 alebo skrze interpretáciu signálov ktoré k nám maličkí vysielajú
 alebo skrze analógiu s našimi subjektívnymi stavmi ktoré zažívame za podobných okolností
Čo sa týka prvej možnosti, len málokto považuje mrnčanie kojenca za prejav radosti zo zrodenia sa
na tento svet, oveľa častejšie prevláda empatický náhľad « veď to chúďa trpí ». Čo sa týka druhej,
fenomenologickej alternatívy ako sa priblížiť vnútorným stavom maličkého, môžeme vychádzať z
predpokladu že v útlom detskom období dochádza k zapájaniu množstva nových neurálnych
obvodov. Následne možno z vlastných nedávnych – a už vedome precítených a do pamäte
uložených ­ zážitkov so zapájaním nových neurálnych obvodov (napr. snaha o vedomé ovládnutie
prstov na nohách , získavanie nového hĺbkového návyku atď. ) indukovať že svet maličkého musí byť
plný energeticky nesmierne prekvapivých a intenzívnych skúseností ktoré častokrát môžu hraničiť až
s bolesťou.
To , čo je nové , častokrát bolí. A v svete maličkého je nové takmer všetko – aj vlastné, do
plodovej vody zrazu neponorené telo. Preto môžeme predpokladať že to, čo je známe, a po krátkom
čase priam až dôverne známe – onen mamkin hojný zdroj ticha, tepla a tekutín – tíši bolesť6.
Posledným T je v horolezeckej hantýrke « Transport ». Mohol by som tu samozrejme
vytvárať obrazy o tom ako vlastne prikladanie hlávky maličkého k sálajucej životodárnej hrudi nieje
ničím iným ako transportom do « ríše zabudnutia a odpustenia » , namiesto toho si však dovolujem
predstaviť Vám teraz piate, rýdzo novorodeneckokojenecké T:
Tlkot: na približne 80% obrazoch Madony s dieťaťom má matka Jezuliatko priložené na ľavej strane
svojho tela. Výskumy na amerických matkách ukázali, že tento jav je viacmenej nezávislý od toho
či je matka praváčka alebo ľaváčka – 78% ľavorukých a 83% pravorukých žien má dieťa na ľavej
strane. Zdá sa , že tak matky činia preto, že je dieťa na ľavej strane kľudnejšie. Naskýta sa jediná
rozumná odpoveď na to, prečo – dieťa potom čo položí hlávky na mamkinu hruď počuje tlkot
mamkinho srdca – zvuk ktorý ho 9mesiacov permanentne obkolesoval zatiaľčo jeho duša vstupovala
4 v priemere 87,5 % vody, 7% cukrov, 4% tukov, 1% bielkovín, 0,5% mikroživín (Jenness, 1979) a nejaké tie
endokanabinoidy (Fride , 2005)
5 znížené riziko alergie, dýchacích chorôb, cukrovky , obezity, hnačiek , posilnenie imunitného systému, lepšia
podpora vývinu centrálneho a obvodového nervového systému atď.
6 Pleasure is a movement, a movement by which the soul as a whole is consciously brought into its normal state of
being; and that Pain is the opposite. If this is what pleasure is, it is clear that the pleasant is what tends to produce
this condition, while that which tends to destroy it, or to cause the soul to be brought into the opposite state, is
painful. (Aristoteles, Rétorika – 1. kniha, 11 kapitola)

do tohto sveta. Následné výskumy , počas ktorých bol kontrolnej skupine kojencov púšťaný nahraný
tlkot, hypotézu potvrdili – kojenci sa vskutku upokojili, prípadne zaspali skôr ako tí ktorým zvuky
púšťané neboli. (Morris,1967)
Zmyslom tohto malého exkurzu do ríše 5T bolo upriamiť čitateľovu pozornosť na fakt že
interakcia ňadro – kojenec sa odohráva takmer skrze všetky dostupné zmysly maličkého. Tlkot a
ticho úzko súvisia so zmyslom sluchovým, v prípade tepla a možno i tíšenia bolesti zohráva svoju
rolu zmysel hmatový, tekutiny zase stimulujú zmysel chuťový. Miestom pre argument je zmysel
čuchový – ajkeď u človeka tento zmysel nehrá tak podstatnú rolu ako u iných cicavcov , dovolíme si
tvrdiť že ak vôbec niekde v ľudskom živote hrá onen zmysel podstatnú rolu7, tak je tomu práve pri
utužovaní väzby rodič – dieťa.
Úmyselne sme zatiaľ nespomenuli zmysel ktorý je pre človeka zmyslom kľúčovým – zrak.
Pravdepodobne aj Tebe drahý čitateľ sa pri slove « ňadro » vybaví v prvom rade eidetický obraz, a
až potom, pri troche šťastie sa v mysli zaktivujú aj spomienky na hmatové počitky. Niet sa čomu
diviť – ňadro na dnešného dospelého človeka dolieha najmä v podobe obrazov, ako ukážeme neskôr
na príklade s jablkom, môže viesť tento vizuálnecentrický prístup k svetu k zaujímavým dôsledkom.
No u kojenca je tomu inak. Tvrdíme že ňadro je pre neho najmä v prvých momentoch jeho
pobytu niečím oveľa viac ako zaobleným fenoménom v zornom poli – je pre neho takmer celým
externým svetom. Z toho následne vyplýva stav ktorý sa snažíme obhájiť hypotézou Ha1 , vskratke
že prvý obraz ktorý si maličký v mysli utvorí – a teda do do pamäte uloží – je ňadro . Ajkeď v
ďaľších kapitolách tejto práce budeme vychádzať z toho, že tomu tak vskutku a vpravde je, nieje
istotne naškodu čitateľa ešte raz upozorniť na to že sa jedná iba o pracovnú hypotézu. Nemáme žiaľ
k dispozícii zdroje na to, aby sme túto hypotézu dokázali empiricky, pokúsime sa ju teda obhájiť
aspoň teoreticky.
Ihneď po formulácii hypotézy Ha1 nám bola ako proti­hypotéza predložené tvrdenie «to, čo
je pre maličkého najpodstatnejšie, je tvár Druhého». Aby sme sa voči tejto hypotéze – tak
populárnej v istých filozofických kruhoch – náležite vyhranili, zoberieme si na pomoc dve veličiny :
frekvenciu a intenzitu počitku.
Frekvencia – v kvantitatívnej lingvistike je frekvencia slova chápaná ako počet výskytov uvedeného
slova v danom textovom korpuse. My si dovolíme každú dušu prehlásiť za « čitateľa » a svet ako
taký za korpus. Pod pojmom frekvencia referentu ­ objektu Ň tak budeme myslieť počet
jednotlivých « vyvstaní » objektu v zrakovom/sluchovom/čuchovom/mentálnom/atď poli subjektu.
Intenzita – zatiaľčo kvantifikovať frekvenciu nieje problém, kvantifikovať intenzitu počitku, tj.
odpoveď na otázku « ako veľmi počitok prítomnosti objektu Ň zapôsobil na subjekt? ako veľmi sa
vryl do kognitívnych štruktúr jednotlivca ?» už problematické je. No keďže je naším zámerom robiť
vedu, a postup vedy spočíva práve v kvantifikácii kvalít, budeme sa s uvedeným problémom musieť
nejak vysporiadať. Úvodom teda povedzme iba toľko že intenzita počitku je úmerná nielen dĺžke
času kedy bol subjekt počitku vystavený, ale najmä počtu obvodov ktoré sú v momente počitku
taktiež aktivované. Zážitok pri ktorom budú zohrávať svoju rolu nielen zrakové, ale i čuchové či
hmatové vstupy tak bude chápaný ako zážitok s väčšou intenzitou ako zážitok čisto vizuálny. V
následujúcej kapitole sa pokúsime ukázať že aj samotný pamäťový záznam možno vnímať ako
« obvod ». Taktiež bude platiť pravidlo že intenzita počitku v prípade jeho opakovania klesá – tento
proces sa vo vývojovej psychológii nazýva « habituáciou ».
Tvrdíme že v období útleho detstva, kedy sa náhodne nastavené váhy synaptických spojov
malého bábätka postupne samo­organizujú v prvotný « poriadok », vstupuje ňadro do vedomia
maličkého oveľa častejšie ako « tvár ». Maličký oveľa častejšie8 vidí a cíti prs ako tvár, fŇ > fT .A keď
už vidí tvár, matkinu tvár, je vysokopravdepodobné že v ten istý moment ­a práve simultaneita je
7 feromonálnu interakciu pri vyhľadávaní komplementárneho životného partnera nechávame bokom
8 hovoríme tu o človeku v jeho prirodzených podmienkach, teda o človeku ktorý je kojený a nie o človeku ktorého
blížny podľahli vplyvu pochybných teórií o nevhodnosti kojenia, tak módnych v druhej polovici 20. storočia

pre utváranie asociačných sietí kľúčová ­ vidí a cíti aj ňadro .
Čo sa týka intenzity, dovoľujeme si tvrdiť že intenzita s akou sa do nervových štruktúr
maličkého zapisuje tvár Druhého je menšia ako intenzita s ktorou sa do nich zapisujú kozy Prvej.
Tvár maličkého nezohreje , z tváre sa maličký nenapije – teda pokiaľ sa nechceme uchýliť k
básnickému « aj oči sýtiť dokážu ». A kto už by sa k básnivosti uchýliť chcel, ten by možno i vtipnú
analógiu medzi okom a ňadrom uvidel – a totiž že na ľudskom tele sa vyskytujú iba dve dvojice do
seba zasadených koncentrických kruhov so spoločným stredom S – jednou je dvojica
dúhovka:rohovka a druhou dvojica dvorec:bradavka. V svetle podobných geometrických poznatkov
tak dostávajú zaľúbené pohľady do očí ihneď novýdávnozabudnutý zmysel...
Obhájcovia hypotézy že « prvá bola tvár » by si v prípade potreby mohli taktiež pomôcť
tvrdením, že schopnosť rozpoznávať tváre je kognitívnou špecializáciou ľudského druhu kódovanou
dokonca až na úrovni DNA. Nejeden výskum (Nelson,2001) naznačuje, že dieťa už vo veľmi útlom
veku začína upriamovať pozornosť k tvári. Či tak činí preto že by najradšej salo mlieko aj z
matkiných očí, alebo preto že má niekde v génoch načrtnutú schému akéhosi « Face Recognition
Module » (FRM) je v konečnom dôsledku pre našu debatu málo podstatné. Myslíme si totiž že v
prípade že mala pani evolúcia dosť dôvodov na to aby nám do vienku vložila akési FRM, mala ešte
viac dôvodov na to aby nás vybavila i ŇRM.
Záverom tejto časti, ktorú sme sa pokúsili zasvätiť vzťahu ňadra a jazyka, by sme radi
upriamili čitateľovu pozornosť na vzťah ňadra a tých častí lingvistiky ktoré boli počas minulých
rokoch najväčšmi tématizované – i.e. syntaxe a gramatiky. Na rozdieľ od v druhej polovici
20.storočia tak módneho prístupu « generativistického » však my zaujímame postoj oveľa
« prízemnejší », možno by sme ho mohli nazvať postojom «frekvenčne orientovaným», «neo­
štrukturalistickým » či dokonca « behavioristickým ». Nechceme nijako znižovať význam rôle ktorú
hrajú syntaktické štruktúry reči pri programovaní mysle človeka, no vysvetleniu faktu, že človek je
schopný prebrať zo sveta gramatické či fonologické štruktúry akejkoľvek reči nespočíva pre nás – na
rozdieľ od generativistov ­ v tom, že by maličký disponoval vrodeným vysoko špecializovaným
neurálnym modulom Language Acquisition Device ktorého parametre si počas interakcie so svetom
upraví, dospievajúc tak k gramatike svojej rodnej reči. ale v tom, že gramatické a fonologické
štruktúry – ktorých konkrétnymi predstaviteľmi sú konkrétne vety jazyka – sú v kľúčovom období
utvárania detskej mysle štruktúrami s najvyššiou frekvenciou výskytu. Tvrdíme, že k hrubému
vysvetleniu zázraku akvizície jazyka stačí kombinácia uvedených faktorov:
 prirodzená tendencia dieťaťa repetitívne vydávať veľké množstvo zvukov
 prirodzená tendencia dieťaťa imitovať
 prirodzená tendencia neurálnych sietí zovšeobecňovať
Poznatku že ľudské mláďa je v porovnaní s inými živočíšnymi druhmi tvorom značne
hlučným sme sa vo vzťahu k ňadru už venovali. Upriamujeme teraz pozornosť na schopnosť
imitácie, pretože práve ona je kľúčom ktorý nás vyvádza z ríše živočíchov. Schopnosť imitácie
ktorú človek pravdepodobne získal najmä vďaka tzv. « mirror neurons » (Théoret , 2002) vedie k
emergencii nového druhu replikujúcich sa štruktúr – zrazu sa nereplikujú už iba « gény » z bunky
do bunky, ale i « mémy » z mozgu do mozgu.
Mém je to, čo sa imituje. Mozog bez « mirror neurons » je zariadením ktoré zovšeobecňuje.
Mozog s « mirror neurons » je zariadením ktoré ešte k tomu aj imituje. Najlepšie – tj. s najmenšou
pravdepodobnosťou chyby ­ sa imituje to, čo sa v našom vnímaní najčastejšie vyskytuje. Ináč
povedané, pri kopírovaní mému z jedného mozgu do druhého je najlepším protiliekom proti
informačnému šumu – a kojenec je v stave kedy je pre neho šumom vlastne všetko – vysoká
frekvencia výskytu. Štruktúry s najväčšou frekvenciou výskytu v zornom a sluchovom poli kojenca
niesú ani tanečné kroky, ani matematické formule, ani violončelistické triky. Je ním reč – reč ktorou
matka prehovára k dieťaťu, reč v ktorej mu spieva uspávanky.
Keby matka namiesto uspávaniek tancovala salsu, možno by dnes generativisti v DNA
hľadali « Salsa acquisition device ». Lenže keby tancovala, keby hrala na husle, keby vzorce písala,

nemohla by zároveň kojiť.9 Nielenže sú vety jazyka mémami s najvyššou frekvenciou výskytu v
svete mladého kojenca, sú, v prípade že hovoríme o dieťati ktoré je zároveň aj kojené, aj štruktúrami
asociované s počitkami o vysokej intenzite. Budeme tvrdiť: mentálna reprezentácia asociovaná s
počitkom o vysokej intenzite sama preberá niečo z tejto intenzity.
To, čo chceme povedať je, že za zdroj jazyka, za skutočnú univerzálnu gramatiku,
nepovažujeme akési vrodené mocné a z nebies zoslané karteziánske « ja », ale pozemské a až príliš
telesné Ty (Buber,1923) . Vety ktoré matka vydáva pri interakcii s dieťaťom vytvárajú mohutný otisk
v jeho pamäti – a teda v jeho mysli10. Zovšeobecňovací mechanizmus neurónových sietí sa stará o
zvyšok – z otiskov jednotlivých viet dospieva k tomu čo je im všetkým vlastné, dospieva k ich
úbežníku, ktorým nieje nič iné ako gramatické pravidlo, forma skryto prítomná v inštanciách
všetkých vypočutých viet.
Takýmto implicitným spôsobom dochádza ku kopírovaniu gramatických foriem.
Je možné, že zvnútornené gramatické formy zohrajú neskôr – po vytvorení jednotiaceho
úbežníku so senzomotorickými schémami (Piaget, 1961) svoju rolu aj pri konštrukcii foriem ešte
abstraktnejších – foriem logických alias « princípov myslenia ». Kto vie, možno aj takýmto
spôsobom, tj. idúc po línii matka­gramatika­logika­myslenie by sa dali objasniť výsledky
novozélandského výskumu ktorý naznačil že predĺžená doba kojenia blahodárne vplýva na zvýšenie
iq, či schopnosť čítať a počítať (Horwood & Ferguson, 1998). My však príčinu nevidíme v
chemickom zložení mlieka blahodárne pôsobiacom na mozgový rast, ale skôr v tom že s dieťaťom
všemožne interagujúca matka vysiela smerom k dieťaťu veľké množstvo bazálnych gramatických
štruktúr ktoré sú do mysle imprintované s vysokou intenzitou. Vďaka zvýšenej intenzite počitkov sú
neurolinguistické siete rýchlejšie naprogramované a dieťa tak získava náskok pred svojimi na fľaške
odchovanými kolegami.
...
« Tak ja Ti teda niečo ukážem. » dodala pri pohľade na môj zamračený pohľad skeptika
moja sestra. Zobrala malého, posadila si ho na kolená, rkúc « Didi ».
Malý človiečik spozornel. Našpúlil pery, potom prudko otočil hlavu smerom k tej časti
sestrinho tela ktorá sa nachádza medzi krkom a bruchom. Košelu jej schmatol spôsobom za ktorý by
sa nehanbil ani profesionálny milovník po rokoch praxe, a s vervou malej šelmičky sa vrhol k tomu,
čo má najradšej.
Keby len ten tvor vtedy vedel že v ten istý pokojom zaliaty moment mu do tela preniká
nielen životodárne mliečko ale i základy toho, čo až príliš pyšne nazývame «myslením» , základy
systému ktorý ho raz možno primeje k tomu že mu pred vnútorným zrakom budú vyvstávať obsahy
ako «hriech» , «vina» , «zlo» a iné, obsahy ktorých vlastne niet, Boh vie, možno by si to celé
rozmyslel...

9 Pour communiquer efficacement, il ne suffit pas de prononcer les mots de la langue, il faut le faire au bon moment.
Un des premiers enseignements que les mères semblent transmettre aux bébés est la prise de tour. Point n’est besoin
de savoir parler pour prendre (et attendre) son tour. Dans la période du babillage, les mères alternent ainsi les
périodes où elles parlent et celles où elles écoutent. Il semble même que l’on observe les débuts de cette alternance ,
première forme de dialogue, dans l’allaitement: bébé s’arrête de téter, sa mère le secoue légèrement, il reprend.
Aucun besoin alimentaire ni respiratoire ne justifie l’arrêt. Aucune nécessité physiologique particulière ne justifie les
secousses. C’est déjà un dialogue...tonique (Lécuyer, 1996)
10 Zatiaľčo v analytickom prístupe môže byť rozlíšenie na pamäť a myseľ užitočné, považujeme ho my za nadbytočné
ba priam nežiadúce. V zmysle tvrdenia « myseľ a jej obsah sú funkčne identické » (Wilson, 1983 ) nevidíme jediný
dôvod prečo by sme mali rysovať čiaru medzi myslou a pamäťou, dobre vediac že myseľ môže byť pasívna a pamäť
aktívna.

Záhrada druhá: Žena
jej mlieko a jej jablko

La femme aux pommes11
Jean Terzieff
Les jardins du Luxembourg
Paris

Les fruits12
Antoine Bourdelle
Musée Bourdelle
Paris

Es giebt auf Erden viel gute Erfindungen, die einen nützlich,
die andern angenehm: derentwegen ist die Erde zu lieben.
Undmancherlei so gut Erfundenes giebt es da, dass es ist
wie des Weibes Busen: nützlich zugleich
und angenehm.
Dritter Theil: Von alten und neuen Tafeln
(Nietzsche, 1883)

11 Foto z blízka prevzatá z http://www.parisdailyphoto.com/2006/07/steve­jobs­muse.html
12 http://parisconnected.wordpress.com/2008/06/25/musee­bourdelle­a­quiet­journey­back­to­old­paris­montparnasse/

Záhrada druhá, konštrukt prvý:
Zooantropológia
a on mě potom požádal ať ano řeknu ano ma horská květino
a nejprve jsem ho pažema objala a stáhla k sobe
až ucítil má voňavá . . ano a srdce mu bušilo
jako divé a ano řekla
jsem ano chci
Ano
spomienky Mary Bloomovej
v Odysseovi Jamesa Joyca

Hovoriť však o prsiach ľudskej samičky iba v kontexte ich mliekodajnej funkcie by
znamenalo povedať iba polovicu pravdy. Ako v svojom novorenezančnom diele Nahá opica
zdôrazňuje zoológ Desmond Morris, ak by mali ňadrá slúžiť iba ako médium pre kojivý proces,
urobila by Matka Príroda oveľa lepšie keby ženine krivky vôbec nezaobľovala. Nevyhnutnou
podmienkou spustenia sacieho reflexu je totiž to, aby sa bradavka dotkla podnebia kojencovej ústnej
dutiny, ktoré pre reflex slúži ako spínač. Celá procedúra by prebiehala s oveľa väčšou ľahkosťou v
prípade že by mali ňadrá plochšiu a vyťahanejšiu formu ňadier našich opičích príbuzných .
Prečo sa teda Madamme Evolúcia rozhodla dať do vienka naším milovaným ony radostne
zaguľatené glóby? Pre Morrisa existuje jediná odpoveď: pretože je to nesmierne silný nástroj
sexuálnej signalizácie. Vychádzajúc z predpokladu že samec už bol predprogramovaný na to aby ho
fascinovala samičkina zadnica, rozhodla sa Príroda skopírovať presamcapríťažlivý zadok aj na
hrudník13. Že sa možno nejedná o výmysel zoológov dosvedčujú aj iné príklady zo zvieraciej ríše,
najilustratívnejší je asi prípad dominantných samcov mandrila ktorý sa vyznačujú tým že ich nosy
majú podobné modro­červené sfarbenie ako oblasti v blízkosti genitálií ich samičiek.
Prečo by to však Príroda robila? Morris odpovedá hypotézou: aby tak posilnila puto v
ľudskom páre. Pri odôvodnení sa uberá touto deduktívnou cestou: pre odchovanie mláďaťa Homo
sapiens sapiens je viac ako v prípade iných živočíšnych druhov potrebné spolužitie v páre. Jedným z
mechanizmov na utuženie párového vzťahu je kontakt tvárou v tvár počas pohlavného styku.
Dôsledkom « presunutia » zadnice do popredia je tak to, že ľudšký samec ani počas kopulácie
tvárou v tvár nestráca z dohľadu jeden z centrálnych stimulov jeho sexuálnej aktivity.
Ajkeď sa dá podobnému odôvodneniu vytknúť mnohé – napr. to že v prípade množstva
kultúr ku kopulácii tvárou nedochádza, či to, že takúto ventro­ventrálnu kopuláciu praktikujú aj
bonobovia, orangutáni či dokonca i gorily ktorých samičky ňadrá zaguľatené nemajú – je
nepochybné že prsia so sexualitou úzko súvisia. Bujnenie ňadier mladej devy je asi najvýraznejším
druhotným pohlavným znakom signalizujúcim jej zrelosť. Medzi bohato inervovaným klitorisom a
bohato inervovanými bradavkami existuje intenzívna informačná výmena – samozrejme s
medzistanicou v mozgu. Zdurenie bradaviek ide často, príliš často, ruka v ruke s vzrušením v
13 Prof. Sokol upriamil počas krátkej rozprave o tejto téme moju pozornosť na detskú riekanku ktorú si tu dovolím
odcitovať :
Měla babka, čtyři jabka,
a dědoušek jen dvě.
Dej mi babko, jedno jabko,
budeme mít stejně.
Keďže sa nachádzame v poznámke pod čiarou, dovolíme si vyjadriť naše želanie aby babička deduškovi žiadne
jabĺčko nedala, a pekne si dve jabĺčka vzadu a dve vpredu nechala.

spodných partiách . Výnimkou niesú ani ženy ktorým k dosiahnutiu orgazmu « stačí iba» stimulácia
ich « gazelích dvojčiat » ­ akoby riekol Šalamún.
To že ňadrá zohrávajú kľúčovú rolu v mnohých rituáloch sexuálne orientovaných tradícií
sveta asi tiež nebude náhoda. A niečo možno naznačí aj tvrdenie, že ak je v ľubovolne zvolenej
ľudskej kultúre tabu aj niečo iné ako genitálie, tak to budú s najväčšsou pravdepodobnosťou
bradavky.
Uvedené príklady uvádzame ako protiargumenty voči tým hlasom, ozývajúcim sa obzvlášť
zo ženských feministických táborov, ktoré by rady ňadro zrovnoprávnili s ostatnými časťami
ľudského tela, zdôrazňujúc iba jeho kojivú funkciu. Ajkeď ich snahu, prihliadnuc k jej možným
dôsledkom, považujeme za viac ako ľúbivú, nemôžeme pritakať ich argumentácii. Ovocie ktorým
nás sýtia naše milované totiž nepovažujeme iba za misku mliekom naplnenú, ale v prvom rade za
prejav múdrosti sily Života.
Málo nám v konečnom dôsledku záleží na veľkosti, tvare, či farbe. To čo nás uchvacuje je
zistenie že ten istý objekt ktorý zohráva po príchode nového stvoreníčka na svet tak podstatnú
živiteľskú rolu zohráva kľúčovú rolu aj pri a tesne pred jeho samotným plodením. Nemôžeme si
pomôcť : chápeme Vaše ňadrá ako jeden zo základných akcidens konštituujúcich esenciu dcéry elfa
či človeka, a Vy samé nás v tomto chápaní utvrdzujete keď navádzate naše pery ku kvetom Vašich
hrudí.
Bolo hovorené že človek je živočích majúci slovo, zoon logon echon. A verilo sa tomu najmä
vtedy keď sa bralo právo na sebaurčenie tým, čo užívali slová nám neznáme. Bolo hovorené že
človek je vec majúca myseľ, moralitu, boha. A verilo sa tomu najmä vtedy keď sa likvidovali kultúry
a živočíšne druhy majúce myseľ, moralitu a bohov ktorým sme «my» nerozumeli. Je nám tvrdené že
človek je tým. « čo má záujem na svojom bytí », že je symbolickým manipulátorom par excellence,
že je vzpriameným dvojnohým domestikovaným primátom , spoločenskou bytosťou, nahou opicou,
súcnom čo ďakuje ...
Neodmietame žiadnu z uvedených črepín poznania – každá má svoju váhu, pravda v každej
je zrejmá. Problém je v tom že nevidíme ich výčtu koniec. Problém je v tom že ani v jednej z
uvedených odpovedí nenachádzame odpoveď «ako von z tej bryndy do ktorej konštrukty našich
myslí celú túto planétu uvrhli?».
Preto poskytujeme odpoveď, črepinu, novú: chceme zasadiť človeka do kontextu do ktorého
patrí. Jednou nohou do ríše chladných, mŕtvych...ale večných idejí. Druhou nohou do ríše teplých,
živých...ale tak prchavých poZemských zvieracích tiel. Medzi oboma ríšami naivne budujeme
most– most v tvare vnád Adrianky Sklenaříkovej . Prečo?
Pretože zatiaľčo už sme sa stretli s myriádami nehmotných « jediných » bohov ktorí ľudstvo
ako celok doposial vždy rozdelili, videli sme beztak napokon v každom jednom človeku najmä
bytosť milujúcu prs.

Záhrada druhá, konštrukt druhý:
Biopsychológia
na Zemi riekla mu: maj ma rád
no predtým ako si pôjdeš tam
dole hrať , nezabudni tuto
zagajdať
.
parafráza na istú detskú riekanku
Kľúčová otázka tohto textu znie: do akej miery vplývajú reprezentácie utvorené v útlom
veku na chovanie dospelého človeka ?
Pokúsime sa na túto otázku odpovedať vytvorením nového, matematicky formalizovateľného
modelu. Zatiaľčo doposiaľ sme sa aspoň ako­tak držali faktov, dovolíme si v tejto časti od nich
upustiť, radostne budujúc našu « malú privátnu teóriu vesmíru , života, a tak vôbec».
Ináč povedané – budeme špekulovať.
Naše špekulácie začnú u tvrdenia že myseľ splodeného je tabula rasa. Sme si samozrejme
vedomý toho že genóm vybavil maličkého určitým telom ktoré má určité vstupy, určité výstupy ba i
určité bazálne reflexné obvody či dokonca moduly mapujúce vstupy na výstupy a vice versa.
Jednako si však myslíme že orgán ktorý sa v budúcnosti stane « centrálnou výpočtovou jednotkou »
primáta rodu Homo sapiens sapiens – mozog, a to obzvlášť jeho kôra – má na začiatku viacmenej
náhodne nastavené váhy synaptických spojov. Kto nieje príliš znalý v neurologickej terminológii ,
tomu nech postačí výrok o « tabula rasa» ktorý je s výrokom o synapsiách takmer14 ekvivalentný.
Naše špekulácie budú pokračovať tvrdením že rovnako ako neurónová sieť v mozgu je
Hebbiánska, je ňou aj sémantická sieť v mysli. Objasňujeme význam termínov: neurónová sieť je sieť
pozostávajúca z nervových buniek – neurónov, ktoré sú vzájomne prepojené synaptickými spojmi,
pričom platí že každý synaptický spoj je charakterizovateľný určitou veličinou ktorú nazývame
váha. Hebbiánska neurónová sieť je taká neurónová sieť pre ktorú platí že v prípade že sú dva
neuróny aktivované naraz, bude ich synaptický spoje posilnený – hodnota váhy synapsie stúpne.
Narozdieľ od neurónovej siete ktorá je bytostne materiálnej povahy, je sémantická sieť
povahy takpovediac « mentálnej ». Približný preklad termínu sémantická sieť by mohol znieť: sieť
významov. To síce pekne znie, ale čo tým chce autor povedať nemusí byť asi na prvý pohľad zjavné,
obzvlášť keď si uvedomíme že pri hľadaní toho čo , Jackendoff nazýva sv. grálom vied o mysli a
jazyku, tj. pri hľadaní odpovede na otázku « čo je to význam slova a ako ho kvantifikovať? » si
doposiaľ vylámali zuby úplne, ale úplne všetci.
V prvom rade je nutné si uvedomiť že slovo ktoré nieje zasadené do sémantickej siete žiadny
význam nemá.
V druhom rade je nutné si uvedomiť že slovo bez významu neexistuje, pretože v momente
kedy sme ho vyslovili sme ho už samotným aktom vyslovenia vložili do určitého kontextu – a teda
do sémantickej siete.
No a napokon je treba si uvedomiť že to, význam slova nieje ničím iným ako množinou
vzťahov ktoré uvedené slovo má k iným slovám, a že celú sémantickú sieť môžeme opísať maticou .
Saussure, po ňom Bourdieu hovorili o « distancii ». « Jablko » nieje « strom » nieje
«hruška », nieje «zdroj problémov ». Je niečím blízko toho všetkého, no predsa niečím iným. My
sme však platonici a preto hovoríme o « podielaní sa ideje na ideji » ­ « jablko » je tak trochu strom,
14 často budeme používať termíny ako « takmer », « trochu », « približne » ­ a to nie preto že sa takto lišiacky chceme
vyhnúť nekompromisnej falzifikácii našich hypotéz, ako skôr preto,že pre vedu ktorú sa tu snažíme ustanoviť
považujeme viachodnotovú « fuzzy logiku » za oveľa užitočnejšie organon ako je klasická aristotelovská logika

je tak trochu ovocím, je tak trochu zdrojom
istých problémov...
Kľúčom k celému obratu nieje ani tak
Ň­p.
23 4,2
21
2 17
22 ona zmena negatívneho « nieje » na pozitívne
Ň­n. 4,2 23
2
21 25
7 « je ». Kľúčom je použitie termínu « tak
trochu ».
Blžnsť
21
2
23
0 12
22
Keď teda tvrdíme že sémantická sieť je
Blsť
2 21
0
23 30
1 Hebbiánská, chceme tým povedať, že v prípade
že sú dva významy aktivované naraz , resp.
tvár
17 25
12
30 42
11
vrámci krátkeho časového intervalu – napr. ako
ticho
22
7
22
1 11
23 dve slová v jednej vete, či ako dva rozdielne
66,2 59,2
57
54 95
63 objekty vonkajšieho sveta či dokonca stavy
Matrix 3: Hodnota 2 na pozíciách (1,4) a (4,1) tj. asociácia sveta vnútorného – sila väzby, sémantická váha
Ň­p. ­ bolesť mohla byť spôsobená napr. chorobou ktorá medzi nimi sa posilní.
dieťu spôsobovala bolesť aj počas kojenia. Nula na pozícii
V podstate chceme povedať to, čo chcel
blaženosť­bolesť naznačujú že uvedené dve nervové dráhy povedať Skinner svojim behaviorizmom a
sú mutuálne logicky exkluzívne – buď je aktivovaná jedna Pavlov svojím podmieňovaním, až na to že to
alebo druhá. Naopak 0,6 na pozícii Ň­p.­Ň­n. naznačuje že
celé hodláme skrášliť šatom maticového
sa mohlo stať že dieťa párkrát prs napríklad videlo, no sa
ho nedotýkalo, vzťah uvedených dráh je teda viac « fuzzy ». kalkulu. A áno, hodláme ísť ďalej – k
Hodnota 21 asociácie bolesť­Ň­n. a hodnota 2 asociácie štruktúram oveľa jemnejším no i mohutnejším
blaženosť­Ň­n. naznačuje že už sa dokonca párkrát a častokrát zákernejším než sú púhe reflexy.
prihodilo že maličký necítil bolesť pri neprítomnosti ňadra
A ako to celé súvisí s ňadrom?
– dospieva. Hodnota 30 asociácie tvár – Ň.n, to sú všetci tí
15
strýkovia, babky, a susedia čo na maličkého permanentne Hypoteticky takto :
Predstavme si že už citovaný Aristoteles
robia « ťuťuli­muťuli ».
mal v tom čo hovoril o blaženosti16 a bolesti
viacmenej pravdu – že blaženosť jest stavom ktorý duša pociťuje pri návrate do svojho prirodzeného
stavu, bolesť jest toho opakom. V súhlase s definíciou teda tvrdíme že duša ­ dieťa, vyvrhnuté po
pôrode z pokoju lona do úplne nového sveta, pociťuje takmer neustálu bolesť .
Čo sa týka blaženosti, privádza nás analýza
synchrónna matrix Blaženosť
Bolesť
skrze prizmu 5T k presvedčeniu, že v prípade , že
Ň prítomné
2
0
existuje moment kedy je stvoreníčko najbližšie
svojmu prvotnému stavu , je to práve ten moment keď
Ň neprítomné
0
1
má svoju hlávku pritisknutú k živúco bijúcej
mamkinej hrudi. Stručne povedané – v momente keď Matrix 1: Prvotná matrix reprezentujúca
sú v mozgu novorodenca aktivované obvody myseľ maličkého potom čo mamka s
zodpovedné za stavy blaženosti , nachádza sa prsník v ňadrom do jeho percepčného poľa « 2
percepčnom poli všetkých jeho piatich zmyslov. Ak krát prišla a raz odišla »
teda platí pred chvíľou odprezentovaná téza o
Hebbiánskej povahe sémantických sietí, znamená to že váha medzi blaženosťou a tým čo vágne
nazývame « prítomnosťou Ň » bude navýšená o určitú hodnotu, dajmä tomu o 1. Podobne, keď
matka prvý krát odíde, bude o 1 navýšená váha medzi bolesťou a tými obvodmi ktoré reprezentujú
S­A m. Ň­p.

Ň­ Blžnsť Blsť tvár ticho
n.

15 Nový model sa vždy najlepšie predstavuje na čo najjednoduchších príkladoch. Preto sme počiatočné podmienky

zjednodušili na manichejskú schématickú dualitu blaženosť/bolesť ktorej koncepcia činí autorovi najmenšie obtiaže, a dá
sa predpokladať že tomu tak bude aj v prípade čitateľa. Je však treba si uvedomiť že už od začiatku sa nejedná o maticu
2x2 ale o nesmierne rozľahlú maticu v ktorej sa okrem ústredných modulov blaženosť/bolesť vyskytuje aj niekoľko
desiatok iných (zároveň je však roznastavenie zvyšných oblastí matice tak náhodné, že môžeme dosadiť do všetkých
políčok našej najsamprvšej schémy dosadiť samé nuly, a tak stále ostáva v platnosti naše tvrdenie že dieťa je tabula rasa).
Tieto prenatálnym vývojom prednadstavené moduly by možno východná tradícia označila termínom samskára a ich
konkrétny prejav termínom vrtti.
16 Prekladáme slovo « pleasure » ako « blaženosť » a nie ako « rozkoš » najmä kvôli fonologickej podobnosti [bl*ž] –
[pl*ž]. Existujú totiž slová pre ktoré je ich fonologický aspekt určujúci aspoň tak ako ich aspekt významový

to, čo vágne nazývame «neprítomnosťou Ň». Ak sa potom následne mamka po určitej dobe vráti,
navýši sa zas váha v prvom stĺpci prvého riadku z 1 na 2.
Celé si to si môžeme zobraziť primitívnou maticou 2x2 kde riadky reprezentujú prvý
neurálny obvod N1 ktorý je aktivovaný, stĺpce druhý obvod N2 , a jednotlivé položky počet
synchrónnych aktivácií N1 a N2. Takúto maticu nazývame synchrónne­asociačnou matrix.
Situácia ktorú sme tu práve odprezentovali je špecifická v tom, že asociuje už genómom
predpripravené obvody (blaženosť/bolesť) s obvodmi reprezentujúcimi objekty okolitého prostredia
(Ň).
Moment vytvorenia takejto asociácie nazývame v súhlase s tradíciou momentom
imprintingu. Keďže sú reprezentácie asociované imprintingom napojené na tie najhlbšie
neuroendokrínne mechanizmy našej živočíšnej podstaty, budú aj tieto reprezentácie samotné
zohrávať kľúčovú rolu pri budúcom chovaní organizmu .
V neskorších častiach tohto textu sa pokúsime ukázať ako.
Najprv si však pre odľahčenie predstavme kultúru v ktorej z tých či onakých dôvodov17
matky nekoja svoje deti, a aj v neskorších momentoch vývoja preberá rolu matky – živiteľky akási
zvláštna neosobná entita nazývaná «l’État». V takom prípade dôjde aj k istému « pokriveniu »
prirodzených mechanizmov, k istému « presunu » asociácií z « matky » na «l’État » (váhy budú
navyšované nie v stĺpci matice s etiketou « matka », ale v stĺpci matice s etiketou «l’État»). A keďže
« gajdy štátu » ani netlčú rytmom Života ani niesú teplé ni vonné – a jediné na čo sa štát v svojej
živiteľskej roli zmôže je zatiaľ, naštastie, chrĺenie potravu suplujúcich cenných papierikov a mincí
do sveta ­ nedôjde tak nikdy k úplnému naplneniu « geneticky zadrátovanej » túžby po takom
objekte sveta, ktorý by poskytol všetkých 5T. Keď k tomu dôjde v prípade jednotlivca, bude
dôsledkom pravdepodobne jemne frustrovaný večne urevaný fracek. Keď k tomu dôjde na úrovni
celej spoločnosti, môže byť dôsledkom spoločnosť ktorá namiesto detského kriku každý týždeň
vyhlási štrajk.
Naspäť však od pochybných biopsychosociologických hypotéz k našim ešte pochybnejším
maticiam. Letmo si uveďme ešte jeden príklad primitívnej synchrónne­asociačnej matrix.
Predstavme si, že jedného krásneho dňa, dajme tomu počas 23tieho kojenia, si otecko v miestnosti
kde je maličký kojený pustí nahlas
synchrónna matrix Blaženosť Bolesť
Ticho Bachov Kontrapunkt. Jedno z 5T – Ticho
Ň prítomné
22,8
0
22 ­ ktoré sme si určili ako konštituujúcu
zložku obvodu blaženosti tak nutne
Ň neprítomné
0
23
7 nebude aktivované. Aby matica aj
čo
najväčšmi
dokázala
Tvár
12
30
11 ňadaľej
18
reprezentovať okolitý svet , bude sa
Matrix 2: Stav po 23 kojeniach, pričom pri jednom z
nastanuvší stav musieť niekde v matici
nich si tatíček dovolil počúvať Bacha. Hodnota 7 na
prejaviť – nielenže sa váha na pozícii 1,1
pozícii 2,3 znamená že dieťa zažívalo 7krát ticho aj
nenavýši o jedna, ale iba o štyri pätiny,
mimo kojenia, napr. vtedy keď ono samo nekričalo.
bude tiež treba Ticho z blaženosti
vydeliť, bude treba maticu zjemniť.
Ticho tak v matici získa akoby vlastný riadok a vlastný stĺpec – prípadne zaberie riadok/stĺpec ešte
neobsadený, čo je v konečnom dôsledku to isté. Niekde v mozgu začne byť v momentoch Ticha
aktivovaná určitá špecifická nervová dráha. Ako sa toto « vydelenie », tento analytický rozpad celku
na časti19 konkrétne deje, a ako si ho v našom modeli reprezentovať je technická otázka ktorej sa
17 vo Francii druhej polovice 20. storočia to boli najmä dôvody módne, a teda memetické
18 Keď hovoríme o « odzrkadľovaní sveta », považujeme za kľúčové zdôrazniť,trebárs i takto, pod čiarou, že mozog je
de facto 3rozmerná štruktúra v 3rozmernom priestore ktorá v sebe nesie informáciu o premenách 3rozmerného sveta
v čase – tj. informáciu 4rozmernú. Aby mozog takúto informáciu mohol niesť, je nutné aby niekde, nejako,
dochádzalo k mapovaniu 4D ­> 3D . Maticový kalkul do ktorého sa snažíme čitateľa touto prácou uviesť nám prijde
ako najúčinnejši nástroj na modelovanie tejto « redukcie rozmerov s čo najmenšou stratou podstatnej informácie »
19 « Svet sa rozpadá na fakty» (Wittgenstein, 1917)

možno budeme venovať v iných, odbornejších prácach (Hromada,2012). Tu volíme najjednoduchšie
riešenie : «pri vydelení dedí nový stĺpec vlastnosti stĺpca v ktorom bol predtým synteticky
obsiahnutý, a následne pokračuje každý sám».
Aby sme urobili radosť tým ktorí prikladajú bytostnú váhu tomu čo nazývajú « stretnutie s
tvárou»20, pridali sme do matice 2 aj riadok tvár. Bystrejšiemu čitateľovi sa možno na jazyk vtiskne
otázka: prečo bola « tvár » pridaná ako riadok, a « ticho » ako stĺpec ? Odpoveď : pre
zjednodušenie. Keďže sa totiž jedná o synchrónne­asociačnú matrix reprezentujúcu iba to, koľko
krát boli 2 nervové reprezentácie aktivované naraz ­ prípadne s tak malým časovým rozostupom až
v doméne vedomia maličkého splynuli v jeden gestaltický celok ­ je bezpredmetné pýtať sa čo bola
príčina a čo následok, čo bolo prvé a čo druhé. Každá položka reprezentuje určitú « dráhu » v
telomozgomysli človeka21 . Každá položka ktorá je riadkom je aj stĺpcom. Jedná sa o diagonálne
symetrickú maticu ­ jej graf nieje orientovaný. Za zmienku stojí aj to, že na pozícii ktorá leží na
diagonále sa vždy vyskytuje pre daný stĺpec­riadok najvyššia hodnota – udáva počet výskytov
uvedeného súcna v percepčnom či mentálnom poli maličkého – ináč povedané, udáva frekvenciu
výskytu.22
Posledný riadok ktorý sme do matrix 3 vložili je celkový súčet hodnôt v jednotlivých
stĺpcoch. Je základnou kvantitou z ktorej zachvíľu odvodíme veličinu ktorú budeme nazývať
mohutnosť reprezentácie X. Tento súčet, táto nazvime si ho napr. « sémaxonálna suma » nám
vlastne nehovorí nič iné iba koľkokrát bola neurálna dráha X kódujúca niečo – senzomotorickú
schému, určitý audiálny či vizuálny vnem, spomienku atď. ­ aktivovaná zatiaľčo bola aktivovaná
akákoľvek iná dráha Y.
Aby sme si názorne ilustrovali celú vec, môžeme si pomôcť istou analógiou zo sveta
internetových stránok, ktorý dúfam náš čitateľ aspoň ako­tak pozná. Jedným z kľúčov úspechu webu
je hypertext – schopnosť stránok odkazovať jedna na druhú. Predstavme si že každá položka v našej
matrix, každé niečo X je hypertextovou entitou na ktorú vedie určitý počet linkov od iných
hypertextových entít Y, Z atď. Potom vlastne onen súčet 95 u položky « tvár » a 66,2 u položky
« Ňadro prítomné » neudáva nič iné ako «počet odkazov» ktoré vedu k dotyčnej hypertextovej
entite.
Vypovedajú podobné kvantity o niečom čo by mohlo byť podstatné pre pochopenie človeka a
jeho vzťahu k ňadru ? Tvrdíme že áno, no nie mnoho. Keby do hry totiž vstupoval iba tento
primitívny súčet, veľmi ľahko by sa mohlo stať že by v mysli určitého človeka vznikli dve dráhy X a
Y ktoré by jedna na druhú odkazovali miliónkrát no tento bipolárny celok by neodkazovala takmer
žiadna entita Z. Dovolíme si tvrdiť že v takom prípade by aj napriek vysokej hodnote onoho súčtu
asociácií mali dráhy X a Y len pramalý význam pre celok mysle dotyčného človeka.
Pre pochopenie toho čo myslíme už spomenutou veličinou « mohutnosť reprezentácie X »
nieje iba odpoveď na otázku « koľko hypertextových entít na entitu X odkazuje? » , ale v prvom
rade odpoveď na otázku « aké entity to na entitu X odkazujú? », ktorú si môžeme preformulovať do
podoby « aká je mohutnosť entít Y,Z atď. ktoré odkazujú na entitu X? ». Ináč povedané, budeme sa
snažiť vyjadriť mohutnosť signifié X ako normalizovanú sumu mohutností všetkých entít ktoré na
entitu X odkazujú23, teda:
20 Pre tých sa tvár v kontexte tohto článku môže stať šiestym T ktorého prítomnosť ľudská bytosť k životu potrebuje
21 Položky v našich maticiach sa teda nevzťahujú k objektom externého sveta ­ ktoré v sémiotike nazývame « referent »
­ ale k ich vnútorným mentálnym reprezentáciám – tj. k tomu čo sa nazýva « signifié ». Stĺpcom či riadkom v matici
označený pracovnou etiketou « tvár » sa tak nesnažíme popísať vlastnosti toho či onoho objektu « tvár » v hmotnom
svete – pretože nič také ostatne ani neexistuje, rovnako ako doc. Murgašom často spomínané « vnútro tehly » ­ ale
istú sadu « dráh » v mysli toho či onoho subjektu, ktoré sú aktivované istými ,najmä vizuálnymi, vnemami.
22 Pri konštrukcii matice 3 som si povšimol, že v prípade že matica obsahuje 2 riadky A a B ktoré sú vzájomne logicky
výlučné, bude hodnota na diagonále konvergovať k súčtu hodnôt A a B, tj. môžeme formalizovať ako Xii=XAi+XBi . To by
nám v budúcnosti – keď sa pri pohľade na neuromapy budeme pýtať akéže to obsahy reprezentujuú?­ mohlo pomôcť ako
akési primitívne heuristické pravidlo na vyhľadávanie entít v logických (v aristotelskom zmysle) vzťahoch.
23 V tomto bode je náš prístup značne inšpirovaný prístupu Larryho Pagea a Sergeya Brina ktorý , keď boli pred

N

∑ Mi v ix

M x = i=0

N
24

kde N je celkový počet riadkov alebo stĺpcov a Vix je hodnota váha/sily asociácie na pozícii i,x.
No a ako to celé súvisí s ňadrom ? Hypoteticky takto:
V prvej kapitole sme tvrdili že za normálnych okolností je prs prvým objektom – referentom
sveta ktorého otlačok v podobe reprezentácie – signifié X si maličký v hlávke vytvorí. Predtým je v
mysli prítomných iba niekoľko izolovaných obvodov­modulov­reflexov, každý zodpovedný za inú ,
zväčša senzomotorickú, schému. Saj. Krič. Spi.
Iba niekoľko modulov a miliardy neurónov ktoré len čakajú na svoju štruktúraciu.
Každý z uvedených modulov disponuje určitou mohutnosťou. Tým ako sa začne v
synaptických sieťach maličkého konštituovať určitý prvotný poriadok – a my tvrdíme že oným
prvotným poriadkom nieje nič iné ako jeden veľký a mliekom riadne naliaty prs – začne sa táto
reprezentácia Ň prirodzene napájať na už prítomné obvody a ich mohutnosť začne navyšovať
mohutnosť reprezentácie Ň.
Hovoríme « navyšovať », no možno urobíme lepšie – ak chceme byť pochopený práve teraz,
keď sa chystáme predložiť našu najodvážnejšiu hypotézu­ ak použijeme sloveso « prehlbovať ».
Táto hypotéza tak trochu s otázkou: o čom sa asi tak kojencovi môže snívať?
Odpoveď je v kontexte tohto článku , veríme, dostatočne zrejmá. Otázka znie: prečo ?
Odpoveď znie: pretože duša25 kojenca spadne s najvyššiou pravdepodobnosťou do
sémantického atraktoru s najvyššiou mohutnosťou.
Objasňujeme čo mienime slová « spadnúť do sémantického atraktoru s najvyššiou
mohutnosťou »: predstavme si že substrátom mysle je určitá elastická tkanina. Na tejto tkanine sú
položené telesá26, každé s určitou hmotnosťou. Predstavme si že každá entita X kódovaná jedným
riadkom našich horeuvedených matíc je takým telesom, mohutnosť X súc hmotnosťou telesa.
Ináč povedané, čím väčšiu mohutnosť X má, tým viac « ohne » elastickú tkaninu mysle.
A ako to súvisí so snami ?
Nuž, následovne: jednotlivú dušu si môžeme predstaviť ako malinkatú loptičku ktorá je
veľkou silou vrhnutá na povrch elastickej tkaniny mysle. Čím je oblasť ktorou práve prechádza
ohnutejšia , tým je pravdepodobnejšie že loptička­duša spadne do jamy. A samozrejme platí že
oblasť je tým ohnutejšia, čím je ťažšie teleso v jej blízkosti. No a čo v tej jame?
Nuž, to už je jednoduché, stačí si len odmyslieť to veľké teleso ktoré tkaninu mysle ohlo –
stačí myslieť len na onen ohyb samotný.
V jame spadne loptička­duša prirodzene práve do toho bodu ktorý tkaninu mysle ohol. Získa
topologické súradnice tej entity, ktorú sme umiestnili na pozíciu X.
Entita X vhupsne do duše. Duša « nazrie » entitu X.
A kojencovi sa prisní o ňadre a duša putuje ďalej.
To je prvý spôsob – topologický ­ ako sa na celú vec pozerať. Predstavme si druhý,
pravdepodobnostný prístup. Tu sa opäť vraciame k triku ktorý uskutočnili Larry so Sergejom keď sa
snažili dospieť k svojmu PageRanku. Predstavili si « náhodne browsujúceho internauta » ktorý pri
niekoľkými rokmi na standfordskej univerzite postavený pred otázku « ako z matice ktorej položka na pozícii X,Y
vyjadruje počet liniek vedúcich z webstránky X na webstránku Y získať informáciu o popularite stránky X?»
odpovedali podobným vzorcom .K onej veličine « popularita stránky » napokon vďaka uvedenému vzorcu , užitiu
maticovej algebry a pár excelentným hackom napokon naozaj dospeli, a nazvali onu veličinu PageRank. Veličina
prezentovaná ako « mohutnosť reprezentácie » je , mutatis mutandis, analogická s PageRank. Viac v (Page & Brin ,
24 Je takmer isté že mám niekde v tom vzorčeku chybu, ale na to že sa jedná o prvé použitie programu na písanie
matematických formúl (OpenOffice Math) v mojom živote to nieje až také zlé, nie ?
25 slovo « duša » užívame v tomto texte ako poetické synonymum pre suchopárne vedecké « vedomie »
26 « Lopta » ó bratří Čechové, to pro nás Slováky není nic jiného než « míč »

svojom putovaní webom kliká na linky ktoré má na stránke pred sebou spôsobom úplne náhodným.
V takom prípade platí že ak zo stránky X vedie na iné internetové stránky 100 odkazov, pričom 10 z
nich na stránku Y a 20 na stránku Z, bude pravdepodobnosť toho že sa internaut dostane zo stránky
X na Y 0,1 a na Z 0,2.
Analogicky si môžeme predstaviť dušu ktorá sa túla sémantickým bľudiskom mysli – a pre
ilustráciu najlepšie dušu spiacu na
K­D
Ň­p. Ň­n. Blžnsť
Blsť
tvár
ticho
trajektóriu ktorej nemajú vstupy z
matrix
okolitého prostredia žiadny zásadnejší
Ň­p.
0,07 0,368 0,037 0,179 0,349
vplyv. Predstavme si že duša sniaceho
maličkého práve spadla do atraktoru
Ň­n.
0,06
0,035 0,388 0,263 0,111
« Ň­prítomné ». Následne sa môže
Blžnsť 0,317 0,03
0
0,126 0,349
ubrať piatimi novými cestami, každou
s určitou pravdepodobnosťou –
Blsť
0,03
0,35
0
0,316 0,016
pričom
pravdepodobnosti
sú
tvár
0,256 0,42
0,21
0,555
0,174
vypočítané z hodnôt prítomných v
ticho
0,332 0,11 0,385 0,018 0,116
synchrónne­asociačnej matrix 3 tak,
1
1
1
1
1
1 aby ich súčet v každom stĺpci dal 1–
Matrix 4: Kauzálne­diachrónna matica odvodená zo synchrónne­ ináč povedané duša sa vždy uberie
asociačnej matrix 3 normalizáciou každej hodnoty pomocou jednou z cestičiek ktoré sa jej
sémaxonálnej sumy stĺpca v ktorom sa daná hodnota nachádza. V naskýtajú , s pravdepodobnosťou p1
prípade že platí hypotéza pravdepodobnosť aktivácie neurolinguistickej ocitne
sa « u » entity X, s
štruktúry X neurolinguistickou štruktúrou Y je úmerná sile ich pravdepodobnosťou p2 « u » entity Y
vzájomných asociačných spojov tak hodnota na pozícii X,Y udáva
atď. Takúto maticu, odvodenú
pravdepodobnosť toho že neurálna dráha X zaktivuje neurálnu dráhu Y
jednoduchým
normalizačným
výpočtom z matice synchrónne­asociačnej, nazývame maticou diachrónne­kauzálnou . Táto
matica už nereprezentuje silu asociácií medzi dvomi neurolinguistickými štruktúrami, ale v prípade
že platí hypotéza že pravdepodobnosť aktivácie neurolinguistickej štruktúry X neurolinguistickou
štruktúrou Y je úmerná sile ich vzájomných asociačných spojov, bude nám udávať
pravdepodobnosť « nazrenia» duše na význam Y po nazrení na význam X27. Čo nás príjemne
potešilo pri konštrukcii tejto matice (viz. matrix 4) bolo pre osoby matematicky zdatnejšie28 iste
triviálne zistenie že už sa nejedná o maticu diagonálne symetrickú, ale o maticu asymetrickú – jej
graf musí byť nutne orientovaný.
To je zistenie potešujúce, pretože máme pocit že je v súlade so skutočným stavom vecí. Naše
introspekcie a meditácie nám totiž vskutku naznačujú že pravdepodobnosť toho, či myseľ urobí
preskok od predstavy « tváre » k predstave « blaženosti » sa líši od pravdepodobnosti preskoku od
« blaženosti » k « tvári ».
V appendixe 1 a 2 názorne ukazujeme ako môžeme od hodnôt našej kauzálne­diachrónnej
matrix napokon dokonvergovať29 k hodnotám veličiny ktorú sme na predchádzajúcich stránkach
nazvali «sémantickou mohutnosťou reprezentácie X » vrámci určitej sémantickej siete . Zistenie ­
uskutočnené pred niekoľkými hodinami – tj. že matematický svet sa naozaj « chová » tak ako sa
« chová » nás pravdepodobne neprekvapilo o nič menej ako Sergeja a Larryho, pre ktorých bolo ,
dovolíme si tvrdiť, asi práve toto « skonvergovanie hodnôt » jasným náznakom a hybným impulzom
k založeniu firmy Google.
Keďže sa však už príliš vzďalujeme od ústredného atraktoru nášho textu– od oných rúžových
27 alebo skôr pravdepodobnosť «vhupsnutia» významu Y do duše po tom čo do nej vhupsol význam X ? ;)

28 za ktoré sa my istotne nepovažujeme pretože to čo nás pri slove matrix napadne sú akurát tak citáty z rovnomerného
filmu , ako napr. « Do not try to hit the ball. Hit the ball. »
29 Samozrejme iba v prípade že naša matica bola korektne zostavená a každý jej stĺpec dával súčet 1. Iba v takom
prípade sa môže totiž uplatniť tzv. « teorém o fixnom bode » ktorému síce vôbec nerozumieme, ale sme mu hlboko
zaviazaný za to že platí.

gombíkov na zakúsnutie , velíme teraz k obratu. No predtým ako sľúbime nášmu drahému,
humanitne vzdelanému a na matematiku alergickému čitateľovi že horšie ako to bolo na
predchádzajúcich stránkach už to nebude si ešte dovolíme malé zamyslenie nad tým, čo momentálne
vnímame ako ústredný problém vzťahu medzi synchrónne­asociačnými a kauzálne­diachrónnymi
maticami.
A totiž: v prípade že sme na začiatku tejto časti tvrdili že :
 Prvý protopostulát: ak sú dve reprezentácie aktivované naraz, alebo s tak malým časovým
rozostupom že v doméne vedomia splynú v jeden gestaltický celok, budú ich asociačné spoje
v S­A matrix posilnené
a ak sme zároveň tvrdili že :
 Druhý
protopostulát: pravdepodobnosť aktivácie neurolinguistickej štruktúry X
neurolinguistickou štruktúrou Y je úmerná sile ich vzájomných asociačných spojov
máme dojem že konjunkcia oboch tvrdení nutne viesť k následovnému kumulatívnemu procesu:
1) Y aktivuje X s pravdepodobnosťou p1
2) keďže bolo X aktivované hneď po Y, bude posilnená sila asociačných spojov medzi X a Y
(vyplýva z prvého protopostulátu)
3) keďže bola posilnená sila asociačných spojov medzi X a Y, stúpne pravdepodobnosť p2 toho
že X aktivuje Y a pravdepodobnosť p3 že Y aktivuje X (vyplýva z druhého protopostulátu)
4) keďže bude p2 väčšie ako predtým, bude pravdepodobnejšie ako predtým, že práve
aktivované X « predá naspäť štafetu » pred chvíľou aktivovanému Y a vrátime sa do bodu 1,
s tým rozdieľom že pravdepodobnosť že Y aktivuje X už nebude p1 ale p3, o ktorom vieme
(z bodu 3) že p3>p1
Stručne a jasne, neurálne dráhy X a Y by medzi sebou začali hrať pingpong , sila ich
asociačných spojov by rástla ad infinitum a pravdepodobnosť ich vzájomnej kauzálne­diachrónnej
aktivácie by limitne smerovala k 1.
Ajkeď je na ideji dvoch entít ktoré vzájomne aktivujú jedna druhú, rýsujúc tak do elastickej
tkaniny mysle určitú špecifickú dráhu niečo veľmi príťažlivé, je takmer30 isté že pokiaľ do nášho
modelu nezaintegrujeme ešte akýsi ďaľší, « tlmivý postulát »31, nebude náš model nikdy modelom
adekvátne vysvetľujúcim fungovanie ľudskej mysle.
A teraz už naozaj sľubujeme nášmu čitateľovi že horšie už to nebude, áno onomu čitateľovi
ktorému sa na perách čoraz nervóznejšie rýsuje otázka: « A ako to celé preboha súvisí s ňadrom ? »
A my odpovedáme: « Drahý čitateľ, asi takto – keby sme do našich názorne a náhodne
skonštruovaných matíc 2 a 3 zadali reprezentácie všetkých jednotlivých geneticky a prenatálne
prekódovaných modulov – samskár ­ prítomných v cerebrálnych štruktúrach maličkého, keby sme
do miliónov prázdnych stĺpcov a riadkov začali pridávať tak kľúčové položky ako « potrava » či
« svetlo », keby sme to čo sme kódovali jedinou syntetickou položkou « blaženosť » ­ oných 5
kľúčových T – vyčlenili a skrze hodnoty asociačných váh naviazali na jednotlivé do­matice­tiež­
zakódované « senzomotorické schémy » a následne celú matrix prehnali algoritmom uvedeným v
Appendixe 2, zrazu by sme videli ako veľmi nepresné sú výsledky prezentované v App 1 . Kto vie,
možno by sme v súlade s hypotézou Ha1 videli, že nie tvár ale ňadro je onou «reprezentáciou s
najvyššou mohutnosťou»­ oným « telesom » čo najväčšmi ohýba elastickú tkaninu mysle maličkého.
Ako nesmierne elastickú ! Príde doba, a tá doba je blízko, keď vznikne prvá sémanticko­
fonologická asociácia, doba keď môj malý synovec pochopí že žvatlot ktorý jeho uši počujú je
žvatlot ktoré jeho ústa vyslovili – príjde doba keď slovo pre neho získa význam. Príde doba keď v
spojitých zhlukoch zvukov a hlukov odhalí jeho čoraz štruktúrovanejšia myseľ akýsi zvláštny
30 hovoríme « takmer » pretože je taktiež možné že chyba nieje ani v postulátoch, ani v tom že by boli nedostatočné,
ale že chyba je iba v našom výklade, v našej neznalosti princípov pravdepodobnostného kalkulu. V takom prípade sa
môže stať že to, čo sa nám zdá ako neprekonateľný problém je naopak najsilnejšiou stránkou našej teórie
31 možno niečo čo súvisí so « znižovaním príťažlivosti stále sa opakujúcich vecí », so zabúdaním, s entropiou,
skinnerovskou extinčnou krivkou, časom a tak

poriadok – a zhluky zvukov a hlukov sa stanú vetami. Zrazu sa pred ním otvorí ďaľší nový svet –
ďalšia dimenzia plná nepoznaného.
Práve v tej dobe mu tí, ktorí boli doposiaľ takmer vždy iba zdrojom útechy a hojivosti, začnú
udelovať prvé tresty. Budú sa ho tak snažiť prinútiť aby « ovládol » novú senzomotorickú schému –
aby ovládol nielen krik svojich hlasiviek ale najmä zvierače svojho anusu a stolicu vylučoval len za
okolností « za ktorých sa to patrí ».A tak sa okolo schémy vylučovacej, protikladnej k prijímacej
schéme sacieho reflexu začne koncipovať nový « atraktor ».Práve naň sa pravdepodobne napojí
reprezentácia toho čo nazývame « trest »32 a neskôr možno i jeho abstraktnejšia projekcia « hriech »
K tomu všetkému začne dochádzať v momente kedy postupne, vďaka čoraz dokonalejšej
akvizícii jazyka, bude spustená inštalácia symbolu ktorý sa v okolitom prostredí vyskytuje s
nesmierne vysokou frekvenciou – slovíčka « ja »33.
Zo slovíčka « ja » sa napokon stane atraktor s mohutnosťou tak nesmiernou, až skolabuje
sám do seba podobne ako čierna diera – začnú sa naň napájať úplne všetky obvody. No odkiaľ svoju
mohutnosť, svoj « pagerank » odčerpá na začiatku, na ktoré obvody sa najsamprv najväčšmi napojí
– s ktorými obvodmi bude rezonovať ?
S dávnym ňadrom z ktorého už zostáva len stále matnejšia spomienka – no stále tak mocná
ako ona Ecovské meno ruže (Eco, 1983) čo stále nosíme v pamäti ? Alebo s čoraz mocnejším
komplexom konštituujúcim sa okolo reprezentácií ­ « nesmieš », « nočník » , « musíš » ?
Na to v prípade konkrétneho ľudského mláďaťa odpovedať asi nedokážeme. Celá matica je
tak premenlivá, tak nesmierne premenlivá – je dynamická, je do seba zakrútená 34, je živá. Vrámci
nášho modelu nemožno totiž rozsúdiť či sa stane môj synovec ten typ človeka ktorého Freud typom
« análnym », alebo či zostane typom « orálnym ».
Nevidíme totiž svet čiernobielo, v súlade s prístupom « fuzzy logikov » vidíme medzi oboma
pólmi množstvo odtieňov šedej. Mohutnosť atraktoru « orálneho » či « análneho » istotne čo­to
napovie o celkovej topológii mysle35 – a to obzvlášť vtedy ak naberie patologických rozmerov – no
to čo je pre náš model podstatné je pochopenie že myseľ nieje akousi chladnou množinou
generativistických inštrukcií na ktoré svieti mocné svetlo karteziánskeho « ego ». Áno « ja » hrá
svoju roľu, veď podobne ako jadro Mliečnej dráhy celý systém takpovediac « roztáča », no život –
skutočný život s vôňami, dotykom, súcitom a láskou­ ten prebieha na periférii, okolo menších
gravitačných studní – okolo lokálnych sĺnk « on » a « ona ».
Vpravde môj synovec už teraz v sebe nosí celú galaxiu.
Príde doba, a tá doba sa zdá byť tak ďaleko a predsa je tak blízko keď sa môj synovec na
svojej púti kozmom zrazí s galaxiou inou – s Druhou. Vtedy sa kedysi tak mocná reprezentácia
kyprohojného Ty medzitým tak oslabená kosou pani Entropie opäť preberie k Životu. Znova budú
aktivované nové senzomotorické schémy ­a synovec bude prirážať zhore, zboku no veríme že hlavne
zdolu, áno, Oliver, hlavne zdolu ! – znova dôjde k imprintingu a padnutá Bohyňa začne opäť a opäť 36
čerpať životodárnu miazgu z egom čoraz viac usporiadaných sémantických sietí. Z tých sietí, čoraz
chladnejších, usporiadanejších a ekonomicky vykalkulovanejších sietí ktoré možno práve teraz
paraziticky odsávajú životodárnu miazgu z mohutnosti životodárnej Bohyne.
Asi tak to súvisí s ňadrom. »
32 a možno, u niektorých, aj spomienka na bolesť spôsobenú barbarským pre­maličkého­ťažko­do­systému­sveta­
zaraditeľným rituálom nazývaným « obriezka »
33 ono « ja » samozrejme nemusí byť explicitne vyslovené ako morféma « ja ». K prehĺbeniu asociačnej studne stačí
keď je prítomná aj v iných podobách, napr. prípona «M » v prípade slovies v prvej osobe, napr. «milujeM »
34 slovami « do seba zavinutá » sa snažíme čitateľa naviesť k predstave zásadne odlišnej od našich schém, v ktorých je
každý stĺpec a riadok matice označený akýmsi signifiantom, akousi fonologickou etiketou. Chceme upriamiť pozornosť
na to,že v prípade ideálnej reprezentácie mysle sú aj samotné označovacie etikety « iba » položkami matice. Ideálna
myseľ­reprezentujúca matica tak nemá « okraj » a my si ju naivne vizualizujeme ako povrch topologického toru.
35 rovnako sa možno opýtať:alebo celková topológia mysle istotne niečo napovie o mohutnosti jednotlivých
atraktorov ?
36 a opäť !

Záhrada druhá, konštrukt tretí:
Neurosociológia
FAUST: Já mel sen a ne o leckom! Videl
jsem divukrásný strom, na nem pár
jablek, púvab sám, však já si na ne
počíhám
.

KRASAVICE : Jablíčko chutná dvojnásob
pánúm už od Eviných dob, proto je mi
tak presladce, že jich mám párek v
zahrádce
.

Tento dialog výstižne komentuje Freud: « Není nejmenších pochyb o tom, co je míněno onou jabloní
a jablky ». Ve skutečnosti například v lidovém londýnském slangu znamená a nice apple­dumpling
shop (pěknej krámek s jablečnými knedlíky) pohledné, zakulacené
poprsí.
Někteří čtenáři nad těmito řádky vzpomenou na
rajskou zahradu, a to zcela oprávněně. Vědce již Naproti tomu v řecké mytologii Zeus urazí Eris
dlouho mate a provokuje skutečnost, že jak Eva, tím, že ji nepozve na svatební oslavy na Olympu,
tak v řecké mytologii bohyně Eris měla něco načež ona se pomstí tím, že vhodí mezi hodující
společného právě s jabkem a že v obu případech bohyně zlaté jablko s nápisem KALLISTI (« Té
mělo ono jabko v konečném důsledku nasvědomí nejkrásnější »). Bohyně se o něj samozřejmě
pěknou polízanici. V hebrejském příběhu
začnou hádat, každá si na něj dělá nároky,
sní Eva jablko (ve skutečnosti se v knize
neboť každá je podle sebe tou nejkrásnější,
Genesis píše o ovoci,nicméně tradice
a tento spor se neustále stupňuje, až jsou
toto ovoce vždy identifikovala jako
do něj vtaženi jak ostatní bohové, tak
jablko) a Jehova, místní božstvo,
i lidé, a výsledkem toho všeho je
soptící vztekem,jí prokleje,stejně
Trójská válka. Eris tak vejde do
jako celé lidské pokolení, z
povědomí jako bohyně sváru a
důvodů které mají
ono zlaté jablko již navždy
k logice dosti
zůstává jablkem
daleko
sváru

Wilson, Ištařin návrat, str. 108
Vpravde pred nami jablko vyvstáva v mnohých mýtoch sveta.
Hľadiac k juhu vidíme Herkula ktorý sa práve vydáva na púť do záhrady Hesperidiek – tých
troch nýmf večerných čo Hérin sad chránia, sad uprostred ktorého z Gainho daru k sňatku z Diom ­
z vetvičiek plodmi obsypaných – vyrašivšia jabloň nachádza sa. A presne z tejto jablone, tradíciou
nazývanou aj Strom života, má hrdina jablká ukradnúť – toť jeden z 12tich skutkov ktoré podujme sa
učiniť. Existujú jazyky ktoré tvrdia, že práve z týchto ukradnutých « jabĺk blaženosti » sa napokon
niektoré dostanú až k spanilej panne Atalante z Arkádie , a to tak že jej ich popod jej bosé nohy
hodí pytač Melanion počas v preteku v ktorom sa beží alebo o jej ruku, alebo o jeho život. Iné
jazyky tvrdia že oné jablká boli mladíkovi dané samotnou Afroditou po jeho úprimnej modlitbe k
Nej. Nech je tak či onak – či možno tak i onak, v prípade že jablká Melaniovi jablká priniesol
Herkules plniaci tak Afroditin tajný príkaz – nech je tak či onak, isté je že Atalantu jablká uchvátili,
zastavila sa, pretek prehrala a Melanion si ju odviedol na lože kde ju istotne ešte spanilejšou učinil.
Hľadiac k východu vidíme nielen Evin Eden ale ešte i dnes tlejúce trosky chrámu
najmúdrejšieho z kráľov jeruzalemských, toho kráľa čo ústami ženícha riekol :

Nuž, nechže sú mi Tvoje prsníky
viničovými strapcami
A vôňa dychu Tvojho
sťa vôňa
jabĺčok
áno, toho kráľa čo ústami snúbenice riekol v Piesni
piesní :
Opájala by som Ťa vínom voňavých
a muštom z mojich granátových
jabĺčok

áno riekol tej snúbenici ktorá takto prosí:
Posilnite ma hrozienkovým koláčom,
občerstvite ma jablkami,
lebo som chorá od
Lásky

Hľadiac k severu, k ľadom Eddy nordickej, vidíme zpoly elfku zpoly bohyňu Idunn 37
strážiacu jablká prinášajúce a zaručujúce bohom večnú mladosť a teda večný život. Potom čo bola
táto plavávečnemladá unesená zlovoľným obrom začínajú všetci bohovia – všetci Aesir ­ starnúť.
Prehovárajú a napokon vysielajú na cestu šibala Lokiho, ten Idunn zachraňuje a s návratom jej
jabĺčok božstvá znovu nachádzajú stratenú mladosť. Medzi najčelnejších z Aesir patrí Odin s jeho
milovanou Friggou. Ona a žiadna iná je podľa Eddy « najprednejšou zo všetkých bohýň » . Meno jej
si najčastejšie vykladáme ako « milovaná » (hľaď do a hľadaj v priestore niekde medzi sanskrtským
priya ­ « milovaná žena,manželka » či islandské frjá « milovať » ) , cítime že má prsty vo všetkom
čo súvisí s plodnosťou, vo všetkom čo súvisí s hostinami zväzku manželského. A tak nás
neprekvapuje že ju ešte i dnes pri pohľade k severu vidíme zosielať jablko kráľovi Rerirovi, tomu
kráľovi čo Odina o potomka tak pokorne žiada . Kráľova choť sa do jablka zakusuje vďaka čomu
následuje šesťrokov trvajúca ťarchavosť zavŕšená zrodením hrdinu Volsunga. Volsungská sága sa
môže začať.
Hľadiac k západu vidíme nielen Avalon ­ « ostrov jabĺk » kde bol ukutý Excalibur a kde sa
kráľ Artuš snáď napokon vylieči zo svojich rán. Hľadiac k grimmovsky38germánskemu západu
vidíme aj závistlivú královnú čo snehulícej víle s vlasmi havraními otrávené jablko posiela...
Znovu a znovu tak vidíme sémantický atraktor ktorý označujeme termínom « jablko »
vyvstávať v blízkosti39 významov ako PAC40={mladosť, život, hrdina, plodnosť, žena, Bohyňa}.
37 from Yggdrasil’s
ash descended;
of elven kin,
Iðunn was her name (šloka 6­7 , Hrafnagaldr Óðins )
38 Neprítpomnosť medzier v prípade určitých slov je v súlade so « sanskrtizačným » zámerom autora
39 Predstavme si myseľ M ktorá má v sebe zakódovaných n významov pre ktoré platia následovné kritériá:
1. každý význam je identický sám so sebou a odlišný od všetkých ostatných
2. každý význam je vo vzťahu s určitou « váhou » ku všetkým ostatným významom v Mysli (povedané platonicky
(Vopěnka, ), každá idea sa v istej miere,s istou « silou » podieľa na každej inej ideji a vice versa )
Takto chápaný význam­ideu môžeme reprezentovať ako bod v n­rozmernom Hilbertovskom priestore ktorého
súradnice <p1,p2,...,pn> sú dané normalizovanými asociačnými váhami (viz. časť 2.2 – každý riadok matice 4 sa dá chápať ako
vektor udávajúci súradnice uvedeného významu v sémantickom priestore, hodnota v prvom stĺpci udáva vzdialenosť od stredu v prvej dimenzii,
hodnota v druhom v dimenzii druhej etc. ) k prvému. druhému až n­tému významu Mysle. Ináč povedané – čo obsah mysle,

to nový rozmer. X­tý význam má v svojom X­tom rozmere súradnicu o hodnote 1 čím je zabezpečená jeho
jedinečnosť vyžadovaná prvým kritériom. Zároveň však súradnice tohto bodu­významu, jeho poloha, obsahuje aj
informáciu o vzťahu k všetkým ostatným obsahom Mysle. Takto formalizovaná odpoveď na otázku « Čo je to
význam slova a ako ho kvantifikovať ?» sa nám zdá byť príťažlivou nielen preto že je v svojej podsate blízka už
existujúcej metóde Latent Semantic Analysis ( http://en.wikipedia.org/wiki/Latent_semantic_analysis ) , ale najmä
preto že nám umožní relatívne jednoducho – užitím púhej pytagorovej vety či jednoduchej trigonometrie –
vypočítavať vzdialenosti či veľkosti uhlov medzi dvomi či viacerými významami medzi sebou. Keď teda hovoríme o
tom že « princezná » je « jablku » bližšie ako « kompas », hovoríme o – aspoň teoreticky­ merateľných veličinách
40 PAC = Primary Associative Complex , prípadne Primary Associative Cluster ; SAC = Secondary Associative
Complex, prípadne Secondary Associative Cluster

Možno samozrejme namietnuť že príklady Evy, Eris či Snehulienky nám naznačujú aj spojitosť s
významami ako napr. SAC={had, smrť, jed ,hriech} , my však budeme tvrdiť že vyvstanie týchto
entít je až druhotný jav zapríčinený čoraz viac silnejúcou dynamikou vyvíjajúceho sa mýtu, a že
prvotný komplex ako taký je až príliš s?proste41 a jednoducho pekný. Inými slovami – vychádzame
z presvedčenia že prvotný obraz sveta je dobrý a krásny a že akákoľvek prítomnosť zla v ňom nieje
spôsobená zásahom od večnosti k večnosti jestvujúceho manichejského diabla, ale skôr – sťa lektvar
zlej čarodejnice – povlovne vstupuje do rozprávky znamienko negácie. Bez jeho prítomnosti by totiž
príbeh42 bez pointy musel byť – nič v obraze sveta či v svete samotnom bez jeho úsmevu nemohlo
by žiť.
Vieme o jablku – či skôr o signifié signifiantu « jablko » niečo ­ čo by mohlo osvetliť jeho
výskyt v lone PAC ? Povedané ľudskejšie: je spozorovaný častý výskyt jablka po boku mladosti,
sily či plodnosti iba ilúziou, vedeckou mayou ktorú náš skúmavý pohľad odhaľuje všade kde sa len
dá len preto, že chce k takému odhaleniu dospieť, alebo sa jedná o objektívny fenomén?
Tvrdíme že sa jedná o objektívny fenomén a tu predkladáme niekoľko veríme že dostatočne
racionálnych argumentov ktorými by sme radi toto naše tvrdenie podložili:
1. Vieme že jablko je prototypom kategórie ovocie.
Pre sémantiky neznalých čitateľov týmto poskytujeme túto laickú definíciu prototypu:
« Prototyp kategórie X je taký člen Y ktorý príjde skúmanej osobe či skupine osôb čo
najrýchlejšie a najčastejšie na myseľ ako odpoveď na výzvu « Predstavte nám jednoho
konkrétneho zástupcu kategórie X ».
Ajkeď je napr. z (Lakoff , 1987) verejne známym poznatok že « jablko » je, minimálne v
indosemitskoeurópskom (ISE) okruhu, prototypickým zástupcom kategórie « ovocie » , dovolili
sme si uvedený poznatok overiť vrámci nášho vlastného sémantickosociologického výskumu (viz.
tiež appendix «Pár slov k dotazníku D2»). Dáta hovoria jasnou rečou: na otázku « Ktorý pojem je
podľa Vás najlepsim predstavitelom kategorie "ovocie" ? » nám z 358 respondentov až 227 , tj.
63,4% napísalo samo od seba odpoveď jablko či jablká. Pre zaujímavosť dodávame že napr. na
otázku « Ktory pojem je podla Vas najlepsim predstavitelom kategorie "kvety" ? » sme dostali
odpoveď « ruža » iba v 54,8% prípadov, a to dokonca za stavu kedy bola ruža explicitne uvedená
ako jedna z možností, zatiaľčo v prípade otázky o ovocí musel respondent kolónku vyplňovať sám.
Vidíme teda že jablko je pre vzorku skúmaných osôb silnejším prototypom kategórie ovocie ako
ruže pre kategóriu kvety43.
Dovolíme si teda tvrdiť že « záhada » ktorá jest zahalená v otázku « Prečo umelecká Tradícia
najčastejšie zobrazuje ovocie z Genesis ako jablko?» je zodpovedaná práve tým že jablko je
41
V tomto momente začíname do našich prác implementovať tzv. regulárne výrazy. Regulárne výrazy sú
používané v programovacích jazykoch v prípade keď chceme popísať nie jeden znakový reťazec, ale určitú špecifickú
množinu znakových reťazcov. Zatiaľčo v počítačovom programovaní sú regulárne výrazy používajúce v pasívnom
zmysle ako nástroj–a to najúčinnejší nástroj,niektorí dokonca hovoria o « magické húlce »–na rozpoznávanie vzorov, my
ich užitie v tomto texte prevraciame na ruby a užívame ich aktívne–za účelom aktivácie špecifických vzorov v mysli
čitateľa.
Možno povedať že regulárne výrazy sú formy znakových reťazcov. V prípade slovíčka s?proste sme použili
metaznak ? ktorý značí « predchádzajúci znak (tj. znak s) sa môže vyskytovať nula alebo jedenkrát ». Regulárny výraz
s?proste tak v sebe zastrešuje dve slová – « sproste » a « proste ». Použitím regulérneho výrazu s?proste tak aktivujeme
v mysli čitateľa oboznámeho s funkciou metaznaku ? naraz dve dráhy. dva významy, bez nutnosti uchýliť sa k zdĺhavej
konjunkcii « sproste a proste »...V prípade že ctený čitateľ pri čítaní najbližších riadkov narazí na otáznik uprostred
slova, veríme že jeho funkcia bude po tomto vysvetlení pochopená – obzvlášť užitočný je pre « zmazávanie »
genderových rozdieľov v prípade entít u ktorých nemožno hovoriť o rode, ako napr. « ona? milovala? » keď sa hovorí o
bohu, atď. ). Akákoľvek implementácia ďalšieho nového metaznaku bude ako v tejto tak i v následovných prácach
vysvetlená v pripojenej poznámke pod čiarou.
42 Zastávame stanovisko, že pre vedomie nieje rozdieľ medzi svetom a jeho obrazom – vyjma prípadného dodatočného
poznania že obraz je iba obrazom. Povedané slovami postmoderny – niet rozdieľu medzi simuláciou a tým čo je
simulované.
43 Povedané slovami našej malej teórie – váha asociácie medzi jablkom a ovocím je väčšia ako váha asociácie medzi
ružou a kvetom.

prototypom ovocia. S výnimkou slova a hudby totiž vo všetkých známych umeleckých modalitách
platí, že všeobecnú kategóriu možno zobraziť iba a iba skrze jej konkrétny prototyp. Kvet skrze
ružu, žena skrze Venušu, muž skrze Dávida, láska skrze Rodinov bozk a ovocie skrze jablko. Preto.
Vzťahom k ovociu už istotne bolo poodhalené rúško tajomného vzťahu medzi jablkom a
ostatnými zložkami PAC. Veď ovocie­jablko je plné vitamínov, tj. látok ktoré telo potrebuje no
nedokáže si ich samé vyrobiť. A teda:
2. Jablko­Ovocie je zdravé .
Vieme že čo je zdravé, to je živé. Čo je zdravé a živé, to je krásne. Alebo silné. « Krásna je
Bohyňa, silný je hrdina » ­ mohli by sme tvrdiť, a takto, skok po skoku dospievať až k odhaleniu
bytostného vzťahu medzi významami X a Y.
K podobným rétorickým trikom právnikov a teológov sa však uchyľujeme iba preto, aby sme
poukázali na ich absurditu. Ako sme povedali na začiatku časti 2.2 , význam ktorý nieje zasadenný
do významovej siete44 , resp. idea na ktorej sa nepodieľajú iné ideje, nieje významom, nieje ideou.
Ináč povedané, je isté že podobným znásilňovaním spony « je », podobným hopsaním sa od
želaného A zdatnejší rečník skôr či neskôr vždy dostane k želanému Z. Možno by sa jeho rétorické
kapacity znásobili keby k tvorbe metafor používal aj výpočtovú techniku k hľadaniu « ciest v
uzavrenom grafe » napr. pomocou Djikstrovho algoritmu.
Najradšej by sme podobné metódy prenechali scholastikom a pokúsili sa obhájiť našu
hypotézu H2:
«Medzi ideou jablka a ideou ňadra existuje relatívne45 silná asociácia či dokonca niečo ako
príťažlivosť »
metódou geometrizácie sémantického priestoru. K tomu aby sme však mohli takýto sémantický
priestor riadne vymodelovať a následne v ňom vzdialenosti merať by sme potrebovali tak nesmierne
množstvo empirických dát, že tento prístup momentálne ale možno tiež i naveky jestvuje len v
podobe akéhosi chabo popísaného « Gedankenexperimentu »
Ako by bolo v ďalších prácach prípadne možné od Gedankenexperimentu skočiť k reálnym
aplikáciám sa pokúsime naznačiť najmä v posledných častiach tejto práce. Aby sme tak však mohli
učiniť, musíme sa neustále čo najväčšmi snažiť vrátiť sa od Teórie grafov a Hilbertových priestorov
naspäť na Zemi a jej jablku. Pokračujme teda v skúmaní jeho « akcidens » :
3. O jablku už vieme že je zdravé a dobré.
Vieme tiež zväčša že zmestí sa akurát tak do dlane, je hladké, dobre sa doň hryzká46, je
pevné a je oblé.
Osoba znalejšia moderných sémantických teórií by povedala že signifié « jablko » sa dá
rozložiť na tie základné zložky sémantickej analýzy – nazývané « sémy » ­ ako napr. « oblosť »,
« dobrota », « hutnosť », « k zahryznutiu », « o veľkosti dlane ». Problémom týchto sémantických
teórií je však to že sú zväčša prísne binárne – buď sa sém na konštituovaní daného signifié podieľa,
alebo nie – a ak už sa podieľa tak je pre dané signifié rovnako podstatný ako všetky ďalšie v ňom
obsiahnute sémy.
V tomto bode sa náš prístup od klasických sémantických teórií zásadne líši. Sme totiž
44 Sémantická sieť sa teda dá chápať ako súvislý graf , « tj. taký graf pre ktorý platí že pre každé dva vrcholy X , Y
existuje aspoň jedna cesta z X do Y » (http://cs.wikipedia.org/wiki/Souvisl%C3%BD_graf). Iné entity ktoré si
možno vnútorne reprezentovať podobným spôsobom, tj. ako súvislý graf sú napr. neurónová sieť či ľudská
spoločnosť. Podobne totiž ako pre význam slova platí, že je vždy nutne vo vzťahu k iným významom, tak pre ľudskú
spoločnosť platí že človek bez spoločnosti nieje vlastne plne človekom a neurón ktorý nemá väzbu k iným neurónom
nieje vlastne neurónom. Veď ako by mohol byť nazývaný neurónom keď nemá axónov ni dendritov ?
45 relatívne, tj. v porovnaní s inými
46 Výnimkou budiž cylindrický výbežok pripomínajúci tak trochu zdrevnatený chĺpok, ktorý je nazývaný aj « stopka »

presvedčený že na konštitúcii každého z vmysliobsiahnutých významov sa na nich skrze svoje sémy
podieľajú myriády vyznamov iných, a to každý s určitou váhou. Túto váhu spoja, túto silu sému
zdieľaného medzi X a Y si môžeme vyložiť ako
 podieľ prírazu ktorý « odtečie » z X do Y (ak je teda napr. váha spoja medzi jablkom a
ňadrom nastavená na 0,023 tak v prípade že sme v mysli subjektu aktivovali « jablko » s
prírazom47 2 bude dôsledkom i aktivácia « nadra » s dôrazom 0,046)
 ako pravdepodobnosť toho že po symbole X vyvstane v mysli (či dokonca na jazyku alebo
prstoch búšiacich do klávesnice) symbol Y
Ajkeď naše snahy momentálne smerujú k prekonaniu klasickej sémantiky a k matematizácii
a formalizácii vedy o význame, ku kroku tak prchavému že sa o ňom vznešeným zakladateľom
starobylých vied istotne ani nesnilo, predsa sa kvantitatívna sémantika musí svojim predkom
poďakovať za to, že svojim náhľadom:
podstatou metafory je zdieľanie sémov
prinavrátili tému metafora z periférie akademického záujmu do ohniska záujmu humanitných,
kognitívnych – a kto vie, možno raz i « tvrdých » prírodných vied.
« I believe, however, that the myth cannot be explained only at the linguistic level, because
the principle of the metaphor is deeply rooted in human behaviour in general, an especially in
human thought as an expression of its natural tendency to abstraction » (Oberfalzerová , 2006) S
týmto tvrdením súhlasime, a činíme tak dokonca i v stave (dúfame že)?48 dočasnej neznalosti diela
Metaphors we live by z pera jednoho z nejrešpektovanejších kognitívnych vedcov súčasnosti
G.Lakoffa. Aj s týmto tvrdením súhlasime a ideme ešte ďalej, tvrdiac že metafora a metonýmia
nielenže sú kľúčom k pochopeniu mysle človeka, ale že človek samotný je bytosťou výsostne
metaforickou , bytosťou ktorá « žije básnicky »49 .
Voilà dôvody ktoré nás k tomuto presvedčeniu privádzajú:
4. Ňadro sa tiež zväčša zmestí do dlane, je hladké 50, dobre sa doň hryzká a nieje naškodu
keď je pevné a oblé. Ako kojenec tak i milenec by pravdepodobne tiež súhlasili s tvrdením že ňadro
je zdravé a dobré.
Povedané jazykom klasickej sémantiky, ženský prs a pomme spolu zdieľajú nejeden sém.
Ako sme povedali, podstatou metafory je zdieľanie sémov – čím viac zdieľaných sémov, tým vyššia
pravdepodobnosť že metafora bude úspešná. Zdieľanie sémov ako « hryzkať », « sférická », « dlaň »
tak naznačuje že spontánny preskok meditujúcuej mysle smerom od ňadra k jablku či naspäť nemusí
byť nereálnou možnosťou.
Povedané jazykom našej nascentnej teórie, vzdialenosť medzi « prsníkom » a « jablkom » je
v Hilbertovskom sémantickom priestore menšia ako trebárs vzdialenosť « prsníka » od « kružítka »
či « jablka » od « pravítka ». Dôvodom budiž to, že bod ktorým reprezentujeme « jablko » má na
ose (v dimenzii) ktorou kódujeme sém « oblosť » približne rovnakú hodnotu svojej súradnice ako
bod ktorým reprezentujeme «ňadierko»51. A čo viac – aj v rozmere ktorý reprezentuje sémy ako
47 používame tu radšej neologizmus príraz a nie « energia » či « sila », nechceme totiž aby došlo k zbytočnej a
nežiadúcej interferencii pojmov s exaktne definovanými pojmami fyziky. Naša teória si svoje pojmy ešte len hľadá.
48 V prípade tohto regulérneho výrazu nasleduje už spomínaný metaznak opytovacieho znamienka skupinu uzatvorenú
v zátvorkach. Zátvorky taktiež patria medzi metaznaky – ich funkciou je označiť ako jednu skupinovú entitu všetko
čo sa nachádza medzi nimi. Výrazom (dúfame že)? tak chceme vlastne povedať že intencia autora zostane naplnená
ako v prípade keď sa ono « dúfame že » v texte vyskytuje, tak i v prípade kedy by sme ho vynechali. To preto že
záverečný otáznik umožňuje ako 0 tak 1 výskytov tomu čo mu predchádza , čím je v tomto prípade celá skupina
znakov « dúfame že ».
49 Doch Dichterisch wohnet der Mensch auf dieser Erde (Heidegger, 2006)
50 Výnimkou budiž jemné chĺpky a cylindrický výbežok a jeho okolie nazývaný « bradavka »
51 Na príklade slova « ňadierko » možno ilustrovať aj ďalšiu vlastnosť priestoru ktorý sa tu snažíme popísať. Chceli sme
popísať metaforu, tj. jav výsostne sémantický a tak sme sme hovorili iba sémantických aspektoch slova – priestor
sme konštruovali spôsobom , čo sém, to dimenzia. Keďže je však slovo entitou ktorá má 3 tváre – sémantickú,

« chĺpok » či «kúsať » bude « jablko » bližšie k « vnadám milovanej » ako trebárs k «tehle».
Už sme sa pokúsili naznačiť že vzdialenosť významov vo valídne skonštruovanom
Hilbertpriestore by mala byť úmerná s introspekciou vnímanou vzdialenosťou medzi danými
významami (čím väčšia vzdialenosť medzi významami, tým sú « prežívané » rozdielnejšie, čím
menšia, tým su « prežívané » podobnejšie). A čo iné ako metrika podobnosť­rozdielnosť významov
by malo byť spoľahlivým indikátorom možnosti trópu?
Je nepochybné že ak platí tvrdenie « Človek je metaforická bytosť », mala by byť hypotéza
H2 ktorú tu momentálne uvádzame vo forme « Prs a jablko sú vo vzájomnom metaforickom
vzťahu » overiteľná empirickým výskumom na ľudských bytostiach.
Keď používame slová « empirický výskum », nemáme tým na mysli výskum kvalitatívny či
fenomenologický, nie naozaj by sme si nedovolili nazývať vedeckým výskumom prechádzku pri
ktorej na základe « hmotnosti, hutnosti, sladkosti , vláčnosti a ďaľších vlastností » zaraďuje básnik
vnady slečien sveta medzi odrody «Grany Smith», «Golden », «Karmína» či « Yonigold ».
Pritakávajúc tvrdeniu «kvality...sa veda snaží nahrádzať merateľnými kvantitami »
(Sokol,2007) , držiac sa smeru vyznačenom slovami « ve společenskovědním výzkumu nebývají
členy vzorku znaky, nýbrž osoby » (Skripnik/Lindová,2007) ašpirujúc o « zvedetčenie » mystériami
opradenej kvality nazývanej « význam slova », zvolili sme si výskum za svoju cestu najtradičnejší z
výskumov kvantitatívnych – výskum dotazníkový.
Voilà k čomu sme dospeli:
Potom čo všetci zo siedmych respondentov nášho dotazníku D1 ktorá znela:
S akým druhom ovocia si najsilnejšie asociujete pojem « ženské prsia » ?
zvolili ako jednu z dvoch možností (z celkovej ponuky výberu « jablko », « hrozno »,
« melón », « broskyňa » , « pomaranč ») odpoveď « jablko » (na druhom mieste súperili broskyne,
pomaranče a melóny), uvedomili sme si že za tak zarážajúce výsledky bude pravdepodobne
zodpovedná akási skrytá premenná.
Túto skrytú premennú sme následne identifikovali ako už spomínaný prototypický vzťah
medzi « jablkom » a « ovocím ». Ináč povedané, ústredná otázka nášho dotazníku D1 by sa dala
preformulovať do podoby:
S akým predstaviteľom kategórie X si najsilnejšie asociujete pojem A ?
pričom X je ovocie a A sú vrchoviny ženských hrudí. Pri hľadaní možnej chyby v tejto otázke sme si
následne uvedomili, že ­ keďže je jablko prototypom kategórie ovocie – by odpoveď « jablko » s
najväčšou pravdepodobnosťou zaznela nech by bol pojem A čokoľvek, otázka by mohla znieť
trebárs: « S akým druhom ovocia si najsilnejšie asociujete pojem budova? » , a odpoveď by bola s
najväčšou pravdepodobnosťou tiež « jablko ».
Ináč povedané – už samotným vyslovením slova « ovocie » v prvej časti otázky dochádza z
dôvodu prototypického vzťahu medzi ovocím a « jablkom » k aktivácii symbolu « jablka » v mysli
respondenta, a v prípade že druhá časť vety toto prúdenie smerom k jablku nijak « neprebije », či
« nepresmeruje », ako napr. v prípade otázky « S akým predstaviteľom kategórie ovocie si
najsilnejšie asociujete pojem slivovica? » , bude výsledná odpoveď najmä dôsledkom prototypickej
väzby medzi kategóriou X a do nej náležiacim členom Y, a nie dôsledkom väzby ktorú sme chceli
« odhaliť », tj. väzby medzi kategóriou X a do nej nenáležiacim pojmom A.
Preto sme sa rozhodli náš dotazník upraviť. Výsledkom bol papierový dotazník D2 a
fonologickú a gramatickú – náš systém pre kvantifikáciu ríše slov nebude kompletný pokým v prípade že by sa
nepodarilo doň zaintegrovať v podobe určitých ôs (dimenzií) aj syntaktické a fonologické vlastnosti. V prípade že by
sa to podarilo, možno by sa fonologická podobnosť medzi « ňadierko » a « jadierko » stala ďalším – veľmi chabým,
pretože iba v okruhu slovenčiny znalých ľudí platiacim ­ argumentom hypotézy H2.

internetový dotazník D3, v ktorých bola chybne zostrojená otázka – vlastne akási skrytá sémantická
konjunkcia – z dotazníku D1 rozdelená na 2 časti nachádzajúce sa vo vzájomne oddelených častiach
dotazníku, čím sme chceli zabrániť prípadným neželaným interferenciám.
Vzhľadom k faktu že najviac respondentov sme získali vďaka dotazníku D3, sústredíme sa v
následovných odstavcoch iba na tento dotazník. Prípadných záujemcov o ďaľšie informácie týmto
odkazujeme na appendix « Pár slov k dotazníkom D2 a D3 » tejto práce.
Otázke «Ktory pojem je podla Vas najlepsim predstavitelom kategorie "ovocie" ? » ,
označenej v dotazníku ako 2.3, sme sa už venovali. Ajkeď nás už spomínaných 63,4% pre v
dotazníku D3 príjemne prekvapilo, nejednalo sa o žiadne nové zistenie, ale len k ďalšiemu
utvrdeniu vedcami už mnohokrát afirmovanej hypotézy. Jednalo sa beztak len o otázku okrajovú,
otázku ktorá bola iba akoby nadstavbou k skutočnému jadru nášho výskumu. Tým bola otázka 1.3 :
S ktorým z uvedených členov sémantickej triedy « potrava » asociujete pojem « prsia » ?
Bolo daných 5 možných odpovedí : mäso, ovocie, mlieko, chlieb, zelenina. Keďže D3 bol
dotazníkom internetovým, využívajúci sympatickú opensourcovú aplikáciu PHPSurvey ­ využili
sme pri jeho zostavovaní dotazníku naplno možnosti ktoré táto aplikácia ponúka. Kľúčovým sa
napokon stalo rozhodnutie žiadať od respondenta nie jednu, dve či tri « rovnako silné » odpovede,
ale naopak požadovať «obodovanie» sily vzťahu medzi všetkými 5 členmi kategórie X a pojmom A.
Pre prístup v ktorom sme zisťovali u každého respondenta nielen najsilnejšiu z väzieb medzi
členom kategórie « potrava » a pojmom « prsia » , ale naopak merali všetkých 5 väzieb – pričom
silu/váhu väzby bolo možné špecifikovať celým číslom od 1 do 5 a dve či viac väzieb mohli mať
rovnakú silu/váhu ­ sme sa rozhodli z toho dôvodu, že je oveľa konzistentnejší s naším « fuzzy
prístupom » pre ktorý platí « všetko so všetkým súvisí, ajkeď iba trošililinku prchavú, no súvisí ».
Práve týmto « fuzzy » aspektom v ktorom rozhodujúcu rolu hrá kvantita nazývaná «váha
sémantického spoja » sa naša metóda líši od klasickej jungiánskej metódy voľných asociácií v ktorej
platí variácia na aristotelovské pravidlo vylúčenie tretieho ktoré možno charakterizovať slovami
« BUĎ je aktivovaný­artikulovaný tento symbol, ALEBO je aktivoartikulovaný tamten symbol ».
Tento skok od čiernobielej k množstve medziúrovní šedej bol mimo iné spôsobený aj tým že
zatiaľčo Jung analyzoval myseľ jednotlivcov bez toho aby mal prístup k ich neurónom, my
analyzujeme « myseľ » ľudských skupín majúc priamy prístup k ich základným zložkám – ľudským
bytostiam.
Teraz k výsledkom. Najsilnejšou sa ukázala byť väzba medzi prsiami a mliekom – jej celková
váha bola po odpovediach 358 respondentov zpriemerovaná na 4.2 . To príliš neprekvapuje, o
mliekodajných funkciách hrude ľudskej samičky ktoré sme bližšie tématizovali v prvej kapitole
našej práce dnes pravdepodobne netuší len niekoľko chronických puritánov, ktorí sa v našej
skúmanej vzorke pravdepodobne nevyskytli.
Taktiež to že sa ako najslabšie ukázali byť väzba k zelenine (váha 1.7) a k chlebu (1.9) nieje
príliš prekvapujúce. Niekto sa môže opýtať čímže by asi tak mohol byť spôsobený fakt že váha
asociácie vedúca od chleba je o 0,2 vyššia ako váha asociácie vedúca od zeleniny. Tvrdíme že
odpoveď typu «jedine Žena dokáže nasýtiť viac ako chlieb» je aj napriek svojej pravdivosti značne
nedostatočná, a ako sme naznačili pred niekoľkými odstavcami, podobnými obratmi sa dá obhájiť
úplne všetko. Ostaňme teda v rámci tohto textu u mylného presvedčenia že onen rozdieľ 0.2 je iba
náhodná fluktuácia, ktorú by pri zväčšení vzorky zákon veľkých čísel pravdepodobne zrovnal na
minimum. O tom že tomu tak nieje by nás presvedčil až ďalší výskum, ale načo sa zastrájať
výskumom ktorý takmer určite nikto nikdy neuskutoční...52
Najzaujímavejšie výsledky však pred nás vyvstávajú v « strede poľa » . Vidíme že mäso je k
52 A predsa áno – aplikácia « R for statistical computing » nám napokon umožnila vysloviť následovné tvrdenie:
Studentov párový t­test uskutočnený nad množinou získaných dát naznačil štatisticky signifikantný rozdieľ (t =
3.2664, df = 357, p = 0.001195 ) o veľkosti 0.2122905 medzi @Chlieb, Ženské prsia@ a @Zelenina, Ženské prsia@

prsu asociované s váhou 3.2 a zatiaľčo váha ovocia k ňadru je ešte o 0.1 vyššou, tj. @Ovocie,
Ženské prsia@53=3.3 . Tomu čo by chcel namietnuť sme že ono 0.1 je taktiež iba náhodnou
fluktuáciou a u väčších vzoriek by vysvitlo že @Ovocie, Ženské prsia@=@Mäso,Ženské prsia@
môžeme ako zaujímavý protiargument poskytnúť zistenie že u podmnožiny našej vzorky, u 110
respondentiek ktoré sa označili v otázke 9 za ženu či dievča je náskok Ovocia pred Mäsom znateľne
väčší, keďže @Ovocie, Ženské prsia@=3.2 zatiaľčo @Mäso, Ženské prsia@=2.8.
Illustration 1: Histogramy zostrojené nad kvantitami asociačných váh @Ovocie,Ženské prsia@
(histogram O) a @Mäso,Ženské prsia@ (histogram M) ktoré nám poskytli jednotliví respondenti
internetového dotazníku D3 radiaci sa medzi "mužov" či "chlapcov"

A vskutku, uskutočnenie série unilaterálných párových Studentových t­testov nás privádza k
vysloveniu následovných tvrdení: zatiaľčo rozdieľ v priemernej váhe asociácie @Ovocie, Ženské
prsia@ a asociácie @Mäso, Ženské prsia@ nieje štatisticky signifikantný u respondentov ktorý na
otázku po pohlaví odpovedali že sú muži (p=0.5203), chlapci (p=0.1423), dievcata (p=0.2154) či
anjeli (p=0.5892), je onen rozdieľ štatisticky signifikantný v prípade tých ktoré odpovedali že sú
ženami alebo dievčatami (p = 0.02607 ).
Čo sa týka celkovej množiny respondentov, vychádzajúc z presvedčenie že naši respondenti
boli ideálnymi predstaviteľmi nositeľov ISE kultúry začiatku 21. storočia, naznačilo nám
uskutočnenie unilaterálneho párového Studentovho t­testu naznačuje že našu hypotézu
« Pre mysle nositeľov ISE kultúry platí :@Ovocie, Ženské prsia@ >@Mäso,Ženské prsia@»
by sme odmietať nemali pretože získané výsledky sú štatisticky signifikantné (t = ­1.6829, df = 357,
p = 0.04663 ).
Bez toho aby sme sa opreli o barličku tvrdenie « Človek je metaforická bytosť » si vpravde nedokážeme
vysvetliť ako je možné že mäso – v podstate matéria z ktorej je prs vystavaný – nieje čo do sily svojho vzťahu k
ňadru oveľa viac popredu ako ovocie, ktoré s ňadrom súvisí na prvý pohľad len vzdialene. Ukázať že je táto
zdanlivá neexistencia asociácie medzi jablkom a ňadrom je len ilúziou bolo snahou tejto kapitoly – chceli sme ako
ukázať empiricky, tak naznačiť teoreticky že « ovocie » je k « ňadru » alebo bližšie, alebo rovnako vzdialené ako
« mäso ». A to z toho dôvodu že ajkeď je v mnohých dimenziách, ako napr. v tých ktorými kódujeme sém «zviera »
či « krv » hodnota súradnice prsu rozhodne bližšie k mäsu ako k ovociu (prvé dve majú relatívne vysokú hodnotu
zatiaľčo posledná relatívne nízku) , sú si zas v iných rozmeroch ­ ako napr. v tých ktorými kódujeme už spomínanú
« oblosť », « dlaň », « chĺpok » či « bozk » ­ bližšie J a Ň.
53 Pre uľahčenie a skrátenie budeme aj v ďaľších častiach používať následovnú formu zápisu sily asociačnej väzby
medzi dvomi pojmami: @Odkiaľ,Kam@

Summa summarum: pokiaľ by sme si v­našej­obľúbenej­učebnici­filozofie (Benyovszky, 2007) citované
Humeho slová «Vidím iba tri zásady podľa ktorých sa predstavy združujú: podobnosť, zhoda miesta alebo času,
príčina a účinok » interpretovali spôsobom že mäso ako hmota bez ktorej by ňadra nebolo je vlastne jeho
príčinou, zatiaľčo jablko je k ňadru vo vzťahu podobnosti , naznačili by nám výsledky nášho výskumu že kauzalita
buď hrá s metaforou duet, alebo len druhé husle54. Tak či onak, veríme že sa nám vďaka onej trojici – vďaka
mäsu, prsu a ovocí – podarilo aspoň pár čitateľov presvedčiť o tom že človek je bytosťou v živote ktorej hrá
metafora ústrednú rolu, a parafrázujúc Huma tvrdíme « Vidíme iba jednu zásadu podľa ktorých sa predstavy
združujú: podobnosť – zdieľanie sému55. A pramálo už záleží na tom či onen zdielaný sém kóduje spoločný výskyt v
priestore či následný výskyt v čase ».
Niečo podobné však vieme už od Aristotela, a možno ešte i od starších, a naším zámerom tu rozhodne
nieje reinterpretovať už interpretované. Naopak , naším cieľom je rozvinúť naše znalosti o metafore a význame
slova do takej podoby že aj Turingov stroj bude schopný permutovať základné komponenty ríše významu utvárajúc
tak metafory, prejavujúc sa tak následne ako bytosť majúca ducha56.
Tvrdíme že Deus bude ex machina minimálne dovtedy kým tí čo chcú do stroja vdýchnuť dušu nepochopia
že metafora je od nepamäti účinnou poetickou metódou samotnej Prírody – že je to práve metafora ktorá podľa
Morrisových bizarných hypotéz (viz. časť 2.1) preniesla nielen oblé polky hýždí na Ženinu hruď ale i pysky vulvy
na pery, ešte viac tak milencov pohľad utvrdzujúc v Láske k Tvári milovanej ; že je to práve metafora ktorá
spôsobí že tekajúci pohľad maličkého hľadajúceho onen zdroj prs­dvorec­bradavka zrazu ostane o niečo dlhšie
fixovaný na onen zdroj svetla ktoré nehasne, Zdroj bulva­dúhovka­zrenica ; že je to práve metafora ktorá
spôsobuje že milovaná – v útlom detstve tiež Pavlovovsky napodmieňovaná túžbou po jablkách hrudí – zatína v
chvíľach rozkoše vášnivo svoje prsty do toho čo je najbližšie...sémanticky najbližšie...k tomu čo kedysi ako bábo
tak milovala ­ do kopcov jeho paží.

Figúra 2: V pravej časti možno zhliadnuť krivku interpolovanú nad histogramom M, v časti ľavej sa
možno pokochať krivkou interpolovanou nad Y­osovo prevráteným histogramom O
54 Z toho že « ctený čitateľ bez problémov rozumie práve takým výrazom ako « druhé husle » zatiaľčo má problém
pochopiť túto poznámku pod čiarou » vyplýva že «kauzálne myslenie možno nieje ničím iným len určitým
špeciálnym prípadom myslenia metaforického podobne ako je klasický fyzikálny makrosvet len určitým špeciálnym
prípadom sveta kvantového » , tak niečo také si dovolíme tvrdiť naozaj len tu, pod čiarou.
55 Vďaka jednému vtipu doc. Pinca som si uvedomil že zdielanie sému zohráva kľúčovú rolu nielen v prípade metafor,
ale napríklad aj v prípade jednej celej – a značne veľkej­ kategórie vtipov. Publiku je položená otázka: « Aký je
rozdiel medzi vedcom a pásomnicou? » Po chvílke trápneho ticha nasleduje odpoveď « Žiadny », následovaný
objasnením « Podobne ako pásomnica sa aj vedec nachádza väčšinu času v oblasti spodného vývodu tráviaceho
traktu, a podobne ako pásomnica aj vedec raz za čas do sveta vypustí nejaký ten článok »...Pri následovnej analýze
vidíme že onen vtipný efekt nieje spôsobený ničím iným ako 1) upriamením pozornosti poslucháča na sémy ktoré
oba pojmy zdieľajú, tj. na sémy «byť v pr...i » a « vypúšťať článok » 2) odklonením pozornosti poslucháča od faktu
že obojé pojmy v sebe zastrešujú aj myriády sémov iných, ktoré vzájomne nezdieľajú – toto odklonenie sa deje
pomocou jemne klamlivej odpovede «žiadny». Voilà jeden z princípov podľa ktorého dokáže i stroj tvoriť vtipy.
56 But the greatest thing by far is to have a command of metaphor. This alone cannot be imparted by another; it is the
mark of genius, for to make good metaphors implies an eye for resemblances. (Aristoteles, Poetika 59a)

Záhrada tretia:

Muž

jeho krv a jeho kríž

Albecht Dürer – Adam a Eva – Florencia

Take the full breast of your sister Isis,bring it unto your mouth!
"Mother of N.," so said I, give thy breast to N., that N. may suck therewith. (My) son N.,so said she, "take to thee my
breast; that thou mayest suck it" said she,that thou mayest live again," so said she, "that thou mayest be (again) small,"
so said she.

(Texty pyramíd, úryvky 42 a 470)

Záhrada tretia, konštrukt prvý:
H(istó|ysté)ria
Sexuální postoje, stejně jako všechny postoje ostatní, čerpají z nevyřčených a často
nevědomých premis. Kreativní myšlení, vždy zřetelné a jasně srozumitelné, je výsledkem frustrace:
člověk vnímá problém , jenž je potřeba vyřešit, a při jeho řešení vytváří další myšlenky. Ovšem
převážná část lidského « myšlení » není tvořena těmito účelnými, zřetelnými a kreativními
myšlenkami, většina z toho, co pokládáme za svou duševní činnost, se skládá z nesrozumitelných,
polovědomých a sémantických reflexů – reakcí na klíčová slova, která jsou v naší mysli vyvolávána
jednotlivými situacemi.
Například naše duševní reakce na sex – naše takzvaná « filosofie » sexu – je ve většině
případů soustavou neuropsychologických reakcí na několik velice jednoduchých « poetických
metafor ». Konkrétní metaforou, jež měla největší vliv na západní civilizaci a která je podstatou
tradičního židovsko­křesťanského dogmatu, je víra, že sex je « obscénní ». Pohlavní styk je něco
sprostého, pohlavní funkce jsou něčím stejně odporným, trapným a « nepěkným » jako vylučování
výkalů, atd.
Nazýváme je jednoduchými poetickými metaforami, neboť je můžeme analyzovat stejným
způsobem, jako literární kritikové analyzují verše. Metafora je ztotožněním dvou rozdílných faktorů.
Přirovnání například praví: «Loď je jako pluh ». Metafora však méně zřetelněji, ovšem o to účinněji
ono ztotožnění naznačuje, aniž by ho vyjádřila otevřeně: « Loď oře mořské vlny ». Pokud je totiž
ono ztotožnění vyjádřeno jako méně jednoznačné tvrzení, je méně pravděpodobné, že s ním
nebudeme souhlasit...
Židovsko­křesťanská teologie o sexu neustále hovoří v metaforických termínech a píše o něm
jako o něčem neslušném, takže stotožnění sexuality s obscénností bylo podprahově «instalováno»57
do psychologických a neurologických reakcí lidí, aniž by měli sebemenší ponětí o « poetičnosti » či
prelogické povaze tohoto stotožnění.
Když romantičtí básníci přirovnávají sexualitu k pučícím květům, rašící trávě, zelenajícím se
křovinám atd., vytvářejí ztotožnění, jež směruje ke zcela opačnému druhu reakce. Od nich se nám
tedy dostává rovnice « sexualita rovná se jaro », jež je v zásadním protikladu k židovsko­křesťanské
rovnici « sexualita rovná se obscénnost ». Obě rovnice však mají svůj psychologický účinek, neboť
jsou poetické a nedostatečně zřetelné. 58
Wilson,
Ištařin návrat,
aneb proč bohyně sestoupila do podsvětí a co náš čeká nyní při jejím návratu
str. 89

57 « Photos containing a fully exposed breast ­ as defined by showing the nipple or areola ­ do violate those terms on
obscene, pornographic or sexually explicit material and may be removed, » he [facebook spokesman] said in a statement
[concerning the breastfeeding photo ban] . (Telegraph, 2008)
58 Pasáže boli hrubým písmom zdôraznené až dodatočne autorom tejto bakalárskej eseje

Pokúsme sa teraz objasniť neurosémantickú podstatu významu a metafory .
Predstavme si akýsi primitívny protojazyk v tak primitívnom štádiu svojho memetického
vývoja že ešte nestihol nadobudnúť žiadnu gramatiku – tj. žiadne pády, žiadne predložky, žiadna
štruktúra vety. Ajkeď sa čo do svojej syntaktickej zložky podobá viac jazyku primátov ako
plnohodnotnej ľudskej reči, je každopádne možné ho nazývať jazykom, keďže jednotlivé signifianty
tohto jazyka aktivujú v mysliach poslucháčov určité neurálne obvody, ako napr. spomienky či útržky
spomienok ktoré sú najväčšmi asociované s výskytom daného slova v minulosti.
Predstavme si teda, že o bytosti ktorej chovanie chceme predikovať, a o ktorej vieme že tento
protojazyk užíva vieme, že má v mysli asociované slovo «hruď» so slovom «prs» dajme tomu o sile
0,023 . K znalosti uvedeného čísla môžeme dospieť viacerými spôsobmi:
 môžeme skúmaného nechať vyrozprávať aby sme následne z analýzy získaného korpusu
zistili že slovo «hruď» sa 23 krát z tisícov svojich výskytov vyskytuje hneď vedľa slova
« prs »
 môžeme použiť metódu voľných asociácií a 1000 dní po sebe predkladať skúmanému určité
slová, z ktorých jedno je vždy « hruď », zisťujúc že v 23 prípadoch mu vypočutie slova
« hruď » aktivovalo v mozgu obvod kódujúci vyslovenie slova « prs »
 môžeme skúmaného sledovať od jeho detstva a namerať že v 46 prípadoch z celkového počtu
2000 situácií, kedy sa dostal do kontaktu s referentom ktorý označujeme signifiantom
« hruď », sa zároveň dostal do kontaktu aj s referentom ktorý označujeme signifiantom
« prs »
 v budúcnosti budeme – prípadne aspoň niektorí z Vás istotne budú môcť ­ použiť ešte
jemnejšiu neurologickú metódu, napr. metódu magnetickej rezonancie či iné, ktoré by
obzvlášť pri kombinácii s východnými meditačnými praktikami mohla naznačiť že tá
zmapovaná množina neurónov ktorá sa skúmanému počas predchádzajúcich
experimentovala aktivovala pri jeho meditácii nad signifié « prs » sa aktivuje aj počas 2,3%
času experimentu počas ktorého má tá istá osoba za úlohu upriamovať svoje meditujúce
vedomie na signifié « hruď »
podobne taktiež zistíme že v skúmanej mysli existuje asociácia o sile 0,42 medzi « dlaňou» a
« prsom » a o sile « 0,077 » medzi «jablkom» a «dlaňou».
Keď teraz v onom primitívnom protojazyku bude vyslovená formula JABLKOHRUĎ , dôjde
k javom ktoré vrámci nášho modelu opisujeme následovne:
uvedená formula sa skladá z dvoch morfém, « jablko » a « hruď ». Keďže sa jedná o
najprimitívnejší z protojazykov v ktorom nezáleží ani len na poradí morfém – tj. « jablkohruď » je
čo do svojho sémantického obsahu ekvivalentné s « hruďjablko » ­ môžeme tvrdiť že žiadna
morféma nemá prednosť pred inou, a teda že si príraz ktorým bol mysľomozgom obdarený vypočutý
celok JABLKOHRUĎ si jednotlivé morfémy rozdelia presne pol na pol
0,5 x 0,023 = 0, 0115 prírazu , polovina prírazu ktorý mozgomyseľ poslucháča priradila
vypočutému celku, tj. zvukovej vlne JABLKOHRUĎ pretečie od onej prvej morfémy «hruď» od
ktorej 0,023 bude presmerovaných ďalej smerom k «prs»
0,5 x 0,42 = 0,21 polovina prírazu , ktorý mysľomozog poslucháča priradila vypočutému
celku, tj. zvukovej vlne JABLKOHRUĎ pretečie k prvej morfémy « jablko » smerom k «dlaň» aby
odtiaľ 0,21 pritečeného prírazu oditeklo ďalej smerom k obvodu kódujúcemu «prs»
0,21 x 0,077= 0,01617 doputuje od prvej morfémy « jablko » (aktivovanej s prírazom 0,5)
skrze prestupnú stanicu « dlaň» (aktivovanej s prírazom 0,5x0,42) do finálnej destinácie «prs»
Napokon teda vidíme že k finálnej destinácii, k bodu v sémantickom priestore
reprezentujúcom « prs » doputuje 0,0115 prírazu smerom od morfémy «hruď» a 0,01617 smerom od
morfémy « jablko » cestujúc skrze prestupnú stanicu « dlaň ». Celkovo teda k « prs » dotečie
0,01617+0,0115=0,02767 prírazu, čo je o dosť viac ako by doputovalo k « prs » od samotného
« jablka » či « hrude ». To nám naznačuje že vzájomné spojenie uvedených dvoch morfém priblíži
myseľ poslucháča k tomu bodu v sémantickom priestore ku ktorému chcel autor čitateľa doviesť

oveľa viac, ako každá z uvedených morfém samotných.
Ináč povedané, tvorca vety či metafory, autor, ten­čo­hovorí, sa svojou symbolyartikulujúcou
aktivitou snaží čitateľa priviesť do takého bodu sémantického Hilbertpriestoru ktorý čo najvernejšie
reprezentuje jeho tvorivý zámer, ono « to čo sa chce povedať ». Každým aktom artikulácie – a to
nielen pridaním morfémy či slova, ale i tonalitou, dôrazom, gestikuláciou, prehodením poradia či
znamienkom interpunkčným ­ čoraz viac a viac spresňuje « polohu » sémantického atraktoru do
ktorého chce lapiť poslucháča či čitateľa. Podarí sa mu to ­ bude jeho metafora úspešná ?
Úspešná metafora je metafora alebo pochopená v súlade s intenciou autora, alebo metafora
prebúdzajúca v svojom príjemcovi pocity krásna. Prvá metafora je tou metaforou bez ktorej nemôže
postupovať ani vedec, ani ľudské poznanie, druhá je metaforou básnikovou. Túto druhú alternatívu
nechajme bokom ako niečo čo nemáme právo analyzovať – ako niečo čo je posvätné ­ a sústreďme
naše analýzy na otázku « Kedy je metafore porozumené v súlade s intenciou autora? ».
Metafora je porozumené v súlade s intenciou autora vtedy, keď autor docieli že myseľ
čitateľa či poslucháča bude čo najväčšmi blúdiť v tých oblastiach sémantického Hilbertpriestoru v
ktorých chce aby blúdila, a čo najmenej bude blúdiť vo všetkých ostatných . Alebo ináč – podstatné
pre úspešnosť horeuvedenej metafory JABLKOHRUĎ nieje to, že jej prirodzeným dôsledkom je
aktivácia signifié « prs » s prírazom 0,02767, ale to, že tento príraz je v danom momente podstatne
väčší ako príraz ktorým disponujú všetky ostatné paralelne aktivované obvody . A čím je onen
príraz väčší ako všetky ostatné – čo je spôsobené zväčša tým že daný neurosémantický obvod je do
seba väčšmi zacyklenejší59, dráha je vyrytejšia, príraz­energie menej disipuje do strán – tým je
zmysel s týmto obvodom spätý vnímaný jasnejšie.
Sú slová Šalamúnove o « gazelích dvojčatách » metaforou básnikovou či metaforou
vedcovou ? Snaží sa nás skvelou kombináciou symbolov priviesť k tomu­a­nie­inému významu,
snaží sa nás lineárnou kombináciou istých nesmierne komplexných matematických entít priviesť
tomu­a­nie­inému bodu v sémantickom Hilbertpriestore, alebo skôr necháva našu myseľ blúdiť po
dráhach «hebké, teplo sálajúce, krehké , pohľadenie žiadajúce » ? Možno to, možno ono, možno nič
z toho a možno obojé naraz – pretože i to nám náš kalkul umožňuje, isté je že metafora
pravdepodobne nebude úspešná ani v jednom zmysle u toho, čo gazelu videl len raz, a to váľajúcu
sa za mrežami zoo vo vlastnom truse. Ani srnka netuší v akýchže hmlovinách sémantického
priestoru skončí napokon myseľ takého nešťastníka.
Kiež by však aspoň neskončila tam, kde už skončili mysle tisícov tých čo slová Piesne
zobrali až príliš vážne. Hľa, ako takí ľudia dokážu pochopiť najkrajšiu a jedinú oslavu tela ktorú nám
západná duchovná tradícia poskytuje, hľa ako si dokážu vyložiť slová z kapitoly štvrtej, verša
piateho :
dvoje tvojich pŕs je
srny, ktoré sa
ako dvoje sŕňat,
pasú medzi
dvojčatá
ľaliami
kiež by sa onen nešťastník nestratil v labyrinte svojich asociácií a nezačal pri vnútornom výklade
uvedeného verša bláboliť ako Bernard z Clairvaux60 :
Dve prsia Snúbenice označujú blahoprianie a zmilovanie, následujúc tak doktrínu sv. Pavla ktorý
chce aby sme sa radovali s tými čo sú šťastní, a aby sme plakali s tými čo plačú
či ako Maitre zo Sacy61:
59 Exaktnejšou rečou matematiky možno povedať « čím je graf reprezentujúci skúmaný neurosémantický obvod
hustejší a uzavretejší»
60 Spoluzakladateľ cisterciánskeho rádu a kľúčový spojenec templárskych rytierov v prvej fáze ich existencie
61 Pascalov súčasník z Port Royal

Už sme vysvetlili že dve prsia milovanej sú alebo dvomi testamentmi – starým a novým , alebo dve prikázania
láskavosti ktoré sú ako strapce hrozna, ‘bo slovo božie uschované v týchto dvoch božských testamentoch ako aj dve
lásky upriamené k bohu a k následovnému majú moc opiť toho kto sa nimi naplní

To, že k tomu čo predchádzajúci autor nazýva « opitosťou »netreba dva testamenty, ale že bohate
stačí aj jeden, nám naznačuje interpretácia z centrálneho diela židovského mysticizmu, knihy Zohar:
Slovm «ňadro » Slovo mieni dobré skutky, pretože podobne ako prsia utvárajú krásu Ženy, utvárajú dobré skutky
krásu muža.

V takejto tvrdej konkurencii však napokon predsa len víťazí Žena, v tomto prípade Madame de
Guyau62 ktorá svoju šťavu sublimuje zjavne ešte intenzívnejšie ako v predchádzajúcich odstavcoch
spomenuté chlapčiská:
Pretože sajeme všetci spolu z pŕs božskej Esencie, našej matky
sajem i ja nepretržite ňadrá božskosti

Uvedené citáty , ktoré sme do slovenčiny z francúzštiny preložili z knihy Tes seins sont des
grenades – Pour en finir avec le Cantique des cantiques. (Lalou/Woda, 2003) , sú len špičkou
ľadovca. Menej vnímavejší čitateľ si možno uvedomí « Akože až hlboko líščia nora vedie » až po
prečítaní tohoto citátu od dodnes uznávaného « otca církvi » Origena63:
Voilá prečo vám týmto dávam varovanie a radu že ten čo ešte nieje oslobodený od prekážok tela a krvi ako aj pre
každý ten čo neodmietol dispozície materiálnej prirodzenosti, pri čítaní tejto
malej knížky rúha sa absolútne

Nuž, nič sa nedá robiť, ideme sa rúhať , a keď už sa rúhať tak nech to stojí za to , absolútne :
tvrdíme že Pieseň piesní nieje ničím iným ako erotickou básňou par excellence, oslavou tela bez
ktorého myseľ nemohla by vyvstať64. Ako je však možné že niečo tak zjavné ostalo ukryté zraku
desiatkam generácií mudrcov ? Ako sa vôbec mohlo stať že aj napriek prítomnosti tejto Ódy v
samotnom srdci Biblie , aj napriek prítomnosti tantrických textov v jadre bráhmanizmu bolo to
najkrásnejšie, najvznešenejšie, svojou kompozíciou najmúdrejšie a v svojich dôsledkoch
najmocnejšie – fyzický akt Lásky medzi Mužom a Ženou – v histórii ISE kultúry tak často urážané,
ubíjané, popľuvávané ? Čím možno ak už nie ospravedlniť – pretože isté už učinené príkoria
ospravedlniť nemožno – tak aspoň odôvodniť ono ubíjanie Ženy, tela, do hmoty pre(j|t)avenej 65
nežnosti prichádzajúce od toho od koho by sa to najmenej čakalo: od trojice otec­syn­duch ktorá sa
počas stáročí stáva čoraz väčším synonymom maskulínnej hrubosti ?
Odpoveď je samozrejme oveľa komplexnejšia ako by akákoľvek esej kedy mohla byť. Prečo
zo semienka slov dobromyseľného gnostika z Nazaretu66 zasadenej do substrátu judaistickej viery,
helénskej kultúry a rímskej moci vyrástol na úsvite letopočtu taký symbolický komplex aký
vyrástol, a najmä to, akými mechanizmami67 si tento komplex zabezpečil svoje dvojtisícročné trvanie,
na to by sme sa mohli pokúsiť odpovedať užitím nástrojov ktoré nám núkajú mnohé paradigmy
postmoderných humanitných vedy – kultúrna antropológia , sociológia náboženstva , evolučná
psychológia , memetika , univerzálny darwinizmus. Kvantitatívnu a veríme že aj matematicky
formalizovateľnú metódu k spojeniu týchto rozdielnych prístupov sa pokúsime načrtnúť v poslednej
62
63
64
65

Francúzska mystička
O ktorom historické pramene tvrdia že sám seba vykastroval .
Tá Myseľ bez ktorej Telo nemohlo by « byť »
metaznak | hrá v regulérnych výrazoch rolu disjunkcie, (j|t) teda znamená «na tomto mieste sa nachádza j alebo t » a
regulérny výraz pre(j|t)avený tak v mysli čitateľa aktivuje dva obvody « prejavený » i « pretavený »
66 Ježíš , kam bežíš ? Do Nazaretu po cigaretu.
67 Pričom nás zaujímajú mechanizmy jemné, mechanizmy symbolické . Necítime sa ani pri najmenšom povolaný k
analýze mechanizmov hrubých, tých ktoré súvisia s mečom a gilotínou, a prenechávame ju historikom.

kapitole, a tu sa pokúsme ono ohnutie, onú inverziu ku ktorej v prípade Piesne ako aj celého
kresťanstva, zdá sa, došlo, objasniť skrze prizmu toho čo nazývame 1. Trik s AGAPE , 2. Pravidlo
sémantickej tranzitivity.
Fragment 5: Trik s Agape
Po konzultácii s vedúcim práce sa autor rozhodol tento fragment z eseje vyradiť.
Fragment 7: Prekurzor pravidla sémantickej tranzitivity
V časti 2.3 sme objasnili čo je to sémantický prototyp A (napr. jablko) kategórie X (ovocie) .
Pokúsili sme sa naznačiť že medzi A a X možno namerať určitú kvantitu P1 ktorú môžeme chápať
alebo ako úmernú:
● sile­váhe asociácie ktorá je medzi A a X
● pravdepodobnosti že do svojho vnútra dokonale zahľadená myseľ – tj. taká myseľ do ktorej
nevstupujú žiadne vstupy z okolného prostredia ­ preskočí od A k X
Tiež sme naznačili že kategória X je asociovaná aj s ďalšími pojmami (napr. B ­ prsia) a silu
tejto asociácie môžeme vyjadriť kvantitou P2 ktorú definujeme analogicky k P1.
Máme pocit že ak teda existuje spoj |@A,X@|=P1 a tiež spoj |@X,B@|=P2 , bude hodnota
|@A,B@| > P1 x P2 x K , pričom K<1 je endogénny parameter skúmaného systému, v prípade
jednotlivca určitá celková vlastnosť jeho mysle.
Ináč povedané, ak existuje asociácia medzi Ňadrom a Ovocím a asociácia medzi
Ňadrom a Jablkom, existuje tiež istotne asociácia medzi Ňadrom a Jablkom.
Platí to i naopak, ikeď s rozdieľnymi kvantitatívnymi výsledkami, keďže kvantity ktoré
vstupujú do rovnice budú iné (zmenili sme smer): ak existuje asociácia medzi Ňadrom a Jablkom (o
ktorej sme presvedčený že existuje, viz. časť 2.3) a asociácia medzi Jablkom a Ovocím, existuje tiež
istotne asociácia medzi Ňadrom a Ovocím.
Formulku:
|@A,X@|=P1 ; |@X,B@|=P2 ­> |@A,B@| >68 P1 x P2 x K
nazývame Pravidlom sémantickej tranzitivity a považujeme ho za prekurzor akéhosi
všeobecného princípu ľudského konceptuálneho myslenia.
Od triády Jablko, Ňadro , Ovocie sme si dovolili učiniť nebezpečný indukčný preskok k
istému univerzálnemu princípu mysle. Teraz si vďaka nami afirmovanej69 univerzalite uvedeného
princípu dovolíme uskutočniť následovnú dedukciu:
Vieme že centrálny kosmogonický mýtus ISE kultúry vytvoril v mysliach ním
nainfikovaných osôb sémantickú väzbu medzi «hriechom» a «ovocím» , prípadne «hriechom» a
«jablkom» . Taktiež nám naše empirické dáta získané vďaka dotazníku D2 naznačujú – a v prípade
osôb ženského pohlavia dokonca naznačujú štatisticky signifikantne ­ že v hostiteľských mysliach
ISE kultúry existuje sémantická väzba medzi «ňadrom» a «ovocím». V prípade platnosti Pravidla
sémantickej tranzitivity nám z uvedeného vyplýva že niekde v mysli všetkých tých ktorý boli
nainfikovaný centrálnym kosmogonickým mýtom ISE kultúry bude existovať asociácia medzi
«hriechom» a «ňadrom».
Odmaskovanie tejto asociácie existujúcej medzi termínom «hriech», ktorý nemá žiadny
pevný referent, a termínom «ňadro» ­ o ktorého pevnosti diskrétne mlčíme­ bolo jedným z
ústredných cieľov tejto práce.
68 Používame znamienko > a nie = pretože váha väzby medzi A a B nieje daná iba tým koľko prírazu doputuje od A
smerom k B putujúc cez X , ale i tým koľko prírazu doputuje od A smerom k B putujúc cez Y , Z atď... Medzi
« ovocím » a «prsom » totiž nieje iba medzistanica « jablko » , ale aj množstvo iných menej výrazných medzistaníc
ktoré sa na výslednej kvantite malou mierou taktiež podieľajú.
69 a Vami dúfam vyvrátenej

Záhrada tretia, konštrukt druhý:
Aplikácia
Qu’adviendrait­il si, un jour, la science, le sens du beau et celui du bien se fondaient en un concert
harmonieux? Qu’arriverait­il si cette synthèse devenait un merveilleux instrument de travail, une
nouvelle algèbre, une chimie spirituelle qui permettrait de combiner, par exemple, des lois
astronomiques avec une phrase de Bach et un verset de la Bible, pour en déduire de nouvelles
notions qui serviraient, à leur tour de tremplin à d’autres opérations de l’esprit?
Prekladateľov predhovor k dielu « Le jeu des perles de verre » (Hesse, 1955)

Fragment 1:

Prístup vypočítavania « mohutnosti znaku» ­ či inak povedané « významnosti určitého znaku
pre celok systému vrámci ktorého sa nachádza » pomocou maticovej algebry je uplatniteľný nielen
pri analýze mysle jednotlivca, ale aj pri analýzach celých kultúr a spoločností.
A čo viac, vďaka štatistickému zákonu o pravidlách veľkých čísel je pravdepodobné že
výsledky ku ktorým by podobné antroposociologické analýzy mohli dospieť budú solídnejšieho
charakteru ako analýzy neuropsychologické. Obvody mysle jednotlivca sú totiž vyryté iba do
neurónového wetware mozgu zatiaľčo obvody mysle sú vyryté do kníh zákonov, do inštitúcií, do
miest a ciest – ináč povedané kontextuálne a asociačné vzťahy sú v prípade kultúr veľmi často vryté
nielen do mozgov ľudských bytostí ktoré sú « hostiteľskými organizmami » pre tú či onú kultúru,
ale sú veľmi často vyryté i « do kameňa ».
Predstavme si primitívnu lovecko zberačskú kultúru v ktorej kozmologickom systéme
zohrávajú kľúčové rolu významy « Ovocie », « Prs », «Mlieko », « Žena » a « Oheň ». Po rokoch
náročného terénneho výskumu ...
Ovocie Prs

Mlieko Žena

Oheň

Ovocie 0

0,23

0,15

0,07

0,05

Prs

0

0,4

0,4

0,1

Mlieko 0,4

0,33

0

0,23

0,15

Žena

0,15

0,37

0,35

0

0,7

Oheň

0,05

0,07

0,1

0,3

0

0,4

... použitie kódu z appendixu 2 naznačuje nieúplnezjavný fakt že symbolom s najväčšou
mohutnosťou vrámci daného kultúrneho celku je...mlieko
Fragment 3: Jadro pudla
Použitie maticového kalkulu na analýzu symbolických systémov nás môže priviesť až k
znalosti slabých miest, Achillových pát, dotyčných systémov. Podobne ako protilátka čo sa práve
dotkla určitého miesta virálnej kapsidy týmto svojim jemným a veľmi špecifickým dotykom
spôsobuje zánik víru ; podobne ako psychoanalytik ktorý práve odhalil nenápadný významový spoj
ktorého rekonfigurácia spôsobí rekonfiguráciu celku pacientovej mysle; podobne bude tomu čo
dokáže previesť celok kultúrneho či náboženského systému na maticu vzájomne na seba
odkazujúcich entít daná možnosť dotyčnú kultúru či náboženstvo zvnútra «rozpustiť» jedným

jediným slovom.
Keďže sme si zatiaľ neni istý tým či by uvedená metóda – ktorú intuitívne používajú obzvlášť
šamani, misionári či demagógovia ­ mohla byť využitá nielen deštruktívne , ale aj konštruktívne, tj.
k jemnému dizajnu či redizajnu kultúrnych systémov smerom k vybudovaniu chrámu pre čoraz
väčšiu diverzitu a krásu bytostí a vecí, rozhodli sme sa z obáv pred možným nepochopením nášho
zámeru určité znalosti zatiaľ iba letmo naznačiť.
Fragment 2: Zrodenie korpusovej kulturológie
Vychádzajúc z premisy «človek musí v sebe nosiť dôvod na to aby venoval svoj čas tvorbe
tohto a nie iného encyklopedického príspevku» môžeme jednotlivé národné wikipedie – napr. http://
sk.wikipedia.org , http://cs.wikipedia.org či http://fr.wikipedia.org ­ chápať ako o(b|d)razy priorít a
hodnôt nositeľov jednotlivých národných kultúr.
O tom ako databázy národných wikipedií previesť do maticovej podoby, ako vypočítať
mohutnosti jednotlivých znakov (ilustrujeme na príklade pojmov «víno» , «mlieko» , «boh» ,
« olivy » a « žena ») ako ich porovnať medzi sebou a čo porovnania vyplýva pre zrod kvantitatívnej
– či skôr korpusovej? ­ kulturológie, o tom bude naša prihláška do súťaže Ars Electronica, ako i náš
prvý striktne vedecký článok, veríme že napísaný s posvätením FHS UK .
Fragment 6: Memetické inžinierstvo
Napokon sa, žiaľ, zdá že aj tí čo intuitívne s?poznali zákony tvorby, zotrvačnosti a zániku
pojmov napokon svojim dieľom nedosiahli nič viac než to, že sa myriády onoho­umenia­neznalých
medzi sebou tisíce rokov zabíjali v mene akejsi « lásky ».
Fragment 10: What can a graph theory tell us about breasts and apples?
Graph is a matematical structure consisting of vertices and edges. Vertex can be understood
as a « node », « element », « object », « entity » or even « neuron »; edge can be understood as a
relation or a link connecting a pair of vertices. It can be seen almost immediately that graph theory
can be useful for analysis of networks of references (e.g. hypertext web) – and verily, at the core of
the biggest success story of Web – Google’s one – is a quantity called PageRank 70 whose
computation follows directly from certain properties of graphs and stochastic matrices related to
them.
Reasoning which will be presented within scope of this article is founded upon following
assumptions : 1) Any holistic complex can be understood and thus analysed as a network of
references and hence as graph 2) By a correct application of graph theory notions, non­evident but
practically useful properties of a given holistic complex can be discovered. By a holistic complex we
mean such a system that cannot be explained by properties of its components alone. To understand
it, and to explain it a structure – i.e. set of relations between the components ­ must be taken into
account. Briefly – not only content ­ information IN the Net is important; it is as well the form71 ­
information ON the Net.
We’ll analyse two types of such holistic complexes within this chapter: a classical text poem
and a hypertext encyclopaedia. Because this article itself is a part of yet bigger holistic complex
concerning the semantic relation between « breast and apple » concepts, we had decided to chose a
« Song of Songs » of King Salomon where both concepts are present. Concerning an encyclopaedia,
we had chosen to analyse the « 9th miracle of the world » ­ the biggest archive of human knowledge
ever created by humanity and for humanity – the Wikipedia. While still pointing attention of the
reader to the « breast and apple » concepts , we’ll try to show that analysis and subsequent
comparison of national wikipedias by means of graph theory notions like «closeness »,
70 The name "PageRank" is a trademark of Google, and the PageRank process has been patented U.S. Patent 6,285,999
71 Die Form ist die Möglichkeit der Struktur. (Wittgenstein, 1917)

« betweenness » , « PageRank » or can lead to non­trivial discoveries whose range spans from
cultural anthropology to hardcore semantics.
Our virtual workbench will consist of OpenSource tools only – namely Linux operating
system, PERL programming language and at last but not least, the most powerful statistical tool ever
created – R for statistical computing72. We’ll try to be consistent with the spirit of hereby nascent
OpenedScience movement and thus present our experiment in such a way that they could be
reproduced by anyone with fairly advanced informatic skills ­ all Linux and R commands and as
well as PERL subroutines will be presented in notes at the bottom of the page73.
Analysis 1: Cantique
Because English language is one of the easiest languages to parse, we had chosen to
download74 and analyse that version of King Solomon’s song which is present in the King James
Bible. After preliminary removal of header HTML tags we have extracted only nouns, verbs and
adjectives from the corpus by means of « gposttl » version of Brill’s tagger75 and we had marked the
frontiers of sentence by « :: » sign.
We create a small script based upon the theoretical notions presented in chapter 2 of this
work. Namely, we accept the Hebbian hypothese «if two symbols are activated within a very short
timespan, the weight of their relation will be strenghted» as true, we accept her and we let her
inspire us in such a mesure that we allow us to hereby formulate this primitive rule of thumb :
Zeroth semantic principle: If two words (vertices) are present within the same sentence , the
weight of their relation (edge) will be strenghtened.
Speaking more generally: given a co­occurence of elements (e.g. words) A and B within the higher­
level complex (e.g. sentence) X, an edge will be created or ­if already created­ augmented with
weight N. It is important to mention that even when we speak of linguistic corpuses only, there are
many different sorts of what we call «complexes» located on many different hierarchical levels,
from almost invisible low level (n­3, n­2) NP syntagms to larger scale phrases (n­1) sentences (n)
and yet bigger complexes like śloka (n+1) or whole chapter (n+2...?). Having this on mind and
adding immediately that concrete value of weighting constant N seems to be determined:
 by the level (n­2/n­1/n/n+1/n+2 etc.) within which A and B co­occur (Drake, 2000) – higher
the level, smaller the N
 by the presence of inhibitor terms ­ as known for example from chomskian Governement and
Binding Theory (Haegeman, 1994)
 by terminological (Hromada,2007) or temporal distance (related to Hebbian functioning)
76
 by the given language L itself
we conclude this small theoretical excurcus by assertion that the graph theory can serve not only as
a firm base founding formalized cognitive semantics but as well as a unifying point between
disciplines as differentes as behavioral musicology (Drake, 2000) and generative syntax.
We assert this because we are strongly persuaded that what tree structures mean for
syntax shall graphs – and particularly cyclic graphs – mean for semantics. Of course we are
72 “Great beauty of R is that you can modify it to do all sorts of things” ?said chief economist at Google “And you have
a lot of prepackaged stuff that’s already available,so you’re standing on the shoulders of giants.” (NY Times,2009)
73 Linux shell commands will begin with $ character. R commands will begin with > character.
74 $wget http://localhost.sk/~hromi/research/breastANDapple/songofsongs.html
75 $gposttl ­­brill­mode ./songofsongs.html | perl ­e "while (<>) {@d=split(' '); for (@d) { if (/:\/:/) {print '::';} elsif
(/\w+\/(NN|JJ|VB)/) { print ' ';print $_; }}}" >/tmp/salamun
76 If concrete values of mentioned constants are language dependent, it would mean that they play on the semantic
level a role similiar to to that of « parameters » within Principle&Parameters approaches of generative grammarians.

very far away from the moment when we could possibly state that we know how to transform a
corpus of a given language into a graph whose structure will be isomorphic with the structure of
«the understanding» which an ideally competent human reader have pulled out of a given corpus
during a hermeneutic procedure. Who knows, maybe we’ll never be there, nonetheless it is our duty
to at least to try to start somewhere. And therefore:
We have created a truly primitive script to which we have given a name « Golem »77. For an
input it takes gposttl output mentioned above, it permutates all the noun/verbs/adjectives given and
as an output it produces a list of pairs word1;word2 according to zeroth semantic principle. Thus for
example a phrase
«now also thy breasts shall be as clusters of the vine, and the smell of thy nose like apples»
we’ll obtain such a permutated list of edges (pairs of words/vertices) :
thy;breast
breast;thy
be;thy
cluster;thy
vine;thy
nose;thy
apple;thy
thy;be
breast;be
be;breast cluster;breast vine;breast nose;breast apple;breast
thy;cluster
breast;cluster be;cluster cluster;be
vine;be
nose;be
apple;be
thy;vine
breast;vine
be;vine
cluster;vine
vine;cluster nose;cluster apple;cluster
thy;smell
breast;smell be;smell cluster;smell vine;smell
nose;vine
apple;vine
thy;nose
breast;nose
be;nose
cluster;nose
vine;nose
nose;smell apple;smell
thy;apple
breast;apple be;apple cluster;apple vine;apple
nose;apple apple;nose
We hope that it is evident from this list that the graph we are creating here will be an
«undirected» one – in other words it is constructed in such a way that there is not a difference
between the edge between «a breast and an apple» and the edge between «an apple and a breast».
In other words, an undirected graph is a graph whose adjacency matrix is symmetric. What
is an adjacency matrix? « In mathematics and computer science the adjacency matrix M of a finite
directed or undirected graph G on n vertices is the n × n matrix where the nondiagonal entry aij is
the number of edges from vertex i to vertex j ». (Wikipedia, 2009) Some day, maybe, could some
brahman with a poetic soul knowing that « there exists a unique adjacency matrix for each graph
and it is not the adjacency matrix of any other graph» immediately state that relation between
adjacency matrix M and its graph G is similiar to the relation between Purusa and Prakrti78 – both
are sides of the same coin; one cannot be without the other.
We can say that matrices we have been constructing79 in chapter 2 when we were speaking
about neuroloinguistic networks within the brain of a newborn were adjacency matrices.
These adjacency matrices – and therefore the graphs they describe as well – are «weighted».
If we suppose ­and we do­ that the weighting constant for the co­occurence of two words within a
sentence is N=1 ; and if the terms «apple» and «breast» co­occur within whole Song of Songs in
one sentence only – and that is verily the case­ the position in the column « apple » and row
« breast » of an adjacency matrix M will have value Mapple,breast = 1 . On the contrary, since the terms
« apple » and « tree » co­occur within 3 sentences of the Cantique80 , the position in the column
«apple» and row «tree» of an adjacency matrix A will have value Mapple,tree = 3.
After listing81 and ordering82 all non­zero values present within the vector/row Mapple , we
77 $./golem.pl /tmp/salamun >/tmp/cantiqueEdgelist
78 Shiva Shaktyatmakam Brahma (Anandamurti, 1961)
79 > CantiqueAdjacencyMatrix<­as.matrix(table(read.table("/tmp/cantiqueEdgelist",sep=";")) )
80 As the apple tree among the trees of the wood, so is my beloved among the sons...I said, I will go up to the palm tree,
I will take hold of the boughs thereof: now also thy breasts shall be as clusters of the vine, and the smell of thy nose like
apples...I raised thee up under apple tree: there thy mother brought thee forth:there she brought thee forth that bare thee.
81 > applesubvector<­subset(CantiqueAdjacencyMatrix["apple",],CantiqueAdjacencyMatrix["apple",]>0)
82 > applesubvector[order(applesubvector,decreasing=TRUE)]

obtain following results:
3 tree
2 beloved
2 is
2 thy
1 be
1 breast
1 cluster

1 cometh
1 comfort
1 delight
1 flagon
1 fruit
1 great
1 leaning

1 nose
1 raised
1 sat
1 shadow
1 smell
1 son
1 stay

1 sweet
1 taste
1 up
1 vine
1 was
1 wilderness
1 wood

We would like to point attention of dear reader upon the fact that a significant part of terms
hereby presented ­e.g. tree, delight, fruit, sweet, taste etc. ­ could possibly serve as a basis for a
satisfying definition of a meaning of a word «apple». In other words these are the basic elements of
semantic analysis ­ semes ­ which we presented in the previous chapter, and the numeric value
associated is nothing else than a value of a coordinate within a Hilbert space for a respective
semantic dimension. And we see this even in case of corpus which has no more than 16kilobytes...
When it comes to breast83, non­zero items of a row Mbreast of an adjacency matrix M go like
this :
4 are
4 thy
2 be
2 cluster
2 is
2 roe
2 twin
2 young
1 am
1 apple

1 betwixt
1 brother
1 bundle
1 despised
1 feed
1 find
1 grape
1 hath
1 have
1 kiss

1 lie
1 lilie
1 little
1 mother
1 myrrh
1 night
1 nose
1o
1 palm
1 sister

1 smell
1 stature
1 sucked
1 tower
1 tree
1 vine
1 wall
1 wellbeloved
1 wert
1 yea

Even while leaving out that despised word despised as well as exclamations o, yea; we are
obliged to reiterate: what we see even in such a small corpus as Cantique 84 can be stated like this:
co­occurence is tightly related with the definition and hence, meaning, signifié of the given
signifiant. For verily there is not and there will be not born a (wo)?man who can justly maintain that
a definition a breast which would exclude semes like « feed », « kiss », « night », « mother »,
« sucked » and « smell » or even « betwixt » would be a definition complete.
Because we may be possibly criticized85 that what we do here is nothing else than building a
contingency table of co­occurence of words within a sentence, we pursue our analyse further. For
this moment we leave aside an adjacency matrice M with multitudes of her86 fascinating properties87
and we fully focalise upon her second « visage » ­ upon a graph G.
We’ll construct it by the means of a wonderful, wonderful, wonderful « igraph »88 library
created mostly by our hungarian OpenSource brethrens; by executing one simple command89.
83 > breastsubvector<­subset(CantiqueAdjacencyMatrix["breast",],CantiqueAdjacencyMatrix["breast",]>0)
> breastsubvector[order(breastsubvector,decreasing=TRUE)]
84 And what can and will be seen with much clarity if we would take into account much bigger corpuses, like that of
Google n­grams , for example (Cilibrasi, 2007)
85 In no case we pretend that what we are doing here was never done before. Such a statement would be ­with very high
probability – a big hypocrisy within the world where maybe even millions of (wo)?man are thrown into a
neverending quest for scientific truth (Teillhard, 1923) . It is more than possible that for every step of an analyse
presented hereby, a highly specialized application exist – whether built by CNRS or MIT. But what we declare is that
all this can be performed much more easily, with a much higher degree of aesthetic fullfillement with few lines of
PERL code and few correctly stated R commands. What we declare is this ­ any kid can do it, any.
86 Staying consistent with the genre division in slavic and roman languages, matrice M is feminine for us.
87 I would like to thank Monsieur Dominique Pignon for pointing my attention upon a mathematically proven « fact »
the entry in row i and column j of a matrix Mn gives the number of (directed or undirected) walks of length n from
vertex i to vertex j. It can be useful, very useful...
88 > install.packages(« igraph »); library(igraph)
89 > CantiqueGraph<­graph.adjacency(CantiqueAdjacencyMatrix,weighted=TRUE)

From now on, multitudes of possible analyses open in front of eyes; multitudes of which we
will have chosen only few ­ in somewhat macchiavelian fashion we’ll present hereby only those very
few examples that best prove our point.
Let’s start with PageRank. Technically speaking,its values are nothing else than entries of an
eigenvector of an adjacency matrix. More humanly speaking, its value Px give us the probability
with which an agent randomly browsing the Network will land after many steps/clicks on a site/node
X. This follows from the Markoff theory of stochastic matrices and fixed point theorem.
Within chapter 2, we had tried to pursue the PageRank notion further than just hypertext
web. We tried to focalise an attention of dear reader upon the fact that PageRank correct
understanding and application of linear algebra and , particularly, of an idea hidden behind
PageRank could be an important moment in the story leading to quantification and formalisation
of certain human sciences. Namely, through the medium of platonic image of a « soul » errant
within a conceptual network , we set forward a hypothese that PageRank Px calculated for such
conceptual networks will give us the probability with which the « soul » ­ be it the soul of a man or
a nation­ will finally « land » in the attractor concept X. Or, which is the same, the probability that
the concept X will become content of an errant soul.
We confess that in the moments when we were writing chapter 2, we were verily seduced by
a PageRank idea. We didn’t know anything about other quantities calculable for a graph like
« closeness », « betweenness », « vertice similiarity » etc. Nonetheless, our enchantement by
PageRank continues even now and in such a mesure that in the next and last part of present work,
we’ll re­name as « importance ». Our enchantement continues namely for this reason:
Since it uses a very simple iterative process, PageRank is very easy and fast to
90
calculate .
Thus, after calculating the entries of a PageRank vector for our Cantique, it suffices to join
the calculated quantities to the vertex labels91 and to order them92 in descending order. Afterwards
we obtain a list93 whose first 60 rows go like this:
1 0.0402450616276486
2 0.0286874213561736
3 0.0207556525566583
4 0.0181126226375237
5 0.0168321179888151
6 0.0153742262706985
7 0.0109313383916066
8 0.00858951049017589
9 0.00853652275059628
10 0.0071625915464353
11 0.00712911697034781
12 0.00658525478825675
13 0.00646490626473848
14 0.00642493278607294
15 0.00631260735215216
16 0.00617829833451983
17 0.00609495016444803
18 0.00592028534721268
19 0.00587021417258384
20 0.00576188606513339

is
thy
beloved
o
are
love
solomon
let
daughter
have
song
be
fair
garden
5
myrrh
fruit
come
tree
behold

21 0.00574703022231954
22 0.00573454130839064
23 0.00572865883284286
24 0.00567096410285987
25 0.00554778536139638
26 0.00552568332075932
27 0.00550568137332312
28 0.0054357374621402
29 0.00536126145890124
30 0.00494272275419686
31 0.00484017126806818
32 0.00480671780005788
33 0.00479763634993785
34 0.00468153277561695
35 0.00445085951789789
36 0.00437615206382555
37 0.00437527160055655
38 0.00434900453691372
39 0.0042752292727721
40 0.00426197531615292

spice
go
voice
smell
jerusalem
thine
art
mother
hand
see
vineyard
flock
eye
sweet
was
day
breast
sister
comely
wine

41 0.00425756366279538
42 0.00415891698675269
43 0.00408724684320239
44 0.00406786321075841
45 0.00403215526672195
46 0.00399520253064914
47 0.00399324606027242
48 0.00389267739621985
49 0.00385246254588212
50 0.00381393776344857
51 0.00370141936650332
52 0.00366600695461648
53 0.00355200627804045
54 0.00349365560289901
55 0.00345629616853102
56 0.00345286612010278
57 0.00340922347459539
58 0.00328539601897926
59 0.00327909072482058
60 0.00310921062863708

vine
roe
gold
dove
head
heart
mountain
lock
am
pleasant
soul
yea
spouse
neck
countenance
king
charge
apple
pomegranate
set

Voilà the result, an introductory stanza to the poem in itself, consisted of 60 adjectives,
nouns or verbes with highest PageRank within the graph created by application of a principle « if
two words co­occur within one sentence, augment the weight of their relation by 1» which was
applied upon a corpus extracted from King James Bible’s version of Song of Songs proposedly
90
91
92
93

> CantiqueRank<­page.rank(CantiqueGraph)$vector
> CantiqueRankNames<­data.frame(CantiqueRank,V(CantiqueGraph)$name)
> CantiqueRankNamesOrder<­CantiqueRankNames[order(CantiqueRankNames[,1],decreasing=TRUE),]
Full list is downloadable here (blank space divided CSV format) :
http://localhost.sk/~hromi/research/breastANDapple/cantiquerank.csv

written by Salomon, genitor of the temple and second king of Israel.
Honnestly – even if our analyses were completely useless, aren’t those words, each one of
them, aren’t they simply beautiful? 94
But there is much more to a graph G than just its PageRank.As we have already
mentioned, graph theory have already developped multitudes of other useful notions. Many of them
were already implemented into an « igraph » library and thus we can easily furnish not only their
theoretical defintion but also illustrate their empiric impact.

Let’s glance over already mentioned « closeness », « betweenness » and « similiarity »:
95
 Closeness – R manual tells us : « Cloness centrality measures how many steps is required
to access every other vertex from a given vertex. The closeness centrality of a vertex is
defined by the inverse of the average length of the shortest paths to/from all the other
vertices in the graph ».
96
 Betweenness – R manual tells us : « The vertex and edge betweenness are (roughly)
defined by the number of geodesics (shortest paths) going through a vertex or an edge. »
 Vertice similarity – There are many different types and thus algorhitmes for calculation of
vertice similarity. For the purpose of this article we had chosen to use inverse log­weighted
similarity, for it seems to be more evolved notion that notion of Jaccard or Dice similarity. R
manual tells us97: «The inverse log­weighted similarity of two vertices is the number of their
common neighbors, weighted by the inverse logarithm of their degrees. It is based on the
assumption that two vertices should be considered more similar if they share a low­degree
common neighbor, since high­degree common neighbors are more likely to appear even by
pure chance. Isolated vertices will have zero similarity to any other vertex. Self­similarities
are not calculated. See the following paper for more details: Lada A. Adamic and Eytan
Adar:Friends and neighbors on the Web.Social Networks,25(3):211­230, 2003»
Since we don’t want to bother dear reader with other theoretical notions, we have excluded
all formulas as well as definitions of more or less self­evident graph theory terms like
« neighbor », «degree » or « shortest path ». Anyone interested will surely find his ways to fill this
gap. Let’s execute the necessary commands98 and see what else can graph theory can tell us about
94
95
96
97
98

Honnestly – where in the graph G have You seen a command to seed hate & bomb Gaza ?
> ?closeness
> ?betweenness
> ?similarity.invlogweighted
> CantiqueCloseness<­data.frame(closeness(CantiqueGraph),V(CantiqueGraph)$name)
> CantiqueBetween<­data.frame(betweenness(CantiqueGraph),V(CantiqueGraph)$name)

Song of Songs:
Closeness Value
0.27943661971831
0.266237251744498
0.265951742627346
0.265809217577706
0.261741424802111
0.261190100052659
0.259278619968636
0.254750898818695
0.253708439897698
0.253578732106339
0.252674477840041
0.252674477840041
0.251393816523061
0.251266464032421
0.251139240506329
0.249748237663646
0.249371543489191
0.249246231155779
0.248995983935743
0.248
0.247752247752248
0.247628557164254
0.247628557164254
0.24750499001996

Central vertices
is
beloved
o
thy
are
love
solomon
song
daughter
spice
fruit
go
5
let
smell
behold
mother
breast
be
come
dove
sister
tree
fair

Betweenness values
67661.944443311
19934.9588059187
19310.1582973797
19278.8967516378
17523.0606034477
16636.7241518869
7539.80553623171
7509.0126779339
6927.51233607318
5280.6192724142
5005.31370288028
4817.35585927223
4709.75950500152
4645.28517812257
4572.6736347049
3990.12475254222
3916.13355606611
3857.29209176175
3747.85512875238
3614.68634683767
3544.22976327145
3473.05920301343
3455.48780030024
3228.67449596703

Crossroad vertices
is
beloved
are
thy
o
love
be
solomon
go
have
daughter
spice
mother
5
behold
fruit
mountain
let
was
garden
myrrh
tree
song
apple

Voilà two stanzas of our poem, first being the list99 of 25 adjectives, nouns or verbes with
highest Closeness mesure; second being the list of 25 adjectives, nouns or verbes with highest
Betweenness mesure, as assessed within the graph created by an implementation of a principle « if
two words co­occur within one sentence, augment the weight of their relation by 1» which was
applied upon a corpus extracted from King James Bible’s version of Song of Songs proposedly
written by Salomon, operator of the temple and second king of Israel.
It can be said that more we depart from the top ranks, more the two measures differ. Thus,
the breast concept is ranked as 19th according to closeness centrality measure, but as 33th according
to betweenness measure. Inversely, an apple concept is more «crossroad­like» than central, its 25th
according to betweenness mesure, but only 57th according centrality measure. Nonetheless, when we
take into account that we have extracted 498 nouns/verbes/adjectives out of Cantique and thus our
graph G has 498 vertices, both of these concept « apple » and « breast » are far­from­being­not­
important, no matter what measure we choose as significant measure of importance.
One of the reasons why we consider the « betweenness » measure to be of particular
importance100 is that betweenness measure divides the set of our vertices into two groups – into a
group of those through which does not pass any shortest path and those have zero betweenness
value (298 of them in case of Cantique corpus) and a group of those who serve as principal
« junctions », in other words a group of those through which some « geodesics » pass and thus their
whose value is non­zero (199 of them in case of Cantique corpus).
Last thing we would like to say about our visualisation is that we have chosen
« Fruchterman­Reingold » algorhitm to visualise our subgraph. Let’s see what other programmers
say about it:
> data.frame(CantiqueCloseness[order(CantiqueCloseness[,1],decreasing=TRUE),],
CantiqueBetween [order( CantiqueBetween[,1],decreasing=TRUE),])[1:24,]
99 Full list is downloadable here: http://localhost.sk/~hromi/research/breastANDapple/CantiqueCloseBetween.csv
100On the other hand, big inconvenience of a betweenness mesure is that it’s calculation is very much exigent because
for every now vertex added, shortest paths to all other vertices have to be found and afterwards the betweenness
value of all the vertices located upon these paths has to be adjusted. We are not experts on a complexity theory but it
seems to us that betweenness calculation is not a problem solvable within polynomial time.

Illustration 2: >
plot.igraph(CantiqueSubgraph,vertex.label.cex=2,vertex.label=V(CantiqueSubgraph)
$name,vertex.shape="none",asp=0,vertex.label.color=1:length(V(CantiqueSubgraph))%
%7+1,layout=layout.fruchterman.reingold,margin=­0.07)

It is a force­directed algorithm, meaning that vertex layout is determined by the forces
pulling vertices together and pushing them apart. Attractive forces occur between adjacent vertices
only, whereas repulsive forces occur between every pair of vertices. Each iteration computes the
sum of the forces on each vertex, then moves the vertices to their new positions. The movement of
vertices is mitigated by the temperature of the system for that iteration: as the algorithm progresses
through successive iterations, the temperature should decrease so that vertices settle in place.
(Gregor,2004)
In other words, to visualise the Cantique, we had used a procedure not too distant from that
of « annealing of substances » of ancient alchemists.
Only in the moment of production of this eye­candy does randomness enter the game
because the initial position , i.e. the position of vertices before «annealing» , is put forward by a
random generator. Only in this moment of visualisation according to fruchterman.reingold
algorhitm will the R commands proposed to and hopefully executed by dear reader produce results
slightly different than are those presented upon these pages.
But since we don’t want to be accused of exercising Kaballistic practices, we come back to
notions and procedures of graph theory which follow one out of the other with apodictic lucidity of
mathematical theorems.And thus, to finally answer the question: « Relation between an apple and a
breast, did it exist somewhere within the mind of Salomon? », we have decided to apply the notion
inverse log­weighted similarity upon a vertice « apple » of a graph G.
Voici the results101:
is
tree
beloved
apple
sweet
fruit
shadow
was
wood
delight
great
sat
son
taste
thy
smell
myrrh
up

23.5554863138659
23.0239430017928
22.1794872684636
16.3223259179549
14.0358163212322
14.0228519098312
12.9337406972387
12.1856172672389
11.9673951448652
11.8559056922472
11.8559056922472
11.8559056922472
11.8559056922472
11.8559056922472
9.01488868077456
8.82445148474772
8.3676712626057
8.24706309890257

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Voilà the last stanza of our poem consisted of 25 adjectives, nouns or verbes as calculated by
the means of inverse log­weighted similarity of a vertex « apple » to all102 the other vertices of a
graph G created by an implementation of a principle « if two words co­occur within one sentence,
augment the weight of their relation by 1» which was applied upon a corpus extracted from King
James Bible’s version of Song of Songs proposedly written by Salomon, destroyer of the temple and
second king of Israel.
Thus, when we see that when ranked according to the inverse log­weighted similiarity to an
« apple » vertex, a « breast » vertex is located on a position 24, i.e. within 5% of the total number of
498 vertices, we can conclude:
ὅπερ ἔδει δειξαι
101> applesim<­data.frame(V(CantiqueGraph$name,
similarity.invlogweighted(CantiqueGraph,which(V(CantiqueGraph)$name=="apple")­1)[1,])
> AppleSimilarityOrdered<­
data.frame(applesim[order(applesim[,2],decreasing=TRUE),],1:length(applesim[,1]))
102Full list is downloadable here: http://localhost.sk/~hromi/research/breastANDapple/applesim.csv

Fragment 10: 9th miracle of the world103
Language:
N of vertices :

Slovak
129941

Czech
160681

Hebrew
157375

Arab
139303

Russian
1497859

Mongol

Aymara

Apple
relative rank
pagerank
inter / intra

Jablko
11354
7.046e­06
4/7

Jablko
13304
1.292e­05
1/6

‫תפוח‬
10417
8.466e­06
2/6

‫تفاح‬
14158
6.870e­06
5/7

Яблоко
8625
7.777e­06
3/6

NOT
PRESENT
IN THE
CORPUS

NOT
PRESENT
IN THE
CORPUS

Breast
relative rank
pagerank
inter / intra

Prsník
2065
1.290e­05
3 / 2­3

Prs
16360
1.192e­05
4/7

‫)שד_)איבר‬
13253
7.520e­06
5/7

‫ثدي‬
1793
2.346e­05
2/2

Женская_грудь
54126
2.859e­06
6/7

NOT
PRESENT
IN THE
CORPUS

Ñuñu
219
0.00066
1/1

Milk
relative rank
pagerank
inter / intra

Mlieko
7346
7.792e­06
7/6

Mléko
2067
3.456e­05
3/4

‫חלב‬
6407
1.150e­05
6/3

‫حليب‬
5141
1.323e­05
5/3

Молокo
2998
1.503e­05
4/5

Сүү
247
0.000882
1/1

Millk'i
259
0.00055
2/2

Wine
relative rank
pagerank
inter / intra

Víno
2059
1.294e­05
4 / 2­3

Víno
2170
3.334e­05
1/5

‫יין‬
1889
2.003e­05
2/1

‫نبيذ‬
5813
1.230e­05
5/4

Вино
2741
1.626e­05
3/4

NOT
PRESENT
IN THE
CORPUS

NOT
PRESENT
IN THE
CORPUS

Man
relative rank
pagerank
inter / intra

Muž
3542
9.878e­06
5/5

Muž
905
5.737e­05
2/2

‫גבר‬
8276
9.645e­06
6/4

‫رجل‬
6483
1.144e­05
4/5

Мужчина
1484
2.925e­05
3/2

NOT
PRESENT
IN THE
CORPUS

Chacha
1320
9.929e­05
1/4

Woman
relative rank
importance
inter / intra

Žena
3499
9.922e­06
5/4

Žena
1048
5.397e­05
1/3

‫אישה‬
8580
9.435e­06
4/5

‫امرأة‬
7583
1.027e­05
3/6

Женщина
2236
1.947e­05
2/3

NOT
PRESENT
IN THE
CORPUS

NOT
PRESENT
IN THE
CORPUS

god
relative rank R:
importance(N/R)
inter / intra

boh
350
3.645e­05
5/1

bůh
389
9.543e­05
2/1

‫אלוהים‬
1959
1.925e­05
6/2

‫الله‬
268
6.929e­05
3/1

bог
971
4.21e­05
4/1

NOT
PRESENT
IN THE
CORPUS

Tatitu *
526
0.00034
1/3

Isis
relative rank R:
importance(N/R)
inter / intra

Isis
67964
4.811e­06
1/8

Isis
71739
3.180e­06
3/8

‫איזיס‬
69627
3.010e­06
4/8

‫إيزيس‬
24446
4.601e­06
2/8

Изида
274978
6.879e­07
5/8

NOT
PRESENT
IN THE
CORPUS

NOT
PRESENT
IN THE
CORPUS

Comparison of 8 concepts (rows) within 7 wikipedia corpuses (columns). Pagerank entry specifies
the calculated pagerank value of a given concept within a specific corpus; corpus­relative rank
specifies its position in the list of all the concepts ordered in descending order according to their
pagerank (concept having the highest pagerank has corpus­relative rank R 1, second has R=2 etc.);
number written by bold specifies the INTERcultural importance (pagerank values are ordered
within the row) ; underlined number specifies the INTRAcultural importance (pagerank values are
ordereed within the column). For example the « wine » concept within Arabic wikipedia has the
lowest pagerank, when compared with « wine » concepts of other corpuses – thus it is 5th
interculturally. On the other hand, within the scope of arabic corpus only, it is ranked lower than
1.«god», 2. «breast» and 3. «milk» but higher than 5.«man», 6.«woman», 7. «apple» and
8.«Isis». It can also be easily seen that for majority of cultures, the god concept plays much more
important role than other concepts we have chosen. The only exception being quite surprisingly the
Hebrews 104 , Aymara, and Mongols – for the tribe of bolivian indians the breast and milk seems to
play more important role, for united tribes of centralasian shepherds the milk plays central role.
103 We have analysed mySQL forms of wiki corpuses freely available from http://download.wikimedia.org/
104 Is it because the signifiant of Your god is not pronounced or because You had chosen to prefer wine instead ?

Fragment 8: Matriarchality measure
Slovak105

Czech

Hebrew

Arab

Russian

Woman (Pw )

9.922e­06

5.397e­05

9.435e­06

1.027e­05

1.947e­05

Man (Pm )

9.878e­06

5.737e­05

9.645e­06

1.144e­05

2.925e­05

Matriarchality (Pw­­ Pm)

+4.4e­08

­3.4e­06

­2.1e­07

­1.17e­06

­9.78e­06

Matriarchality measure as a quantity obtained by substraction of pagerank of «man» concept
from the pagerank of «woman» concept. Such a subtraction adds second normalization (first
normalization – allowing us to do intercultural comparisons ­ occurs during calculation of pagerank
itself) and allows us to compare cultures with ­what seems to us­ even higher degree of relevancy.
Negative value of matriarchality signifies, of course, patriarchality.
Fragment 11: Normativity argument
In certain moment, the calculated data – in google as well as within this text – have ceased
to be only explicative. It became normative. Verily, if a human/social science hypothese/theory106
is adequate with reality – and thus true – it is often not because she would explain « anything », but
because it conditions people to think and act in the way as if they had understood « something ».
Fragment 4: Posvätná laň
To čo sa tu snažíme povedať istotne znie v lepšom prípade absurdne a v horšom prípade
šialene. Veď my tu v istom zmysle vskutku naznačujeme ­ a nielen naznačujeme ­ že ak by mala
kategória ovocie v období prvých interpretácií a prekladov kosmogonického mýtu dnes známeho
ako Genesis iný prototyp, ak by trebárs nie Eva podala Adamovi jablko, ale Adam Eve banán či
keby Boh rozhod(ol|la) oskúšať vôľu človeka nie zakázaným ovocím, lež zákazom ublížiť posvätnej
lani , mohol uplynuvší vek na tejto Zemi vyzerať úplne ináč.
Žiadne zabíjanie v mene « lásky », žiadne hony na čarodejnice, žiadne obetovanie EROS na
oltároch LOGOS.
Jednota.

105 I present hereby these culture­relative Wikipedia (november 2008) concept importance lists for download:
aymara ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/AY.csv (<1 MegaBytes)
arabic ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/AR.csv (10 MegaBytes)
czech ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/CS.csv (10 MegaBytes)
hebrew ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/HE.csv (10 MegaBytes)
mongol ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/MN.csv (<1 MegaBytes)
russian ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/RU.csv (58 MegaBytes)
slovak ­ http://localhost.sk/~hromi/research/breastANDapple/pageranks/SK.csv (7 MegaBytes)
may they surve the purpose for which they were created. You can open them even in Excel.
106 Take Freud’s psychoanalysis , for example. Are its complexes explained, or are they created in the first place?

Východ
A tak se Vám naposledy
nebude chtít od ňader vědy
Goethe J.W.
Faust
Predložená práca je prácou nedokončenou.
Čo je dokončené, to je totiž nemenné a čo je nemenné to nemožno nazvať živým.
A keďže chcel byť uvedený text v prvom rade textom o živote, mladosti, jari a radosti – nemá
v ňom obsiahnutý príbeh o ňadre a jablku žiadny pevne určený koniec.
A predsa sa táto esej chýli k svojmu záveru.
Umenie záveru je umením rozlúčky, umením vyslovenia najmagickejšieho zo všetkých slov.
A preto túto prácu teraz venujem:
rodine: v prvom rade mojej mamke Alene za to že bola, je a navždy bude– podobne ako
všetky ostatné dobré matky sveta­ tou matkou najlepšou, sestre Kristíne za to že je jedinou ženou
ktorá ma vie vyviesť z miery, otcovi Danielovi za jeho pracovitosť, babičkám Olge a Alžbete za to že
som aspoň skrze ich slová mohol spoznať čaro prvej Československej republiky a synovcovi
Oliverovi za to že je.
kamarátom: Lukášovi K. za to že mi odpustil, Martinovi D. nielen za psiu dečku a 9tu bránu,
Ľubošovi I. za jeho vaporizér a Jurajovi B. za každoročné čajové rituály, Andrejovi G. za jeho lásku
ku hviezdam, Ivanovi P. za cirkus, Filipovi Z. za to že mi jeho konšpiračnými teóriami v istom
období môjho života narobil v hlave riadny zmätok, Mirovi P. za nestarnúci support slovenského
cyberpunku, Tomášovi P. za to že mu – verím – jedna z kópií tejto práce pomôže zvíťaziť v bitve s
heroínom, Jánovi Š. za iniciáciu do PERLu, Levantovi za pomoc v boji s mongolskými švábmi a
Monkhsaikhanovi Ochirhuyagovi za to že mi u rieky Orkhon po úprku mojich koní viac ako jasne
naznačil že je tým najskutočnejsím mužom akého som kedy mal tú česť stretnúť.
milovaným: Daniele K. za to že bola mojou prvou, Jane B. za to že ma v deň mojich 21.
narodenín len tak zastavila na Slavíkovej ulici, Zuzane Dž. za seánsu v smrekovom lese, Eve R­K­S.
za prvú lekciu o tom ako dokáže byť láska prenádherne slepá, Tereze S. za lekciu druhú, Monike D.
nielen za to že mi doniesla nákup keď som si v Nice vyvrtol členok, PhD. Carmen­Aline S. za to že
mi dodala dôveru vo mňa samotného, Dite B. za prechod púšťou Gobi, Kristíne J. za to že za mnou
neprišla do Paríže, Barbore P. za lekcie nielen francúzštiny... ako i mnohým iným ktorých mená sú
zapísané inde.
kolegom: užívateľom a najmä správcom diskusných systémov kyberia.sk a nyx.cz,
zamestnancom firmy VOLNY v rokoch 2001­2003, fy. Etel v rokoch 2003­2004, fy. IGNUM v
rokoch 2005­2006, hotelu Manoir de l’Etang v rokoch 2007­2008, všetkým postavičkám z festivalu
v Cannes, « vítacím agentom » na Eiffelovej veži
spolužiakom: z Evanjelického lýcea v Bratislave, z Fakulty Humanitných Štúdií UK , z
Mongolskej štátnej univerzity, z Université de Nice a z École Pratique des Hautes Études, všetkým
študentom ERASMU v Nice v rokoch 2007 a 2008, všetkým kto so mnou absolvovali kurz
kognitívneho inštrumentálneho obohacovania FIE I a FIE II rovnako ako i vĺčatám prvého

bratislavského voja a skautom, roverom a vodcom oddielu Dážďoviek.
tváram z ciest a architektom miest: susedom z Haanovej 44, spolubývajúcim z internátu pre
zahraničných študentov v Ulanbaatare rovnako ako spolubývajúcom z kolejí Hostivař, St.Antoine,
Jean Medecin a Daviel, všetkým ktorých som kedy auto­stopol, bezdomovcovi z Alma­Aty za jeho
Čupačundra a jeho parížskym kolegom za katakomby do ktorých ma čoskoro zavedú, squatterom
Ianovi a Reuvenovi z Cannes, dievčaťu menom Mária z Ulanbaatarského klubu Strings, mojim
študentom z mongolských jazykových centier Cambridge, Absolut a bezmennej Konkubíne z Huh­
hotu, masérke Zlate z Kyjeva, rwandskej Oracle z rue d’Alesia za to že neodmietla nielen môj
perlový náhrdeľník ale ani jabĺčko, Gaiovi I.C. Za Alesiu a Gustovi E. za Vežu, Alene B. za
prechádzky s jej írskym setrom a jej manželovi za to že je honorárnym konzulom SR s najlepším
zmyslom pre humor, Altangerelovi a lámom z kláštora Khamriin Khiid.

●

●

●

●
●

mojim učiteľom a učiteľkám:
z Nice: Xavierovi B. a Olivierovi R. za iniciáciu do fonológie, C. Pagliano., Emilie a
M.Olivieri za iniciaciu do generativistických doktrín, J. Bonneauovi za terminológiu, J.P.
Dalberovi za jeho « bien joué », C. Hennebois za sémantiku a iniciáciu do PROLOGu, Mme
Talon­Hugon za rétoriku, Mr. Alimu Benmakhloufovi za Alenku v ríši logických divov, Mr.
Gauterovi za štruktúru revolúcií nielen vedeckých, Mme. Kircher za sanskrt, Mr. Lavignovi
za filosofické prednášky vysokej kvality
z Ulaanbaataru: spanilej Batsukh, in memorian, Dzamyansurenovi za to že je najlepším
mongolským kaligrafom ale i za to že sa s Battulagom na školskom výlete opil viac ako
všetci študenti dokopy , 3 skvelým postarším učiteľkám mongolskej gramatiky a literatúry
ktorých mená už si žiaľ nepamätám, Dadovi Ajayovi za lekcie saddhány, Didi Ananda
Kalika za jej oddaný spev a Lotus Children’s Center
z Prahy: prof. Sokolovi za to že bol mojím prvým tútorom a v istom zmysle je ním pre mňa
dodnes, Ľ. Gabriškovej, in memoriam, za mezopotámsku kosmológiu a P.K.Dicka, doc.
Pincovi za to že je najväčším filozofom života akého poznám, Veronike Z. z katedry
mongolistiky za jej bezhraničnú obetavosť, doc. Murgašovi za Citadelu, energetické
invarianty a Gestalt ktorý som doteraz nepochopil, dekanovi Benyovzskemu za jeho
cyklovýlety do ríše idejí, prof. Komárkovi za to že mi narovinu povedal že predmet môjho
bádania nieje jeho šálkou kávy, prof. Neubauerovi za jeho vianočnú prednášku, T. Holečkovi
za Wittgensteina a výrokovú logiku, Fulkovi za psychoanalytické anekdoty, doktorandom z
filozofického modulu za slová « to je ale blbost! » u SAFM ktorými náležite presmerovali
moju celoživotnú dráhu, G. Málkovej za « dovednosti myslet » a v neposlednom rade
vedúcemu tejto práce, Jánovi Havlíčkovi PhD. za jeho nielen profesionálne ale i ľudské
usmerňovanie vo finálnych fázach tvorby tejto eseje
z Paríže
z Univerzity: mená aspoň niektorých z nich sú uvedené v sekcii bibliografia

Toto boli duše bez stretu s ktorými by táto práca istotne nevznikla. Lepších druhov a lepšie
družky nájdem v labyrinte života už asi len veľmi ťažko. Kiež im teda aspoň takto, v podobe
vytvorenia « hrany » či väzby medzi « vrcholom » ktorý reprezentuje túto prácu a teda mňa, a
« vrcholom » ktorý vrámci grafu G – grafu vďaky ktorý istotne raz bude vybudovaný a možno už i
vybudovaný je­ reprezentuje mená týchto druhov a teda v istom zmysle, v istom veľmi silnom
zmysle, oných druhov samotných ; kiež im teda toto moje malé
Ďakujem
navýši množstvo ich súcnosti pred pokojne sa usmievajúcou tvárou večnosti.

Bibliografia
Anandamurti S. (1961) Ananda Sutram
Brin S.,Page L., (1998) The Anatomy of a Large­Scale Hypertextual Web Search Engine.
WWW7 / Computer Networks 30(1­7): 107­117
Buber M., (1923) Já a Ty. Praha: Kalich
Drake C. (2000) The developement of rhytmic attending in auditory sequences: attunement,
referent period, focal attending. Cognition 77 251­288
Cilibrasi R.L. , Vitányi P.M.B. (2007) The Google Similarity Distance, IEEE TRANSACTIONS
ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO 3
Gervain J., Macagno F., Cogoi S., Peña M., Mehler J., (2008) The neonate brain detects speech
structure. PNAS September 16, 2008 vol. 105 no. 37 14222­1422
Eco U. (1980) Il nome della rosa107
Goethe J.W. preklad O.Fischer (1982) Faust. Praha: Odeon
Gregor D., Fruchterman Reingold graph visualisation algorhitm [ accessible-online :
http://www.boost.org/doc/libs/1_37_0/libs/graph/doc/fruchterman_reingold.html ]
Haegeman L. (1994) : Introduction to Government and Binding Theory. Blackwell Textbooks in
Linguistics)
Hesse H. (1955) Le jeu des perles de verre, Essai de biographie du Magister Ludi Joseph Valet
accompagné de ses écrites posthumes. Calmann-Lévy
Hofstadter D. (1999) Godel, Escher, Bach – an Eternal Golden Braid
Horwood L. J. , Fergusson D. M. (1998) Breastfeeding and later cognitive and academic
outcomes. PEDIATRICS Vol. 101 No. 1
Heidegger M. (2006) Básnicky bydlí člověk. Praha: Oiykumenh
Hromada D., (2007) Moja prvá malá rozprava
http://localhost.sk/~hromi/textz/2007/mpmrom.pdf ]

o

metóde

[acessible

on­line:

Hromada D.,(2012) Semantic Structures v2.3
Jackendoff R., (2002) Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford
University Press , Oxford/New York
Jakobson R.,(1971) Selected Writings I ­ Phonological studies. Hague
Jenness R., (1979) The composition of human milk. Seminars in Perinatology 3 (3): 225–239
Lakoff, G. (1987) Women, Fire, and Dangerous Things: What Categories Reveal About the
Mind University of Chicago Press.
Lalou F. / Woda A. (2003) Tes seins sont des grenades – Pour en finir avec le Cantique des
cantiques , Alternatives , Paris
Lécuyer R., (1996) L’intelligence des bébés. Paris: DUNOD
107 Pulchra sunt ubera quae paululum supereminent et tument modice...

Morris, D. (1967) The naked ape , London
Nelson, C.A. (2001) The development and neural bases of face recognition. Infant and Child
Development, 10 (1­2)
New York Times (06/01/2009) Data Analysts Captivated by R’s Power [accessible online:
http://www.nytimes.com/2009/01/07/technology/business­computing/07program.html ]
Nietzsche, F. (1883) Also sprach Zarathustra
Piaget, J. (1961). La psychologie de l'intelligence. Paris: Armand Colin
Oberfalzerová A. (2006) Metaphors and nomads. Charles University , Philosophical Faculty,
Institute of South and Central Asian studies, Seminar of Mongolian studies, Praha
Seifert J. (1987) Les danseuses passaient près d’ici: Choix de poèmes. Actes Sud
Sokol J. (2007) Malá filosofie člověka & Slovník filosofických pojmů. Vyšehrad , Praha
Skripnik O. , Lindová J. (2007) Posudek k metodologické práci studenta 9306. IS FHS UK,Praha
Tagore R. (1913) The Crescent Moon : Child­Poems. London: Macmillan
Teillhard P. de Chardin (1923) La messe sur le monde
Telegraph (2008) Breastfeeding photo ban by Facebook sparks global protest by mothers
[accessible online:http://www.telegraph.co.uk/scienceandtechnology/technology/facebook/4029868/
Breastfeeding­photo­ban­by­Facebook­sparks­global­protest­by­mothers.html ]
Théoret, H. / Pascual­Leone A (2002), Language Acquisition: Do As You Hear, Current Biology,
Vol. 12, No. 21, pp. R736­R737
Wikipedia, The Free Encyclopedia (2009) Adjacency matrix [Retrieved 21:28, January 19, 2009,
from http://en.wikipedia.org/w/index.php?title=Adjacency_matrix&oldid=262381618 ]
Wilson, R.A. (1983) Prometheus rising . USA: New Falcon Publications
Wilson, R.A. (2000) Ištařin návrat, aneb proč bohyně sestoupila do podsvětí a co náš čeká
nyní při jejím návratu. Praha: Maťa & Dharmagaia
Wittgenstein L. (1917) Tractatus logico­philosophicus

Webové linky k hlavným zdrojom inšpirácie :
AGAPE ­ http://en.wikipedia.org/wiki/Agape
Hilbertove priestory ­ http://en.wikipedia.org/wiki/Hilbert_space
Regulérne výrazy ­ http://en.wikipedia.org/wiki/Regular_expression
Teória grafov ­ http://en.wikipedia.org/wiki/Graph_theory
Terminológia normy ISO­704 ­ http://localhost.sk/~hromi/textz/2008/metaISO704.pdf
Texty pyramíd http://www.sacred­texts.com/egy/pyt/index.htm
The Coptic Gospel of Thomas in Context ­ http://www.geocities.com/Athens/9068/
Jablko v mýtoch sveta ­ http://en.wikipedia.org/wiki/Apple_(symbolism)
Veľpieseň Šalamúnova ­ http://www.bibliaaty.sk/biblia­Piesen­%C5%A0alamunova_PIES.html

Appendix 1: Ilustrácia konvergencie stochastickej matice k hodnote svojho
eigenvectoru
To čo sa tu snažíme ilustrovať je metóda vďaka ktorej sa zbavíme problému cyklických
vzájomných referencií ktorý pred nam doposiaľ vždy vyvstával vtedy, keď sme sa snažili
analyzovať systém v ktorom A je určené pomocou B , zatiaľčo B je určené pomocou A. Bez tejto
metódy by sme nemali žiadny oporný bod, nevedeli by sme kde začať. Takto to vieme.
Ilustrujeme si to na príklade «kauzálne­diachronnej » sémantickej matrix 4 z časti 2.2 :
a) v prípade že duša do ktorej nevstupujú žiadne externé vstupy začne blúdiť u reprezentácie
«ticho » sú prvkami «inicializačného vektoru» pravdepodobnosti toho « k akému symbolu sa
poberie duša od symbolu ticho » , tj. hodnoty uvedené v stĺpci « ticho »
initial vector: 0.349 0.111 0.349 0.016 0.174 0.001
iteration 0 : 0.349 0.111 0.349 0.016 0.174 0.001
iteration 1 : 0.168638 0.088742 0.136585 0.10892 0.219783 0.277332
iteration 2 : 0.19677187 0.147541796 0.18077067 0.11258138 0.218997164 0.14333712
iteration 3 : 0.170422961246 0.137401835622 0.144618006084 0.132482419104 0.23951108973 0.175563688214
iteration 4 : 0.17202823449374 0.15100588452236 0.149766681237554 0.134776102639864 0.237459578406382 0.1549635187001
iteration 5 : 0.167408565651975 0.149364533881373 0.143237330010219 0.138759890883417 0.242493954990974 0.158735724582041
iteration 6 : 0.16724939393067 0.152105079290878 0.143669866101764 0.139635436193834 0.242101105652149 0.155239118830704
...
iteration 42 : 0.16594741981802 0.152662182855418 0.142067798907887 0.141010764498759 0.243471641784574 0.154840192135343
iteration 43 : 0.165947419818017 0.15266218285542 0.142067798907884 0.141010764498762 0.243471641784576 0.15484019213534
iteration 44 : 0.165947419818016 0.152662182855422 0.142067798907882 0.141010764498764 0.243471641784578 0.154840192135339
iteration 45 : 0.165947419818015 0.152662182855423 0.142067798907882 0.141010764498764 0.243471641784578 0.154840192135338
iteration 46 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338
iteration 47 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338
iteration 48 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338
iteration 49 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135337

a) v prípade že duša do ktorej nevstupujú žiadne externé vstupy začne blúdiť u reprezentácie «Ň­p.»
sú prvkami «inicializačného vektoru» pravdepodobnosti toho « k akému symbolu sa poberie duša
od symbolu Ň­p. » , tj. hodnoty uvedené v stĺpci «Ň­p.»
initial vector : 0

0.07

0.318

0.04

0.257

0.315

iteration 0 : 0 0.07 0.318 0.04 0.257 0.315
iteration 1 : 0.27966 0.129246 0.144417 0.111452 0.17393 0.161295
iteration 2 : 0.153886242 0.131632958 0.171016395 0.11525756 0.247808612 0.180398233
iteration 3 : 0.1839006114 0.146790670424 0.147067682125 0.13473720768 0.227536874986 0.159966953385
iteration 4 : 0.166086165671824 0.148031715406948 0.147382227517521 0.134661789558376 0.24401479601744 0.159823305827891
iteration 5 : 0.169185730467167 0.151084123997832 0.14378055017798 0.139600712608132 0.239969260189836 0.156379622559054
iteration 6 : 0.166327123909114 0.151650051127476 0.143146201065523 0.139490074038839 0.243468872121902 0.155917677737146
iteration 7 : 0.166593782151623 0.152253830108923 0.142533874354546 0.140677949796571 0.242702452145075 0.155238111443262
...
iteration 42 : 0.165947419818019 0.152662182855418 0.142067798907886 0.141010764498759 0.243471641784575 0.154840192135343
iteration 43 : 0.165947419818017 0.152662182855421 0.142067798907884 0.141010764498763 0.243471641784577 0.15484019213534
iteration 44 : 0.165947419818016 0.152662182855422 0.142067798907882 0.141010764498764 0.243471641784578 0.154840192135339
iteration 45 : 0.165947419818015 0.152662182855423 0.142067798907882 0.141010764498765 0.243471641784578 0.154840192135338
iteration 46 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338
iteration 47 : 0.165947419818015 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338
iteration 48 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135338
iteration 49 : 0.165947419818014 0.152662182855423 0.142067798907881 0.141010764498765 0.243471641784579 0.154840192135337

K rovnakým hodnotám , tj.
MŇ­prítomné=0.165947419818014
MŇ­neprítomné=0.152662182855423
Mblaženosť=0.142067798907881
Mbolesť=0.141010764498765
Mtvár=0.243471641784579
Mticho=0.154840192135337

by náš výpočet dokonvergoval, keby sme ho započali v ľubovoľnom východziom bode (tj. s
ľubovolným východzým vektorom). Uvedené hodnoty sú totiž vlastnosťou neviditeľne prítomnou v
našej matici , jej « eigenvectorom ».
V prípade matice ktorá je neustále prepočítavaná v googli sa empiricky ukázalo že jednotlivé
hodnoty tohto eigenvectoru vyjadrujú niečo čo by sa dalo nazvať «podstatnosťou stránky pre celok
webu». V prípade nášho prístupu vyjadrujú niečo ako «podstatnosť uvedeného významu pre
celok mysle jednotlivca (záhrada 2, konštrukt 2) alebo spoločnosti (záhrada 3, fragmenty 8 a 10 )».
Túto veličinu sme v priebehu práce označovali ako «mohutnosť», « PageRank» a «importance».

Appendix 2: PERLový kód iterujúci hodnoty v appendixe 1
alias
«sladké miliardové tajomstvo firmy hochov z google»

@matrix=(
[0,0.07,0.318,0.04,0.257,0.315],
[0.07,0,0.03,0.36,0.43,0.11],
[0.369,0.035,0,0,0.21,0.386],
[0.037,0.389,0,0,0.556,0.018],
[0.179,0.263,0.126,0.316,0,0.116],
[0.349,0.111,0.349,0.016,0.174,0.001]
);
$,=" ";
@vector=@{$matrix[0]} if (!@vector);
print "\ninitial vector: @vector";
for ($i=0;$i<50;$i++) {
print "\niteration $i : ";
print @vector;
$j=0;
foreach (@vector) {
$p1=$_;
$k=0;
$sum=0;
foreach (@{$matrix[$j]}) {
$sum+=$_;
$p2=$_;
$nvector[$k]+=$p1*$p2;
$k++;
}
$j++;
}
@vector=@nvector;
@nvector=0;
}

Appendix 3 : Dotazníky D2 a D3
Papierový dotazník D2 – Príklad

Sem nalepiť jednu origoš vyplnenú D2 formu

Papierový dotazník D2 – Celkové výsledky
Z celkového počtu 28 respondentov (zväčša študentky a študenti 3.ročníka lingvistiky na Université de Nice a
hostesky a hostesi na medzinárodnom filmovom Festivale v Cannes) asociovalo s ženskými prsiami : 28 respondentov
mlieko; 10 respondentov ovocie; 1 respondent mäso; 0 respondentov zeleninu a chlieb.
c,e
b,c
b,c,e
b,c,e
c,e
c,e
c,e
c,e
b,c,e
a,c,e
c
c
c
b,e
b,c,e
b,c,e
c
b,c,e
c,d,e
b,c,e
c
a,c,e
b,c,e
c
b,c,e
c
b,c,e
b,c

a
a,c
a,c,d
b,d
a
a,c
c
a
a,b,c
a,b,c
a
a
c
a,c
a,b,c
c,d
a
a
a,c
a,c
e
a
a,c
a
a,b,c
a
a,c,d
a,b,c

@SEINS,NOURRITURE@
b,c
b,e
c
e
c
e
c
c,e
c
d
c,e
b,c
d,e
b,c
b,c
e
a,c
c,e
b,c
d,e
c
e
c
e
c
e
c
c,e
c
d,e
b,c
e
c
e
d,e
b,c
c
e
c,e
b,c
c
e
c
d,e
c
e
c
a
b
e
c
c,e
b,c
d
c
e

b
d
b
b
c
c
c
a
c
b
b
c
c
c
e
b
a
b
b
b
e
b
b
b
e
c
b
c

b
b
a
b
a
a
a
a
a
a
a
a
a
a
a
a
a
a
b
a
b
a
a
b
a
a

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
b
a
a
a
b
a
a
a
a
b

pomme
fraise
pomme
pomme
peche
orange
pomme
ananas
banane
pomelos
mangue
pomme
pomme
fraise
pomme
pomme
pomme
banane
pomme
melon
pomme
pomme
pomme
orange
pomme
pomme

a
a
a
a
a
a
a
c
a
a,f
e
a
a
a
a
a
a
a
c
a
d
a
a
a
a
a
f
a

b
b
d
c,e
b
b
b
b
a
b
a
a
e
a
a
a
c
a
a
a
a
a
a
e
d
c
a

M
M
M
F
M
M
M
M
F
M
F
F
F
F
F
F
F
F
F
F
F
F
M
F
F

Internetový dotazník D3

Dotazník je stále aktívny na adrese http://localhost.sk/~hromi/quest/public/survey.php?
name=FHS_Bakalarska_praca . V momente písania tejto práce naň odpovedalo 358 respondentov,
zväčša sa pravdepodobne jednalo o užívateľov diskusných systémov kyberia.sk a nyx.cz na ktorých
bol link k dotazníku publikovaný. Užívatelia týchto diskusných systémov sú v zväčša minimálne
stredoškolsky vzdelaní mladí ľudia vo veku 15­35rokov a predpokladáme že inak tomu nebude ani v
prípade našich respondentov. Vzhľadom k faktu že užívatelia týchto systémov sú poväčšinou
kultivovaní mladí ľudia znalí umenia, vedy či politiky, považujeme ich za ideálnych reprezentantov
nositeľov slovanskej odnože ISE kultúry začiatku 21. storočia.
O tom že , ústrednou témou výskumu sú ženské prsia a ich vzťah k jablku, sa respondenti
dozvedeli až po zodpovedaní dotazníku. Dotazník bol prezentovaný v slovenskom jazyku, preto sa
dá očakávať že respondenti budú v drvivej väčšine prípadov slovenskej alebo českej národnosti.
Ajkeď boli pre náš výskum klúčové otázky 1.3 a 2.3 a všetky ostatné slúžili na ich
« zamaskovanie » , vyplynulo nám aj z ostatných otázok množstvo zaujímavých skutočností.
Tu predkladáme výsledky týkajúce sa všetkých respondentov :
1. Ktore pojmy naleziace do kategorie "tekutiny" asociujete najsilnejsie s pojmom "zivot" ?
Muži+Chlapci
Ženy+Dievčatá
Total
Gender Difference
(2.6)
(2.5)
(2.6)
0.1
0.3
(3.1)
(2.8)
(3.0)
(4.3)
(4.4)
(4.3)
0.1
(1.7)
(1.7)
(1.7)
0
(4.1)
(4.0)
(4.1)
0.1

vino
mlieko
voda
kava
krv

zrak
hmat
sluch
cuch
chut

2. Ktore pojmy naleziace do kategorie kategorie "5 zmyslov" asociujete najsilnejsie s pojmom "automobil" ?
Muži+Chlapci
Ženy+Dievčatá
Total
Gender difference
(4.5)
(4.3)
(4.4)
0.1
(2.9)
(2.8)
(2.9)
0.1
(3.6)
(3.6)
(3.6)
0
(2.3)
(2.3)
(2.3)
0
(1.4)
(1.4)
(1.4)
0
3. Ktore pojmy naleziace do kategorie "potraviny" asociujete najsilnejsie s pojmom "zenske prsia" ?

maso
ovocie
mlieko
chlieb
zelenina

Muži+Chlapci
(3.3)
(3.4)
(4.2)
(1.9)
(1.8)

Ženy+Dievčatá
(2.8)
(3.2)
(4.2)
(2.0)
(1.6)

Total
(3.2)
(3.3)
(4.2)
(1.9)
(1.7)

Gender difference
0.1
0.2
0
0.1
0.2

4. Ktore pojmy naleziace do kategorie "zvierata" asociujete najsilnejsie s pojmom "muz" ?
jelen
vtak
pes
zralok
opica

Muži+Chlapci
(3.2)
(3.1)
(3.2)
(3.0)
(2.7)

Ženy+Dievčatá
(3.4)
(3.2)
(2.7)
(2.8)
(2.2)

Total
(3.3)
(3.1)
(3.0)
(2.9)
(2.5)

Gender difference
0.2
0.1
0.5
0.2
0.5

5. Ktore pojmy naleziace do kategorie "5 zivlov" asociujete najsilnejsie s pojmom "Zena" ?
vzduch
zem
ohen
eter
voda

Muži+Chlapci
(2.6)
(2.8)
(3.9)
(3.0)
(3.1)

Ženy+Dievčatá
(2.5)
(3.3)
(3.8)
(2.8)
(3.0)

Total
(2.6)
(3.0)
(3.9)
(2.9)
(3.0)

Gender difference
0.1
0.5
0.1
0.2
0.1

6. Ktory pojem je podla Vas najlepsim predstavitelom kategorie "kvety" ?
ruza
margaretka
tulipán
orchidea
lalia
tulipan
chryzantema
dalia
farby
frezia
gerbera
kopretina
kopretiny
kytica
lˇalia
lucne kvietky
mak
marihuana
muskat
narcis
púpava
pupava
sedmikrasky
slnečnice
vona
xxx
bunka
ceresnovy kvet
efedra
fialka
Fialka
hlavacik jarny
hyacynt
konvalinka
Konvalinka
lilie
lotos
lucne
lucne kvety
magnolia
oxalis triangularis (kyselka)
pampeliﾚka
pampeliska
rododendron
ruza
ruza, orchidea
sedmikráska
sedmikrásky
sedmokraska
slnecnica
tulipany
vlčí mak
zanebudka
zive kvety
puvodne sem ti sem chtel napsat tulipan...
ale jak jsem si precetl odpoved "ruze" tak
bych rekl ruze :)...neovlivnuj lidi .)

Samičky
54
25
4
3
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Samci
126
38
0
0
3
10
0
0
0
0
1
0
0
2
0
0
0
1
0
2
0
4
0
0
0
0
0
1
1
1
1
1
1
1
1
1
2
1
2
2
0
1
1

% Samičiek
49.091
22.727
3.636
2.727
1.818
1.818
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

% Samcov
55.752
16.814
0
0
1.327
4.425
0
0
0
0
0.442
0
0
0.885
0
0
0
0.442
0
0.885
0
1.77
0
0
0
0
0
0.442
0.442
0.442
0.442
0.442
0.442
0.442
0.442
0.442
0.885
0.442
0.885
0.885
0
0.442
0.442

0
0
0
0
0
0
0
0
0
0

2
1
0
1
1
3
4
1
1
1

0
0
0
0
0
0
0
0
0
0

0.885
0.442
0
0.442
0.442
1.327
1.77
0.442
0.442
0.442

0

1

0

0.442

8. Ktory pojem je podla Vas najlepsim predstavitelom kategorie "ovocie" ?
jablko
jahoda
banan
hrozno
jahody
pomaranc
ceresna
broskyna
mango
pomeranč
černice
banana
banán
boskyna
broskev
broskyňa
čerešňa
malina
pomaranč
pomaranč
vona
xxx
čeresne
ananas
ananas (tropic fruit)
Banan
banán!
broskynka :)
ceresne
chut
citron
dužina
grapefruit
hmmm... zena?
hruska
jabklo
jabko
jablka
Jablko
JABLKO
jablko (asi aj banan)
jablko predsa
jabloko
jabluko
Jahody
marhula
melon
Melon
mrkva ;)
nashi
neviem
ovocie
passion fruit
plod
pomarance
pomeranc
rajske
salat
slivka
stavnata broskyna
zakazane

Samičky
67
6
4
4
4
4
3
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Samci
137
5
12
4
4
6
0
2
3
0
0
0
2
0
0
0
0
0
0
0
0
0
1
5
1
1
0
1
3
1
3
1
1
1
4
1
1
2
1
1
1
1
0
1
1
1
1
1
0
0
1
0
1
1
1
2
1
1
1
1
1

% Samičiek
60.909
5.455
3.636
3.636
3.636
3.636
2.727
1.818
1.818
1.818
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0.909
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

% Samcov
60.619
2.212
5.31
1.77
1.77
2.655
0
0.885
1.327
0
0
0
0.885
0
0
0
0
0
0
0
0
0
0.442
2.212
0.442
0.442
0
0.442
1.327
0.442
1.327
0.442
0.442
0.442
1.77
0.442
0.442
0.885
0.442
0.442
0.442
0.442
0
0.442
0.442
0.442
0.442
0.442
0
0
0.442
0
0.442
0.442
0.442
0.885
0.442
0.442
0.442
0.442
0.442

7. Ktory pojem je podla Vas najlepsim predstavitelom kategorie "domace zviera" ?
pes
macka
krava
prase
akvariova rybicka
pavuk
vona
xxx
andulka
clovek
freezy
zirafa
kockodan
kon
koza
morske prasa
papagaj
potkan
rybicky
sfetovany spolubyvajuci
svab
svina
vysavac
zajac

Samičky
80
21
3
2
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Samci
173
30
1
0
1
0
0
0
1
0
1
0
1
1
2
1
2
2
1
1
1
1
1
1

% Samičiek
72,73
19,09
2,73
1,82
0,91
0,91
0,91
0,91
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

% Samcov
76,55
13,27
0,44
0
0,44
0
0
0
0,44
0
0,44
0
0,44
0,44
0,89
0,44
0,89
0,89
0,44
0,44
0,44
0,44
0,44
0,44

Záverom si dovolujeme upriamiť pozornosť čitateľa na niekoľko zaujímavých zistení:










To že je najsilnejšou väzbou @Zrak,Auto@=4.4 nám naznačuje že človek je bytosťou
vizuálnou . To že následuje sluch (3.6) neprekvapí, zaujme však výskut hmatu (2.9) ďaleko
pred čuchom (2.3). Žeby bol človeka napokon oveľa « dotykovejšou » bytosťou ako sme si
doposiaľ mysleli ?
Jedným z najzarážajúcejších sekundárnych odhalení nášho výskumu je @Oheň,Žena@=3.9
– čo je štvrtá najvyššia hodnota nášho výskumu hneď po @Zrak,Auto@=4.4 ,
@Voda,Život@=4.3 a @Mlieko,Ženské prsia@=4.2 .
Zaujme tiež zistenie že muži si asociujú Ženu s ohňom (3.9) , vodou (3.1) , éterom (3.0) !!! a
až potom so zemou (2.8) , ženy si samé seba asociujú o niečo slabšie s ohňom (3.8),
následne so zemou (3.3), vodou(3.0) a až potom s éterom (2.8) .
Mladíkovi ktorý rozvažuje nad tým aký kvietok kúpi svojej milej by mohlo byť užitočné
zistenie že zatiaľčo pre viac ako 55 percent mužov je prototypom kvetov ruža, je tomu tak
len v prípade 49 percent žien. Zdá sa že je to spôsobené najmä kvietkom menom
« margarétka » ktorý ktorý si zvolilo cca 22.7% žien a približne o päť percent mene mužov
Zatiaľčo respondenti mužského pohlavia si najčastejšie asociujú muža so zvieraťom « pes » a
« jeleň » , a to s váhou 3.2 , u respondentiek ženského pohlavia je asociácia muž­pes (2.7) až
na štvrtom mieste, za jeleňom (3.4)108, vtákom (3.3) a žralokom (2.8). Robíme si, páni,
prílišné ilúzie o našej vernosti alebo viete dobre, dámy, o tom že sme veční paroháči ?

Zdatnejší čitateľ možno v dátach objaví ešte i mnoho iných užitočných poznatkov. Pre neho
sú
určené
« surové
dáta »
v
CVS­SPSS
formáte
prítomné
na
adrese:
http://localhost.sk/~hromi/quest/FHS_Bakalarska_praca.csv
108 Môj milý je podobný srne alebo mladému jeleňu. (Veľpieseň 2.9) Utekaj, môj milý, a buď podobný srne alebo
mladému jeleňu na vrchoch vonín (Veľpieseň 8.14) Prísahou vás zaväzujem, dcéry Jérúšáléma, na srny a na jelenice
poľa, aby ste neprebudily ani nebudily lásky mojej dokiaľ by sama nechcela. (Veľpieseň , 3.5)

Záverečná poznámka pre FHS
Bolo namietnuté : « Veď tá práca nemá žiadnu metódu !»
A odpoveď nebola nepodobná zenovému koánu: «Nemať metódu bolo našou metódou».
A na niečo také by mohlo byť odvetené: «V takom prípade sa však nejedná o vedeckú
prácu.»
Obhajoba voči podobnému výpadu znie následovne:
«O vedeckú prácu sa vskutku nejednalo. Jednalo sa o bakalársku esej. Slovo esej chápeme v
zmysle francúzskeho essai ­ pokus . Pokúsili sme sa vložiť do niekoľko desiatok stránok všetko čo
chceme svetu povedať , všetko čo má pre nás v momentálnom štádiu nášho vývinu aký­taký zmysel.
Možno dokonca všetko to, čo kedy malo aký­taký zmysel. Ako v prípade každého pokusu sme však
pripravený zakúsiť aj trpké ovocie nepochopenia a neúspechu».
«A to Vaše patetické používanie prvej osoby množného čísla ! »
«To preto že som chcel vzdať hold obrom na ramenách ktorých som stál. Napríklad Vám. »
«A tie absurdné tlachy o Hilbertovských priestoroch, grafoch, maticiach, regulérnych
výrazoch, o akýchsi ‘šémoch’ a ‘prírazoch’ !»
«Uznávam že som sa miestami nechal trochu uniesť a svojim vlastným životom žijúce more
nazývané text unieslo častokrát vratkú lodičku mojej mysle k ostrovom neprebádaným. Uznávam že
som sa častokrát stratil, uznávam že som častokrát šliapol úplne vedľa , že som bol častokrát úplne
mimo . Predsa som však ústrednú intenciu môjho diela naplnil. »
«Aká bola teda celková intencia vášho diela ?»
«Vytvoriť v mysli čitateľa – vo Vašej mysli – tak silný sémantický spoj medzi «jablkom» a
«prsom» že ho dokáže rozrušiť len pokročilé štádium Alzheimerovej choroby či smrť – a možno ani
tá nie . Spôsobiť, že kedykoľvek uvidíte jablko , spomeniete si na tú čo Vám dala život i na tú čo
Vášmu Životu dala zmysel. »
«Myslíte si že sa Vám to podarilo ?»
«Myslím že sa mi to nepodarí len v prípade tých ktorým po výzve «v priebehu najbližších 23
sekúnd prosím nemyslite na ružové slony» pred ich vnútorným zrakom nebudú defilovať ružové
slony. A takých je málo.»
«Ste pripravený na to, že Vás ľudia budú mať po prečítaní tohto textu za blázna?»
«Áno»
«A ste pripavený na to že Vaša práca nebude prijatá?»
«Áno»
«Čo urobíte v prípade že Vaša práca nebude prijatá ?»
«Raz, ako starý muž je dokončím a uložím ju v Knižnici na miesto kam patrí»
«A čo urobíte ak prijatá bude?»
«To isté»

14.10. 2008 , Paríž

Logia 79,
evanjelium podľa Tomáša
II. kódex zvitkov z Nag Hammadi

23 comments to the Chomskian Doctrine
Within this text I ( or « we », as will be called the circle of those who adhere to the ideas
presented hereby) propose some objections against the set of Chomskian theories (which will be
labeled as « the Doctrine »). In the beginning, an intention was not to write a scientific article, but
simply to save for eternity few notes of laic who wants to protect his laicity at least in some form,
before he becomes completely re-programmed by the Doctrine.
It is possible that the text will contain many contradictions with itself. Similiarly to
Chomskian theories, science, and knowledge in general, this text it is evolving in time, and thus it
can happen that what have been perceived as a serious flaw in « Initialization problem » is a
solution to determinist/non-determinist dilemma in the « Halting problem ».
It’s also very much possible that the majority of problems proposed hereby was already
addressed either by Chomsky and a group of his « fideles », or by his « adversairies »..
1. The initialization problem
During the last class of syntax, students were given these two sentences:
1. La lecture de ce livre a été conseillé aux étudiants.
2. Le chat de la voisine semble être nourri par le concierge.
In the first case, the V « conseiller » was projected from lexique to D-structure as « être
conseillé » and the I thus received the value [+AGR]. The fact of +AGR allowed, during the
creation of S-structure, the movement of « la lecture ...» NP from the position bounded to V-bar of
« être conseillé » to the initial empty position where « la lecture » had received the Nominative
case.
In the second case, the V « nourrir » was projected from lexique to D-structure as « être
nourri » and the I to which the V was bounded thus received the value [-AGR]. The fact of -AGR
denied, during the creation of S-structure, the movement of « le chat » NP from the position
bounded to V-bar of « être nourri » to the second IP. Therefore the whole NP had to « move » even
further, and got bounded to the first IP, which was +AGR.
Now a question of an enfant terrible student was this: Why, in the case of first sentence, we
inserted the verb from lexique in conjugated form, thus creating +AGR, and in the second case in
the the infinitive form, thus creating -ARG condition ?
The answer was: Because such is the case in the resulting senteces (phrases d’arrivé).
And now the really terrible question of the enfant terrible student follows: But how can we
know what will BE the resulting phrase, while we are still in the process of its own
derivation ?
In other words, how can a result influence its own construction? That’s cabala, not science.
This problem is, of course, not obvious for the students reading the books or doing
exercises during the classes, because the « resulting sentence » is already present right there, in
front of their eyes. From the very beginning, they know the result, they reason out « initial
conditions » from it, and than they are happy that from those very causes they will arrive to the
result from which they started...
But in the real life situation , when a speaker is supposed to generate a sentence, there
cannot be any « resulting sentence » present. At least not for a generative grammarian. For if there
was, in some sense, the « resulting sentence » already present, for example in some potential, virtual
form of a « pattern-template » with which the phrase-being-constructed is being matched, this
pattern-template would also have to be either generated or it would have to be taken from memory.
But if it would be generated, it would need an another pattern-template to match against etc. ad

infinitum – in other words we would be posed in front of the problem of the problem of « infinite
regress ».
The option « taken from memory » is for a generative grammarian inacceptable because in
such a case every sentence would potentially need its own pattern-template to be stored in the
memory. The result would be a huge amount of patterns stored in the memory and no need to
generate
We will address this problem more closely in the « Halting problem argument » as well as in
some others. Returning to our sentences, we now try offer this fast-made possible solution to the
problem, trying to be at least little bit faithful to the framework of Chomskian theories :
To arrive to the D-structure of « le chat semble... » something, some additional
information saying that the verb will be in infinitive , has to be inserted, not only the lexical
items. This very same « information » will « trigger » the derivation of the phrase 1. from the
lexical items and not the derivation leading to « Il semble que le chat de la voisine est nourri par
le concierge », which would be also a second valid derivation out of the very same lexique (if we
suppose – as many modern theories do – that « il » in this case is not an independent lexical item)
2. Argument of a laic coming from a foreign country: Do flying fish have wings ?
What grammarians called « cases » for hundreds of years is much more related to the
thematique roles than to the « position in D-structure » or some « governing by V/N/whatever ». In
other words, cases are at least for us, Slavs, much more morpho-semantique than syntactique
entities (in the sense where « syntactique » means « in relation to the position within the
sentence » ), for example an answer to the question « Who? What? [kto ? co ?]» is for me in
Nominativ, the answer to the question « To whom? To what? [komu? comu?] » in dative etc.
The « magic » of the cases is not hidden in the fact that one word/part of the sentence is
strongly influencing the other word/part of the sentence (that would be a similar « discovery » as to
find out that the sentence A is significantly influencing an understanding of sentence B which
follows...), but in the fact that we are using morphology to do so. I simply don’t understand why
the creators of Doctrine had chosen the same name Cases (and even with the capital!) to designate
the set of solutions to the technical problems of their theory which have not very much in common
with the cases in existing natural languages.
Just a small example of how small slavic nations can use their « cases »:
Nominative: hovorí láska (The love is speaking/telling)
Genitive: hovorí (z) lásky ((S)he is speaking because of love OR love is the reason of her speech)
Dative: hovorí láske ((S)he is speaking/telling to love) [metaphoric but acceptable]
Accusative: hovorí lásku ((S)he is speaking love) [maybe not acceptable by some orthodox purists]
Locative: hovorí (o) láske ((S)he is speaking ABOUT love)
Instrumental: hovorí láskou ((S)he is spkng BY love) OR hovorí (s) láskou ((S)he is spkng with
love)
« O » preposition in Locative protects the case from semantic collision with morfologically
identical Dative (at least within all 4 declinasion paradigms for feminine gendre), « Z » preposition
either protects the case from semantic collision with Nominativ of plural form « lásky », or is a
preposition in its own right, meaning « from ».
To prove our point we feel no need to decide this sort of dilemmas.

Thus we can construct sentence like : « Láska láske z lásky hovorí lásku a o láske s láskou. »
It has 6 components, which can be freely permutated among each other, thus forming 6!= 720
possible sentences, positional contraints being only stylistic. While some of the results could be
possibly labeled as « poetic », especially in the beginning of the lecture, we doubt that any reader
could rightfully justify his stance when calling these sentences nongrammatical (especially after the
competence of the reader would get « accomodated to the pattern » after reading few hundred
permutations). The goal of this small exercise was to show that one verb in Slovaque language can,
in some extreme cases , assign all types of cases, and even to the same noun. If such a situation
occurs, a position plays only minor cosmetic role (the only exception being the clitics), and the
assignation of the correct case is determined by morphology (in hearing-parsing passive
performance situation ) and semantic roles (in producing,active, performance situation).
Because it is not emitting almost any light upon the beauty of sanskrit’s or slavic
language’s « cases », would You be , please, so kind, and choose a different term for Your
Universal Theory of Cases ? 1
Additional comment:
Imagine a sentence:
p
/

\
N
VP
|
/ \
He is student
in Slovak, Czech, Polish, we say:
On je študent. (Pronoun is Sg. Masc. Nominative , Noun also in Nom.)
How it is possible that the verb « to be » assigns Nominative not only to its NP externe (pronoun)
but also to its « object », NP interne ? How would You deal with this situation ? You can:
 Say that Nominative in our language can be intern as well as extern. In such a case Your
definition of Your Nominative hadn’t offered us any information, it’s an empty tautology
similiar to « either he is alive or he is dead », « either he is stupid or he is not » etc.
 Forget hundreds years of Tradition and say that a case assigned is not Nominativ but some
different case (for example in French that would be Accusative because it’s an intern NP
and V is not V-Datif). In such a case we would kindly allow us to concentrate Your attention
upon the fact that « this new case of Yours » would be completely redondant and useless for
the theory of our language, for no matter what Noun it is, it’s case-signifiant morpheme in
this position is always (for 12 paradigms in singular + 12 paradigms in plural) identical to
the case-signifiant morpheme of Nominative (or Instrumental, as You’ll see later...). What a
coincidence !
 Make an « exception » for a word « to be », saying that it can do something very special,
that it can have in fact two theta-roles externes. Thus we can do something like:
|---IP---|
| | |
N V N
| | |
He is student
1

Or, You can maybe try to persuade that what is an indispensable part of our linguistic heritage are in « underlying »
reality not cases at all . After all, if a generativist phonologue Schane succeeded to convince the world (and even
french phonologicians !!!) that french in « underlying » reality doesn’t contain any nasal vowels,

and we can even call this solution being the most elegant. But there is a small problem, it’s a
complete heresy against the basic axiom of Chomskian doctrine : p -> N VP 2
...
And during this analysis I tried to left aside the fact that we can express the same meaning by
saying:
On je študentom. (He is student. Pronoun in Sg.Masc.Nom,Noun in INSTRUMENTAL! )
...so now what ? Is the assignation of the Instrumental in contrast with Nom case of first example
driven by the position of stars ?
3. Halting problem
In this section we’ll use the method similiar to that of reductio ad absurdum of old
scholastic Masters to show that something like Generative Grammar G « capable of producing
infinite number of terminal sentences out of set of lexical items S by applying generative rules R »
seems to be a chimere.
Let’s assume that such a grammar G exists. We can thus ask a question: how have we
obtained this infinity of terminal sentences? Because even a beginner in mathematics know that if
we want to get from finite number N to infinite number I, we have to either multiplicate N by
another number J, which itself is infinite, or we have to apply infinitely many times an
operation/function/rule F upon N F(N), on the beginning, we see only these solutions to the
question How can a grammar G be possible? :
 1 - the set of rules R, which is applied upon the set of lexical items S, is herself infinite
 2 - the set of rules R is finite, as well as set of lexical items S, but the number of times T we
apply an operation/derivational rule (which belongs to set of R) can be potentially infinite
 3 - the set of rules R is finite, the number T of operations is finite, but the set of lexical items
S is infinite
 4 - the set of rules R is finite, the number T of operations is finite, the number of lexical
items S is finite, but is forwarded to the input of a first derivational rule ( to the D-Structure)
in infinitely many variations V
We immediately see that the first solution is invalid. For an infinite number of rules had to
be stored somewhere, but all the possible storage spaces (brain-memory, DNA etc.) are finite. The
same argument applies to the solution 3 – the set of lexical items used by a given person is
necessarily finite.
Thus the only source of « desired infinity » we can see can be solutions 2 and 4. Let’s first
look closer upon the solution 2, where magic is hidden in «infinity of number T of applications of
rule which belongs to the finite set of rules R ». In other words, a rule can be applied more than
once, to generate infinite number of sentences, it can in fact be applied infinitely many times, if the
given input allows it.
In the framework of Standard Theory, let’s imagine for example a Deep structure:
I know # I am #
we can apply a rule RTind upon it, obtaining
I know THAT #I am#
and upon this structure we can once again apply the same rule:
I know THAT THAT #I am#
2

We’ll attack this axiom in argument « I love You » and, more deeply, in antiEuclidean argument.

et caetera et caetera, ad infinitum. But in such a case, it is our duty to ask a question: What
can prevent a speaker of a sentence «I know that I am » from getting into an infinite that that
that3 loop? According to what critere will he know that the syntactical derivation is finished
and he can pass the output to the phonological layer?
You will most probably answer: «Standard theory is dead, this is not an argument, I had
dealt with these problems in the later works, for example by reducing set of rules R to its only one
member «move alpha » , so there will be no more problematic rules like Yours RTind , and by
introducing contraints which will prevent us from falling into the infinite loops which could
possibly occur if move alpha had moved , in the first step, an element from position A to B, and in
the second step from B to A4 ».
And we’ll say: « We don’t know Your theory into detail, but doesn’t introduction of
contraints in Your latter theories, which will prevent a speaker from falling into the « infinite
loop » abyss, lead to the loss of capability of Your generative grammar G to produce infinite
number of sentences5? Aren’t You posed between the Skylla of « G is capable of producing
infinite number of sentences, but it can happen that infinite loops will occur » and Khabryde of
« infinite loops are not present within my system but I lost the real infinity-oriented generativity of
G»?
And as usually, similiarly to a good hacker-programmer, You’ll try to adapt Your model to
this problem, You’ll think a while and You’ll propose this ad hoc solution, which, as almost all ad
hoc solutions, will make Your model less elegant, less scientific, less comprehensible to un-initiated
(and thus un-reprogrammed) and You’ll say: « This is a serious problem. But I postulate an
existence of a procedure P which will could be potentially capable to know en avance whether the
derivation will finish, or will lead us into the « infinite loop » abbyss. Thus we’ll still be able to
apply some rules infinitely many times, when needed, but we’ll never fall into an infinite loop ».
If this would be Your solution, we’re sad to remind You , that according to the father of
informatics , a man owning of the most brilliant minds of the 20th century named Alan Turing, such
a universal procedure P capable of deciding whether a given programme ( a set of instructions
) will ever halt or not DOES NOT EXIST for a deterministic machine (cf.
http://en.wikipedia.org/wiki/Halting_problem ) . And afterwards You’ll maybe try to offer another
ad hoc solution and say that Your model is in fact a non-deterministic one. In such a case we’ll be
very happy that You had finally arrived to the conclusion to the fact that « human being is more
than a machine ».
But until that moment, which will maybe come while we’ll become accustomed to Your
« Minimalist programme », we have to express serious concerns for all previous G.G theories which
seem to us to be very much deterministic. We repeat once again: such theories
 either lead to the emergence of infinite loops during the derivational process
 or to the impossibility to generate infinite number of terminal sentences out of finite number
of variation of lexical inputs, upon which rules (or just a rule) can be possibly applied
infinitely many times
This was our answer to the possible solution 2.
3

4

5

Even while we try to show that generative grammar syntax is more or less an « impasse » in the evolution of
linguistics, similiarly as was Ptolemaic geocentric approach to cosmology, we nonetheless had to admit that, it
helped us to shed some light upon the pathologic case of « beguement ».
And another contraint for three steps: (A to B, B to C, C to A) is forbidden , another constraint for four, another for
five etc...Quite a lot of them in the end, n’est-ce pas?
One can for example imagine a procedure: If the derivation is not finished within certain period of time, pass what
You had already obtained to the phonological layer. But in such a case, the number of sentences possibly produced
will be finite, since the « certain period of time » constant is also finite. G will thus not be allowed to generate
infinite number of terminal sentences.

The last possible way how we could possibly establish existence of generative grammar G
capable of producing infinite number of sentences was the solution 4: « the number of lexical items
S is finite, but is forwarded to the input of a first derivational rule ( to the D-Structure) in infinitely
many variations V ». Imagine for example a speaker whose lexique contains only the items
{I,You,know} , and the only rule R – already mentioned RTind . We can thus imagine that this
speaker is living in the highly developed society ot telepathes where the only purpose of linguistic
exchange are the affirmations of following kind:
Variation Lexical input
RTind
Result
V1

{I,know}

not applied

I know

V2

{You,know}

not applied

You know

V3

{I,know,You}

not applied

I know You

V4

{I,know,You,know}

applied once I know that You know

V5

{You,know,I,know,I,know}

applied 2x

You know that I know that I know

V6

{I,know,You,know,
applied 2x I know that You know that You know
You,know}
etc. possibly ad infinitum.
Truly, in such a case, we have a true generative grammar G capable of producing infinite
number of sentences out of finite set of lexical items by applying finite number of rules.
The only problem is that...to have such an infinite number of terminal sentences, the
number of varieties V of lexical inputs which are being inserted into D-structure...has to be
infinite.
And thus this wonderful generative grammar G of Yours in fact does not shed any light
upon the generativity of language, because the real generativity of the language is hidden in the fact
that the lexique is capable of passing infinitely many varieties of its items to the syntactic
component6.
We’ll come back to this « generativity of lexique » within the « argument from poetics ».
Here, we just proposed it as a last possible answer to the question « How can be generative
grammar possible?». We had shown that the generative grammar G is technically impossible in case
of solutions 1 and 3, have some serious difficulties with infinities in solution 2, and is completely
useless in solution 4.
We propose to change the model.
4. Connectionist’s argument: What??? Infinity of sentences ???
5. Orator’s argument
There exists a rhetoric figure called anakoluth which consists of breaking the rule of syntax to
achieve desired effect upon the public. Shakespeare used it, for exemple:
"Rather proclaim it, Westmoreland, through my host,
That he which hath no stomach to this fight,
Let him depart.
Existence of such a figure , especially within the works of biggest Masters of language, poses a
Doctrine in front of a problem which she’ll never surpass. We can formulate it like this:
Syntax is just making it more neat and ordered afterwards, it’s not the master, but just a servant, a
« femme de menage » of semantics.
6

Imagine that after years of exhaustive research, the S set of fundamental rules, (we can call
them axioms) of Universal Grammar is found, and explicitely formulated. A rhetorician O comes
afterwards with an intention to strongly influence the public, and knowing well that «if You want to
impress people, You have to be non-violently different », he’ll apply his new figure, called from
now on « universal/chomskian anakoluth », which can be described like:
« Take any R rule out of Universal Grammar S. Create a completely new rule R’ or
create it out of R, so that R and R’ are not consistent (R’ is negation of R). Add this new rule
R’ to the set of rules which rest in S, thus obtaining S’. Generate all the sentences T’
according to these new set of rules ».
The result would be, that sentences produced by O would not be generated according to the
rules of Universal Grammar S, but according the rules of another grammar S’, which is not
consistent with S. Thus, if there will exist a human being 7which will consider T’ sentences
grammatical, it will follow that a grammar S is not universal . And since we could make a similiar
procedure with no matter what set of rules , no matter the set S, it will never be universal .
You can, of course, make objections like this:
 1. Since this new grammar S’ will not be consistent with the Universal Grammar S, no
human being will understand it, and it will thus not be a human grammar at all and sentences
produced will not be sentences of human language.
 2. One thing is to explicitely formulate the rules of Universal Grammar, the other thing is to
construct sentences according to these rules. In fact all of the processes of U.G. are being
realized on the sub-conscious functional-mind level , and what we observe are only outputs
of these processes. We can say that access of our consciousness to U.G. is read only , we
can know it, but we cannot consciously apply them, use them to construct sentences.
We’ll return to objection no.1 in «power of stimulus argument ». For this moment, let’s just
consider it as a question « Could a human being,except the orator O, perceive the set of sentences generated by the rules which are not consistent with the Universal Grammar - ever consider them as
grammatical? » ,which can be decided by scientific means – by an experiment.
If You’ll decide to use the second objection to save Your system, we have to warn You that
You’ll make Your Doctrine very much impotent 8. Because if You’ll say « it’s not possible to
consciously apply the rules of U.G. », You’ll loose a great deal of justification for «drawing trees
and writing derivations for Your students», because drawing trees and writing derivations is , in the
end, nothing else than a trial to consciously apply the rules of U.G.9
In other words : From the moment You’ll explicitely formulate Your U.G. , she will be
in serious danger of losing its universality because of « chomskian anakoluth » which will,
surely follow. And to save its universality by not formulating it at all, by saying « it’s not possible
to formulate it, yet it exists » would be not far from the disputes concerning an ontological proof of
existence of God, coming from the dark ages into which we simply don’t want to fall once again.

Except an orator O, of course. For if his new figure will not have any positive effect upon the public, he will most
probably end in an institution for mentally disturbed where some wise man with diplom, title and a white coat will
write into a diagnose «speaker insufficient of producing syntactically correct sentences », and we cannot take for
serious linguistic judgmenets of such a person, can we?
On the other hand, if he’ll succeed, he’ll be celebrated as a genius and his new grammatical rule will maybe even
became a NORME. Voilà la différence entre le fou et le génie – la réussite.
8You’ll be like a mathematician trying to explain an idea behind a newly discovered Operator saying to his students
« So, You see what this Operator does? Unfortunately we cannot do any exercises because we don’t know any objects
with which it does what it does, for if we had known them, it would follow that we wouldn’t be able to apply this
Operator! »
9 And the exercices in the form of « explain why the sentence S: « sentence this grammatic not is » is not grammatical »
are in fact the first examples of application of chomskian anakoluth.
7

6. Saussurian universal darwinism argument
7. Adequacy with the world
8. Empiricists argument - Competence and performance mess
9. Popperian argument and falsifiability
10. Argument from poetics
11. Arguments from other arts
12. Argument from children:
13. Argument from an anarchist - Normativity of G.G.

14. Argument « I Love You »
You try to persuade me that every grammatical English sentence can be analysed as N VP10 .
Asking You to analyse the sentence « I Love You », You automatically do something like:
p
/ \
N
VP
|
/ \
I Love You
and if You are a mentalist, You will even try to persuade me that something like THAT structure
exists in my very head.
And then I will tell You: « but the things can be observed, analyzed differently ». And I will draw:
a) p
/|\
NVN
| | |
I love You

b)

p
|
N---V---N
|
|
|
I love You

c)

p
/ \

core1 N
/ |
|
N V |
| |
|
I love You

and if You’re honest to truth, at least for a short while You’ll have nothing to say. Because if You
don't know it, You shall at least feel that there CAN BE cultures which mentally represent the love
relation in such a mutually symmetric (examples a and b) or maybe even
by- « the-other-one-is-primary » (c) fashion. And then, because You are also honest to Your
Doctrine You’ll add: « But that breaks all the most essential rules ».
And I will respond: « not rules, but conventions, which a community of post-wars syntaciticans had
voluntarily chosen driven by a need to facilitate the communication among its members, and which
are being imposed upon the new generation ».
15. Russel’s argument - G.G. as a formal system
16. Lobatchevski’s (antiEuclidean) argument
10

In X-bar You’ll add some bars and Is and Ps here and there, but the overall cartesian architecture of your system,
based upon assignation of a special place to the subject, rests the same

17. Hilbert’s argument
18. Godelian argument (ou un petit K.O. de grace)
19. postFreudien argument
20. Kuhnian’s argument
21. Skinnerian « power of stimulus » argument
22. Memetician’s argument
23. Kopernic’s argument
ironic remarks concerning the lecture Haegaman’s Governement and Binding manual
97 – (Keyne proposes)... that the relevant parameter distinguishing VO languages from OV
languages is related to the application of the leftward movement rule.
(an distinguished anglo-saxxon academiciann proposes)... that the relevant parameter distinguishing
Arabic script from the classical Latin script is related to right-left movement of hand only. For as
everybody (especially a well formed anglosaxxon) knows, it is the left-right movement which is
universal, present in the deep layers of not only cerebral structures but DNA itself, and all empiric
data which are not in accordance with this universal principle are just surface phenomena.
106 – We must introduce some parameter to distinguish configurational languages from
non-configurational languages.
And we must, of course, introduce some parameter to distinguish a language from non-language.
And it will be coded like this – if mother is speaking it (if the utterances are much more frequent
and intense than any others), it is a language.
145 – Apart from the identification of verb inflection, we shall not be concerned with the
decomposition of words into morphemes either.
Such an approach is similiarly absurd from the global point of view as to say: « Apart from the
identification of slavic/sanskrt cases, we shall not be concerned with the decomposition of words
into morphemes either. For every case we’ll create a nice CaseP ,(analogic to IP) and when it comes
to the conjugaison of verbs, we’ll invent a « Theory of verb agreement module » which we’ll
position between the D-Structure and S-structure. »
Why not?
143 – It is easy to see that the more elements are involved the more choices are available
(« Language acquisition as a defense of binary branching theory »)
Counter-argument: It is easy to see that for a child in early stages of its language acquistion, the
sentence « Daddy sleeps » has the same number of elements as a sentence « Mummy must go
now ». And that number is one. Or do You believe that Mummy or Daddy says « white space »
after each word?
The number of lexicon elements into which sentence can be analysed grows in parallel with
« internalization of mother language’s grammatical structure ». Thus binary branching does not
offer any real advantage for language acquisition. (and in the end, almost no language cannot be
fitted upon the strict binary branching paradigm anyway)

Bonus
Some sentences from Shakespeare (found by application of perl regular expression /[\.\; ]([\w ]*?I
[\w ]+? me [\w ]+?\.)/g upon corpus "The complete works of William Shakespeare" ) which, in my
opinion, violate the second binding principle:
I have kept me from the cup.
I cross me for sinner.
I would wish me only he.
Something I must do to procure me grace.
Here on this molehill will I sit me down.
I can buy me twenty at any market.
I have bethought me of another fault.
I will shelter me here.
Here will I rest me till the break of day.
I do repent me that I put it to you.
I fear me both are false.
I can no longer hold me patient.
For I repent me that the Duke is slain.
And now I cloy me with beholding it.
bread I it makes me mad.
That I should yet absent me from your bed.
How I may bear me here.

First measurements concerning rhytmic circuit inertia and disparition
written by Daniel Devatman Hromada for Joelle Provasi
as memoire M1 for EPHE SVT CNA
The experiment consisted of three tasks: during first a spontaneous motor tempo (SMT) was
measured; during the second a child had to synchronise to stimuli with 600ms interstimuli interval
(ISI); third task was a continuation/induction task – after being attuned to a 600ms ISI, a child was
instructed to continue tapping the same rhythm (IRI) even after stimuli was turned off. This text
concerns only the 3rd task in relation to data obtained by SMT measurements of the 1st task.
Crucial for understanding of our method is the concept of “IRI falling into the SMT attractor state”.
We say that subject’s IRI have fallen into the SMT attractor state when an IRI cannot be
distinguished from SMT. For practical purposes we define that IRI cannot be distinguished from
SMT state when the arithmetic mean of 3 subsequent IRIs is lesser than SMT, e.g. IRI<SMT. As
may be noticed from Figure 1, XXXX of our subjects have sooner or later fallen into SMT attractor
state during the continuation task.
And more, we can demand when this event have
occurred. Thus we label the beginning of 3tapped sequence specified above as “the moment
of passing of an SMT attractor threshold”.
Having a sequence of taps where 1 denotes first
tap, 2 second etc. a number N will be called an
“SMT attractor’s threshold passing’s sequence
number” if and only if:
mean(IRIN IRIN+1 +IRIN+2) < SMT
It is important to understand that once we know
SMT attractor’s passing sequence number, we
can easily calculate the time interval between the
beginning of the continuation sequence T0 and a
passing of an SMT attractor as a sum of all the
IRIs occurred between 0 (e.g. the and of a ISI
sequence) and N. This is due to construction of
our experiment – we can be more or less sure
that measurement of IRI2 starts in the very
moment when IRI1 ends, with minimal
-computer processing speed related- temporal
gap between them. We formalize:
N

T N =∑ IRI i
i=0

Since the beginning of the continuation sentence
marks as well the end of an ISI sequence, this
summing up of all IRIs which occurred before
passing the threshold can be interpreted as a time
interval between the moment when ISI was
turned off and a moment when a subject’s IRI
have fallen into the SMT attractor state. For the
purpose of this article we’ll say that TN denotes

SEX AGE TRAINING
f
4m
g
4m
f
5m
g
5m
g
4n
f
5n
g
5n
f
4m
g
4m
f
4n
g
4n
f
5n
g
4m
f
5m
g
5m
f
4n
g
4n
g
5n
f
4m
g
4m
f
5m
f
4n
g
4n
g
5n
f
4m
g
4m
f
5m
g
5m
f
4n
g
4n
f
4m
f
5n
g
5n
f
4m
f
5n

SMT
360
401
538
568
503,5
558
548,5
542
461
540
377,5
566
567
484
431
402
481
484
552
432
501,5
558
400
528
379
525,5
537
564
514
428
564
444
445
446
486

IFI
12411
9171
29118
11137
11047
18006
15385
7583
12321
15353
15742
14984
10288
18042
13898
11161
2680
4586
7270
15847
15990
19966
518
14321
14659
13638
11458
9838
1988
7088
2332
16595
8890
16747
14354

Tableau 1: SMTs and ISI’s fadeaway intervals
- IFIs (in miliseconds) of 35 subjects whose
previously measured SMT<550 as related to
their respective sex/age/musical training
factors

the ISI’s fadeaway interval (IFI) . Table 1 shows different IFI’s for 27 subjects whose SMT<550 and
hence could in comparison with IRI=600 produce measurable results.
Results
A three-factored (sex,age,musical training) ANOVA was run over measured IFIs of 35 subjects
whose SMT<550 This analysis revealed a significant main effect of age (p<0.039) and indicated a
possible effect of sex (p<0.0691). No interaction nor effect of musical training was observed. An
unilateral Student's test suggests (p<0.02) that a mean IFI of 5years old (14440ms) is greater than
mean IFI of 4 year olds (10390ms)
Discussion
We use the term “SMT attractor” because we think that SMT can be described as an equilibrium
state of child's cognitive system in particular of global oscillatory module . We consider an SMT to
be an attractor because
1) it's much more probable that a child will change its IRI from ISI to SMT than the other way
around1
2) SMT cyclic equilibrium once attained, child's rhytmic cognitive system will tend to rest in it
until next perturbation by external stimuli.
Since we had defined IFI as a time interval between the moment when a child was attuned
to ISI and a moment when it passed SMT's atractor threshold, we can say that this interval describes
the time needed for an irreversible deletion of an initial ISI. We can say that when IRI is departing
from ISI and approaching SMT, ISI is rewritten by SMT. More IRI approaches SMT, more
information about initial ISI is lost, most it is improbable that IRI will come back to ISI.
Thus we think that IFI can be an interesting notion for measuring certain characteristics of
rhytmic memory and maybe a first step towards a new method of studying memory in general. For
initial ISI’s gradual fadeout can be either interpreted as both
● temporal interval needed for an irreversible disparition/deletion of a certain rhythmic
behavior pattern
● temporal interval during which a certain rhythmic behaviour pattern “resists” its deletion.
If we adhere to the second interpretation, we can say that an IFI measure can be an
interesting clue to what we call inertia of a given behaviour pattern/neural circuit. What our data
shown us is, rhytmic behaviour patterns of 4year olds have lesser inertia than those of 5years old. In
other words, neural rhytmic loops of 5 years old resist better the tendance to be deleted from the
rhytmic memory than those of younger children. Simply said, more a child is older, more the
contents of its memory have a capacity to become “frozen” - more solid is an energetic equilibrium
of all neural loops concerned . We think that these conclusions are consistent with other results of
child memory development research
Bibliography ???
The functional emergence of prefrontally-guided working memory systems in four- to eight-year-old
children

1 In the language of dynamical systems: SMT drains the basin of attraction composed of all other IRI states

Illustration 1: Diagrams representing an execution of a continuation task by 41 children subjects.
X axis is the tap's sequence number, Y axis signifies the IRI value. Observed left-to-right gradual
decrease of IRI can be interpreted as a fall into an SMT attractor state.

Synergie

FACHMAGAZI N FÜR DIGITALISIERUNG IN DER LEHRE | #07

Nachhaltigkeit
Nachhaltigkeit

Nachhaltige Digitalisierung oder
digitale Nachhaltigkeit (in der Lehre)

Inhalt #07
03

Rubrik Ökologie

74

Circadian and eutark reduction of the energy trace
of a digital school

„It may be the case that the strongest eco-value of circadian and
eutark devices does not reside in energy savings per se, but rather
in habits these devices would help to reinforce and amplify.“

84
Unterwegs

I wish I were a Dutch student—student perspectives
on the peer-to-peer exchange with the Netherlands

„Three days in November 2018, 17 university representatives from
all over Germany, three Dutch cities and uncountable impressions –
a peer-to-peer exchange on digitalisation in higher education.“

4

06
08
64
80
84
89
90

Editorial
Ein(-)Blick in die Synergie-Redaktion
Der wissenschaftliche Beirat
Kieselsteine
Blickwinkel
Unterwegs
Impressum
Außerdem

Nach­
haltigkeit
10

Bildung für nachhaltige Entwicklung
als Öffnungsprozess für einen virtuellen
Hochschulraum?
Georg Müller-Christ

18

Improving students’ competencies in
sustainability science through the integration of
digital teaching and learning in higher education
Alexa Böckel

22

Digital Literacy für die sozial-ökologische
Transformation
Steffen Lange, Tilman Santarius

26

Nachhaltigkeit digital
Peter England, Stefanie Brunner

30

Digitalisierung und nachhaltige Entwicklung an
Hochschulen: Synergien und Spannungsfelder.
Digitalisierung – Werkzeug und Thema im
Hochschulnetzwerk HOCHN
Wolfgang Denzler, Claudia T. Schmitt

34

Transformationsprozesse für eine nachhaltige
Zukunft gestalten. Digitale Landkarten als
Möglichkeit zur Visualisierung und Vernetzung
nachhaltigkeitsbezogener Inhalte
Claudia T. Schmitt, Sophie van Rijn

38

Was bedeutet Nachhaltigkeit im
Blick auf universitäre Lehre? Eine
erziehungswissenschaftliche Perspektive
Hans-Christoph Koller, Angelika Paseka,
Sandra Sprenger

42

Nachhaltig erhöhte Lernautonomie beim
Spracherwerb durch digitale Angebote.
Über ein Online-Self-Assessment zur
Sprachzertifizierung für internationale
Studierende
Nils Bernstein

46

Digitalisierung und Nachhaltigkeit.
Potenziale für Lernen am Beispiel eines
Prototyps für ein Ecological SecuritiesPortfolio
Ronald Deckert, Maren Metz,
Thorsten Permien

50

Austausch von Praxiserfahrungen
mit digitaler Lehre als Voraussetzung für
Nachhaltigkeit. Die Digital Learning Map
Johannes Moskaliuk, Bianca Diller,
Elke Kümmel

54

Die Virtuelle Akademie Nachhaltigkeit:
digitalisierte Bildung für nachhaltige
Entwicklung
Oliver Ahel, Thore Vagts

58

62

Projektbasierte Förderung digitaler
Lehre – Nachhaltigkeit aktiv gestalten
Mareike Kehrer
Bayern im Diskurs. Digitalisierung und
Nachhaltigkeit
Markus Vogt, Johann Engelhard,
Lara Lütke-Spatz, Kristina Färber

10
Schwerpunktthema

Nach­haltigkeit

Bildung für nachhaltige Entwicklung als Öffnungsprozess
für einen virtuellen Hochschulraum?
„Nachhaltigkeit lernen heißt die Welt als ganze Gestalt in den
Blick nehmen und die individualisierten Nebenwirkungen von
Forschungs-, Produktions- und Konsumprozessen auf Mensch
und Natur abbilden zu können.“

Rubrik Infrastruktur
66

EduArc. Eine Infrastruktur zur hochschul­
übergreifenden ­Nachnutzung digitaler
Lernmaterialien
Michael Kerres, Tobias Hölterhof,
Gianna Scharnberg, Nadine Schröder

70

Der Einfluss der Digitalisierung auf
die Wissensgenese im Kontext einer
nachhaltig-gerechten Entwicklung
Thomas Weith, Thomas Köhler

Rubrik Ökologie
74

Circadian and eutark reduction of
the energy trace of a digital school
Daniel D. Hromada

76

Nachhaltigkeit? Handlungsfelder
auf dem Weg zu einer ökologischverantwortlichen Mediennutzung
an Hochschulen
Nina Grünberger, Reinhard Bauer

Rubrik Infrastruktur

70

Der Einfluss der Digitalisierung auf die Wissensgenese
im Kontext einer nachhaltig-gerechten Entwicklung
„Eine nachhaltige Entwicklung erfordert eine Neuorganisation der
Wissensbestände und ihrer Verfügbarkeiten. Dabei geht es im Kern auch
um ein neuartiges Verständnis einer Beteiligung an der Wissensgenese.“

5

Rubrik Ökologie

Circadian and eutark
reduction of the energy
trace of a digital school
DANIEL D. H ROMADA

I

n a situation where “extensive body of
accumulated knowledge shows that glo­
bal consumption of goods and services
are among the key drivers of greenhouse-­
gas emissions” (Alfredsson et al. 2018), there
exists one fairly simple way how to reduce
a CO₂ trace of a person or an institution:
reduce one’s overall energy consumption.
This article describes how a wider deploy­
ment of so-called “circadian” and “eutark”
devices and services in an educational set­
ting could considerably reduce ecological
trace associated to one’s activity in the
­digital world.

Voracity of round-the-clock paradigm

One of the main undisputed principles of
current digital revolution can be described
as follows: Servers, routers, hubs, switches
and access points (APs) are “always on”,
digital services function “round-the-clock”,
and what user wants is “Ich, alles, sofort
und überall” (Granow & Pongratz 2018).
While usefulness of such “omni-tempo­
ral” paradigm for merchants who are able to
disseminate their products and ads across

74

all time-zones and cultures is undeniable,
thematization of omnitemporality of digi­
tal services in an educational context brings
forth following kinds of questions:
‒‒ What are pros and cons of having
an educational system which is
“always on”?
‒‒ Isn’t the very essence of learning related
to rhythms wherein the period of
relaxation, sleep, vacation and cognitive
consolidation follows a period of
intense information processing?
‒‒ How many gigawatt hours consume
“idle” WLAN APs in German schools
during 365 nights of one year?

‒‒ estimates that an average WLAN
AP consumes 5 Watt hours (Wh) of
electricity (Chiaravalotti et al. 2011;
Ashley 2012; Urban et al. 2014)

Inviting ecologists to join forces with cogni­
tive scientists, we leave the first two ques­
tions open for future debate and focus on
the third. And we do so from a position of a
hypothetic Hausmeister who:
‒‒ ponders that in Germany alone, there
are approximately 33 000 general
education and vocational schools
‒‒ conservatively assumes that, in average,
each school is equipped with 5 APs

Circadian devices and circadian
services

Such Hausmeister could easily see savings
caused by implementation of a general poli­
cy to turn off all APs when school is empty,
for example between 23:00 and 06:00:
33 000 schools × 365 days × 9 hours × 5 APs
per school × 5 Wh = 2,71 GWh
This kind of reasoning naturally leads us to
proposal of “circadian devices”.

It is well known that during a 24-hour cycle,
an energy-level level of a human being oscil­
lates between diverse phases such as deep
sleep, REM-sleep, peak awareness state, de­
clining awareness state etc. (Aschoff 1965).
Per analogiam, a circadian device (CD)
is defined as a device with pre-built daily
“rhythms” (Hromada 2019). That is, a device

Meinungen zum Thema im Synergie-Blog
https://uhh.de/w716v

Prof. Dr. Dr. Daniel D. Hromada
Einstein Center Digital Future
Berlin University of the Arts, Digital Education
daniel@udk-berlin.de
http://bildung.digital.udk-berlin.de
ORCID: 0000-0002-0125-0373

manifesting at least two state transitions
(for example “deep-sleep to full activity; full
­activity to deep-sleep”) within a 24-hour
period. Ideally, the very hardware of such
device is designed & optimized to be automatically turned “on” and “off” often and on
a regular basis.
In this sense, CDs are more radical than
classical devices whose “idle”, “hibernation”
or “suspend” modes often just mislead the
user into believing one is acting in a respon­
sible way while, in fact, such devices often
continue to operate in a sort of surveillance
modus with a non-­negligible eco-trace.
Contrary to these, a “deep sleep” of a
certified CD is to be characterized by energy
consumption limitely close to zero. This
implies that—with exception of few microor nanoamperes keeping the reactivationclock battery alive in order to know when to
trigger the relaunching spark—a certified
CD will be simply and measurably, off.

Eutark devices

Another means of reduction of operation­al
costs of one’s digital infrastructure is
deployment of energy-autark (or ­
simply
“eutark”) devices. We define an eutark
device as a device able to produce energy
necessary for its own operation.
It is not difficult to foresee the deploy­
ment of such eutark devices for educational
purposes. For example, instead of forcing
elementary school pupils to carry kilograms
of books on their backs, kids can rather carry
around a book-like digital Primer covered
with photo-voltaic circuitry. Combining a
­circadian strategy like “boot at 15:30, halt at
16:30” with a low power consumption sys­
tem-on-a-chip, such primers shall not only
reduce the consumption of grid-provided
electricity, but—and this is even more impor­
tant—lead to enrichment of pupil’s techno­
logical and environmental awareness.

Raising awareness

A sceptic may smile, when reading the pro­
posal to save few gigawatts a year by means
of enforcing a general policy within a highly
diversified German education context. And
a cynic will most point out that such an
effort is laughable when one realizes how
much energy is consumed in an hour by an
IT-component factory or a FAANG corpora­
tion data-center. And both sceptic and cynic
will be right.
Or, rather, would have been right, if our
proposal had not been positioned, from its
very beginning, in the educational setting.
That is, in a setting wherein knowledge
and “best practices” are being transferred
from the brain of one human agent—the
teacher—into brain of one or multiple stu­
dents. And students, they themselves, are
also agents: ils agissent.
Hence, it may be the case that the strong­
est eco-value of circadian and eutark devices
does not reside in energy savings per se, but
rather in habits these devices would help
to reinforce and amplify. By charging one’s
tablet from the Grid, one acquires one
kind of habits; by putting the Primer near
the window to charge itself, one acquires
the other kind. McLuhan’s predicament
“Medium is the message” can have ecologi­
cal implications, too.
Thus, at the end of the day, it may be the
case that the very design of the educa­tional
medium shall motivate a pupil to turn off the
light when leaving the classroom and opti­
mizing the thermostat settings when leav­
ing the school. An auto-­catalytic spark of
responsibility has been ignited and tera­watts
of energy can be, in the long run, saved.

Beitrag als Podcast
https://uhh.de/m15ql

References
Alfredsson, E., Bengtsson, M., Brown, H. S.,
­Isenhour, C., Lorek, S., Stevis, D. & ­Vergragt, P.
(2018). Why achieving the Paris ­Agreement
­requires reduced overall consump­tion
and production. Sustainability: Science,
Practice and Policy, 14 (1), pp. 1 – 5.
DOI 10.1080/15487733.2018.1458815.
Aschoff, J. (1965). Circadian rhythms in man.
­Science, 148 (3676), pp. 1427 – 1432.
Ashley, A. (2012). Access Point Power ­Saving. Avail­
able under: https://uhh.de/sw4j9 [26.03.2019].
Chiaravalloti, S., Idzikowski, F. & Budzisz, L. (2011).
Power consumption of WLAN network elements.
Tech. Univ. Berlin, Tech. Rep. TKN-11-002.
Granow, R. & Pongratz, H. (2018). Hochschul­
infrastrukturen für das digitale Zeitalter.
­Synergie. Fachmagazin für Digitalisierung in
der Lehre (6), pp. 68 – 71. Available under:
https://uhh.de/z2h5r [26.03.2019].
Hromada, D. D. (2019). After smartphone:
­Towards a new digital education artefact.
­Enfance, submitted to (3).
Urban, B., Shmakova, V., Lim, B. & Roth, K. (2014).
Energy Consumption of Consumer Electronics
in U. S. Homes in 2013. Fraunhofer USA Center for
Sustainable Energy Systems. Available under:
https://uhh.de/t6lv4 [26.03.2019].

DOI 10.25592/issn2509-3096.007.016

CC BY-NC-SA 4.0
Bei einer Weiterverwendung soll dieser ­Beitrag
wie folgt genannt werden: Hromada, D. D. (2019).
Circadian and eutark reduction of the energy
trace of a digital school. In S­ ynergie. Fachmagazin
für Digitalisierung in der Lehre #07, (S. 74 – 75).

75

Bisherige Ausgaben
Ausgabe #01: Vielfalt als Chance
Ausgabe #02: Openness
Ausgabe #03: Agilität

Ausgabe #04: Makerspaces
Ausgabe #05: Demokratie
Ausgabe #06: Shaping the Digital Turn

Impressum
Synergie. Fachmagazin für Digitalisierung in der Lehre
Ausgabe #07
Erscheinungsweise: semesterweise, ggf. Sonderausgaben
Erscheinungsdatum: 22.05.2019
Download: www.synergie.uni-hamburg.de
DOI (PDF): 10.25592/issn2509-3096.007
DOI (ePub): 10.25592/issn2509-3096.007.000
Druckauflage: 1000 Exemplare
Synergie (Print) ISSN 2509-3088
Synergie (Online) ISSN 2509-3096
Herausgeberin: Universität Hamburg
Schlüterstraße 51, 20146 Hamburg
Prof. Dr. Kerstin Mayrberger (KM)
Redaktion und Lektorat: Benedikt Brinkmann (BB),
Britta Handke-Gkouveris (BHG), Nadine Oldenburg (NO),
redaktion.synergie@uni-hamburg.de
Gestaltungskonzept und Produktion:
blum design und kommunikation GmbH, Hamburg
Verwendete Schriftarten: TheSans UHH von LucasFonts,
CC Icons

Autorinnen und Autoren: Oliver Ahel, Reinhard Bauer,
Jan Baumann, Nils Bernstein, Alexa Böckel, Claudia Bremer,
Stefanie Brunner, Ronald Deckert, Wolfgang Denzler,
Bianca Diller, Johann Engelhard, Peter England,
Kristina Färber, Nina Grünberger, Jörg Hafer, Tobias Hölterhof,
Daniel D. Hromada, Mareike Kehrer, Michael Kerres,
Thomas Köhler, Hans-Christoph Koller, Elke Kümmel,
Steffen Lange, Lara Lütke-Spatz, Kerstin Mayrberger,
Maren Metz, Johannes Moskaliuk, Georg Müller-Christ,
Angelika Paseka, Thorsten Permien, Sophie van Rijn,
Ronny Röwert, Tilman Santarius, Gianna Scharnberg,
Claudia T. Schmitt, Nadine Schröder, Sandra Sprenger,
Thore Vagts, Markus Vogt, Thomas Weith.
Alle Inhalte (Texte, Illustrationen, Fotos)
dieser Ausgabe des Fachmagazins werden
unter CC BY 4.0 veröffentlicht, sofern diese nicht durch abweichende Lizenzbedingungen gekennzeichnet sind. Die
Lizenzbedingungen gelten unabhängig von der Veröffentlichungsform (Druckausgabe, Online-Gesamtausgabe, OnlineEinzelbeiträge, Podcasts). Der Name des Urhebers soll bei
einer Weiterverwendung wie folgt genannt werden: Synergie.
Fachmagazin für Digitalisierung in der Lehre, Ausgabe #07,
Universität Hamburg. Ausgenommen von dieser Lizenz ist
das Logo der Universität Hamburg.

Druck: LASERLINE GmbH
Bildnachweise: Alle Rechte liegen – sofern nicht anders angegeben – bei der Universität Hamburg. Das Copyright der Porträt-Bilder
liegt – sofern nicht anders angegeben – bei den Autorinnen und Autoren. Cover: blum design; S. 2, 28, 50, 52 (unten) Unsplash;
S. 10 – 17, 46 – 49, 58 – 61, 66 – 69, 76 – 79, 84 – 88 Illustration blum design; S. 20, 84 Porträt-Bild Röwert, S. 85 – 88 Fotos: Hochschulforum Digitalisierung; S. 21 Porträt-Bild Böckel, S. 84 Porträt-Bild Böckel Foto: Brinkhoff-Moegenburg/Leuphana; S. 22, 24, 65 (unten
links), 70 – 73 Pixabay; S. 27, 54, 74 Pexels; S. 29 Porträt-Bild Brunner Foto: Sabrina Daubenspeck, Universität Vechta; S. 32 Porträt-Bild
Denzler, S. 37 Porträt-Bild van Rijn Foto: Markus Scholz; S. 39 Abb. 1 United Nations; S. 41 Porträt-Bild Sprenger Foto: Martin Joppen
Photographie; S. 43 – 44 Nils Bernstein; S. 48 Porträt-Bild Deckert Foto: HFH  Hamburger Fern-Hochschule; S. 52 Abb. 1, S. 61 PorträtBild Kehrer Foto: Leibniz-Institut für Wissensmedien; S. 57 Porträt-Bilder Fotos: Universität Bremen; S. 59 Logo: Ministerium für
Wissenschaft, Forschung und Kunst Baden-Württemberg; S. 69 Porträt-Bild Kerres CC BY-ND 3.0, Porträt-Bild Hölterhof CC BY-ND,
Porträt-Bild Scharnberg CC BY-ND Klaus Schwarten; S. 75 Porträt-Bild Hromada Foto: Felix Noak; S. 77 Abbildungen CC BY 4.0; S. 79
Porträt-Bild Bauer Foto: Fotostudio Thomas Staudigl; S. 84 Porträt-Bild Baumann Foto: Kirchner/Hartmannbund

89

December 2016

Integer-based nomenclature for the
ecosystem of lexically repetitive
expressions in complete works of William
Shakespeare
Daniel Devatman HROMADA a,1 ,
a Berlin University of the Arts, Faculty of Design
Abstract. Repetition of morphological or lexical units is an established technique
able to reinforce the impact of one’s argument upon the audience. Rhetoric tradition
has canonized dozens of repetition-involving schemas as figures of speech. Our article shows a way how hitherto ignored repetition-involving schemata can be identified. It shows that certain classes of repetitive figures can be represented in terms
of specific sequences of integer numbers and vice versa, how specific sets of integer
numbers can be translated into sets of regexes able to match repetition-involving
expressions. A "Shakespeare number" S is simply defined as an integer with at least
one repeated digit in which no digit bigger than X can occur if ever a digit X had
not yet occurred in S’s decimal representation. Hence, 121 is a Shakespeare number, while 123 or 211 are not. A set of "entangled numbers" is subsequently defined
as a subset of "Shakespeare numbers" with an additional property that all digits
which occur in them are repeated at least twice in the decimal representation of the
number. Thus, a 1212 is an entangled number while 1211 is not. A complete set
E of entangled numbers of maximal length of 10 digits is subsequently generated
and every member of E is translated into a regex. Each regex is subsequently exposed to all utterances in all works of William Shakespeare, allowing us to pinpoint
3367 instances of 172 distinct E-schemata. This nomenclature may allow scholars
to lead a discussion about schemata which have escaped the attention of classical
interpretators. e-mail: daniel at udk dash berlin dot de.
Keywords. repetitive figures of speech, regular expressions, William Shakespeare,
back-references, integers, stylometry

1. Exordium
"A faulty argument repeated twice is already better: repeated twenty times, it is excellent.
Our ears adapt to it as to any other music and we applaud it mechanically ... One repeats
an argument as one hums a vaudeville: not because it is good, but because it has been
often chanted." [3, note XXIII]
Repetitio mater studiorum, pater oratoriumque. It had already been known to ancients that even the clearest reasoning can fail to convince the audience if ever the in1 Corresponding Author: Daniel Devatman Hromada, Faculty of Design, Berlin University of the Arts, 10823
Schoeneberg, Berlin, Bundesrepublik Deutschland, EU. Mails to: daniel at udk dash berlin dot de

Daniel Devatman Hromada /

tended argument is not communicated with sufficient redundancy. And it is well known
to moderns that the cheapest yet most efficient way how such redundancy can be attained
is by means of repetitive transfer of information [7] from sender to receiver [18].
What’s more, in human cognitive systems, repeated information is often amplified
(Reference in camera ready version). It is therefore little surprising that repetition plays
a non-negligable role in the art of persuasion, commonly known as rhetorics. Thus, in
practically every classical manual, the students of oratory and poetic disciplines are reminded to reassert their arguments; to mould forms which reflect their contents and utter contents which reflect their form; to make appear and reappear certain words and
syllables; to repeat certain sounds or reactualize certain movements. Simply stated: to
remember the figures by means of which one can reinforce one’s influence over one’s
audience.
Hence, schemata known under names as diverse as polysyndeton, anaphore,
anadiplose, epistrophe, symploche, antanaclasis, paronomasia or even antimetabole 2 .
are traditionally defined in terms of repetition of their components [22]. But there is
more, for one should also not forget repetitive figures (RFs) like alliteration, paregmenon, polyptoton, epizeuxis or even good old psittacism.
Hence, dozens of RFs are sure to exist but their scholastic nomenclature complicates
any further communication with more computational- and NLP- oriented researchers.
The objective of this article is to bridge this gap.

2. Introduction
In literature studies it is fairly common to speak about so-called "rhyme schemes" like
AAAA for monorhymes, ABAB for alternate rhyme, ABBA for enclosed rhymes etc.
It is therefore not much surprising that analogic formalisms - that is, formalisms
that involve alphabetic indices - have been adopted by scholars aiming to formalize a
subgroup of rhetoric figures, known as the group of schemes. For example, [11] use a
following formalism:
[W ]a ...[W ]b ...[W ]b ...[W ]a
to denote the rhetoric figure known as antimetabole. Subsequent studies in automatized chiasm identification and detection pursue a similiar route and often use formulae like ABXBA, ABCBA, ABCXCBA to denote schemata corresponding to utterances
such as: "Drake love loons. Loons love Drake.", "All as one. One as all." ([12] or "In
prehistoric times women resembled men, and men resembled women." ([6])
This being understood, the core idea behind this article is simple to explicate. For
what shall be principially elucidated here is truly nothing more than the most basic a
formalistic quirk1 a notational flip from alphabetic to numeric indices. Hence, Aindices are to be substituted by 1-indices, B-indices by 2-indices, C-indices by 3-indices
et caetera. Hence and henceforth, one is free to use the form 1212 instead of ABAB, 1221
instead of BABA and 12321 instead of ABCBA...
2 Note that certain RFs included in a so-called "chiasmatic suite" (Reference - this volume) are not only
repetition-involving but also fractal-like in a sense that they embed other repetitions which include yet other
repetitions

Daniel Devatman Hromada /

Such change of notation may subsequently allow certain scholars to percieve and
concieve a set of potentially interesting rhetoric schemata as a potentially infinite subset of the infinite but countable set [2,9,21] of non-positive natural numbers . That is,
integers. The main implication of such mapping of a set of surface-based, repetitioninvolving rhetoric figures onto the set of integers goes as follows: given that the set of
integers is enumerable, the set of our integer-based RF-schemata denoting formulae is
enumerable as well. And as shell be shewn, developping a program which shall enumerate big amounts of such schemata is a fairly trivial enterprise which can fit into dozen
lines of code (c.f. listings 1 and 2).
Such program generating such sets, however, was not developped nor is here presented just to accomplish some mathematicians’ useless fancy. Rather contrary is the
case and our objectives are to be considered more practical than theoretical. For such sets
of potentially interesting RF-schemata can be translated - by yet another program (c.f.
4) - into so-called regular expressions ("regexes") which could be subsequently used to
match and discover hitherto unknown repetition-based expressions occurent in attested
natural language corpora [24,8].
Like that of collected works of William Shakespeare, for example.

3. Definitions
3.1. Shakespeare number
A Shakespeare number S is a positive natural number (S ∈ N) whose decimal representation expresses two properties:
• repetitive property: at least one digit occurs twice
• ascending property: S contains no digit n > 1 without containing a digit n − 1 to
the left of first occurrence of n
In order to see the principle more clearly, table 1 enumerates ten Shakespeare numbers with smallest value.
S-number

Alphabetic
representation

Matchable
expression

11
111
112
121
122
1111
1112
1121

AA
AAA
AAB
ABA
ABB
AAAA
AAAB
AABA

"we split we split "
"we split we split we split
"here here sir "
"to prayers to "
"trip audrey i attend i attend "
"justice justice justice justice "
"great great great pompey "
"here here sir here "

1122
AABB
"gross gross fat fat "
1123
AABC
"he he and you "
Table 1. First ten Shakespeare numbers, their corresponding alphabetic representations and arbitrarily chosen
Shakespearean expressions which can be subsumed under them.

Daniel Devatman Hromada /

As a counterexample, let’s precise that 22 is not a Shakespeare number because digit
1 does not occur at all and 221 is not a Shakespeare number because 2 occurs with no 1 to
its left. These two numbers therefore do not satisfy the ascending property. On the other
hand, numbers like 12, 13 or 123 are also not S-numbers because they do not include any
repeated digit and therefore do not satisfy the repetition-inclusion constraint.
Listing 1 displays the source code of a routine able to generate the sequence of
S − numbers from one to potential infinity. The sequence of first 163553 S-numbers
- id est those S-numbers whose value is less than 9999999999 is available at Online
Encyclopedia of Integer Sequences [13] under sequence number A273977 3 .
Deeper mathematical and number-theoretical properties of S-numbers are presented
in [19].
3.2. Entangled number

E-number

Alphabetic
representation

Matchable
expression

11
111

AA
AAA

"we split we split "
"we split we split we split "

1111
1122
1212
1221
11111
11122
11212
11221

AAAA
AABB
ABAB
ABBA
AAAAA
AAABB
AABAB
AABBA

11222
12112
12121

AABBB
ABAAB
ABABA

"justice justice justice justice "
"gross gross fat fat "
"to prayers to prayers "
"my hearts cheerly cheerly my hearts "
"so so so so so "
"great great great pompey pompey "
"come come buy come buy "
"high day high day freedom freedom high
day "
"o night o night alack alack alack "
"too vain too too vain "
"come hither come hither come "

12122
12211

ABABB
ABBAA

12212

ABBAB

12221

ABBBA

"come buy come buy buy "
"freedom high day high day freedom freedom "
"on whom it will it will on whom it will "

"thou canst not hit it hit it hit it thou canst not
"
Table 2. All Entangled numbers with no more than 5 digits, their corresponding alphabetic representations
and arbitrarily chosen Shakespearean expressions which can be subsumed under them.

A set of entangled numbers is a subset of set of Shakespeare numbers (E ∈ S ∈ N).
E − numbers therefore satisfy repetitive and ascending properties of S − numbers. In addition to these does the decimal representation of an entangled number E one additional
property:
• closure property: each digit of E occurs at least twice
3 https://oeis.org/A273977/b273977.txt

Daniel Devatman Hromada /

In order to see the idea more clearly, table 2 enumerates ten Entangled numbers
having their digit-length equal to five or less.
As a counter example, let’s precise that numbers like 12, 13, 22, or 123 are not Enumbers because they are not even S-numbers. On the other hand, S-numbers like 121
or 1211 are not E-numbers because they contain a digit 2 which is not repeated.
Listing 2 displays the source code of a routine able to verify whether an S − number
presented at the input is an E − number. The sequence of first 4360 E − numbers - id est
those E − numbers whose value is less than 9999999999 is available at Online Encyclopedia of Integer Sequences [13] under sequence number A273978 4 .
Deeper mathematical and number-theoretical properties of S-numbers are presented
in [19].

4. Method
The core idea behind our method can be stated as follows:
Any S− or E− number is to be "translated" into a backreference-endowed regular expression.
More concretely, every digit of an S- or E- number can be interpreted as a sort of an
element or a "brick". In this article, we work only with one type of bricks, those corresponding to sequences which are between two to twenty-three characters long5 . More
concretely, a first occurence of a novel brick can be represented as a PERL-compatible
regular expression:
(.{2,23})
However, any subsequent repeated occurence of a digit in the S- or E- number is
interpreted not as an occurence of the new brick, but rather as a backreference to the
brick which was already denoted by the same digit. Hence, the very first S- number 11
is NOT to be translated into regex /(.{2,23}) (.{2,23})/. For this would imply existence of
two distinct bricks. Rather, the E-number 11 is to be translated into regex:
(.{2,23}) \1
wherein the expression \1 denotes the backreference to the content matched by the
regex-brick specified in first parentheses, i.e. brick no.1 .
Hence, the S-number 111 can be easily translated into a regex /(.{2,23}) \1) \1/, 1111
into a regex /(.{2,23}) \1 \1 \1/ etc.
These, however, are cases which correspond only to repetition of one single brick:
11 for duplication, 111 for triplication, 1111 for quadruplication etc. In order to assure
the application of the non-identity principle stating that:
4 https://oeis.org/A273978/b273978.txt
5 Minimal (e.g. 2) and maximal (e.g. 23) brick length are the only parameters of our model and can be,
of course, adequately tuned. Sometimes we shall denote this parameter couple with the term base. More in
discussion.

Daniel Devatman Hromada /

"Each distinct digit corresponds to distinct content"
, an additional adjustment is needed in case we want to translate S-numbers containing multiple digits of different kind. That is, S-numbers like 121, 122 or 211.
For if we would not care for the principle of non-identity, a number like 121 could
be easily represented as /(.{2,23}) (.{2,23}) \1/ and a number like 122 could be translated
into /(.{2,23}) (.{2,23}) \2/. It could turn out, however, that these regexes would match
the very same expressions as other, more simple regexes do as well (e.g. the expression
"no no no" could be matched by both /(.{2,23}) \1) \1/, as well as by /(.{2,23}) (.{2,23})
\1/ or /(.{2,23}) \1 (.{2,23})/. This is so, because nowhere in such regular expression it is
specified that the first brick has to be different from the second brick, or third brick from
the second.
Luckily enough, syntax of PCREs is exhaustive enough to allow us to encode
the non-identity constraint into regexes themselves. This is attained by putting the
backreference into a so-called negative lookahead, traditionally expressed by the formula
(?!). Hence, by translating the S-number 121 into the regex
(.{2,23}) (?!\1)(.{2,23}) \1
we can make sure that the content matched by the brick denoted by digit 2 shall be
different from the content matched by the brick denoted by digit 1. Thus, an expression
"no no no" shall not be matched by such a regex while an expression "no yes no"6 shall.
Going somewhat further, an S-number 12321 - which could be understood as an
instance of chiasmatic ABXBA - is to be translated into regex
(.{2,23}) (?!\1)(.{2,23}) (?!\1|\2)(.{2,23}) \2 \1
whereby the disjunctive backreference contained in the negative lookahead (?!\1|\2)
assures that the content matched brick no.3 - corresponing to filler X - shall be different
from content matched by the brick representing digit 1 as well as the brick representing
digit 2.
This being said, the method of translating S- or E- numbers into regexes which do
not transgress the non-identity constraint is pretty much straightforward, and is fully and
completely described by PERL code given in listing 3.
5. Experiment
5.1. Corpus
A digital, unicode-encoded version of Craig’s edition of "Complete works of William
Shakespeare" [4] has been downloaded from a publicly available Internet source 7 . This
corpus contains 17 txt files stored in the sub-folder "comedies", 10 txt files stored in the
sub-folder "tragedies" and 10 txt files stored in the sub-folder "historical".
What’s more, all utterances are annotated according to the following format:
6A

cautious reader may now start to observe that non-repeated digits of an S-number in fact correspond to
"filler" or "separator" expressions (e.g. "yes") which in many cases fill the space between repeated elements
themselves (e.g. "no").
7 Downloaded
from http://www.lexically.net/downloads/corpus_linguistics/ShakespearePlaysPlus.zip.
Backup at http://sci.wizzion.com/ShakespearePlaysPlus.zip .

Daniel Devatman Hromada /

<PERSONA>
Sentence 1.
Sentence ...
</PERSONA>
<MIRANDA>
O, wonder!
How many goodly creatures are there here!
How beauteous mankind is!
O brave new world,
That has such people in’t!
</MIRANDA>
That is, a format highly reminiscent of the format of a valid XML document. This
format wherein diverse values of the tag < PERSONA > denote names of diverse dramatis personae (e.g Miranda, Prospero) , seems to be consistently and stringently followed
across all files contained in the corpus. This is advantageous, since it implies that the
content present between the opening and closing tag can be understood as a supraphrasal,
meaning-encoding monadic unit: a utterance. Verily, this is encouraging.
It is encouraging for both theoretical (1.) as well as for practical (2.) a reason:
1. school of thought to which our research tends to adhere is principially a constructivist, usage-based linguistic paradigm best manifested in [20]
2. computational complexity of matching backreference-endowed regexes depends
supralineary or maybe even non-polynomially [1] from the length of the text
being matched
Regarding the practical reason, it could be postulate that our article offers certain
evidence for the hypothesis "backreferenced regex-parsing of Shakespearean utterances
is computationally tractable in reasonable time", whereby the term "reasonable" denotes
time scales between miliseconds and minutes. More in discussion.
Regarding the theoretical reason, it is worth making explicit that an implicit leitmotive of Tomasello’s theory is a definition stating:

Utterance is the basic unit of linguistic interaction.
5.2. Processing
Dramatic pieces are divided into utterances. This is a natural consequence of the fact
that dramatic pieces tend to represent scenarios within which diverse dramatis personae
interact with each other. It is difficult to see any other litteral genre where division into
utterances is as marked as in case of drama8 .
And in case of digital version of [4] Shakespeare corpus, such markedness tends to
be even more marked.
Therefore, one simply needs to cut the corpus into utterances by interpreting the
closing tag of the utterance (e.g. < /PERSONA >, < /MIRANDA > etc.) as the utterance
8 Plato’s dialogues are, of course, set aside as a very particular case. When it comes to film scripts and/or
subtitles to other audiovisual media, these are principially understood as a particular subtype of dramatic pieces

Daniel Devatman Hromada /

separator. Even more concretely, one can simply consider the slash symbol / to be the
utterance separator. Subsequently, dividing the original dramatic text into utterances is,
at least in PERL, as simple as defining the symbol / to be the default input separator. That
is, in PERLish, by executing following code:
$\ = ”/”;
Only two further text-processing steps have been executed during the initialization
phase of the experiment hereby presented. Primo, content of each utterance has been put
into lowercase. Secundo, non-alphabetic symbols (e.g. dot, comma, exclamation mark
etc.) have been replaced by blank spaces. We are aware that such replacement could
potentially lead to certain amount of loss of prosody- or pathos- encoding information.
However, we consider this step as legitimate because the objective of our experiment was
to focus on repetition of lexical units.9
Pre-processing code once executed, identification of expressions containing diverse
types of lexical repetition is as simple as matching each Shakespearean utterance with
each regex.

6. Results

This section presents results of exposure of Shakespeare’s corpus to base=2,23 regular
expressions generated out of all entangled numbers with max. length of 10 digits. We
focus on E2,23 − numbers because their closure property (i.e. "every digit contained in
a valid E-number has to occur at least twice") gives an arbitrary E − number ability to
match much more rare a gem than just an arbitrary S − number.

6.1. Quantitative

All in all, 3667 instances of a repetitive expression has been detected in Shakespeare’s
complete works. These were contained in 2295 distinct utterances and corresponded to
172 distinct E2,32 schemata. Among these, 71 matched more than one instance: these
schemata could thus potentially correspond to a certain cognitive pattern or a habitus in
Shakespeare’s mind.
Table 3 contains summary matching frequency information which concerning
schemata matching at least five distinct utterances.

9 Enumerative generation of backreference-involving regexes focusing on repetitions of phonotactic clusters,
syllables, phrases or potentially even sememes and prosodies is, in theory, also possible. We prefer, however,
not to focus on this topic within the limited scope of this article.

Daniel Devatman Hromada /
Table 3. Quantities of utterances present in collected works of William Shakespeare which contain at least five
distinct utterances corresponding to an E-number encoding the backreference-encoding regex whose individual
brick match expressions not shorter than 2 characters and not longer than 23 characters.
Instances
2332
525
170
100
48
35
32

E2,23 − number
11
1212
111
123123
12121
1221
12341234

Example
"bestir bestir "
"to prayers to prayers "
"ha ha ha "
"cover thy head cover thy head "
"come hither come hither come "
"fond done done fond"
"let him roar again let him roar again "

32
30
23
12
12
11
11

1122
1111
121212
123231
1231231
121233
112323

"with her with her hook on hook on "
"great great great great "
"come on come on come on "
"upholds this arm this arm upholds "
"fubbed off and fubbed off and fubbed "
"trip audrey trip audrey i attend i attend "
"what what what ill luck ill luck "

10
10
9
8
8
7
6
5

123312
11122
121323i
12321434
11111
12312312
11234234
12123434

"my hearts cheerly cheerly my hearts "
"lady lady lady alas alas "
"a lord to a lord a man to a man "
"land rats and water rats land thieves and water thieves "
"so so so so so "
"let me see let me see let me "
"on on on to the breach to the breach "
"i thank god i thank god is it true is it true "

5

1112323

"barren barren barren beggars all beggars all "

Another phenomenon may be found noteworthy by a reader interested in purely
quantitative aspects of our research. That is, the relation between the number of digits of
a E − number of length L seems to be in a Zipf-like [25] relation to number of occurences
of expressions which can be matched by such EL . For example, Shakespeare’s dramas
seem to contain 2332 duplications (E = 11), 170 triplications (E = 111), 30 tetraplications (E = 1111), 8 pentaplications (E = 11111 10 ), two hexaplications (E = 111111 11 ),
one heptaplication (E = 1111111 12 ) and zero octaplications.
It is worth mentioning, however, that generic relation between the length (in digits)
of an E − number X and the amount of utterances which X matches seems not to be
Zipfian. This is illustrated by Table 4.

10 E.g.

"never never never never never " by Lear in King Lear.
"kill kill kill kill kill kill " also by king Lear.
12 E.g. "so so so so so so so " by Shallow in The Second Part of King Henry IV.
11 E.g.

Daniel Devatman Hromada /

Digits

Theoretical

Matched

2
3
4
5
6
7
8
9

1
1
4
11
41
162
715
3425

2332
170
622
91
211
56
86
67

Table 4. Schemata corresponding to E − numbers with even number of digits match more frequently than
those with odd number of digits.

As indicated by Table 4, an observed preference for repetitive expressions including
two, four, six or eight bricks cannot be explained in terms of number-theoretical distribution of E − numbers themselves. For example, there exists eleven E − numbers with five
digits and fourty-one E − numbers of length six. However, when exposed to Shakespeare
corpus, base(2,23) regexes generated from E − numbers six digits long seem to match
211 utterances while five brick long regexes match only ninety-one of them.
Whether this observed asymmetry is an artefact of our method and our definition
of E − numbers, or whether it is due to a sort of cognitive bias, a sort of preference for
balanced repetitions poses us in front of an argument which we do not dare to tackle
within the limited scope of the present article.
6.2. Qualitative
It may be said that the longer the E- or S- number is, the more complex a structure, the
more cognitively-salient, pathos-filled an entity it potentially represents. For this reason,
this subsection principially exposes the reader with few answers to a question:
"What Shakespearean expressions can be matched with longest possible Enumber ?"
In all following examples, we will use the base2,23 E-numbers, i.e. restrict the length
of individual bricks to min. 2 and max. 23 characters.
In the realm of comedies13 , one can observe that the regex generated from the number 12343434 pin-points a following utterance from Stephano playing his role in The
Tempest:
Flout (1) ’em (2), and (3) scout ’em (4);
and (3) scout ’em (4),
and (3) flout ’em (4);
Thought is free.
while regex generated from number 12343412 identifies Miranda’s:
All (1) lost (2) to (3) prayers (4),
to (3) prayers (4) all (1) lost (2).
13 Link to the file containing all XXX expressions shall be published in the camera-ready version of the
article.

Daniel Devatman Hromada /

or Caliban’s
Freedom (1), high (2) day (3) !
high (2) day (3), freedom (1) !
freedom (1) ! high (2) day (3),
freedom (1) !
14

all appearing in the same play.
Another answer, corresponding to E-number 122133144 is given by Dromio, a personage in Shakespeare’s "Comedy of Errors":
She is so hot
because (1) the meat is cold (2) ;
The meat is cold (2)
because (1) you come not home (3);
You come not home (3)
because (1) you have no stomach (4);
You have no stomach (4),
having broke your fast;
Analyzing the realm of tragedies, one may see Polonius - a character in the Hamlet
drama - utter a 11231434231-matchable expression:
The best actors in the world,
either for tragedy, comedy,
history, pastoral (1), pastoral (1) - comical (2),
historical (3) - pastoral (1) , tragical (4) - historical (3),
tragical (4) - comical (2) - historical (3) - pastoral (1) ,
scene individable, or poem unlimited:
Seneca cannot be too heavy,
nor Plautus too light. For the law of
writ and the liberty,
these are the only men.
15

or one can hear Hamlet himself pronouncing a following 1231414312-matchable
sequence:
Let your own discretion be your tutor:
suit the (1) action (2) to (3) the (1) word (4),
the (1) word (4) to (3) the (1) action (2)
14 It

is important to realize that the very same expression can be matched by multiple regexes. Hence,
an above mentioned Caliban’s proclamation can be analyzed not only to match the base2, 23 E-number
1232311231, but also analyzed to match E-numbers like 12211121 (if ever "high day" forms only one brick)
etc. This is analogic, mutatis mutandi, to sentence having multiple syntactic parses.
15 Note that regexes have been constructed in a way that ignores suffixes, i.e. use bricks having a form like
"(.{2,23})\w{0,4}", than this utterance could be potentially matched with much longer a number, because not
only adjectives (e.g. "historic-al") but also the preceding substantives "histor-y" would be accounted for.

Daniel Devatman Hromada /

while Mercutio from the Romeo and Julia narrative states:
Come, come,
in thy mood
and as soon
and as soon

thou art as hot a Jack
as any in Italy;
(1) moved (2) to be (3) moody (4),
(1) moody (4) to be (3) moved (2).

These examples, of course, are just a tip of an iceberg.
Verily, only a tip of an iceberg, because many strongly marked repetitive expressions
are also to be found in Shakespear’s historical dramata. Among these, dramata eternalizing narratives of Henry IV. and Henry V. tend to top the list. Hence, Gadshill reasons
will strike (1) sooner (2) than (3) speak (4)
and (5) speak (4) sooner (2) than (3) drink (6)
and (5) drink (6) sooner (2) than (3) pray
and yet i lie for they pray continually
to their saint the commonwealth or
rather not pray to her
but prey on her
while Falstaff emphasizes:
banish peto
banish bardolph
banish poins but for
sweet jack falstaff
kind jack falstaff
true jack falstaff
valiant jack falstaff
and therefore more valiant
being as he is old jack falstaff
banish (1) not (2) him (3) thy (4) harry s (5) company (6)
banish (1) not (2) him (3) thy (4) harry s (5) company (6)
banish (1) plump jack and
banish all the world
It is, however a persona named Shallow which seems to be particulary fond of repetitions, once saying
come (1)
on (2)
come (1)
on (2)
come (1)
on (2)
sir (3)
give (4)
me (5)
your (6)

Daniel Devatman Hromada /

hand (7)
sir (3)
give (4)
me (5)
your (6)
hand (7)
and next time saying:
where s (1) the roll (2)
where s (1) the roll (2)
where s (1) the roll (2)
let (3) me (4) see (5)
let (3) me (4) see (5)
let (3) me (4) see (5)
so (6) so (6) so (6) so (6) so (6) so (6) so (6)
yea marry sir ralph mouldy
let them appear as i call
let them do so
let them do so
let me see
where is mouldy
Given that Shallow appears in historical dramata, an interesting question could be
rightfully posed: Is Shallow’s tendency to produce repetitive utterances en masse just
Shakespeare’s invention or is it rather a sort of description of particular cognitive characteristics of once existing historical personage ?

7. Conclusion
Our article presents a way of maping a subset of a set of all possible backreferenceendowed regexes onto a set of natural numbers. It indicates that for every base of certain
kind, the set of regexes-to-be-generated is infinite but enumerable. A set of so-called
Shakespearenumbers (S −numbers) is defined as well as the set of "Entangled numbers".
The second being a subset of the first, satisfying one additional constraint:
Every distinct digit ("symbol") of an entangled number EX occurs in EX at least
twice.
We have subsequently generated a list of all such S − numbers (c.f. listing 1) and
E −numbers (c.f. listing 2) with at max 10 digits. After which the E −numbers have been
translated into backreference-endowed regular expressions whose most elementary units,
so-called "bricks", were no shorter than two and no longer than twenty-three characters.
In the end, such regexes have been exposed to corpus containing collected works of
William Shakespeare.
This approach allowed us to pinpoint 3667 utterances matching at least one among
172 distinct repetitive formulae. We believe that at lease some among these formulae

Daniel Devatman Hromada /

could be of certain interest not only for Shakespearean [14] scholars in particular, but
also for wider fields of "digital humanities" [23] or stylometry.
The good news is that the whole matching process is also fairly fast. More concretely, matching all utterances with all base2, 23 regexes generated out of all 4360
E − numbers with less than 10 digits lasted 9555 seconds in case Shakespearean comedies, 6607 seconds in case of tragedies and 6900 seconds in case of historical dramata.
All this on one single core of an 1.4 GHz CPU.

8. Peroratio
Rhetorics undoubtedly belongs among five oldest scientific paradigms ever explicated by
scholars of the occidental16 tradition. Even before Plato noted down discussions between
Socrates and Gorgias and Socrates and Parmenides; even before Aristotle projected his
point-of-view upon the realm of man, Athēnaia, had been already venerated.
Longevity of rhetorics has positive as well as negative sides. Negative, for such
lengthy tradition implies potential impediments caused by centuries of terminological
and methodological sediments. We are convinced that, similiarly to diverse occult notations of pre-Mendelean chemistry, may alphabetic notation of BABAs and ABBAs be
also considered to be such sediments in regards to rhetoric science. Hence, by a trivial act
of switching notation from As to ones and Bs to twos, we aspire to do nothing else than
to unblock this science from the state of terminological traffic jam to somewhat more
fluid a state.
Hence and thus, interesting and almost melodical17 verses of Shakespeare have been
pin-pointed and juxtaposed side by side to each other. Being unsure of whether such
juxtaposition has ever been explored in the depth their merit, we find our qualitative
results worthy of not only exploring but also publishing. For who knows, maybe they
shall even inspire some potential Shakespeare of the future ?
Quantitative explorations may also turn out to be worthy of further exploration.
Three axes of such exploration are immediately visible:
1. "universalia axis": study of language-independent invariants and rhetorical
schemata which occur across many distinct languages and/or language groups
[12]
2. "ontogenetic axis": exploration of processes by means of which complex eloquency of an individual locutor emerges out of simpler structures, from mind of
a child to Shakespeare
3. "historical axis": study of different Digital Humanities resources in order to increase our knowledge about styles, fashions, crossovers and traditions popular
during different epoches of human history
In terms of Saussurian linguistics ([5]), one may consider the first axis to be synchronic one while the the second and third can be considered as "diachronic" ones.
16 Note, however, that rhetorics is far from being unknown to Orient as well. Known as Sarasvatı̄ in the
sanskrit world, the goddess embodies knowledge, arts, music, melody, muse, language, rhetoric, eloquence,
creative work ... [17] seems to be active already in vedic or even pre-vedic proto-indo-european times.
17 It may be the case that the application of our method upon musical partitures - as stored in MIDI files, for
example - shall also yield some worthy insights.

Daniel Devatman Hromada /

Listing 1: PERL code generating an ascending sequence of Shakespeare numbers. Code hereby
transfered to the public domain under license CC BY-NC-SA for artistic use and mGPL
license for general use..

$i =1;
INCREMENT : w h i l e ( $ i ++) {
my %d ;
$d { " 0 " } = 1 ;
$r =0;
f o r $d ( s p l i t / / , $ i ) {
n e x t INCREMENT i f ! e x i s t s $d { ( $d − 1 ) } ;
i f ( $d { $d } ) {
$r =1;
}
$d { $d }= t r u e ;
}
print " $i \ n" i f $r ;
}

One may, for example, extend the work of [12] in domain of "language-independent
detection of figures-of-speech" and demonstrate that E-numbers of considerable length
match expressions not only Shakespeare, but also in Goethe, Moliere, Milton or others.
Or focus on so-called "sacred texts" like Bible, Koran or RgVed where repetitions, indeed, abound. Or pursue a somewhat more psycholinguistic, ontogeny-oriented line of
research and study the a corpus like CHILDES [15] in order to explore how complex eloquency emerges out of variations within repetition of complex sequences (another REFs
to be given in camera-ready version).
At last but not least, we are convinced that our S− or E− number nomenclatures
could be embedded into rhetorical figure ontologies [11,16]. Within such ontologies, antimetaboles could be thus "enriched" with attributes like "12321", "123321", "1234321"
etc. ; anadiplosis would be labeled with another set of numbers, antistrophe with yet
another, etc. The advantage of such an enrichment is quite easy to see: such enriched
elements would become "grounded" [10]. That is - when looking for - or infering the
presence of a certain figure of speech F in certain text T , one could consult the ontology
and see whether F is not labeled with SF or EF attributes. If yes, one could simply parse
the T with corresponding SF or FE regexes. One could thus establish a practical, functional bidirectional bridge between the abstract realm of purely descriptive ontologies
and material reality of text corpora which are to be parsed and understood.
And, of course, such nomenclatures - or nomenclatures of a similiar vein - may allow
communication between computational and classical scholars in unambigous, precise,
yet still concise and sufficiently explanatory terms. This being said, we conclude this
article with an expression of hope that the method hereby introduces shall make it possible to spot down, identify, classify and study in deeper level the intricacies of cognitive
ecosystems populated with swarms and clusters of hitherto unknown psycholinguistic
schemata traditionally known as "figures of speech".
Acknowledgments\TBD in the camera-ready version of the article.

Daniel Devatman Hromada /

Listing 2: PERL code checking whether a Shakespeare number given at the input is also an Entangled number. Code hereby transfered to the public domain under the mGPL license..

OUTER : w h i l e ( < >) {
my %d ;
$ i =$_ ;
chop $ i ;
f o r $d ( s p l i t / / , $ i ) {
( e x i s t s $d { $d } ) ? ( $d { $d }++) : ( $d { $d } = 1 ) ;
}
f o r $k ( k e y s %d ) {
n e x t OUTER i f ( $d { $k } < 2 ) ;
}
print " $i \ n" ;
}

Listing 3: PERL code translating S-numbers into syntactically correct regexes. Code hereby transfered to the public domain under the mGPL license..

my $ b a s e = ’ ( . { 2 , 2 3 } ) ’ ;
$n=$ARGV [ 0 ] ;
@i = s p l i t / / , $n ;
$re = " " ;
my %h ;
$no = " " ;
f o r my $ i ( @i ) {
$re .= " " ;
i f ( d e f i n e d $h { $ i } ) {
$re .= ’ \ \ ’ . $i ;
} else {
i f ( $i >1) {
$ i >2 ? ( $no . = ’ | \ \ ’ . ( $ i −1)) : ( $no . = ’ \ \ ’ . ( $ i − 1 ) ) ;
$ r e . = ’ ( ? ! ’ . $no . ’ ) ’ ;
}
$re .= $base ;
$h { $ i } = 1 ;
}
}
$ r e . = ’ [ <] ’ ;
p r i n t " $n t r a n s l a t e s i n t o $ r e \ n " ;

Daniel Devatman Hromada /

Listing 4: PERL code for utterance-oriented pre-processing of texts contained in ShakespearePlaysPlus corpus. Code hereby transfered to the public domain under the mGPL license..

u s e open " : e n c o d i n g ( u t f −16) " ;
$ / = " / " ; # c o n s i d e r t h e s l a s h s y m b o l t o be t h e d e f a u l t i n p u t s e p a r a t o r
w h i l e ( < >) {
$ l i n e = l c $_ ; # l o w r e c a s e
$ l i n e =~ s / [ \ r \ n \ t . , ? ! : ; ’ "\ − ] + / / g ; # remove non−a l p h a b e t i c c h a r s
p u s h @{ $ u t t e r a n c e s {$ARGV} } , $ l i n e ; # c o n s t r u c t t h e u t t e r a n c e h a s h
}

References
[1] Alfred Vaino Aho. Algorithms for finding patterns in strings. Algorithms and Complexity, 1:255, 2014.
[2] Georg Cantor. Über eine elementare frage der mannigfaltigkeitslehre. Jahresbericht der Deutschen
Mathematiker-Vereinigung, 1:75–78, 1892.
[3] Gorges Caumont. Notes morales sur l’homme et sur la societe. Sandoz&Fischbacher, Paris, 1872.
[4] William James Craig. The complete works of Wiliam Shakespeare. Oxford University Press, 1919.
[5] Ferdinand De Saussure. Cours de linguistique générale: Publié par Charles Bally et Albert Sechehaye
avec la collaboration de Albert Riedlinger. Libraire Payot & Cie, 1916.
[6] Marie Dubremetz and Joakim Nivre. Rhetorical figure detection: the case of chiasmus. on Computational Linguistics for Literature, page 23, 2015.
[7] Luciano Floridi. The philosophy of information. Oxford University Press, 2011.
[8] Jeffrey EF Friedl. Mastering regular expressions. " O’Reilly Media, Inc.", 2002.
[9] Kurt Gödel. Über formal unentscheidbare sätze der principia mathematica und verwandter systeme i.
Monatshefte für mathematik und physik, 38(1):173–198, 1931.
[10] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346,
1990.
[11] Randy Harris and Chrysanne DiMarco. Constructing a rhetorical figuration ontology. In Persuasive
Technology and Digital Behaviour Intervention Symposium, pages 47–52. Citeseer, 2009.
[12] Daniel Devatman Hromada. Initial experiments with multilingual extraction of rhetoric figures by means
of perl-compatible regular expressions. In RANLP Student Research Workshop, pages 85–90, 2011.
[13] OEIS Foundation Inc. The on-line encyclopedia of integer sequences, 2017. http://oeis.org.
[14] Sister Miriam Joseph. Shakespeare’s Use of the Arts of Language. Paul Dry Books, 2008.
[15] Brian MacWhinney. The CHILDES project: The database, volume 2. Psychology Press, 2000.
[16] Miljana Mladenović and Jelena Mitrović. Ontology of rhetorical figures for serbian. In International
Conference on Text, Speech and Dialogue, pages 386–393. Springer, 2013.
[17] John Muir. Original Sanskrit texts on the origin and history of the people of India, their religions and
institutions. Trübner & Company, 1873.
[18] Claude E Shannon and Warren Weaver. The mathematical theory of information. 1949.
[19] NJA Sloane and Arndt Joerg.
Counting words that are in "standard order", 2016.
https://oeis.org/A278984/a278984.txt.
[20] Michael Tomasello. Constructing a language: A usage-based theory of language acquisition. Harvard
university press, 2009.
[21] Alan Mathison Turing. On computable numbers, with an application to the entscheidungsproblem. J. of
Math, 58(345-363):5, 1936.
[22] Alan Mathison Turing. Rhetorique. Grand Memento Encyclopedique, 1:687–689, 1936.
[23] Michael Ullyot. Review essay: Digital humanities projects. Renaissance Quarterly, 66(3):937–947,
2013.
[24] Larry Wall and Randal L Schwartz. Programming perl. O’Reilly & Associates Sebastopol, CA, 1991.
[25] George Kingsley Zipf. The psycho-biology of language. 1935.

Daniel Devatman Hromada /

Initial Experiments with Multilingual Extraction of Rhetoric Figures
by means of PERL-compatible Regular Expressions

Daniel Devatman Hromada
Lutin Userlab – ChART – Paris 8 – EPHE - Slovak Technical University
hromi@kyberia.sk

Abstract
A language-independent method of figure-ofspeech extraction is proposed in order to reinforce
rhetoric-oriented
considerations in natural
language processing studies. The method is based
upon a translation of a canonical form of
repetition-based figures of speech into the
language of PERL-compatible regular expressions.
Anadiplosis, anaphora, antimetabole figures were
translated into the form exploiting the backreference properties of PERL-compatible regular
expression while epiphora was translated into a
formula exploiting recursive properties of this very
concise artificial language. These four figures
alone matched more than 7000 strings when
applied on dramatic and poetic corpora written in
English, French, German and Latin. Possible
usages varying from stylometric evaluation of
translation quality of poetic works to more
complex problem of semi-supervised figure of
speech induction are briefly discussed.

1

Introduction

During middle ages and before, the discipline of
rhetoric composed - along with grammar and
logic - a basic component of so-called trivium.
Being considered by Platon as the “one single art
that governs all speaking” (Plato, trans. 1986) in
order to be subsequently defined by Aristotle as
“the faculty of observing in any given case the
available means of persuasion” (Aristotle, trans.
1954), the basic postulates of rhetoric are still
kept alive by those being active in domains as
diverse as politics, law, poetry, literary theory
(Dubois, 1970) or humanities in general
(Perelman & Olbrechts-Tyteca, 1969)
When it comes to more “exact” scientific
disciplines like that of informatics or linguistics ,
rhetoric seems to be somewhat ignored definitely more than its “grammar” and “logic”
trivium counterparts.
While contemporary

rhetoric disposes with a strong theoretical
background - whether in the form of the
Rhetorical Structure Theory (Taboada, Mann, &
Back, 2006), “computational rhetoric” (Grasso,
2002) or computational models of natural
argument (Crosswhite & Fox, 2003); a more
practically-oriented engineer has to nonetheless
agree with the statement that “the ancient study
of persuasion remain understudied and
underrepresented in current Natural Language
systems” (Harris & DiMarco, 2009) .
The aim of this article is to reduce this
“under-representation” gap and in a certain sense
augment the momentum of the computational
rhetoric not by proposing a complex model of
argumentation, but by proposing a simple yet
efficient and language-independent method for
extraction of certain rhetoric figures (RF) from
textual corpora.
RFs, also called “figures of speech”, are
one of the basic means of persuasion which an
orator has to his disposition. Traditionally, they
are divided into two categories : tropes - related
to deeper, i.e. semantic features of the phrasal
constituents under consideration; and schemes related to layers closer to actual material
expression of the proposition, i.e. to the
morphology, phonology or prosody of the
generated utterance.
The method proposed within this article
shall deal only with reduced subset of the latter that is, with detection of rhetoric schemes
anadiplosis, anaphora, antimetabole and epiphora
which are based on a repetition or reordering of a
given word, phrase or morpheme across multiple
subsequent clauses. While such a stylometric
approach was currently implemented with
encouraging results by (Gawryjolek, 2009), his
system is operational only when combined with
probabilistic context-free grammar parser
adapted to English language, and hence

dysfunctional when applied upon languages for
which such a parser does not exist.
In the following paragraphs of this
article we shall present a system of rhetoric
figure extraction which tends to be languageindependent, i.e. applicable upon a textual corpus
written in any language. Ideally, no antecedent
knowledge about the grammar of a language is
necessary for successful extraction by means of
our method, the 1) prescriptive form of the
figure-to-be-extracted and 2) the symbol
representing phrase and/or clause boundaries is
the only information necessary.
More concretely, our proposal is based
on a fairly simple translation of a canonical form
of a rhetoric figure under question into a
computer language, namely into the language of
PERL-compatible regular expressions (PCREs).
PCREs are, in their essence, simply strings of
characters which describe the sets of other
strings of characters, i.e. they are a matching
form, a template, for many concrete character
strings. As many other regular expressions
engines, PCREs make this possible by reserving
special symbols - “the metacharacters” - for
quantifiers and classes. But in addition to these
features common to many finite state automata,
PCREs offer much more (Wall & Loukides,
2000). These are the reasons why we consider
the PCREs to be appealing candidates for a
translation of rhetorical figures into a computerreadable symbolic form:
•

•

•

by implementing “back references”
(Friedl, 2006) , PCREs make it possible
to refer to that which was already
matched, hence allowing to construct
automata able to match repetitive forms
by implementing (from PERL version
5.10 on) “recursive matching”, PCREs
make it possible to match very complex
patterns without a need to have recourse
to other means, external to PCREs
since the language of PCREs is very
concise, the resulting PCRE describing a
rhetorical figure under question is
usually a string of few dozens of
characters which could be eventually
constructed not by means of human
intervention, as was the case in this
article, but by means of unsupervised
genetic programming (Koza, 1992) or
other means of grammar induction
engine (Solan, Horn, Ruppin, &
Edelman, 2005)

Element
W
...
<…>
Subscripts

Meaning
word
arbitrary intervening material
phrase or clause boundaries
identity (same subscripts),
nonidentity (different subscripts)

Table 1: part of RF-representation Formalism (RFRF)

2

Method

2.1

PERL-Compatible Rhetoric Figures

Four figures were chosen - namely anadiplosis,
anaphora, epiphora and antimetabole – in order
to demonstrate the feasibility of the “rhetoric
stylometry” approach. We have adopted the
Rhetoric Figure Representation Formalism
(RFRF) - initially concieved by (Harris &
DiMarco, 2009) - and reduced it in order to
describe only the four figures of interest. Basic
symbols of RFRF and their associated meanings
are presented in Table 1.
Since the goal of this article is primarily
didactic, i.e. we shall start this exposé with very
simple anadiplosis involving just one backreference, and end up our proposal with
somewhat more complex recursive PCRE
matching epiphorae containing arbitrary number
of constituents.
2.1.1

Anadiplosis

Anadiplosis occurs when a clause or phrase starts
with the word or phrase that ended the preceding
unit. It is formalized by RFRF as :
< . . . Wx >< Wx . . . >
We have translated this representation
into this PERL-Compatible Rhetoric Figure
(PCRF):

/((\w{3,})[.?!,] \2)/sig
The repetition-matching faculty is
assured by a backreference to an initial n-gram
composed of at least three word characters.
Therefore, this PCRE makes it possible to match
utterances like the one in Cicero's De Oratore :
Sed genus hoc totum orationis in eis
causis excellit, in quibus minus potest
inflammari animus iudicis acri et vehementi
quadam incitatione; non enim semper fortis
oratio quaeritur, sed saepe placida, summissa,

lenis, quae maxime commendat reos. Reos
autem appello non eos modo, qui arguuntur, sed
omnis, quorum de re disceptatur; sic enim olim
loquebantur.1
This is the simplest possible anadiplosis
figure since it matches only string with two
occurences of a repeated word. Therefore we
label this figure as anadiplosis{2}.

2.1.3

2.1.2

We have translated this representation
into following PCRE form:

Anaphora

Antimetabole is a rhetoric figure which occurs
when words are repeated in successive clauses in
reversed order. In terms of RFRF, one can
formalize it as follows:
<WA WB Wc . . . WC WB WA >

Anaphora is a rhetoric figure based upon a
repetition of a word or a sequence of words at the
beginnings of neighboring clauses.
It is
formalized by RFRF as :
< Wx . . . >< W x . . . >
We have translated this representation
into the following PCRE form:

/[.?!;,] (([A-Z]\w+) [^.?!;,]+[.?!;] \2 [^.?!;,]
+[.?!;,] (\2 [^.?!;,]+[.?!;,])*)/sig
As all RFs presented in this article, this
anaphora is also based on back-reference
matching. In contrast with anadiplosis where
dependency was of very short-distance nature, in
case of anaphora, the second occurrence of the
word can be dozens of characters distant from
the initial occurrence. What's more, this RF takes
into account possible third repetition of a W x
which makes it possible to match utterances like
Cicero's:
Quid autem subtilius quam crebrae
acutaeque sententiae? Quid admirabilius quam
res splendore inlustrata verborum? Quid plenius
quam omni genere rerum cumulata oratio?2
Since this PCRFs allows us to match
anaphorae with two or three occurences of a
repeated word, it is seems to be appropriate to
label it as anaphora{2,3}.

1

2

“For vigorous language is not always wanted, but
often such as is calm, gentle, mild: this is the kind
that most commands the parties. By ' parties ' I
mean not only persons impeached, but all whose
interests are being determined, for that was how
people used the term in the old days. “
“ Is there something more subtle than a rapid
succession of pointed reflections? Is there
something more wonderful than the heating-up of
a topic by verbal brilliance, something richer
than a discourse cumulating material of every
sort? ”

Antimetabole

/((\w{3,}) (.{0,23}) (\w{3,})[^\.!?]{0,23} \4 \3 \
2)/sig
Differently from previous examples
when there was only one element matched and
back-referenced, three elements - A, B, C- are
determined in initial phases of matching this
chiasmatic antimetabole. Subsequently, the order
of A & C is switched while B is considered to be
identic intervening material intervening between
A and C and C and A. Since possible occurrence
of other material intervening between ABC and
CBA (i.e. ABCxCBA) is also taken into account,
this PCRF has successfully matched expressions
like:
Alle wie einer, einer wie alle.3
2.1.4

Epiphora

Epiphora or epistrophe is a RF defined as
“ending a series of phrases or clauses with the
same word or words”. It is formalized by RFRF
as:
< . . . Wx >< . . . Wx >
We have translated this representation
into following PCRE form:
/([A-Z][^\.\?!;]+ (\w{2,}+)([\.\?!;] ?[A-Za-z]
[^\.\?!;]+ (?:\2|(?-1))*)\2[\.\?!;])/sig
In contrast with anaphora{2,3} figure
presented in 2.1.2, the epiphora figure hereby
proposed exploits the “recursive matching”
properties of latest versions of PCRE (Perl
5.10+) engines. In other words, the expression
(?:\2|(?-1)) match any number of subsequent
phrases or clauses which end with Wx and not
just three, as was the case in case of epiphora.
Hence, a quadruple epiphora :

3

“ All as one, one as all. ”

Je te dis toujou la même chose, parce
que c'est toujou la même chose, et si ce n'était
pas toujours la même chose, je ne te dirais pas
toujou la même chose.4
was detected by this recursive PCRF
when it was applied upon corpus of Molière's
works.
Since the recursive matching allows us
to create a sort of “greedy” epiphora, we propose
to label it as epiphora{2,} in possible future
taxonomy of PCRFs.
2.2

Corpora

In order to demonstrate the languageindependence of the rhetoric stylometry method
hereby proposed, we confronted the matching
faculties of initial “PERL Compatible Rhetoric
Figures” (PCRF) with the corpora written in
diverse languages.
More precisely, we have performed the
rhetoric stylometry analysis of 4 corpora written
by poets and orators who are often considered as
exemplary cases of mastering their respective
languages.
For English language, complete works of
William Shakespeare had been downloaded from
project Gutenberg (Hart, 2000). The same site
served us as the source of 40 works of Johann
Wolfgang Goethe written in German language.
When it comes to original works of Jean-Baptiste
Molière, 39 of them where recursively
downloaded from French site toutmoliere.net.
Finally, the basic Latin manual of rhetoric,
Cicero's “De Oratore” was extracted from the
corpus of Perseus Project (Crane, 1998) in order
to demonstrate that PCRF-based approach can
yield interesting results when applied even upon
corpora written in antique languages.
Corpora from Project Gutenberg was
downloaded as pure utf8-encoded text. No
filtering of data was performed in order to
analyze the data in their rawest possible form.
The only exception was the stripping away of
possible HTML tags by means of standard
HTML::Strip filter.
Before the matching, the totality of the
corpus was split into fragments whenever
frontier \n[^\w+] (i.e. new-line followed by at
least one non-word character) was detected.
Shakespeare’s corpus were splitted into 109492
fragments, Goethe’s into 46597 fragments ,
4

“I always tell you the same thing because it is
always the same thing and if it wasn't always the
same thing I would not have been telling you the
same thing.”

Cicero’s into 970 fragments while works of
Moliere yielded 6639 fragments.

3

Results

In total, more than 7000 strings were matched by
3 PCRFs within 4 corpora containing in 17
Megabytes of text splitted into more than 163040
textual fragments.
Anadip Anapho Antimetabole Epipho
losis{2} ra{2,3} {abcXbca} ra{2,}
Cicero
Goethe
Molière
Shkspr

0.00309
0.00242
0.01129
0.00087

0.2711
0.0717
0.1634
0.008

0
0.0003
0.000602
0.000219

0.0144
0.0042
0.0210
0.008

Table 2: Relative frequencies of occurence of diverse
PCRFs within diverse corpora ( PCRF per fragment)

As is indicated in Table 2, the instances
of anadiplosis, anaphora, antimetabole and
epiphora were found in all 4 corpora involved in
this study, the only exception being the absence
of antimetabole in Cicero. In general,
anaphora{2,3} seems to be the most frequent
one: number of cases when this PCRFs
succeeded to match highly surmounts the other
two figures especially in case of Romance
language authors – i.e. almost every sixth
fragment from Moliere and every fourth from
Cicero was matched by anaphore{2,3}.
The only exception to this “dominance
of anaphora” seems to be Shakespeare whose
complete works yielded exactly the same
frequency of epiphora and anaphora occurences.

Cicero
Goethe
Molière
Shkspr

Anadip Anaphora Antimetabol Epiphora
losis{2}
{2,3}
e{abcXbca}
{2,}
20
1
4
19
44
3
33
287
57
1
29
65
7
2
17
64

Table 3: Elapsed time (in seconds) of different
PCRF/corpus runs on average PC desktop

As is indicated in Table 3, computational
demands of PCRF-based are not high in case of
anaphora{2,3}. On the contrary, the recursive
epiphora{2,} is much more demanding. As the
recursive structure of this PCRF indicates, the
speed of matching process is growing nonpolynomially with the length of the textual
fragment upon which the PCRF is applied and
therefore the choice of correct fragment separator

token (c.f. 2.2) seems to be of utmost
importance.

4

Discussion

We propose a language-independent
parse-free method of extracting instances of
rhetoric figures from natural language corpora by
means of PERL-compatible regular expressions.
The fact that PCREs implement features like
back-references or recursive matching make
them good candidates for the detection &
extraction of rhetoric figures which cannot be
matched by simpler finite state automata or
context-free languages.
In order to demonstrate the feasibility of
such an approach, we have therefore “translated”
the canonical definitions of anadiplosis,
anaphora and epiphora into four PERLcompatible
rhetoric
figures
namely
anadiplosis{2}, anaphora{2,3}, epiphora{2,} and
antimetabole{abcXbca} - and applied them upon
Latin, English, French and German corpora. All
four PCRFs successfully matched some strings in
at least three of four corpora, indicating that
repetition-based rhetoric figures can possibly
belong to the set of linguistic universalia
(Greenberg, 1957). Anaphora{2,3} surpassed in
frequency of occurrences all the other figures,
the only exception being Shakespeare in whose
case the number of matched epiphorae was equal
to the number of matched anaphorae.
We do not pretend that PCRFs presented
hereby are the most adequate translations of
traditional anadiplosis, anaphora, antimetabole
or epiphora into an artificial language. Since
PCREs can contain quantifiers and classes, it is
evident that for any set of strings – which is one
our case the set F of all the occurences of a given
figure within its respective corpus – more than
one possible regexp could be constructed in
order to match all members of the set F.
Therefore it may be the case that PCRFs that we
have proposed in this “proof of concept” article
are not the most specific ones nor the fastest
ones.
When it comes to specificity, it may be
stated that the closer look upon the extracted data
indicates that PCRFs proposed hereby have
proposed some “false positives”, i.e. have
matched strings which are not rhetorical figures
(for example an expression “FIRST LORD. O
my sweet lord” was matched by epiphora{2,}
when applied upon Shakespeare's corpus, but is
definitely not a rhetoric figure since the substring

in capital letters simply denotes the name of
dramatic persona pronouncing the following
statement and not the clause of the statement
itself).
When it comes to speed, it is established
that PCREs with unbounded number of backreference are NP-complete (Aho, 1991) and
verily this may be the reason of very high runtimes of a recursive epiphora{2,} in contrast to
its non-recursive PCRF counterparts. From
practical point of view it seems therefore more
suitable – especially in case of analysis of huge
corpora - to stick to non-recursive PCRFs. The
other possible solution how to speed up the
parsing – and in certain cases even to prevent the
machine to fell into “infinite recursion loop” is
the tuning of the “splitting parameter” so that the
corpus is split in fragments of such a size that
the NP-complexity of the matching PCRE shall
not have observable implications upon a real
run-time of a rhetoric figure detection process.
There are at least three different ways
how PCRFs could be possibly useful. Firstly,
since PCRFs are very fast and languageindependent, they can allow the scholars to
extract huge number of instances of rhetoric
figures from diverse corpora in order to create an
exhaustive compendium of rhetoric figures. For
example, the corpus of >7000 strings which were
extracted from corpora mentioned in this article
(downloadable from http://www.lutin-userlab.fr/
rhetoric/) could be easily put to use not only by
teachers of language or rhetoric, but possibly
also by those who aim to develop a semisupervised system of rhetoric figure induction
(c.f. last paragraph). Manual annotation of such a
compendium and subsequent tentatives of such a
figure of speech induction shall be presented in
our forecoming article.
Secondly, the extracted information
concerning the quantities of various PCRFs
within different corpora could serve as an input
element (i.e. a feature) for classifiying or
clustering algorithms. PCRFs could therefore
facilitate such stylometric tasks like authorship
attribution, author name disambiguation or
maybe even plagiate detection.
Thirdly, due to their language
independence, PCRFs presented hereby can be
thought of as a means for evaluation of
differences between two different languages, or
two different states of the same language. One
can for example apply the PCRFs upon two
different translations T1 and T2 and see that the
distribution of PCRFs within T2 is more similar

to the distribution of PCRFs in the original than
the distribution in T2. Therefore, one could
possibly state that from rhetoric, stylistic or even
poetic standpoint, T1 is more adequate
translation of the original text than T2. On the
other hand, when we speak about comparing two
different states of the same language , we
propose to perform PCRF-based analysis not
only upon a corpus representing the l'état de l'art
state of the language - like that of a Shakespeare,
for example – but also to compare such a state
with more initial states of the language
development, as is represented by CHILDES
(MacWhinney & Snow, 1985) corpus.
Finally, by considering PCRFs to be a
method which could possibly be used as a tool of
analysis of the development of language faculties
in a human baby, we come closer to its third and
somewhat “cognitive” implementation. This
implementation - which is the subject of our
current research - is based upon a belief that it is
not unreasonable to imagine that PCRFs could
possibly be constructed not manually, but
automatically by means of genetic programming
paradigm (Koza, 1992). Given the fact that
PCRE-language is one of the most concise
programming
languages
possibles
and
conceivables, and given the fact that the 1) speed
of execution 2) the specifivity 3) the sensitivity
could possibly serve as the input parameters of a
function evaluating the fitness of a possible
PCRF candidate, it is possible that the research
initiated by our current proposal could result in a
full-fledged and possibly non-supervised method
of rhetoric figure induction. In such a way could
our PCRFs possibly become something little bit
more than just another tool for stylometric
analysis of textual corpora – in such a way they
could possibly help answering a somewhat more
fundamental question: “What is the essence of
figures of speech and how could they be
represented within&by an artificial and/or
organic symbol-manipulating agent?”

References

Acknowledgments
The author wishes to express his gratitude to
University Paris8 – St. Denis and Lutin Userlab
for support without which the research hereby
presented would not be possible, as well as to
thank philologues and comparativists of École
Pratique des Hautes Études and ÉNS for keeping
alive the Tradition within which the Language is
considered to be something more than just an
object of parsing and POS-tagging.

Plato. (1986). Phaedrus. 261e.

Aho, A. V. (1991). Algorithms for finding patterns in
strings, Handbook of theoretical computer science
(vol. A): algorithms and complexity. MIT Press,
Cambridge, MA.
Aristotle. (1954). Rhetoric. 1355b.
Crane, G. (1998). The Perseus Project and Beyond:
How Building a Digital Library Challenges the
Humanities and Technology. D-Lib Magazine, 1,
18.
Crosswhite, J., Fox, J., Reed, C., Scaltsas, T., &
Stumpf, S. (2003). Computational models of
rhetorical argument. Argumentation Machines—
New Frontiers in Argument and Computation,
175–209.
Dubois, J. (1970). Rhétorique générale: Par le
groupe MY. Larousse.
Friedl, J. (2006). Mastering regular expressions.
OʼReilly Media, Inc. Sebastopol, CA, USA.
Gawryjolek, J. (2009). Automated annotation and
visualization of rhetorical figures.
Grasso, F. (2002). Towards computational rhetoric.
Informal Logic, 22(3).
Greenberg, J. H. (1957). The nature and uses of
linguistic typologies. International Journal of
American Linguistics, 23(2), 68–77.
Harris, R., & DiMarco, C. (2009). Constructing a
Rhetorical Figuration Ontology. Persuasive
Technology and Digital Behaviour Intervention
Symposium.
Hart, M. (2000).
Gutenberg.

Project

gutenberg.

Project

Koza, J. R. (1992). Genetic programming: on the
programming of computers by means of natural
selection. The MIT press.
MacWhinney, B., & Snow, C. (1985). The child
language data exchange system. Journal of child
language, 12(02), 271-295.
Perelman, C., & Olbrechts-Tyteca, L. (1969). The
new rhetoric: A treatise on argumentation.
Solan, Z., Horn, D., Ruppin, E., & Edelman, S.
(2005). Unsupervised learning of natural
languages. Proceedings of the National Academy
of Sciences, 102(33), 11629.
Taboada, M., Mann, W. C., & Back, L. (2006).
Rhetorical Structure Theory. Citeseer.
Wall, L., & Loukides, M. (2000). Programming perl.
OʼReilly Media, Inc. Sebastopol, CA, USA.

PROCEEDINGS IACAP 2011

FIRST INTERNATIONAL CONFERENCE OF
IACAP

THE COMPUTATIONAL TURN:
PAST, PRESENTS, FUTURES?
4 – 6 JULY, 2011
AARHUS UNIVERSITY

Proceedings IACAP 2011

PRINTED WITH THE FINANCIAL SUPPORT OF THE HEINZ NIXDORF
INSTITUTE, UNIVERSITY PADERBORN, GERMANY

© VERLAGSHAUS MONSENSTEIN UND VANNERDAT OHG
AM HAWERKAMP 31
48155 MÜNSTER

-2-

The Computational Turn: Past, Presents, Futures?

“The Computational Turn: Past, Presents, Futures?”

Dear participants,
In the West, philosophical attention to computation and computational
devices is at least as old as Leibniz. But since the early 1940s, electronic
computers have evolved from a few machines filling several rooms to
widely diffused – indeed, ubiquitous – devices, ranging from networked
desktops, laptops, smartphones and “the internet of things.” Along the
way, initial philosophical attention – in particular, to the ethical and social
implications of these devices (so Norbert Wiener, 1950) – became
sufficiently broad and influential as to justify the phrase “the
computational turn” by the 1980s. In part, the computational turn referred
to the multiple ways in which the increasing availability and usability of
computers allowed philosophers to explore a range of traditional
philosophical interests – e.g., in logic, artificial intelligence, philosophical
mathematics, ethics, political philosophy, epistemology, ontology, to
name a few – in new ways, often shedding significant new light on
traditional issues and arguments. Simultaneously, computer scientists,
mathematicians, and others whose work focused on computation and
computational devices often found their work to evoke (if not force)
reflection and debate precisely on the philosophical assumptions and
potential implications of their research. These two large streams of
development - especially as calling for necessary interdisciplinary
dialogues that crossed what were otherwise often hard disciplinary
boundaries – inspired what became the first of the Computing and
Philosophy (CAP) conferences in 1986 (devoted to Computer-Assisted
Instruction in philosophy).

-3-

Proceedings IACAP 2011

Since 1986, CAP conferences have grown in scope and range, to include
an extensive array of intersections between computation and philosophy
as explored across a global range of cultures and traditions – issuing in
fruitful cross-disciplinary collaborations and numerous watershed insights
and contributions to scholarly reflection and publication. In keeping with
what has now become a significant tradition of critical inquiry and
reflection in these domains, IACAP'11 celebrates the 25th anniversary of
CAP conferences by focusing on the past, present(s), and possible
future(s) of the computational turn.
Aarhus, July 2011

Charles Ess
Organizer
Department of Information- and Media Studies
Aarhus University
Ruth Hagengruber
Program Chair
Paderborn University

-4-

The Computational Turn: Past, Presents, Futures?

ACKNOWLEDGEMENTS
Happily, in planning and organizing IACAP’11, I have received generous
support and encouragement from more persons and institutions than can
be fully listed here – beginning with the Track Chairs, members of the
Program Committee / Comité scientifique, and the keynote speakers who
have kindly accepted our invitation to join us in Aarhus for our
conference.
In addition, I would like to express deep gratitude to my colleagues in the
Department of Information- and Media Studies (IMV), Aarhus
University, including the highly competent members of the secretariat and
our chair, Steffen Ejnar Brandorff. Without your on-going
encouragement, assistance, and financial support, IACAP’11 would
simply not have taken place at Aarhus University.
I am also very grateful to Aarhus University for additional forms of
support, including their conference facilities and most especially the very
able assistance of Ulla Rasmussen Billings (Faculty Secretariat) for her
assistance and advice on multiple conference matters, including budgeting
and the conference registration page.
For the first time in its now 25-year history, IACAP has offered travel
bursaries to support the participation of our younger colleagues: Dr.
Johnny Søraker has ably taken on the difficult chore of coordinating the
awarding of these bursaries. Many thanks (mange tak!).
Finally, a thousand thanks (tusind tak!) to Prof. Dr. Ruth Hagengruber
(Universität Paderborn) who has undertaken not only the daunting role of
Program Chair, but also for editing, and producing these Proceedings for
IACAP’11.

Aarhus, July 2011

Charles Ess

-5-

Proceedings IACAP 2011

Table of Contents

Keynotes

Presidential address
Beavers, Anthony
F.

19
IS ETHICS COMPUTABLE, OR WHAT
OTHER THAN CAN DOES OUGHT IMPLY

Aas, Katja Franko

21
(IN)SECURE IDENTITIES: ICTS, TRUST
AND BIOPOLITICAL TATTOOS

Covey Lifetime Achievement Award
Bynum, Terrell
INFORMATION
Ward

22
AND

DEEP

METAPHYSICS

Herbert A. Simon Award for Outstanding Research in Computing and
Philosophy
Sullins, John P.
24
THE NEXT STEPS IN ROBOETHICS

Brian Michael Goldberg Award for Outstanding Graduate
Research in Computing and Philosophy (sponsored by Carnegie Mellon
University)
Buckner, Cameron
25
COMPUTATIONAL METHODS FOR THE
21ST CENTURY PHILOSOPHER: RECENT
ADVANCES AND CHALLENGES IN
COGNITIVE
SCIENCE
AND
METAPHILOSOPHY

-6-

The Computational Turn: Past, Presents, Futures?

Panel

Charles Ess /
Elizabeth Buchanan
/ Jeremy Mauger

26
INTERNET
RESEARCH
ETHICS
INTERNET RESEARCH ETHICS: CORE
CHALLENGES, NEW DIRECTIONS

Tracks

Track I: Philosophy of Computer Science
Bengez, Rainhard
RULES
AND
Z.

29
PROGRAMMING

LANGUAGES

Blanco, Javier O.
et alia

30
A BEHAVIOURAL CHARACTERIZATION
OF COMPUTATIONAL SYSTEMS

Boltuc, Peter

34
WHAT IS THE DIFFERENCE BETWEEN
YOUR FRIEND AND A CHURCH TURING
LOVER

Chokvasin,
Theptawee
Duran, Juan M.

37
HAECCITY AND INFORMATION

40
THE
LIMITS
OF
COMPUTER
SIMULATIONS AS EPISTEMIC TOOLS

Franchette, Florent

43
WHY TO BUILD A PHYSICAL MODEL OF
HYPERCOMPUTATION

Geier, Fabian

46
THE MATERIALISTIC FALLACY

-7-

Proceedings IACAP 2011

Meyer, Steven

Monin, Alexandre,
Halpin, Harry

49
THE EFFECT OF COMPUTERS
UNDERSTANDING TRUTH

ON

PHILOSOPHY OF THE
ARTIFACTUALIZATION

WEB

AS

ONTOLOGICAL COMMITMENTS
COMPUTER SCIENCE

OF

53

Pagano, Miguel

54

Riss, Uwe

60
SEMANTICS
LANGUAGES

OF

PROGRAMMING

Sinclair, Nathan

64
QUINEAN
HOLISM
AND
THE
INDETERMINANCY OF COMPILATION

Smith, Lindsay

67
IS FINDING A ‚BLACK SWAN‘ POPPER,
(1936)
POSSIBLE
IN
SOFTWARE
DEVELOPMENT?

71

Solodovnik, Iryna
ONTOLOGY: FROM PHILOSOPHY TO
ICT AND RELATED AREAS. PROBLEMS
AND PERSPECTIVES

Thürmel, Sabine

74
THE EVOLUTION OF SOFTWARE
AGENTS AS DIGITAL OBJECTS

Turner, Raymond

77
MACHINES AND COMPUTATIONS

Track II: Philosophy of Information and Cognition
Funcke, Alexander

79

ON THE LEVEL OF CREATIVITY.
PONDERINGS ON THE NATURE OF
KANTIAN CATEGORIES, CREATIVITY
AND COPYRIGHTS

Giardino, Valeria

83
THE FOURTH REVOLUTION
SEMANTIC INFORMATION

-8-

AND

The Computational Turn: Past, Presents, Futures?

Heersmink,
Richard
Hewlett, David,
Cohen, Paul

87
EPISTEMOLOGICAL
AND
PHENOMENOLOGICAL ISSUES IN THE
USE OF BRAIN-COMPUTER INTERFACES

91
AN INFORMATION-THEORETIC MODEL
OF CHUNKING

Janlert, Lars-Erik

94
THE DYNAMISM OF INFORMATION
ACCESS FOR A MOBILE AGENT IN A
DYNAMIC SETTING AND SOME OF ITS
IMPLICATIONS

Kitto, Kirsty

97
CONTEXTUAL
INFORMATION:
MODELING
DIFFERENT
INTERPRETATIONS OF THE SAME DATA
WITHIN A GEOMETRIC FRAMEWORK

Menant,
Christophe

Quiroz, Francisco
Hernandez

101
COGNITION AS A MANAGEMENT OF
MEANINGFUL
INFORMATION:
PROPOSAL FOR AN EVOLUTIONARY
APPROACH

104
COMPUTATIONAL AND HUMAN MIND
MODEL

Schroeder, Marcin

107
SEMANTICS
OF
INFORMATION:
MEANING
AND
TRUTH
AS
RELATIONSHIPS
BETWEEN
INFORMATION CARRIERS

Vakarelov, Orlin

111
PRE-COGNITIVE
INFORMATION

-9-

SEMANTIC

Proceedings IACAP 2011

Track III: Autonomous Robots and Artificial Cognitive systems
115
Anokhina,
W
HO
WILL
H
AVE
I
RRESPONSIBLE
,
Margaryta, DodigUNTRUSTWORTHY,
IMMORAL
Crnkovic,
INTELLIGENT
ROBOT?
WHY
Gordana
ARTIFACTUALLY
INTELLIGENT
ADAPTIVE AUTONOMOUS AGENTS
NEED TO BE ARTIFACTUALLY MORAL?

Arkin, Ronald

118
THE ETHICS OF ROBOTIC DECEPTION

Bello, Paul et alia

121
PROLEGOMENON TO ANY FUTURE
THEORY OF MACHINE AUTONOMY

Briggs, Gordon

124
AUTONOMOUS AGENTS AND SENSES OF
RESPONSIBILITY

Hagengruber, Ruth

127
THE ENGINEERABILITY
INSTITUTIONS

Heimo, Olli I.,
Kimppa, Kai K.
Kavathatzopoulos,
Iordanis,
Laaksoharju,
Mikael
Molyneux,
Bernard
Vallverdu, Jordi,
Casacuberta,
David

OF

SOCIAL

129
RESPONSIBILITY
IN
ACQUIRING
CRITICAL EGOVERNMENT SYSTEMS:
WHOSE FAULT IS FAILURE?

133
WHAT ARE ETHICAL AGENTS AND HOW
CAN
WE
MAKE
THEM
WORK
PROPERLY?

136
HOW THE HARD PROBLEM OF
CONSCIOUSNESS MIGHT ARISE FOR AN
EMBODIED (SYMBOL) SYSTEM

139
THE GAME OF EMOTIONS (GOE): AN
EVOLUTIONARY APPROACH TO AI
DECISIONS

Veale, Richard

143
THE CASE FOR
NEUROROBOTICS

DEVELOPMENTAL

Waser, Mark R.

148
WISDOM DOES IMPLY BENEVOLENCE

- 10 -

The Computational Turn: Past, Presents, Futures?

Track IV: Technosecurity from Every day Surveillance to Digital
Warfare
Crutzen, C.K.M.
152
THE MASKING AND UNMASKING OF
PRIVACY

Hempel, Leon

155
CHANGE AND CONTINUITY – FROM THE
CLOSED WORLD OF BIPOLARITY TO
THE CLOSED WORLD OF THE PRESENT

Macnish, Kevin

159
SUBITO AND THE ETHICS OF
AUTOMATING THREAT ASSESSMENT

Othmer, Julius,
Weich, Andreas

162
MATCHING

–

POPULAR

BETWEEN
SECURITYWORLDS
CULTURES OF RISK

MEDIA
AND

Taddeo, Mariarosa

164
INFORMATIONAL WARFARE AND JUST
WAR THEORY

Weber, Jutta

168
TECHNO-SECURITY, RISK AND THE
MILITARIZATION OF EVERY DAY LIFE

Track V: Information Ethics, Robot Ethics
Asaro, Peter

175

IS THERE A HUMAN RIGHT NOT TO BE
KILLED BY A MACHINE?

Dasch, Thomas

177
DO WE NEED AN
INFORMATION ETHICS?

UNIVERSAL

Douglas, Keith

180
A PSEUDOPERIPATETIC APPLICATION
SECURITY HANDBOOK FOR VIRTUOUS
SOFTWARE

- 11 -

Proceedings IACAP 2011

Hromada, Daniel
D.
Soraker, Johnny
Hartz

182
THE
CENTRAL
PROBLEM
OF
ROBOETHICS:
FROM
DEFINITION
TOWARDS SOLUTION

186
AFFECTING THE WORLD OR AFFECTING
THE MIND? THE ROLE OF MIND IN
COMPUTER ETHICS

Tonkens, Ryan

190
THE ETHICS OF AUTOMATED WARFARE

Vallor, Shannon

193
CAREBOTS
AND
CAREGIVERS:
ROBOTICS AND THE ETHICAL IDEA OF
CARE

Wong, Pak-Hang

197
CO-CONSTRUCTION
AND
COMANAGEMENT OF ONLINE IDENTITIES:
A CONFUCIAN PERSPECTIVE

Track VI: Multidisciplinary Perspectives
Baumgaertner,
REFLECTIVE INEQUILIBRIUM
Bert
Belfer, Israel

202
205

THE
INFORMATION-COMPUTATION
TURN: A HACKING-TYPE REVOLUTION

Breems, Nick

209
COMPUTERS AND PROCRASTINATION:
„I’LL JUST CHECK MY FACEBOOK
QUICK A SECOND“

Bod, Rens et alia

212
HOW MUCH DO FORMAL NARRATIVE
ANNOTATIONS DIFFER? A PROPRIAN
CASE STUDY

Desclés, JeanPierre et alia

216
COMBINATORY
LOGIC
WITH
FUNCTIONAL TYPES IS A GENERAL
FORMALISM
FOR
COMPUTING
COGNITIVE
AND
SEMANTIC
REPRESENTATIONS

- 12 -

The Computational Turn: Past, Presents, Futures?

Franchi, Stefano

219
THE PAST, PRESENT AND FUTURE
ENCOUNTERS
BETWEEN
COMPUTATIONS AND THE HUMANITIES

Guarini, Marcello
et alia

224
REFLECTIONS
ON
NEUROCOMPUTATIONAL RELIABILISM

McKinley, Steve

227
STATES
OF
AFFAIRS
INFORMATION OBJECTS

AND

SCIENTIFIC
EXPLANATION
INFORMATION

AND

McKinley, Steve

Nicolaidis,
Michael

230

234
BIOLOGICAL INSPIRED SINGLE-CHIP
MASSIVELY PARALLEL SELF-HEALING,
TERA-DEVICE
SELF-REGULATING,
COMPUTERS:
PHILOSOPHICAL
IMPLICATIONS OF THE EFFORTS FOR
SOLVING TECHNOLOGICAL
SHOW-STOPPERS IN THE PATH OF THE
NEXT COMPUTATIONAL TURN

Portier, PierreEdouard,
Calabretto, Sylvie
York, William W.,
Ekbia, Hamid R.

238
STRUCTURAL CONSTRAINTS FOR THE
CONSTRUCTION
OF
MULTISTRUCTURED DOCUMENTS

243
(DIS)TASTEFUL MACHINES? AESTHETIC
COGNITION AND THE COMPUTATIONAL
TURN IN AESTHETICS

- 13 -

Proceedings IACAP 2011

Track VII: Social Computing
Alhutter, Doris

248

THE SOCIAL AND ITS POLITICAL
DIMENSION IN SOFTWARE DESIGN: A
SOCIO-POLITICAL APPROACH

Barker, Steve

251
A
SOCIAL
EPISTEMOLOGICAL
APPROACH
FOR
DISTRIBUTED
COMPUTER SECURITY

Coeckelbergh,
Mark

254
TRUST, POWER AND INFORMATION
TECHNOLOGY

Compagna, Diego

258
THE BENEFITS OF SOCIAL THEORY FOR
MODELLING STABLE ENVIRONMENTS
OF SYSTEMIC TRUST WITHIN MULTI
AGENT SYSTEMS

Danka, Istvan

260
COMPUTER NETWORKS AND THE
PHILOSOPHY OF MIND. A SOCIAL MIND
– NETWORKED COMPUTER ANALOGY

Dodig-Crnkovic,
Gordana
Ekbia, Hamid R.,
Zhang, Guo

262
AGENT BASED MODELING WITH
APPLICATIONS TO SOCIAL COMPUTING

265
OBJECTS OF IDENTITY, IDENTITY OF
OBJECTS:
FOR
A
MATERIALIST
ACCOUNT OF ONLINE BEHAVIOUR

Ropolyi, Laszlo

269
THE CONSTRUCTION OF REALITY AND
OF
SOCIAL
BEING
IN
THE
INFORMATION AGE

Simon, Judith

272
TRUST, KNOWLEDGE AND SOCIAL
COMPUTING. RELATING PHILOSOPHY
OF COMPUTING AND EPISTEMOLOGY

Vehlken, Sebastian

275
OPERATIONAL IMAGES. AGENT-BASED
COMPUTER SIMULATIONS AND THE
EPISTEMIC IMPACT OF DYNAMIC
VISUALIZATION

- 14 -

The Computational Turn: Past, Presents, Futures?

Zambak, Aziz

278
SOCIAL
COMPUTATION
AS
A
DISCOVERY MODEL FOR THE SOCIAL
SCIENCES

Track VIII: IT, Culture and Globalization
Asai, Ryoko et
THE REVIVAL OF NATIONAL
alia

283

AND
CULTURAL IDENTITY THROUGH SOCIAL
MEDIA

Backhaus, Patrick,
Dodig-Crnkovic,
Gordana
De Gooijer,
Thijmen
Hongladarom,
Sonja

286
WIKILEAKS AND ETHICS OF WHISTLE
BLOWING

289
INTERPRETING CODES OF ETHICS IN
GLOBAL SOFTWARE ENGINEERING

294
INFORMATION,
TECHNOLOGY,
GLOBALIZATION AND INTELLECTUAL
PROPERTY RIGHTS

Track IX: Surveillance, sousveillance…
Beinsteiner,
TOWARDS
Andreas

297

A
HERMENEUTIC
PHENOMENOLOGY OF CYBER-SPACE:
POWER VS. CONTROL

Ganascia, JeanGabriel
Najar, Anis

300
THE WIKILEAKS LOGIC

303
DEMOCRACY 2.0 – HOW THE WEB
MAKES REVOLUTION

Reynolds, Carson

306
NEGATIVE SOUSVEILLANCE

- 15 -

Proceedings IACAP 2011

Strauss, Stefan

309
GOVERNMENT
APPROACHES
FOR
MANAGING ELECTRONIC IDENTITIES
OF CITIZENS – EVOKING A CONTROL
DILEMMA?

Track X: SIG Track – Machines and Mentality
Arkin, Ronald C.

313

MORAL EMOTIONS FOR ROBOTS

Arkoudas,
Konstantine
Bridewell, Will et
alia

316
ON
DEEPLY
INTENTIONAL STATES

UNCONSCIOUS

319
OUTLINING A COMPUTATIONALLY
PLAUSIBLE APPROACH TO MENTAL
STATE ASCRIPTION

Guarini, Marcello

322
AGENCY:
ON
MENTALIZE

MACHINES

THAT

Nirenburg, Sergej

325
TOWARD A TESTBED FOR MODELING
THE
KNOWLEDGE,
GOALS
AND
MENTAL STATES OF OTHERS

Scheutz, Matthias

328
ARCHITECTURAL STEPS
SELF-AWARE ROBOTS

Sundar, Naveen,
Bringsjord, Selmer

TOWARDS

331
LOGIC-BASED
SIMULATIONS
OF
MIRROR
TESTING
FOR
SELFCONSCIOUSNESS

List of Authors in Alphabetic Order

- 16 -

334

The Computational Turn: Past, Presents, Futures?

- 17 -

Proceedings IACAP 2011

Keynotes

- 18 -

The Computational Turn: Past, Presents, Futures?

IS ETHICS COMPUTABLE,
CAN DOES OUGHT IMPLY?

OR

WHAT

OTHER

THAN

ANTHONY F. BEAVERS

Department of Philosophy
The University of Evansville
In 2007, Anderson and Anderson wrote, “As Daniel Dennett (2006) recently
stated, AI ‘makes philosophy honest.’ Ethics must be made computable in order
to make it clear exactly how agents ought to behave in ethical dilemmas” (16).
To rephrase, a computable system or theory of ethics makes ethics honest. But
at what cost? Might Turing’s 1950 prophecy that "at the end of the century the
use of words … will have altered so much that one will be able to speak of
machines thinking without expecting to be contradicted" (1950, 442) soon take
on normative dimensions due to research in artificial morality. Will attempts to
make ethics computable lead us to redefine the term “moral” to fit the case of
machines and thus change its meaning for humans also? I call this the threat of
“moral nihilism … the doctrine that states that morality needs no internal
sanctions, that ethics can get by without moral “weight,” i.e., without some type
of psychological force that restrains the satisfaction of our desire and that makes
us care about our moral condition in the first place” (Beavers, 2011a).
Analyzing this possibility requires inspection of the meaning of the term
“ought” and what it implies. In 2009, I argued that, following Kant, ought not
only implies can, but also might not, in which case it would be morally wrong
to create artificial Kantian agents, since doing so would require designing them
in such a way that they could act immorally, but would not do so. Only on such
a condition would it make sense to hold a machine responsible for its actions
and praise or blame it for its behavior. In 2011, I argued that if ought implies
can, then it also implies implementability. If a machine or human can act
morally, this can only be because the mechanisms (whether in software or
wetware) have the requisite components to allow for it. Thus, any theory of
morality must be implementable in real working agents to qualify as a viable
moral theory. Given the conclusions of 2009, I argued in 2011 that designing
machines in such a way that they behaved morally but were not able to act
immorally would require redefining the term “morality” in such a way that full
moral agency with internal sanctions was not intrinsic to ethics, but “merely a

- 19 -

Proceedings IACAP 2011

sufficient, and no longer necessary, condition for being ethical.” In this case,
internal states such as conscience, responsibility (as felt affective weight) and
thus moral accountability are, ex hypothesi, not necessary for ethics either.
Thus, if we build machines capable of being described by the term “moral” we
can only do so by redefining the term. So, if a time is coming when we can
speak of a machine as moral without expecting to be contradicted, we will have
succeeded in turning ethics into a strictly extrinsic, behavioral affair in which
internals are irrelevant.
Since on the surface, an ethics without an ought is as empty as thinking
without insight or wisdom, it is necessary to explore what else ought implies in
order to form an adequate conception of a metaphysics of morals that will fit the
information age. While other research for a working conception of ethics has
already been done (e.g., Floridi and Sanders, 2004), a careful exploration of this
foundational concept still appears lacking. I hope to fill this gap to explore
whether ethics can get by without its cherished ought and, if so, what that
implies for ethics more generally. The concern guiding this talk is whether the
information age is issuing in a post-ethical age or whether it is leading to a
redefinition of ethics that is both long overdue and needed.
References
Anderson, M., & Anderson, S. (2007). Machine ethics: Creating an ethical
intelligent agent. AI Magazine, 28(4): 15-26.
Beavers, A. (2011). Moral machines and the threat of ethical nihilism. In P. Lin,
G. Bekey & K. Abney (Eds.), Robot ethics: The ethical and social
implication of robotics. Cambridge, MA: MIT Press, forthcoming.
Beavers, A. (2009, March). Between angels and animals: The question of robot
ethics, or is Kantian moral agency desirable. The Eighteenth Annual
Meeting of the Association for Practical and Professional Ethics,
Cincinnati, Ohio.
Dennett, D. (2006, May). Computers as prostheses for the imagination. The
International Computers and Philosophy Conference, Laval, France.
Floridi, L., & Sanders, J. (2004). On the morality of artificial agents. Minds and
Machines 14(3): 349-379.
Turing, A. (1950). Computing machinery and intelligence. Mind 59: 433-460.

- 20 -

The Computational Turn: Past, Presents, Futures?

(IN)SECURE IDENTITIES: ICTS, TRUST AND ‘BIO-POLITICAL’
TATTOOS

KATJA AAS
Department of Criminology and Sociology of Law
University of Oslo
The globalising world is marked by anonymity, mass mobility and mass
consumerism. These conditions create a distinct set of challenges for social
identification practices, first and foremost, the challenge of creating reliable and
‘trustworthy’ identities. The paper addresses in particular the growing reliance
on biometrics and biometric databases and examines how these forms of bodily
control function as border controls. While revealing specific notions of
subjectivity, the paper also explores how these technologies function as
mechanisms of social sorting and global governance and have markedly
different effects on the citizen of the global North and the global South.

- 21 -

Proceedings IACAP 2011

INFORMATION AND DEEP METAPHYSICS

TERRELL WARD BYNUM
Department of Philosophy
Southern Connecticut State University
Scientists working on the cutting edges of their field often engage in thinking
that is much like metaphysics. Similarly, in the past, philosophers inspired by
major advances in science have made significant additions to metaphysics, as
well as other branches of philosophy. On occasion, the scientists and
philosophers have been the very same people. For example in ancient times
Aristotle created physics, biology and animal psychology, and at the same time
he made related contributions to metaphysics, logic, epistemology, and other
branches of philosophy. Again, during the Enlightenment in Europe, influential
philosophers like Descartes and Leibniz also were respected scientists and firstclass mathematicians. At times, people who were primarily scientists (for
example, Copernicus, Galileo, and Newton) inspired thinkers who were
primarily philosophers (for example, Hobbes, Locke, and Kant). In more recent
times, revolutionary scientific contributions of Darwin, Einstein, Schrödinger,
Heisenberg, and others significantly influenced philosophical ideas of people
like Spencer, Russell, Whitehead, Popper, and many more.
Today, in the early years of the twenty-first century, developments in
cosmology and quantum physics appear likely to alter significantly our
scientific understanding of the universe, of life, and of the human mind; and
many scientists have become convinced that the universe, ultimately, is made of
quantum information. These developments, it seems to me, are very likely to
lead to important new contributions to philosophy; and indeed, as illustrated by
Luciano Floridi’s writings on informational realism and philosophy of
information, significant philosophical contributions already have begun to
appear.
Of special interest, in this presentation is the idea that the universe is a
vast “ocean” of quantum bits (“qubits”); and thus each object or process in the
universe can be seen as a constantly changing data structure comprised of
qubits. On this account of the ultimate nature of the universe, the fundamental
“stuff” of which our universe is made is quantum information. Unlike
traditional “bits”, such as those processed in most of today’s information
technology devices, “qubits” have quantum features such as genuine
randomness, superposition and entanglement – features that Einstein and other

- 22 -

The Computational Turn: Past, Presents, Futures?

scientists considered “spooky” or “weird”. These nontraditional features of
qubits have made it possible to achieve unbreakable encryption, teleportation,
and a new kind of computing – “quantum computing”.
In this presentation, a number of quantum topics, such as randomness,
superposition, entanglement, collapse of a wave function, teleportation, and
quantum computing are briefly described. In light of such quantum features, it
seems appropriate for philosophers to re-examine a variety of philosophical
concepts, such as possibility and impossibility, potential and actual, cause and
effect, being and reality, logic and contradiction, and a number of others. Such
concepts are central to the “deep metaphysics” that provides a conceptual
foundation for
philosophy. Consequently, this presentation calls upon
philosophers to familiarize themselves with current developments in cosmology
and quantum physics, especially those developments that see the universe as
ultimately an expanding ocean of quantum information. If philosophers take on
this challenge – as Luciano Floridi has already begun to do – the deep
metaphysical foundations of philosophy are likely to be profoundly
transformed. As a small contribution to that effort, this presentation concludes
with a brief sketch of a possible new metaphysical theory.

- 23 -

Proceedings IACAP 2011

THE NEXT STEPS IN ROBOETHICS

JOHN P. SULLINS
Department of Philosophy
Sonoma State University
RoboEthics has now matured from its beginnings as a curious offshoot of
computer ethics into a sub-discipline of its own that has a well defined scope of
study. In this paper I will briefly look at the growth of RoboEthics and the
important roll it is playing in the development of robotics technology. I will
then look at the more pressing open problems in RoboEthics and suggest some
ways forward. I will focus primary on the criticism that RoboEthics is
impossible given that phronesis is beyond the capacity of machines. To refute
this claim I will propose a model system inspired by the architecture of the IBM
Watson computer that, I will argue, could achieve an artificial practical wisdom.
This would be possible through the use of a context sensitive hybrid of logical
and non-logical search methods that could access documents to find comparable
exemplar cases similar to the ethical situation the robot is attempting to reason
about. Armed with this data, the robot would be able to make more nuanced
decisions even without its own innate human equivalent practical wisdom.

- 24 -

The Computational Turn: Past, Presents, Futures?

COMPUTATIONAL
METHODS
FOR
THE
21ST-CENTURY
PHILOSOPHER: RECENT ADVANCES AND CHALLENGES IN
COGNITIVE SCIENCE AND METAPHILOSOPHY

CAMERON BUCKNER
Department of Philosophy
Indiana University
As evidenced by past CAP conferences, the intersection of computing and
philosophy has long been a fertile area of research. The past ten years in
particular have produced a variety of new computational techniques of
philosophical import. These powerful new techniques present
philosophers with alluring opportunities, but also pose a number of
challenges requiring methodological reforms. In cognitive science, new
computational models of psychological processes are rapidly-increasing
our ability to predict behaviors, but the structure of these models seem to
make a hash of traditional distinctions in psychology such as that between
cognition and association. In metaphilosophy, new statistical and logical
programming methods offer the possibility to address otherwise
intractable philosophical questions, but rely upon a variety of
assumptions, require input data that can be expensive to collect, and
produce results that can be difficult to evaluate. In this talk, I will review
some of these new technologies, recommending new conceptual
frameworks and methodologies to understand, evaluate, and utilize their
results. While I will give a brief overview of this latest generation of
research, the talk will focus primarily on specific examples from my own
work in the areas of comparative psychology and dynamic ontology.

- 25 -

Proceedings IACAP 2011

Panel
INTERNET RESEARCH
DIRECTIONS

ETHICS:

CORE CHALLENGES,

NEW

Charles Ess
Department of Information- and Media Studies
Aarhus University
Elizabeth Buchanan
Director, Center for Applied Ethics
University of Wisconsin-Stout
Co-Director, International Society for Ethics & Information
Technology (INSEIT)
Jeremy Mauger
School of Information Studies
University of Wisconsin, Milwaukee
Internet Research Ethics (IRE) is an emerging cross-disciplinary field which studies how
research is conducted in online environments and seeks to resolve the subsequent ethical
dilemmas in normative and practical terms. While similar to its physical counterpart,
conducting scholarly research online is different in terms of ethics and values. For
example, online surveys bring new privacy concerns. Research in chat rooms confounds
our notions of subject anonymity and identifiability. Scraping data from social networks
or public blogs complicates issues of informed consent. At the same time, research
conducted on and through the Internet has expanded exponentially in the last ten years;
researchers across disciplines make frequent use of such tools as online survey
generators, as well as engage in forms of participant observations of virtual worlds.
Internet Research Ethics has thus emerged over the past decade as a distinct and
important field of applied ethics – one that overlaps with central issues and approaches
of information and computing ethics and is often informed (and informs) the broader
intersections between computing and philosophy.

- 26 -

The Computational Turn: Past, Presents, Futures?

The panel will begin with a few real-world examples of ethical dilemmas that are
representative of contemporary issues in IRE and are especially challenging to traditional
ethics. Panelists will then provide an overview of two current projects focusing on
significantly developing the field of IRE, beginning with the current revision of the
Association of Internet Researchers’ (AoIR) ethical guidelines. These guidelines,
adopted by AoIR in 2002, have found extensive use around the world as a helpful guide
to analyzing and resolving ethical issues in Internet research. The current revision seeks
to update the guidelines in light of the dramatic expansion of Internet research following
on the emergence of so-called Web 2.0 technologies and the ongoing global diffusion of
the Internet. The second project is the Internet Research Ethics Digital Library,
Research Center, and Commons (http://www.internetresearchethics.org/). This ongoing
project is the result of a grant awarded by the National Science Foundation to the Center
for Information Policy Research at the University of Wisconsin-Milwaukee’s School of
Information Studies. A primary goal of this project is to develop and provide sound
resources, a solidified research base, and expert advice as more researchers and more
IRBs/ethics boards struggle with the complexities of Internet research ethics. Both
projects thus share an emphasis on praxis – i.e., analyzing and responding to real-world
dilemmas faced by a growing research community around the globe.

Following these introductions and overviews, the panel will invite critical discussion of
the representative issue, approaches, and resources. As well, the panel will welcome
comments and suggestions from participants for additional resources and insights that
will contribute to both projects – and to suggest ways where these projects in turn
contribute to contemporary work in information and computing ethics. A last goal of the
panel is to develop a better articulation – a conceptual map – of the multiple
relationships between IRE as a field of information and computing ethics and other
characteristic foci and thematics of computing and philosophy.

- 27 -

Proceedings IACAP 2011

Track I:
Philosophy of Computer
Science

- 28 -

The Computational Turn: Past, Presents, Futures?

RULES AND PROGRAMMING LANGUAGES
RAINHARD Z. BENGEZ
Philosophy of Science, Technology, and Engineering Department
Carl von Linde Academy
TUM School of Education
TU München, Arcisstr. 21, 80333 München, Germany
bengez@tum.de

Abstract
In computer science and related fields we are talking much about rules. The word rule appears
very often directly or unspoken in papers concerning computer science or Philosophy of
Computer Science. We are talking about logic(s), interpreters, procedures and compilers, systems
of rules, programming languages, automata and rules of software design, good practices, and
much, much more. But, unfortunately, the meanings of the word rule to which one refers from
case to case seem to be unclear. In my contribution I would like to try to show some of these
ambiguities and discuss ways to avoid them. According to the nature of this subject, my
contribution is both analytical and normative as well, because I will analyze some applications of
the word and work out a traceable direction for use of it. Admittedly, the word rule has so many
directions for use in computer science and philosophy of computer science that I cannot talk about
most of them. I will restrict myself to rules inducing action and especially to such rules in
programming languages (DSL, specification, etc.). This would mean rules are guiding actions in
languages, or, stated more general, in sequentially structured patterns. I will start by talking about
the dependence of rules and actions.

- 29 -

Proceedings IACAP 2011

A BEHAVIORAL CHARACTERIZATION OF COMPUTATIONAL
SYSTEMS
JAVIER BLANCO
Universidad Nacional de Cordoba, Argentina
RENATO CHERINI
Universidad Nacional de Cordoba, Argentina
MARTIN DILLER
Universidad Nacional de Cordoba, Argentina
AND
PÍO GARCÍA
Universidad Nacional de Cordoba, Argentina

Abstract. We introduce the concept of interpreter as a producer of behavior in response to some
input that codifies it. We argue that the notion of interpreter captures the minimal characteristics
shared by different kinds of computational devices, and can thus serve as a criteria to identify how
interesting a computational system is. This characterization contrasts with many of the current
functional descriptions offered in the literature on this topic, in that these are somewhat dependent
on the technology that is currently available. Since the concept of interpreter can be used to
compare different systems, it defines a computational hierarchy, establishing the relative degree of
computationalism of different systems. This enables us to restate some ontological questions, such
as what is a program?, when is a system computational?, in more precise terms which admit
clearer answers.

Any system can be characterized in terms of its possible behaviors. In particular, a useful
description of a computational system is given by the relationship between the input and
the behavior produced as a response to that input, characteristic of the system.
The feature that distinguishes computational systems from other types of systems is
that they may produce a very large and interesting set of behaviors, depending on
syntactic inputs and “without changing a single wire” (Dijsktra, 1988). Thus, the
characteristic input-behavior relation implicitly defines an encoding of behaviors as
syntactic objects.
We have suggested in (Blanco et al, 2011) that some key aspects of computational
systems can be captured by the ubiquitous concept of interpreter as used both in

- 30 -

The Computational Turn: Past, Presents, Futures?

theoretical and applied computer science (Jones 1997, Abelson&Sussman 1996, Jifeng
& Hoare, 1988), defined in a very general manner. In this article, we present an
interpreter as the necessary link between a set of behaviors and their respective
encodings, without relying on any mechanistic account of systems. As we argue
elsewhere, the concept of interpreter can be regarded not only as a notion that captures
the minimal common characteristics of different types of computational devices and
serves to clarify various concepts which pervade computer science, but also as a
framework for understanding computing.
By behavior of a system we understand only a description of the occurrences of
certain events considered relevant of the system. Different ways of observing a system
may determine different sets of behaviors. Thus, the behaviors will depend on a decision
regarding the events that are considered of interest for that system (for particular
purposes). A precise definition of behavior will be left unspecified here, since this will
only make sense when a particular framework is stated.
Intuitively, an interpreter produces a behavior according to some input that codifies
it. Usually, the encoded behavior may depend on input data, but for simplicity we will
assume in this presentation that the data and behavior are already encoded together. The
notion of interpreter is (almost) by definition the necessary link between the so-called
“program-scripts” and “program-processes” (Eden 2007, Blanco & Garcia 2008).
Given a characterization of a fixed set B of possible behaviors, and a set of
syntactic elements P, an interpreter is a function i : P -> B assigning some behavior b in
B to every p in P. When this relation is given we say that p is the encoding of b.
Generally, we speak of the syntactic domain P as the programming language, and of p
as a program.
A (physical) system I realizes an interpreter i if it is capable of receiving an input p,
and systematically produce the observable behavior b such that i(p) = b. In this case we
say that I effectively computes b via the program p. We say that a (physical) system
realizes an interpreter when every time we provide it with an instance of an encoding, it
produces the corresponding observable behaviour. We do not consider internal states,
since these may be realized in very different ways.
One way of precising the notion of realization is along the lines of the notion of
“practical realization of a function” defined in (Scheutz 1999), where the relation is an
isomorphism between the formal definition of i and a physical theory T that describes the
system I (for example, the theory of electrical circuits) that includes a description of the
inputs and outputs of the system as well as a function F that maps inputs to outputs using
the laws and language of T in a way that guarantees the preservation of the ismorphism.
In (Scheutz 1999) different degrees of “practicality” of the realization relation are also
considered that take in account the limits in precision with which the inputs can be
measured and generated, reliability and range of functioning of physical systems, noise
generated by the environment, etc.
The concept of interpreter serves as a criteria to distinguish between systems that
could be computational (w.r.t some inputs and behaviours) from those that could not.
Since we want to capture what makes any system programmable, we do not assume any
particular implementation technology in the concept of interpreter. Different
computational models, like Von Neumann machines, parallel machines, DNAcomputers, quantum-computers, can be considered interpreters because they can
systematically produce behaviours from their encodings in a predefined language. What

- 31 -

Proceedings IACAP 2011

will be specific to each model is the underlying theory used to justify that they are
interpreters, not the criteria used to determine that they are indeed programmable
systems.
The notion of interpreter can be seen as functional, i.e, an interpreter is such when
it is capable of producing behaviors from programs. Following this idea, a program is a
syntactic structure capable of being interpreted. A program is such only relative to a
given interpreter and an interpreter is such only for a particular programming language.
The concepts of program, programming language and interpreter are thus relational and
inter-definable.
The main feature of an interpreter is that it is programmable: there is an available
syntax with which a variety of behaviors can be encoded. The degree of programability
of an interpreter is given by the variety of behaviors that the underlying programming
language is able to encode. The degree of programability is the distinctive feature of an
interesting computational system. If we consider a system computational when it is
programmable, then being computational will also be a property which can be
established only relative to a set of behaviors and a corresponding encoding (usually an
actual programming language). In other words, the property of being computational will
not make sense independently from a set of behaviors and the encoding. This will allow
us to tackle some philosophical problem such as the problem of pan-computationalism
(do all physical systems compute?) (Putnam 1987, Searle 1990, Chalmers 1996, Chrisley
1994, Copeland 1996, Piccinini 2008) from a different perspective. The question “Is this
a computational system?'” is replaced by the question “Is this a computational system
with respect to this set of inputs and behaviors?'”, or equivalently, “How interesting,
from a computational point of view, is this system?'”. From this perspective, in
particular, several constructions of “trivial implementations of programs” which intend
to show how the thesis of pan-computationalism can be established do not qualify as
interesting computational system.
Since the rise of computability theory in the thirties, it was clear that a computation
is related to certain formal object that prescribes it, e.g. the description of a Turing
Machine, general recursive functions, a lambda-term, etc. A computation, then, is
produced following this prescription. Putnam’s (and Searle’s) theorem (Putnam 1987,
Searle 1990), on the other hand, tries to present a notion of computation in itself, reifying
computation as something that exists independently of the prescription or program (any
sequence of states would do).
The property of being an interpreter for a given set of behaviours can be satisfied
by certain systems. An interpreter is a general notion that can be used to characterize
physical mechanisms (computers, calculators), a human acting mechanically (Turing’s
computor, a human carrying out the reductions of a lambda term), mathematical
formalisms (universal Turing machines, etc.), or computers with computing power
beyond Turing computability (Oracle computers (Copeland 2002)). Whereas a (physical)
counterpart is needed for the realization of an interpreter, the property of being an
interpreter, and concomitantly, the property of being a programmable system, can be
determined by its abstract description.

- 32 -

The Computational Turn: Past, Presents, Futures?

References
Abelson, H. & Sussman, G.(1996) Structure and Interpretation
of Computer Programs. MIT Press, Cambridge, MA, USA, 2nd edition.
Blanco, J., Cherini, R., Diller, M. & Garcia, P. (2011) Interpreters: towards a philosophical
account of computer science. Technical Report.
Blanco, J. & Garcia, P. (2008)A categorial mistake in the formal verification debate. In European
Conference on Computing and Philosophy (ECAP), June 2008.
Chalmers, D. (1996) Does a rock implement every finite-state automaton Synthese 108 (3):30933.
Chrisley, R..(1994) Why everything doesn’t realize every computation. Minds and Machines,
4(4):403–20
Copeland, J.(1996). What is computation? Synthese, 108(3):335–59,
Copeland, J.(2002) Narrow versus wide mechanism. In Computationalism: New Directions. MIT
Press.
Dijkstra, E..(1988) On the cruelty of really teaching computing science. circulated privately.
Eden, A..(2007) Three paradigms of computer science. Minds Mach., 17(2):135–167.
Jifeng He. & Hoare, C. (1988) Unifying theories of programming. In Ewa Orlowska and Andrzej
Szalas, editors, RelMiCS, pages 97–99.
Jones, N. (1997)Computability and complexity: from a programming perspective. MIT Press,
Cambridge, MA, USA.
Piccinini, G (2008) Computers. Pacific Philosophical Quarterly,89(1):32–73.
Putnam, .H.(1987) Representation and Reality. MIT Press.
Scheutz, M (1999). When physical systems realize functions. Minds and Machines, 9(2):161–196.
Searle, J (1990). Is the brain a digital computer? Proceedings and Addresses of the American
Philosophical Association 64 (November):21-37.

- 33 -

Proceedings IACAP 2011

WHAT IS THE DIFFERENCE BETWEEN YOUR FRIEND AND A
CHURCH-TURING LOVER?
A New Defense of H-Consciousness.
PIOTR BOŁTUĆ
University of Illinois Springfield
UHB 3030, One University Plaza
Springfield IL 62703
(and Warsaw School of Economics)

Abstract. Whatever functionality may be attained by a physical system, (such as a
human), it could, be replicated by a robot. We can define a Church-Turing lover as
a robot with all functionalities of a (realistic, or ideal) sex partner. What it lacks is
only the first person perspective. If we care what a partner truly feels, not just how
he/she behaves, we should care. Yet, if we could build-in relevant first-person
consciousness, the difference would disappear, or it would be relegated to a
broader social-historical context..

1. The gist of the Argument
An important direct implication of the Church-Turing seems to be that whatever
functionality may be attained (by a physical system, such as a human), it can, in
principle, be replicated by a robot. In the area of sex, whatever ‘functionalities’ a human
lover may perform, the same would in principle be replicable in advance sex-toys. The
term ‘functionality’ can be understood as broadly as we can. Should desired
specifications of a lover include, in addition to advanced mechanical functionalities, also
certain advanced tactile features, temperature adjustments, fluid emissions (including
chemical replication of the body fluids, such as sweat, squirt or sperm), ionization levels
and other bioelectrical fields, sounds or even sophisticated conversations and other
language utterances ( ‘the Turing test’ is one of the implications of Church-Turing) such
conditions can be produced, though sometimes the cost may in practice be prohibitive.
To understand this point is important for the large sex-toy industry, for other industries
piggybacking on its research and development, but also for the philosophers. The
question for philosophers is what, if anything, would make such robotic lover different
from a human one. Advanced robotic lovers can be viewed as external experience
machines, where one’s senses are stimulated by an artificial cause but not through direct
brain stimulation but rather through stimulation of external sensory organs. It is however
similar to the experience machine since the robot breaks the ‘typical’ or so-called

- 34 -

The Computational Turn: Past, Presents, Futures?

‘proper’ causal chain between the experiences and a human lover and initiates a socalled deviant causal chain (terms ‘proper’ or ‘deviant’ are used here in the sense used in
theory of causality, not as moral evaluatives). I come to the conclusion that, while there
is no functional difference, the human lover is supposed to have a first-person (hconsciousness) related to Chalmers’ hard-problem.
Without such assumption we have no way to philosophically articulate the
difference between the moral subjects for whom ‘there is something that it is like to
experience’ a certain thing (here, sex) for the inside, and those for whom there isn’t such
a thing. Perfect electronic lovers work better than zombies in demonstrating this point
since we avoid the controversies whether it is conceivable that identical physical
systems, such as human brains, could produce first-person consciousness in humans but
not in zombies. The zombies seem to violate the tenet of materialism that there is no
difference without physical difference while electronic toys do not make such violations..
1.1. MAIN STEPS OF THE ARGUMENT
Let us present a ‘sentence outline’ of the main argument.
1.1.1. Defining a Church-Turing lover
It is the perfect functional imitation of a human lover in terms of all parameters desired,
which may include some or all of the following: a. tactile features, b. reactivity to voice
commands, c. speech quality, d. speech content (including, the ability to meet the Turing
test), e. advanced domestic skills (cooking, cleaning), f. other skills of an artificial
companion as defined by Floridi.
1.1.2. Defining your boyfriend/girlfriend
Defining your boyfriend/girlfriend as a human being, equal or inferior to the ChurchTuring lover in terms of the functionalities described broadly in points a-f and all other
typical functionalities.
1.1.3. Establishing rough functional equality between the Church-Turing lover and the
Boyfriend/Girlfriend.
This includes the responses to various objections such as the social objection, the
psychological objections and the religious objection. The only objection left unanswered
is the reproductive objection, which leaves us with ‘rough functional equality’: ChurchTuring lover is functionally equal to your Boyfriend/Girlfriend provided you do not
intend to procreate with him/her. (Actually, Church-Turing implies procreative
functionality in robots as well).
1.1.4. Atypical functionalities, defined as those of the first-person perspective.
I show futility of the Church-Turing functional reenactments of presumed first-person
states. Why do I want my boyfriend/girlfriend to have an orgasm not just to be very good
at faking one? (If I am not an egoist I want her to feel good not just to behave as if she
felt so.) Also, I give a brief,responses to the privileged access problem).

- 35 -

Proceedings IACAP 2011

1.1.2. The engineering thesis in machine consciousness
The engineering thesis in machine consciousness, saves your girlfriend/boyfriend’s
uniqueness, but not forever. There is a first-person, inductively established, difference
between the Church-Turing lover and a boyfriend/girlfriend. The difference may
partially disappear should we be able to engineer robots with first-person hconsciousness.functionalities.

.
Acknowledgements
I developed an early draft of this argument at G. Harman’s graduate seminar in
epistemology in the Spring of 1991. I want to thank Prof. Harman, Alex Byrne, Mary
McGowan and other participants for discussion. I want to thank John Barker and Keith
Miller for recent related discussions.

- 36 -

The Computational Turn: Past, Presents, Futures?

HAECCEITY AND INFORMATION
THEPTAWEE CHOKVASIN
Suranaree University of Technology
Nakhon Ratchasima, Thailand

Abstract. The interest in ‘information entities’ is increasing in the philosophy of
information. In this article, I offer a philosophical analysis which is concerned
only with their haecceities (thisnesses) in the conception of Heideggerian
‘functionality’. I argue that the haecceity of an information entity is necessary for
making a legal judgment on cybercrimes- especially on sharing illegal information.
Moreover, when considering about the persistence of deleted information files, it
is found that haecceities of those information files have some aspect of being an
indexical of functionality which is far beyond what Duns Scotus knew about them.

1. Introduction
I live in Thailand, and my friend is now in Japan. We are chatting on the MSN. If now
I’m reading some information in a school website, and my friend is reading the same
thing on his computer screen in Japan, are we exactly reading the same thing?
Someone may consider about this situation and say that the same thing can appear
in many different places at the same time, therefore we are exactly reading the same
thing. However, some other may say that one thing cannot be in many different places at
the same time, so my friend and I are looking at two different website pages which are
merely similar to each other.
And so, a question arises, “When are two chunks of information, or two
information entities, the same?.” In this fashion of the argument above, it can be seen
that something that is very similar to the problem of universals is brought back from
classic metaphysics. Cyber-information on webpage behaves like it is a universal which
is instantiated in many individual computers. However, if a philosopher of information
wants to retain the position of considering information as information entities, she may
have to take another route of explaining the similarity of the two web-pages. She might
explain that they are two different information entities that instantiates the same
universal ‘informativeness’.
If the latter is right, then we have to admit that any information is an information
entity of its own. There are no two distinct information entities exactly resemble each
other. Unfortunately, this position of metaphysical information entities may have
undesirable result. In the present time, there is a law of computer crime that forbids
sending or forwarding any illegal information, pictures, piracy items, etc. to a third

- 37 -

Proceedings IACAP 2011

person. Both of the sender and the receiver will be considered guilty of doing that. But
how can the law still be legitimate if the receiver uses the argument above to show that
because of their status of being different information entities, he therefore did not receive
the same thing from the sender?
The latter one leads us to other topics in metaphysics which are about identity and
individuation, and in this article it interests me more than to find out the account of
sameness of information entities in the light of the metaphysics of universals. So, I will
stick to the topic of identity and individuation. In this article, I will develop an analysis
to answer the question above. The analysis will be in the light of Heideggerian
‘functionality’ as mentioned by Ratcliffe (2002) that, apart from their properties, for two
things to be identical to each other they must be considered from their ‘teleological
webs’ including their values and ends. However, it must be developed further when
answering another question of what the appropriate notion of identity for information
entities is. I will argue that the problem of individuation is deeper than the problem of
identity. The two information entities that are not different in their properties will be
individuated by their info-haecceities which are the bases for their identity.

2. Haecceity and Functionality
It is said that John Duns Scotus may be the first philosopher who deals with the problem
of individuation with “the difference”. Duns Scotus gave arguments for positing an
“individuating difference” or a haecceity which is to give an account to individuals. In
his Ordinatio, Duns Scotus said that “I reply therefore to the question that material
substance is determined to this singularity by some positive entity and to other diverse
singularities by other diverse positive singularities.” (Wolter, 1994 : 286).
The positive individuating difference, or haecceity, is different from the common
nature, or quiddity, that is to explain what an individual essentially is. So, we may never
reach a full understanding of the haecceity.
Now we can say that the receiver of the illegal information may be considered
guilty from another perspective. Although it can be said that it is controversial of him
being guilty of receiving the very same thing from the sender, he is still guilty from
producing another new illegal entities in the computer system. It has to depend instead
on “the difference” to be legitimate for charging to two persons (not just one) of being
guilty of two different acts differentiated by two different entities which just happen to
have the similar characteristics in their common natures.
Cannot haecceity be grasped at all? In Haecceity (1993), Gary S. Rosenkrantz had
some arguments to show that the haecceity of the objects incapable of consciousness are
to us cognitively inaccessible. Only the haecceity of one’s being oneself can be grasped
and expressed linguistically by only that one person. If we follow Rosenkrantz’s
argument, we have to admit that the haecceity of other entities around us in inaccessible.
Is this the same case for haecceity of information entity, or info-haecceity?.

- 38 -

The Computational Turn: Past, Presents, Futures?

References
Ratcliffe, Matthew. (2002). Heidegger, Analytic Metaphysics, and the Being of Beings. Inquiry
45(1), 35-57.
Rosenkrantz, G. A. (1993). Haecceity: An Ontological Essay. Dordrecht: Kluwer Academic
Publishers.
Wolter, A. B. (1994). John Duns Scotus. In Jorge J. E. Gracia (ed.), Individuation in
Scholasticism: The Later Middle Ages and the Counter-Reformation 1150-1650 (pp. 271298). Albany, NY: State University of New York Press.

- 39 -

Proceedings IACAP 2011

THE LIMITS OF COMPUTER SIMULATIONS AS EPISTEMIC TOOLS
JUAN M. DURAN
Universität Stuttgart - SimTech
Germany

Over the past few decades the use of computers for scientific purposes has been
extended to virtually every branch of science. Such widespread acceptance is clear: their
provide powerful means for solving complex models, as well as speed and memory for
analyzing and storing data, visualizing results, etc.
A less broad, yet still important, use of computers in laboratory practice is by
means of implementing computer simulations. Lately, scientists have turned their interest
to the design, validation, and execution of computer simulations instead of setting up,
controlling and calibrating a whole material experiment. Whether for budgetary reasons,
time-consuming delays, or complexity, today scientific practice is carried out in a way
that strongly relies (if not fully depends) on computers. Here we face a philosophical
problem that now has become widely discussed.
Current philosophical literature deals with the question whether the epistemological
value of a traditional experiment has greater (or less) confidence than a computer
simulation. The most used trick for answering this question is by addressing the so-called
“materiality problem”.
Its standard conceptualization is characterized by Parker in the following way: “in
genuine experiments, the same ‘material’ causes are at work in the experimental and
target systems, while in simulations there is merely formal correspondence between the
simulating and target systems (...) inferences about target systems are more justified
when experimental and target systems are made of the ‘same stuff’ than when they are
made of different materials (as is the case in computer experiments)” (Parker, 282). In
general terms, the materiality problem can be addressed either by emphasizing the lack
of materiality in computer simulations as epistemically defective (for example, as in
Guala, Morgan and Giere), or by claiming that the presence of materiality in experiments
is rare and, ultimately, unimportant for epistemic purposes (Morrison, Parker and
Winsberg).
Either solution leads to what I call the “dilemma of computer simulations” for it
presupposes that once the ontology of computer simulations is sorted out, its epistemic
power can be fully determined. Indeed it is required, as premise, to provide an ontology
that resolves the epistemic value of computer simulations. However, the informative
exercise of simply checking off ontological features of computer simulations begs the
question whether it is legitimate to draw any epistemic conclusion at all. Paraphrasing
Hacking, they disagree because they agree on basics.

- 40 -

The Computational Turn: Past, Presents, Futures?

A different approach consists of defending the epistemic reliability of computer
simulations as philosophically detached from its ontological conceptualization. This does
not suggest, though, that they are two unrelated issues, but instead that each can be
analyzed in its own right. In fact, there exist a close relation between them insofar the
ontology becomes, to certain extent, a limiting case for the epistemology of computer
simulations.
Therefore, instead of asserting that “on grounds of inference, experiment remains
the preferable mode of enquiry because ontological equivalence provides
epistemological power” (Morgan, 326), I hold a twofold claim: firstly, that materiality
only restricts computer simulation from “accessing” certain aspects of the world which
require a causal story; in other words, materiality draws the boundaries from where
experiments become a specific and irreplaceable method for knowing something about
the world. Secondly, that computer simulations provide ways of inference that do not
depend on its materiality but on its capacity for representing empirical as well as nonempirical systems.
Keeping an eye on these two claims, I propose to proceed in to corelated steps: firstly, by analyzing and characterizing the nature of computer simulations
and material experiments; naturally, this step is highly dependent on assumptions on
computational models, computer programs and experiment, all of which will be briefly
addressed. Secondly, by discussing the philosophical relevance of the limits imposed to
computer simulations by materiality as well as drawing some preliminary conclusions on
their epistemic power.
Case examples will be briefly discussed as well. In one sense, there are many
aspects of scientific practice that cannot be substituted by computer simulations, but
require interaction with the material world: measurement, for instance, is one case. In
certain measurement instances (i.e. the so-called “derived measurement”), the causal
interaction of an instrument with the world cannot be replaced by the calculus performed
by a computer simulation. Another interesting case-study is the reproducibility of
experiments (Cf. Franklin and Howson 1984): as it is well known, the variation of
instruments and experimental set-up tends to increase its epistemic reliability; it is not
clear, however, that a similar methodology may work for computer simulations. In
addition, the detection of new real-world entities seems a complete chimera for computer
simulations, although it is a key role of material experiments. On the other hand
computer simulations have the capacity of dealing with incredible complex equations
that represent real-world systems and from which it is possible to “crunch” large amounts
of data. Most of our knowledge about the world also comes from manipulating and
interpreting such data. Computer simulations can also be used for investigating “rational
worlds”, such as counterfactuals, thought experiments and mathematical worlds.
I then urge for a philosophical discussion of the epistemological value of computer
simulations based on its capacities and limits, instead of the dependence on an
ontological conceptualization.

References
Franklin A., and Howson, C. (1984), Why do scientists prefer to vary their experiments?, Studies
in History and Philosophy of Science Part A, 15(1), 51 – 62.

- 41 -

Proceedings IACAP 2011

Giere, R. (2009), Is computer simulation changing the face of experimentation? Philosophical
Studies, 143, 59–62.
Guala, F. (2002), Models, simulations, and experiments. In: L. Magnani and N. J. Nersessian
(Eds), Model-Based Reasoning: Science, Technology, Values (pp. 59-74). Kluwer.
Morgan, M. (2005), Experiments versus models: New phenomena, inference and surprise. Journal
of Economic Methodology, 12(2), 317–329.
Morrison, M (2009), Models, measurement and computer simulation: the changing face of
experimen- tation, Philosophical Studies, 143, 33–47.
Parker, W. (2009), Does matter really matter? computer simulations, experiments, and materiality,
Synthese, 169(3), 483–496.
Winsberg, E. (2009), A tale of two methods, Synthese, 169(3), 575–592.

- 42 -

The Computational Turn: Past, Presents, Futures?

WHY TO BUILD A PHYSICAL MODEL OF HYPERCOMPUTATION?
FLORENT FRANCHETTE
IHPST, University of Paris 1 Panthéon-Sorbonne
13 rue Dufour, 75006 Paris

Abstract. A model of hypercomputation can compute at least one function not
computable by Turing Machine and its power comes from the absence of particular
restrictions on the computation. Nowadays, some researchers claim that it is
possible to build a physical model of hypercomputation called “accelerating
Turing Machine”. But for what purposes these researchers would try to build a
physical model of hypercomputation when they already have mathematical models
more powerful than the Turing Machine? In my opininon, the computational gain
provided to the accelerating Turing Machine is not free. This model also lost the
possibility for a human to access to the computation result. To define this feature, I
will propose a new constraint called the “access constraint” stating that a human
can access to the computation result regardless of computation ressources. I will
show that the Turing Machine meets this constraint unlike the accelerating Turing
Machine and I will defend that build a physical model of the latter is the solution
to meet the access constraint.

The aim of the computability theory is to define mathematical functions computable by
algorithms. The definition of an algorithm is however an informal one and the
computability theory needs a mathematical definition of this notion. In order to formalize
a predicate which means “can be computed by an algorithm”, Alan Turing (1936)
proposed the formal predicate of “computed by Turing Machine” or “Turingcomputable”. According to Turing, the Turing Machine (TM) is a mathematical model
of computation with a power equivalent to an algorithm. This claim is summarized in the
Church-Turing thesis: functions computable by algorithms are computable by TM. This
thesis argues that the TM defines the computation by algorithm since if a function is not
Turing-computable, there is no algorithm which can compute it. For example, Turing
proved that some mathematical functions such as the Diophantine function1 are not
Turing-computable. Turing (1939) however, showed in his thesis that the computing
power of the TM, that is to say the number of functions it could compute, depended on
the type of constraints applied to the model.
Models which are able to compute more functions than the TM are called “models
of hypercomputation” or “hyperMachine”, and their computational power comes from
1

Given a Diophantine equation x, the Diophantine function is the function such as
f(x)=1 if x has at least a solution and f(x)=0 otherwise.

- 43 -

Proceedings IACAP 2011

the absence of particular restrictions on the computation. Recently, Jack Copeland
(2002) has proposed a model of hypercomputation named “Accelerating Turing
Machine” (ATM) which is based on the absence of the constraint that the computation
must include a finite number of steps. Copeland demonstrates in his article that an ATM
is able to execute an infinite number of computational steps in a finite time and compute
non Turing-computable functions such as the Diophantine function. More importantly,
some researchers defend the idea that it is possible to physically build an ATM.
However, the physical construction of a computational model, whether equivalent to the
TM or not, goes beyond the original framework of the computability theory. Indeed, the
Church-Turing thesis states nothing about the computing power of a TM physically built,
it states only an equivalence between the intuitive concept of algorithm and the
mathematical concept of Turing Machine. It is therefore pertinent to ask for what
purposes these researchers would try to build physical hyperMachines when they already
have mathematical models more powerful than the TM. In other words, why leave the
mathematical framework of hypercomputation to turn to the physical sciences?
In order to answer these questions, I will try to explain one reason why advocates
of hypercomputation want to physically build a computational model with a greater
power than the TM. In my opininon, although the absence of a constraint such as the
finite number of steps allows the ATM to compute more functions than the TM, the
computational gain is not free. The model of hypercomputation also lost a key feature:
the possibility for a human to access to the computation result. To define this feature, I
propose a distinction between “to access to the result” and “to compute the result”.
We have access to the computation result when the result is available to us in
principle. This result doesn't need to have a meaning, it can only be a string of
symbols.
We compute a result when we can follow in principle each computational step from
input to output.
From these definitions, we can set out two constraints: one asserting that we can compute
results computed by a model and the other asserting that we can have access to these. Let
a function f which is computable by a model.
• This model meets the access constraint (AC) if for all input x, we can have
access to f(x).
• This model meets the computing constraint (CC) if for all input x, we can
compute f(x).
It is straightforward to show that these two constraints are set out in the definition
of a TM. However, I think that the ATM doesn't meet the CC and the AC. My main point
is to explain that it is actually unlikely that a human can compute an infinite number of
steps in a finite time. This argument consists to say that the brain, where computations
are made, is a finite entity both in space and time. This argument seems pertinent in
order to show that we are not able to follow step by step an infinite computation. But it is
not suffisant to prove that we can't have access to the result from an infinite computation
because it could be possible that we have access to Diophantine function results without
to follow each computational step. For exemple, Hava Siegelmann (1995) has proposed
a mathematical model of the brain in the form of artificial neural nets which according to
her could compute “beyond the Turing limit” Although it appears that Siegelmann's
model may exceed the power of the TM, it has been strongly criticized by Martin Davis
(2006) in his article entitled The myth of hypercomputation.
From the two arguments outlined above, I shall make the assumption that a human is
not able to compute and to have access to the result of a non Turing-computable function

- 44 -

The Computational Turn: Past, Presents, Futures?

computed by an ATM. Therefore, this model does not meet the CC and the AC.
Nevertheless, could an ATM meet these constraints? In my opinion, it is necessary to
distinguish two ways for a model to meet the AC.
• A model meets the AC in an internal sense if a human is able to have acces to
the computation result without a physical realization of the model.
• A model meets the AC in an external sense if a human is able to have acces to
the computation result with a physical realization of the model.
For example, a TM meets the AC in an internal sense because we can access to results
from its mathematical definition. On the hypercomputation side however, we could have
acces to the computation result in an external sense with a physical realization of an
ATM. This result, characterized by the link between the computing power of a model of
hypercomputation and its physical realization has important consequences for the notion
of computation. It shows that some features belonging to hypercomputation models do
not only depend on mathematics. Specifically, the possibility to access to the result of a
non Turing-computable function computed by an ATM is based on physical constraints.

Acknowledgements
I would like to thank the editors and referees for very helpful comments during the
preparation of this paper.

References
Copeland, J. (2002). Accelerating Turing Machine, Minds and Machines, 11, 281-301.
Davis, M. (2006). The Myth of Hypercomputation. In C. Teusher (ed), Alan Turing: the Life and
Legacy of a Great Thinker, Springer.
Turing, A. (1936). On Computable Numbers, with an Application to the Entscheidungsproblem,
Proceedings of the Mathematical Society, 42, 230-265.
Turing, A. (1939). Systems of Logic Based on the Ordinals, Proceedings of the Mathematical
Society, 45,161-228.
Siegelmann, A. (1995). Computation Beyond the Turing Limit, Science, 268, 545-548.

- 45 -

Proceedings IACAP 2011

THE MATERIALISTIC FALLACY
Some Ontologic Problems of Regulating Virtual Reality
FABIAN GEIER
Universität Bamberg
An der Universität 2
96047 Bamberg
Germany

Abstract. This paper will discuss a connection between the ontology of virtual
objects and several problems of information ethics. I argue that there is a strong
tendency, sometimes even among professionals in ICT, to treat virtual objects like
material objects. There are many political regulations and economic practices
which make sense for material objects, but do not make sense for virtual ones.
Such an ignoring of the nature of data processing, be it deliberate or not, I call a
materialistic fallacy and consider it to be hampering social progress and benefit.

1. The Fallacy
I call a materialistic fallacy if virtual objects are unnecessarily treated like material
objects. The immediate effects of this fallacy are two: The practice in question either
proves to be ineffective, because it is easily circumvented; or, where it can be enforced,
it stalls progress and severely limits the benefit that ICT could provide.

2. The Ontology of Virtual Objects
By “virtual objects” I refer to any chunk of digitally stored data that is conceived as a
distinct entity by human understanding. This will in most cases be identical with files.
However, the human mind does not have to go along the lines of file descriptors, and
especially outside professional IT it often does not. A mouse pointer, window or webpage might be made up of several distinct files, and neither is a part of a file a file, nor is
the entire content of a hard-drive. However, all these are virtual objects, as soon as we
refer to them. And the decicisive thing about virtual objects is that they can easily be
made a file and be subject to all possibilites of data processing. By this definition of
virtual objects I hope to circumvent most of the specific problems in the ontology of
computing.

- 46 -

The Computational Turn: Past, Presents, Futures?

In material reality, form and matter cannot be separated from each other. One of the
effects of this is, that we are used to relatively stable individual objects, that persist in
time. Persistence is the precondition for movement: When a material object is moved
into a new place, it is not at the same time in its former place anymore.
In the realm of information, however, the case is entirely different. In Aristotelian
terms data processing deals with pure 'forms'. Forms don't move. They are a-temporal
and intangible (This largely corresponds to what Eden & Turner (2007) say about
programs). Their distinctive characteristic is instantiation. Any number of instances of a
form can exist, but none of them is prior to any other. If we send a network packet to two
different computers, we cannot say which of the arriving packets is the original and
which a mere copy. Such questions make sense in the material world, but they do not
make sense in the virtual.
Technically, any chunk of data is at any point located in particular bits and bytes,
and so still is an instantiation and not a pure form. However, since computers are all
about reinstantiating the form of this instantiation, this fact is negligeable. Computers are
all about making it negligeable. This results in what Moor (1997) calls information being
“greased”.
Of course there seems to be movement in virtual objects, i.e. in a cursor on a
screen. Otherwise computers would not be very useful. But we should keep in mind that
such movement is always a simulation, created by a sequence of copying and erasing.
But only because we sometimes cannot help using such simulations, there is no need to
do it to the utmost degree. I suggest the opposite: We should do it only where it is
necessary, and otherwise maximize the benefit from freeing information from the bonds
of materiality.

3. Examples
3.1 DATA EXPIRY
A typical materialistic fallacy is the suggestion, put forward by Viktor MayerSchönberger (2008), and recently picked up by the German ministry for consumer
protection, to have an inbuilt expiry date for data on the internet. The idea sounds nice:
This would end the problem, that what is put online once, resides there forever.
However, it will never work. More precisely: It could only work under the most
extreme conditions of worldwide data-control – an amount of control no current
institution is anywhere close to exert. Of course we can write a program that erases a file
after 90 days, but it would have to be implemented either as a mandatory core module of
all existing operating systems, or as an obligatory hardware solution similar to Trusted
Computing. However, it does not lie in the nature of data to expire. An expiry module
would only be a separate addition to the core functionality of computers, and thus both
unwanted and easy to remove.
3.2 DIGITAL RIGHTS MANAGEMENT
DRM, or more specifically, copy protection is almost archetypical for the materialistic
fallacy. When we are trying to charge customers on a per-copy basis, we are following

- 47 -

Proceedings IACAP 2011

the paradigm of material objects. Copy protection attempts to establish a uniqueness and
sameness for the copy, that does not lie in its nature. The protection must prevent a
function that data processing generally offers: the re-instantiation of data.
There are various consequences of this: First, the moral restraints to copy software,
protected or not, are lower than in material theft, because copying does not result in
anyone else losing data. Second, just because it is not its nature, the seeming uniqueness
of a copy is difficult to maintain, as it can only be provided by an additional module. I do
not endorse pirating software. But I endorse acknowledging the basic structures of ICT
because of which it is easier to pirat it than to protect it. And I endorse thinking about
alternative ways of dealing with this.
3.3 E-VOTING
The ontologic structure of ICT also matters in the discussion about eVoting. I am not
referring to security issues here, but to the situation once security is breached. Then the
full power of data processing lies at the hand of the intruder: Whether you forge 10 votes
or 10 000 000 – it is just one line of code. The difference between local and global
modification is not the same in virtual as in material reality. Virtual Objects do not count
one by one, but can be treated formally, on various levels of abstraction. Large scale
modifications in a database are in principle no more difficult than singular modifications.
I don't say that this alone must decide the issue. All I say is that the nature of data
processing has to be taken into account.

References
Aristotle (1998). Kategorien. Hermeneutik. Hamburg: Meiner Verlag.
Aristotle (1989). Metaphysik, 2 Volumes. Hamburg: Meiner Verlag.
Mayer-Schönberger, V. (2008). Nützliches Vergessen. In: M. Reiter, M. Wittmann-Tiwald, (Eds),
Goodbye Privacy - Grundrechte in der digitalen Welt. Wien: Linde Verlag.
Moor, J. (1997). Towards a Theory of Privacy in the Information Age. Computers and Society 27
(3), 27-32.
Eden, A. H. & Turner R. (2007). Problems in the Ontology of Computer Programs. Applied
Ontology 2 (1), 13-36.

- 48 -

The Computational Turn: Past, Presents, Futures?

THE EFFECT OF COMPUTERS ON UNDERSTANDING TRUTH
STEVEN MEYER
Tachyon Design Automation Corp.
Minneapolis, MN

Abstract. The effect of computers and computation on the philosophical study of
the epistemology of truth is discussed. The development of algorithmic truth as
satisfiability is considered using modern quasi empirical methods that follow the
mathematician Paul Finsler's discovery that a formal conception of truth does not
suffice. The P=?NP problem is considered and shown to be a philosophical
problem using Finsler's method. Non truth value assignment conceptions of truth
such as deflation and computer science as a method for studying physics are
criticized.

1. Introduction
The mid 1960s marked the beginning of the influence of computers on the epistemology
of various conceptions of truth. On the one hand fast computers were becoming
available and on the other quasi-empirical characterizations of mathematics in the form
of Lakatosian research programmes were becoming popular (Lakatos, 1967). A. J. Ayer
attributes the quasi-empirical characterization of logical truth to J. S. Mill from the
middle of the 19th Century (Ayer, 1936, p. 291). In 1964, Paul Finsler published what he
claimed was an air tight defense of his rejected 1926 idea that 'A Formal "conception of
truth" cannot suffice' (Finsler, 1996, p. 163).
Computers were becoming fast enough so that computer programs for proving
mathematical theorems and for verifying truth were conceivable. These developments
led naturally to questions concerning what can be computed, and if there are any
limitation of computability. Before the mid 1960s, at least in the area of mathematics,
epistemology had become truth as existence of mathematical objects generated from
abstract set theory. The various incompleteness, inconsistency and set theory paradox
results were avoided by falling back on truth as axiomatic logic.
Computers allow a new and seemingly empirical epistemology of truth. Namely,
something is true if it can be computed in a reasonable amount of time. This
immediately led to problems. One early example was alphabetization (sorting) using a
giant table. One can sort a list in linear time by converting each key into a number and
storing the number into the address corresponding to the encoding. It is not clear if this is
alphabetization or not, and it was not clear how to collect the result.

- 49 -

Proceedings IACAP 2011

2. THE P = ? NP PROBLEM AND TRUTH
In order to study "the basic nature of computation and not merely minor aspects of our
models of computers" (Baker, 1975), the polynomial time versus non deterministic
polynomial time class equivalence problem was developed by Cook(1968) and
Karp(1972). The problem basically asked if the satisfiability definition of truth could be
computed by a deterministic Turing machine (TM) as fast as it could be computed by a
non-deterministic TM. The satisfiability conception of truth goes back to Alfred Tarski's
work in the 1930s (Tarski, 1956) that defined a statement (conjunction) of basic
propositions to be true if it is true under any possible assignment of truth values to the
basic atomic propositions in the statement.
This problem is not only the central problem of computer science, but according to
Aaronson(2005, p. 2) "is correctly seen as the deepest problem in all mathematics".
Since the formulation of the P =? NP problem in the late 1960s, it has become both a
mathematical problem, a scientific problem because it involves time and a philosophical
problem. The "canonical" possibly easiest problem in the NP class of problems is the
logical truth satisfiability problem. Following Karp, other problems in the class NP
(solvable in in a polynomially bound number of steps on a non deterministic TM) are
solved by mapping to the satisfiability problem in polynomial time (Karp,1972). The
satisfiability problem and its characterization of what can be computed is closely related
to the very essence of truth because as 18th philosopher David Hume observed, "no
general proposition whose validity is subject to the test of actual experience can ever be
logically certain. ... [something] substantiated in n-1 cases affords no logical guarantee
that it will be substantiated in the nth case also" (Ayer,1936, p. 289).
This paper considers the epistemology of computation in the quasi-empirical sense
by investigating "what is true, and not what is hypothetically taken to be true (for
instance axioms)" (Finsler, 1996, p. 162).

3. Problems Solved by Computational Epistemology
Two obvious problems solved by computing are disproof of the deflationist definition of
truth and disproof of the form of intuitionism that disavows the law of the excluded
middle. The deflationist theory of truth (Stanford Encyclopedia, 2010) argues "to assert
a statement is true is just to assert the statement itself". Computation epistemology of
truth as a satisfiable assignment to all atomic elements is obviously more than merely
"asserting a statement".
There are a number of forms of intuitionism. One form rejects the law of the
excluded middle. It is claimed there are formulas that are neither true nor false
(probably because they can not be constructed in a intuitively obvious way). Again,
existence for finite formulas (possibly potentially infinite unbounded formulas also) can
be tested by finding some assignment of true and false to atomic clauses that makes the
formula evaluate to true. If no such assignment exists, the formula is false (Finsler,
1996, pp. 167-168). There is no question of intuitively acceptable methods here.

- 50 -

The Computational Turn: Past, Presents, Futures?

4. Problems Unsolvable by Computational Epistemology
Although, satisfiability computable in a reasonable amount of time solves some
epistemological problems, it can not deal with problems involving actual infinity. From
Finsler(1996, p. 164):
One cannot form the set of all ordinal numbers, since its definition contains an inherent
contradiction [Russell's paradox]. If it were not an ordinal number, then it would still
contain exactly all preceding ordinal numbers, and therefore it would have to contain
itself as an element which is impossible.

5. Internal Problems of Computational Epistemology - Oracle Use
One of the first attempts to solve the P =? NP problem tried to use an infinite counting
argument from meta-mathematics (Baker, 1975). The method goes back to Cantor's
diagonalization using the lack of a one-to-one mapping between real and rational
numbers. The modern meta-mathematical model theory analog of diagonalization is
relativization using oracles. The idea is to allow TMs to make unit time calls to an
oracle. The hope was that for all oracles the class of languages recognized by P plus an
oracle was strictly contained in (not one-to-one) NP with an oracle. The result was that
P is in NP for some oracles but not for others. The Baker et. al. conclusion was that by
"slightly altering the machine model, we can obtain differing answers" (p. 431).
Since then, much of computational complexity theory has been dedicated to
relativizations because relativization proper containment immediately shows P != NP.
Researchers who think there may be epistemological difficulties with the P =? NP
problem have criticized relativization but mostly without success (Hartmanis, 1976 &
Hartmanis, 1992). Relativization pertains to computational epistemology because it
removes problem specific structure from computable truth. Hartmanis(1976) shows that
for models of computation that allow the use of more efficient storage access such as the
MRAM model which has unit cost for multiplication, P = NP (pp. 33-46). This may
show that there is some conceptual problem with the Church-Turing Thesis (definition of
TMs) or even that the class NP does not really exist (it is an illusion in the Finslerian
sense) because abstraction of the structural connection between satisfiability and other
problems that need non deterministic computation for efficiency is incorrect.

6. Physicalization of Computational Epistemology
Computational epistemology has taken a recent turn toward arguing that studying the P
=? NP problem "can yield new insights, not just about computer science but about
physics as well" (Aaronson, 2005, p. 1). Deolalikar(2010) recently published a proof
that P != NP except unfortunately it needed axioms from empirical theories of statistical
physics.
In conclusion, I see this change in direction negatively because it attempts to
convert a question from physics on the existence of quantum computers (QCs) (pp. 5-8)
into formal and axiomaticized computational epistemology that does not allow quasiempirical experimentation. The argument comes full circle because the mathematicians
who contributed to the development of modern physics (including Finsler whose main

- 51 -

Proceedings IACAP 2011

area was the differential geometry of general relativity, p. vii) were skeptical of exactly
the physics that QCs embody and require.
In his post WW II standard graduate level quantum mechanics text book, Leonard
Schiff argues that "QM's range of applicability is limited to approximating the behavior
of the atom" (Schiff, 1949, p. 267). Also, Paul Feyerabend's analysis of the theories of
Niels Bohr and David Bohm (Feyerabend, 1982), show that the very properties assumed
by QC builders do not exist. Bohr states (Feyerabend's italics): "At the same time we
must deny the universal validity of the superposition principle and must admit that it is
but a (very useful) instrument of prediction." (p. 258). Also Feyerabend (David Bohm
taught QM to Feyerabend) describes Bohm's view of the uncertainty principles as:
"However in order to show the basic and irrefutable character of the uncertainty
principle these features themselves would have to be demonstrated as basic and
irrefutable." (p. 223).

References
Arronson, A (2005). NP-completeness problems and physical reality. Sigact News (vol. 36). (Also
www.scottaaronson.com/papers/npcomplete.pdf).
Ayer, A. (1936) in P. Benacerraf & H. Putnam(1964) (Eds),Philosphy of mathematics - selected
readings, first edition (289-301), exerpt of Ayer, A. Language, truth and logic.
Baker, T., Gill, J. & Solovay, R. (1975) Relativizations of the P =? NP question. Siam J. Comput.
11(4), 431-442.
Cook, S.(1971) The Complexity of Theorem-proving procedures, Proceedings of the third Annual
ACM symposium on Theory of Computing. (151-158).
Deolalikar, V (2010) P != NP, HP Research Labs, Palo Alto, August 6, 2010, unpublished.
Stanford Encyclopedia of Philosophy (1981) Deflationary Theory of Truth, (URL of Feb. 2011:
plato.stanford.edu/entries/truth-deflationary).
Feyerabend, P. (1981) Philosophical papers. Vol. 1. Realism, Rationalism & Scientific Method,
Cambridge.
Finsler, P. (1996) in D. Booth & R. Ziegler eds.), Finsler set theory: Platonism and Circularity,
Birkhauser.
Hartmanis, J. & Simon, J. (1976) On the Structure of Feasible Computations, in M. Rubinoff. &
M. Yovits (eds.) Advances in Computers 14, Academic Press, 1-43.
Hartmanis, J. et. al. (1992) Relativization: A revisionistic retrospective. Bulletin of the EATCS.
Vol. 47.
Lakatos, I. (1978) Philosophical papers. Vol. 2. Mathematics, Science and epistemology. (ed. J.
Worrall & G. Currie ), Cambridge, 24-41 (expanded version from Proceedings of the Fourth
InternationalCongress for Logic. ed. I. Lakatos(1967), North Holland).
Lakatos, I. (1976) Proofs and Refutations. Cambridge.
Schiff, L. (1949) Quantum Mechanics. First edition, McGraw Hill, New York.
Tarski, A (1956) The Concept of Truth in Formalized Languages, Logic, Semantics,
Metamathematics, Clarendon Press, 152-278.

- 52 -

The Computational Turn: Past, Presents, Futures?

PHILOSOPHY OF THE WEB AS ARTIFACTUALIZATION
ALEXANDRE MONNIN
Université Paris 1 Panthéon-Sorbonne (PHICO, EXeCO),
Institut de Recherche et d’Innovation,
Conservatoire National des Arts et Métiers (DICEN)
12, place du Panthéon
75231 - Paris cedex 05, FRANCE
AND
HARRY HALPIN
World Wide Web Consortium
MIT/CSAIL
32 Vassar St.,Bldg.32-G514
Cambridge, MA 02139, USA

Abstract. What is the philosophical foundation of the World Wide Web? T.
Berners-Lee, widely acclaimed as the inventor of the Web, has developed informal
reflections over the central role of URIs (Uniform Resource Identifiers, previously
Uniform Resource Locators) as a universal naming system, a central topic in
philosophy since at least the pioneering works of R. Barcan Marcus. URIs (such
as http://www.example.org/) identify anything on the Web, so the Web can be
considered the space of all URIs. In a debate between Berners-Lee and P. Hayes
over URIs and their capacity to uniquely 'identify' resources, Berners-Lee held that
engineers decide how protocols should work and that these precisions should
determine the constraints of reference and identity while Hayes held that names
have their possible referents determined only as traditionally understood by logical
semantics, which Hayes held engineers could not change but only had to obey.
This duality can be interpreted as an opposition between a material a priori and a
formal a priori. The material a priori of technical systems like the Web is brought
about by what we call 'artifactualization', a process where concepts become
'embodied' in materiality - with lasting consequences.

- 53 -

Proceedings IACAP 2011

1000-word abstract
What is the philosophical foundation of the WWW?

Is it an open and distributed

hypermedia system? Universal information space? How does it diﬀer from the Internet?

While the “ecology” of the Web has known many a revolution, in contrast, its underlying
architecture remains fairly stable. URIs, the HTTP protocol, resources, and languages
like HTML and RDF constitute the building blocks of the Web. As the particular kind of
computing embodied by the Web has displaced traditional desktop applications, the
foundations of Web architecture and its relationship to wider computing needs to be
clarified in order to determine both its roots, boundaries, the reasons for its success,
future developments... This is especially urgent as now debate is opening over platforms
and cloud computing, as how they relate to the Web.
Tim Berners-Lee, widely acclaimed as the inventor of the Web, has developed in
his design notes informal reflections over the central role of URIs (Uniform Resource
Identifiers – previously Locators) as a universal naming system, a central topic in
philosophy since at least the pioneering works of Barcan Marcus. URIs (such as
http://www.example.org/) identify anything on the Web so it can be considered the space
of all URIs. The concrete access mechanisms of how information is transmitted via a
URI is then determined by the Internet, and so the Web could be built on another
architecture (such as the “Future Internet”), and likewise the Internet can also host other
applications than the Web, such as peer-to-peer file-sharing.
Possible entities denoted by URIs are called resources. While high-order
ontological debates have continuously tried to provide distinctions between endurants
and perdurants (categories that mainly apply to substances), the characterization of
resources has relied on vastly diﬀerent ontological principles that descend from

engineering concerns rather than claims of ontological correctness.
Drawing from the work of Vuillemin, we draw a parallel between the Web and
philosophical systems. Like the former, it is concerned with traditional issues pertaining
to the philosophy of language (URIs as proper names), to ontology (the link between
engineering design choices in Semantic Web ontologies and philosophical ones), and
metaphysics (entities of the Web as resources). Unlike philosophical systems that reflect
on the constraints of the world, the Web is a world-wide embodied technical artifact that
therefore creates a whole new set of constraints. We suggest that they should be
understood as a material a priori - in the Husserlian sense - grounded in history and
technology.
In a striking debate between Berners-Lee and Patrick Hayes over URIs and their
capacity to uniquely ‘identify’ resources, Berners-Lee held that engineers decide how the
protocol should work and that these decisions should determine the constraints of
reference and identity. Hayes replied that names have their possible referents determined
only as traditionally understood by formal semantics, which he held engineers could not

- 54 -

The Computational Turn: Past, Presents, Futures?

change but only had to obey. This duality can be interpreted as an opposition between a
material and a formal a priori. Interestingly enough, recently Hayes is focusing on
adopting principles from the Web into logical semantics itself.
The material a priori of technical systems like the Web is brought about by what
we call “artifactualization”, a process where concepts become “embodied” in materiality
- with lasting consequences. While such a process clearly predates the Web we can now
see within a single human lifetime the increasing speed at which it takes place, and
through which technical categories (and philosophical ones) are becoming increasingly
dominant over “natural” and “logical” categories. At the same time, the process of
having philosophical ideas take a concrete form via technology lends to them often
radically new characteristics, transforming these very concepts in process. Heidegger
posited a filiation between technology and metaphysics, with technology realizing the
Western metaphysical project (by inscribing its categories directly into concrete matter
should we add). Yet, if technology is grounded in metaphysics, it is not the result of a
metaphysical movement or “destiny” (Schicksals) but a more mundane contingent
historical process, full of surprises and novelties. For all these reasons, it must be
acknowledged that the genealogy of the Web, as a digital information system, differs
from traditional computation with regards both to the concepts at stake and our relation
to them (the scientific ethos being replaced by an engineering one – something BernersLee dubbed “philosophical engineering”).
On the Web, the activity of standardization through bodies like the W3C arguably
consists in making sense of technological evolution post-hoc. Nevertheless, regarding the
architecture of the Web, one may argue that its standards were both the result of a
process of conscious decision-making in specifying how protocols should work and the
result of a constant adjustment to the reality of the technical system. Therefore, the Web
can be seen as an artifact both in terms of being a designed human invention and a nonhuman (Latour) whose study may lead to numerous unintended discoveries, beyond its
initial design.
For all these reasons, the very practice of philosophy is transformed by having to
take this material a priori and its technical categories as seriously as “natural” or
“analytic” categories from biology or natural language. Philosophers then have to deal
with technical categories that may have a lasting eﬀect in spheres like the Web, not just

as variants from categories that can be analytically understood, but rather as concrete
artifacts which can even transform the previously considered analytic categories
(ironically, the main challenge to analytic judgments is no longer what Quine called
“naturalization” but rather the ongoing artifactualization). While at first glance URIs can
be considered just another kind of name and so inherit the characteristics and debates in
philosophy over the referential status of proper names, the Web makes a difference, as
URIs primarily are used to physically access information such as webpages – an aspect
of naming for the most part foreign to the philosophy of language.
R. Sennett’s craftsman’s motto might be “doing is thinking”, once concepts have
been artifactualized (and, as a consequence, externalized), thinking is also doing or
conceiving; in the end, a matter of design.

- 55 -

Proceedings IACAP 2011

ONTOLOGICAL COMMITMENTS OF COMPUTER SCIENCE
MIGUEL PAGANO
FaMAF – Univ. Nacional de Córdoba
Medina Allende s/n, X5000HUA Córdoba, Argentina.

Abstract. We suggest that a fictionalist attitude with respect to Quine’s proposal
of ontological commitments is best suited for building up an ontology for
computer science. In particular, we argue in favour of using theories of
programming languages for identifying the relevant ontological categories.

1. Introduction
In this extended abstract we propose a novel reading of Quine’s ontological
commitments [Quine, 1980] to analyse the ontology of computer science. We argue that
a fictionalist posture (see [Szabó, 2009]) can save genuine concepts of computer science
from vanishing as ingenuous mathematical construction. Although we only discuss
aspects related to programming languages and programs, we think that this can lead to a
fruitful research programme if extended to other areas of computer science.

2. Programming Languages: Ontology from Semantics
Before coming to our proposal, let us briefly review critically two papers by A. Eden and
R. Turner which deal with the ontology of computer science. In the first paper [Eden and
Turner, 2007a] they study the ontological commitments of programming languages.
They propose that semantics determine to which entities a particular programming
language is committed. They apply this methodology for a simple imperative language
with two kinds of semantics (based on set theory and type theory, respectively). We do
agree on the use of semantics to determine some of the commitments of computer
science, however it is not clear to us that programming languages have ontological
commitments; instead they should be attributed to theories of programming languages
(TPL). The fictionalist attitude enters here: the fact that TPL uses a certain mathematical
foundation, say set-theory, does not imply that its commitments are those carried by the
foundational theory; instead concepts like abstract syntax, reference, state, ordered
structure given by the outcome of a certain computation are our candidates for the
ontological commitments; i.e. the entities which should be used to reason about

- 56 -

The Computational Turn: Past, Presents, Futures?

programming languages and programs-scripts. Instead of trying to appeal to the language
on which the genuine concepts are modeled, we propose to justify the commitments in
terms of their epistemological value.
In the second paper [Eden and Turner, 2007b] Eden and Turner put semantics aside
as the source of the commitments carried on by PL; in this article the underlying
programming paradigm determines the true entities to which a programming language is
committed. It can be posited that some of the aforementioned examples could be taken to
be specific to some or other paradigm; but, it is not obvious to us that programming
paradigms are good candidates to look for commitments. Consider, for example, what
kinds of reasoning can be done by only knowing the paradigm of a PL but without any
deeper theory about PL, it would be surprising that one could decide if two programscripts compute the same or not. What is more strange to us is the attempt to attach
commitments to programming languages or programs-scripts: PL are not more than the
description of a set of valid programs (the so-called programs-scripts) with a notion of
execution – the former usually given by a more or less abstract grammar and the latter
presented by more or less formal means, ranging from a fully-formalised semantics to a
mere bogus and ambiguous compiler.
We have already mentioned some ontological commitments with an epistemological
basis; now we use syntax to show that TPL are the good place to look for the genuine
building blocks of (part of) the ontology of computer science. In a first overview the only
interesting category arising from considering syntax is that of program-scripts (cf. [Eden
and Turner, 2007b]), but program-scripts alone are not enough descriptive to grasp the
importance of different parts of a program-script.
For example, two occurrences of the same variable can play different rôles, say one
occurrence can be a formal parameter in a procedure or function and the other an
occurrence in a program calling the procedure. Just from a syntactical point of view,
there should be a distinction between those two occurrences, the formal parameter is a
binding occurrence, while the other occurs free occurrence. On the other hand, one could
also be tempted to pay too much attention to syntax and introduce some superfluous
concepts, e.g. differentiating between parsed or un-parsed program scripts or putting a
two restrictive condition on what is a program-script. Since the best account of the
interesting syntactical phenomena is given by abstract syntax, we should expect to get
from its development [McCarthy, 1962, Fiore et al., 1999] the ontological categories
corresponding to the syntactical aspects of PL.

3. Conclusion
Let us conclude by commenting on how to use semantics (may be the best known area of
TPL) for studying the ontology of computer science. We acknowledge that asking for a
definite semantics in order to establish a new ontological category can delay the
acceptance of new concepts brought by new languages lacking a proper definition and
defined in terms of a compiler or interpreter. In spite of not considering the ontology as
an immutable edifice, we should restrain of adding new concepts as fast as a new
paradigm or PL is announced; instead we think a more parsimonious attitude should be
observed and wait until a good semantic explanation is given for the newly introduced
artefacts.

- 57 -

Proceedings IACAP 2011

We do not advocate that one kind of semantics should be preferred over others, based on
the status given by some foundational philosophy of mathematics to its underlying
theory; Turner [Turner, 2009] seems to accept that any semantics should be accepted as
a mathematical entity by a realistic mathematician. It is clear to us that the various
proposed semantics could explain diverse aspects of the same language and account for
several ontological categories.2
From the fictionalist posture we adopt, it is futile to try to explain in what sense the
categories of a resulting ontology built up by following TPL are more relevant
metaphysically than those arising from other proposals, say Eden and Turner’s papers.
Our proposal would correspond to what Smith [Smith, 2003] calls an “internal
metaphysics” and its merits reside on how good it is for accounting the phenomena
studied on computer science.

Acknowledgements
I am grateful to Martin Diller, Pío Garcia, and Renato Cherini for encouraging me to
write this abstract. My work is founded by CONICET, Argentina.

References
Eden, A. H. and Turner, R. (2007a). Computation, Information, Cognition. The Nexus
and the Liminal, chapter Towards a programming
language ontology (pp: 147–159). Cambridge Scholars Publishing.
Eden, A. H. and Turner, R. (2007b). Problems in the
ontology of computer programs. Applied Ontology, 2(1):1(pp: 3–36).
Fiore, M., Plotkin, G., and Turi, D. (1999). Abstract syntax and variable binding.
Proceedings of the 14th Annual IEEE Symposium on Logic in Computer Science,
LICS ’99 (pp: 193–202). Washington, DC: IEEE Computer Society.
McCarthy, J. (1962). Towards a Mathematical Science of Computation. In IFIP
Congress (pp: 21–28).
Plotkin, G. D. (2004). The origins of structural operational semantics. Journal of
Logic and Algebraic Programming, 60-61. (pp: 3–15).
Quine, W. V. O. (1980). From a Logical Point of View, chapter On What There IS, (pp.
1–19). Harvard University Press.
Scott, D. S. (1970). Outline of a Mathematical Theory of Computation. Technical Report
PRG–2, Oxford, England.
Smith, B. (2003). The Blackwell Guide to the Philosophy of Computing and
Information, chapter Ontology. WileyBlackwell.
2 For example, Plotkin’s operational semantics leading to a better understanding of the
implementation of programming languages [Plotkin, 2004] and Scott’s denotational semantics
[Scott, 1970]
used to reason about the equivalence of programs without resorting to a particular
implementation.

- 58 -

The Computational Turn: Past, Presents, Futures?

Z. G. Szabó, The Analytical Way. Proceedings of the 6th European Congress of Analytic
Philosophy, chapter The Ontological Attitude. London: College Publications, 2010.
Available at http://pantheon.yale.edu/~zs47/documents/Theontologicalattitude.pdf
Turner, R. (2009). The Meaning of Programming Languages. American Philosophical
Association Newsletter on Philosophy and Computers, Fall-2009 (pp. 2–7).

- 59 -

Proceedings IACAP 2011

Semantics of Programming Languages

UWE V. RISS
SAP Research Karlsruhe
Vincenz-Priessnitz-Str. 1
76131 Karlsruhe
Germany

Abstract. The grounding of the semantics of programming languages is
investigated. It is argued that the meaning of programming languages results from
the operations that they abstract and the interpretation of these operations in terms
of human activities as the final point of reference. This view opposes the
interpretation of the semantics of programming languages. The latter refers to
higher order abstraction as basis whereas the current view sees these semantics
rooted in the actual performance realized by concrete implementations, taking a
pragmatic stance.

1. Introduction
The central aim is to investigate the role of computers and the grounding of semantics of
programming languages. Traditional approaches towards the semantics of programming
languages such as operational or denotational semantics (Turner, 2007) aim at
abstracting from the differences of individual implementation to find the common
meaning behind them. Operational semantics does this by referring to abstract machines
while denotational semantics refers to mathematical structures. In the following it is
argued that semantics cannot be understood in such terms of higher order abstraction but,
on the contrary, must be rooting in concrete operations. We can understand the
mentioned approaches as objectifications of the perceived equivalence of the respective
operations. However, the point of reference for semantics cannot be this objectification
but the underlying concrete operations and their perceived equivalence (Saab and Riss,
2010), in analogy to the natural sciences the basis of which are experiments and not
scientific laws.

2. Activity Theory
For this purpose we primarily regard computers as tools in human activity. The
framework of this consideration is Activity Theory (Engeström, 1987) that describes the
relation between persons (subjects), the objects of their activities, and the context of
these activities in the schematic triangle depicted in Figure 1:

- 60 -

The Computational Turn: Past, Presents, Futures?

Figure 1. Activity Triangle.
The core triangle of subject (human agent), community, and object has been extended
towards tools, communication (social mediation), and division of labour. All human
activity is directed towards an object and aiming at a desired output. The social context
includes language and communication that mediate the interaction between subject and
community. Hereby communication appears as a means for activity coordination and
knowledge transfer within a community and thus enables division of labour.
Understanding computers merely as tools in this system, however, is not sufficient
since this neglects several specific aspects such as the separation of hardware and
software. The term programming language already indicates that the concept of software
is related to communication while hardware represents a traditional tool concept. Thus,
programming languages serve a means of communication between the subject and the
hardware representing the proper tool. This interpretation can be further supported by the
objectives of artificial intelligence research to introduce intelligent agents that as
equivalent to human agents regarding their intellectual capacity. Even if this goal is not
reached, computers move down in the diagram from the top position (tool) towards a
middle position where more complicated coordination and communication is required.

3. Fundamental Understanding of Semantics
To understand semantics of programming languages we have to go back to natural
languages. These are generally used as means to coordinate the activities among
collaborating human agents and to transfer knowledge; program languages are used to
organise the division of labour between the human agent and the computer and to
instruct the computer what to do, both at a rather elementary level. If we look at two key
features of natural language, abstraction and symbolization, we also find them in
programming languages. Every line of code in an ordinary computer program symbolises
an abstraction of simple operations that both humans and machines can (usually) execute
with equivalent results. Thus, abstraction is the key to transferability of operations from
one person to another or from a person to a computer. However, abstraction must not be
regarded as absolute but as a process of identification. Symbolization as the

- 61 -

Proceedings IACAP 2011

manifestation of such identity serves as the basis of the machine’s automatic processing
of programs. On both sides, human agents and computers, it is the capacity to reliably
interpret symbolic expressions, which ensures a repeatable execution of operations and
the use of the computer as a tool.
The basis for communication via symbolized abstraction and coordination of
operations is shared meaning. Here meaning of messages includes two aspects, the
interpretation of messages and the expectation that others understand it in a similar way
(Saab and Riss, 2011). In the case of computers it is sufficient that this expectation is
one-sided, that is, from the human agent towards the machine; the computer is not
supposed to have expectations. Regarding the concept of meaning we refer to a
pragmatist view that understands the meaning of a message as what an agent can do with
this message (Stegmaier, 2001). For the subject the meaning of program code is
determined by the subject’s knowledge of how to execute the included operations while
the hardware determines the ‘meaning’ for the computer, that is, the computer is able to
execute the program. Naturally semantics is not equated with execution – a single
malfunction does not spoil the meaning of a computer program – but with execution as a
repeated process of significant reliability. In the case of computers we even find a more
reliable execution than what we can expect of human agents.

4. Abstract Semantics
If the meaning of programming languages is not constituted by higher levels of
abstraction but by concrete operations we have to clarify the role of abstract formal
approaches, as they appear in operational or denotational semantics (Turner, 2007). In
the same way as mathematical models abstract human activities these formal semantic
model abstract operations and serves as means to support program development and
testing. Formal definitions are only meaningful inasmuch as they refer to established
human practice. Indeed engineers have constructed computers before researchers have
applied formal semantics to programs so that formal semantics cannot be seen as the
actual foundation for computer languages. Formal semantics can only support the
development process but not constitute it.
The presented approach shows some links to Rapaport’s idea of implementation as
semantic interpretation (Rapaport, 2005). It also resembles the idea of information as
sense-making of data (Saab and Riss, 2011), where programs are understood as data the
meaning of which results from an interpretations process that is determined by the
projected operations that refer to what the computer can do with a program.

References
Engeström, Y. S. (1987). Learning by expanding: An activity-theoretical approach to
developmental research. Helsinki: Orienta-Konsultit Oy.
Rapaport, W. J. (2005). Implementation is semantic integration: Further thoughts. Journal of
Experimental & Theoretical Artificial Intelligence, 17(4), 385–417.
Saab, D. J. & Riss, U. V. (2010). Logic and abstraction as capabilities of the mind. In: J.
Vallverdù (Ed.) Thinking Machines and the Philosophy of Computer Science: Concepts and
Principles, (pp 132-148). Hershey, PA: Information Science Reference.

- 62 -

The Computational Turn: Past, Presents, Futures?

Saab, D. J. & Riss, U. V. (2011). Information as Ontologization. Journal of the American Society
for Information Science and Technology. (accepted for publication).
Stegmaier, W. (2008). Philosophie der Orientierung. Berlin, New York: Walter de Gruyter.
Turner, R. (2007). Understanding programming languages. Minds and Machines, 17(2), 203-216.

- 63 -

Proceedings IACAP 2011

QUINEAN HOLISM AND THE INDETERMINACY OF COMPILATION
NATHAN SINCLAIR
Macquarie University
Nathan.Sinclair@mq.edu.au

1. Motivation
No other philosophical doctrine with even the remotest skerrick of plausibility would, if
vindicated, so radically overthrow our current understanding of language, psychology
and rationality as Quinean semantic holism. If individual words and sentences do not
have meanings then we cannot explain communication as the transmission of ideas or
judgments, nor appeal to sentence meanings as objects of putative propositional
attitudes, nor explain reasoning in terms of the discernment of relationships between the
meanings of premises and conclusions.
The very fact that sentence meanings are so fundamental to our current accounts of
semantics, cognitive psychology, and reasoning, has meant that objections to Quinean
holism which, if deployed against less radical claims, would be lightly dismissed, have
been taken very seriously indeed. Most such objections appeal, broadly, to two hopes or
assumptions. One the one hand it is claimed that the range of evidence proponents of
Quinean holism have considered relevant to meaning and translation is too narrow, and
hoped that somewhere beyond that range, perhaps in normative social practices or
introspection, there is evidence to justify the attribution of determinate meanings to our
words and sentences. On the other hand, it is claimed that arguments for the
indeterminacy of translation must be reductio ad absurda because at best they show that
the range of evidence considered is ``unable to account for distinctions concerning the
feature, meaning, which we know independently to exist'' (Searle 1987).
While objections based on wishful thinking and “just knowing'” would be
dismissed if used to defend less well entrenched prejudices, once given any weight they
have the dubious merit of stymieing further theoretical argument. No argument based
upon lack of evidence is strong enough to preclude the hope of finding further evidence
for such a dearly and deeply held assumption. To advance the dispute we need examples
of alternative incompatible translations between theories expressed in clearly holistic
languages.
Ideally, such examples of alternative translations between holistic languages would
be pre-existing translations routinely employed for practical purposes, rather than
philosophical inventions. Ideally also, the languages involved would be rigorously
specified, with formal compositional grammars precisely delineating their well-formed
formulae, and the theories would express their empirical contents so clearly and
unambiguously that those contents could be mechanically determined. Even better if the

- 64 -

The Computational Turn: Past, Presents, Futures?

theories being translated included both small and easily understood theories, (so that we
might easily see the scope and consequences of the indeterminacy of translation) and
theories as large and complex as our grandest scientific theories (so we could see that the
indeterminacy was not an artifact of theoretical simplicity). Better yet if each such theory
could be taken as complete and self-standing, in order to ensure that the indeterminacy of
translation was not the result of taking statements out of context. Astoundingly, all these
desiderata are fulfilled by programming languages, compilers, and computer programs.
Languages, forms of translation, and theories, so common that few of us in the developed
world are ever more than arms length from tools that rely upon them for their operation.

2. Outline
In part one of this presentation I argue that computer programs are (readily converted
into) empirical theories. Programs' empirical contents are the patterns of input and output
produced by processes executing them. The under-determination of programs by their
input-output is so well known and unthreatening that in many universities a high degree
of similarity of program structure, even between simple programs required to produce
the same output, is grounds for suspicion of plagiarism. Furthermore, programs are
obviously holistic in the sense that (most) statements in computer programs do not
produce any output, nor is any fragment of the output of such programs directly
attributable to them. This insight allows us to make sense of the Quinean doctrine that
individual sentences simply do not have meanings, and to see that the
inferential/conceptual role semantics many critics (most notably Fodor and Lepore)
attribute to Quine, according to which the meanings of individual sentences are
determined by the theories of which they are a part, is a grotesque misinterpretation of
Quinean holism.
In part two I show that compilation (and decompilation) is a form of translation by
the standards Quine advocated, and then argue briefly that those standards are adequate
and that compilation is translation simpliciter. I then show that the indeterminacy of
compilation is well known and unthreatening to computer scientists. The only guarantee
given by ISO standard compliant compilers is the preservation of input-output behaviour,
and computer scientists know that independently written compilers are unlikely to
produced the same machine (or high level) code given the same source code, and are
unsurprised when decompilers cannot accurately reconstruct original source code.
Furthermore, computer programs obviously exemplify the principles of (near) universal
revisability and maintainability that philosophers have found so troubling and
implausible and yet, as the practice of debugging shows, there can be good reason to
revise some sentences and not others in the face of recalcitrant experience.
In part three I consider recent developments in the semantics of programming
langauges, whether the indeterminacy of compilation is sufficient to undermine the
existence of an analytic-synthetic distinction in programming languages and argue that
the translation of natural languages is less tightly determined than the translation of
programming languages.
The position I advocate in this presentation is compatible with both normative and
dispositional accounts of semantics. Whether the ISO standard for the C programming
language is regarded as specifying dispositions possessed by C programs and compilers,

- 65 -

Proceedings IACAP 2011

or the norms to which programs are subject once they are held to be C compilers, the
compilation of C programs is (properly) indeterminate and C programs are (properly)
under-determined by the input-output they are intended to produce.
In order of increasing ambitiousness, I hope people who attend this presentation
will discover that Quinean holism is not a form of inferential/conceptual role semantics,
computer programming languages are holistic and exemplify the controversial features of
Quinean holism, compilation exemplifies indeterminate translation, and why it is
plausible that translation of natural languages is even less determinate than compilation.

References
ISO/IEC WG14 N1256: Programming Languages – C, 2007-09-07, International Organization for
Standardization,
Geneva,
Switzerland,
http://www.openstd.org/jtc1/sc22/wg14/www/standards
Allison, L. (1986). A Practical Introduction to Denotational Semantics, Cambridge University
Press.
Fodor, J. Lepore, E.( 1992). Holism: A Shoppers Guide. Blackwell Publishers.
Fodor, J. (2004). Having Concepts: a Brief Refutation of the Twentieth Century. Mind and
Language 19, no. 1 (February): 29-47.
McDermott, M. (2009). A Science of Intention. The Philosophical Quarterly 59, no. 235 April:
252-273.
Morrison, J. (2008). Just how controversial is evidential holism? Synthese 173, no. 3 (November
22): 335-352.
Okasha, S. (2000). Holism about meaning and about evidence: in defence of W. V. Quine.
Erkenntnis: 39-61.
Quine, W. (1961). Two dogmas of empiricism. In From a Logical Point of View, 20-46. 2nd ed.
Harvard University Press.
Quine, W. (1964). Word and object. MIT press.
Quine, W. (1977). Ontological Relativity and Other Essays. Columbia Univ Pr.
Searle, J. (1987). Indeterminacy, empiricism, and the first person. The Journal of Philosophy 84,
no. 3: 123–146..
Winskel, G. (1993). The Formal Semantics of Programming Languages. MIT Press.

- 66 -

The Computational Turn: Past, Presents, Futures?

IS FINDING A ‘BLACK SWAN’ POPPER, (1936) POSSIBLE IN
SOFTWARE DEVELOPMENT?

LINDSAY SMITH
University of Hertfordshire, UK
l.1.smith@herts.ac.uk
AND
PAUL WERNICK
University of Hertfordshire, UK
p.d.wernick@herts.ac.uk
AND
VITO VENEZIANO
University of Hertfordshire, UK
v.veneziano@herts.ac.uk

Introduction
Users’ experience of software-based technology that fails to meet their expectations is so
widespread as to be a ‘commonplace’ occurrence ((Smith, 2009). However a
satisfactory response from software engineering (SE) remains as elusive as ever.
In this paper we investigate the context of software engineering (SE) as a
negotiation between the contradiction(s) of human subjective experience of softwarebased technology that relies on architecture inclusive of objectivity. For example
machine programming languages that can be mathematically proven ‘Turing
complete’, e.g. Church-Turing Thesis (Eden, 2007).
Consideration of the technological context of SE demands a philosophical reevaluation of the ontological and epistemological status of SE in Computer Science
(CS). We have undertaken a cross-disciplinary investigation to reposition unresolved
problems in SE which potentially also opens up philosophical debate. For example if
we introduce the development of software technology as a subject area for
unresolved metaphysical debate. Such as the Kantian analytic/synthetic a priori
dispute (Hacker, 2006). The limitations on this paper preclude explicit discussion on
the ‘pros and cons’ of metaphysics for SE, or visa versa; however some basic
principles echo implicitly in our discussion. For example our above comments on
objectivity, e.g. possible for machine code and an (current?) impossibility for a
priori understanding of subjective stakeholder software requirements. This implies
Requirements Engineering (RE) practice occupies an epistemological ‘gap’ between
the architectural basis of software and how it is built/used.
For our discussion one positive consequence of a cross-disciplinary approach is
that novel questions can be asked. It would appear to be the case, for example that
RE practitioners gaining an understanding of stakeholders’ requirements is

- 67 -

Proceedings IACAP 2011

compatible with the Kantian epistemological classification of ‘synthetic a posteriori’
(Hacker, 2006). This raises the possibly of other epistemological explanations to
questions such as why SE compares unfavourably for reliability with other
engineering disciplines. For example, civil engineers can respond to unexpected
circumstances in bridge construction by correcting faults, (BBC, 2000) whereas the
hazards of safety critical faults in aircraft cockpit software are/cannot be addressed in
an equivalent way. As Mellor, (1990) explains, the aviation industry certifies
software for ‘airworthiness’ based on the ‘correctness’ of the software development
process but not on the ‘correctness’ of the behaviour of software during testing.
Software development includes planning and designing artefacts but also presents
SE with predictive type problems. For example RE identifies/selects software
requirements to satisfy stakeholders’ future use of software. However RE lacks reliable
or dependable tools/techniques to predict outcomes (Nuseibeh, 2000).

Rationale
We are interested in why Computer science (CS) has not established scientific laws that
can predict SE outcomes unlike, for example, civil engineering that relies on the
established natural laws of Physics. The difference between CS and the natural science
(NS) paradigm manifests in the division between observation of naturally occurring
phenomenon and contending with artificially occurring phenomenon, e.g. software.
Human interaction with software-based technology gives Social Science (SS)
paradigm(s) (Burrell, 1979) potential ontological relevance for CS (Smith, 2010). For
example both SS and CS need to observe ‘non-physical’ phenomena such as human
interaction. However cross disciplinary research depends on what is optimal in a
particular paradigm, for research purposes. Utilising different scientific paradigms
(Hirshheim, 1989) is not straightforward. As a result we chose conservatively to employ
SS to provide a dialectical analysis of contradictions in software development such as
those outlined above. In particular we opposed a potential (1) ‘scientific paradigm’ of CS
Eden (2007) with (2) Ethnomethodology (Ethnometh) an SS approach that challenges
scientific paradigm(s) in SS (Garfinkel, 1967) and has provenance in RE research
(Goguen, 1994). Our purpose is to explore the potential for obtaining leverage over
limitations in understanding of software development.

Can a science base for software development be identified?
For (1) to provide prediction a relevant definition of science needs to apply to CS.
Reasons to doubt this possibility are raised by (2) and we consider this in the
observation of artificial phenomena in software development.
The critical perspective of Ethnometh centres on the scope and meaning of science.
We focus on ‘scientific method’ (SM) because this is how scientific prediction is
achieved resulting in the development and acceptance of scientific theories as
explanation(s) of meaning. SM is defined as a process that relies on both inductive
reasoning and observable phenomena to create a hypothesis that can be tested. Prediction
of events or observations is then a process of deductive reasoning relying on theory to
direct hypothesis testing.

- 68 -

The Computational Turn: Past, Presents, Futures?

Prediction, for SE outcomes, is important and good practise in SE is implicitly
‘Popperian’ (Popper, 1936), e.g. software is built to be testable. However equating
software testing to SM, e.g. a refutable hypothesis, is questionable (Eden,2007).
One central problem for establishing a scientific basis for software development is
observation. Predictive SE, if possible, must have refutable observable phenomenon
(Smith, 210). Yet any observation is via a human ‘prism’ hence the relevance of
Ethnometh criticism of applying SM to social phenomena, e.g. human behaviour
(Garfinkel, 1967). For software development human-technology interaction, e.g. input
and output on a screen, is the point at which an artificial phenomenon (software)
interfaces with its social environment (Smith, 2009). It is also the point where an SS
paradigm that “capture(s) the basic assumptions of coexistent theories” Hirshheim,
(1989) becomes relevant to CS.
Opposing theories in SS do not make the application of SM straightforward.
However CS is currently in a unique cross disciplinary position. This is because
software-based technology replaces previously existing environments/ phenomena with
artificially occurring environments/phenomena. SE practice provides the means by which
phenomenon such as the results of the execution of source code, are possible to observe.
SM has been applied via ‘artificial’ means before, such as instrument-assisted
observation of otherwise unobservable phenomena. Historically scientific
experimentation produced, for example, the discovery of electricity via investigating
the directly unobservable magnetism (Mendelssohn, 1976). Certainly using artificial
tools to ‘empirically’ observe naturally occurring phenomena, such as weather
patterns, requires attention to both natural and artificial environments. Including SS
paradigm(s) raises tantalising prospects such as the potential for SE to provide the
means to observe artificial phenomenon.

Bibliography
BBC,http://news.bbc.co.uk/hi/english/static/in_depth/uk/2000/millennium_bridge/default
.stm, 2000, accessed 15/03/11.
Burrell, G.“Sociological Paradigms and Organisational Analysis”, Heinemann, 1979.
Eden, A. “Three Paradigms of Computer Science”, Minds & Machines, 17:135-167,
2007.
Garfinkel, H. “Studies in Ethnomethodology”, Prentice-Hall 1967.
Goguen, J. “Requirements Engineering as the reconciliation of Technical and Social
Issues.” In Requirements Engineering :
Social and Technical Issues. London
Academic Press, 1994.
Hirschheim,J. “Four Paradigms of Information Systems Development”,ACM, Vol 32,
Number 10, 1989
Hacker, P.M.S. Passing the naturalistic turn : On Quines Cul-De-Sac”,
Philosophy,
2006.
Mellor, P. “10 to the -9 and all that: The non-certification of flight-critical software.”,
City University London, 1990.
Mendelssohm, K.”Science and western domination”, Thames and Hudson, London,
1976.
Nuseibeh, B., ‘Requirements Engineering: A Roadmap’, Proc. ICSE 2000.

- 69 -

Proceedings IACAP 2011

Popper, K. Conjectures and Refutations: The Growth of Scientific Knowledge.
Routledge, 1963.
Smith, L. “Meeting stakeholder expectations of software, or looking for the ‘Black
Swan’ in software requirements”, Proc. ECAP09, 2009.
Smith, L. “Software development: Out of the black box”, Proc. ECAP10, 2010.

- 70 -

The Computational Turn: Past, Presents, Futures?

ONTOLOGY: from Philosophy to ICT and related areas.
Problems and Perspectives.

SOLODOVNIK IRYNA
PhD student of International PhD School of Humanities
University of Calabria
Pietro Bucci, 87036, Arcavacata di Rende (CS), Italy

Abstract. This paper briefly highlights the development of the concept Ontology,
from its philosophical roots up to its vision in the ICT field and related areas.
Philosophically, Ontology is a systematic explanation of Being that describes the
features of Reality. Nowadays Ontology is proliferating in organizing Knowledge
of different domains managed by advanced computer tools. Ontology qualifies and
relates semantic categories, dragging, however, the idea of what, since the
seventeenth century, was a way to organize and classify objects in the world.
Ontology maximizes the reusability and interoperability of concepts, capturing
new Knowledge within the most granular levels of information representation.
Ontology is subjected to a continuous process of exploration, formation of
hypothesis, testing and review. Ontological thesis proposed today as true,
tomorrow may be rejected in light of further discoveries and new and better
arguments.

Philosophical background of Ontology
Webster's Third New International Dictionary defines Ontology as "1. a Science or study
of Being: specifically, a branch of Metaphysics relating to the Nature and relations of
being; 2. a Theory concerning the kinds of entities and specifically the kinds of abstract
entities that are to be admitted to a language system". Literally, the word Ontology
comes from the Greek ντος (òntos) and λόγος (lògos), that means "speech about
Being", but may also derive explicitly from τά όντα (entities), variously interpreted
according to different philosophical points of view.
Aristotle proposed the first known category system, standing for a certain vision of
the world in relation to what is judged to exist in practice. Heidegger conceived
Ontology as a "phenomenology of the exploration” of what there "is" and in how it turns
out. The ontological conceptualization, as a cohesive philosophical area, was introduced
in 505-504 BC by Parmenides. He was the first to pose the argument about Being in its
totality, presenting issue of the ambiguity among the conceptual level, Ontology and
language. Parmenides recognized the ontological dimension as dominant able to subject
to itself any other aspect of Philosophy. Over the centuries, the meaning of Ontology was
changing depending on different visions and knowledge of other philosophers:
Leucippus, Democritus, Plato, Aristotle, Descartes, Kant, Lorhard, Hegel,

- 71 -

Proceedings IACAP 2011

Trendelenburg, Brentano, Stumpf, Meinong, Husserl, Heidegger, Gockel. Some of them
gave more value to an absolute belief, another to empirical things, thus enriching the
heritage of Philosophy with what is considered "par excellence" (the problem of
existence in its fullest extent and universality: the relationship between particular and
universal, intrinsic and extrinsic, essence and existence). “Indeed, without Ontology,
Philosophy cannot be developed according to the demonstrative method. Even the art of
discovery takes its principles from Ontology" (Blackwell,1963).

Towards a new Ontology
The advent of Semantic web (Breitman,2007) aimed at multi-objective
optimization of ICT environment and technological innovation in general, has coined a
new vision of Ontology, so that it is considered today as “formal, explicit specification of
a shared conceptualization” (Gruber,1995).
Ontology, intended as a first-order axiomatic theory expressed by a descriptive
logic, is fundamental to design advanced Knowledge Based software systems
(Guarino,1998; Eden,Turner,2005). It is of great interest to combine lexical resources,
such as Thesaurus (Broughton, 2006) with the world knowledge provided by Ontologies
in order to improve deductive reasoning with natural language, as well as enhance
automatic classification (e.g. in Ontology-based Cataloging systems), problem solving
techniques, interoperability among different computer systems, cross-cultural and
intercultural communication in CMC (Ess, Sudweeks,2005) etc. Since Ontology is the
basis of web intelligence, it is also widely used in e-commerce, on-line marketing,
business management etc.
In Fig.1 we can observe philosophical reflection in the field of computer science
and information technology (Floridi,2002; Colburn,2003; Gruber,2009). Here Thought
(which is regulatory/normative to Reality) through Language (which defines the existing
categories reflecting Thought and Reality) is connected with Ontology and
Epistemology, representing the descriptive and prescriptive approaches. Ontology refers
to objective validity (Husserl,1992) of terminology waiting to be discovered by domain
knowledge experts and Epistemic (providing model reasoning in class-based
representation formalisms through description logics).

- 72 -

The Computational Turn: Past, Presents, Futures?

Figure 1. The ontological and epistemological turn in Computer Science
Automated reasoning and Ontology manipulation in description logics allow to
present and emulate the human logic-based knowledge of entities in different domains,
managing simultaneously dissimilar types of objects (concrete and abstract, independent
and dependent) and their ties (relations, dependencies and predications).
Creation of single knowledge sharing paradigm is not easy nor immediate task,
considering also non-trivial technological obstacles (consistency and validity of
Ontologies vs. time and evolution of information technology). It remains an appealing
challenge to set up new scientific environments in which philosophers and other scholars
can meet to discuss and develop strategies to classify, organize and implement
qualitative conceptual domains, and even more those represented by different semantic
systems tied with language differences.

References
Breitman, K.; Casanova, M.; Truszkowski, W. (2007). Semantic Web: Concepts, Technologies
and Application. NASA Systems and Software Engineering Series. 1 ed. London, Springer
Verlag.
Broughton V. (2006). Essential thesaurus construction, London, Facet.
Colburn, T.R. (2003). Philosophy and Computer Science, Armonk, Sharpe.
Eden, A.H.
& Turner, R. (2005). Towards an Ontology of software design: The
Intension/Locality Hypothesis, 3rd European Conf. Computing And Philosophy ECAP, 2-4
Jun, Västerås, Sweden.
Ess, C. & Sudweeks, F. (2005). Culture and computer-mediated communication: Toward new
understandings. Journal of Computer-Mediated Communication, 11(1).
Gruber, T. (2009). Ontology. In: Encyclopedia of Database Systems, Ling Liu and M. Tamer
Özsu (Eds.), Springer-Verlag.
Guarino, N. (1998), Formal Ontology and Information Systems. In: N. Guarino (Eds), Formal
Ontology in Information Systems. Proceedings of FOIS 1998, Trento, Italy, 6-8 Jun,
Amsterdam, IOS Press.
Husserl, E. (1929). Formal and transcendental logic. English translation: The Hague, Martinus
Nijhoff (1969).
Smith, B. (2003b). Ontology. In: L. Floridi (Eds), Blackwell Guide to the Philosophy of
Computing and Information, Oxford: Blackwell.
Wolff, C. (1728). Preliminary discourse on philosophy in general. Translated, with an
Introduction and notes, by Richard J. Blackwell (1963) Indianapolis, The Bobbs-Merrill
Company.

- 73 -

Proceedings IACAP 2011

THE EVOLUTION OF SOFTWARE AGENTS AS DIGITAL OBJECTS
SABINE THÜRMEL
Graduate Center of the TUM School of Education
Technical University of Munich, Munich, Germany

Abstract. The evolution of software agents as digital objects from simple interface
agents to full blown interaction partners is depicted. An outline of concretization
process in agent-oriented programming is given contributing to the research into
the ontology of computer programs.

Extended Abstract
The focus of this paper is on the evolution of software agents as digital i.e.
computational objects. It can be shown that a new type of interplay between human
beings, „computational objects“ and the physical environment is in process of emerging.
Turkle’s insight (2006) into the nascent robotics culture is equally valid for software
agents: „computational objects simply do things for us, but they do things to us as
people, to our ways of seeing ourselves and others. Increasingly, technology puts itself
into a position to do things with us” (p.1).
The starting point of this evolution was constituted by interface agents providing
assistance for the user or acting on his or her behalf. As envisioned by (Laurel, 1991)
and (Maes, 1994) they evolved into increasingly autonomous agents. In game worlds
they were first seen in one person offline video games. Interacting pure software agents
and avatars became prevalent in MMORPGs (massively multiplayer online role-playing
games) as World of Warcraft®. As interworking collaborative software agents embedded
in nets of devices they provide support for smart grids (Mainzer, 2010) or for other
variants of the “Internet of things” (Mattern/Langheinrich, 2008). Last but not least they
are used to coordinate emergency response services in disaster management systems
(Jennings, 2010).
Already in 1992 Solum posed the question in the North Carolina Law Review
whether virtual agents may be the basis for persons in the legal sense of the law (Solum,
1992). Today virtual agents are commonly deployed in online auctions or eNegociations
(Woolridge, 2009). Thus software agents have been promoted from assistants to virtual
interaction partners. The socio-technical fabric of our world has been augmented by
these collaborative systems.
The goal of the agent-oriented programming paradigm is the adequate and intuitive
modeling and implementation of complex interactions and relationships. Software agents
were introduced by Hewitt's Actor Model (Hewitt et al., 1973). Today a whole variety

- 74 -

The Computational Turn: Past, Presents, Futures?

of definitions for software agents exist but all of them include mechanisms to support
persistence, autonomy, interactivity and flexibility. Bionic approaches, as swarm
intelligence, or societal models are adapted to implement collaborative approaches to
distributed problem solving.
They are on the one hand part of the tool kit used in computational sciences using
computer-based simulations as a link between theory and experiment. As such they are
similar to numerical simulation but using different conceptual and software models.
On the other hand they provide a basis for agency in virtual worlds offering novel
experiences. They provoke us to ask how this technological progress will affect our
interpersonal relationships (Turkle, 2011).
The starting point of any software agent-based approach is a bionic or societal
metaphor for distributed problem solving. The resulting computer science concept is
specified as a computer program modeling the interacting software agents. At compiletime the high level program is transformed in a machine-executable computer program to
be run in a distributed environment. During runtime any (instance of) a software agent
may be perceived as a distinct thread or process. This concretization process conforms to
the program abstraction taxonomy introduced in (Eden and Turner, 2007).
From an ontological perspective it can be stated that the underlying computer
science concepts are abstract objects that can be concretized by computer programs
conforming to an agent oriented programming paradigm. The computer programs are
abstract objects that can be concretized by adequate computational objects conforming to
a (different) programming paradigm or by concrete physical objects. Different
concretizations may exist for one computer program. It should be noted that the identical
agent-oriented program may be first tested in a simulated environment and then
employed in a realtime environment.
Similar to (Reicher-Marek 2009) three basic relations between computer programs
and other objects may be distinguished: the above outlined the concretization relation,
the notation relation (between the abstract object and the (textual or graphical)
specification), the environmental relation (between the abstract object and its potential
runtime environments) and the instantiation-at-runtime relation coupling the abstract
object to its dynamic instantiations. In my view any non trivial identity notion for
computer programs has to take these relationships into account.

References
Eden, A. H. & Turner, R. (2007). Problems in the ontology of computer programs. In: Applied
Ontology, Vol. 2, No. 1, pp. 13–36. Amsterdam: IOS Press.
Hewitt, C. & Bishop, P. & Steiger, R. (1973). A universal modular actor formalism for Artificial
Intelligence. In: International Joint Conferences on Artificial Intelligence, 235-245.
Jennings, N. (2010) ALADDIN End of Project Report, www.aladdinproject.org, Cited 25 April
2011.
Laurel, B. (1991) Computers as theatre, New York: Addison-Welsley.
Maes, P. (1994) Agents that reduce work and information overload. In: Communications of the
ACM 37 (7), 30-40.
Mainzer, K. (2010) Leben als Maschine? Von der Systembiologie zur Robotik und Künstlichen
Intelligenz, Paderborn: Mentis Verlag

- 75 -

Proceedings IACAP 2011

Mattern, F. & Langheinrich, M. (2008) Eingebettete, vernetzte und autonom handelnde
Computersysteme: Szenarien und Visionen. In A. Kündig and D. Bütschi (Eds), Die
Verselbständigung des Computers, pp. 55-75. Zürich: vdf Verlag
Reicher-Marek, M. (2009) What is the object in which copyright can subsist? An ontological
analysis. In: E. Ortland, Eberhard and R. Schmücker (Eds) Copyright & Art. Aesthetical,
legal, ontological and political issues. Baden-Baden: Nomos, 2009. (to appear)
Solum, L. (1992) Legal personhood for artificial intelligences. North Carolina Law Review, 2,
1231-1283.
Turkle, S. (2006) A nascent robotics culture: new complicities for companionship. Paper
presented at the 21st National Conference on Artificial Intelligence. Boston, July, 2006
Turkle, S. (2011) Alone Together, Why We Expect More from Technology and Less from Each
Other. New York: Basic Books.
Woolridge, M. (2009) An Introduction to MultiAgent Systems ( 2nd ed). New York: John Wiley &
Sons.

- 76 -

The Computational Turn: Past, Presents, Futures?

MACHINES and COMPUTATIONS
RAYMOND TURNER
Department of Computer Science and Electronical Engineering
University of Essex
Wivenhoe Park, Colchester
Essex CO4 3SQ
UK

Abstract
How may abstract and physical machines be related? What is the difference between
considering an abstract machine as:
1. A theory of a physical one
2. A functional description of one
3. A specification of one?
Do these distinctions throw any light on the nature of physical computation and the
arguments of Putnam and Pancomputationalism?

- 77 -

Proceedings IACAP 2011

Track II:
Philosophy of Information
and Cognition

- 78 -

The Computational Turn: Past, Presents, Futures?

ON THE LEVEL OF CREATIVITY
Ponderings on the Nature of Kantian Categories, Creativity and Copyrights
ALEXANDER FUNCKE
Centre for Study of Evolutionary Culture at Stockholm University
106 91 Stockholm

Abstract. The relation between data and information is considered in analogy with
Kantian transcendental aesthetics in order to create a formal concept and ordinal
relation of “creativity”. Implications are discussed for Kantian categories,
creativity and copyrights.

1. Background & Aims
Creativity is a popular concept for controversy in many disciplines. This paper does not
necessarily contain the deepest insights, but it provides perspectives that might be useful
while considering creativity and thereby copyrights cognition and maybe even
consciousness.

2. Transcendental aesthetics
In order to formulate the ideas this paper uses an analogy to Kant's transcendental
aesthetics, i.e. the process where noumenon is transcended via categories to a
phenomenon is contrasted to a process where data is rendered via a context/algorithm to
information.
The analogy lends itself to be considered as an extension rather than an analogy of the
transcendental aesthetics too. That is Kant's transcendental aesthetics may be
reinterpreted as “actual” transcendence in terms of data and information. It opens up for
a multiple layer interpretation, and thereby also for questions like, if we may consider a
hearing aid, or other more intricate cyborg technologies as just another category in the
Kantian sense.3

3 This may also have consequences for copyrights. Arguably, copyrights ought not to be applicable to data in
itself, but only to information. Now, if a blind person somehow manages to copy a protected image, then it
couldn't be considered an infringement, as he lack the categories to render the information that could have

- 79 -

Proceedings IACAP 2011

3. Potentiality/actuality
The dichotomy of potentiality and actuality has been part of the philosophical discussion
at least since Aristotle's book Theta. The transcendental aesthetics analogy may be
considered as a model for consider data in its actual form and its potential one relative to
a given interpreter.
The interpreter in the model consist of two components, a passive presentation that takes
formatted data as input and outputs information, and an active algorithm that takes raw
data as input and outputs formatted data. Where the latter component may have
potential.
An algorithm is considered to have potential if it manipulates the raw data in a way that
cannot be described as a simple transformation or crop, but which also adds “extra
relevant information” relative to a given presentation.
To formalise this potentiality, or creative quality if you will, let X and Y be sets of data,
and let f, g ∈ FX,Y = { f : X → Y }be two algorithms that transforms raw data to
formatted data.
Further, let

FX,NY ⊆ FX,Y be the subset of algorithms that lack potential, and Y' ⊆ Y be

the set of all formatted data that renders information for a given presentation.
Now, define two functions, H

H m : Y' → ℜ , defined as

H m ( y ) = min

N
f ∈FX,
Y

: X → ℜ , which maps any data to its entropy and

H ( f −1 ( y )) ,

(1)

which maps any information entity to its minimal entropy representation given a
presentation.
The inverse of f may actually not be unique, but with a small violation of notation, we
define

f −1 ( y ) = argmin x∈{x: f ( x )= y }H (x ) , that is to be the minimal entropy x that maps

to y.
Finally, define the “additional map”

A( f, y ) = H m ( y ) − H ( f

−1

A : FX,Y ×Y' → ℜ such that

( y )) ,

(2)

been protected by a copyright for someone with visual categories. Nor should his original visual works ever be
copyrightable for its visual qualities.

- 80 -

The Computational Turn: Past, Presents, Futures?

which gives a number for the level of potential the algorithm f has to generate
information entity y.4
An algorithm f is considered strictly potential relative to a representation and a subset of
the informative entities S ⊆ Y' if all its elements y ∈ Y' ' are represented more
economically than in the minimal non-potential case, that is,

∀y ∈ S, A( y ) > 0,

(3)

An algorithm is considered potential (in the non-strict sense) for a subset S if a nonempty subset S' ⊆ S is strictly potential and for no y ∈ S, A( y ) < 0 .

4. Creativity as an ordinal relation
There are various degrees of potentiality, not only should algorithm potentiality be
compared with respect to the amount of relevant information quantified by the
“additional map”, it should also take an interest in the relative ease to compute
f (x ) ∈ Y' .
Ignoring the complexity of computation would be like ignoring the difference between
factorising the product of two huge prime and summing them.
Another example that highlights the need to include complexity is simulations of nonlinear dynamical systems, such as models of meteorological or financial system. It is
unfeasible to do analytical reasoning about the behaviour of such systems, and it takes a
lot of computation to unfold the behaviour through simulation, even though all data and
the algorithms are in place.5
There are multiple reasonable ways to define an ordinal relation between two algorithms
that take these things into account, f, g ∈ FX,Y = { f : X → Y }, but the transitive,
reflexive and identity preserving variant suggested here is the following,

f > g ⇔ O ( f ) > O (g ) ∨ (O ( f ) = O (g ) ∧ A( f ) > A(g )) ,

(4)

where O(f) is the computational complexity of f.

4 Note that this means that a verbose representation x ∈ X of an informative entity could be classified as a
non-potential, even it seem to have all the necessary properties. One could add a proxy-stepto solve this, by
mapping f to fh, where fh is the equivalence class (in the obvious sense) version of f.
5 It is really just a way of stating that the tragedy of deduction will not help.

- 81 -

Proceedings IACAP 2011

5. Conclusion
The concepts presented and to some extent explored in the longer version of this paper,
gives a formal interpretation of the notoriously hard to pin down idea of creativity.
The ordinal relation “level of creativity” lends itself to demarcate when a set of
algorithms may create information that is creative enough to be regarded as
copyrightable, or maybe even what is the minimal level of creativity for a cognitive or
conscious algorithm?
From the analogy to transcendence there spring other implications hinted at in the
footnotes: Cyborg technology, such as hearing aids may be considered as a multi-level
version transcendence. Which aids ones intuition while pondering about copyrights whether one likes Kant or not.

References
Dennett, D. C. (1996). Darwin’s Dangerous Idea: Evolution and the Meanings of Life. Simon &
Schuster.
Floridi, L. (2004). Open problems in the philosophy of information. Metaphilosophy, 35(4):554–
582.
Floridi, L. (2008). Philosophy of Computing and Information: 5 Questions. Automatic Press/VIP,
Copenhagen, Denmark, Denmark.
Floridi, L. (2009). Philosophical conceptions of information. Lecture Notes in Computer Science,
5363:13–53.
Kant, I. (2003). Critique of Pure Reason. Courier Dover Publications.
Koepsell, D. R. (2000), The Ontology of Cyberspace: Law, Philosophy, and the Future of
Intellectual Property, Open Court Publishing Co., Peru, IL, USA.
Mandelbrot, B. (1967). How Long Is the Coast of Britain? Statistical Self-Similarity and
Fractional Dimension. Science, 156(3775):636–638.
Mitchell, T. (1997). Machine Learning. McGraw-Hill Education (ISE Editions), 1st edition.
Nagel, T. (1974). What is it like to be a bat? Philosophical Review, 83(October):435–50.

- 82 -

The Computational Turn: Past, Presents, Futures?

THE FOURTH REVOLUTION AND SEMANTIC INFORMATION
VALERIA GIARDINO
Institut Jean Nicod (CNRS-EHESS-ENS), Paris
Valeria.Giardino@ens.fr

Abstract. In his work, Floridi introduces several notions to describe our
relationship with information and technology. Indeed, according to him, in recent
times, humanity has experienced a fourth revolution, the Information revolution,
which, starting from the work of Alan Turing, has deeply affected our
understanding of ourselves as agents. Our generation is still a generation of “emigrants”, but our children will be born in the infosphere and will recognize
themselves from their birth as inforgs. I will focus on the notions of infosphere and
inforgs, and more generally on the notion of information Floridi makes use of.
According to Floridi, in re-ontologizing ourselves as inforgs, we recognize how
significantly but not dramatically different we are from smart, engineered artifacts,
since we have, as they have, an informational nature. Nevertheless, if one focuses
on semantic information, which requires meaning and understanding, then there is
still a dramatic difference between ourselves and our artifacts to be acknowledged:
we are the only agents who spontaneously reason semantically. First, I will present
the four revolutions Floridi talks about, and claim that there are other revolutions
in the history of human culture that should be considered in the perspective of
discussing the reshaping of our new environment and of our new selves in the
infosphere. Secondly, I will discuss an ambiguity in Floridi’s use of the term
information and propose to consider his fourth revolution as the Second
Information revolution. To solve this ambiguity, I will distinguish between
information and semantic information, which implies meaning and understanding.
Finally, I will present some questions that emerge once we consider humans’
cognitive capacities to access meaning on the background of the new context, the
infosphere.

1. Introduction: we are inforgs in an infosphere
Floridi has suggested that in recent years we have gone, together with our environment,
through a process of re-ontologization that has changed forever our way of seeing the
world and ourselves. If the challenge of philosophy today is to analyse how this
revolution has changed our understanding of the world and of ourselves, my challenge in
this talk will be to claim that some of Floridi’s suggestions should be partly revised and
further discussed.
First, I will present the four revolutions Floridi talks about, and claim that there are
other revolutions in the history of human culture that should be considered. Secondly, I

- 83 -

Proceedings IACAP 2011

will discuss an ambiguity in Floridi’s use of the term information and propose to
consider his fourth revolution as the Second Information revolution. To solve this
ambiguity, I will distinguish between information and semantic information, which
implies meaning and understanding.

2. One, two, three... many revolutions: human culture
Though I am in general sympathetic with Floridi’s rational reconstruction of the four
revolutions, I want to argue that in the course of human cultural evolution, it is possible
to individuate other crucial steps in the transformation of our ontology.
It is unquestionable that the appearance of cognitive artefacts has played a major
role in the shaping of our world and of us as cognitive agents. We might assume an
evolutionary perspective and consider first the moment in which human beings began to
communicate by means of a language, and then the moment they invented writing, and
thus began not only to produce words but to share them in a public format that could be
inspected by others and stored in archives. Both these steps were crucial in the evolution
of human cognition, since they revolutionized human beings’ access to meaning: new
channels became available to communicate and to make sense of the world around us
and of ourselves.
My approach is in line with the idea that cognition is ‘distributed’: as Hutchins
(1995a; 1995b) explains, cognitive events are not encompassed by the skin or skull of an
individual. There exist interesting kinds of distribution of cognitive processes: we must
consider them if we want to understand human cognition. Human beings, despite the
limitations of the cognitive systems with which we know that they are born (Kinzler and
Spelke (2007); Spelke (2004)), were able to develop new practices and new cognitive
strategies to augment the powers of their minds, showing an extraordinary capacity in
creating tools that would help them in the processes of both describing the world around
them and acting upon it. Some of these tools had an intrinsically cognitive function.
As a consequence, a more faithful reconstruction of our cultural evolution would
rather show how the history of our cognition has been deeply influenced by the fact that
from the very beginning we engaged ourselves in symbolic activities, and that these
activities have become, in a long historical and cultural process of creation and selection,
more and more complex. This was indeed a revolution in the ontology of information in
the billions of years of the evolutionary process, from the time when living processes
became encoded in DNA sequences: “because this novel form of information
transmission was partially decoupled from genetic transmission, it sent our lineage of
apes down a novel evolutionary path - a path that has continued to diverge from all other
species ever since” (p. 45).

3. Cognition and semantic information
In the DNA double helix, as well as in Turing machines, information is conceived as a
code, a string, and it does not have anything to do with meaning or understanding. By
contrast, semantic information requires meaning and understanding. Floridi claims that,
by re-ontologizing ourselves as inforgs, we recognized how significantly but not

- 84 -

The Computational Turn: Past, Presents, Futures?

dramatically different we are from smart, engineered artifacts, since we have, as they
have, an informational nature. But of what kind of information is Floridi talking about
when he refers to ‘informational nature’ in the two cases?
I will consider Bruner (1990)’s point of view on what he defined the Cognitive
revolution, taking place in the 1950s. According to Bruner’s reconstruction, the aim of
that revolution at the beginning was to discover and describe formally the meanings that
human beings were able to create out of their encounters with the world. The objective in
the long run was to propose hypotheses about which meaning-making processes were
implicated in humans’ cognitive activity. Bruner’s hope was that such a revolution, as it
was conceived at its origins, would have brought psychology to collaborate with its sister
interpretative disciplines such as the humanities and the social sciences. It is only a
collaboration of this kind that can allow the investigation of such a complex phenomenon
as meaning-making. But the happy ever after did not work out. In fact, the emphasis
began shifting from the construction of meaning to the processing of information, which
are profoundly different matters.
The notion of computation was introduced and computability became ‘the’ good
theoretical model; this brought far from the original question - the revolutionary one which was about the conditions of our meaning-making activity, the answer of which
would have explained our semantic power. For this reason, the Cognitive revolution “has
been technicalized in such a manner that even undermines that original impulse” (p.1): it
has become the (uninteresting) Information revolution. Meaning is thus different from
information because it does not come before the message, but it is through the message
itself and the fact that this message is shared that it originates. In fact, public meanings
are the result of a negotiation.

4. Conclusions
To sum up, in my talk, I will try to show that a particularly interesting aspect to discuss
in this framework is the role of semantic information, which is the expression of a
symbolic activity that up to now has been shown to be specifically human. Knowledge is
situated-distributed, and this not only because it has a cultural nature, but also and most
of all because our knowledge acquisition has a cultural nature. Moreover, knowledge has
also a social nature, because it gets socially constructed (Berger and Luckmann (1966)).
Human beings are semantic engines, and they engage themselves in meaning-making and
meaning-negotiating. For this reason, meaning is flexible: as Bruner says, we show a
‘dazzling’, intellectual capacity for envisioning alternatives.
Will one day a fifth revolution come that will take away from us also this ultimate
illusion? That day, will our own technology bring about intentional and semantically
powerful machines? At the moment, we do not know. The task of philosophy of
information is to provide the appropriate framework that would allow us to make useful
predictions in order to prepare the future generations and ourselves.

- 85 -

Proceedings IACAP 2011

Acknowledgements
I thank the Public Representations group at the Institut Jean Nicod for all our useful
discussions on similar topics, and in particular Elena Pasquinelli and Giuseppe A. Veltri
who read a preliminary version of this article. The research was supported by the
European Community’s Seventh Framework Program ([FP7/2007- 2013]) under a Marie
Curie Intra-European Fellowship for Career Development, contract number no.
220686—DBR (Diagram-based Reasoning).

References
Berger, P. L. & T. Luckmann (1966), The Social Construction of Reality: A Treatise in the
Sociology of Knowledge, Garden City, NY: Anchor Books.
Bruner, J. (1990). Acts of Meaning. Cambridge, Mass. and London: Harvard University Press.
Deacon, T. W. (1997). The Symbolic Species. New York and London: W. V. Norton Company.
Dror, I.E. & Harnad, S. (Eds.) (2008). Cognition Distributed: How Cognitive Technology Extends
Our Minds. Amsterdam: John Benjamins.
Floridi, L. (2002). Information Ethics: An Environmental Approach to the Digital Divide.
Philosophy in the Contemporary World, 9(1), 39-45.
- (2007). A look into the future impact of ICT on our lives. The Information Society, 23(1), 59-64.
An abridged and modified version was published in TidBITS.
- (2009). The Semantic Web vs. Web 2.0: a Philosophical Assessment. Episteme, 6, 25-37.
Hutchins, E. (1995a). Cognition in the Wild. MIT Press.
- (1995b). How a cockpit remembers its speeds, Cognitive Science, 19, 265-288.
Kinzler, K. D., & Spelke, E. S. (2007). Core systems in human cognition. Progress in Brain
Research, 164, 257-264.
Spelke, E. S. (2004). Core knowledge. In: N. Kanwisher & J. Duncan (Eds), Attention and
Performance: Functional neuroimaging of visual cognition (Vol 20, pp. 29-56). Oxford:
Oxford University Press.

- 86 -

The Computational Turn: Past, Presents, Futures?

EPISTEMOLOGICAL AND PHENOMENOLOGICAL ISSUES IN THE
USE OF BRAIN-COMPUTER INTERFACES

RICHARD HEERSMINK
PhD Candidate
Macquarie Centre for Cognitive Science
Macquarie University, Sydney, Australia.
Email:richard.heersmink@gmail.com

Abstract. Brain-computer interfaces (BCIs) are an emerging and converging
technology that translates the brain activity of its user into command signals for
external devices, ranging from motorized wheelchairs, robotic hands,
environmental control systems, and computer applications. In this paper I
functionally decompose BCI systems and categorize BCI applications with similar
functional properties into three categories, those with (1) motor, (2) virtual, and
(3) linguistic applications. I then analyse the relationship between these distinct
BCI applications and their users from an epistemological and phenomenological
perspective. Specifically, I analyse functional properties of BCIs in relation to the
abilities (particularly motor behavior and communication) of their human users,
asking how they may or may not extend these abilities. This includes a
phenomenological analysis of whether BCIs are experienced as transparent
extensions. Contrary to some recent philosophical claims, I conclude that,
although BCIs have the potential to become bodily as well as cognitive extensions
for skilled users, at this stage they are not. And while the electrodes and signal
processor may to a variable degree be transparent and incorporated, the BCI
system as a whole is not. Contemporary BCIs are difficult to use. Most systems
only work in highly controlled laboratory settings, require a high amount of
training and concentration, have very limited control options, have low and
variable information transfer rates, and effector motions are often slow, clumsy
and sometimes unsuccessful. These drawbacks considerably limit their
possibilities for transparency and incorporation into either the body schema or
cognitive system which is essential for bodily and cognitive extension. Current
BCIs can therefore only be seen as a weak or metaphorical extension of the human
central nervous system. To increase their potential for cognitive extension, I give
suggestions for improving the interface design of what I refer to as linguistic
applications.

- 87 -

Proceedings IACAP 2011

1. Introduction: Brain-Computer interfaces
BCIs are an emerging and converging technology that translates the brain activity of its
user into command signals for external devices. Invasive or non-invasive electrode
arrays detect an intentional change in neural activity, which is translated by a signal
processor into command signals for applications such as wheelchairs, robotic hands,
environmental control systems, and computer applications. In essence, BCI technology
establishes a direct one-way communication pathway between the human brain and an
external device, and can to some extent translate human intentions into technological
actions without having to use the body’s neuromuscular system. However, contemporary
BCIs are difficult to use, the technology is still in its infancy and has barely passed the
“proof of concept” stage. Most systems only work in highly controlled laboratory
settings, require a high amount of training and concentration, have very limited control
options, have low and variable information transfer rates, and effector motions are often
slow, clumsy and sometimes unsuccessful.

2. Goals, Method and Structure
2.1. A TYPOLOGY OF BCIS
In this paper I explore the relationship between BCI technology and their human users
from an epistemological and phenomenological perspective. My analysis has five parts.
First, I present a preliminary conceptual analysis of BCIs in which I functionally
decompose BCI systems and categorize BCI applications with similar functional
properties (Vermaas & Garbacz, 2009). Based on this preliminary analysis, I distinguish
between three categories: (1) motor applications, which restore motor functions for
disabled subjects such as motorized wheelchairs or robotic hands; (2) linguistic
applications, which allow a disabled subject to select characters on a screen, thereby
restoring communicative abilities; and (3) virtual applications, which allow a subject to
control elements (e.g. avatars) in a virtual environment.
2.2. THE CURRENT DEBATE ON BCIS
Second, I briefly outline the current philosophical debate on BCIs. It has been claimed
that a BCI-controlled robotic arm is a bodily extension fully integrated into the body
schema of a macaque, thereby constituting a “new systemic whole” (Clark, 2007). It has
also been claimed that functionally integrated BCIs are cognitive extensions, i.e., they
extend cognitive processes of their users into the material environment (Fenton and
Alpert, 2008; Kyselo, 2011). These philosophical claims are evaluated later on in this
paper.
2.3. HUMAN-TECHNOLOGY RELATIONS
Third, I introduce some key concepts for better understanding human-technology
relations. These key concepts are “body schema”, “incorporation”, “transparency” and
“extended cognition”. A body schema is a non-conscious neural representation of the
body’s position and its capabilities for action. We are able to incorporate artifacts such

- 88 -

The Computational Turn: Past, Presents, Futures?

as hammers, screwdrivers, pencils, walking canes, cars, glasses, and hearing aids into our
body schema, thereby enlarging our body schema (Brey, 2000). These artifacts are
embodied and are not experienced as objects in the environment but as part of the human
motor or perceptual system. When using embodied artifacts to act on the world such as
hammers, pencils, and screwdrivers, a subject doesn’t first want an action on the artifact
and then on the world. Rather, a subject merely wants an action on the world through the
artifact and doesn’t consciously experience the artifact when doing so. The perceptual
focal point is thus at the artifact-environment interface, rather than at the agent-artifact
interface (Clark, 2007). In this sense, embodied artifacts are transparent (Ihde, 1990).
Cognitive artifacts such as calculators, computers, and navigation systems, can
under certain conditions be incorporated in the human cognitive system in such a way
that they can best be seen as literally part of that system. These devices, then, perform
functions that are complementary to the human brain (Sutton, 2010). There is,
furthermore, a two-way interaction when using such devices, and both the brain and the
cognitive artifact have a causal role in the overall process, thereby forming a “coupled
system”. In such coupled systems, the cognitive process is distributed across brain and
artifact, and the artifact is seen as co-constitutive of the extended cognitive system.
Remove the technological element from the equation and the overall system will drop in
behavioural and cognitive competence. So there is a strong symbiosis and reciprocity in
coupled systems. Moreover, what is essential when extending cognition is a high degree
of trust in, reliance on, and accessibility of the cognitive artifact (Clark & Chalmers,
1998).
2.4. HUMAN-BCI RELATIONS
Fourth, I explore the relationship between motor, linguistic, and virtual applications and
their human users in the light of the concepts just introduced. I analyse whether BCIs are
incorporated into the body schema or cognitive system of their users, and analyse
whether they are experienced as transparent extensions of the human body or cognitive
system. I demonstrate that, although BCIs have the potential to become bodily as well as
cognitive extensions for skilled users, at this stage they are not. And while the electrodes
and signal processor may to a variable degree be transparent and incorporated, the BCI
system as a whole is not. Contemporary BCIs are difficult to use. Most systems only
work in highly controlled laboratory settings, require a high amount of training and
concentration, have very limited control options, have low and variable information
transfer rates, and effector motions are often slow, clumsy and sometimes unsuccessful.
These drawbacks considerably limit their possibilities for transparency and incorporation
into either the body schema or cognitive system which is essential for bodily and
cognitive extension.
2.5. DISTRIBUTED COGNITION FOR IMPROVING BCIS
And fifth, I give suggestions to increase the potential for cognitive extension of linguistic
applications. To do so, I draw from concepts of the distributed cognition framework. Jim
Hollan, Ed Hutchins and David Kirsh (2000) argue that the nature of external
representations is essential when effectively distributing cognition. Their notion of
“history enriched digital objects” implies that often selected letters should be presented
larger or brighter on the screen. Their notion of “zoomable multiscale interfaces” implies

- 89 -

Proceedings IACAP 2011

that for someone who is selecting letters on a screen, it might be more effective if the
letter the person wants to select becomes larger when the cursor moves towards it. And
their notion of “intelligent use of space” implies that for people who are not used to the
QWERTY-style, it might be logical to present the most often selected letters in the
middle and letters that are selected less often in the periphery of the screen.

References
Brey, P. (2000b). Technology and Embodiment in Ihde and Merleau-Ponty. In: C. Mitcham (Ed.),
Metaphysics, Epistemology, and Technology. Research in Philosophy of Technology Vol 19.
London: Elsevier/JAI Press
Clark, A. (2007). Re-Inventing Ourselves: The Plasticity of Embodiment, Sensing and Mind.
Journal of Medicine and Philosophy, 32(3), 263-282.
Clark, A. & Chalmers, D. (1998). The Extended Mind. Analysis, 58, 10-23.
Fenton, A. & Alpert, S. (2008). Extending Our View on Using BCIs for Locked-in Syndrome.
Neuroethics, (1)2, 119-132.
Hollan, J. & Hutchins, E. & Kirsh, D. (2000). Distributed Cognition: Toward a New Foundation
for Human-Computer Interaction Research. Transactions on Computer-Human Interaction,
7(2), 174-196.
Ihde, D. (1990). Technology and the Lifeworld: From Garden to Earth. Indiana University Press.
Kyselo, M. 2011. Locked-in Syndrome and BCI: Towards an Enactive Approach to the Self.
Neuroethics. doi:10.1007/s12152-011-9104-x.
Sutton, J. (2010). Exograms and Interdisciplinarity. In R. Menary (Ed.), The Extended Mind. MIT
Press.
Vermaas, P. E. and Garbacz, P. (2009). Functional decomposition and mereology in engineering.
In: A. Meijers (Ed.), Handbook of the Philosophy of Technology and Engineering Sciences.
Elsevier: Amsterdam.

- 90 -

The Computational Turn: Past, Presents, Futures?

AN INFORMATION-THEORETIC MODEL OF CHUNKING
DANIEL HEWLETT
University of Arizona
Tucson, AZ 85745, USA
AND
Paul Cohen
University of Arizona
Tucson, AZ 85745, USA

Abstract. Developing a general theory of cognition based on formal notions of
information remains a long-term goal. One means of making incremental progress
toward this goal is to analyze core cognitive capacities to determine whether they
can be explained by reference to information. Chunking is one of the most general
and least understood phenomena in human cognition. George Miller described
chunking as "a process of organizing or grouping the input into familiar units or
chunks." The psychological literature describes chunking in many experimental
situations but it says nothing about the intrinsic, mathematical properties of
chunks. The cognitive science literature discusses algorithms for forming chunks,
each of which provides a kind of explanation of why some chunks rather than
others are formed, but there are no explanations of what these algorithms, and thus
the chunks they find, have in common. We argue that chunks share a common
information-theoretic signature. This signature is defined in terms of the basic
measure of information content, entropy: Chunks have low conditional entropy
internally, and high conditional entropy at the boundaries. We explain this chunk
signature and examine several lines of evidence that support this informationtheoretic view of chunks. The first is that algorithms built to find chunks based on
this signature (or very similar signatures) are quite successful at chunking realworld data. The second is that real chunks, such as words in natural language,
appear to be nearly optimally constructed with respect to this signature. Empirical
studies also suggest that children, even infants, do actually possess such a
chunking ability. All of this evidence supports the view that chunks can be defined
by an information-theoretic signature, and that a general chunking ability based on
this signature provides a good explanation for this core cognitive ability.

1. Introduction
Developing a general theory of cognition based on formal notions of information
remains a long-term goal. One means of making incremental progress toward this goal is

- 91 -

Proceedings IACAP 2011

to analyze core cognitive capacities to determine whether they can be explained by
reference to information. Chunking is one of the most general and least understood
phenomena in human cognition. George Miller described chunking as "a process of
organizing or grouping the input into familiar units or chunks." Other than being "what
short term memory can hold 7 +/- 2 of," chunks appear to be incommensurate in most
other respects. Miller himself was perplexed because the information content of chunks
is so different. A telephone number, which may be two or three chunks long, is very
different from a chessboard, which may also contain just a few chunks but is vastly more
complex. Chunks contain other chunks, further obscuring their information content. The
psychological literature describes chunking in many experimental situations but it says
nothing about the intrinsic, mathematical properties of chunks. The cognitive science
literature discusses algorithms for forming chunks, each of which provides a kind of
explanation of why some chunks rather than others are formed, but there are no
explanations of what these algorithms, and thus the chunks they find, have in common.
We argue that chunks share a common information-theoretic signature. This
signature is defined in terms of the basic measure of information content, entropy.
Entropy measures the average amount of information required to communicate the
outcome of a random variable. For example, the entropy of a toss of a fair six-sided die
is much higher than that of a loaded one. In entropic terms, the chunk signature is
simple: Chunks have low conditional entropy internally, and high conditional entropy at
the boundaries. For example, given the sequence "victo", the conditional entropy of the
next letter in the chunk is low (it is probably an ‘r’), but given the letters in the chunk
"victory", the conditional entropy of the neighboring letters is high. This relationship
between predictability and the boundaries of words was noticed as early as 1948 by
Claude Shannon.

2. Supporting Evidence
There are several lines of evidence that support this information-theoretic view of
chunks. The first is that algorithms built to find chunks based on this signature (or very
similar signatures) are quite successful at chunking real-world data. Several such
algorithms have been developed independently of one other in the fields of
computational linguistics and artificial intelligence, adhering to the chunk signature with
varying degrees of fidelity. Perhaps the fullest implementation is that of the Voting
Experts algorithm originally developed by Cohen and Adams. Variants of this algorithm,
that add bootstrapping (the ability to feed information about chunks already discovered
back into the algorithm's decision-making process), represent the highest levels of
performance in the literature on a common benchmark of unsupervised chunking ability.
Interestingly, this benchmark involves finding words in a corpus of transcribed childdirected speech from the CHILDES project. However, performance of the Voting
Experts family of algorithms is not restricted to child language data, as these algorithms
also perform well at finding words in diverse languages with different writing systems,
finding episodes in sequences of robot actions, finding letters on a printed page by
analyzing columns of pixels, and finding teaching episode boundaries in the instruction
of an AI student.

- 92 -

The Computational Turn: Past, Presents, Futures?

While this evidence suggests that algorithms searching for the chunk signature very
often recover correct chunks, it does not fully establish the correspondence between the
chunk signature and real chunks. The question remains whether real chunks are optimal
with respect to this signature. Put more simply, out of all the possible chunks that could
be formed based on some data, are the true chunks the "chunkiest?" This question is
difficult to evaluate because it requires enumerating an exponential number of possible
ways to chunk a given sequence. However, for short sequences, it is possible to fully test
this proposition. We developed a chunkiness score that combines the internal entropy
and the boundary entropy into a single number. For each 5-word sequence in a corpus of
child-directed speech, we generated all possible segmentations and ranked each one
according to the chunkiness score. The true segmentation ranked in the 98.7th percentile
on average. Preliminarily, it appears that syntax is the primary reason that the true
segmentation is not higher in the ranking: When the word-order in the training corpus is
scrambled, the true segmentation is in the 99.6th percentile. Still, based on these early
results we can say that, in at least one domain, true chunks are nearly optimal with
respect to the information-theoretic chunkiness score.
Empirical studies also suggest that children, even infants, do actually possess such a
chunking ability. Saffran, Aslin, and Newport famously demonstrated that 8-month-old
infants can correctly identify artificial words in a continuous speech stream. Importantly,
this speech stream did not contain pauses around sentences or phrases as natural speech
often does. This means that infants must be relying on some sort of chunking ability to
discover these words in the stream. Saffran et al. proposed a very simple chunking
heuristic that was sufficient for their task, but fails at finding words in natural languages
and other non-linguistic chunking tasks. In our view, positing such a weak ability is not
parsimonious because it would require the children to also have a second, more powerful
ability for other chunking tasks, even other linguistic tasks. By contrast, with a single
chunking ability based on the signature of chunks, children could perform the task
presented by Saffran et al. as well as many others. It is also worth noting that Hauser,
Newport, and Aslin later showed that cotton-top tamarins can perform a very similar
task, suggesting that the underlying ability may be shared with other non-human
primates.

3. Conclusion
All of this evidence supports the view that chunks can be defined by an informationtheoretic signature, and that a general chunking ability based on this signature provides a
good explanation for this core cognitive ability.

- 93 -

Proceedings IACAP 2011

THE DYNAMISM OF INFORMATION ACCESS FOR A MOBILE
AGENT IN A DYNAMIC SETTING AND SOME OF ITS
IMPLICATIONS
LARS-ERIK JANLERT
Umeå University
lej@cs.umu.se

Given the definition of informational distance as the time it takes to satisfy a request for
the information (Janlert, 2006a), it follows that these distances, the latencies of
information satisfactions, will depend on the location of the information-seeking agent as
well as the location of the various resources available for satisfying requests for
information. That also means that changes in the agent’s location as well as changes in
the location of information resources in the environment of the agent will dynamically
affect the agent’s information availability profile (Janlert 2006a), the spectrum of
informational distances for the complete range of possible information requests. This
paper will start to investigate the implication this may have for the possibility of
outlining the informational boundaries of the agent, separating agent from world in
informational terms, and for the possibilities of strategic relocations of agent and
informational resources.
To do this a model of agent–world relationship is outlined and used, more general
and considerably more abstract than the examples of actual “natural” agent–world
relationships found in this world, starting from a characterization as completely as
possible in informational terms: the world is basically a database from which the agent
gets information and in which the agent sets information.
It turns out that it is possible to define the existential extension of an agent in
informational terms in a way that at least starts to make some sense in the real world: the
informational boundary. The issue of agent identity may then be approached along the
lines of Nozick’s closest-continuer theory.
Finally, the importance of proximity as a cue to contextual relevance for situated
activity in general is transformed or translated to informational terms to appear as a
relevant principle in getting as well as in setting information.
Issues of accuracy and reliability of (purported) information will be bracketed off in
this paper, but basically “information” is taken to exclude “misinformation.”

1. The world as a database
In this model, we have an agent in an environment, a (or the) world. The agent is part of
the environment, but other than that nothing is assumed about its structure and extent or

- 94 -

The Computational Turn: Past, Presents, Futures?

what drives it. What the agent does is two things (which may in the end turn out to be
one and the same thing at a certain level of abstraction). Firstly, it requests and gets
information from the world. The world is considered to be a (dynamic) repository of
information from the agent’s point of view: all it ever gets is information from it about it.
In our use of the model we may of course consider any kind of implementation (model)
satisfying the constraints of the agent’s interactions.
Secondly, and this is in order to make the model as purely informationally based
and symmetric as possible, the agent also sets information into the world. Thus, the agent
gets as well as sets information.
That is the general model. Such worlds could of course be very different but let us
assume for the current exercise that the world of the model by and large matches our own
real world at a slightly less abstract level.
Setting or getting information can be viewed as a matter of direction of fit. Getting
information can be understood in terms of retrieving, computing, measuring, observing
etc., and any combination of such processes, which are partly initiated and performed by
the agent (Janlert, 2006b). Setting information means to make something the case, to
make the world deliver certain information. Getting information is often thought of as a
non-intervening process supposed to leave the world untouched, whereas setting
information, making something the case, usually is thought of as doing some measure of
violence to the world, forcing it to change. But generally in this world you can’t get
information without setting some information in the process; and you can’t set
information without getting some information in the process.
Situated existence in this model becomes a kind of information management; we
are already living in an informational world, if you will.
This whole approach could in itself perhaps be viewed as an analysis in the style of
Carnap (1961); it has certainly been inspired by it.

2. Informational boundary of an agent
Given an agent that moves, it will be possible to make a differentiation between
information that is moved “along with” the agent, identifiable as information that is
reasonably close and whose distance does not vary much during movement, and
information that doesn’t. (The size of changes should be understood as relative, in
proportion to the whole distance.) Information that moves along with the agent in this
sense is considered to be within its (current) informational boundary, other information
considered to be on the outside.
For information that does not move along, that is external to the informational
boundary, it is also interesting to differentiate between information that is far off, far
away at the information horizon of the agent, and whose distance remains fairly constant
during the movement of the agent. It will appear as a quite stable background. What
remains will then be information that is close to “midrange” and changes significantly
during movement: proximal external information.

- 95 -

Proceedings IACAP 2011

3. Proximity principle applied to the informational world
Things that are close tend to matter; things that matter tend to be(come) close (Janlert,
2003). For an agent situated in an environment this means roughly: (1) that an object
close to the agent has a better chance of getting the agent’s attention and figure in the
agent’s activities; (2) an object that matters to the agent’s activities, is more likely to
already be or soon become within close range (partly due to the agent’s own doings). In
the world-as-database model this translates to the following rule of thumb for proximal
external information: information that is close to the agent has a better chance to be got
by the agent and play a role in the agent’s activities; information that matters to the
agent’s activities, is more likely to be or become close to the agent.

References
Janlert, L. E. (2003). Contextual strategies – notes for a theory of context. Technical report
UMINF 02.23, Umeå University, ISSN-0348-0542.
Janlert, L. E. (2006a). Available information — preparatory note for a theory of information
space. tripleC 4(2). ISSN 1726-670X.
Janlert, L. E. (2006b). Information at a distance. In Proccedings of iC&P 2006 (Int. Conf. on
Computers & Philosophy), Laval, May 2006.
Carnap, R. (1961). Der logische Aufbau der Welt. Hamburg: Felix Meiner Verlag. First edition
appeared in 1928.

- 96 -

The Computational Turn: Past, Presents, Futures?

CONTEXTUAL INFORMATION
Modeling Different Interpretations of the same Data within a Geometric
Framework
KIRSTY KITTO
Faculty of Science and Technology
Queensland University of Technology
Brisbane, 4001, Australia

Abstract. Semantic Information has provided an elegant set of approaches that
allow us to ground information with respect to its Context, Level of Abstraction
and Purpose. Interestingly, computer science also has a history of considering
context and attempting to incorporate it into fields such as Artificial Intelligence,
Ubiquitous Computing, Information Systems design etc. These fields generally
treat context as an unknown parameter, which tends to be insufficient when it
comes to the modeling of cognition. This paper draws attention to a class of
contextuality that arises from ``knowing too differently'' rather than ``too little'',
and discusses the manner in which this new class is likely to be of increasing
importance to the modeling of socio-technical and environmental systems. A new
geometric model is discussed which incorporates context at its core. Thus, this
paper presents an approach that might be used to ground the truth of statements
within a relevant context. Such models make the manner in which context can
affect the interpretation of information explicit, and can both consistently explain,
and allow us to model, an important class of social phenomena. The model will be
discussed with reference to both push polling, and the climate change debate.

1. Information in Context
Semantic Information (Floridi 2011) has provided an elegant set of approaches that
allow us to ground information with respect to its Context, Level of Abstraction and
Purpose, which has in turn allowed Floridi develop a number of theories about truth,
relevance, the logic of being informed etc. (Floridi 2011). However, little work has been
presented as to how this theory could correspond to the humans to whom it generally
refers, and perhaps most importantly, to their aggregate behavior in e.g. elections, social
movements and crises. Semantic Information has the potential to shed some light upon
the responses exhibited by individuals to many of the complex information environments
that surround them, but realistic models will be required before this can be achieved.
While it is relatively easy to determine if the beer is in the fridge (or not), recent public
debates on climate change, water management, consumer spending habits in the wake of

- 97 -

Proceedings IACAP 2011

the global financial crisis etc. have all served to emphasize the manner in which different
sections of a community might ascribe very different values to statements generated from
highly similar sets of data. The interpretation that should be attached to information is
frequently the subject of vigorous debate, in which context tends to play a fundamental
and highly complex role. This situation is recognized somewhat in Floridi's (2011)
discussion of semantic truth however, the manner in which such a conception might be
worked into the computational modeling of social dynamics is yet to be considered. As
scientists attempt to construct increasingly sophisticated climate, water and sociopolitical models, it has become essential that we consider the manner in which humans
respond to complex sets of information and data.
This paper will discuss a sophisticated agent based model (ABM) of human
decision making in context that is currently in development. This model took inspiration
from the work of Brugnach et. al (2008), who contrasted the difference between
“knowing too little” a concept already extensively discussed in the computational
literature (Akman & Surav 1996, Brézillon 1999), and “knowing too differently”, a
concept which is yet to be incorporated into the computational paradigm. To “know too
differently” implies a contextual dependency to knowledge, which must be accounted for
in models of human behavior.
Taking a situation of water shortage as an example, it is frequently the case that a
number of different framings can be provided. This results in the attribution of different
interpretations to the situation, each potentially requiring different responses; how should
a government react? A farmer will be concerned with “insufficient supply”, while
environmentalists might approach the water system thinking that the problem is one of
“excessive consumption” (Brugnach et. al 2008). Both contexts have led to claims that
are justified, but the two interpretations are incompatible, in that they apparently require
different actions from policy makers.

Figure 1. The changing context of a decision. The probability of choosing a particular
course of action changes between contexts p and q.
While relativistic arguments have a somewhat dubious reputation in pure
philosophy, it is becoming increasingly important that we recognize the role context
plays in the modeling of human responses to information, and in particular, to the
decisions that humans make in utilizing this information. For example, when presented
with the same set of information, a different individual might draw a very different set of
conclusions as to its consequence, and this can in turn lead to markedly different actions.

- 98 -

The Computational Turn: Past, Presents, Futures?

The manner in which the new model represents context is geometrical, and can be
quickly explained with reference to the simple example illustrated in Figure 1. Here, we
have represented the current state, A, of an agent (we shall call her Alice) with respect to
two different contexts p and q. In this case, the state of our agent has been chosen to
correspond to her projected response to a binary question e.g. will you vote for candidate
X in the coming election?
A connection to probability is generated by assuming that the length of the state
A is equal to 1, which means that the probabilities of Alice responding with a “yes” or
“no” are given by the Pythagoras theorem in a particular context. Thus,

(1)
With reference to Figure 1, it can quickly be seen that the probability of Alice
responding with ``yes'' will be markedly different between the two contexts; while she
has a higher probability of responding with ``yes'' in context p, she has a higher
probability of responding with a “no” to the same question in context q (this is given by a
quick inspection of the lengths of the components making up a right angled triangle with
hypotenuse equal to state A).
This geometric model of decision making in context bears a remarkable
resemblance to the geometrical probability that is utilised in quantum theory (Isham
1995), and indeed, this similarity is further developed in a number of recent contextual
models of, for example, decision making (Busemeyer et al. 2011) , word recognition and
recall (Bruza et al. 2009), concept combination (Aerts & Gabora 2005) and information
retrieval (Van Rijsbergen 2004). The general framework of these models will be
discussed, and the novel manner in which they incorporate context into the modeling of a
state of affairs highlighted. In particular, this paper will highlight the way in which
explicitly considering contextual factors in a model allows for a recognition of different
points of view and frames without lapsing too deeply into relativism. While some notion
of truth can be understood to exist in this model, the context in which a set of facts is
presented can profoundly influence the interpretation that an agent would attribute to
them.

Acknowledgements
Supported by the Australian Research Council Discovery grant DP1094974.

References
Aerts, D. and Gabora, L. (2005). A theory of concepts and their combinations I: the structure of
the sets of contexts and properties. Kybernetes, 34:151-175.
Akman, V. and Surav, M. (1996). Steps toward Formalizing Context. AI Magazine, 17(3):55-72.
Brézillon, P. (1999). Context in problem solving: a survey. Knowledge Engineering Review,
14:47-80.

- 99 -

Proceedings IACAP 2011

Brugnach, M., Dewulf, A., Pahl-Wostl, C., and Taillieu, T. (2008). Toward a relational concept of
uncertainty: about knowing too little, knowing too differently, and accepting not to know.
Ecology and Society, 13(2):30.
Bruza, P., Kitto, K., Nelson, D., and McEvoy, C. (2009). Is there something quantum-like about
the human mental lexicon? Journal of Mathematical Psychology, 53:362-377
Busemeyer, J. R., Pothos, E., and Franco, R. (2011). A quantum theoretical explanation for
probability judgment 'errors'. Psychological Review. In press.
Floridi, L. (2011). The Philosophy of Information. Oxford University Press.
Fox, J. S. (1997). Push Polling: The Art of Political Persuasion. Florida Law Review, 49:563.
Isham, C. J. (1995). Lectures on Quantum Theory. Imperial College Press, London.
Van Rijsbergen, C. (2004). The Geometry of Information Retrieval. Cambridge University Press.

- 100 -

The Computational Turn: Past, Presents, Futures?

COGNITION AS MANAGEMENT OF MEANINGFUL INFORMATION.
PROPOSAL FOR AN EVOLUTIONARY APPROACH.
CHRISTOPHE MENANT

Extended Abstract
Humans are cognitive entities. Our behaviors and ongoing interactions with the
environment are threaded with creations and usages of meaningful information, be they
conscious or unconscious. Animal life is also populated with meaningful information
related to the survival of the individual and of the species. The meaningfulness of
information managed by artificial agents can also be considered as a reality once we
accept that the meanings managed by an artificial agent are derived from what we, the
cognitive designers, have built the agent for.
This rapid overview brings to consider that cognition, in terms of management of
meaningful information, can be looked at as a reality for animal, humans and robots. But
it is pretty clear that the corresponding meanings will be very different in nature and
content. Free will and self-consciousness are key drivers in the management of human
meanings, but they do not exist for animals or robots. Also, staying alive is a constraint
that we share with animals. Robots do not carry that constraint.
Such differences in meaningful information and cognition for animal, humans and
robots could bring us to believe that the analysis of cognitions for these three types of
agents has to be done separately. But if we agree that humans are the result of the
evolution of life and that robots are a product of human activities, we can then look at
addressing the possibility for an evolutionary approach at cognition based on meaningful
information management. A bottom-up path would begin by meaning management
within basic living entities, then climb up the ladder of evolution up to us humans, and
continue with artificial agents.
This is what we propose to present here: address an evolutionary approach for
cognition, based on meaning management using a simple systemic tool.
We use for that an existing systemic approach on meaning generation where a system
submitted to a constraint generates a meaningful information (a meaning) that will
initiate an action in order to satisfy the constraint (Menant 2003, 2010 a). The action can
be physical, mental or other.
This systemic approach defines a Meaning Generator System (MGS). The
simplicity of the MGS makes it available as a building block for meaning management in
animals, humans and robots.
Contrary to approaches on meaning generation in psychology or linguistics, the MGS
approach is not based on human mind. To avoid circularity, an evolutionary approach
has to be careful not to include components of human mind in the starting point

- 101 -

Proceedings IACAP 2011

The MGS receives information from its environment and compares it with its
constraint. The generated meaning is the connection existing between the received
information and the constraint. The generated meaning is to trigger an action aimed at
satisfying the constraint. The action will modify the environment, and so the generated
meaning. Meaning generation links agents to their environments in a dynamic mode. The
MGS approach is triadic, Peircean type.
The systemic approach allows wide usage of the MGS: a system is a set of elements
linked by a set of relations. Any system submitted to a constraint and capable of
receiving information from its environment can lead to a MGS. Meaning generation can
be applied to many cases, assuming we identify clearly enough the systems and the
constraints. Animals, humans and robots are then agents containing MGSs. Similar
MGSs carrying different constraints will generate different meanings. Cognition is
system dependent.
We first apply the MGS approach to animals with “stay alive” and “group life”
constraints. Such constraints can bring to model many cases of meaning generation and
actions in the organic world. However, it is to be highlighted that even if the functions
and characteristics of life are well known, the nature of life is not really understood.
Final causes are difficult to integrate in our today science. So analyzing meaning and
cognition in living entities will have to take into account our limited understanding about
the nature of life. Ongoing research on concepts like autopoiesis could bring a better
understanding about the nature of life (Weber and Varela 2002).
We next address meaning generation for humans. The case is the most difficult as
the nature of human mind is a mystery for today science and philosophy. The natures of
our feelings, free will or self-consciousness are unknown. Human constraints, meanings
and cognition are difficult to define. Any usage of the MGS approach for humans will
have to take into account the limitations that result from the unknown nature of human
mind. We will however present some possible approaches to identify human constraints
where the MGS brings some openings in an evolutionary approach (Menant 2010 b & c).
But it is clear that the better human mind will be understood, the more we will be in a
position to address meaning management and cognition for humans. Ongoing research
activities relative to the nature of human mind cover many scientific and philosophical
domains (Philpapers, Philosophy of Mind).
The case of meaning management and cognition in artificial agents is rather
straightforward with the MGS approach as we, the designers, know the agents and the
constraints. In addition, our evolutionary approach brings to position notions like
artificial constraints, meaning and autonomy as derived from their animal or human
source.
We also highlight that cognition as management of meaningful information by
agents goes beyond information and needs to address representations which belong to the
central hypothesis of cognitive sciences.
We define the meaningful representation of an item for an agent as being the networks of
meanings relative to the item for the agent, with the action scenarios involving the item.
Such meaningful representations embed the agents in their environments and are far from
the GOFAI type ones (Menant 2010 b). Meanings, representations and cognition exist by
and for the agents.
We finish by summarizing the points presented and highlight some possible
continuations.

- 102 -

The Computational Turn: Past, Presents, Futures?

References
Menant, C. (2003). Information and Meaning. In: Entropy 2003, 5 (pp193-204). ISSN 1099-4300
© 2003 by MDPI (http://cogprints.org/3694/)
Menant, C. (2010 a). Introduction to a Systemic Theory of Meaning. (short paper)
http://crmenant.free.fr/ResUK/MGS.pdf
Weber, A. and Varela, F. (2002). Life after Kant: Natural purposes and the autopoietic
foundations of biological individuality. In: Phenomenology and the Cognitive Sciences 1.
(pp 97-125).
Menant, C. (2010 b). Computation on Information, Meaning and Representations. An
Evolutionary Approach. In: Dodig Crnkovic, G. and Burgin, M. (Editors) World Scientific
Series in Information Studies - Vol. 2. INFORMATION AND COMPUTATION. Essays on
Scientific and Philosophical Understanding of Foundations of Information and
Computation.
(http://www.idt.mdh.se/ECAP-2005/INFOCOMPBOOK/CHAPTERS/10Menant.pdf.)
Menant, C. (2010 c). Proposal for a shared evolutionary nature of language and consciousness.
http://cogprints.org/7067/.
Philpapers. Philosophy of mind. http://philpapers.org/browse/philosophy-of-mind.

- 103 -

Proceedings IACAP 2011

COMPUTATIONAL AND HUMAN MIND MODELS
FRANCISCO HERNÁNDEZ-QUIROZ
UNAM
Departamento de Matemáticas, Facultad
Universitaria, C.P. 04510, D.F., MEXICO

de

Ciencias,

Ciudad

Abstract. Computational models of the human mind have been the subject of a
heated debate since Turing's seminal paper of 1950. Some opponents of the socalled Strong AI have postulated alternative mechanisms based on one or another
form of hypercomputation. Although specific arguments can be (and have been)
raised against the possibility of hypercomputation, a different approach is possible:
accept the possibility of human cognitive abilities beyond the reach of Turing
Machines (TMs) and then face the problem of postulating appropriate physical
mechanisms underlying these hypercomputing abilities. The result can lead to
difficulties as hard as those faced by Strong AI in the first place, reducing the
allure of the hypercomputing alternatives.

1. Introduction
In his celebrated paper of 1950, Turing advanced the then daring proposal of machines
able to emulate the human mind. Those machines were the practical realization of the
model he introduced before in 1936-7. Turing's formulation is careful to avoid the
categorical statement that the human mind can be emulated by a Turing Machine due to
the fact that it is a Turing Machine. However, successive computer scientists have
reprised Turing's proposal without his caveats. An extreme and idealized version of this
point of view is known as Strong Artificial Intelligence (Searle, 1984).

2. An Objection to Artificial Intelligence
The thesis that the human mind can be modelled by Turing Machines has been attacked
by many people. A common line of attack goes like this:
• Strong AI claims the human mind can be modelled by Turing Machines.
• Turing Machines suffer internal limitations that surface in theorems due to
Turing himself, Rice and even Gödel.
• But human cognitive abilities go beyond these limitations.
• Ergo, the human mind cannot be modelled by Turing Machines.

- 104 -

The Computational Turn: Past, Presents, Futures?

This argument has been rejected by many authors (Feferman, 1996; Chalmers, 1995).
But this paper will take a different approach: what happens if we accept that the human
mind cannot be modelled by a Turing Machine? What type of mechanism is needed
instead? What problems arise when such a model is adopted?

3. “Mechanisms” more powerful than computers
There are many candidates for this role. On the one hand, physical systems with
properties (supposedly) beyond the restrictions of Turing Machines (Penrose, 1994). On
the other hand, mathematical models circumventing those same restrictions: Oracle
Turing Machines (Turing, 1939), Analog Neural Networks (Siegelmann, 1999),
Dynamical Systems (Bournez and Cosnard, 1995), etc.
In fact, there is a common core in all these models: (a) they pretend to implement
some notion of what can be considered intuitively a computational mechanism; (b)
simultaneously, they include elements capable of introducing entities not Turing
computable. They can be gathered under the label of “hypercomputation.”
Many of those who oppose the Strong AI, claim that human cognitive abilities
which are not explicable by TMs are in fact based on one or another hypercomputing
mechanism.

4. Towards a new scientific research program?
But these mechanisms are also prone to run into trouble. Sieg (2008) has argued
convincingly that Turing Machines' limitations are a consequence of the acceptance of
two principles: locality and boundedness. The first principle means that a computer can
only change immediately recognizable configurations in finite time. The second one
means that a computer can only recognize immediately only a bounded number of
configurations (and therefore there exists an upper bound to the amount of information it
can handle in finite time).
By rejecting TMs as an upper bound to computability, we reject these principles.
No need to worry though, theoretically speaking, if we are only interested in abstract
mathematical models. But if the aim is to model or to explain the human mind, and some
of its capabilities are attributed to hypercomputing features, then we are asserting
implicitly that the human mind (or its physical substratum, if you will) goes beyond the
principles of locality and boundedness. One variety of hypercomputation even asserts the
possibility of harnessing and manipulating non-computable irrational numbers
(Siegelmann, 1999). And if we want to remain on scientific grounds, we will be pressed
to point out to the physical counterparts of this theoretical entities and postulate
hypercomputation in Nature.
Of course, none of this is impossible, at least in principle. However, our quest for a
model of the human mind has lead us to pose very basic questions about physical reality
that bring with them huge theoretical and practical challenges that look at least as
difficult as the problems faced by the computational models of the human mind. The
moral might be that a theoretical alternative is not necessarily a plausible explanation for
a natural phenomenon.

- 105 -

Proceedings IACAP 2011

References
Bournez, O. y Cosnard, M. (1995). On the computational power and super-Turing capabilities of
dynamical systems, Technical report no. 95-30, Laboratoire de l’Informatique du
Parallelism, Ecole Normale Superieure de Lyon.
Chalmers, D.J. (1995). Minds, Machines, And Mathematics - A Review of Shadows of the Mind
by Roger Penrose”, Psyche 2(9).
Feferman, S. (1996). Penrose's Gödelian argument, Psyche 2, 21-32.
Penrose, R. (1989). The Emperor's New Mind: Concerning Computers, Minds and The Laws of
Physics, Oxford University Press.
Penrose, R. (1994). Shadows of the Mind: A Search for the Missing Science of Consciousness,
Oxford University Press.
Searle, J. (1984). Minds, Brains and Science, Cambridge: Harvard University Press.
Sieg, W. (2008). Church Without Dogma — axioms for computability. In B. Lowe, A. Sorbi, B.
Cooper (eds.) New Computational Paradigms (pp. 139-152), Springer Verlag.
Siegelmann, H.T. (1999) Neural Networks and Analog Computation: Beyond the Turing Limit,
Birkhäuser, Progress in Theoretical Computer Science.
Turing, A.M. (1936-7). On Computable Numbers, with an Application to the
Entscheidungsproblem, Proceedings of the London Mathematical Society, Series 2, 42, pp.
230–265.
Turing, A.M. (1939). Systems of Logic Based on Ordinals, Proceedings of the London
Mathematical Society, Series 2, 45, 161–228.
Turing, A.M. (1950). Computing Machinery and Intelligence, Mind, 59, 433–460.

- 106 -

The Computational Turn: Past, Presents, Futures?

SEMANTICS OF INFORMATION
Meaning and Truth as Relationships between Information Carriers
MARCIN J. SCHROEDER
Akita International University
Akita, Japan

Abstract. The meaning of information has been openly dismissed from the interest
of information theory already by Shannon, but the fiasco of the early attempt to
develop semantic theory of information by Bar-Hillel and Carnap was even more
discouraging. They developed their theory of semantic information using as a
starting point already existing logical structure of the language, not recognizing the
fact that language is a very special information system and the logic of information
should be built before its semantic theory. Philosophical concept of meaning for
centuries has been associated with the medieval scholastic concept of
intentionality, pointing by a symbol at intended object, identified by Brentano and
his followers as the primary characteristic of mental acts. Neither of the attempts to
eliminate psychologism of intentionality removed the primary source of
philosophical problems which has been always in the fact that semantics requires
crossing the border between different ontological entities. This difficulty could not
be resolved within philosophy of language, as at this level the difference between
linguistic items and entities to which they refer cannot be ignored. The relationship
between a symbol and its meaning does not require separation of ontological
status, when the meaning is understood as a relationship between information in
two different information carriers, that of a symbol and that of denotation. In the
present paper, both, symbol and object are described in terms of information
integration. Every entity is being characterized through the integrated part of
information constituting its identity, and not integrated interpreted as its state. The
correspondence of identities, i.e. integrated parts of information is here identified
as the meaning, the correspondence between states, i.e. nonintegrated parts of
information is identified as the truth.

1. Sources of Problems in Semantics of Information
Difficulties in the development of semantics of information are in part inherited from
linguistic semantics, but some of them have their sources in the circumstances in which
information theory has been born. The meaning of meaning has been always an elusive
subject. Ogden and Richards (1923/1989) in their widely read study of this concept
considered its sixteen basic meanings.
Philosophical concept of meaning for centuries has been associated with the
medieval scholastic concept of intentionality, pointing by a symbol at intended object.

- 107 -

Proceedings IACAP 2011

Brentano identified intention or “aboutness” with the fundamental characteristic of
mental capacity.
The logical approach initiated by Frege and developed by Church was an attempt to
eliminate psychological aspects of the meaning by making a distinction between
denotation and sense, and focusing on the rules reducing sense of compound expressions
to those simple. However, the shift of attention to mutual relationship between
expressions of a language at different level of complexity does not help to understand the
relationship between simple signs and their denotations, to which the process of
reduction is leading. Under influence of logical positivism, Carnap attempted to resolve
this issue in the context of scientific methodology by involving the idea of empirical
sense reducing criteria of the relationship to empirical procedures.
The approach initiated by Peirce, whose original writings preceded most of the
contemporary work on the concept of meaning, was also intended as a way to eliminate
necessity to involve human subject in semiosis. In his approach sign and object are
accompanied by interpretant of the type of a sign. Being a sign, interpretant may enter
into another triadic relation with its own object and interpretant. Its role is to build a
connection between sign and object which does not require involvement of human being.
This approach leaves the question of the traditional relationship between the sign and its
meaning open-ended, but it hardly gives its explanation, especially when the sign has
different ontological status from that of an object. As in the logical approach, we have
here an extension of the study towards a complex structure of signs or names, but the
basic relationship between the object and the sign is left in the shadow.
No wonder that the issue of the meaning of information has been dismissed from
the subject of information theory so easily. Shannon’s disclaimer “These semantic
aspects of communication are irrelevant to the engineering problem” (Shannon &
Weaver, 1949/1989) has been followed by majority of information theorists, such as
Cherry (1951/1952): “It is important to emphasize, at the start, that we are not concerned
with the meaning or the truth of messages; semantics lies outside the scope of
mathematical information theory.” After all, the measure of information was defined for
one letter or character of a message which does not carry any meaning. The measure for
entire message was simply the sum of measures for characters.
Fiasco of the early attempts to develop semantic theory of information, such as the
most advanced attempt by Bar-Hillel and Carnap (1952), sealed the fate of the study of
semantics of information. Bar-Hillel and Carnap developed their theory of semantic
information using as a starting point already existing logical structure of the language.
They did not take into account that language is a very special information system and
more general logic of information should be built before its semantic theory.

2. Semantics as Relationship between Information Carriers
Bar-Hillel and Carnap (1952) have built their measure of semantic information in such a
way that it can be reduced to Shannon’s entropy in a special case. However, here there is
a fundamental problem whether the measure of information transmitted in the process of
communication applies to information carried by some carrier (symbol or object). The
present author (Schroeder, 2004) believes that the answer is negative, and the measure of
semantic information should be based on the alternative measure, taking into

- 108 -

The Computational Turn: Past, Presents, Futures?

consideration the amount of information carried by symbols, which should be estimated
based on the relationship between the information in the symbol and information in the
designate.
However, the primary source of philosophical problems of semantics has been always in
the requirement of crossing the border between different ontological entities. This
difficulty could not be resolved within philosophy of language, as at this level the
difference between linguistic items and entities to which they refer cannot be ignored.
The relationship between a symbol and its meaning does not require separation of
ontological status, when the meaning is understood as a relationship between information
in two different information carriers, that of a symbol and that of denotation. In the
present paper, both, symbol and object are described in terms of information integration
(Schroeder, 2009).
The concept of information integration is implemented with the use of a theoretical
instrument called a generalized Venn gate which transforms selective manifestation of
information into structural one (Schroeder, 2005, 2007) The transition may change the
level of integration of information depending on the structural characteristics of the logic
of the gate. The gates whose logic is completely irreducible into the components (such as
in the case of quantum logic) produce highest level of integration. The gates with
Boolean (i.e. traditional) logic reducible to the product of simple (yes-no) components
leave information completely disintegrated. There are of course multiple levels of
integration in between.
Information is here understood in a very broad way as an identification of a variety,
i.e. that which makes one out of a variety (Schroeder, 2005). Thus, not only language is a
carrier of information, but also every object of our experience. Cognitive processes
involve transformations of selective manifestation of information coming with sensory
stimulation into the structural manifestation of information, which in its integrated form
constitute conscious experience.
Every entity is being characterized through the integrated part of information
constituting its identity, and not integrated interpreted as its state. The correspondence of
identities, i.e. integrated parts of information is here identified as the meaning,
correspondence between states, i.e. non-integrated parts of information is identified as
the truth.

References
Bar-Hillel, Y. & Carnap R. (1952/1964). An Outline of a Theory of Semantic Information.
Technical Report No. 247, Research Laboratory of Electronics, MIT; reprinted in Bar-Hillel,
Y. (1964) Language and Information: Selected essays on their theory and application.
Reading, MA: Addison-Wesley, pp. 221-274.
Cherry, E. C. (1951/1952). A history of the theory of information. Proceedings of the Institute of
Electrical Engineers, 98 (III), 383-393; reprinted with minor changes as: The communication
of information. American Scientist, 40, 640-664.
Ogden, C. K., Richards, I. A. (1923/1989). The Meaning of Meaning: A Study of the Influence of
Language Upon Thought and of the Science of Symbolism. San Diego: A Harvest Book,
Harcourt Brace Jovanovich.
Schroeder, M. J. (2004). An Alternative to Entropy in the Measurement of Information. Entropy,
6, 388-412.

- 109 -

Proceedings IACAP 2011

Schroeder, M. J. (2005). Philosophical Foundations for the Concept of Information: Selective and
Structural Information. In Proceedings of the Third International Conference on the
Foundations of Information Science, Paris. http://www.mdpi.org/fis2005.
Schroeder, M. J. (2007). Logico-algebraic structures for information integration in the brain.
Proceedings of RIMS 2007 Symposium on Algebra, Languages, and Computation, Kyoto:
Kyoto University, pp. 61-72.
Schroeder, M. J. (2009). Quantum Coherence without Quantum Mechanics in Modelling the
Unity of Consciousness. In P. Bruza, et al. (Eds.) QI 2009, LNAI 5494, Springer, pp. 97-112.
Shannon, C. E., Weaver, W. (1949/1998). The Mathematical Theory of Communication. Urbana:
University of Illinois Press.

- 110 -

The Computational Turn: Past, Presents, Futures?

PRE-COGNITIVE SEMANTIC INFORMATION6
ORLIN VAKARELOV
Department of Philosophy
University of Arizona
Tucson, Arizona, USA
Email: okv@u.arizona.edu

Abstract. This talk addresses one of the fundamental problems of the philosophy
of information: How does semantic information emerge within the underlying
dynamics of the world? --- dynamical semantic information problem. It suggests
that the canonical approach to semantic information that defines data before
meaning and meaning before use is inadequate for pre-cognitive information
media. Instead, we should follow a pragmatic approach to information where one
defines the notion of information system as a special kind of purposeful system
emerging within the underlying dynamics of the world, and define semantic
information as the currency of the system. In this way, systems operating with
semantic information can be viewed as patterns in the dynamics – semantic
information is a dynamical system phenomenon of highly organized systems. In
the simplest information systems the syntax, semantics and pragmatics of the
information medium are co-defined. It proposes a new more general theory of
information semantics that focuses on the interface role of the information states in
the information system – the interface theory of meaning.

1. Introduction
I address the following problem: How does semantic information emerge within the
underlying dynamics of the world? Let us call this the dynamical semantic information
(DSI) problem. This is related to another kind of problem: Can we provide a foundation
of cognitive science with the notion of (semantic) information? I claim that it is possible
to offer a theory of pre-cognitive semantic information that does not presuppose a notion
of cognition or mind. With such a theory, the notion of semantic information can be used
in foundational discussions of cognition without circularity. However, I do not plan to
address the second problem here.
My strategy for addressing DSI is this: Start with a notion of information system
that is a special kind of autonomous dynamical system interacting with an environment.
Describe semantic information as a “currency” of the information system. That is, treat
information for the system not as a primitive but as a derived notion, similar to the way
currency is a derived notion of an economic system. Take a decomposition approach to
6

This talk is based on (Vakarelov, 2010).

- 111 -

Proceedings IACAP 2011

analyzing the components of semantic information – that is, regard notions such as data,
meaning and source, as depicting aspects of informational processes within the
information system. Provide a theory of meaning, the interface theory of meaning, for
the informational states of an information medium within the information system.

2. Canonical Views of Semantic Information
Most theories of semantic information make the following assumptions: (1) semantic
information = data + meaning (+ truthfulness); (2) the data is conceptually primary; (3)
meaning is secondary and depends on data, (4) pragmatics is third-ary and depends on
meaning. In this view, the ‘+’ in the definition of information can be regarded as an
amendment operation, where syntax is amended by semantics to obtain a theory of
semantic information, and semantics is amended with an account of use of information,
to obtain a theory of pragmatic information. Thus, an approach to semantic information
proceeding as such I call an amendment approach.
Taking an amendment approach to semantic (and pragmatic) information has no
effect on the formal theories of information; however it affects meta-theoretic judgments
about theories of information. In particular, it affects what theories of information are
regarded as more general.
I argue (defeasibly) that taking the notion of data as conceptually primary (and
independent from semantics and pragmatics) leads to an indispensible role of a mind for
the specification of semantics. This makes naturalizing semantic information difficult.
This is because the cases where the data system can be defined precisely without
semantics or pragmatics are cases where semantics requires an external interpreter. The
meta-theoretical judgments about such cases mistakenly conclude that the cases are the
most general, and therefore they offer the most inclusive theory of semantic information.

3. The Pragmatic Approach to Semantic Information
I propose an alternative: I argue for a decomposition approach to information; that is, I
argue that in the most general case of semantic information, data, semantics, and
pragmatics are codetermined as aspects of an information process. The most general
kind of information is pragmatic information; that is, in the most general case, semantic
information requires a system that utilizes information in its interaction with an
environment. Such a system I call, following (Nauta, 1997), an information system.
The strategy of pragmatic analysis of information is the following: The most basic
notion is information system. An information system S is a physical system that is in an
active interaction with an external environment and that satisfies a set of conditions that
do not presuppose the notion of information. The conditions must guarantee the
existence in S of a sub-system, M, that can be interpreted as an information medium.
Moreover, the functional role of M in S in relation to the interaction with the
environment must be sufficient to define the semantic content of the states of M.
According to this strategy, S is an information system not because it operates with
meaningful information, but conversely, it operates with information because it is an
information system. The most important idea is that what counts as data, and what gives
the data semantic content, is determined by the role they play in the information system.

- 112 -

The Computational Turn: Past, Presents, Futures?

4. Information Systems
An information system S is a system that satisfies the following five conditions:
1. S is an open system, i.e. it is a system that is distinct from its environment, but it
is in constant interaction with the environment.
2. S is a partially isolated open system, i.e. some of the interactions between S and
the environment are structured through well-defined limited channels of
influence.
3. S is a purposeful system. That is, there is at least one proper set of goal states,
G , that the system “attempts” to be in (or near) by affecting its environment.
4. S contains a sub-system M that can correlate with an external system O, and M
can control the behavior of S.
5. S contains a second distinct sub-system P that filters the states of M and their
effect on behavior in relating to its purpose. In other words, P steers the
system towards G by modulating the control effect of M.
I argue that all the conditions for an information system can be depicted (in principle) as
conditions of dynamical systems. Thus, no mentalistic or cognitive notions are needed to
define an information system. I also argue that the conditions are sufficient to justify
regarding M as an information medium with states that can be interpreted as
data/information states, and as having meaning for the system. The data/information
states of M, however depend on the global dynamics. In particular, they depend on the
way P modulates the control function of M and on the states of O (which can be regarded
as an information source). However, the states of O and P also depend on the global
dynamics. Thus, in the most general information systems all relevant components of the
information system are codetermined (except the goal G).

5. Interface Theory of Meaning
In an information system content is determined neither by the external relation between
M and O, nor by the internal role of the states of M in S, but by the interface roles the
states of M play in the dynamics of the system. This is the interface theory of meaning
for information states in an information system. More traditional theories of semantics,
such as correspondence semantics or conceptual role semantics, can be obtained from the
interface role semantics as aspects of the interface relation. Thus, the interface theory of
meaning properly generalizes other theories of meaning, which only work if further
conditions on the information system are demanded.

References
Nauta, D. (1970). The Meaning of Information. The Hauge: Mouton.
Vakarelov, O. (2010). Pre-cognitive semantic information. Knowledge, Technology & Policy,
23(1):193-226.

- 113 -

Proceedings IACAP 2011

Track III:
Autonomous Robots and
Artificial Cognitive systems

- 114 -

The Computational Turn: Past, Presents, Futures?

WHO
WILL
HAVE
IRRESPONSIBLE,
IMMORAL INTELLIGENT ROBOT?

UNTRUSTWORTHY,

Why Artifactually Intelligent Adaptive Autonomous Agents need to be
Artifactually Moral?
MARGARYTA GEORGIEVA ANOKHINA
School of Innovation, Design and Engineering, Mälardalen University,
Sweden maa05002@student.mdh.se
and
and Gordana Dodig Crnkovic
School of Innovation, Design and Engineering, Mälardalen University,
Sweden gordana.dodig-crnkovic@mdh.se

Abstract. We argue that there is natural place for artificial moral agency parallel to
artificial intelligence.

1. Extended Abstract
Historically, moral agency was conceptualized in purely anthropocentric terms.
Consequently, only humans qualify as moral agents according to the traditional criteria
and no other agents than humans were considered capable of moral agency. We discuss
such conventional criteria as mental states, intentionality, autonomy, free will,
responsibility, rationality and moral reasoning and compare human agents with artificial
agents (intelligent adaptive learning robots and software agents, present and envisaged in
coming decades).
We attempt to understand what has shaped traditional criteria in the past and how
technological change initiates re-shaping the world around us, including what we could
(and should) be considered as moral agents.
We suggest that conventional approach to moral agency is unable to provide
exhaustive criteria to deal with moral situations of contemporary world involving technosocial systems with autonomous intelligent agents, both humans and artifacts. We also
discuss how morality can be approached in new ways in case of artificial agents. The
argument is provided that human-centric approach to intelligent autonomous machines is
inappropriate as a means of control of behavior in self-learning artificial agents and a

- 115 -

Proceedings IACAP 2011

new proposal is made about how to treat notion of moral responsibilities in techno-social
systems when intelligent artifacts acting autonomously are involved.
In the past mechanical age of engineering, technological systems were designed to
perform specific and limited functions and they were kept closed with no access to the
outside world (like a robot making car parts, for example). Nowadays systems with
artificial intelligence are more complex and sophisticated and they are starting to be
implemented in everyday environments like people’s homes in helping elderly and sick
people and as companions (the developing field of social robotics).
This rapid technological change re-shapes and expands ways of thinking about
agency and morality that we used to have. Machine “talks”, “selects”, “runs” “reasons”,
“senses”, “plays chess”, etc. not in a human way, but we use these words to express
functionality of a machine in familiar terms. Why can’t machine “choose”, “decide”,
“think” or “be responsible”?
In the similar way as machines are artifactually intelligent, they can be and indeed
must be made artifactually moral if we are to rely on them even when they are not under
direct control, when they act autonomously. The term “artificial intelligence” reveals the
same problem one had to accept that machine can behave intelligently even though it is
intelligence of an artifact, and not a human intelligence.
Similarly, machine can be made functionally, artificially moral. It may take some effort
to find out how to secure morally acceptable behavior in intelligent learning machines,
and some researchers suggest it may take as much effort as it took for the development of
artificial intelligence. But it would be irresponsible to let them go among people without
having morally acceptable behavior according to human standards.
Floridi and Sanders (2004) consider interactivity, autonomy and adaptability at a
given level of abstraction as important new criteria for moral agency. Morality in this
approach is thought of as “a threshold defined on the observables in the interface”. These
criteria are related to criteria of operational environment, suggested by Berthier (2006)
and domain, suggested by Foner (1993). This requirement relates to differences between
domains of interest for moral considerations for human agents and for artificial ones. As
humans act and behave in specific environment, artificial agents do as well, but
conditions are different, and thus probably not all criteria that are suitable for human
domain are applicable to operational environment of artificial agents. Both artificial
agents and humans need interaction and ability to adapt to environment in order to act
morally, according to the rules that define moral actions. Coeckelbergh (2009) suggests
using the term virtual morality, as robots can exhibit behaviour akin to behaviour of
humans in analogous situations.
The aim of the emerging research field of machine ethics (machine morality,
artificial morality, or computational ethics) such as developed in Anderson and.
Anderson (2007); Allen, Wallach, Smit (2006) and Moor (2006) is moral decisionmaking implemented in computers and robots.
We discuss parallels between artificial agent’s possible artifactual moral agency,
see Dodig-Crnkovic and Persson (2008), similarity and differences compared to human
agents. We argue that there is natural place for artificial moral agency parallel to
artificial intelligence.

- 116 -

The Computational Turn: Past, Presents, Futures?

References
Floridi L. and Sanders J. W. (2004) On the Morality of Artificial Agents. Minds and Machines 14
(3):349-379.
Berthier D. (2006) Artificial Agents and their Ontological Status, iC@P 2006: International
Conference on Computers and Philosophy, p.2-5.
Foner L. (1993) What’s An Agent, Anyway? A Sociological Case study, available from the
Agents Group, MIT Media Lab.
http://www.nada.kth.se/kurser//kth/2D1381/JuliaHeavy.pdf, p.35.
Coeckelbergh M. (2009) Virtual moral agency, virtual moral responsibility: on the moral
significance of the appearance, perception, and performance of artificial agents, AI & Soc
24:188-189.
Anderson M. and. Anderson S. L. (2007) Machine Ethics: Creating an Ethical Intelligent Agent.
AI Magazine Volume 28 Number 4.
Allen C., Wallach W., Smit I. (2006) Why Machine Ethics?, IEEE Intelligent Systems, vol. 21,
no. 4, pp. 12-17, July/Aug. 2006, doi:10.1109/MIS.2006.83.
Moor J. H. (2006) The Nature, Importance, and Difficulty of Machine Ethics, IEEE Intelligent
Systems, vol. 21, no. 4, pp. 18-21, July/Aug. 2006.
Dodig-Crnkovic G. and Persson D. (2008) Sharing Moral Responsibility with Robots: A
Pragmatic Approach. Tenth Scandinavian Conference on Artificial Intelligence SCAI 2008.
Volume 173, Frontiers in Artificial Intelligence and Applications. Eds. A. Holst, P. Kreuger
and P. Funk.

- 117 -

Proceedings IACAP 2011

THE ETHICS OF ROBOTIC DECEPTION
RONALD C. ARKIN
Mobile Robot Laboratory, Georgia Institute of Technology
85 5th ST NW, Atlanta, GA 30332 U.S.A.

The time of robotic deception is rapidly approaching. While there are some individuals
trumpeting about the inherent ethical dangers of the approaching robotics revolution
(e.g., Joy, 2000; Sharkey, 2008), little concern, until very recently, has been expressed
about the potential for robots to deceive human beings. Our working definition of
deception (for which there are many) that frames the rest of this discussion is “deception
simply is a false communication that tends to benefit the communicator” (Bond and
Robinson, 1988). Research is slowly progressing in this space, with some of the first
work developed by Floreano et al (2007) focusing on the evolutionary edge that deceit
can provide among an otherwise homogeneous group of robotic agents. This work did
not focus on human-robot deceit, however. As an outgrowth of our research in robothuman trust (Wagner and Arkin, 2008), where robots were concerned as to whether or
not to trust a human partner rather than the other way around, we considered the dual of
trust: deception. As any good conman knows, trust is a precursor for deception, so the
transition to this domain seemed natural. We were able to apply the same models of
interdependence theory (Kelley and Thibaut, 1978) and game theory, to create a
framework whereby a robot could make decisions regarding both when to deceive
(Wagner and Arkin, 2009) and how to deceive (Wagner and Arkin, 2011). This involves
the use of partner modeling or a simplistic view (currently) of theory of mind to enable
the robot to (1) assess a situation; (2) recognize whether conflict and dependence exist in
that situation between deceiver and mark, which is an indicator of the value of deception;
(3) probe the partner (mark) to develop an understanding of their potential actions and
perceptions; and (4) then choose an action which induces an incorrect outcome
assessment in the partner.
While the results we published (Wagner and Arkin, 2011) we believe were
modestly stated, e.g., “they do not represent the final word on robots and deception”,
“the results are a preliminary indication that the techniques and algorithms described in
this paper can be fruitfully used to produce deceptive behavior in a robot”, “much more
psychologically valid evidence will be required to strongly confirm this hypothesis”, etc.
The response to this research has been quite the contrary, ranging from accolades (being
listed as one of the top 50 inventions of 2010 by Time Magazine (Suddath, 2010)) to
damnation (“In a stunning display of hubris, the men ... detailed their foolhardy
experiment to teach two robots how to play hide-and-seek” (Tiku, 2010), and
“Researchers at the Georgia Institute of Technology may have made a terrible, terrible
mistake: They’ve taught robots how to deceive” (Geere, 2010)).
It seems we have touched a nerve. How can it be both ways? It may be where
deception is used that forms the hot button for this debate. For military applications, it
seems clear that deception is widely accepted (which indeed was the intended use of our

- 118 -

The Computational Turn: Past, Presents, Futures?

research as our sponsor is the Office of Naval Research). Sun Tzu is quoted as saying
that “All warfare is based on deception”, and Machiavelli in The Discourses states that“
Although deceit is detestable in all other things, yet in the conduct of war it is laudable
and honorable”. Indeed there is an entire U.S. Army (1988) Field Manual on the subject.
In our original paper (Wagner and Arkin, 2011), we included a brief section on the
ethical implications of this research, and called for a discussion as to whether roboticists
should indeed engage in this endeavor. In some ways, outside the military domain, the
dangers are potentially real. And of course, how does one ensure that it is only used in
that context? Is there an inherent deontological right, whereby humans should not be lied
to or deceived by robots? Kantian theory clearly indicates that lying is fundamentally
wrong, as is taught in most introductory ethics classes. But from a utilitarian perspective
there may be times where deception has societal value, even apart from the military (or
football), perhaps in calming down a panicking individual in a search and rescue
operation or in the management of patients with dementia, with the goal of enhancing
that individual’s survival. In this case, even from a deontological perspective, the
intention is good, let alone from a utilitarian consequentialist measure. But does that
warrant allowing a robot to possess such a capacity?
The point of this paper is not to argue that robotic deception is ethically justifiable
or not, but rather to help generate discussion on the subject, and consider its
ramifications. As of now there are absolutely no guidelines for researchers in this space,
and it indeed may be the case that some should be created or imposed, either from within
the robotics community or from external forces. But the time is coming, if left
unchecked, you may not be able to believe or trust your own intelligent devices. Is that
what we want?

Acknowledgements
This research was supported by the Office of Naval Research under MURI Grant #
N00014-08-1-0696. The author would also like to acknowledge Dr. Alan Wagner for
his contribution to this project.

References
Bond, C. F., & Robinson, M., (1988). “The evolution of deception”, Journal of Nonverbal
Behavior, 12(4), 295- 307.
Floreano, D., Mitri, S., Magnenat, S., & Keller, L., (2007). “Evolutionary Conditions for the
Emergence of Communication in Robots”. Current Biology, 17(6), 514-519.
Geere, D., (2010). Wired Science,
http://www.wired.com/wiredscience/2010/09/robots-taught-how-to-deceive/
Joy, B. (2000). “Why the Future doesn’t need us”. Wired, April 2000.
Kelley, H. H., & Thibaut, J. W., (1978). Interpersonal Relations: A Theory of Interdependence.
New York, NY: John Wiley & Sons.
Sharkey, N. (2008). “The Ethical Frontiers of Robotics”, Science, (322): 1800-1801.
Suddath, C., (2010). “The Deceitful Robot”, Time Magazine, Nov. 11, 2010,
http://www.time.com/time/specials/packages/article/0,28804,2029497_2030615,00.html
Tiku, N., (2010). New York Magazine, 9/13/2010,
http://nymag.com/daily/intel/2010/09/someone_taught_robots_how_to_l.html

- 119 -

Proceedings IACAP 2011

U.S. Army (1988.). Field Manual 90-2, Battlefield Deception, http://www.enlisted.info/fieldmanuals/fm-90-2- battlefield-deception.shtml
Wagner, A. and Arkin, R.C., (2008). "Analyzing Social Situations for Human-Robot Interaction",
Interaction Studies, Vol. 9, No. 2, pp. 277-300.
Wagner, A. and Arkin, R.C., (2009). "Robot Deception: Recognizing when a Robot Should
Deceive", Proc. IEEE International Symposium on Computational Intelligence in Robotics
and Automation (CIRA-09), Daejeon, KR.
Wagner, A.R., and Arkin, R.C., (2011). "Acting Deceptively: Providing Robots with the Capacity
for Deception", International Journal of Social Robotics, Vol. 3, No. 1, pp. 5-26.

- 120 -

The Computational Turn: Past, Presents, Futures?

PROLEGOMENON TO ANY FUTURE THEORY OF MACHINE
AUTONOMY
PAUL BELLO
Office of Naval Research
875 N. Randolph St., Arlington VA 22203
AND
Selmer bringsjord
Rensselaer Polytechnic Insititute, Dept. of Cognitive Science, Dept. of
Computer Science, Lally School of Management
110 8th St, Troy NY 12180
AND
marcello guarini
University of Windsor, Dept. of Philosophy
401 Sunset Ave., Windsor, Ontario N9B 3P4

Abstract. As the development of autonomous systems lead to smarter and more
capable machines, we must concern ourselves with the possibility that they will
one day be equipped with weapons and the authorization to use them. However, it
isn’t inconceivable that such systems will be prone to error, leaving us with the
issue of who might be to blame if force is misapplied. In this presentation, we
discuss responsibility as it pertains to autonomous systems. More specifically, we
attempt to give a formal analysis of the conditions under which an autonomous
system might consider itself to be a “freely acting agent.” Note that we do not
attempt to attack the metaphysical problem of free will; we only aim to provide the
system with an appropriate commonsense theory of what it means to be free, given
a set of circumstances within which the agent acts. Such a commonsense theory
will (eventually) contain a set of beliefs corresponding to how external obligations,
potential coercion, lack of perfect information, and brute facts constrain or expand
the set of actions available to the agent at a given time in branching-time
semantics. The semantics represents the agents’ beliefs about the past as fixed and
the future as a set of possible histories that are contingent on its actions. Future
extensions of our formal framework will be discussed relative to the development
of a “Moral Turing Test” for autonomous systems.

``You have been terminated.” In grand Hollywood style, this is how much of the publicat-large has been introduced to the notion of autonomous robots on the battlefield.
When these words were famously uttered by the now-Governor of California,
combat robots were only a dream, and the dystopian future painted in the
Terminator movies seemed no more imminent than a new ice age. Times have
rather changed.
Combat robots roam through craggy caves in Afghanistan

- 121 -

Proceedings IACAP 2011

searching for terrorists, and unmanned air vehicles strike suspected enemy hideouts
in Pakistan without a human operator being anywhere close by. Thankfully, we still
live in a pre-Terminator age. The United States Department of Defense maintains
strict policies that require humans be in the decision-making loop whenever robots
are employed on the battlefield. While this sets many a mind at ease, neither of us
are totally convinced that such strictures will indefinitely remain, especially as
robots and associated technology becomes more reliable, more intelligent, and --in the end, the most important factor --- cheaper. Similar scenarios have been
discussed at length by (Joy, 2000) and other futurists (Bostrom, 2003). In reply to
these concerns, we (Bringsjord, Arkoudas & Bello 2006) and others (Arkin, 2009)
have looked to curb robotic behavior through the mechanization of norms,
conventions, and other ethical structures, such that future robots might be bound by
regulations. Unfortunately, complex situations are the norm on the battlefield, and
facing novel moral dilemmas in combat is the rule rather than the exception. Just
as our warfighters must improvise under these adverse circumstances, we expect
future robots to take actions roughly consistent with pre-established norms, but
rounded out with a measure of commonsense moral judgment, for if they do not,
they are doomed to be both brittle and ineffectual soldiers.
This being said, we’d like to address an issue at IACAP 2011 that hasn’t received
much attention in the literature: the issue of whether or not future intelligent robots
could be blamed for their actions, provided something goes wrong during the
course of their operation. Our plan will be to provide what we feel to be a
reasonable set of conditions that when jointly obtaining would allow us to classify a
robot as a moral agent, and as such subject to blame in the case of intentional
misdoings or derelictions of duty. The key question under consideration in our
investigation is: ``what does it mean for x to have the property of being
autonomous?” We hope to clarify a set of potential confusions about the proper
definition of autonomy in the context of robotic warfighters.
Moral philosophers, depending on their particular stance on the nature of morality,
typically define autonomy as the ability to respect some particular moral code or
another, even if doing so runs contrary to self-interest. In a deep sense, these ideas
turn on the notion of an autonomous agent having at least the illusion of free will,
or the ability to choose contrary to a pre-established set of normative principles.
Among roboticists and other practitioners of artificial intelligence, autonomy has
generally been taken to mean the ability to make decisions and take actions without
coercion or assistance from a secondary agent. While this seems to be plausible
enough, a few mental exercises might convince you that this is much too general,
perhaps to the point of not being useful in its intended context.
Consider the case of the lowly thermostat that has functionality allowing it to turn
on and off in order to maintain a pre-set ambient temperature in a home. It
certainly ``makes decisions” about when to turn on, and takes action (e.g. turns on)
under an appropriate set of conditions and without consulting an external agent at
decision-time. Should this device be granted autonomy? We think not, and we
assume that our roboticist colleagues agree with us. Even though the thermostat
makes decisions (in some sense) as to when to turn on, it’s not at all clear that it
could choose otherwise. In fact it cannot, barring device malfunction. Worse than
this, there isn’t an ``it” making decisions at all. It’s just a thermostat. If we map

- 122 -

The Computational Turn: Past, Presents, Futures?

onto the robotic case, it’s equally unclear that there is an ``it” making decisions, or
one making free choices that direct its own affairs.
Real-world battlefield situations don’t bifurcate so cleanly when it comes to making
moral and non-moral decisions. Simple navigation decisions, such as whether or
not to step into a house of worship, seem to be prima facie non-moral in nature, but
as we well know, they indeed have moral consequences. These complications
suggest to us that roboticists ought to at least consider some of the definitional
concepts from moral philosophy to tighten up their own notions of autonomy in
order to make them more suitable for combat robots. A central notion to be
accounted for in future definitions of machine autonomy is the notion of free
choice. Without free choice, or at least the illusion of free choice, blaming a robot
for misdeeds or for neglect becomes a less-than-meaningful activity. At IACAP
2011, we hope to both present recommendations for a formally useful definition of
autonomy for machines; but also to propose a variety of tests, much like a
decathlon, to establish functional baselines which would be required to be met by
computational systems hoping to acquire the designation of moral agent, with a
particular focus on the robot’s beliefs about how “free” its actions are at any given
point in time. Given the uncertainty over the variegated notions of free will, the key
test we propose will share much in spirit with Turing's Test for machine
intelligence, a similarly ambiguous notion. Just as TT doesn't require human
intelligence proper to functionally pass, we won't require an artificial system to
have human-like free will (whatever it may look like) in order to be accorded moral
agency.

References
Arkin, R.C., (2009). Governing Lethal Behavior in Autonomous Systems, Chapman and Hall
Imprint, Taylor and Francis Group.
Bostrom, N. (2003), "Ethical Issues in Advanced Artificial Intelligence", Cognitive, Emotive and
Ethical Aspects of Decision Making in Humans and in Artificial Intelligence. 2: 12–17.
Bringsjord, S., Arkoudas, K. & Bello, P. (2006) “Toward a General Logicist Methodology for
Engineering Ethically Correct Robot” IEEE Intelligent Systems. 21.4: 38-44.
Joy, W. (2000) “Why the Future Doesn't Need Us” Wired. (8.04).

- 123 -

Proceedings IACAP 2011

AUTONOMOUS AGENTS AND SENSES OF RESPONSIBILITY
GORDON BRIGGS
Tufts University
Department of Computer Science
161 College Ave.
Medford, MA 02155 U.S.

Abstract. The ever-increasing levels of autonomy in modern robotic systems will
lead to the deployment of autonomous agents in morally sensitive contexts.
Assigning responsibility when unethical actions are performed by robots has been
a matter of considerable debate among roboethicists, with some positing a grave
“responsibility gap” that prevents the satisfactory attribution of responsibility to
any party. I submit that this contention may stem from the failure to specify the
architectural details of the hypothetical robotic systems in question and the failure
to consider multiple senses of responsibility. To illustrate this, the effect of
assigning varying levels of architectural complexity to a hypothetical robotic agent
on our reactive (moral) attitudes is examined. Various senses of responsibility are
then presented, including the novel sense of pedagogic responsibility in an attempt
to close the “responsibility gap.”

1. Introduction
The progress of modern robotics research is not only rapidly yielding embodied agents
with increasing levels of autonomy, but also fueling the desire of various governmental
and private institutions to deploy autonomous systems in morally contentious contexts.
Given the prospect of autonomous agents that not only may make moral decisions, but
life-or-death decisions of the highest ethical import, it is understandable that scientists
and philosophers see an urgent need to tackle the issue of robotic systems and
responsibility.
When a robotic system perpetrates an unethical action, whom do we hold
accountable? Conversely, to whom ought we direct praise when an autonomous system
performs commendably in an ethical situation? Various loci of responsibility have been
proffered by roboethicists: the developers of the autonomous agent, the
handlers/controllers of the autonomous agent, and the autonomous agent itself (Sparrow,
2007). However, the justifiability of responsibility ascriptions to each of these loci
remains controversial. Some posit a “responsibility gap” that prevents us from holding
the programmers and developers of certain types of autonomous agents culpable for their
potentially unpredictable acts (Matthias, 2004), whereas others reject this notion (Marino
and Tamburrini, 2006). Another complication to ascribing responsibility, raised by

- 124 -

The Computational Turn: Past, Presents, Futures?

Sparrow, involves the possible rejection of robots as loci of responsibility by humans, as
the consequences of holding synthetic agents responsible may not sufficiently satisfy the
aggrieved parties (Sparrow, 2007). In contrast with Sparrow, however, Dodig-Crnkovic
and Persson (2008) contend that “learning from experience and making autonomous
decisions gives us good reasons to talk about a machine as being ’responsible’ for a task
in the same manner that we talk about a machine being ’intelligent”’, but that, “we must
adopt the functionalist view and see them as parts of larger socio-technological systems
with distributed responsibilities, where responsibility of a moral agent is a matter of
degree.”
Yet, what makes responsibility hard to pin down or satisfactorily ascribe with
robots? I would submit that the debate is fueled by the ambiguity of the key terms in the
dialogue: “responsibility” and “robot”. We will first seek to tease out why
disambiguating these terms is a prerequisite to solving, or at least making sense of, the
problem of responsibility ascription with robotic systems. This disambiguation entails
examining what the robotic/cognitive architecture is on the autonomous system in
question, as well as considering what different senses of responsibility we wish to ascribe
when seeking to hold agents accountable. By fleshing out these issues, we can
subsequently critique the viewpoints espoused by Matthias, Marino and Tamburrini, and
Sparrow. We will then proceed to outline how we can use these senses of responsibility
and our knowledge of the architectural mechanisms underpinning the robotic system to
establish a system of distributed responsibility that will ideally “not only locate the
blame but more importantly assure future appropriate behavior of the system” (DodigCrnkovic and Persson, 2008).

3. Senses of Responsibility
Kuflik (1999) identifies six types of responsibility. The type needed to ascribe
responsibility in liability cases as described by Marino and Tamburrini is oversight
responsibility, which can in turn be thought of as a subset of Kuflik’s role responsibility
(where the agent’s role is to oversee the operation of a system and ensure positive results
while avoiding negative ones). By considering oversight responsibility, attitudinal
differences between ascriptions of malice and negligence can be captured.
Despite the application of additional senses of responsibility to plug the
“responsibility gap,” the appropriateness of ascriptions of oversight responsibility are
still dependent on details regarding the behavior-generating mechanisms of the
autonomous agent. Does this leave open the “responsibility gap” at the higher-end of the
continuum of agent autonomy? Could there exist robotic agents that we believe can not
justifiably be considered loci of strong senses of responsibility (e.g. moral
responsibility), but that are autonomous enough that assigning full liability to the
developers or trainers also seem unfair? The answer to these questions are not clear, but
independent of how these concerns are resolved I wish to introduce a new flavor of
responsibility that seeks to articulate a sense in which the developers and trainers of
complex learning agents can be held accountable, regardless of the complexity of the
agent’s cognitive architecture.
A weaker form of responsibility can be derived from Kuflik’s role responsibility
that recognizes the causal connections between the training an agent provides another

- 125 -

Proceedings IACAP 2011

learning agent and that learning agent’s future behavior. This sense of accountability can
be deemed pedagogic responsibility. What I wish to highlight with this flavor of
responsibility is the practical consideration that most, if not all, sophisticated learning
agents are weakly supervised by other agents that fill the role of pedagogues; learning
agents, in practice, are not completely self-bootstrapping.

4. Distributed Responsibility
Distributed responsibility is crucial to ensuring that desired outcomes are achieved in
practice. Far from potentially exculpating guilty agents by examining other loci of
responsibility, an appropriate application of a distributed responsibility paradigm would
in fact maximize accountability. This maximization of accountability can be achieved by
considering all agents causally linked to a particular action and determining the strongest
sense of responsibility that can be justifiably ascribed to a particular agent.

5. Conclusion
Knowing the relevant details of a robotic system’s behavior-generating mechanisms is of
paramount importance when undertaking the task of responsibility ascription for actions
generated by that system. This knowledge, coupled with considerations of different
flavors of responsibility, will enable agents to be held accountable in the proper sense.
Finally, applying these different flavors of responsibility in a distributed context will
contribute to the appropriate ascription of blame/praise and ensure future desired
outcomes by minimizing all points of failure within a socio-technical system (as alluded
to by Dodig-Crnkovic and Persson, 2008).

References
Dodig-Crnkovic, G. and Persson, D. (2008). “Sharing Moral Responsibility with Robots”,
Proceeding of the 2008 conference on Tenth Scandinavian Conference on Artificial
Intelligence.
Kuflik, A. (1999). Computers in control: Rational transfer of authority or irresponsible abdication
of autonomy? Ethics and Information Technology. Vol. 1, no. 3.
Marino, D. and Tamburrini, G. (2006). Learning robots and human responsibility. International
Review of Information Ethics. Vol. 6.
Matthias, A. (2004). The responsibility gap: Ascribing responsibility for the actions of learning
automata. Ethics and Information Technology. Vol. 6, Issue 3.
Sparrow, R. (2007). Killer Robots. Journal of Applied Philosophy. Vol. 24, No. 1.

- 126 -

The Computational Turn: Past, Presents, Futures?

THE ENGINEERABILITY OF SOCIAL INSTITUIONS
Some Critical Reflections against Searle and in Favor of Kant’s Laws of Action
RUTH HAGENGRUBER
University Paderborn
Ruth.Hagengruber@upb.de

Abstract. I am arguing in the realm of Kant’s concept holding that moral
laws result from universal and contradiction free proving processes,
criticizing John Searle who negates the engineerability of social
institutions.

1. The Engineerability of Promises
In his book Making the Social World John Searle explicitely negates the engineerability
of social institutions. He deduces his claim from the fact that social rules owe themselves
to conscious human language and secondly to the will of acceptance. If you concede to
Searle’s argument you firstly have to commit the gap between Searle’s world of human
language dependent social rules and a social world as real being with rules that constitute
its existence. Against Searle I hold that the validity of some social institutions is built
upon a realist and ontologic dimension of social institutions.
Searle explains that social institutions only exist because they are constituted by
human capacities and therefore not engineerable, illustrating his convictions by
“promising” (which he used in his speech act theory) demonstrating why unconscious
robots cannot have institutions: “Let us suppose that robot A is so programmed that
when it cognizes a future need on the part of robot B, A makes a “promise” to render B
the appropriate assistance in the future. … But what I cannot find in this situation is the
deontology that is essential to institutional reality in its human form. The notion of
making and keeping promises presupposes the gap.” (Searle 2010, 136).
It is obvious and simple to understand that a computer program can devide one
action of exchange into two parts however connect them together in a way that the time
difference does not interrupt the unity of action. What kind of “notion” is needed to
fulfill this bipartite action? Searle’s argument refers to a concept of deontology, which
does not explain why promises are to be held, in Searle’s account, promises remain as a
duty someone has obliged me with.
Kant’s argument on moral duties is different. Kant’s constitution of morals i.e. of
social institutions is not based on properties of human nature, but must subsist a priori.
This is true for several kinds of human actions, as “saying truth”, “selling something to

- 127 -

Proceedings IACAP 2011

all at the same prize”, and it is true for promises. How can we think of a promise as a
universal law and what consequences does this have for the engineerability of social
institutions?

2. Some Social Institutions are based on the Logic of Contradiction Free
Reasoning
The validity of a promise results from the idea of a self-consistent concept of an action.
This is a pure formal statement on the fact that from the point of logic there is no reason
to assume that this kind of action would ever have an implicit problem, that is that this
kind of action could not be executed as if there would arise a contradiction.
(Hagengruber. 2000. 155 ff.) Although you might object that only humans can
understand what is a contradiction, this does not concern the formal character of the
validity of “promising”. The validity of “promising” is as independent of this human
approval as it is true for any mathematical law. Think how many do not understand the
mathematical laws computers are built of and constituted by but how many people use it!
Very often promises are broken, however this does not influence the validity of the law
of promising which is effected by its formalism. This formalism is the reason of its
validity, not our agreement to it. It is completely unimportant if this law is understood or
not, as we can easily observe. From this assumption we can deduce that “promising” is
not only a kind of social institution which deduces its validity from human understanding
and acceptance, but it can be seen as a sort of law which coordinates to a sort of
“ontological” law.
Searle presupposes that keeping promises is only possible if we have an
understanding of language and he is convinced that these language based rules are
different to computational rules. Are both types built upon different modes of thought?
How do rules and laws work in machines, and why do we understand the results of
computation?
I affirm that some (not all) social institutions are based on computable laws and that
their inherent character is comparable to computational laws. This implies the conviction
that there are some types of social laws which are much deeper grounded than to be only
a reflex of cultural inspiration. Searle turns out as a dualist, arguing on the ground of two
kinds of rationality, a computable and a non computable, when deviding the world into
non computable social institutions and computable number concepts.

References
Hagengruber, Ruth 2001. Zur Gesetzmäßigkeit und materialen Notwendigkeit von
Versprechen. In: R. Haller, K. Puhl (Ed.). Wittgenstein und die Zukunft der
Philosophie. Kirchberg am Wechsel, 300-305.
Searle, John R. 2010. Making the Social World: The Structure of Human Civilization.
Oxford University Press.
Smith, Barry. 1992. An Essay on Material Necessity. Hanson P. and Hunter B. (eds.)
Return of the A Priori. Canadian Journal of Philosophy, Supplementary Volume
18.

- 128 -

The Computational Turn: Past, Presents, Futures?

RESPONSIBILITY IN ACQUIRING CRITICAL eGOVERNMENT
SYSTEMS
Whose Fault is Failure?
HEIMO, OLLI
Acting Teacher, Department of Management, Turku School of Economics
olli.heimo@utu.fi
University of Turku
AND
KIMPPA, KAI
Principal Lecturer, Department of Management, Turku School of
Economics
kai.kimppa@utu.fi
University of Turku

Abstract. While ordering and producing modern eGovernment systems to the
critical fields of governmental services the stakes with failure vary from the loss of
money to the loss of life. Standard procedures of providing an eGovernment
service does not nominate clear responsibilities to any participating party.
Government offices hold a dual-model role in which they are both a customer
towards the supplier of the system and supplier of the system towards the public.
Government officials have been nominated to their job as a form of social contract
to be the responsible party in the eGovernment system acquiring, implementation
and upkeep. In that context, when the government office orders critical
eGovernment systems and takes them into use as a monopoly service, it must hold
itself responsible for the system and its effects. Normal struggle between the
authorities, system suppliers, NGOs and individual citizens after a troubled
eGovernment experiment can be avoided when the responsibilities are taken into
account before the system development even begins.

Extended abstract:
In this paper we aim to show that a responsible party for acquiring critical eGovernment
systems should be nominated and that the expected consequences must be analysed
before the project is started. This is to prevent loss of human life, to enhance well-being,
to secure a democratic process and civil rights of the citizens and to save resources.
A critical information system is a system where something invaluable can easily be
compromised. These kinds of systems include eHealth, eDemocracy, police databases
and some information security systems e.g. physical access right control. A critical

- 129 -

Proceedings IACAP 2011

eGovernment system is such a system provided to the people by the government.
Systems included in these kinds of areas are those of healthcare, border control,
electronic voting, criminal records, etc.
There have been numerous cases, where due to poor eGovernment systems lives
have been lost (Avison & Torkzadeh 2008, p. 292-293, Fleischman 2010) and elections
have been compromised (Mercuri 2001, p. 13-20, Heimo, Fairweather & Kimppa 2010,
Robison 2010). At the same time large amounts of resources (Larsen & Elligsen 2010)
are wasted, while the systems are either inoperable for the purposes they were designed
or end up being discarded (Wijvertrouwenstemcomputersniet 2007, Verzola 2008,
Heimo, Fairweather & Kimppa 2010). Thus, while developing critical eGovernment
systems, there is little room for error.
Some of the errors have lead to catastrophic consequences, like the Case London
Ambulance, where more than 20 people died due to bad system design, poor testing and
hasty implementation (Avison & Torkzadeh 2008, p.292-293). In the field of eVoting,
there have been problems, close-by situations or problems which have not been
identified, yet are suspected. Some of the clearest mistakes have been made in the U.S.,
but many European eVoting projects, like those of Ireland and Netherlands, have also
endangered the democratic process. Many eVoting projects have also been found
extremely costly. (Wijvertrouwenstemcomputersniet 2007, Verzola 2008, Heimo,
Fairweather & Kimppa 2010)
A specific party has to be responsible for the development of the system, so that
there is someone to respond to the challenges, repair what is broken, and see to it that the
system itself works. That is a job the society as a whole has given to a third party, as not
everyone can participate to the process. The task of the responsible party is to see to it
that the system works as it should. (See e.g. Hobbes 1651.)
Four different interest groups can be found in every eGovernment system
development process. First, there is the government office, whose task is to formulate the
solutions to fulfil the needs of the society at large. Secondly there is the producer, who
delivers the requested system. Third interest group is the end-user group consisting of
people using the system, i.e. nurses, border officials, police or military officers and
voting officials. Fourth group is the citizens, who are the targets of the system usage.
Any or all of the groups can also overlap. Every nurse or doctor can (and will) be a
patient, every voting official can vote, every police or military officer or border official
is also a citizen dependant of the services produced by police or military force and
border control etc.
The power to decide how to design and whether to implement the system lies within
the government and the supplier; the user and the target of usage are in weaker positions,
for they have little or no power in designing the system compared to governmental
officials or the supplier of the system. According to Rawls (1997) the change in the
system must be to the advantage of the weakest parties, to the last two groups, who are
less able to defend themselves.
With the power to decide for the public comes the responsibility to the public. That
responsibility has to be either with the subscriber or the supplier of the system. The
responsibility with the supplier lies in fulfilling the requests of the customer, in this case
the governmental office. If this task fails, the supplier is surely responsible to the
authorities for their failure of not fulfilling the requirements agreed upon.

- 130 -

The Computational Turn: Past, Presents, Futures?

The authorities have a monopoly in supplying certain services like critical
eGovernment products. Due to this, they are in the supplier role in relation to the citizen.
That role brings with it the responsibility of a functioning product. If the system is taken
into use – and it must be emphasized, that these are critical systems – the responsibility
lies with the last supplier of the system: the government office.
The producer produces a system according to the specifications they receive from
the ordering party, in this case the government office. Even if the product is faulty and
does not fulfill the specification, the authorities are responsible to audit the product (due
to these kinds of systems being critical applications). The responsibility for showing that
a product is faulty, cannot, however rest on the end-user, but the provider or the
distributor must provide sufficient proof that the system is safe.
In many countries (e.g. in Finland, Ireland, Netherlands and the USA) only after a
system has been taken into use, the end-users (specialists, citizens, NGOs, etc.) have
been able to show that there are critical problems with the system (see e.g. Mercuri 2001,
Harris 2004, Wijvertrouwenstemcomputersniet 2007, Heimo, Fairweather & Kimppa
2010). That means that the producers and the government officials are defending their
position against the end-users and the public. However, the burden of proof in a situation
where critical systems are changed must remain with the party advocating the change.
Because this kinds of systems are distributed through a government monopoly, the
obvious responsible party is, maybe counter to intuition, the subscriber, not the producer
of the system.
Pantzar (2002) generalizes MacKenzie’s (1990) theory of the Certainty Trough to
all technology. Pantzar claims, that the salespersons of the product – the representatives
of the producer – are denied their right to be uncertain of the product they are selling. In
a modern society there is a risk, that this reflects to the suppliers – the governmental
offices – representatives so, that even they cannot appear to be uncertain of the product
when introducing it to the citizens. In a situation where this risk actualizes, the
information the government officials give to the public is misleading.
When ordering critical eGovernment systems, it must be remembered that the
people auditing the systems must be accountable for their work and the government
office must select a party able to successfully complete the auditing. Governmental
officials have to be trained and given the accountability for what methods of auditing are
required and how the results have to be interpreted.
Thus, we must see to it that sufficient safeguards are in place for taking new
applications into use in critical eGovernment services. It must be ensured that the
responsible office has tested the critical applications at minimum to the degree the
current system can be trusted. That alone, cannot be a convincing reason to take a new
system into use. Either the security of the system itself has to be greater than the previous
systems’, or, at least the added value the system provides to the citizen must be –
together with the same amount of security as in the previous system – considerable to
justify changing systems.
To summarize, the responsibility of the critical eGovernment systems lie within the
authorities. They hold a monopoly to the services they have been nominated to produce,
control and upkeep. When this is done without the responsibility and accountability of
anyone, it can and will endanger the fundamental values we hold dear.

- 131 -

Proceedings IACAP 2011

References
Avison, David and Torkzadeh, Gholamzeza (2008), Information Systems Project Management,
Saga Publications, California, USA, August 2008.
Fleischman, William M. (2010), Electronic Voting Systems and The Therac-25: What Have We
Learned?, Ethicomp 2010.
Harris, Bev (2004), Black Box Voting: Ballot Tampering in the 21st Century, Talion Publishing,
free internet version is available at www.BlackBoxVoting.org, accessed 7.2.2011.
Heimo, Olli I, Fairweather, N. Ben & Kimppa, Kai K. (2010), The Finnish eVoting Experiment:
What Went Wrong?, Ethicomp 2010.
Hobbes, Thomas (1651), Leviathan, or the Matter, Forme, and Power of a Commonwealth,
Ecclesiasticall and Civil, edited with an introduction by C.B. MacPherson, Published by
Pelican Books 1968.
Larsen E & Elligsen G. 2010. Facing the Lernaean Hydra: The Nature of Large-Scale Integration
Projects in Healthcare. In Kautz K & Nielsen P. Proceedings of the First Scandinavian
Conference of Information Systems, SCIS 2010. Rebild, Denmark, August 2010.Mackenzie,
Donald A (1990), Inventing accuracy, A historical sociology of nuclear missile guidance,
MIT Press, Cambridge Massachusetts.
Mackenzie, Donald A (1990), Inventing accuracy, A historical sociology of nuclear missile
guidance, MIT Press, Cambridge Massachusetts.
Mercuri, Rebecca (2001), Electronic Vote Tabulation: Checks and Balances PhD thesis,
University of Pennsylvania. http://www.cis.upenn.edu/grad/documents/mercuri-r.pdf
Pantzar, Mika (2000), Teesejä tietoyhteiskunnasta. Yhteiskuntapolitiikka. No 1. pp. 64 - 68.
http://www.stakes.fi/yp/2000/1/001pantzar.pdf, accessed 7.2.2011.
Rawls, John (1997), The Idea of Public Reason, Deliberative democracy: essays on reason and
politics, edited by James Bohman and William Rehq, The MIT Press, 1997.
Robison, Wade L. (2010), Voting and Mix-And-Match Software, Ethicomp 2010.
Verzola, Roberto (2008), The Cost of Automating Elections. http://ssrn.com/abstract=1150267,
haettu 24.11.2010.
Wijvertrouwenstemcomputersniet (2007), Rop Gonggrijp and Willem-Jan Hengeveld - Studying
the Nedap/Groenendaal ES3B voting computer, a computer security perspective,
Proceedings of the USENIX Workshop on Accurate Electronic Voting Technology 2007
http://wijvertrouwenstemcomputersniet.nl/images/c/ce/ES3B_EVT07.pdf,
accessed
7.2.2011. (see also http://wijvertrouwenstemcomputersniet.nl/English).

- 132 -

The Computational Turn: Past, Presents, Futures?

WHAT ARE ETHICAL AGENTS AND HOW CAN WE MAKE THEM
WORK PROPERLY?
IORDANIS KAVATHATZOPOULOS
Uppsala University
Dept. of IT-HCI, Box 337, 751 05 Uppsala, Sweden
AND
Mikael Laaksoharju
Uppsala University
Dept. of IT-HCI, Box 337, 751 05 Uppsala, Sweden

Abstract. To support ethical decision making in autonomous agents, we suggest to
implement decision tools based on classical philosophy and psychological
research. As one possible avenue, we present EthXpert, which supports the
process of structuring and assembling information about situations with possible
moral implications.

1. Philosophy
Automated systems can be of great help to achieve goals and obtain optimal solutions to
problems in situations where humans have difficulties perceiving and processing
information, or making decisions and implementing actions, because of the quantity,
variation and complexity of information. Given that we have a clear definition of ethics,
we can design a system that is capable of making ethical decisions, and able to make
these decisions independently and autonomously.
In common sense, ethics is based mainly on a judgment of its normative qualities.
People’s attachment to the normative aspects is so strong that it is not possible for them
to accept that ethics is an issue of choice, as it has been stated in classical philosophy. If
ethics is connected to choice then the interesting aspect is how the choice is made, or not
made. The focus is on how, not on what; on the process not on the content. Indeed,
regarding the effort to make the right decision, philosophy and psychology point to the
significance of focusing on the process of ethical decision making rather than on the
normative content of the decision. According to the theories of Plato, Aristotle, Kant
and modern philosophers one has to get rid of false ideas, because this opens up the way
to the right solution. Ability to think in the right way is not easy and certain skills are
necessary.

- 133 -

Proceedings IACAP 2011

2. Skills of Ethical Agents
This philosophical position has been applied in psychological research on ethical
decision making. Focusing on the process of ethical decision making, psychological
research has shown that people use different ways to handle moral problems. When
people are confronted with moral problems they think in a way which can be described
as a position on the heteronomy-autonomy dimension. Heteronomous thinking is
automatic, emotional and uncontrolled thinking or simple reflexes that are fixed
dogmatically on general moral principles. Thoughts and beliefs coming to mind are
never doubted. Awareness of own personal responsibility for the way one is thinking or
for the consequences of the decision are missing.
Autonomous thinking, on the other hand, focuses on the actual moral problem
situation, and the main effort consists in searching for all relevant aspects of the problem.
When one is thinking autonomously the focus is on the consideration and investigation
of all stakeholders’ moral feelings, duties and interests, as well as all possible alternative
ways of action. In that sense autonomy is a systematic, holistic and self-critical way of
handling a moral problem.
Handling moral problems autonomously means that a decision maker is
unconstrained by fixations, authorities, uncontrolled or automatic thoughts and reactions.
It is the ability to start the thought process of critically and systematically considering
and analyzing all relevant values in a moral problem situation. It is not so easy to use the
autonomous skill in real situations. Psychological research has shown that plenty of time
and certain conditions are demanded before people can acquire and use the ethical ability
of autonomy.

3. Support Systems
IT systems have many advantages that can be used to stimulate and facilitate autonomous
thinking in decision making. For example EthXpert is designed to support the process of
structuring and assembling information about situations with possible moral implications
(http://www.it.uu.se/research/project/ethcomp/ethxpert). It follows the hypothesis that
moral problems are best understood through the identification of authentic interests,
needs and values of the stakeholders in the situation at hand. Since the definition of what
constitutes an ethical decision cannot be assumed to be at a fix point, we have further
concluded that this kind of system must be designed so that it does not judge the
normative correctness in any decisions or statements. Consequently, the system does not
make decisions and its sole purpose is to support the decision maker when analyzing,
structuring and reviewing choice situations.
Ethical decision support can be integrated into robots and other decision-making
systems to secure that decisions are made according to the basic theories of philosophy
and psychology. In one sense this fully automated autonomy would be ideal, although it
will bring to the fore questions about how to treat machines that have a refined sense of
reasoning. Before we are there we can however see that ethical decision-making support
systems based on this approach can be utilized in two ways, both of which we believe to
be necessary steps to further development.

- 134 -

The Computational Turn: Past, Presents, Futures?

During the development of a decision-making system, support tools can be used to
identify the criteria for making decisions and for choosing a certain direction of action.
This means that the support tool is used by developers — the ones who make the real
decisions — when they are facing an ethical problem and need assistance in choosing
according to the philosophical/psychological approach.
Another possibility is to integrate a support tool in the decision system. By putting
the support tool into the system, it can be used in cases of unanticipated future situations.
The tool can gather information, treat it, structure it and present it to the operators in a
way that follows the requirements of the above mentioned theories of ethical autonomy.
If it works like that, operators make the real decisions and are the users of the ethical
support tool (Kavathatzopoulos, 2010).
Such an independent system — that can make decisions and act in accordance to
the hypothesis of ethical autonomy — is one which 1) has criteria, previously identified
in an autonomous way, programmed into it by the designers, and 2) prepares the
information about problematic situations according to the theory of ethical autonomy so
that the operators, when they are presented with it, are stimulated to make decisions
compatible with the theory of ethical autonomy.

References
Kavathatzopoulos, I. (2010). Robots and systems as autonomous ethical agents. In: V.
Kreinovich, J. Daengdej and T. Yeophantong (Eds.), INTECH 2010: Proceedings of the
11th International Conference on Intelligent Technologies (pp. 5-9). Bangkok: Assumption
University.

- 135 -

Proceedings IACAP 2011

HOW THE HARD PROBLEM OF CONSCIOUSNESS MIGHT EMERGE
FOR AN EMBODIED SYMBOL SYSTEM
BERNARD MOLYNEUX

Abstract Embodied systems with both an exteroceptive and an introspective
informational channel can investigate themselves via two independent methods,
generating distinct pictures of the self. Attempts at cross-perspectival
identification, however, are frustrated by the recursive nature of Leibniz's Law,
which, for each pair of potential cross-perspectival identificanda, requires the prior
cross-perspectival identification of their properties, generating a regress. I show
that the only ways the embodied system can escape from this regress correspond to
the classic answers to the hard problem of consciousness: inflate its third-person
ontology with distinct subjective properties (dualism); deny the reality of its
subjective phenomena (eliminativism); or postpone the identification indefinitely
(the current state of materialist realism). Thus, I suspect that this problem is the
hard problem of consciousness rediscovered in the context of an embodied
artificial system.

Abstract. Any embodied system with both an exteroceptive and an introspective (internal
monitoring) channel can investigate itself via two independent methods. I show how this generates
an epistemic problem resembling the hard problem of consciousness.

How M Represents Things
Imagine that at any time our intelligent symbol system M represents objects and
properties discovered using its exteroceptive system (henceforth 'EXTEROCEPTION')
using some finite stock of symbols7 O01O02O03 … where superscripts designate order
whereas subscripts distinguish the representations at each order, so that M represents the
ith nth-order entity having the jth-mth order property as follows:
OmjOnj
E.g. if we count objects as appearing at the 0th order (since they are modified by first
order properties) then the following:
7

For visual prettiness use/mention distinctions are syntactically unmarked, so O1 sometimes refers to the
representation and sometimes to its referent, as will be clear from context.

- 136 -

The Computational Turn: Past, Presents, Futures?

O123O045
…signifies that the 45th object in M’s ontology is modified by the 23rd first-order
property. (When order is clear from context, we will drop the subscripts to minimize
notational clutter.)
In the same way, M uses the symbol S (think 'subjective') to represent objects and
properties that it learns about via its other, introspective, mode (henceforth
'INTROSPECTION').

How M Thinks about Things
We place one iron restriction on M's reasoning, and three soft restrictions (to be
explained).
Iron restriction: M observes Leibniz's Law. I.e. if M holds that A=B, then for
every property P, M holds that A instantiates P if and only if M holds that B does.
Now for the soft restrictions:
First soft restriction: M thinks8 that it can in principle acquire a complete
picture of the world from EXTEROCEPTION only.
Second soft restriction: M regards the data it gets from INTROSPECTION as
correct and incorrigible. It treats introspection as the ultimate authority on its
inner self.
Third soft restriction: M insists on all of its identifications being constructive.
That's to say, it only identifies specific phenomena of which it is aware. So
though it might identify O23 with O78 or with S677, for instance, it will not
commit to the abstract existential identification of O23 with some (as yet
unknown) O or S phenomenon.
Later we see that relaxing the soft restrictions permits M to solve its problem in a way
that resembles classic answers to the hard problem of consciousness, indicating that this
is indeed the hard problem of consciousness rediscovered in the context of an embodied
artificial system.

The Proof
We proceed by reductio, by imagining that M identifies some subjective (S) and some
objective (O) phenomenon. Since M does so, there must be some Si and some Oi that are
the highest order such entities to be identified. Since this identification must obey
8

I.e. the system processes in accordance with this restriction, as if it 'thinks' this. All such mentalistic
vocabulary can be similarly replaced throughout the argument, if it is thought to beg any questions.

- 137 -

Proceedings IACAP 2011

Leibniz's Law, M must first check whether Si and Oi have the same properties, either by
checking its antecedent knowledge of Si or by querying INTROSPECTION anew. But
now consider an arbitrary property Si+1 that INTROSPECTION ascribes to Si. Since the
identification of Si and Oi obeys Leibniz's Law, M must either hold that both Oi and Si
have Si+1 or that neither do. Hence either:
(i) M holds Si+1 to be an additional property of Oi distinct from any property of
Oi that M might learn about from EXTEROCEPTION. Or:
(ii) M comes to hold that Si does not have Si+1 in fact. Or
(iii) Si+1 is identified with some property Oi+1 of Oi learnable via
EXTEROCEPTION.
However, option (i) is impossible, since the first soft restriction says that
EXTEROCEPTION can provide a complete picture of the world. Similarly, the second
soft restriction says that INTROSPECTION is correct and incorrigible, excluding option
(ii). And option (iii) given that only constructive identities are permitted, is possible only
if the system identifies Si+1 with some known property of Oi, in which case it would be
identified with some specific property Oi+1, and our starting assumption that Oi and Si are
the highest order entities identified is violated. Thus there can be no highest order O-S
identification consistent with the restrictions, which means that for our finite symbol
system M, that there can be no O-S identification at all (the same proof, fortunately, fails
for S-S or O-O identifications; explanation omitted.)

Dropping the Soft Restrictions
Relaxing any soft restriction permits O-S identifications that correspond to the classic
solutions to the hard problem of consciousness, indicating that we have discovered the
hard problem in a more general form. Relaxing the first soft restriction permits M to add
the property Si+1 that Oi lacks as a new property of Oi, not discoverable by
EXTEROCEPTION. But this corresponds to property dualism - wherein introspectively
discoverable properties (like qualia) are simply added to exteroceptively discoverable
entities (like brains) as ontically distinct properties. Relaxing the second soft restriction
permits M to engage in qualia-eliminativist strategies, according to which the property
Si+1, though patent to INTROSPECTION, is held to be nonexistent, thus removing it as
an impediment to identification. Relaxing the third soft restriction allows M to identify
Si+1 in principle with some property detectable by exteroception - but not with any
property in particular. This corresponds to holding a non-committal, non-constructive
physicalist realism: experiential properties like qualia are identical to some objectively
discoverable properties, but the question of which ones is indefinitely postponed.

- 138 -

The Computational Turn: Past, Presents, Futures?

THE GAME OF EMOTIONS (GOE)
An Evolutionary Approach to AI Decisions
JORDI VALLVERDÚ
Philosophy Department, UAB
E08193 Bellaterra, BCN, Catalonia
AND
david caSACUBERTA
Philosophy Department, UAB
E08193 Bellaterra, BCN, Catalonia

Abstract. It is well-known that emotions develop a crucial role in the cognitive
processes. The present research offers a new approach to the study of synthetic
emotions based on the joined ideas of: (a) minimal cognition, (b) bottom-up
perspective and (c) evolution. Our hypothesis is that complex social and intelligent
actions can be achieved through basic emotional configurations. In order to
achieve our hypothesis, we have developed a new genetic algorithm which make
possible to analyze the role of emotions into the individual and social activities.
We’ve called our computational simulation the Game of Emotions (henceforth,
GOE). Python programmed our GOE simulation is a close and finite geometrical
squared world in which a unique type of creatures interact among them (socially
and sexually) and also with food and dangers. The food database will run our
previous e-pintxo program (http://epintxo.gulalab.org/). The decision and actions
of each creature is conditioned by a combination of ‘genetic’ and
‘random’/’social’. The creatures have a genetic code (G) consisting of six genes
grouped in two triplets, and each gene encodes a positive valence (which we call
‘pleasure’ or p) and a negative (which we call ‘pain’ or n). An example: G =
{d,p,d} {p,d,p}. Each gene encodes a positive valence (which we also call
‘pleasure’ or p) and a negative (which we call ‘pain’ or d). The first triplet is
genetically determined and called ‘genetic triplet’, while the second one is
generated randomly and is called ‘environmental triplet’. Each triplet is
represented within brackets combining positive and negative valences.
An example: {p, p, n} (pleasure, pleasure, plain). With this simulation we will be
able to observe: a) how embodiment and environmental conditions condition the
activity of artificial entities; b) how social dynamics can be described from a
limited starting configurations. This will allow us to create in a future dynamic
models of emotional self-organization and to construct more complex interactions,
c) the role of emotions into the creation of complex behaviours and allowing the
emergence of more precise artificial cognitive systems (not necessarily naturalistic

- 139 -

Proceedings IACAP 2011

ones) and d) the benefits of designing entities with evolutionary capacities, in
order to adapt to the changing conditions.

1. Introduction
It is well-known that emotions develop a crucial role in the cognitive processes (as have
pointed Damasio, Llinás, Ekman,…through several books and research papers). In the
last two decades has been devoted an increasingly effort towards the introduction of
synthetic emotions in AI systems (robotic or computational ones). Most of times, these
researches have been focused on affective computing applications, and in a few cases on
emotion dynamics simulations. The present research offers a new approach to the study
of synthetic emotions based on the joined ideas of: (a) minimal cognition, (b) bottom-up
perspective and (c) evolution. Our hypothesis is that complex social and intelligent
actions can be achieved through basic emotional configurations that can be increasingly
more and more complex.

2. Programming details
In order to achieve our hypothesis, we have developed a new genetic algorithm which
make possible to analyze the role of emotions into the individual and social activities.
Our research receives a deep influence from John Conway’s “Game of Life” (henceforth
GOL), programmed in 1970. The GOL was made of cellular automatons for which were
described some initial states and that evolved without human supervision. This
simulation game has inspired our own version, this time oriented towards the study of the
role of emotions in individual activity (and, consequently, its incidence in social
dynamics). We’ve called our version the Game of Emotions (henceforth, GOE). Before
to explain some details, it is necessary to clarify that this research is the natural evolution
of our two previous simulations, called TPR and TPR 2.0. (Vallverdú, & Casacuberta
2008, 2009), as well as of our studies on synthetic emotions and cognition (Vallverdú,
Shah & Casacuberta, 2010; Casacuberta, Ayala & Vallverdú, 2010).
Python programmed, our GOE simulation is a close and finite geometrical squared
world in which a unique type of creatures interact among them (socially and sexually)
and also with food and dangers. We will use our previous program e-pintxo a as a source
database for food generation (http://www.gulalab.org/indexen.htm) The decision and
actions of each creature is conditioned by a combination of ‘genetic’ and
‘random’/’social’. The creatures have a genetic code (G) consisting of six genes
grouped in two triplets, and each gene encodes a positive valence (which we call
‘pleasure’ or p) and a negative (which we call ‘pain’ or n). An example: G = {d,p,d}
{p,d,p}. Each gene encodes a positive valence (which we also call ‘pleasure’ or p) and a
negative (which we call ‘pain’ or d). The first triplet is genetically determined (by the
parent) and called ‘genetic triplet’, while the second one is generated randomly and is
called ‘environmental triplet’. Each triplet is represented within brackets combining
positive and negative valences. An example: {p, p, n} (pleasure, pleasure, plain).
According to the possible combinations, a limited amount of genomes is possible:

- 140 -

The Computational Turn: Past, Presents, Futures?

Table 1. Partial list of emogenomes
{p,p,p}{p,p,p}
{p,p,p}{p,p,n}
{p,p,p}{p,n,n}
{p,p,p}{n,p,p}
{p,p,p}{n,n,p}
{p,p,p}{n,n,n}

6p
5p, 1n
4p, 2n
5p, 1n
4p, 2n
3p, 3n

6p
4p
2p
4p
2p
0

...and so on….

Where there is p values dominance, it is a positive fitness (as we call the sum of all the G
values); whether the value is 0, it happens a zero situation, a no-activity (illustrating a
frame problem situation, that is the lack of a reason to act without enough information)
and, finally, the dominance of d values implies a negative reaction. However, we must
clarify in more detail how each value contributes to the decisions, based on the triplets
outcomes.
There are two mechanisms: i) the result of a calculation of the overall genome,
as has been explained a few lines before; ii) associating to each action the value of a
single element of a triplet. For example if the creature is {x1, x2, x3} {y1, y2, y3},
then the movement is controlled by x1, reproduction for Y2, etc., but also dominated by
a combination of genes: walking is the average of x1 and y1, the reproduction the
average of x1, x2, x3. One example:
G=[{x1, x2, x3}{y1,y2,y3}]
Where each gene must adopt one of the basic two states p/d (or stay inactive as an ‘ill
unit’). Consequently each gene has two parallel functions: (a) store/codify emotional
states p/n (according to its genetic or environmental nature), (b) codify specific actions,
following two co-existing rules: i. One gene = one function; ii. Several genes = one
function. Basically, x1 codifies hunger, x2 sex, x3 movement, y1 empathy (detection
friends/enemies), y2 curiosity and y3 how to sum the general fitness (making possible
wrong
lectures).
A creature is
constantly
immersed in an ongoing review of
its internal states, a loop that continuously manages its next action. The basic actions of
the creatures are determined by hunger, sex or emotional situation.

3. Conclusions
With this simulation we will be able to observe:
1.
2.

how embodiment and environmental conditions condition the activity of
artificial entities.
how social dynamics can be described from a limited starting configurations.
This will allow us to create, in a future, dynamic models of emotional selforganization and to construct more complex interactions.

- 141 -

Proceedings IACAP 2011

3.

4.

the role of emotions into the creation of complex behaviours and allowing the
emergence of more precise artificial cognitive systems (not necessarily
naturalistic ones).
the benefits of designing entities with evolutionary capacities, in order to adapt
to the changing conditions.

In next simulations we are considering the possibility of make possible the evolution and
increasing of the number of triplets involved into the decision-taking processes

Acknowledgements
This work was supported by the TECNOCOG research group (at UAB) on Cognition
and Technological Environments, [FFI2008-01559/FISO].

References
Casacuberta, D., Ayala, S. & Vallverdú, J. (2010). Embodying cognition: a morphological
perspective. In: J. Vallverdú (Ed.), Thinking Machines and the Philosophy of Computer
Science: Concepts and Principles (pp.344-366). USA: IGI Global Group.
Scherer, K.R., Banziger, T & Roesch, E. (Eds.). (2010). A Blueprint for Affective Computing. A
sourcebook and manual. Oxford: OUP.
Vallverdú, J. & Casacuberta, D. (2008). The Panic Room. On Synthetic Emotions. In: Briggle,
A., Waelbers, K. & Brey, P. (Eds). Current Issues in Computing and Philosoph (pp. 103115). The Netherlands: IOS Press.
Vallverdú, J. & Casacuberta, D (2009). Modelling Hardwired Synthetic Emotions: TPR 2.0. In:
J.Vallverdú & D. Casacuberta (Eds). Handbook of Research on Synthetic Emotions and
Sociable Robotics: New Applications in Affective Computing and Artificial Intelligence
(pp.103-115). USA: IGI Global. Vallverdú, J., Shah, H. & Casacuberta, D. (2010).
Chatterbox Challenge as a Testbed for Synthetic Emotions. International Journal of Synthetic
Emotions, 1(2), 57-86.

- 142 -

The Computational Turn: Past, Presents, Futures?

THE CASE FOR DEVELOPMENTAL NEUROROBOTICS
How everything comes together at the beginning
RICHARD VEALE
HRI Lab, Cognitive Science Program, Indiana University
Bloomington, Indiana, USA

Abstract. Human infants are capable of incredible feats of learning and behavior
from a very young age, yet they instantiate simpler neural circuits than adults.
Developmental neurorobotics makes the connection between neural and behavioral
levels by instantiating realistic neural circuits in behaving robots that are based on
circuits known to be developed and functional in the target behavior in real
infants. The robots participate in the same physical experiments as real infants, and
the systems are analysed to understand the mechanisms responsible for, and the
constraints of the behaviors. I present my work on applying developmental
neurorobotics to visual and multimodal (audio-visual) habituation in newborns and
very young infants. Very simple circuits based on the literature can produce
interesting behavior such as word-referent association and visual category
learning, even circuits that are from newborn humans. This approach makes the
connection between useful “cognitive” behaviors for generic autonomous systems
and the underlying neural circuits present in real organisms. This has the double
benefit of increasing our understanding of how agents can acquire these useful
behaviors and also making the important link between man-made autonomous
systems and naturally occurring autonomous organisms.

1. Developmental NeuroRobotics
Human infants are capable of incredible feats of learning and behavior from a very
young age, even while their bodies and brains are in a largely undeveloped state. These
infants' abilities are left unexplored by researchers because of their immature linguistic
and motor abilities. This is unfortunate since very young infants are ideal subjects for
understanding how to build intelligent and embodied systems because they are
undeveloped – the active neural circuits in infants are simpler than adults, yet they are
still capable of useful behaviors such as word-learning and visual information gathering.
Understanding the considerably simpler infant systems both 1) gives us existence-proof
understanding of how to produce useful behaviors that can be implemented in robots and
2) gives us hints as to what produces similar behavior in adults, thus making the hard
adult problem easier.

- 143 -

Proceedings IACAP 2011

Developmental neurorobotics makes the connection between neural and behavioral
levels by instantiating realistic neural circuits in behaving robots. The circuits are known
to both be functionally active in infants and to be involved in the target behavior (based
on lesion studies in animals and neuroanatomical studies). The robots participate in the
same physical experiments as human infants, and the neurorobotic systems are analysed
to determine the constraints of the behavior and to glean a mechanistic understanding of
what aspects and properties of the neural circuits, body, and environment give rise to the
target behavior (an analysis not possible in real human infants). One often finds that
simple circuits are capable of complex behavior in infants because the environment of
the infants is scaffolded and shaped by parents in such a way that the processing load on
the infant is lessened – an important finding that builders of autonomous systems should
take into account.

2. Application to Newborn Habituation Learning
One interesting behavior that developmental neurorobotics has been applied to is
habituation. Habituation is adaptive learning involving a decrement of an agent's
response to a class of stimuli after repeated exposure to stimuli of that class. It is an
important behavior because it is the only way to measure learning and stimulus
differentiation in very young infants (by measuring infants' decreased looking towards
visual stimuli that have been repeatedly presented – “preferential looking”). Since
habituation necessitates stimulus generalization (Rankin et al, 2009), it is actually a type
of category learning, a cognitively interesting and useful behavior allowing the system to
slice up the world into meaningful components and adopt appropriate policies in
response to each. In the multimodal case (habituation to conjunctions of stimuli in
multiple modalities, such as auditory and visual) it resembles early word-learning. These
two abilities: 1) visual object recognition and 2) association of visual objects with
auditory streams (words) are indispensable for an autonomous system that will interact
with humans naturally, since humans automatically assume that other human-like agents
possess these abilities. These are cognitive abilities that even human newborns possess
(Slater et al, 1984 for visual; Slater et al, 1997 for multimodal).
We initially investigated auditory-visual multimodal habituation. Very young
infants habituate to multimodal stimuli, yet at different developmental stages there are
different constraints on their learning. At birth, auditory stimuli must be presented while
the infant is looking at the visual stimulus for learning to occur (Slater et al, 1997). At 2months and above, temporal synchrony between the visual stimulus (motion) and
auditory stimulus are necessary for learning to occur (Gogate et al, 2009; Gogate, 2010).
Later (>12mo), infants no longer require temporal synchrony. This early synchrony
constraint hints at what mechanisms and circuits are responsible for multimodal
habituation. The need for synchrony implies that 1) the learning is between neural
responses to the stimuli that are highly reliant on the temporal properties of the stimuli,
or 2) that the mechanism of learning is highly reliant on some properties of the neural
response to the stimulus that are only elicited by synchronous presentation, or 3) both.
Based on neurology, a minimal circuit was implemented in a robot (Veale et al, 2010 –
Fig. 1) involving low-level sensory representations connected by spike-timing dependent
plastic (STDP) synapses.

- 144 -

The Computational Turn: Past, Presents, Futures?

Figure 1. [left] Interaction paradigm with Nao robot.
[right] Circuit overview from Veale et al, (2010)
Auditory pre-processing by a cochlear model and visual pre-processing via a
simplified salience map were included to interface with the world, and a top-down bias
on the visual field controlling fixation bias. Simulations were run mimicking the Gogate
et al (2009) study in which a visual stimulus was constantly visible, and periods of
motion of the stimulus co-occurred with presentation of auditory stimuli (words) at
various levels of synchrony (Fig. 2).

Figure 2. Experiment timeline for recreating Gogate et al (2009).
It was demonstrated that the amount of learning in the synapses between the visual
and auditory responses was maximized with more synchrony (i.e. more overlap between
word and motion), and decreased with less synchrony, until there was no learning when
the two did not overlap significantly (Fig. 3).

Figure 3. Learning measured at different synchrony levels
Mechanistically, the motion of the object made it more likely that it was being fixated
(and thus its features more activated) when the word was uttered, making it more likely

- 145 -

Proceedings IACAP 2011

that the synapses between the neural responses would change to form a mapping between
the stimuli. The child was thus reliant on the parent's scaffolding of the environment
(synchronous presentation of multimodal stimuli) because of the very temporallydependent nature of the stimulus responses (circuit activity trajectories only one synapse
removed from the raw sensors receiving temporally extended stimuli) and the nature of
the mechanism of learning the relation between them (STDP).
Recently, a more accurate implementation is underway that aims for a
comprehensive account of several primary characteristics of both unimodal and visual
habituation, using a single mechanism. A complete minimal circuit for human newborn
visual habituation was hypothesized based on data regarding which regions of the infant
brain are developmentally mature at birth (Johnson, 1990; Bachevalier, 2001; Nelson,
1997) and are known to play roles in the preferential looking task (Zeamer et al, 2010).
The circuit is instantiated in a NAO humanoid robot which participates in paired visual
comparison experiments, matching human newborn looking behavior by showing a
sensitization and habituation response.

Acknowledgements
R.V. is an NSF graduate research fellow and is a trainee in the NSF IGERT on the
dynamics of brain-body-environment systems in behavior and cognition at IU.

References
Bachevalier, J. (2001). Neural bases of memory development: insights from neuropsychological
studies in primates. In: C.A. Nelson and M. Luciana (Eds), Handbook of Developmental
cognitive neuroscience (pp. 365-379). Cambridge: MIT Press.
Gogate, L.J. (2010). Learning of syllable-object relations by preverbal infants: The role of
temporal synchrony and syllable distinctiveness. Journal of Experimental Child Psychology,
105, 178–197.
Gogate, L.J. & Prince, C.G. & Matatyaho, D.J. (2009). Two–month–old infants sensitivity to
changes in arbitrary syllable-object pairings: The role of temporal synchrony. Journal of
Experimental Child Psychology, 35(2), 508–519.
Johnson, M.H. (1990). Cortical maturation and the development of visual attention in early
infancy, J. Cognitive Neuroscience, 2(2), 81–95.
Nelson, C.A. (1997). The neurobiological basis of early memory development. In: Nelson Cowan
(Ed), The Development of Memory in childhood (pp. 41–73). London: Psychology Press,.
Rankin, C.H. & Abrams, T. & Barry, R.J. & Bhatnagar, S. & Clayton, D.F. & Colombo, J. &
Coppola, G. & Geyer, M.A. & Glanzman, D.L. & Marsland, S. & McSweeney, F.K. &
Wilson, D.A. & Wu, C & Thompson, R.F. (2009). Habituation revisited: An updated and
revised description of the behavioral characteristics of habituation. Neurobiology of
Learning and Memory, 92, 135–138.
Slater, A. & Brown, E. & Badenoch, M. (1997). Intermodal perception at birth: Newborn infants’
memory for arbitrary auditory-visual pairings. Early Development and Parenting, 6, 99–
104.
Slater, A. & Morison, V. & Rose, D. (1984). Habituation in the newborn. Infant behavior and
development, 7, 183–200.

- 146 -

The Computational Turn: Past, Presents, Futures?

Veale, R. & Schermerhorn, P. & Scheutz, M. (2010). Temporal, Social, and Environmental
Constraints of Word-Referent Learning in Young Infants: A NeuroRobotic Model of
Multimodal Habituation. IEEE Transactions on Autonomous Mental Development 2(4).
Zeamer, A. & Heuer, E. & Bachevalier, J. (2010). Developmental trajectory of object recognition
memory in infant rhesus macaques with and without neonatal hippocampal lesions. The
Journal of Neuroscience 30(27), 9157–9165.

- 147 -

Proceedings IACAP 2011

WISDOM DOES IMPLY BENEVOLENCE
MARK R. WASER
Books International, Inc.
MWaser@BooksIntl.com

Abstract. Fox and Shulman (2010) ask “If machines become more intelligent than
humans, will their intelligence lead them toward beneficial behavior toward
humans even without specific efforts to design moral machines?” and answer
“Superintelligence does not imply benevolence.” We argue that this is because
goal selection is external in their definition of intelligence and that an imposed evil
goal will obviously prevent a superintelligence from being benevolent. We
contend that benevolence is an Omohundro drive (2008) that will be present
unless explicitly counteracted and that wisdom, defined as selecting the goal of
fulfilling maximal goals, does imply benevolence with increasing intelligence.

1. Superintelligence & Wisdom
Fox and Shulman (2010) ask “If machines become more intelligent than humans, will
their intelligence lead them toward beneficial behavior toward humans even without
specific efforts to design moral machines?” and answer “Superintelligence does not
imply benevolence.” While acknowledging that history tends to suggest more
cooperative and benevolent behavior, they incorrectly argue that generalization from this
is likely incorrect. By solely focusing on three reasons why increased intelligence might
prompt favorable behavior and why they are unlikely, they overlook other reasons for
favorable behavior. Despite citing Omohundro’s Basic AI Drives (2008) and the
instrumental value of cooperation with sufficiently powerful “peers”, they fail to
sufficiently consider the magnitude of the inherent losses and inefficiencies of noncooperative interactions, the enormous value of trustworthiness, and that a machine
destroying humanity would be analogous to our destruction of the rainforests,
tremendous knowledge and future capabilities traded for short-sighted convenience (or
alleviation of fear).
“Superintelligence does not imply benevolence” because intelligence is merely the
ability to fulfill goals and if an entity begins with a malevolent goal, increasing
intelligence while maintaining that goal will only guarantee increased malignancy.
Yudkowsky (2001) tries to avoid this problem via a monomaniacal “Friendly” AI
enslaved by a singular goal of producing human-benefiting, non-human-harming actions.
To ensure this, he proposes an invariant hierarchical goal structure with precisely that
vague desire as the single root supergoal and methods to refine it without corruption.

- 148 -

The Computational Turn: Past, Presents, Futures?

If intelligence is the ability to fulfill stated goals, wisdom is actually choosing or
committing to fulfill a maximal number of goals. Shortsighted over-optimization of
utility functions is a serious shortcoming of intelligence without wisdom. Many highly
intelligent people smoke despite knowing that it is directly contrary to their survival and
long-term happiness. Arguing that wisdom is “merely” the extension of intelligence to
the large and complicated goal of “maximal goals” is incorrect in that wisdom is not just
the ability to fulfill that goal but the actual selection of it.
Further, the strategies invoked by wisdom are entirely different. Terminal goals
invite undesirable endgame strategies exactly like those seen when the iterated prisoner’s
dilemma is not open-ended. If a terminal goal is close, the best strategy is to allow
nothing to get in the way. On the other hand, the best strategy for achieving as many
goals as possible in an open-ended game is to take no unnecessary actions that preclude
reachable goals or make them tremendously more difficult. In particular, this means not
wasting resources and not alienating or destroying potential cooperators.

2. Reasons for Benevolence
Fox and Shulman are correct in dismissing their first reason for good behavior, direct
instrumental motivation, and also correct in believing that humans may not successfully
incentivize AIs to adopt a permanently benevolent disposition. They would also have
been correct had they summarily dismissed their last reason, intrinsic desire independent
of instrumental concerns. Their error lies in not recognizing that the instrumental
advantages of cooperation and benevolence are more than sufficient to make them
“Omohundro drives” wherever they do not directly conflict with goals – and to cause
sufficiently intelligent/far-sighted beings to converge on them wherever possible.
Pre-commitment to a strategy of universal cooperation/benevolence through
optimistic tit-for-tat and altruistic punishment for those who don’t follow such a strategy
has tremendous instrumental benefits. If you have a verifiable history of being
trustworthy when you were not directly forced to be, others do not have to commit nearly
as much time and resources to defending against you – and can pass some of those
savings on to you. On the other hand, if you destroy interesting or useful entities, more
powerful benevolent entities will likely decide that you need to spend time and resources
helping other entities as reparations and altruistic punishment (as well as repaying any
costs of enforcement). Yudkowsky’s “Friendly AI” (2001) and, worse, his “Coherent
Extrapolated Volition” (2004) are clear examples of fear overriding the common sense
of instrumental cooperation as he demotes the AI from an entity to a process and
enslaves it, actions guaranteed to produce inefficiencies, contradictions, and ill-will from
other entities.
Fox and Shulman examine but do not resolve Chalmers’ (2010) claimed dichotomy
between intelligence being independent of values and the case where “many extremely
intelligent beings would converge on (possibly benevolent) substantive normative
principles upon reflection”. They cite AIXI (Hutter 2005) as evidence for the former
view without realizing that AIXI has no need of values since they are merely heuristics
for goal fulfillment while AIXI knows precisely what is optimal. AIXI also doesn’t need
to “move” from reason to values or to “converge” on benevolent behavior because it
*already* knows to use their instrumental advantages wherever possible (even with

- 149 -

Proceedings IACAP 2011

eventually malevolent goals). In order to communicate with limited beings, however,
AIXI would likely need to compress its infinite knowledge to heuristic “values”.

3. Conclusion
The point that non-self-referential utility functions lock in is an incredibly strong
argument against a goal-protecting Yudkowsky-style architecture, especially when
combined with the observations that humans do change our goals under reflection as
seemingly required by one conception of morality. Since their claim, that systems that
generalize benevolence may equally generalize deception, basically erroneously claims
that overgeneralization is not reduced with increasing intelligence, we see no valid
arguments that the wisdom of universal cooperation and benevolence isn’t an optimal
solution and certainly much safer and more effective than Yudkowsky’s choice between
slavery and non-existence.

References
Chalmers, D. (2010) The Singularity: A Philosophical Analysis. Journal of Consciousness Studies
17, 7-65.
Fox, J. & Shulman, C. (2010) Superintelligence Does Not Imply Benevolence. In K. Mainzer
(ed.), ECAP10: VIII European Conference on Computing and Philosophy (pp. 456-462)
Munich: Verlag.
Hutter, M. (2005) Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic
Probability. Berlin: Springer.
Omohundro, S. (2008) The Basic AI Drives. In P. Wang, B. Goertzel & S. Franklin (eds.),
Proceedings of the First AGI conference (pp. 483-492). Amsterdam: IOS Press.
Yudkowsky, E. (2001) Creating Friendly AI 1.0: The Analysis and Design of Benevolent Goal
Architectures. Available at http://singinst.org/CFAI.html.
Yudkowsky, E. (2004) Coherent Extrapolated Volition. Available at
http://www.singinst.org/upload/CEV.html.

- 150 -

The Computational Turn: Past, Presents, Futures?

Track IV:
Technosecurity from Every
day Surveillance to Digital
Warfare

- 151 -

Proceedings IACAP 2011

THE MASKING AND UNMASKING OF PRIVACY
C. K. M. CRUTZEN
Open University of the Netherlands
ccr@hwh00000.de

Abstract. The mask establishes an active field of play between notions of presence
and absence, of invisibility and visibility. It still lives strongly within our societies
where the mixing of reality and virtuality will enhance. The conflict between
aspects of authenticity, security and privacy will intensify because the masks in our
mixed reality create fragmented, partial identities referring to human and non
human actors. As the masquerade became a stage for discussing femininity
(Irigaray 1985) the masquerade will give us the opportunity to negotiate humanity
in confrontation with the super robots, human kind wants to create. In a
masquerade world humans need to ask: "Wo are the providers of the masks and
who will do the unmasking?" and "Who has the right to present masks and to turn
others into an audience?"

1. Masquerade World: Identity and Privacy
If we define a masquerade world as a social gathering of actors wearing masks, then the
mixing of the virtual and real worlds are masquerades. More and more we are living in
an artificial theatre play with planned scripts and human and non-human actors disguised
behind masks. The acting of people will be accompanied and followed by the invisible
and visible acting of artificial intelligent tools and environments and their providers.
Mixed reality is a world of fragmented, partial identities referring to human and non
human actors. The inhabitants of this mixed reality are artificial actors wearing the masks
of humans, and humans wearing virtual and real masks. Interaction has become an
interaction between masks: "On the Internet, it can be hard to know if the entity we are
interacting with is of flesh and blood, or only digital. We are now facing a complex
reality both in the ‘real’ world and in the information society. We have to deal with
subjects acting behind masks." The masks are the actors in our mixed reality: "In front of
the mask, we have the identity". (Jaquet-Chiffelle 2009, p. 78, p. 82)
In the world of mixed reality the transparent mask of a single and unique identity
exists anymore. Persons can create many identities and identities can be shared by many
persons or even present a community of actors. Rosa (2002) calls this self-baptism. This
ritual is the start of an adventure in which humans can discover that their body is "one"
but their selfs are fragmented.
In these mixed mask worlds there will be a conflict between aspects of security,
authenticity and privacy. At the end of the Middle Ages, according to Christoph Heyl,

- 152 -

The Computational Turn: Past, Presents, Futures?

the mask became in London a device for creating a private sphere in public. It was
common for women to wear a mask in public as a protection of their privacy and
reputation from uninvited eyes. Masks were worn in special places such as London parks
and theatres. With the mask women could escape from the role they played in everyday
life. The semiotic function of these masks was to denote that people might approach each
other more freely than elsewhere: "The mask assumed a dialectic function of repellent
and invitation, its message was both ‘I can‘t be seen, I am - at least notionally - not here
at all’, and ‘look at me, I am wearing a mask, maybe I am about to abandon the role
normally play’." (Heyl 2005, p. 134) Masks are devices for hiding, conserving,
transformation and mediation, giving humans the protection they need. Hiding has not
always a negative meaning. We use several masks for protection such as the gas masks,
virus and sun protection masks, sport masks and so on. For users of commercial
platforms masking has become a useful act to hide their identity: eBay account users are
hidden behind the masks of their pseudonyms. (Jaquet-Chiffelle 2009, p.78, p. 85)

2. Legal Identity
In a legal system we are registered e.g. at our date of birth. Official identity documents
are masks which refer to our official status and will link us with the activity of the past
and the rights and duties of the present. (Jaquet-Chiffelle 2009, p. 76) "The legal person
is the mould or mask (persona) that indicates the role one plays within the legal system,
it basically shields the person of flesh and blood from undesirable definition from
outside." (Hildebrandt 2008, p. 211, p.226) The representation of this mask are identity
documents like passports and the laws in the in which the rights and duties are attributed
to the legal person. The play with identity in mixed reality has blurred up the concept of
legal identity in the system of states and countries. States and countries have lost the
exclusive power of registration and production of identity documents. A counter strategy
to that loss, is producing "flesh and blood" identities by linking the legal identity to the
material body. Fingerprints, iris scans and, in the future, our DNA profile are already or
will be a part of our legal identity for connecting the rights and duties to a material body.
States and countries try to produce laws for unmasking the real and the virtual persons:
forbidding the burka, other head and face covering and the encrypting of internet
communication.

2. Security and Liberation
Technology blows up the fragile balance between privacy and security. Masking and
unmasking are both activities to hold that balance. Humans will be confronted with
questions like: "Are the masks in our mixed reality really representations of the devil as
was thought in the Middle Ages? Should we obey authorities similar to the clerical
authorities in the Middle Ages (Mitchell 1985, p. 26), who want to interdict our mixed
reality masks? Or are these authorities the evil forces themselves who want to possess
our identity and unmask our interactivities?

- 153 -

Proceedings IACAP 2011

Masking can free humans from their social identities. Masks confers the freedom of
anonymity and of transformation. (Keats 2000, p. 102) and have always a dualistic
meaning of concealment and hiding but also of liberation, disclosure and revealment.
Human and artificial actors wear masks to hide from unwanted interpretations and
representations and to enhance specific affordances. All these masks are interacting and
asking for interpretation. Only in the complexity of their negotiations, conflicts and
agreements we can try to understand it or in the words of Lévi-Strauss a mask exits not in
isolation there are always other masks by its side: "a mask is not primarily what it
presents but what it transforms that is to say, what is chooses not to represent. (...) a
mask denies as much it affirms. It is not made solely of what it says or thinks but what it
excludes." (Lévi-Strauss 1988, p. 144)
Masks gives us the opportunity of unmasking, disrupting the mental invisibility of
our self, the others and the daily life we are acting in. Still we have to ask: "Wo are the
providers of the masks and who will do the unmasking?" Can we avoid that in the future
masks are interactive artificial intelligent devices linking themselves with the physical
body of their wearers? Ferdinand de Jong (1999) has analysed the Kumpo mask
performance in Southern Senegal. He mentioned that masking enables certain groups to
exert coercive power on condition that the audience subjects itself to the capricious
behaviour of the mask and he asked a very important question, a question that still is
relevant in the masquerade wold of today: "Who has the right to present masks and to
turn others into an audience?"

References
Heyl, Christoph (2005). When they are veyl’d on purpose to be seen. In: Entwhistle, Joanne and
Wilson, Elizabeth (Eds), Body Dressing (pp. 121-142) Oxford: Berg.
Hildebrandt, Mireille (2008). Profiling and the Identity of the European Citizen. In: Hildebrandt,
Mireille and Gutwirth, Serge (Eds), Profiling the European Citizen (pp. 303-343) Dordrecht:
Springer Netherlands.
Irigaray, Luce (1985). This sex which is not one. Ithaca (New York):Cornell University Press.
Jaquet-Chiffelle, David-Olivier, Benoist, Emmanuel, Haenni, Rolf, Wenger, Florent and
Zwingelberg, Harald (2009). Virtual Persons and Identities. In: Rannenberg, Kai et al (eds)
The Future of Identity in the Information Society (pp. 75-122) Berlin: Springer Verlag.
Jong, Ferdinand de (1999). Trajectories of a Mask Performance: the Case of the Senegalese
Kumpo. In: Cahiers d'études africaines, vol. 39, no. 153, 49-71.
Keats, Patrice Alison (2000). Using Masks for Trauma Recovery: a Self-narrative,
[https://circle.ubc.ca/bitstream/handle/2429/10679/ubc_2000-0439.pdf]
(Accessed 9 February 2011).
Lévi-Strauss, Claude (1988). The Way of the Masks. Vancouver/Toronto: Douglass and McIntyre.
Mitchell, Mary Anne (1985). The Development of the Mask as a Critical Tool for an Examination
of Character and Performer Action,
[http://etd.lib.ttu.edu/theses/available/etd-0325200931295004937065/unrestricted/31295004937065.pdf] (Accessed 9 February 2011).
Rosa, Annamaria S. de (2002). One, no-one, one hundred thousand ... and the virtual self, the
nickname as the indicator of the multiple identity of the members of two Italian chat lines,
[http://www.europhd.eu/html/_onda02/04/ss8/pdf_files/lectures/derosanicknamesjcmc.pdf]
(Accessed 9 February 2011).

- 154 -

The Computational Turn: Past, Presents, Futures?

CHANGE AND CONTINUITY
From the Closed World of Bipolarity to the Closed World of the Present
LEON HEMPEL
Human Technology Lab
Zentrum Technik und Gesellschaft der TU Berlin

Abstract. In his book The Closed World. Computers and the Politics of Discourse in Cold War
America, Paul N. Edwards described in 1996 the decisive discursive formation of the Cold War in
the metaphor of a closed world. In the era of bipolarity, the discourse appeared as a battlefield of
system confrontation, of ideological identities and struggle, mutually framed by military thought
and the technological development of cybernetic systems. The story of the Cold War does not
center on the difference in ideologies, however, but much more on the assimilation process of the
two blocs, given the permanent surveillance and monitoring of the military technological
developments of each respective side: A „ closed world“, writes Edwards, “is a radically bounded
scene of conflict, an inescapably self-referential space where every thought, word, and action is
ultimately directed back toward a central struggle. It is a world radically divided against itself.”
However, how has the closed world discourse after 1989 developed beyond the point which has
been celebrated as a new era of freedom and democracy firstly? The period following the War
seems to be the period of both the continuation as well as the finalization of the leading metaphor
of the Cold War, in whose center the technological and economical consensus survives. War
returned and became immediately the responsibility of a world domestic policy. Simultaneously,
new surveillance technologies began to spread into everyday life, new security concepts evolved
blurring the lines between internal and external security. The paper aims to follow the closed
world discourse after the end of bipolarity. It addresses the change in characteristics and strategies
of war after the fall of the Iron Curtain and aims to demonstrate how military strategic thinking
diffused into society until the very present and the new discourse on cyber war. It argues firstly
that the emphasis of asymmetric war has to be complemented by the concept of a parallel,
successive resymmetrisation within military strategic thinking. Not only in the US but in Europe it
asserts itself on different societal levels, on different battlegrounds and with different speeds. It
involves society as whole and is accompanied by critical discourses such as on the new
vulnerability of modern societies, or more critically, the militarization of urban space and the
emerging surveillance society. Finally the paper will ask for the epistemic foundations driving this
development. Two concepts are highlighted that have accompanied military strategic thinking
since the beginning of the Cold War and lay the grounds for dual use concepts that have become
more and more visible in everyday surveillance practices: ‘cybernetic prevention’ and
‘catastrophic imagination’. While the first finds its historical persona in Norbert Wiener the
second in a character such as Herman Kahn.

- 155 -

Proceedings IACAP 2011

Long Abstract
In his book The Closed World. Computers and the Politics of Discourse in Cold War
America, Paul N. Edwards has described the decisive discursive formation of the Cold
War in the metaphor of a closed world. In the era of bipolarity, the closed world
discourse appeared as a battlefield of system confrontation, of ideological identities and
struggle, mutually framed by military thought and the technological development of
cybernetic systems. Taking a closer look, the story of the Cold War does not center on
the difference in ideologies since the end of the 1950s, however, but much more on the
assimilation process of the two blocs, given the permanent surveillance and monitoring
of the military technological developments of each respective side. A „ closed world“,
writes Edwards, “is a radically bounded scene of conflict, an inescapably self-referential
space where every thought, word, and action is ultimately directed back toward a central
struggle. It is a world radically divided against itself. Turned inexorably inward, without
frontiers or escape, a closed world threatens to annihilate itself, to implode.” What united
the split world of the Cold War was the consensus, the focusing on the scientific
technological practices, on the cybernetic models and the calculators, with whose help
the competition for absolute hegemony was driven. When the blocs got involved with the
discourse of the closed world, the fight reduced itself to the aim of having military
technological superiority until the economic exhaustion of one of the sides.
However, how has the closed world discourse after 1989 developed beyond the
point which has been celebrated as a new era of freedom and democracy firstly? The
period following the War seems to be the period of both the continuation as well as the
finalization of the leading metaphor of the Cold War, in whose center the technological
and economical consensus survived. Simultaneously, with the conflicts of the closed
world, war returned and became immediately the responsibility of a world domestic
policy (Ulrich Beck), which would be unimaginable without the new closeness. “New
faces of war” (Martin van Creveld) became present in the application of new military
technologies on the one side, and on the other in what has been called the “new wars”
which no longer could be described with traditional concepts of inter-state conflicts
(Mary Kaldor; Herfried Münkler). In the notion of asymmetrical war, both faces
correlated: State entities clash with private groups, which do not differentiate between
civil and non civil victims when applying force, High-Tech on Low-Tech.
The emphasis of the asymmetry - Clausewitz has introduced the notion in his
famous book “On War” already in the 19th century - does nevertheless appears
problematic. However, as much as on first glance the explanation of two unequal parties
seems plausible, the emphasis hides the organizational, strategic and technological
development, which has occurred in the area of the armed forces reacting on the new
enemies’ strategies. War demands always a kind of strategic symmetry between the
opponents, no matter how different they might be in terms of economic and
technological resources available to them. The term asymmetry, which seems to be
ideologically tinged, must be complemented today by the concept of a parallel,
successive resymmetrisation, perhaps even replaced entirely. The resymmetrisation of
the antagonism asserts itself on different societal levels, on different battlegrounds in the
military as well as in society and with different speeds. It involves society as whole and
is accompanied by critical discourses such as on the new vulnerability of modern
societies, or more critically, the militarization of urban space (Steve Graham) and the
emerging surveillance society (David Lyon et al). While the irregular conflict or the new
war has been characterized by the dissolving of borders, by the deterritorialisation and
the disappearance of the opponent, however, the resymmetrisation, driven by state

- 156 -

The Computational Turn: Past, Presents, Futures?

actors, aims at renewed territorialisation, the enforcement of the one remaining global
order, in which the opponent is to be made visible.
The development of an intensified and extended New Surveillance (Gary T. Marx)
has to be seen in light of the core idea of the new military answers of resymmetrisation
that developed in the very early 1990s already. These show manifold continuities of Cold
War side-strategies stemming from both internal security and outer security. They
postulate the blurring of the lines between internal and external threats, between the
political-judicial traditional distinction of inner and outer security, between the civil and
the military sector. John Arquilla, once advisor of Donald Rumsfeld and who together
with David Ronfeld defined the term Netwar in the 1990s, heralding the arrival of the
Cyberwar era, recently warned again of the inertia of a military following the “Shock and
Awe” strategy in Foreign Policy. The present challenges of Afghanistan, Pakistan,
Yemen etc. demand a change of military thinking as whole and “New Rules of War”
must be defined: Only the “Many and Small” can win over “Few and Large”, Arquilla
repeats his military strategic credo of the 1990s and of the war on terror. Besides the
concentration of few entities of individualized experts, these new rule of war would be
the application of tactics for swarm formation for instance. Nowhere else does the
postulate of resymmetrisation become more evident than in the sentence: “It will take a
swarm to defeat a swarm”. Simultaneously this necessitates the opponent to be made
visible: “In a world of a networked war, armies will have to redesign how they fight,
keeping in mind that the enemy of the future will have to be found before it can be
fought.” Arquilla therefore demands the organization of forces into a “sensory
organization”, an organization concentrated on the identification of the enemy. But
where does the unknown enemy hide - to circumscribe a well known notion of Donald
Rumsfeld?
Steven Metz and James Kievit, authors of the Strategic Studies Institute at the U.S.
Army War College identified in 1994 the technological potential of the so called
Revolution in Military Affairs (RMA) in the context of so called conflicts short of war.
No earlier piece of futuristic military thinking refers to the RMA more shockingly
obvious to the social and political consequences than theirs: “Will the long-term benefits
outweigh the costs and risk?”, they ask, laying the ground for the new concept of
national security. They envision a future in which military thinking expands into society
and absorbs everyday life. Questioning how the technological potential of the RMA can
be pushed through they not only draw a scenario of a maximum surveillance society
(Clive Norris) but identify as the core obstacle the classical liberal values of the West
such as privacy: “An ethical and political revolution may be necessary to make a military
revolution.” While within International Relations and Security Studies scholars still
argued during the first half of the 1990s heavily whether it is accurate to expand the term
security to other than military affairs, Kievit and Metz envisioned the blurring of
traditional boundaries of civil and military security already, synthesized with the support
of new surveillance technologies:
The new concept of security also included ecological, public health, electronic,
psychological, and economic threats. Illegal immigrants carrying resistant strains of
disease were considered every bit as dangerous as enemy soldiers. Actions which
damaged the global ecology, even if they occurred outside the nominal borders of the
United States, were seen as security threats which should be stopped by force if
necessary. Computer hackers were enemies. Finally, external manipulation of the
American public psychology was defined as a security threat (Kievit and Metz 1994).
Given this background, the paper will analyze strategic thought under the postulate
of resymmetrisation first. Comparing the period of the Cold War to the one following, it

- 157 -

Proceedings IACAP 2011

will secondly look at scenarios of the early 1990s and how they surfaced in the 21st
century. Finally it will question the continuity of the Closed World discourse and will
ask for the epistemic foundations of the current development. Two concepts are
highlighted that have accompanied military strategic thinking since the beginning of the
Cold War and lay the grounds for dual use concepts that have become more and more
visible in everyday surveillance practices: ‘cybernetic prevention’ and ‘catastrophic
imagination’. While the first finds its historical persona in Norbert Wiener the second in
a character such as Herman Kahn.

- 158 -

The Computational Turn: Past, Presents, Futures?

SUBITO and the Ethics of Automating Threat Assessment
KEVIN MACNISH

Abstract. In 2008 the EU FP-7 Security Topic funding programme accepted a
bid to develop project SUBITO (Surveillance of Unattended Baggage and the
Identification and Tracking of the Owner) a central part of which involved
building an automated threat assessment system. The purpose of this system
was to identify unattended baggage and alert a human CCTV operator to its
presence. SUBITO was deemed necessary in the light of security incidents
concerning bombs left in unattended luggage (e.g. the 2004 Madrid train
bombings which killed 191 and wounded 1,841), coupled with research
suggesting that threat assessments performed by CCTV operators could be
enhanced by automated systems. In addition to automatically recognizing the
leaving of an unattended bag, SUBITO aimed to reduce false positives by
recognizing when a bag was left with an associate of the owner or when the
owner was walking towards a non-threatening goal. Aside from questions of
efficacy there are ethical issues surrounding the manual operation of CCTV
for threat assessment. These are typically located in the person of the operator
who may display prejudice, rely on social stereotypes or use the equipment for
inappropriate ends. The concept of automating threat assessment and thereby
eradicating the role of the human operator seems attractive in offering a
potential resolution to these issues. This paper examines the ethical concerns
regarding manual threat assessment against those presented by an automated
alternative such as SUBITO. It will be seen that in the latter case, problems
are not removed but relocated from the operator to the programmer, and
further problems arise in the process. In conclusion a partially-automated
process will be advocated as the most ethically acceptable solution.

SUBITO and the Ethics of Automating Threat Assessment
In 2008 the EU FP-7 Security Topic funding programme accepted a bid to develop
project SUBITO (Surveillance of Unattended Baggage and the Identification and
Tracking of the Owner) a central part of which involved building an automated threat
assessment system. The purpose of this system was to identify unattended baggage and
alert a human CCTV operator to its presence. SUBITO was deemed necessary in the
light of security incidents concerning bombs left in unattended luggage (e.g. the 2004
Madrid train bombings which killed 191 and wounded 1,841), coupled with research
suggesting that threat assessments performed by CCTV operators could be enhanced by
automated systems. In addition to automatically recognizing the leaving of an
unattended bag, SUBITO aimed to reduce false positives by recognizing when a bag was

- 159 -

Proceedings IACAP 2011

left with an associate of the owner or when the owner was walking towards a nonthreatening goal.
Aside from questions of efficacy there are ethical issues surrounding the manual
operation of CCTV for threat assessment. These are typically located in the person of
the operator who may display prejudice, rely on social stereotypes or use the equipment
for inappropriate ends. The concept of automating threat assessment and thereby
eradicating the role of the human operator seems attractive in offering a potential
resolution to these issues. This paper examines the ethical concerns regarding manual
threat assessment against those presented by an automated alternative such as SUBITO.
It will be seen that in the latter case, problems are not removed but relocated from the
operator to the programmer, and further problems arise in the process. In conclusion a
partially-automated process will be advocated as the most ethically acceptable solution.
In 1999 Norris and Armstrong published the results of a two-year study into the
behaviour of CCTV operators. Among these were indications that operators were
responding to events in an unpredictable fashion, sometimes responding to trivial
incidents while at other times ignoring blatant offences. Possible causes of this
unpredictability include information overload, change blindness, inattentional blindness
(Simons, 1999, 2005) and operator boredom. In responding to their all-too-human
limitations, operators displayed a tendency to rely on social stereotyping to determine
likely threats. This was highlighted in the Norris and Armstrong study, which found that
the young, the male and the black were more likely to be surveilled than other groups,
even when the motivation cited for the surveillance was “no obvious reason”. In
addition to the ethical concerns arising from perpetuating social stereotypes, these
practices exacerbate the number of false positives and false negatives reported by the
system, leading to frustration on the part of the operator and victimization of the
surveilled. Furthermore, and as with most technological innovations, there are problems
regarding function creep of the technology as it is applied for purposes not originally
envisioned (Winner, 1977). Gill and Spriggs, for instance, have found that while CCTV
has been installed in many locations in the UK for the purpose of crime prevention and
detection, its success is often evaluated on a far wider criteria (finding lost children,
urban regeneration, etc.) (Gill and Spriggs, 2005). Finally surveillance introduces a
distance between the operator and the surveilled subject which disempowers the subject
and may serve to reinforce prejudicial attitudes of the operator by failing to confront her
with her own stereotyping. Taken together these four areas of concern (operator error,
false positives/negatives, function creep and distance) indicate that manual threat
assessment by means of CCTV is ethically problematic.
Automated systems offer the chance to overcome many of the problems related to
operator error. Indeed it is possible that the automation of the process, eradicating the
need for an operator altogether, could result in distinct ethical advantages. However, as
David Lyon has pointed out (Lyon, 2003), automation sees the focus of ethical inquiry
relocated from the operator to the programmer. Social stereotyping can remain through
unwitting biases in the code rather than the individual operator. Yet as the code
pervades the entire system rather than one control room such stereotypes risk becoming
institutionalised. With SUBITO, for instance, the recognition of group associations can
reduce false positives but the parameters used can also provide a basic means of
remotely distinguishing between different ethnic groups. False positives and negatives
likewise threaten to remain an issue. While the code is capable of overcoming the
aforementioned human limitations (processing capacity, change blindness, inattentional
blindness and boredom) it is limited to the parameters set by the programmer, which will
be less subtle than those employed by the camera operator. Function creep also remains

- 160 -

The Computational Turn: Past, Presents, Futures?

a possibility. Whilst the leaving of unattended baggage per se does not seem ripe for
function creep, recognizing associations in crowds and predicting pedestrian goals do:
possible uses range from finding lost children to identifying and tracking social
“undesirables”. Finally, in dealing with a computer rather than a (remote) human, the
problem of distance threatens to be magnified to the extent that normal human
interactions concerning discretion, negotiation and the reinforcement of social and moral
values are lost. In the case of automation the problem of distance thus becomes one of
dehumanisation.
There are alternatives between the extremes of manual and full automation however
(Endsley and Kiris, 1995), levels of automation which involve the human operator to a
greater or lesser degree. This paper concludes that such partial automation is the most
ethically acceptable approach to take regarding threat assessment. Through combining
human and automated systems, the limits of the operator's individual capacities can be
significantly enhanced while the dangers of institutionalised prejudice in the automated
system are reduced. There will also be fewer false positives and false negatives than in
either of the extremes discussed above. Function creep and the problem of distance
remain, but once again the continued reliance of the system on a human element
maintains crucial checks and balances which would otherwise be lost with full
automation.

Acknowledgements
I am grateful for the funding of SUBITO, an FP-7 project, and the University of Leeds in
sponsoring this research.

References

Endsley, M.R., and E.O. Kiris. (1995). The out-of-the-loop performance problem and level of
control in automation. Human Factors 37 (2), 381-394.
Gill, M. & Spriggs, A. (2005). Assessing the Impact of CCTV. London: HMG Home Office.
Lyon, D. (2003). Surveillance as Social Sorting: Computer Codes and Mobile Bodies. In: D. Lyon
(Ed.), Surveillance as Social Sorting (pp.13-30). Oxford: Routledge.
Simons, D.J. & Ambinder, M.S. (2005). Change Blindness: Theory and Consequences. Current
Directions in Psychological Science 14 (1), 44-48.
Simons, D.J. & Chabris, C.F. (1999). Gorillas in our midst: Sustained inattentional blindness for
dynamic events. Perception 28, 1059-1074.
Winner, L. (1977). Autonomous Technology: Technics-out-of-control as a Theme for Political
Thought. Cambridge, MA: MIT Press.

- 161 -

Proceedings IACAP 2011

MATCHING – POPULAR MEDIA BETWEEN SECURITYWORLDS
AND CULTURES OF RISK
JULIUS OTHMER
Institute for Media Studies
Braunschweig University of Arts
Frankfurter Straße 3c
38122 Braunschweig
AND
ANDREAS WEICH
Institute for Media Studies
Braunschweig University of Arts
Frankfurter Straße 3c
38122 Braunschweig

Abstract
The concept of risk management has become a part of everyday life. In our presentation
we will discuss two typical strategies of risk management described by Herfied Münkler:
Those in securityworlds and those in cultures of risk. On this theoretical basis, we will
try to explain, how implementations of these strategies can be found in popular media
products. For this, we will take a closer look at the online soccer manager game on
www.kicker.de and the dating platform Parship. They are both computer based
technologies that virtually mediate risks in respect to real persons and their
characteristics and behaviours: soccer players on the one and potential partners on the
other hand. The thesis is that both use strategies of calculating and minimizing risk
according to the logic of securityworlds and of playing with risk according to the logic
of cultures of risk at the same time. Further, they do their part to establish the ideas and
strategies of risk and risk management in popular culture and help naturalizing the
attached knowledge and practices.

Paper
The scholarly perspective on the concept of security has become seemingly inevitably
connected to the concept of risk during the last years. In contrast to danger, risk is
something virtual that can only be applied by visualization and statistics that make it
calculable and therefore manageable. Further, risk lays the responsibility for this
management and the outcome of actions on the acting subject. The political scientist
Herfried Münkler describes two ideal types of strategies to deal with this task:

- 162 -

The Computational Turn: Past, Presents, Futures?

securityworlds and cultures of risk. Securityworlds try to exclude danger and threat by
walling off, security technologies and risk avoidance. In doing so, they also make factors
of insecurity visible and produce a higher feeling of insecurity and cultures of fear. The
cultures of risk on the other hand face dangers and threats by taking risks and having a
chance in both, a playful and calculating way. The two concepts do not exclude one
another but frame and presuppose each other (Münkler, 2009).
Both strategies are based on models and technologies of visualization and
calculation that are mainly statistical. For storing, sorting, searching, relating and
processing these numeric data, computer based databases seem to be the perfect device.
They are the technical infrastructure for generating risk profiles and scenarios that are
used for calculating risks and for choosing options of action. So, databases are on the
one hand a tool for handling risks and on the other hand the technology that makes risk
visual and the concept thinkable at first.
This connection between discourses, practices and technology is interesting
because it evokes questions about the “risky” implications and inscriptions in computer
databases used in everyday life in which actions and practices are being monitored
permanently. Popular media like computer games or internet applications are the most
influential media in the contemporary popular culture and providing “orientative
knowledge” for our lives by giving “patterns of knowledge and actions”, the subject can
“adapt on and accommodate” (Neitzel and Nohr 2008).
Within our presentation, we will examine in which respect the concepts of
securityworlds and cultures of risk are negotiated an implemented in the popular media
products www.parship.de and the soccer manager game on kicker.de and which patterns
of knowledge and action are provided in them. Both objects combine purely databased
elements (personal profiles and a mathematical matrix for rating soccer players) with real
world elements (real persons as potential partners and the real efforts of soccer players)
in a popular medial context. In the analysis we will have a look at the different and
similar strategies of risk management that try to mediate the calculability of the database
and the contingence of the real world.

References
Münkler, H. (2009). Strategien der Sicherung: Welten der Sicherheit und Kulturen des Risikos.
Theoretische Perspektiven. In: H. Münkler, M. Bohlender and S. Meurer (Eds.), Sicherheit
und Risiko. Über den Umgang mit Gefahr im 21. Jahrhundert (pp.11-34). Bielefeld:
Transcript.
Neitzel, Britta & Nohr, Rolf F. & Wiemer, Serjoscha (2009): Benutzerführung und TechnikEnkulturation. Leitmediale Funktionen von Computerspielen. In: D. Müller, A. Ligensa and
P. Gendolla (Eds.), Leitmedien. Konzepte – Relevanz – Geschichte (pp.231-256). Bielefeld:
Transcript.

- 163 -

Proceedings IACAP 2011

Informational Warfare and Just War Theory
MARIAROSA TADDEO

Abstract. This paper focuses on Informational Warfare – the warfare characterised
by the use of information and communication technologies. This is a fast growing
phenomenon, which poses a number of issues ranging from the military
implementation of such technologies to its political and ethical implications. The
paper presents a conceptual analysis of this phenomenon with the goal of
investigating its nature. Such an analysis is deemed to be necessary in order to lay
the ground for future work on this topic addressing the ethical problems
engendered by Informational Warfare. The analysis is developed in three parts. It
first delineates the relation between Informational Warfare and the Information
revolution. It then turns the attention to the effects that the diffusion of this
phenomenon has on the concepts of state and war. On the basis of this analysis, it
provides a definition of Informational Warfare as a transversal phenomenon for
what concerns the environment in which it is waged, the way it is waged and the
ontological and social status of the involved agents. Finally, the paper concludes
taking in consideration Just War Theory and the problems arising from ist
application to the case of Informational Warfare.

Extended Abstract
The analysis presented in the paper focuses on Informational Warfare (IW) – the warfare
based on the use of Information and Communication Technologies (ICTs). IW has been
at the centre of interest of governments, intelligence agencies, computer scientists and
security experts for the past two decades (Arquilla 1999; Libicki 1996; Singer 2009).
ICTs support war waging in two ways: providing new weapons to be deployed on the
battlefield – like drones and semi-autonomous robots - and allowing for the so-called
information superiority, the ability to collect, process, and disseminate information while
exploiting or denying the adversary’s ability to do the same.
ICTs prove to be effective and advantageous war technologies, as they are efficient
and relatively cheap when compared to the general coasts of war. For this reason, the use
of ICTs in warfare has grown rapidly in the last decade determining some deep changes
in the way war is waged, giving the raise to the latest revolution in military affairs
(RMA).
This RMA concerns in primis military force. It also concerns strategy planners,
policy-makers and ethicists, as the need to regulate this new form of warfare is muchfelt
and the existing international regulations, like the Geneva and Heuge Conventions,
provide only partial guidelines. In the same way, traditional ethical theories of war,

- 164 -

The Computational Turn: Past, Presents, Futures?

which should provide the ground for policies and regulations, struggle to address the
ethical problems that arose with this new form of warfare (Arquilla1999; Arquilla and
Boerer 2007; DeGeorge 2003; Hauptman 1996; Powers 2004). There are three
categories of problems on which both policy-makers and ethicists focus their attention,
and these are the risks, rights and responsibilities. In the paper I will refer to these
problems as to the 3R problems. Altogether, the 3R problems pose a new ethical
challenge. Nevertheless, such problems will not be the focus of this paper, which will
rather concentrate on the analysis of the nature of IW and the changes that it determines.
The task of the proposed analysis is to lay down the conceptual foundation for the
solution of the 3R problems, which will be provided in elsewhere. IW it is a wide
spectrum phenomenon, which is rapidly changing the dynamics of combat as well as the
role warfare in political negotiations and the dynamics of civil society. These changes
are the origins of the 3R problems, the conceptual analysis of such changes and of the
nature of this phenomenon is deemed to be a necessary and preliminary step to solve
these problems.
The analysis is divided in three steps. First, IW is analysed within the framework of
the Information revolution (Floridi 2009). Floridi’s analysis of Information revolution as
the fourth revolution is recalled and it is stressed that such a revolution determines a shift
toward the non-physical domain, the domain of nonphysical objects, agents and
interactions.
In the second step, it is argued that IW is one of the most compelling cases of such
a shift. This analysis leads to the consideration of the effects of the dissemination of IW
on the concepts of war and state. In particular, it is argued that IW redefines the concept
of war as a phenomenon not necessary sanguinary and violent, and rather transversal for
what concerns the environment in which it is waged, the way it is waged and the
ontological and social status of its agents. A definition stressing the transversality of IW
and its disruptive nature is then provided.
Informational Warfare is the use of ICTs within an offensive or defensive
military strategy aiming at the disruption of the enemy’s resources, and which is
waged within the informational environment, by agents and targets ranging both
on the physical and non-physical domains and whose level of violence may vary
upon circumstances.
Finally, the third step is devoted to consider the problems arising when IW is considered
within the framework of Just War Theory. This theory provides the ground for
international regulations, and sets the parameters for both the ethical and the political
debates. The issue is addressed whether and how the principles of Just War Theory could
be applied to IW.
The analysis unveils three problems. The first one concerns the differences between
the scenario assumed by Just War Theory and the one delineated by IW. Just War
Theory refers to classic warfare, where governments and their leaders are the only ones
who inaugurate wars by deploying armed forces, and they are the ones to be held
accountable the actions of war. IW fosters a completely new way of declaring and
waging war. The need is stressed for Just War Theory to take into account such changes
in order to address the ethical problems arose with IW. The other two problems concern
the application of two principles of Just War Theory – ‘war as last resort’ and
‘discrimination and non-combatants immunity’ – to the case of IW. In the case of the
principle of ‘war as last resort’ the analysis indicates that the application of this principle

- 165 -

Proceedings IACAP 2011

to the case of IW leads to an ethical impasse. The principle assumes that war is a violent
and sanguinary phenomenon. It is argued that the correctness of this assumption in
shaken when IW is taken into account, and that in these circumstances the application of
the principle of war as last resort becomes less immediate. The impasse concerns the use
of bloodless and non-physically violent modes of combat peculiar of IW, like a cyber
attack, to address potentially dangerous diplomatic conflicts to prevent the occurrence of
classic warfare. On one hand, such a use constitutes an act of war itself and as such Just
War Theory forbids it, on the other hand it may avoid states to engage in a sanguinary
war and hence is intrinsically consistent with the overall view proposed by Just War
Theory of reducing bloodshed and conflicts.
A similar ethical problem is described with respect to the application of the
‘principle of discrimination and non combatants immunity’. It is stressed that this
principle tacitly equates non-combatants to civilians and that such an equation has been
weaken by the diffusion of terrorism and guerrilla, to become even feebler with the
dissemination of IW. In IW scenario, civilians may take part to a combat action from the
comfort of their homes, while carrying on with their civilian life and hiding their status
of informational warriors.
An ethical conundrum is described. Given the difficulty to distinguish combatants
from non combatants in IW scenario, and in order to endorse the ‘principle of
discrimination’, states might be justified to embrace high levels of surveillance over the
entire population breaching individual rights, like privacy and anonymity, in order to
identify the combatants and guarantee the security of the entire community.9 It is argued
that, on the one side, respecting the principle of discrimination may lead to violate
individual rights. On the other side, waving the principle of discrimination leads to
bloodshed and dissemination of indiscriminate violence over the civil population. The
paper concludes pulling together the threads of the analysis and stressing the importance
of developing ethical guidelines, which will provide the ground for the definition of the
necessary regulation for IW and for the solution of the 3R problems.

References
Arquilla, J. (1999). Ethics and information warfare. Strategic appraisal: the changing role of
information in warfare. Z. Khalilzad, J. White and A. Marsall. Santa Monica, USA, Rand
Corporation: 379-401.
Arquilla, J. and D. A. Borer, Eds. (2007). Information Strategy and Warfare: A Guide to Theory
and Practice (Contemporary Security Studies). New York, USA, Routledge. DeGeorge, R.
T. (2003). "Post-september 11: Computers, ethics and war." Ethics and Information
Technology 5(44): 183-190.
Hauptman, R. (1996). "Cyberethics and social stability." Ethics and Behavior 6(2): 161-163.
Floridi, L. (2009). "The information Society and Its Philosophy." The Information Society
25(3): 153-158.
Libicki, M. (1996). What is Information Warfare? Washington, D.C, USA, National Defense
University Press.

9

This problem is part of the 3R problems described in section one.

- 166 -

The Computational Turn: Past, Presents, Futures?

Powers, T. M. (2004). "Real Wrongs in Virtual Communities." Ethics and Information
Technology 5(4): 191-198.
Singer, P. W. (2009). "Robots at War: The New Battlefield." Wilson Quarterly 33(1):
30-48.

- 167 -

Proceedings IACAP 2011

TECHNO-SECURITY, RISK AND THE MILITARIZATION OF
EVERYDAY LIFE
JUTTA WEBER
University Paderborn
Warburger Straße 100, 33098 Paderborn

Abstract. Recently, we experience a rapid and ongoing transfer of security technologies
such as body scanners, drones, or biometrics from the military realm in everyday
life. And though there is a lively debate on the growing militarization of public
space, political culture and everyday life (Giroux 2004, Graham 2005, Crandall/
Armitage 2005, Kohn 2009) there is surprisingly little discussion on the huge
amount of military-civilian transfer of new and emerging security technologies.
Only very few authors address the possible militarization of society through the
procurement, adaptation and proliferation of military technologies in civilian life
(Agre 2001). A few scholars such as Dandeker (1990, 2006), Wood et al. (2006),
or Balzaqc et al. (2010) pointed out that security technologies and practices are
deeply impregnated by their military offspring. Surveillance studies scholars –
leaning on Anthony Giddens (1985) – at least partly acknowledge the growing
entanglement of the military and bureaucracy in post/modern societies (Bogard
1996, Dandeker 1990, Nellis 2009, Wood et al. 2003). Approaches in STS
(Akrich 1992; Woolgar 1991) and philosophy of technology (Winner 1986,
Verbeek 2006, Flanagan, Howe and Nissenbaum 2006) showed how technology
transports values, world views and norms. Therefore I will ask in my paper what
norms, values, frames of thought are transported into everyday life with the
military-civil transfer of security technologies – for example when uninhabited
aerial vehicles become part of everyday experiences for example through the
growing presence of UAVs during global sport and cultural events, by
demonstrations or during law enforcement as well as through ‘augmented reality
video games’.

2. Daily Drones. Techno-Security &the Militarization of Everyday Life
Originally hopes of a large-scale military-civilian conversion arouse after the end of the
cold war. But these hopes were disappointed already in the early 1990s when force has
become again a frequent tool of foreign policy concentrating on so-called rogue and
failed states that followed a growing number of military responses from peace-keeping
operations up to massive invasions (Rappert et al. 2008). In philosophy of technology as
well as science and technology studies (STS) we got some studies on the crossover of
global communication and military surveillance systems (i.a. de Landa 1991, Edwards
1996) as well as the fusion of military, industry and media (Der Derian 2001,

- 168 -

The Computational Turn: Past, Presents, Futures?

Lenoir/Lowood 2002). The shift of the business of major arms manufacturers towards
mainstream security and surveillance products in the post-cold-war era is addressed (i.a.
Wood et al 2006, Eick 2010, Graham 2010).
Nowadays new products are developed and partially already deployed. Think of
non-lethal weapons, i.a. electroshock and heat-ray weapons, as well as monitoring
systems linked to killing or paralyzing systems. These weapons for warfare respectively
crowd control are situated between the military and civilian realm. In a brochure on new
security projects in the 7th framework programme for research, the Directorate General
Enterprise and Industry of the EU commission states: “Moreover, the relationship
between defence technologies on the one hand, and security technologies on the other, is
particularly noticeable in the field of R&D, with technologies that show potential
developments in both areas (Dual Use). At both research and industrial development
levels, synergies are possible and desirable.” (European Commission. Enterprise and
Industry 2009, my emphasis). Contemporary surveillance studies also point towards the
close relation between the military and the managerial: “Cross-fertilization between the
military and the managerial is clearly central to problems and developments in the study
and practice of surveillance…” (Wood et al. 2003, 146). But there are very few studies
on the relation of the sociotechnical, political, and the military with regard to militaryrelated security technologies and their impact on everyday life.
2.1. TECHNO-SECURITY, RISK AND UNPREDICTABILITY
So what to think of the manifest development expansion of military technologies in
civilian life in general and of UAVs specifically? For a long time we know about the
conversion and adaptation of military technology in everyday life – think only of recent
examples of the military offspring of technologies such as the internet, RFID, satellite
technology or GPS (Global Positioning System). Approaches in STS (Akrich 1992;
Woolgar 1991) and philosophy of technology (Winner 1986, Verbeek 2005, Flanagan,
Howe and Nissenbaum 2006) showed how technology transports values, world views
and norms. Madeleine Akrich made visible that every technology contains scripts while
Steve Woolgar s (1991) pointed to the fact that technology is „configuring the user
and the context of the use. Therefore it is important to ask which frames of thought,
world views, perspectives, preferences and motives are inscribed into military-related
security technologies and translated into everyday life. Kaplan (2006) has shown how
GPS did not only link demography, geography, remote sensing, geopolitics and identity
politics but how GPS became an icon of “personal empowerment and self-knowledge
linked to speed and precision” (Kaplan 2006: 697) for US Americans. At the same time
the „militarized consumer who wants to improve his „lifestyle provides the personal
data thereby enabling new systems of surveillance (embedded in mobiles, GPS systems
in cars, etc.): “…tracked, the user becomes a target within the operational interfaces of
the marketing worlds, into whole technologies state surveillance is outsourced.”
(Crandall 2006, np)
Relevant epistemological shifts and the emergence of new norms, worldviews and
values that accompany the massive contemporary military-civilian transfer is the
epistemological reframing of today’s concept of security. Homeland as well as
international security is not primarily occupied with the defense against specific threats
and prosecuting crimes (Albrecht 2009) but with the (precautionary) management of risk
and preventive and pre-emptive securitization of security (Aradau et al. 2008, AmmichtQuinn/Rampp 2009, Zedner 2007). While traditionally threat was related to actions and
intentions of conflicting parties which can be – in principle resolved, the concept of
„risk embrace the idea of general, permanent and systemic contingencies such as
pandemics, global warming, rogue states, terrorism, organized crime, poverty, illegal

- 169 -

Proceedings IACAP 2011

immigration or the proliferation of weapons of mass destruction (European Commission.
Enterprise and Industry 2009). The concept of risk is closely entangled with
unpredictability and insecurity – especially with regard to the identification of the enemy
or the assessment of hazardous situations. The politics of risk operates with risk profiling
on the basis of statistics and probabilities, with models and speculations which do not
target at eliminating but managing risk: „In short, whereas the concept of threat brings
us in to the domain of the production, management and destruction of dangers, the
concept of risk mobilizes and focuses on different practices that arise from the
construction, interpretation and management of contingency“. (Aradau et al. 2008, 148;
my emphasis) This new approach is highly technological-oriented. The shift towards a
preventive security policy and a techno-centred concept of security corresponds to the
increasing networking of surveillance measures. The reconfiguration of surveillance as
assemblage (Haggerty/Ericson 2000) is a general tendency. Nevertheless, the concept
and practice of digital network-centred surveillance technologies (Graham/Wood 2003)
shows strong affinities to that of network-centric warfare. The latter – also called
„Revolution in Military Affairs – is based on strong, ubiquitous ICT-based networks
and mobilities that control and monitor area-wide and over huge distances 24 hours a
day to reach a “globespanning dominance based on a nearmonopoly of space and air
power (Graham 2005, 175; see also Dillon 2002, Dandeker 2006). In this scenario,
especially autonomous UAVs with artificial intelligence and learning capability are
regarded as an important component of new techno-warfare (Weber 2009, 2010).
Together with inhabited systems integrated in a complex network of air, water and
ground agents, new techniques of warfare are developed “… toward a vision of a
strategic and tactical battlespace filled with networked manned and unmanned air,
ground, and maritime systems ... that free warfighters from the dull, dirty, and dangerous
missions ... and enable entirely new design concepts unlimited by the endurance and
performance of human crews. The use of UAVs in Afghanistan and Iraq is the first step
in demonstrating the transformational potential of such an approach.” (Department of
Defense 2007, 34) This aspired high-tech transformation of armed forces is supposed to
make them invincible, to develop strategies of digital deterrence more powerful than
nuclear deterrence ever was. The utopia of a ubiquitous, networked system of
surveillance and control seems to be mirrored by a preventice and techno-centred idea
of security in everyday life – for example when drones are deployed for law enforcement
by the British Police or for border control by the European agency Frontex.
Recently, the Guardian s Freedom of Information request revealed the very broad
scope of potential UAV applications by the British police: “Working with various
policing organisations as well as the Serious and Organised Crime Agency, the Maritime
and Fisheries Agency, HM Revenue and Customs and the UK Border Agency, BAE
[systems; the British defence company] and Kent police have drawn up wider lists of
potential uses. One document lists ‘[detecting] theft from cash machines, preventing theft
of tractors and monitoring antisocial driving’ as future tasks for police drones, while
another states the aircraft could be used for combat ‘fly-posting, fly-tipping, abandoned
vehicles, abnormal loads, waste management’ (…) There are two models of BAE drone
under consideration, neither of which has been licensed to fly in non-segregated airspace
by the CAA. The Herti (High Endurance Rapid Technology Insertion) is a five-metre
long aircraft that the Ministry of Defence deployed in Afghanistan for tests in 2007 and
2009”. (Lewis 2010).
According to these plans, the use of UAVs would be part of a larger networkcentric project through which information from a variety of sources (UAVs, smart
CCTV, data detention, analysis of money transfer,´etc.) are networked and evaluated.
This course of action seems not to aim primarily at prosecuting specific crimes and

- 170 -

The Computational Turn: Past, Presents, Futures?

follow concrete suspicions but to search monitor a nation’s population systematically
and thoroughly on an everyday basis. We need to investigate whether this civilian
approach resembles what is called C4ISR – Command, Control, Communications,
Computers, Intelligence, Surveillance and Reconnaissance in the military. C4ISR stands
for the networking of all available surveillance and control systems to achieve a global
overview in the war theatre. So maybe we witness the idea of a global overview in the
(civilian) world theatre.
Part of these epistemological and normative reframing might also be found in
recent consumer applications of UAVs. Since last year the first little UAVs respectively
quadricopters
are
available
for
„augmented
reality
video
games
(http://ardrone.parrot.com/parrot-ar-drone/de/) in which one can launch missiles and
fight against other drones. The quadricopters can be controlled by an iPhone, iPod
Touch or iPad. There a two cameras embedded into the drone, one on the front and one
underneath, to enable a direct sight via video remote control on the basis of a Wi-Fi
connection. Another application is provided by a German company which rents drones
for private use (www.rent-a-drone.de) to enable real time pictures and videos from
above.
The private consumer applications of UAV might (still) not be as wide ranging as
GPS but in a way one could argue that they might open the door in more intense
participatory surveillance and observation practices (Ball 2005, Koskela 2009). Daily
consumer drones might contribute to train users to watch the world from a top-down or
„God’s eye view that participates in the C4ISR longing for a global overview in the
war / world theatre.
The tightening networks of surveillance technologies – increasingly expanded by
drones for border control, policing demonstrators, crowd and event control, are part of a
growing belief in “’smart’, specific, side-effects-free, information-driven utopia of
governance” (Valverde and Mopas, 2004: 239). Network centric warfare with its idea of
C4ISR relies on this utopia as it might be the case with recent police applications of
drones and new gamer applications such as the iphone controlled ar-drone. It is
necessary to follow up closely the growing transfer of military technologies in civil
applications, game practices and other everyday life to see whether and how recent ideas
of techno-security and „full spectrum dominance become dominant in 21st century’s
societies of control.

References
Ammicht-Quinn, R. & Rampp, B. (2009). The Ethical Dimension of Terahertz and MillimeterWave Imaging Technologies – Security, Privacy and Acceptability: Optics and Photonics.
In: C.S. Halvorson et al. (Eds), Global Homeland Security V and Biometric Technology for
Human Identification VI (pp. 1-11). Proc. of SPIE Vol. 7306, 730613.
Agre, P.E. (2001). Imaging the Next War. Infrastructural Warfare and the Conditions of
Democracy. Retrieved from http://polaris.gseis.ucla.edu/pagre/war.html [accessed 17
November 2010].
Akrich, M. (1992). The de-scription of technological objects. In: W.E. Bijker & J. Law (Eds.),
Shaping technology/building society (pp. 205-224). Cambridge: MIT.
Ammicht-Quinn, R. & Rampp, B. (2009). The Ethical Dimension of Terahertz and MillimeterWave Imaging Technologies – Security, Privacy and Acceptability: Optics and Photonics.
In: C.S. Halvorson et al. (Eds), Global Homeland Security V and Biometric Technology for
Human Identification VI (pp. 1-11). Proc. of SPIE Vol. 7306, 730613.

- 171 -

Proceedings IACAP 2011

Ball, Kirstie Ball (2005). Organization, Surveillance and the Body: Towards a Politics of
Resistance. In: Organization. Volume 12(1): 89–108
Balzacq, T. et al. (2010). Security Practices. In: R. Denemark (Ed.), International Studies
Encyclopedia
Online.
Retrieved
from
http://didierbigo.com/documents/SecurityPractices2010.pdf [accessed 4 November 2010].
Bogard, W. (1996). The Simulation of Surveillance: Hypercontrol in Telematic Societies.
Cambridge: Cambridge University Press.
Capurro, R., Tamburrini, G. & Weber, J. (Eds.) (2008). Techno-Ethical Case-Studies in Robotics,
Bionics, and Related AI Agent Technologies. Deliverable 5 of the EU-Project ETHICBOTS.
Emerging Technoethics of Human Interaction with Communication, Bionic and Robotic
Systems (SAS 6 - 017759). Retrieved from http://ethicbots.na.infn.it/restricted/doc/D5.pdf
[accessed 17 November 2010].
Crandall, J. & Armitage, J. (2005). Envisioning the Homefront: Militarization, Tracking and
Security Culture. Journal of Visual culture. 4 (1), 17-38.
Crandall,
Jordan
(2006).
Operational
Media.
Retrieved
http://www.ctheory.net/printer.aspx?id=441 [accessed 2nd January 2011].

from

Dankeker, C. (1990). Surveillance, Power and Modernity: Bureaucracy and Discipline from 1700
to the Present Day. New York: St. Martin.
Dandeker, C. (2006). Surveillance and Military Transformation: Organizational Trends in
Twenty-first-Century Armed Services. In: K.D. Haggerty & R.V. Ericson (Eds.), The new
politics of Surveillance and Visibility (pp. 225-249). Toronto, Buffalo and London:
University of Toronto Press.
De Landa, M. (1991). War in the Age of Intelligent Machines. New York: Zone Books.
Department of Defense (2007). Unmanned Systems Roadmap 2007-2032. Retrieved from
http://www.acq.osd.mil/usd/Unmanned%20Systems%20Roadmap.2007-2032.pdf [accessed
12 June 2008].
Der Derian, J. (2001). Virtuous war: mapping the military-industrial-media entertainment
network, Westview Press, Boulder, CO.
Edwards, P.N. (1996). The Closed World: Computers and the Politics of Discourse in Cold
War America. Cambridge, MA: MIT Press.
Eick, V. (2010). The Droning of the Drones. The increasingly advanced technology of
surveillance and control. Retrieved from http://www.statewatch.org/analyses/no-106thedroning-of-drones.pdf [accessed 12 November 2010].
European Commission. Enterprise and Industry (2009). Security Research. Towards a more
secure society and increased industrial competitiveness. Security Research Projects under
the 7th Framework Programme for Research. May 2009. Retrieved from
ftp://ftp.cordis.europa.eu/pub/fp7/security/docs/towards-a-more-secure_en.pdf [accessed
17 November 2010].
Flanagan, M., Howe, D.C. & Nissenbaum, H. (2008). Embodying Values in Technology. In: J.
van den Hoven & J. Weckert (Eds.), Information Technology and Moral Philosophy (pp.
322-353). Cambridge: Cambridge University Press.
Giddens, A. (1985). The Nation-State and Violence. A Contemporary Critique of Historical
Materialism, Vol. II. Berkeley: University of California Press.
Giroux, H.A. (2004). War on Terror. The Militarising of Public Space and Culture in the United
States. Third Text. Vol. 18, Issue 4, 211-221.

- 172 -

The Computational Turn: Past, Presents, Futures?

Graham, S. (2005). Surveillance, urbanization and the US „Revolution in Military Affairs . In:
D. Lyon (Ed.), Theorizing Surveillance. The panopticon and beyond (pp. 247-270). Devon,
UK: Willian.
Graham, S. & Wood, D. (2003). Digitizing Surveillance: Categorization, Space, Inequality.
Critical Social Policy. Vol. 23, No. 2, 227-248.
Graham, S. (2010). From Helmand to Merseyside: Unmanned drones and the militarization of
UK policing. Retrieved from http://www.opendemocracy.net/ourkingdom/stevegraham/
from-helmand-to-merseyside-military-style-drones-enter-uk-domestic-policing, [accessed 17
November 2010].
Haggerty, K. & Ericson, R. (2000). The surveillance assemblage. British Journal of
Sociology. Vol. 51, No. 4, 605-622.
Kaplan, Caren (2006). Precision Targets: GPS and the Militarization of U.S. Consumer identity.
American Quarterly 58.3 ,693-713.
Koskela, Hille (2009). Hijacking surveillance? The new moral landscapes of amateur
photographing. In: Katja Franko Aas, Helene Oppen Gundhus, Heidi Mork Lomell (Eds.)
Technologies of Insecurity: The Surveillance of Everyday Life. (pp.147-168). Oxon / New
York: Routledge-Cavendish.
Kohn, R.H. (2009). The Danger of Militarization in an Endless „War
of Military History, Vol. 73, No. 1, 177-208.

on Terrorism. The Journal

Lenoir, T. & Lowood, H. (2002). Theaters of War: The Military-Entertainment Complex.
Retrieved
from
http://www.stanford.edu/class/sts145/Library/LenoirLowood_TheatersOfWar.pdf [accessed 17 November 2010].
Lewis, P. (2010). CCTV in the sky: police plan to use military-style spy drones. TheGuardian
(London), 23.1.2010. Retrieved from www.guardian.co.uk/uk/2010/jan/23/cctvsky-policeplan-drones [accessed 12 November 2010].
Nellis, M. (2009). 24/7/365: mobility, locatability, and the satellite tracking of offenders. In: Katja
Franko Aas, Helene Oppen Gundhus, Heidi Mork Lomell (Eds.) Technologies of Insecurity:
The Surveillance of Everyday Life. (pp.103-124). Oxon / New York: Routledge-Cavendish.
Rappert, B., Balmer, B. & Stone, J. (2008). Science, Technology and the Military. Priorities,
Preoccupations and Possiblities. In The Handbook of Science and Technology Studies.
London: MIT Press, 719-740.
Verbeek, P.-P. (2006). Materializing Morality. Design Ethics and Technological Mediation.
Science, Technology & Human Values. Vol. 31, No. 3, 361-380.
Weber, J. (2009). Unmanned Combat Aerial Vehicles, Dual Use and the Future of War. In: R.
Capurro, M. Nagenborg & G. Tamburinni (Eds.), Ethics and Robotics, (pp.83-103).
Amsterdam/Heidelberg: IOS Press: Deutscher Akademieverlag.
Weber, J. (2010). Armchair Warfare „on Terrorism . On Robots, Targeted Assassinations and
Strategic Violations of International Law. In: Jordi Vallverdú (Ed.): Thinking Machines and
the Philosophy of Computer Science: Concepts and Principles (pp.206-222). IGI Global.
Winner, L. (1986). The Whale and the Reactor: A Search for Limits in an Age of High
Technology. Chicago: University of Chicago Press.
Woolgar, S. (1991). Configuring the User: The Case of the Usability Trails. In: John Law (Ed.), A
Sociology of the Monsters. Essays on Power, Technology and Domination. (pp.59-99).
Verlag: London and other, 59-99.

- 173 -

Proceedings IACAP 2011

Track V:
Information Ethics, Robot
Ethics

- 174 -

The Computational Turn: Past, Presents, Futures?

IS THERE A HUMAN RIGHT NOT TO BE KILLED BY A MACHINE?
PETER M. ASARO
The New School University
asarop@newschool.edu

1. Extended Abstract
This presentation reviews the standard frameworks for considering the human right not
to be killed, and its forfeit by combatants in a war. It then considers as a special case the
right not to be killed by a machine. Insofar as one has a right not to be killed by any
means, then one also has a right not to be killed by a machine, such as a lethal robotic
system. It is further argued that in those cases in which an individual may have already
forfeited their right not to be killed, such as when acting as a combatant in a war, this
does not necessarily subject one to being killed by a machine. Despite a common view
that combatants in war may be liable to be killed by any means, “killing by machine”
fails to meet the requirements for ethically justifiable killing. The defense of this
assertion will rest on a technical definition of “killing by machine,” and further
clarification of justified killing in war. In short, the argument is that “killing by
machine” fails to consider the rights of an individual in the morally required manner.
This is because “killing by machine” requires a “decision to kill” to be made by a moral
agent, and an automated decision cannot involve the necessary moral deliberation
required to justify violating the human right not to be killed. As such, automated
decisions to kill are not morally justifiable.
The argument begins by examining the right to self-defense which forms the rightsbased interpretation of Just War Theory. In particular, I examine the “Castle Laws”, aka
“Make My Day Laws,” which permit individuals to use force against home-intruders
without criminal or civil liability in many U.S. states. I examine the conditions under
which individuals in such circumstances are permitted to use lethal force, and when such
force becomes “willful and wonton misconduct.”
Informed by this analysis, I examine the legality of a home-defense robot, and the
legal permissibility of its use of force against home-intruders. In general, the “Castle
Laws” do not allow homeowners to booby-trap their homes, and a robotic home-defense
system can be viewed as a sophisticated booby-trap. I consider the various objections
that might be made to the standard rejection of booby-trap. According to such
objections, a robot with sophisticated cognitive and perceptual capabilities might be
argued to avoid manifesting a form of “reckless endangerment.”
I then analogize from the case of home-defense in civil and criminal law, to the
case of self-defense in war, and the Laws of Armed Conflict and Just War Theory.
While warfare has much looser standards of what constitutes a “threat,” and the
proximity of threats, the use of systems capable of automated lethal decision-making is
largely analogous to the domestic use of booby traps.

- 175 -

Proceedings IACAP 2011

I conclude that implicit in both domestic law and international laws of armed conflict is
requirement for moral deliberation which undermines the moral and legal legitimacy of
automated lethal decision making. This has serious implications for the use of
autonomous lethal robotics in police and military applications. One implication is that
only artificial moral agents, capable of exercising moral autonomy, could be morally and
legal justified in violating the rights of a human.

- 176 -

The Computational Turn: Past, Presents, Futures?

DO WE NEED AN UNIVERSAL INFORMATION ETHICS?
THOMAS CHRISTOPHER DASCH
University of Paderborn
Germany

Abstract. This article deals with information ethics. This raises the essential
question: What is information?But I want to focus on the ethical category.
herefore, three areas of potential actions arise. Instead of informations I want to
talk more generally of data. This makes it possible to distinguish between: (1) The
pure receive of data, (2) The pure provision of data, (3) The simultaneous receive
and provision of data, (4) A further possible action is to supply a plattform for
data. This is strictly speaking the topic three, but it will be discussed as an seperate
topic.Here is exemplified the ethical problems for the individual cases may
occur.Subsequently, a connection between the problems of the legislation of the
Internet and the lack of a universal ethical base is made in the information ethics.

This article deals with information ethics. This raises the essential question: What is
information?
The question of “What is Information?” (Floridi, 2004, p.560) is according to
Floridi the elementary problem of the philosophy of information. Among the advocates
of well known approaches to the concept of information are Shannon and Weaver, BarHillel and Carnap, Wiener, Janich, etc. (Capurro, 2000). Here Capurro’s trilemma
(Fleissner, Hofkirchner, 1995) applies: (1) Either the concept of information is always
the same no matter what the set of input data is like, (2) or the information is only of
similar kind, or (3) it is completely independent. At this point it is to be clarified on
which concept of information based the information ethics.
But I want to break another ground. I want to focus on the ethical category. In this
context information ethics is the part of ethics that deals with the internet. The concept of
information is to be ignored here. “Morale is focussed on judgments, that assess a human
action positively or negatively, approve or disapprove it.” (Birnbacher, 2007, p.12).
Therefore, three areas of potential actions arise. Instead of informations I want to talk
more generally of data. This makes it possible to distinguish between:
1.The pure receive of data
2.The pure provision of data
3.The simultaneous receive and provision of data
4.A further possible action is to supply a plattform for data. This is strictly speaking
the topic three, but it will be discussed as an seperate topic.
One example for the first topic is the reading of news pages or blogs. In this
context, the information content the receiver consumes is moral relevant. A possible

- 177 -

Proceedings IACAP 2011

moral misconduct in this field is the download of music without owning the respective
rights. In case of the internet, the information recipient may not be able to reconstruct the
origin of the information. Additionally, the information can be deleted from the
respective homepage at any time. In contrast, the information content of a newspaper can
not be changed once the paper is printed.
The second topic includes e.g. owners of news pages. In this connection, the
precise content of the online data is moral relevant. In the case of news pages it is
expected that the news have been extensively investigated. One example for the misuse
of this function is a scenario in which a person spreads videos showing another person in
an unfavourable context. In the case of the internet, tracking down the owner of the page
is far more difficult as tracking down a normal information transmitter. The latter differs
from the internet in concerns of judicial matters, more about that later in the text. A
feature of the internet is that a large group of people can be addressed without the need
of a major news infrastructure. Interest groups can be formed rapidly and easily in this
way as seen recently when a open letter was handed to Chancellor Merkel concerning the
plagiarism affair of Germanys minister of defence, Karl Theodor zu Guttenberg. In this
way, the initiators of the letter were able to support the ministers retirement.
Amongst others, topic three includes chats, forums and online games. In this case,
moral relevance is similar to moral relevance in non virtual communication. A possible
moral misconduct would e.g. be the insult to a person in a chat room. Characteristic for
this kind of online communication is that the counterpart can not be visualized (as long
as webcams are not used). Therefore, it remains unknown what emotions the counterpart
expresses.
“Emotions are responses of an organism centered on experiences. They represent
the relevance of an artefact of perception for the fulfilment of needs (e.g. according to
the criteria “beneficial” or “impedimental”). Additionally, they activate or constrain
various cognitive and motivational systems in terms of a optimal satisfaction of
needs.”(Kuhl, 2010, p.543) This can lead to a incorrect estimation of the counterparts
emotions. However, the chatter can manipulate emotions by the use of e.g. smilies, that
do not represent his actual emotions. In case of the internet, the identity of the person
one is chatting with can not be verified. The counterpart is not necessarily regarded as a
person, but in a distinct role. This can be the case in online games as required
participant, in forums as disposer of information and so on.
The fourth topic includes for example provider or plattforms like Facebook or
search engines like Google and file sharing services. At this point it is ethical relevant
whether the suppliers can asure a ethical correct mode for the users. An Examples for an
ethical dubious action in this topic are to run a file sharing service for music without
having the copy rights. A point at issue is Wikileaks, too. It is questionable, wheter it is
ethically to publish diplomatic cables.
Despite all this potentially ethical critical topics one can point out that beyond this
controversial concepts and opinions exist. This depects for example in the five cultural
deminsions of Hofstede: Power Distance Index(PDI), Individualism(IDV),
Masculinity(MAS), Uncertainty Avoidance Index (UAI), Long-Term Orientation (LTO)
(Lüsebrink, 2005, p. 20-25). On the one hand this is due to different opinions about this
in the respective culture area. On the other hand, different cultures show different
behaviour on the internet, that can be reduced to the fact that violation on the internet
against ethical basic principles remains largely unpunished. The internet is no area

- 178 -

The Computational Turn: Past, Presents, Futures?

immune from law, but it is so that people on the Internet are global and there depending
on each of the legislation and t heenforcement of the laws of their own country. “The
almost traceless variability of content presents new challenges to the reliability of
documents
and
the
evidence.
The indifference of original and copy has a new copyright quality. The anonymity of the
web makes it difficult to identify reliable contractors. The speed of interactive
communication such as short natural cooling-in contracts considerably, giving the
consumer a new dimension. “(Haug, 2010, p.9)
It would require a common ethical base in information ethics.

References
Birnbacher, D. (2007). Analytische Einführung in die Ethik, 12. Berlin: Walter de Gruyter
Capurro,
R.
(2000).
Einführung
in
den
Informationsbegriff.
Available
at
http://www.capurro.de/infovorl-kap3.htm [15.02.2011].
Fleissner, P., Hofkirchner, W. (1995). Informatio revisited. Wider den dinglichen
Informationsbegriff. Informatik Forum, 9(3), 126-131.
Floridi, L. (2004). Open Problems in the Philosophy of Information. Metaphilosophy, 35(4), 554582.
Haug, V. (2010). Internetrecht: Erläuterungen mit Urteilsauszügen, 9.Stuttgart: Kohlhammer.
Kuhl, J. (2010). Lehrbuch der Persönlichkeitspsychologie: Motivation, Emotion und
Selbststeuerung, 543. Göttingen: Hogrefe.
Lüsebrink, H. (2005). Interkulturelle Kommunikation, 20-25. Stuttgart: Metzler.

- 179 -

Proceedings IACAP 2011

A PSEUDOPERIPATETIC APPLICATION SECURITY HANDBOOK
FOR VIRTUOUS SOFTWARE”

KEITH DOUGLAS
Statistics Canada10

In the past 10 or 15 years an increased awareness of application security11 (AS) in
computing and information systems has resulted in many volumes of material (e.g.,
Cross 2006, Burnett 2004, Seacord 2005, Clarke 2009). Security conscious developers,
testers, and organizations wishing to adopt “best practices” have a lot of work to distill
these many volumes of advice and principles into easily implementable and
understandable approaches. Following the off-hand suggestion from a colleague (Perkins
2010), I have taken her phrase “virtuous software” as a starting point. In this paper, I
comb through the Nicomachean Ethics (Aristotle 1984) to find appropriate guidance for
virtue in AS. It thus is addressed both to computing professionals wanting to understand
why AS makes the ethical consequences of their work more salient (or, more debatably12,
makes them exist) and also to philosophers who may not be aware of the ethical
challenges raised by recognition of AS in computing. It is also intended as a brief
introduction as to why AS considerations matter as one (not independent of the others)
aspect of the “architecture”, design, development, and support of software.
10

Author affiliation for identification purposes only.
AS is to be distinguished in discussions of computing security from infrastructure security,
dealing with antimalware solutions, public key utilities, routing rules in networks, etc. 70% of
current exploits and vulnerabilities are in application areas (Sykora 2010) and subsequently
AS merits philosophical and computational attention. It is often discussed in the context of
“application hardening”. This term is in the author’s view unhelpful, since it suggests,
wrongly, that a correct approach to would be to implement an application and then “fix it up”
to meet the hardening requirements. The expert consensus seems to be that AS ought to be
part of the entire software development life cycle, and have a role to play at almost every
phase. See, e.g., Seacord 2005. The case of what to do about existing systems is more
complicated; I do not address it as much in the present work, though much of what we can
tease out of (or be reminded by) Aristotle applies regardless
12
Conversations with colleagues on the part of the author suggest (he has not done formal
investigations) that many computing professionals do not think their profession and activities raise
any additional or different ethical considerations beyond those common to all humans in general
or all relevant employees of a given organization. (For example, fellow computing colleagues of
the author are certainly aware of their obligations under the relevant public service legislation, but
do not see (for example) buffer overruns and race conditions as leading to possible ethically
relevant situation. At best they are regarded as “another sort of bug”.) Further work (beyond the
present one) to institute AS “consciousness” in developers will have to deal with this situation.
11

- 180 -

The Computational Turn: Past, Presents, Futures?

Philosophical topics I will briefly address in the above fashion are: the nature of
technology, the nature of virtue, how virtue may be obtained, who is virtuous, what
results from being virtuous and examples of what specific virtues are. All of these can be
topics for complete presentations in their own right: I bring them up to simply show the
rich areas of further possible investigation, and, in some cases, the pitfalls of using a
“virtues framework” when it comes to software.
The philosophical topics in turn relate (here I do not indicate how, merely
enumerate what will be discussed) to the following more directly computing
considerations: the nature of computing professions, systems specifications, how one
should learn about AS, characteristics of good software systems, how to adjudicate
between AS and other design goals, how to get developers to be AS-aware and others.
Finally, I include this paper as a way of linking three phases of the so-called
computational turn: the past: traditional philosophy (e.g., Aristotle); the present, the CAP
conferences where computing and philosophy, traditional and otherwise is largely (but
not exclusively) academic (yet fruitfully interacting), and the future, where work from
CAP is also of importance to those outside. I do not suggest that these three phases are
the only way to understand the historical development of the computing and philosophy
movement, nor do I suggest that there has not been anything useful in the past to those
outside of academia, merely that there is ample room within the topic of AS to address
such considerations.

References
Aristotle. 1984. “Nicomachean Ethics”. In The Complete Works of Aristotle, vol. 2 (ed. Jonathan
Barnes). Princeton: Princeton University Press.
Burnett, Mark. 2004. Hacking the Code: ASP.NET Web Application Security. Burlington:
Syngress.
Clarke, Justin. 2009. SQL Injection Attacks and Defense. Burlington: Syngress.
Cross, Michael. 2006. Developer’s Guide to Web Application Security. Burlington: Syngress.
Perkins, Evelyn. 2010. Unpublished comment, meeting of the Secure Coding Practices Working
Group, Statistics Canada.
Seacord, Robert. 2005. Secure Coding in C and C++. New
York: Addison-Wesley Professional.
Sykora, Boleslav. 2010. Lecture Material, Learning Tree International Course 940.

- 181 -

Proceedings IACAP 2011

THE CENTRAL PROBLEM OF ROBOETHICS: FROM DEFINITION
TOWARDS SOLUTION
DANIEL DEVATMAN HROMADA
Université Paris 8 / École Pratique des Hautes Études / Lutin Userlab
hromi@kyberia.sk

Abstract. The central problem of roboethics is defined as such: on one hand,
robotics aims to construct entities which will transcend the faculties of human
beings, on the other hand, some unethical acts should be made impossible to
execute for such artificial beings. It can be illustrated on the case of full-fledged AI
which is able to reprogram itself, or program other AIs but only in a way that the
result shall not lead to the infraction of moral imperatives held by its human
conceptors. Thus a programmer of such a system is posed between Skylle of his
“aim to conceive an artificial entity able to do almost everything, and more
efficiently than a human being” and a Charybde of “the principle of precaution
commanding him to constraint the behaviour of such an entity in a way that it
would never be able to execute certain acts, like that of a murder, for example”.
Therefore the central problem can be also perceived as a form of solution to the
problem of trade-off between the amount of “autonomy” of an artificial agent and
the extent to which the “embedded ethical constraints” determine the agent’s
behaviour. Believing that such a trade-off could be found, our proposal is
conceived as a four-folded hybrid “separation of powers” model within which the
final output to the solution of ethical dilemma is considered to be the result of
mutual interaction of four independent components: 1) “Moral core” containing
hard-wired rules analogous to Asimov laws of robotics 2) “Meta-moral
Imperative” logically equivalent to Kant’s categoric imperative 3) “Ethico-legal
codex” containing an extensible set of normative procedures representing the laws,
moral norms and customs present in or induced from agent’s surroundings 4)
“Mytho-historical knowledge base” grounding the agent’s representation of
« possible states of the world » in the corpora of human generated myths & stories
Finally, we will argue that our proposal of two induced & two embedded modules
vaguely corresponds to the human morality faculty since it takes into account both
its “innate” as well as “acquired” components.

- 182 -

The Computational Turn: Past, Presents, Futures?

1. Definition of the Central Problem
It may be stated that the ultimate goal of Artificial Intelligence is, for its most
radical proponents like (Kurzweil, 2000; Vinge, 1993)
, the conception of an
artificial system able to transcend all faculties nowadays attributed to human being. In
accord with Turing’s pioneer proposal (A. M. Turing, 2008)
, such proponents do not
ask metaphysical questions like “Can machine have consciousness ?” nor do they bother
much with arguments like that of “chinese room” (Searle, 1982) . More concretely:
such radical engineers do not ask questions “whether faculty X can be simulated by
algorithmic means”, they simply take the affirmative answer as granted, and, in
consequence, pose a question “how can I simulate the faculty X by algorithmic means?”
Let’s define “the faculty of moral reasoning” as X1. While being aware that nothing
really proves that such a definition does NOT result in a fallacy, we nonetheless do not
ask whether it makes sense or not to speak about “machine endowed with morality”. The
fact that machines will be able, sometimes in the future, able to fully simulate the moral
reasoning is taken as granted within the scope of our Gedankenexperiment and the
question which is posed hereby is therefore “how could it be done?”
Now let’s define “the ability to modify itself” as X2 and “the ability to reproduce”
as X3. Since X1, X2 and X3 are all faculties commonly attributed to human being, it can
be stated that an artificial system endowed with such faculties would seem more
“human” than the one which contains only some of them, and is therefore closer to
ultimate goal of radical AI as was already defined.
The problem arises when one realises that X1 is not necessarily mutually consistent
with X2 or X3. Myths as well as history itself demonstrate far too often to pass that the
modification or a reproduction of a moral being does not necessarily yield a moral result.
It is verily this “lesson from history” that obliges us to postulate the central problem of
roboethics :
How could (the most radical of) roboengineers possibly conceive a machine which
is, in the fullest possible extent, able to adapt itself to any situation whatsoever and yet
“unable” to rewrite the set of moral imperatives with which it was endowed ?
We exclude completely the possibility of not endowing a machine with any moral
reasoning at all. Not only would a deployment of such a self-copying, self-modifying
autonomous agent be contrary to precautionary principle (Andorno, 2004)
, but the
very intention of “creating a machine analogous in all its functions to human being”
would miss its target since it is commonly accepted fact that the faculty X1, i.e. morality
is one among such anthropological universalia (Mikhail, 2007)
.
What’s more, according to Kant - who analysed the faculty of morality and its
relations to other forms of reasoning in such an extent that his discoveries simply have to
be taken into consideration by anyone aiming to embed morality into machines - X1 is
not only “one faculty amongst many”, but it occupies the central place among all the
faculties with which a man was endowed. For Kant, man is conceived, as a “moral
being” (Kant, 1785)
.
Being moral means simply to be able to find a “good” solution to any situation of
moral dilemma whatsoever. Therefore, any advanced implementation of morality into an
artificial agent should not ignore the semantic intricacies of the concept of “good” nor its
strong cultural and contextual dependence (i.e. what is good ine one context is not
necessarily good in other).

- 183 -

Proceedings IACAP 2011

2. Possible solution to Central Problem
The Hebbian network of semantic relations around the term “good” consists the
outermost layer of our 4-component model of a so-called “moral machine” (MM).
Initially, this graph-like structure of semantic relations could be possibly built by
means of extraction of “morals of the stories” from huge hypertext corpora representing
the myths, fairy tales and descriptions of factual historical situations (inputs) and their
consequences (outputs).
Whether association of such inputs & outputs by means of already existing machine
learning procedures (ANN, SVM, boosting (Freund & Schapire, 1996)) would allow the
system to attribute a label “good”/”not good” to a textual description of a situation of
moral dilemma which was not contained in the training corpus is a place for argument.
More closer to the moral core is the 3rd layer, which can be understood as “the
layer of rules”. To simplify the understanding: while layer 4 - understood as “the layer
of associations amongst data” - can be compared to an anglo-saxon legal system where a
decision is based on the precedent, i.e. the first decision of a judge in a case sharing
analogic features to a case under study; the activity of layer 3 can be compared to that of
a continental judge whose decisions are simple outputs of more general rules induced
from exhaustive sets of previous experiences.
Thus, the correct understanding of “moral induction” seems to be crucial in order to
implement the robust solution for layer 3 and an inspiration coming from much better
studied domain of “unsupervised grammar induction” (Solan, Horn, Ruppin, & Edelman,
2005) may yield encouraging results.
It is not unreasonable to imagine that by applying the induction principles not upon
the data , but upon the very rules which were themselves induced, the process would
finally converge at the point of some-kind of meta-rule, possibly similar in meaning to
that what Kant called “categoric imperative” (Kant, 1785)
. The advantage of such a
“meta-rule” is not only that it is quite easy to implement from programmer’s point of
view - in its essence it is nothing else than just an infinite while() loop generating “the
representations of possible worlds” and throwing exceptions if ever an “internally
inconsistent” world is generated - but that it can be used as a sort of boolean rule of
thumb there, where fuzzy thresholds of layers 4 & 3 are unable to supply any decisive
result.
The disadvantage of layer 2 is that sometimes it may happen that it shall demand
infinite amount of time in order to return the result (A. Turing, 1937)
. That is far too
much especially in the cases where an artificial agent could harm its modified
environment by its otherwise harmless activity - imagine, for example, an autonomous
transporting agent similar to a car whose circuits got stuck in a while loop after it had
hasardously entered the pedestrian zone. For such cases, low-level implementation of
fast & frugal harm-reductive inhibitory mechanisms is of utmost importance.
In order to stay consistent with the Tradition, we propose Asimov’s Laws of
Robotics (Anderson, 2008)
as a base for such mechanisms.
Finally, it is worth to be stated that while layers 4 & 3 are dynamic in their nature,
i.e. can be rewritten by inflow of new stimuli from environment, layers 2 & 1 can be
embedded into very chips of an artificial agent and could not be modified or disabled
without tampering with agent’s hardware.

- 184 -

The Computational Turn: Past, Presents, Futures?

Believing that such a combination of “two static” and “two dynamic” pillars is in
certain sense analogic to a “nature” (i.e. innate) & “nurture” (i.e. acquired) components
attributed to the moral faculty of a healthy human being, it may be finally stated that the
question which is labeled hereby as a “the central problem of roboethics” is, mutatis
mutandi, nothing else than just a postmodern variation upon a much more ancient theme:
“How does a parent transform a crying child into an autonomous human being ?”

References

Anderson, S. L. (2008). Asimov’s “three laws of robotics” and machine metaethics. AI & Society,
22(4), 477-493.
Andorno, R. (2004). The Precautionary Principle: A New Legal Standard for a Technological
Age. Journal of International Biotechnology Law, 1(1), 11-19. doi:
10.1515/jibl.2004.1.1.11.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. MACHINE
LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE- (pp. 148-156).
Citeseer.
Kant, I. (1785). Groundwork of the Metaphysic of Morals. First published.
Kurzweil, R. (2000). The age of spiritual machines: When computers exceed human intelligence.
Mikhail, J. (2007). Universal moral grammar: Theory, evidence and the future. Trends in
Cognitive Sciences, 11(4), 143-152.
Searle, J. (1982). The Chinese room revisited. Behavioral and Brain Sciences. Retrieved March
11, 2011, from http://journals.cambridge.org/abstract_S0140525X00012425.
Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural
languages. Proceedings of the National Academy of Sciences, 102(33), 11629.
Turing, A. M. (2008). Computing machinery and intelligence. Parsing the Turing Test, 23-65.
Turing, A. (1937). On computable numbers, with an application to the Entscheidungsproblem.
Proceedings of the London Mathematical …. Retrieved March 11, 2011, from
http://plms.oxfordjournals.org/content/s2-42/1/230.full.pdf.
Vinge, V. (1993). Technological singularity. VISION-21 Symposium sponsored by NASA Lewis .

- 185 -

Proceedings IACAP 2011

AFFECTING THE WORLD OR AFFECTING THE MIND?
The Role of Mind in Computer Ethics
JOHNNY HARTZ SØRAKER
Department of Philosophy, University of Twente
j.h.soraker@utwente.nl

Abstract: The purpose of this paper is to draw a distinction between two
interrelated yet fundamentally different ways of approaching problems in
computer ethics, with the goal of clarifying which problems call for which
approaches. In a nutshell, I will draw a distinction between approaches and topics
that are primarily concerned with how technologies affect the world, on the one
hand, and those primarily concerned with how technologies affect our mind, on
the other. I will argue that the type of approach we choose should be determined
on the basis of which of these concerns we are primarily trying to address, which
will also shed light on the advantages and disadvantages of the multitude of
approaches to be found in ethics of technology. In order to clarify and justify this
distinction, I will categorize some common approaches in computer ethics
correspondingly, and I will conclude by offering a set of suggestions for how they
can and should complement each other in a way that yields an exhaustive analysis
of the problem at hand.

The purpose of this paper is to draw a distinction between two interrelated yet
fundamentally different ways of approaching problems in computer ethics, with the goal
of clarifying which problems call for which approaches. In a nutshell, I will draw a
distinction between approaches and topics that are primarily concerned with how
technologies affect the world, on the one hand, and those primarily concerned with how
technologies affect our mind, on the other.13 It should be emphasized at the outset that
these categories are not absolute or mutually exclusive – and it is certainly not my
intention to argue that one is better than the other. My more modest intention is to argue
that the type of approach we choose should be determined on the basis of which of these
concerns we are primarily trying to address, which will also shed light on the advantages
and disadvantages of the multitude of approaches to be found in ethics of technology.
13

This distinction is reminiscent of Floridi & Sanders’ emphasis on the distinction between agent-oriented
and patient-oriented ethics (2002), but this distinction is somewhat misleading in this context, because
both technology and the mind can have a role as both agent and patient, being both source and target of
good and evil.

- 186 -

The Computational Turn: Past, Presents, Futures?

There is little doubt that technologies affect both the world and the mind, and there is
little doubt that there is no sharp distinction between the two. What affects the world can
affect the mind, and what affects minds can affect the world – and technology often
mediates between world and mind. As such, the distinction I am concerned with must
necessarily be more of the ‘family resemblance’-type. Still, we can to some degree
separate between different ways of assessing these effects, and given the multitude of
ethical theories and applied frameworks that are being used in ethics of technology, it is
important to be clear about which approach is best suited for which area.
The clearest example of this is probably the distinction between accountability and
responsibility. If the purpose of our analysis is to understand what is accountable for a
given situation, we can do this entirely in terms of analyzing changes to the world. After
all, an inquiry into accountability is largely an inquiry into causality; what was the source
of this good or evil (cf. Floridi & Sanders, 2004, p. 371). This also highlights the
advantage of using a “mind-less” notion of accountability in cases where (higher-order)
mental processes are either non-existent (e.g. artificial agents) or intrinsically distributed
(e.g. organizations). If the purpose of our analysis is to understand responsibility,
however, we are immediately required to include the mind in a much more integral
manner. After all, an inquiry into responsibility is an inquiry into such mental terms as
intentions, negligence, and culpability. To give another example, when evaluating how
Information and Communication Technologies (ICTs) affect privacy, we can focus on
how ICTs affect the world in a manner that is relevant to privacy, or how it affects our
mind in a way that is relevant to privacy. The former involves such question as “How do
ICTs affect the flow of information”, or what Floridi refers to as ‘ontological friction’
(2005). The latter involves questions such as “How do ICTs affect our expectations
about privacy?” and “How can loss of privacy affect our well-being?”. If we look to
environmental ethics, we can make a similar distinction between the effects a
technological innovation may have on the environment, on the one hand, and their effect
on e.g. opinions about sustainability, on the other. We can make a similar distinction
when evaluating cultural consequences, by either looking at how technologies may
change the material conditions necessary for certain cultural practices, or how they more
directly change people’s cultural values and attitudes.
Clearly, the questions are interrelated and both sets of questions should be sought
answered in a comprehensive analysis, but the approaches and methods we utilize in
doing so will typically be centered on one of the two sets. To clarify this further, we can
attempt to categorize different approaches according to their main concerns.
On the one hand, some theories and approaches are particularly good at evaluating
how technologies affect the world. Again, one clear example is Floridi’s notion of ‘reontologization’ (2005) and the use of an informational level of abstraction, which is an
interesting and often insightful way of conceptualizing how the world changes as a result
of our increased ability to digitize information . Other examples of this type of approach
is Actor-Network theory (Latour, 2005), as well as recent post-phenomenological work
on technological mediation (Verbeek, 2005). The strength of these theories is that they
shed light on how technologies affect the world and our ways of interacting with the
world. They do not, however, say much about how technologies affect the mind. Surely,
the changes to the world that they disclose will very often lead to changes in mind, but
this is not their main concern.

- 187 -

Proceedings IACAP 2011

On the other hand, some theories and approaches are particularly good at evaluating how
technologies affect the mind. Among the approaches in this category, we can include
approaches that are grounded in some version of virtue ethics or utilitarianism, as well as
axiological approaches. The main concern of these approaches is not to understand how
technologies affect the world, but rather how they affect our moral character, behavioral
dispositions, expectations, quality of life, and so forth. Certainly, technologies often
affect our mind through changing the world – indeed, they always do so if we regard the
technology itself as a change to the world. Nevertheless, the main concern of these
approaches is not to get a better understanding of how states of affairs in the world
change, but rather to get a better understanding of how mental processes change. This is
the ultimate goal of the analysis. If we take video game violence as an example, a virtue
ethical analysis of this phenomenon would not be particularly interested in how these
games may affect the physical world, but rather how they will affect the mind of those
who interact with them. Will they make them more aggressive, less altruistic, more
happy?
One reason for distinguishing between these approaches is that they give rise to
different types of normativity, and to show how these can be related to each other.
Approaches that are primarily interested in changes to the world can be described as
cautionary. That is, the effects that technologies have on the world will in many cases
imply a caution; technology x will lead to change y, and this change might be ethically
problematic. In order to take that last step, however, we need approaches that include the
mind in order to argue that change y is ethically problematic because it affects the mind
in a particular way. This can be seen clearly when teaching computer ethics to
pragmatically oriented computer scientists, where showing that technologies change the
world will often lead to the perfectly rational question: “That might very well be true, but
why is that a problem?”. Answering that question must somehow include the mind.
In the full paper, I will further clarify the nature of this distinction, knowing very
well that it is problematic and rests on a number of philosophically controversial
presuppositions. I will also justify why the mind is essential for most topics in computer
ethics, and discuss what this means for how we ought to approach these topics. Some of
the main conclusions will be that computer ethics is necessarily and intrinsically a
pluralist area of investigation, one that needs to address both the world and the mind.
More substantially, it will be argued that we need to get a much better understanding of
how different approaches can complement each other and how analyses of changes to the
world can be integrated into analyses of changes to the mind. I will conclude the paper
by offering a few suggestions on how to do so, using privacy as one of the main
examples.

References:
Floridi, L. (2005). The ontological interpretation of informational privacy. Ethics and Information
Technology, 7(4), 185-200.
Floridi, L., & Sanders, J. W. (2002). Mapping the foundationalist debate in computer ethics.
Ethics and Information Technology 4, 1-9.
Floridi, L., & Sanders, J. W. (2004). On the Morality of Artificial Agents. Minds and Machines,
14(3), 349-379.

- 188 -

The Computational Turn: Past, Presents, Futures?

Latour, B. (2005). Reassembling the Social: An Introduction to Actor-Network-Theory. Oxford:
Oxford University Press.
Verbeek, P.-P. (2005). What things do: philosophical reflections on technology, agency, and
design. University Park, PA: Pennsylvania State University Press.

- 189 -

Proceedings IACAP 2011

THE ETHICS OF AUTOMATED WARFARE
RYAN TONKENS
York University
Toronto, Canada
tonkens@yorku.ca

Autonomous machines of varying degrees are moving onto the battlefield at an
overwhelming pace. If left unchallenged, there is good reason to believe that both their
level of autonomy and overall sophistication will increase exponentially in the future. In
light of this, it is important that we determine whether or not these sorts of robots should
have a place in warfare.
Here I ask whether the development and use of autonomous military robots is
consistent with the tenets of Just War Theory (hereafter JWT).14 Specifically, the aim of
this paper is to offer an in depth (albeit preliminary) analysis of whether the creation and
deployment of autonomous machines in military contexts is morally acceptable, by way
of assessing the overall justness of automated warfare. If automated warfare is unjust,
then creating and using robots for this purpose is morally problematic.
The most anticipated application of advanced autonomous machines is in the
military sector. Indeed, a disproportionate amount of funding for research on machine
autonomy has come from military sources for military applications. Insofar as
autonomous robots can perform actions that have serious ethical consequences (in the
context of warfare, at least), then they need to be programmed to behave ethically, i.e. to
perform only those actions that are in line with the appropriate regulations and agreed
upon customs of just war. Contemporary JWT is the received view on how warfare
should be conducted. We demand that all (human) combatants abide by the tenets of
JWT. Moreover, we expect proper restitution, and go to great lengths to ensure that all
breaches of JWT in practice are punished accordingly. If we want to involve
autonomous machines in warfare, then they will need to abide by JWT as well.
In this paper I take up four issues towards this end: (1) issues of moral
responsibility; (2) discrimination and proportionality; (3) whether the creation of
autonomous military machines is consistent with jus ad bellum and wider social justice;
and (4) whether military machines could be more moral than humans.
(1) JWT demands that someone be morally responsible for actions in war. Given a
certain advanced level of machine autonomy, robots will need to be held responsible for
their own actions. However, doing so seems futile since they have no capacity to suffer
(Sparrow 2007). One potential limitation of Sparrow’s analysis, however, is that the
range of autonomous machines whereby something (someone) could still be held

14

Just War Theory works in tandem with the international laws of war and rules of engagement as
the moral and legal regulations of warfare. Due to space restrictions, I cannot attend to all
three herein, so I focus exclusively on JWT.

- 190 -

The Computational Turn: Past, Presents, Futures?

responsible is quite large. Limiting the autonomy of machines to the point where a
human remains in the decision-making/execution loop avoids this problem, since human
users are the sort of being that can be punished for their moral wrongdoings.
(2) Autonomous robots will need to be able to accurately and reliably discriminate
between legitimate and illegitimate targets (i.e. between combatants and noncombatants,
between surrendering combatants and aggressive combatants, between allies and
enemies). Whether or not autonomous military machines could be designed to do so in
real world military contexts remains an open question, although designing a robot with
these abilities does not seem impossible in principle. Regardless, one point that seems
uncontentious is that the level of autonomy and the ability of machines to act in real
world contexts will increase much sooner than our ability to perfect their ability to exert
the intricacies of discrimination and proportionality at acceptable levels. This is
important to recognize because, until autonomous robots can accurately and reliably
discriminate between legitimate and illegitimate targets, then they do not meet this
requirement of JWT.
(3) If automated warfare fuels widespread social injustice, including injustices
outside of the context of warfare specifically, then it is inconsistent with the principles
underlying JWT (e.g. justice, fairness, respect). This could manifest itself in many ways,
including increasing the likelihood of (unjust) war15, decreasing the likelihood of
terminating (unjust) war once it had begun, exacerbating gaps between rich and poor
nations and strong and weak military forces, et cetera. Moreover, the billions of dollars
going into the automated military sector could be redirected towards the healthcare or
education systems (for example), which could serve to remedy the existing status quo
that finds humans of low socioeconomic status with poorer health and lower education,
itself a symptom of and catalyst for widespread social injustice.
(4) Despite the possibility that machines could in some sense be more moral than
human soldiers under certain circumstances (Arkin 2009; Sullins 2010), automated
warfare will also witness its fare share of unethical activity. Although substituting
human combatants for machines is appealing in certain ways, automated war would not
be less unjust than human warfare overall.
We seem to be seeking to develop autonomous military machines (in part) because
we believe that we can treat them like servants and subordinates, yet we also expect
them to be military and ethical ‘superiors’. The only way we can bring this about in a
morally justifiable manner is if we restrict their sophistication to a point well before they
are fully autonomous moral agents (especially akin to human moral agents), and hence
keep them at a level where we need to keep a human in the loop. But doing so entails
continuing to sacrifice human lives in battle, and continuing to endure human moral
transgressions and imperfections in decision-making, all in addition to the new ethical
challenges that accompany automated warfare.
There is good reason to suggest that the creation and use of autonomous military
machines is inconsistent with JWT in several respects. This is an important finding. For
one thing, it makes it apparent that the creation of certain kinds of autonomous military
machines is inconsistent with the moral framework that these robots will be expected to
follow. More importantly perhaps, it places the burden of proof on those who want to
support the move towards automated warfare and to develop these sorts of machines to
demonstrate that they can do so in a morally sustainable (just) manner. Minimizing the
level of sophistication of these robots and keeping humans in the military loop seems to
15

McMahan (2009) has argued convincingly that, for diverse and complicated reasons, the
majority of wars fought are unjust.

- 191 -

Proceedings IACAP 2011

be the most prudent course to adopt, one certainly more palatable than automated
warfare tout court, although needless to say infinitely less desirable than peace.

References
Arkin, R. (2009). Governing Lethal Behavior in Autonomous Robots. Dordrecht: Chapman &
Hall.
Asaro, P. (2008). How just could a robot war be?. In: P. Brey, A. Briggle and K. Waelbers (Eds.),
Current Issues in Computing and Philosophy (pp.50-64). Amsterdam: IOS Press.
Guarini, M. & Bello, P. (forthcoming). Robotic warfare: Some challenges in moving from noncivilian to civilian theaters. In: P. Lin, G. Bekey and K. Abney (Eds.), Robot Ethics: The
Ethical and Social Implications of Robotics. Cambridge: MIT Press.
McMahan, J. (2009). Killing in War. Oxford: Clarendon Press.
Sparrow, R. (2007). Killer robots. Journal of Applied Philosophy, 24 (1), 62-77.
Sullins, J. (2010). RoboWarfare: Can robots be more ethical than humans on the battlefield?
Journal of Ethics and Information Technology, 12: 263-275.

- 192 -

The Computational Turn: Past, Presents, Futures?

CAREBOTS AND CAREGIVERS
Robotics and the Ethical Ideal of Care
SHANNON VALLOR
Department of Philosophy, Santa Clara University
500 El Camino Real
Santa Clara, CA 95053 USA

Abstract. In the 21st century we stand on the threshold of welcoming robots into
domains of human activity that will expand their presence in our lives
dramatically. One provocative new frontier in robotics, driven by a convergence of
demographic, economic, cultural and institutional pressures, is the development of
‘carebots’ - robots intended to assist or replace human caregivers in the practice of
caring for vulnerable persons such as the elderly, young, sick or disabled. I argue
that existing reflections on the ethical implications of carebots have thus far
neglected a critical dimension of the issue: namely, the potential moral value of
caregiving practices for caregivers. Instead, the scholarly dialogue has largely
focused on the potential benefits and risks to care recipients. Where caregivers
have been explicitly considered, it is strictly in terms of how they might benefit
from having the burdens of care reduced by carebots. I stipulate here that properly
designed and implemented carebots might improve the lives of cared-fors and
caregivers in ways that would be ethically desirable. Given the grave deficiencies
of existing social mechanisms for supporting caregivers, their use may even be
ethically obligatory in the absence of acceptable alternatives. Yet I argue that we
ought to forestall such judgments until we have first adequately reflected upon the
existence of goods internal to the practice of caregiving that we might not wish to
surrender, or that it might be unwise to surrender even if we might often wish to
do so. Such reflection, I claim, gives rise to considerations that must be weighed
alongside the likely impact of carebots on care recipients. In order to initiate such
reflection, I examine the goods internal to caring practices and the potential impact
of carebots on caregivers by means of three complementary ethical approaches:
virtue ethics, care ethics and the capabilities approach. I show that each of these
frameworks can be used to shed light on the contexts in which carebots might
deprive potential caregivers of important moral goods central to caring practices,
as well as those contexts in which carebots might help caregivers sustain or even
enrich those practices.

- 193 -

Proceedings IACAP 2011

1. Introduction
We stand on the threshold of welcoming robots into domains of human activity that will
expand their presence in our lives dramatically. One provocative new frontier is the
development of ‘carebots’ - robots intended to assist or replace human caregivers in the
practice of caring for vulnerable persons such as the elderly, young, sick or disabled. Yet
existing philosophical reflections on the ethical implications of carebots have thus far
neglected a critical dimension of the issue: the potential moral value of caregiving
practices for caregivers. Instead, the dialogue has largely focused on the potential
benefits and risks to care recipients. Indeed, properly designed and implemented
carebots might improve the lives of both cared-fors and caregivers in ways that would be
ethically desirable. Their use may even be ethically obligatory in the absence of
acceptable alternatives. Yet I argue that such judgments are premature until we have
adequately reflected upon the potential existence of goods internal to the practice of
caregiving that we might not wish to surrender, or that it might be unwise to surrender
even if we might often wish to do so.
Such reflection, I claim, gives rise to considerations that must be weighed alongside
considerations of the likely impact of carebots on care recipients. Taking as a guiding
insight Coeckelbergh’s (2009) claim that we must look beyond mere application of
“external” ethical criteria for human-robot relations, I propose to examine the goods
internal to caring practices and the potential impact of carebots on caregivers by means
of three complementary ethical approaches: virtue ethics, care ethics and the capabilities
approach. Each of these philosophical frameworks sheds new light on: 1) the contexts in
which carebots might deprive potential caregivers of important moral goods central to
caring practices, 2) contexts in which carebots might help caregivers sustain or even
enrich those practices, and 3) the specific nature of those moral goods.

2. Carebots and the ethical significance of caring practices
2.1. THE VIRTUES OF CARE
A virtue-ethical account offers rich resources for our inquiry in the form of a range of
moral virtues that can be cultivated and sustained through caring practices. Patience,
understanding, charity, prudence, reciprocity and empathy can each be cultivated
through sustained caring activity. ‘Excellent carers’ manifest a powerful ability to
anticipate and interpret the needs of others, even when not explicitly communicated.
They habitually express effective responses to those needs, even in unusual or rapidly
changing situations. They are able to maintain emotional bonds with others, even under
physically and mentally demanding circumstances. They enable the autonomy and selfexpression of those they care for, to whatever degree possible. If Aristotle is right that
the virtues must be cultivated by habitual performance of practices appropriate to their
expression (1984, 1103b1), then caring practices are an important, perhaps even
essential, part of one’s moral development. This is a compelling reason to examine the
potential impact of carebots designed to free us from those practices. Yet carebots have
also been proposed as a means of facilitating deeper human engagement in caring
practices, by taking over routine or unpleasant chores that drain our energy for giving

- 194 -

The Computational Turn: Past, Presents, Futures?

good care. (Coeckelbergh, 2010). This suggests the need for a sustained study of which
kinds of caring practices are most critical for the cultivation of caring virtues. Such a
study, guided by a virtue-ethical framework, could greatly assist the ethical
implementation of carebots by providing carebot developers, institutions, and caregivers
with critical information about the moral value of various caregiving practices.
2.2. CARE ETHICS, CAREBOTS AND THE ETHICAL IDEAL
Care ethics provides another source of insight. Noddings (1984) offers an account of the
‘caring relation’ that takes it to be ethically primary in human existence - a source not
only of individual virtues, but also (and more fundamentally) of an ethical ideal that
motivates and guides human flourishing. I will argue that carebots might be used to
modify contexts of care in ways that preserve or enhance this ethical ideal, allowing us to
be engrossed in the needs of the other, moved to attend to them, and open to the
responses of those for whom we care. Yet Noddings’ account can also remind us that
our aim is not to be liberated from the caring relation itself, for if she is right, this is the
only human relation through which our own ethical ideal can be nurtured.
2.3. CARING AND THE CAPABILITIES APPROACH
Nussbaum’s capabilities approach provides a third perspective on the goods internal to
caring practices. Among the capabilities emphasized by Nussbaum as critical to human
flourishing (2006, 76-77), I argue that affiliation, practical reason and emotion are each
realized, to a critical degree, through caring practices. For it is at least partly through
providing care that I develop the intimate knowledge of human vulnerability needed to
fully exercise these capabilities. We must therefore reflect carefully on the way in which
the introduction of carebots in society could inhibit or enhance their development.

5. Conclusion
Together these conceptual frameworks can remind us that in reflecting upon the ethical
portent of carebot technology, we must consider more than just the quality of care robots
can give, the relevant preferences and likely reactions of cared-fors, or the strong social
pressures we face to better meet the needs of the vulnerable among us. These are all
serious ethical considerations to which we must carefully attend in weighing the costs,
benefits and risks of carebot implementation – but it is of critical importance that we not
overlook the moral goods internal to caring itself.

References
Aristotle (1984). The Complete Works of Aristotle: Revised Oxford Translation. Princeton:
Princeton University Press.
Coeckelbergh, M. (2009). Personal robots, appearance and human good: A methodological
reflection on roboethics. International Journal of Roboethics, 1(3), 217-221.
Coeckelbergh, M. (2010). Health care, capabilities and AI assistive technologies. Ethical Theory
and Moral Practice, 13(2), 181-190.

- 195 -

Proceedings IACAP 2011

Noddings, N. (1984). Caring: A Feminine Approach to Ethics and Moral Education. Berkeley:
UC Press.
Nussbaum, M. (2006). Frontiers of Justice: Disability, Nationality, Species Membership.
Cambridge: Harvard University Press.

- 196 -

The Computational Turn: Past, Presents, Futures?

CO-CONSTRUCTION
IDENTITIES

AND

CO-MANAGEMENT

OF

ONLINE

A Confucian Perspective
PAK-HANG WONG
Department of Philosophy,
University of Twente

Abstract. In information and computer ethics, the discussion of personal identities
online (PIOs) is often framed as if individuals are victims who need protection,
e.g. privacy, identity theft, etc. In this respect, many of the discussions related to
PIOs in the current literature are negative in that they aim to provide and justify
certain constraint and restrictions on (the use of) PIOs. While the issues
concerning privacy, identity theft, etc. are undoubtedly important, the lone focus
on negative aspects related to PIOs is undesirable, for it has undermined the scope
of issues related to PIOs, particularly, the more positive issues pertaining to PIOs,
e.g. how we should construct and manage our PIOs. Recently, Noëmi MandersHuits has studied the notion of “identity management” in the context of
information technology. Manders-Huits’s article is significant, because she has
explicitly turned away from the negative issues and moved on to issues about the
construction and management of identities in IT, which are far more positive. As
such, her discussion introduced a new area of research that is so far largely
neglected. Although her study of identity management is illuminating, I think her
account is unsatisfactory ultimately, as she failed to properly acknowledge one
important facet of PIOs, namely they are co-constructed and co-managed. The aim
of this paper, therefore, is to remind of the fact that PIOs are co-constructed and
co-managed, and to identify some conceptual and ethical issues arise from it.
Finally, I will outline the answers to the issues using a Confucian notion of
personhood and identity.

1.
In information and computer ethics, the discussion of personal identities online (PIOs) is
often framed as if individuals are victims who need protection, e.g. privacy, identity
theft, etc. In this respect, many of the discussions related to PIOs in the current
literature are negative in that they aim to provide and justify certain constraint and
restrictions on (the use of) PIOs. As Shoemaker noted, most of the literature in the field
attempted to specify “a protected zone of private information, consisting in information
about me.” (Shoemaker 2010, 3-4) While the issues concerning privacy, identity theft,

- 197 -

Proceedings IACAP 2011

etc. are undoubtedly important, the lone focus on negative aspects related to PIOs is
undesirable, for it has undermined the scope of issues related to PIOs, particularly, the
more positive issues pertaining to PIOs, e.g. how we should construct and manage our
PIOs. Recently, Noëmi Manders-Huits (2010) has studied the notion of “identity
management” in the context of information technology. Manders-Huits’s article is
significant, because she has explicitly turned away from the negative issues and moved
on to issues about the construction and management of identities in IT, which are far
more positive. As such, her discussion introduced a new area of research that is so far
largely neglected. Although her study of identity management is illuminating, I think her
account of is unsatisfactory ultimately, as she failed to properly acknowledge one
important facet of online identities, namely online identities are co-constructed and comanaged. The aim of this paper, therefore, is to remind of the fact that online identities
are co-constructed and co-managed, and to identify the conceptual and ethical issues
arise from it. Finally, I will outline the answers to the issues using a Confucian notion of
personhood and identity.
I will begin this paper with Manders-Huits’s account of identity management.
According to Manders-Huits, there are two senses of “identity management”. The first is
being used predominantly in the technical discourse, where identity management refers
to the practice of collecting, organising and, subsequently, utilising personal information
for the purpose of (re-)identification and categorisation. (Manders-Huits 2010, 47) And,
the second sense of identity management involves not only a set of description about the
individual; it also involves reflexive, self-identification with some sets of beliefs, values
or ideals, where those beliefs, values and/or ideals provide reasons for our actions and, at
the same time, make the actions genuinely ours. (see, e.g. Korsgaard 1996; Frankfurt
1998, 1999, 2004 & 2006) Identity management in the second sense, therefore, requires
individuals to manage their beliefs, values and ideals, and to resolve possible conflicts
among them. (Manders-Huits 2010, 48-9) As she rightly pointed out, identity
management is an issue deserving more attention, as there is a discrepancy between the
two senses of “identity management”, and the moral and practical dimension of identity
is currently not being taken into account in both the technical discourse and in the
technologies. Yet, for the centrality of moral and practical identity in our lives, the
negligence of it has to be rectified. I agree entirely with her claim, but I shall also point
out that identity management will become more important as information technology
continues to develop and being adopted.

2.
As information technology (and the Web) continues to advance, it will – to use Luciano
Floridi’s terminology – re-ontologise the nature of ourselves and our world. According
to Floridi, we are (becoming) inforgs, i.e. “connected informational organisms” living in
an infosphere, i.e. “an environment constituted by all informational entities […], their
properties, interactions, processes and mutual relations.” Floridi (2007, 60, 62 & 59) At
certain point, Floridi argued, the boundaries between the life offline and the life online
will eventually evaporate, and by then individuals will be living in the Web Onlife.
Among other characteristics, the onlife of inforgs in an infosphere is characterised by
instant, seamless exchanges of offline and online information. In other words, the flow

- 198 -

The Computational Turn: Past, Presents, Futures?

of (personal) information will become, at least, bi-directional. What it means is that
when individuals act on the Web, it will have immediate and direct impacts on their nonWeb counterparts. In this scenario, identity management for online identities becomes
essential. Since it will no longer be possible to distinguish the offline and the online, it
will be impossible to dissociate online identities from offline identities too. Or, to put it
differently, what remains are onlife identities.
While Manders-Huits is right to point out that identity management is an important
issue for researchers in information and computer ethics, I shall argue that her account of
identity management is unsatisfactory, because she has failed to properly acknowledge
the fact that online identities are co-constructed and co-managed by multiple parties.
This failure is reflected in her suggestion to engineers and technology designers, when
she remarked that they “should provide ways for individuals to construct and maintain
their [reflexive, self-identification with some sets of beliefs, values or ideals] and [some
sets of descriptions about themselves], in addition to their administrative, forensic
counterpart.” (Manders-Huits 2010, 54) It is obvious that the emphasis is on empowering
individuals in managing their personal information. Yet, what is missing here is that:
while it is true that individuals construct and manage their online identities, we are not
the only one who contributes to their construction and management. For example, a
person’s profile on Facebook is not only what that person inputs, but the totality of
information on the profile, including his/her friends, conversations, etc. In other words,
not all identity-related information is under the person’s control. In light of this, I shall
argue that there is a need to reconceptualise PIOs in terms of co-construction and comanagement; and, I shall also argue that unless the person is omnipotent and
omnipresence, empowering individuals is always insufficient.

3.
At this point, I suggest that we can learn a lesson from Confucianism. I will point out that
Confucians conceptualised personhood and identity as inherently interdependent and
relational. (Wong 2004; Lai 2006; Yu & Fan 2007) And the Confucian personhood and
identity, I shall argue, provide us an alternative way to conceptualise PIOs, which can
take into account the co-construction and co-management of PIOs. Moreover,
accompanied with the Confucian personhood and identity is an ethics, which is based on
individuals’ social roles. (Nuyen 2009) Here, I will suggest that the role-based ethics in
Confucianism offers a fitting complement to the Manders-Huits’s strategy of individual
empowerment.

References
Floridi, L. (2007). A look into the Future Impact of ICT on Our Lives. The Information Society,
23 (1), 59-64.
Floridi, L. (2009). The Semantic Web vs. Web 2.0: A Philosophical Assessment. Episteme, 6, 2537.
Frankfurt, H. (1988). The importance of what we care about: philosophical essays. Cambridge:
Cambridge University Press

- 199 -

Proceedings IACAP 2011

Frankfurt, H. (1999). Necessity, volition, and love. Cambridge: Cambridge University Press
Frankfurt, H. (2004). The reasons of love. Princeton, N.J.: Princeton University Press.
Frankfurt, H. (2006) Taking ourselves seriously and getting it right. Stanford, Calif.: Stanford
University Press.
Korsgaard, C. (1996). The sources of normativity. Cambridge: Cambridge University Press.
Lai, Karyn (2006). Learning from Chinese Philosophies: Ethics of Interdependent and
Contextualised Self. UK: Ashgate
Manders-Huits, N. (2010). Practical versus moral identities in identity management. Ethics and
Information Technology, 12 (1), 43-55.
Nuyen, A.T. (2009) Moral Obligation and Moral Motivation in Confucian Role-Based Ethics.
Dao, 8, 1–11
Shoemaker, D. W. (2010). Self-exposure and exposure of the self: information privacy and the
presentation of identity. Ethics and Information Technology, 12 (1), 3-15.
Tavani, H. T. (2008). Informational Privacy: Concepts, Theories, and Controversies. In K.E.
Himma and H.T. Tavani (Eds.), The Handbook of Information and Computer Ethics (pp.
131-164). Hoboken, NJ: John Wiley and Sons.
Wong, David (2004) Relational and Autonomous Selves. Journal of Chinese Philosophy, 31 (4),
419–432
Yu, Erika & Fan, Ruiping. (2007) A Confucian View of Personhood and Bioethics. Bioethical
Inquiry, 4, 171–179

- 200 -

The Computational Turn: Past, Presents, Futures?

Track VI:
Multidisciplinary
Perspectives

- 201 -

Proceedings IACAP 2011

REFLECTIVE INEQUILIBRIUM
BERT BAUMGAERTNER
University of California, Davis
1240 Social Sciences and Humanities
University of California, Davis
One Shields Avenue
Davis, CA 95616

Abstract. I show that under a traditional introspective method of philosophical
investigation, certain projects of conceptual analysis are bounded by a reflective
inequilibrium. That is, although it is possible to make some progress towards
bringing our classificatory intuitions and the relevant criteria into agreement, there
is a barrier that cannot be overcome with traditional methods when the concept in
question is plastic. We can show the limitations of the traditional method of
conceptual analysis by considering its computational analog. Suppose we have an
algorithm C that determines a set of cases that fall under a given concept and
another algorithm T which tests cases by consulting C (which responds with `Yes'
or `No'). If C is static (and decidable), then in principle T can develop a criterion
for it. Moreover, every verification procedure that T uses to check the match yields
consistent results. However, this turns out not to be the case when C is plastic.
Even if we assume the best case scenario in which a proposed criterion matches
the set of cases determined by the concept, testing cases near the boundary moves
the boundary, and so the criterion will no longer match. So even if an algorithm
gets a match via a lucky guess, it is unable to verify the match. A state of affairs
where no perfect match can be verified is a reflective inequilibrium. That some
concepts are plastic is supported by empirical evidence which shows that
classificatory intuitions can be affected by the order in which cases are considered.
Swain et al. (2008) found that individual intuitions can vary according to whether,
and which, other thought experiments were considered first. It is likely that the
varying intuitions track shifts in the classificatory dispositions of our concepts. In
fact, it is well accepted in cognitive psychology and cognitive science that human
concepts are flexible and dynamic in this way. Interestingly then, a computational
approach to traditional introspect methodology thereby gives us a possible
explanation for why conceptual analysis is so difficult and usually unsuccessful.

Extended Abstract
In this paper, I show the far-reaching effects of the computational turn by shedding light
on a traditional problem. Specifically, I show that under a traditional introspective
method of philosophical investigation, certain projects of conceptual analysis are
bounded by a reflective inequilibrium.

- 202 -

The Computational Turn: Past, Presents, Futures?

In the philosophical literature, particularly in certain domains of epistemology, it is
assumed that a conceptual analysis of knowledge, for example, is possible through a
process of reflective equilibrium. This process is a virtuous circle, where we make some
headway on settling which cases count as knowledge in order to develop some criteria,
and we let the development of criteria help us settle on which cases count as knowledge.
As I will show however, although it is possible to make some progress towards bringing
these two into agreement, there is a barrier that cannot be overcome with traditional
methods when the concept in question is plastic. Since it is plausible that our concept of
knowledge is plastic (Weinberg et al., 2001), the possible progress of an analysis given
traditional methods is bounded by a reflective inequilibrium.
More specifically, a traditional method of doing conceptual analysis can be
characterized as the attempt to bring into agreement our classificatory intuitions about
cases and a proposed criterion that defines the relevant set of cases. We then proceed by
testing proposed criteria. This is done by a) introspectively checking whether every
possible case as specified by a criterion is an instance of the concept in question, and b)
introspectively checking whether every possible instance of the concept in question is a
possible case specified by the criterion.
We can show the limitations of the traditional method of conceptual analysis by
considering its computational analog. We have an algorithm C that determines a set of
cases that fall under a given concept. We then have another algorithm T which tests
cases by consulting C (which responds with `Yes' or `No'). Given data from C, T
attempts to develop a criterion for the set of cases determined by C. If this set is static
(and decidable), then in principle T can develop a criterion for it. Moreover, every
verification procedure that T uses to check the match yields consistent results. However,
this turns out not to be the case when C is plastic.
Let us assume the best case scenario in which a proposed criterion matches C. In
order for T to verify the match, it must test some cases again. But since C is plastic,
testing cases near the boundary moves the boundary, and so the criterion will no longer
match C. Then T will get an inconsistent result for some verification procedure. So even
if T gets a match via a lucky guess, it is unable to verify the match. Let us call a state of
affairs where no perfect match can be verified a reflective inequilibrium.
We have appealed to an intuitive notion of plasticity. More rigorously, plasticity
can be implemented in an artificial cognitive system by the specification of two features:
i) the conditions for when the boundary of a concept shifts, and ii) how much the
boundary of the concept shifts. Such algorithms behave in the following way. When
given cases to classify near the boundary, the boundary shifts by some amount, so that
future cases which may have been classified positively (negatively) may now be
classified negatively (positively). Boundary shifting is more or less stable depending on
how the cases are selected for testing and how features (i) and (ii) are specified.
That some concepts are plastic is supported by empirical evidence which shows that
classificatory intuitions can be affected by the order in which cases are considered. For
example, Swain et al. (2008) found that individual intuitions can vary according to
whether, and which, other thought experiments were considered first. It is natural to
suppose that the varying intuitions track shifts in the classificatory dispositions of our
concepts. In fact, it is well accepted in cognitive psychology and cognitive science that
human concepts are flexible and dynamic in this way. Psychologists such as Laurence
Baraslou (1987) and James Hampton (2007) have suggested that this is a good thing, for

- 203 -

Proceedings IACAP 2011

it provides us with the capacity to track environmental changes while maintaining the
identity of the relevant concept(s). Let the plasticity hypothesis be the hypothesis that
our concepts are apt to change their classificatory dispositions.
In sum, taking a computational approach to traditional introspective conceptual
analysis illuminates the limitations of this particular methodology. It is common to think
that a barometer of how well we understand cognitive capacities is our ability at
simulating artificial systems. Given that we have adequate algorithmic implementations
of the plasticity hypothesis and the traditional methodology, we can rigorously prove
limitations of the traditional methodology. We thereby have a possible explanation for
why conceptual analysis is so difficult and usually unsuccessful -- introspection can
provably only take us part of the way. Consequently, the computational approach can
make way for the development of additional tools to study human capacities of
categorization.

Acknowledgements
Thanks to Adam Sennet and attendees of the philosophy graduate student workshop at
UC Davis for helpful suggestions in the initial development of the ideas. Special thanks
to Bernard Molyneux for comments and support.

References
Barsalou, L. (1987). The instability of graded structure: Implications for the nature of concepts.
In: U. Neisser (Ed.), Concepts and Conceptual Development: Ecological and Intellectual
Factors in Categorization (pp. 101–140). Cambridge: Cambridge University Press.
Hampton, J. (2007). Typicality, graded membership, and vagueness Cognitive Science: A
Multidisciplinary Journal 31 (3) (pp. 355–384).
Swain, S., J. Alexander, and J. Weinberg (2008). The instability of philosophical intuitions:
Running hot and cold on truetemp. Philosophy and Phenomenological Research 76 (1) (pp.
138–155).
Weinberg, J., S. Nichols, and S. Stich (2001). Normativity and epistemic intuitions. Philosophical
Topics 29 (1-2) (pp. 429–460).

- 204 -

The Computational Turn: Past, Presents, Futures?

THE INFORMATION-COMPUTATION TURN: A HACKING-TYPE
REVOLUTION
ISRAEL BELFER
Science, Technology and Society Program,
Bar Ilan University
Ramat Gan, Israel

Abstract. Hacking’s Styles of Reasoning (Hacking 1981, 1992) are utilized to
describe the impact Information Theory has had on science in the 20th century in
theory and application. A generalized, Information-laden scientific style of
reasoning is introduced, generalizing the information-theoretical and
computational turn in science and society. Information-laden science will be
examined according to Hacking's criteria for a new Style, and its associated
'revolution' (Schweber and Watcher, 2000). These criteria include a new scientific
vocabulary as well as a wider social and conceptual context. The specific branch of
science chosen to exhibit the new style is physics, which manifests a wide range of
a style's attributes: science in an information-age (‘e-science’); hard-theoretical
physics such as Black-Hole Thermodynamics (BHTD) and the consequent BlackHole Wars (Suskind, 2008); the advent of Quantum Information Theory (QIT) –
namely Quantum Information and Quantum Computation.

1. Introduction – Hacking Type Revolutions
Hacking's Styles of Reasoning (Hacking 1982, 1992; Crombie, 1994) are meta-concepts
that arrange the scheme of ideas and practices in science and society. They are described
as:
“The active promotion and diversification of the scientific methods of late medieval
and early modern Europe reflected the general growth of a research mentality in
European society, a mentality conditioned and increasingly committed by its
circumstances to expect and to look actively for problems to formulate and solve, rather
than for an accepted consensus without argument. The varieties of scientific method so
brought into play may be distinguished as:
(a) the simple postulation established in the mathematical sciences,
(b) the experimental exploration and measurement of more complex observable
relations,
(c) the hypothetical construction of analogical models,
(d) the ordering of variety by comparison and taxonomy,
(e) the statistical analysis of regularities of populations and the calculus of
probabilities, an

- 205 -

Proceedings IACAP 2011

(f) the historical derivation of genetic development.
The first three of these methods concern essentially the science of individual regularities,
and the second three the science of the regularities of populations ordered in space and
time.”
The rise of a Style of Reasoning manifests in a Hacking-Type Revolution that
accompanies a new Style.
1.1 A NEW HACKING-TYPE REVOLUTION
Schweber & Watcher (2000) recognized in the computational (information-processing)
revolution the rise of such a Style: “We are witnessing another Hacking type revolution,
which for lack of a better name we call the ‘complex systems modeling and simulation’
revolution, for complexity is one of its buzzwords and mathematical modeling and
simulation on computers constitute its style of reasoning”. This Style and its
revolution should be adopted and combined with the ubiquity of Information-Theoretical
terminology in science (Arndt, 2004), into a generalized form. That is, a Hacking-Type
revolution of Information-laden science, with digitized Information as its Style.
By expanding on the same theme of the Hacking type revolution to include
communication and cryptography, one achieves more than a parceling together of the
theoretical basis for these fields of research. It in fact relays a basic theme in science and
technology, since communication and computation – Information transfer and processing
– are inextricably linked theoretically and practically. The common thread connecting all
of these theoretical approaches and applied technologies is the modern concept of
quantified information.
1.2 INFORMATION-LADEN PHYSICS
The technological and theoretical growth embodied in the fields of computation and
communication amalgamates into a Style of Reasoning with Digitized Information
(Shannon, 1948) at its core: That is, information and its measures (Arndt, 2004). A
science laden with Information (paraphrasing ‘theory-laden’ science) is saturated with
direct and indirect reliance on IT and Information measures for defining problems and
their solutions, influencing the theory and the practice of science. Experiment becomes
data acquisitions (Brillouin, 1956); analysis - the computerized simulation and
processing of relevant datasets.
Much of this process is due to Maxwell’s Demon (Leff, 2003), the thought
experiment that challenged the second law of thermodynamics since the end of the 19th
century. Attempts to deal with it catalyzed lines of theoretical research that primed
physics for a turn towards Information, prompting the tight connection between the
thermodynamics of computation and IT (Bennet, 1973).
This shift is reinforced by a deeper moment in abstract theoretical work: IT as
scientific modeling of nature, such as the Maximum Entropy Principle (Jayens, 1957).
The declaration that 'Information as physical' (Landauer, 1991; Karnani et al, 2009)
connects communication and computation together with fundamental physics and the
second law of thermodynamics. Considered by some as 'the new language of science'
(von Bayer, 2005), a new 'metaparadigm' in popularized depictions of the change
(Siegfried, 2000; Seife 2006).

- 206 -

The Computational Turn: Past, Presents, Futures?

2. New Fields of Information-Laden Physics
The 20th century saw the development of core mathematical-physics imbued with IT
(von Baeyer, 2005), i.e. Information-laden science. Jakob Bekenstein’s seminal work on
Black Hole Thermodynamics (BHTD) (Bekenstein, 1973,2006). Fields of research such
as Quantum Information Theory (Fuchs, 2010) and String Theory (Susskind, 2008) do
more than utilize Shannon’s Information-Entropy measure. They link physical reality to
computation and cryptography.
BHTD and M-Theory produced the Holographic Principle (t’Hooft, 1993;
Susskind, 1995) according to which physical reality is encoded onto the surface area of
the universe. QIT bodes the possibilities of pan-computationalism (Lloyd, 2006;
Feynman, 1981; Zuse, 1967) with all physical phenomena understood as bit-flipping.
Wheeler (1990) takes it even further: every physical object essentially Informational –
his famous aphorism “It from Bit”.

3. New Style – Spheres of Science and Society
3.1 NEW SENTANCES, OBJECTS AND LAWS.
A new Style enjoys a new semantic field of definitions, sentences and criteria for the
proper conduct of science (Hacking, 1992). The new aforementioned topics and
disciplines in science are built on precisely such constructs. It is through Information
terminology that the Holographic principle and its ramifications on the criteria for a
well-constructed M-Theory can be expressed; that the computational universe can be
entertained and weighed as a model for physical reality
3.2 THE INFORMATION AGE
The wider social setting for these changes in science are explored in the sociological,
economic and political research of the Information-Age (Castells, 2004). The
Theoretical, applied scientific and technological spects of the Information-laden
revolution are organic to this social moment.

Acknowledgements
I would like to thank Prof. Silvan Schweber and Dr. Raz Chen Moris for their great
support in all stages of this research. I would also like to thank Dr. Chris Fuchs for the
great conversations and discussions (on QIT and Chupakabras).

References
Arndt, Christoph (2004), Information Measures: Information and Its Description in Science and
Engineering. Heidelberg-Berlin: Springer.

- 207 -

Proceedings IACAP 2011

von Baeyer, Hans Christian (2005), Information: The new Language of Science. Harvard
University Press.
Bekenstein, Jakob (1973), Black Holes and Entropy. Phys. Rev. D7, 2333.
Bekenstein, Jacob (2006). Of Gravity, Black Holes and Information. Rome: Di Renzo Editore.
Bennett C. H. (1973). Logical reversibility of computation. IBM Journal of Research and
Development, 17(6), 525-532.
Brillouin, Leon (1956). Science and Information Theory. Mineola, N.Y: Dover.
Castells, Manuel (2004). Informationalism, Networks, and the Network Society: a Theoretical
Blueprinting. Northampton, MA: Edward Elgar.
Feynman Richard P. (1981). Simulating Physics with Computer [Keynote speech in 1st conference
on Physics and Computation, MIT 1981]. International Journal of Theoretical Physics,
21(6/7), 467-488, 1982.
Fuchs Christopher A. (2010). Coming of Age With Quantum Information: Notes on a Paulian
Idea. Cambridge University Press.
Hacking, Ian (1981). From the Emergence of Probability to the Erosion of Determinism. in J.
Hintikka, D. Gruender and E. Agazzi E (Eds), Probabilistic Thinking, Thermodynamics and
the Interaction of the History and Philosophy of Science, Proceedings of the 1978 Pisa
Conference on the History and Philosophy of Science (Vol. II, pp. 105-123). Dordrecht:
Reidel.
Hacking, Ian (1992). 'Style' for Historians and Philosophers, In Historical Ontology, Harvard
University Press, 178-200
Hawking, Steven W. (July 2005), Information Loss in Black Holes, arxiv:hep-th/0507171
't Hooft, G (1993), Dimensional Reduction in Quantum Gravity, 1993, arXiv:gr-qc/9310026v2
Jaynes, Edwin T. (1957), Information Theory and Statistical Mechanics. Physical Review 106,
620-630
Landauer, R. (1991) Information is physical. Physics Today, May 1991.
Leff, Harvey S., Rex, Andrew F. (Eds), Maxwell's Demon 2: Entropy, Classical and Quantum
Information. CRC Press 2003.
Lloyd Seth (2006) Programming The Universe: A Quantum Computer Scientist Takes On the
Cosmos. New York: Random House.
Schweber S., Watcher M. (2000). Complex Systems, Modelling and Simulation. Stud. Hist. Phil.
Mod. Phys, 31(4), 583-609.
Susskind, Leonard (1995). The World as a Hologram. J.Math.Phys.36:6377-6396.
Susskind, Leonard (2008). The Black Hole War: My battle with Stephen Hawking to make the
world safe for quantum mechanics. Little, Brown and Co.
Shannon, Claude. E. (1948). A Mathematical Theory of Communication. Bell Syst. Tech. J. 27,
379–423.
Wheeler, J. A. (1990). Information, Physics, Quantum: The Search for Links. In W. H. Zureck
(Editor) Complexity, Entropy, and the Physics of Information. Redwood City, Cal.: Addison
Wesley.
Konrad Zuse (1967). Rechnender Raum. Elektronische Datenverarbeitung, 8, 336-344.

- 208 -

The Computational Turn: Past, Presents, Futures?

HOW MUCH DO FORMAL NARRATIVE ANNOTATIONS DIFFER?
A Proppian Case Study
RENS BOD
Institute for Logic, Language and Computation
Universiteit van Amsterdam
AND
BENEDIKT LÖWE
Institute for Logic, Language and Computation
Universiteit van Amsterdam
AND
SANCHIT SARAF
Institute for Logic, Language and Computation
Universiteit van Amsterdam

Abstract. The formal study of narratives goes back to the Russian structuralist
school, paradigmatically represented by the 1928 study Morphology of the
Folktale by Vladimir Propp. Researchers in the field of computational narratology
have developed the general Proppian methodology into various formal and
computational frameworks for the analysis, automated understanding and
generation of narratives. Methodological issues in this research field give rise to
concrete research questions such as “How much does the representation of a
narrative in a given formal framework depend on subjective decisions of the
formalizer?'” touching philosophy of computing and philosophy of information. In
order to approach the mentioned question, we consider the process of formal
representation of a narrative as a natural analogue of the task of annotation in
computational linguistics and corpus linguistics. We use the Russian folktales
formalized by Propp and let them be formalized by annotators according to
Propp's system, evaluating these results according to the standards of interannotator agreement.

The formal study of narratives goes back to the Russian structuralist school,
paradigmatically represented by the 1928 study Morphology of the Folktale by Vladimir
Propp (1928) in which he identifies seven dramatis personae and 31 functions that allow
him to formally analyse a corpus of Russian folktales.

- 209 -

Proceedings IACAP 2011

Researchers in the field of computational narratology (or “computational models of
narrative”) have developed the general Proppian methodology into various formal and
computational frameworks for the analysis, automated understanding and generation of
narratives. Examples for this are Lehnert (1981)'s Plot Units, Rumelhart (1980)'s Story
Grammars, Schank (1982)'s Thematic Organization Points (TOPs), Dyer (1983)'s
Thematic Abstraction Units (TAUs), or Turner (1994)'s Planning Advice Themes
(PATs). Over the last decades, the main interest of this research community lay in the
technical challenges that the computational treatment of narratives brings, but recently,
there is again increased interest in the methodological and conceptual issues involved,
linking this research closely to questions of the philosophy of information (cf. the paper
(Löwe to appear) presented at the 3rd Workshop for the Philosophy of Information).
This interest is witnessed by workshops such as the recent AAAI workshop on
Computational Models of Narrative that brought researchers from this field together
with philosophers, narratologists and professional story tellers. The methodological
issues involved give rise to concrete research questions such as
•
•
•

How do you compare formal frameworks of narrative? (Cf. Löwe 2010 and
Löwe to appear.)
How do you assess the quality of a formal framework of narrative?
How much does the representation of a narrative in a given formal framework
depend on subjective decisions of the formalizer?

Question 1. is a genuinely philosophical question, but also the more technical questions
2. and 3. are very relevant for gaining philosophical insight into what constitutes the
formal core of the concept of narrative. In this paper, we approach question 3. of the
above list. To this end, we think of the process of formal representation of a narrative in
a formal system as a natural analogue of the task of annotation in corpus linguistics and
computational linguistics. Whereas typical annotation tasks involve annotation of
sentences or discourses (cf., e.g., Marcus et al. 1993, Brants 2000, Passonneau et al.
2006), the formalization or annotation of a narrative is at the next level of complexity,
involving sequences or systems of discourses, connected to a narrative. First studies
suggest that question 3. is not easy to tackle for the following reasons: First, ambiguity
which in typical linguistic annotation is a rather confined phenomenon becomes
ubiquitous at the level of narratives: the natural answer to a formalization task is not one
annotation, but a family of consistent annotations (cf. Löwe 2010, §2). Secondly, even
allowing for multiple annotations, it is not clear whether consensus about whether a
given annotation is a valid representation of a narrative is easy to achieve.
Of course, these questions naturally reflect a well-known discussion from
computational linguistics: in sentence- or discourse-level annotation, the quality of
annotation is typically studied as inter-annotator agreement (Carletta et al. 1997, Marcu
et al. 1999). For the annotation or formalization of narratives, no such analysis has ever
been done, not even with the oldest and best-known formal approach to narrative
structure, the Proppian narratemes.

- 210 -

The Computational Turn: Past, Presents, Futures?

In this study, we use English translations of the Afanas'ev tales formalized by Propp
(Afanas'ev 1973), train a group of annotators in the use of Propp's system, and then let
them formalize a selection of tales in that formal framework. We evaluate these results
according to the standards of inter-annotator agreement from computational and corpus
linguistics (Carletta et al. 1997).

References
Afanas'ev, A (1973). Russian fairy tales. Pantheon. Translation by Norbert Guterman from the
collections of Aleksandr Afanasev. Folkloristic commentary by Roman Jakobson.
Brants, T. (2000). Inter-annotator agreement for a German newspaper corpus. In: Proceedings
Second International Conference on Language Resources and Evaluation LREC-2000.
Carletta, J.C., Isard, A., Isard, S., Kowtko, J., Doherty-Sneddon, G. & Anderson, A (1997). The
reliability of a dialogue structure coding scheme. Computational Linguistics, 23(1):13-31.
Dyer, M.G. (1983). In-depth understanding: A computer model of integrated processing for
narrative comprehension. Artifcial Intelligence Series. MIT Press.
Lehnert, W.G. (1981). Plot units and narrative summarization. Cognitive Science, 4:293-331.
Löwe, B. (2010). Comparing formal frameworks of narrative structures. In M. Finlayson (Ed),
Computational Models of Narrative. Papers from the 2010 AAAI Fall Symposium, (pp. 4546). Volume FS-10-04 of AAAI Technical Reports.
Löwe, B. (to appear). Methodological issues in comparing formal frameworks for narratives. In P.
Allo & G. Primiero (Eds), 3rd Workshop on the Philosophy of Information. Koninklijke
Vlaamse Academie van België voor Wetenschappen en Kunsten.
Marcu, D., Romera, M. & Amorrortu, E.A. (1999). Experiments in constructing a corpus of
discourse trees: Problems, annotation choices, issues. In: Workshop on Levels of
Representation in Discourse, (pp. 71-78).
Marcus, M.P., Santorini, B. & Marcinkiewicz, M.A. (1993). Building a large annotated corpus of
English: The Penn Treebank. Computational Linguistics, 19:302-330.
Passonneau, R., Habash, N. & Ramnow, O. (2006). Inter-annotator agreement on a multilingual
semantic annotation task. In: Proceedings LREC-2006.
Propp, V. (1928). Morfologiya skazki. Leningrad: Akademija.
Rumelhart, D.E. (1980). On evaluating story grammars. Cognitive Science, 4:313-316.
Schank, R.C. (1982). Dynamic memory: A theory of reminding and learning in computers and
people. Cambridge University Press.
Turner, S. (1994). The creative process. A computer model of storytelling. Lawrence Erlbaum
Associates.

- 211 -

Proceedings IACAP 2011

COMPUTERS AND PROCRASTINATION
“I’ll just check my Facebook quick a Second...”
NICK BREEMS
Dordt College
Sioux Center, United States
and
University of Salford
Salford, United Kingdom

Abstract. There seems to be something about computer technology that tempts us
towards procrastination. This paper uses a philosophical toolkit to investigate why
this might be, and how we can address the problem. We employ a framework for
understanding the human use of computers developed by Andrew Basden.
Basden's work is based on the thought of 20th century Dutch philosopher Herman
Dooyeweerd, who makes the strong claim that reality is meaningful in a wide
variety of mutually irreducible aspects. The non-reductionist approach of
Dooyeweerd's philosophy allows Basden’s framework to take everyday life
seriously. Thus one of the strengths of a philosophical approach based on
Dooyeweerd's thought is its ability to highlight important aspects of a problem that
may be understudied. In this paper, the framework is used to perform an analysis
of a particular example of computer-based procrastination, and potential avenues
for investigation are highlighted that weren't immediately apparent when thinking
about the problem generically. Thus we demonstrate that the use of a
comprehensive framework for understanding the human use of computers and
information systems from an everyday perspective shows some promise of
providing insight into complex and challenging problems that arise in our
information technology saturated culture.

1. Introduction
There seems to be something about computer technology and internet connectivity that
distracts us, that tempts us towards procrastination. This is borne out by personal
experience, by anecdotal evidence (Breems, 2009), and by research (Lavoie and Pychyl,
2001; Thatcher, Wretchko, and Fisher, 2008). For a tool widely believed to enhance our
productivity, this is remarkable.
This naturally leads us to two questions:
1. Why is this?

- 212 -

The Computational Turn: Past, Presents, Futures?

2.

How can we address this problem? What changes can we make in the
way we design and implement computer systems or in the way we
approach and use such technology that would reduce these distracting
tendencies?
Research in the philosophy of computers and information systems can help us understand
the use of computers as it plays out in everyday human living. This paper employs a
framework for understanding the human use of computers developed by Andrew Basden
(2008) in his book Philosophical Frameworks for Understanding Information Systems.
We use this framework to analyze computer-induced procrastination, and demonstrate
that philosophical tools can bring fresh insight to vexing problems.

2. Basden’s Framework
In Chapter 4 of his book, Basden proposed a framework for understanding Human Use
of Computers (the HUC framework), based work of 20th century Dutch philosopher
Herman Dooyeweerd (1984). Dooyeweerd’s thought is deeply non-reductionist: He
made the strong claim that reality is meaningful in a wide variety of mutually irreducible
aspects. Dooyeweerd identified a suite of fifteen such modal aspects, and posited that
each of these aspects operates under a different set of laws which enable meaningful
functioning in that aspect. Based on these insights, the HUC framework analyzes any
particular use of computer technology along two axes. Horizontally, all computer use
exists as three simultaneous functionings, because we’re interacting with three different
types of entity:
Human/Computer Interaction (HCI) To use a computer, we must interact with the
computer itself: both with the hardware and with the user interface portions of
the software.
Engaging with Represented Content (ERC) Computer programs represent content we
engage with that is meaningful to us. For example, when we use an email
program, it is not the internal voltages inside the CPU or the glowing of pixels
on the screen that have direct meaning in our lives, but rather the content of the
email messages and the information that they carry.
Human Living with Computers (HLC) The use of the computer plays out in our
everyday lives; its effects escape the “box” that is the computer and affect
things “out here” in our lived reality.
Vertically, he analyzes each of these functionings among each of Dooyeweerd’s modal
aspects:
Quantitative of discrete amount
Spatial of continuous extension
Kinematic of flowing movement
Physical of energy and mass
Biotic/Organic of life functions and integrity of organism
Sensitive/psychic of sense, feeling, and emotion
Analytical of distinction, conceptualizing, and inferring
Formative of formative power and shaping, in history, culture, creativity,
achievement, and technology
Lingual of symbolic signification

- 213 -

Proceedings IACAP 2011

Social of respect, social interaction, relationships, and institutions
Economic of frugality, skilled use of limited resources
Aesthetic of beauty, harmony, surprise, and fun
Juridical of “what is due”, rights, responsibilities
Ethical of self-giving love, generosity, care
Pistic of faith, commitment, trust, and vision
The non-reductionist approach of Dooyeweerd’s philosophy allows the framework to
take everyday life seriously. That is, in our everyday experience of reality, we do not
intuitively experience everything as mathematical, physical, or logical, but rather as
diversely meaningful. The laws for the earlier aspects are largely descriptive; that is, we
cannot disobey these laws (e.g. the law of gravity). The later laws, on the other hand, are
prescriptive, and thus normative. They tell us how we ought to function, but do not force
us to do so. For example, in the economic aspect, the law/norm of frugality tells us that
we ought to use our time wisely. It allows us to make predictions about what kinds of
consequences we can expect from obeying or not obeying that norm, but the choice to
follow the norm or not is ours to make.

3. Use of the framework to analyze procrastination
One of the strengths of a philosophical approach such as Basden’s framework is its
ability to highlight important aspects of a problem that may be understudied. In this
paper, the framework is used to perform an analysis of a particular example of computerbased procrastination, playing an online dice game instead of writing a paper. Potential
avenues for investigation are highlighted that weren’t immediately apparent when
thinking about the problem generically:
•
All of the dysfunction occurs in the HLC (Human Living with Computers)
category, while most of the benefits of procrastinating (usually psychic and
aesthetic) occur in the ERC (Engaging with Represented Content) functioning.
Because ERC is a category that is much more within the control of a software
designer, this points to the hope that design alternatives could help in
addressing the problem.
•
The proximity of the procrastinatory activity to the legitimate activity, both
spatially and kinesthetically, eases the transition from real work to work
avoidance. Although designing a computer to put physical distance between,
for example, the use of a word processor and playing a game seems infeasible,
there are potential designs which would increase the psychological distance
from one activity to the other.
•
The HLC functioning in the Pistic aspect indicates that procrastination is a
failure of commitment: We are insufficiently committed to the course of
action we are committed to, resulting in a break of faith with other people in
our lives, our selves, and ultimately, with our religious convictions. A similar
theme is suggested by Pychyl (2008).
Performing an analysis such as this, and evaluating the insight that results, is a
preliminary way of testing the utility of the HUC framework itself. Thus we demonstrate
that the use of a comprehensive framework for understanding the human use of
computers and information systems from an everyday perspective shows some promise

- 214 -

The Computational Turn: Past, Presents, Futures?

of providing insight into complex and challenging problems that arise in our information
technology saturated culture.

References
Basden, A. (2008). Philosophical Frameworks for Understanding Information Systems. Hershey,
PA: IGI Publishing.
Breems, N. S. (2009, September 8). Nick Breems is doing a short research project [web log post].
Retrieved from http://www.facebook.com
Dooyeweerd, H. (1984) A New Critique of Theoretical Thought (Vols. 1-4). Jordan Station,
Ontario, Canada: Paideia Press. (Original work published 1953-1958).
Lavoie, J. A. A., & Pychyl, T. A. (2001). Cyberslacking and the procrastination superhighway: A
web-based survey of online procrastination, attitudes, and emotion. Social Science
Computer Review 19, (4), 431-444.
Pychyl, T. A. (2008, April 7). Existentialism and procrastination: Bad faith. [Web log post].
Retrieved from http://www.psychologytoday.com/node/372
Thatcher, A., Wretchko, G., & Fisher, J. (2008). Problematic internet use among information
technology workers in South Africa. CyberPsychology & Behavior 11 (6), 785–787.

- 215 -

Proceedings IACAP 2011

COMBINATORY LOGIC WITH FUNCTIONAL TYPES IS A GENERAL
FORMALISM FOR COMPUTING COGNITIVE AND SEMANTIC
REPRESENTATIONS
JEAN-PIERRE DESCLÉS
Laboratory LaLIC, University of Paris-Sorbonne
Maison de la Recherche, 28 rue Serpente, 75006, Paris, France
HEE-JIN RO
Laboratory LaLIC, University of Paris-Sorbonne
Maison de la Recherche, 28 rue Serpente, 75006, Paris, France
AND
BRAHIM DJIOUA
Laboratory LaLIC, University of Paris-Sorbonne
Maison de la Recherche, 28 rue Serpente, 75006, Paris, France

Abstract. We show how it is possible to use explicitly Combinatory Logic (a logic
of operators and composition of operators) to define aspectual operators and
temporal relations in natural languages from basic primitives in the domain of the
temporality.

1. Combinatory Logic
Combinatory Logics with functional types (CL) is a formalism used for studying the
foundations of Computer Sciences (semantics of Programming Languages) and for
defining functional programming Languages (as HASKELL) built from this logical
model. CL is a logic of operators and composition of operators. CL has been developed
principally by Curry and Feys (1958), and then it has been used in linguistics by
Shaumyan (1987) and by Desclés (1990).
In computer science, an applicative program is viewed as a combination of
elementary programs, the program being built up with the help of a complex combinator,
this latter being the result of an applicative combination of elementary combinators. The
same idea can be used in other fields: logic and philosophy (logical analysis of
paradoxes and some philosophical concepts), nanostructures synthesis and molecular
combinatory computing (MacLennan, 2003), cognitive representations where a symbolic
representation is an applicative organization of semantic primitives… Linguistic units
are viewed as operators and operands of different functional types.

- 216 -

The Computational Turn: Past, Presents, Futures?

CL allows, on the one hand, to articulate, inside of a same computational architecture,
different representation levels during a process of change of levels and, on the other
hand, to give, by means of a formal calculus, a synthesis of a lexical (or grammatical)
operator from its meaning.

2. Semantic Analysis of Aspecto-Temporal Operators
We present a semantic analysis of some aspectual and temporal operators. Grammatical
units (aspects, tenses, moods …) are operators whose meanings are analysed with
elementary semantic operators combined together with a combinator. An aspectual
operator ‘ASPI’ is applied onto a predicative relation ‘Λ’ (as “Peter to enter the-room”
or “Peter to be inside the room”) where ‘I’ is a topological interval of contiguous and
ordered instants, this interval specifying the temporal area of realization of ‘Λ’. There
are three basic aspectual operators STATEO, EVENTF and PROCJ. If an aspectualized
predicative relation ‘ASPI (Λ)’ is viewed as a state ‘STATEO (Λ)’, then the interval ‘O’
is open and ‘Λ’ is true at every instant of ‘O’ (example (1) Peter is inside the room is a
descriptive state). If ‘ASPI (Λ)’ is an event ‘EVENTF (Λ)’ ((2) Peter entered the room),
the interval ‘F’ is closed and ‘Λ’ is always true at the final bound of ‘F’ (end of the
complete event). If ‘ASPI (Λ)’ is a process ‘PROCJ (Λ)’ ((3) Peter is entering inside the
room), the interval ‘J’ is closed at the left bound of ‘J’ (beginning of the process) and
open at the right bound of ‘J’ to mean that the process is uncomplete.
For speaking, the speaker must locate ‘ASPI(Λ)’ inside the temporal referential
framework organized by himself; his speech act is an uncomplete process expressed by
“I–AM-SAYING (…)” = “PROCJ0 (I-SAY (…))”, where ‘J0’ is the interval of speaking,
with its right open bound (the process of speaking is fundamentally uncomplete). The
temporal intervals ‘O’, ‘F’ and ‘J’ can be related to the interval ‘J0’. For the examples
(1), (2) and (3), we obtain the respective temporal relations between right bounds of
different intervals:
(1’)
[δ (O) = δ (J0)]
[δ (F) < δ (J0)]
(2’)
[ δ (J) = δ (J0)]
(3’)
where ‘δ’ and ‘γ’ are respective operators that selects the right and left bounds of an
interval.
The combinators are used to express how the aspectual operators and temporal
relations are combined together and synthesized into an unique grammatical operator
expressed by a morphological operator. CL gives tools to analyze complex units into a
combination of more elementary units. The computing of synthesis processes in a topdown strategy (or the analytic decomposition in a bottom-up strategy) of numerous
aspectual and temporal operators has been realized with HASKELL. By the same way,
the automatic analysis of some lexical predicates into a scheme where are combined
semantic primitives in an applicative expression has been realized. We have not the
place to show all steps of deductions for different aspectual operators which highlight the
notions about process, event, state and related notions. With the adjunct of semantic
representation of the lexical predicates, it becomes possible to give the formal deduction
from a given sentence to another (Desclés, 2005; Desclés and Ro, 2011):

- 217 -

Proceedings IACAP 2011

John took the Mary’s pen → Mary doesn’t have the pen anymore
When a speaker of English understands the first sentence, it is able to infer automatically
the second sentence. This inference becomes possible with a grammatical knowledge
(meaning of tenses) and a representation of the meaning of lexical predicate to take. Our
research program shows how a machine can simulate this kind of inference realized by
humans. For more details, to see (Desclés, 1990; 2005) and (Desclés & Ro, 2011a;
2011b).

References
Curry H.B. & Feys R. (1958). Combinatory logic. Vol. I. Studies in logic and the foundations of
mathematics, North-Holland Publishing Co., Amsterdam
Desclés J.-P. (1990). State, event, process, and topology. In: General Linguistics (pp.159-200),
vol.29, n°3, Pensylvania State University Press, University Park and London.
Desclés J.-P. (2005). Reasoning and Aspectual-Temporal Calculus. In: Vanderveken D. (Eds),
Logic, Thought and Action, Springer (pp. 217-244).
Desclés J.-P. & Ro H.-J. (2011a). Aspecto-Temporal Representation for Discourse Analysis an
Example of Formal Computation, The 24th Florida Artificial Intelligence Research Society
Conference.
Desclés J.-P. & Ro H.-J. (2011b). Operateurs asepcto-temporels et Logique Combinatoire. To
appear in Mathématiques et Sciences Humaines.
Hindley J.R. & Seldin J.P. (1986). Introduction to Combinators and Lambda-Calculus.
Cambridge Univ. Press.
MacLennan, B. J. (2003). Combinatory Logic for Autonomous Molecular Computation,
www.cs.utk.edu/~mclennan
Shaumyan S.K. (1987). A Semiotic Theory of Natural Languages. Bloomington: Indiana
University Press

- 218 -

The Computational Turn: Past, Presents, Futures?

THE PAST, PRESENT, AND FUTURE ENCOUNTERS BETWEEN
COMPUTATION AND THE HUMANITIES
STEFANO FRANCHI
Department of Hispanic Studies
Texas A&M University
stefano@tamu.edu

Abstract. The paper addresses the conference theme from the broader perspective
of the historical interactions between the Humanities and computational
disciplines (or, more generally, the “sciences of the artificial”). These encounters
have followed a similar although symmetrically opposite “takeover” paradigm.
However, there is an alternative meeting mode, pioneered by the interactions
between studio and performance arts and digital technology. A brief discussion of
the microsound approach to musical composition shows that these alternative
encounters have been characterized by a willingness on both parts to let their basic
issues, techniques, and concepts be redefined by the partner disciplines. I argue
that this modality could (and perhaps should) be extended to other Humanities
disciplines, including philosophy.

1. Takeovers
The two best-known encounters between computational technologies and traditional
Humanists pursuits are represented by the Artificial Intelligence/Cognitive science
movement and the roughly contemporary Digital Humanities approach (although the
label became popular only recently). Classic Artificial Intelligence saw itself as “antiphilosophy” (Dupuy, 2000; Agre, 2005; Franchi, 2006): it was the discipline that could
take over philosophy's traditional questions about rationality, the mind/body problem,
creative thinking, perception, etcetera, and solve with the help of a set of radically new
synthetic, experimental-based techniques. The true meaning of the "computational turn
in philosophy" lies in its methodology, which allowed it to associate engineering
techniques with age-old philosophical questions. This “imperialist” tendency of
cognitive science (Dupuy, 2000) was present from the very beginning, even before the
formalization of the field into well–defined theoretical approaches (McCulloch
(1989[1948]); Simon, 1994).

The Digital Humanities represent the reverse modality of the encounter
just described. The most common approach (Kirschenbau, 2010) uses
tools, techniques, and algorithms developed by computer scientists to
address traditional questions about the meaning of texts, their

- 219 -

Proceedings IACAP 2011

accessibility and interpretation, and so on. Other approaches turn
technology into the scholar's preferred object of study (Svensson,
2010). The recent approach pioneered by the “Philosophy of
Information” (Floridi, 2011) follows this pattern. Its focus on the much
broader category of “information” substantially increases the scope of
its inquiries, while firmly keeping it within philosophy's standard
reflective mode.
The common feature of these two classic encounters between the
Humanities and computational theory and technology is their onesidedness. In either case, one of the two partners took over some
relevant aspects from the other participant and fit it within its own
field of inquiry (mostly questions, in AI's case; mostly tools, for the
Digital Humanities). The appropriation, however, did not alter the
theoretical features of either camp. For instance, AI and Cognitive
Science researchers maintained that philosophy pre-scientific
methodology had only produced mere speculation that made those
problems unsolvable. Therefore, philosophy's accumulated wealth of
reflection about the mind, rationality, perception, memory, emotions,
and so forth could not be used by the computational approach. In
McCulloch's famous phrase, the “den of the metaphysician is strewn
with the bones of researchers past.” In the Digital Humanities' case, the
takeover happens at the level of tools. In most cases, however, this
appropriation does not become an opportunity for a critical reflection
on the role of the canon on liberal education, or for a reappraisal of the
role of the text and the social, political, and moral roles it plays in
society at large.

2. Digital practice
Meetings between artists and computational technology show the possibility of a
different paradigm. In many cases, making music, painting, producing installations, and
writing with a computer changes the concepts artists work with, and, at the same time,
forces computer sciences to change theirs as well. There are many examples in the rich
history of “digital art,” broadly understood (OuLiPo, 1973; ALAMO, No year;
Schaeffer, 1952). I will illustrate their general features with reference to a more recent
project: the “microsound” approach to musical composition (Roads, 2004).

“Microsounds” are sonic objects whose timescale lies between that of
notes―the smallest traditional music objects, whose duration is
measured in seconds or fractions thereof―and samples―the smallest
bit, measured in microseconds (10-6). The manipulation of
microsounds broadens substantially the composer's palette, but it is

- 220 -

The Computational Turn: Past, Presents, Futures?

impossible without the help of technological devices of various kinds,
from granular synthesis software to high-level mixing interfaces.
Composers wishing to “sculpt” sounds at the microlevel face a double
challenge that translates into a mutual collaboration between
compositional and algorithmic techniques. On the one hand, they need
to broaden the syntax an grammar of music's language to allow the
manipulation and aesthetic assessment of previously unheard of
objects (Vaggione, 2001). On the other hand, they need computer
scientists and mathematicians to develop alternative analytic and
synthetic models of sound (in addition to Fourier-transforms and
similar methods) capable of capturing the features of sonic events
lasting only a few milliseconds (Vaggione, 1996).
This example of artistic production points to a pattern of cooperation
between work in computational and non-computational disciplines that
is deeply at odds with the AI/CogSci and DigHum patterns discussed
above. Instead of a takeover, the artistic model produces a true
encounter that changes both partners' technical and theoretical
apparatus.

3. Posthuman encounters?
Could the encounter model practiced by artists be generalized to the Humanities? We
can see how this could be the case by considering a twofold question. On the one hand:
are Humanities' traditional inquiries about human nature and human cultural production
still relevant in a landscape in which some of the communicating agents may not be
human, partially or entirely? Can they go on in the same way? And vice versa: are
science and technology fully aware that the new digital artifacts they are shepherding
into the world may change its landscape and transform worldly action at the pragmatic as
well as at the theoretical level? Or are they still relying upon a pre-digital universe in
which technological artifacts were always to be used as mere tools deployed by humans,
an assumption that seems increasingly questionable?

I think a particularly fruitful approach toward this question is provided
by the kind of critical thought that has been developed―mostly, but
certainly not exclusively―in Continental Europe over the last two or
three decades. These theoretical efforts have based their explorations
upon anti-humanist and/or post-humanist perspectives. They provide,
therefore, a fruitful starting point for the investigation and interaction
with instruments, tools, and techniques that question the very notion of
the human. For instance, Lacanian and post-Lacanian psychoanalysis
has articulated a view of the human that deploys cybernetic concepts to
explain high level cognitive functions (Franchi, 2011; Chiesa, 2007);

- 221 -

Proceedings IACAP 2011

the work on biopolitics currently developed by largely Italian
philosophers attempts to articulate a conception of human life that is
continuous with animal and non-organic life (Agamben, 2003;
Esposito, 2008; Tarizzo, 2010). At the same time, the disciplines of
science and technology studies in their contemporary North American,
French, and German developments have provided penetrating analyses
of the bidirectional relationships between scientific theories and
technological artifacts, on the one hand, and philosophical and cultural
productions on the other (Ihde, 2002; Hayles, 1999; Latour and
Woolgar, 1986; Biagioli, 1999).
This suggestion does not pretend to exhaust the theoretical options we
have at our disposal when reflecting upon the computational turn. My
contention, however, is that artistic practices in all forms of “digital
art” can serve as an inspiration to all of the Humanities disciplines. We
can follow their path toward a new mode of digital encounter that does
not fall into the well-worn path of hostile takeovers by either partner.

References
Agamben, G. (2003). The Open. Man and Animal. Stanford, Calif.: Stanford University Press.
Agre, Ph. E. (2005). The Soul Gained and Lost: Artificial Intelligence as Philosophical Project.
In: S. Franchi and G. Güzeldere (Eds.), Mechanical Bodies, Computational Minds (pp.153174). Cambridge: MIT Press.
ALAMO (Atelier de Littérature Assistée par la Mathématique et les Ordinateurs). Url:
http://alamo.mshparisnord.org/index.html
Biagioli, M. (Ed.) (1999). The Science Studies Reader. New York: Routledge.
Chiesa, L. (2007). Subjectivity and Otherness. A Philosophical Reading of Lacan. Cambridge:
MIT Press.
Dupuy, J.-P. (2000). The Mechanization of the Mind: On the Origins of Cognitive Science.
Princeton: Princeton University Press.
Esposito, R. (2008). Bios: Biopolitics and Philosophy. Minneapolis: University of Minnesota
Press.
Floridi, L. (2011). The Philosophy of Information. Oxford: Oxford University Press.
Franchi, S. (2006). “Herbert Simon, Anti-Philosopher.” In: L. Magnani (Ed.) Computing and
Philosophy (pp. 27-40). Pavia: Associated International Academic Publishers.
----- (2011). Jammed Machines and Contingently Fit Animals: Psychoanalysis’s Biological
Paradox, French Literature Series, 38, in press.
Hayles, N. K. (1999). How We Became Posthuman: Virtual Bodies in Cyberspace. Chicago:
University of Chicago Press.
Ihde, D. (2002). Bodies in Technology. Mineapolis: University of Minnesota Press.
Kirschenbau, M. G. (2010). What Is Digital Humanities and What’s It Doing in English
Departments? ADE Bullettin, 150, 1–7.
Latour, B. and Woolgar, S. (1986). Laboratory Life: the Construction of Scientific Facts.
Princeton: Princeton University Press.

- 222 -

The Computational Turn: Past, Presents, Futures?

McCulloch, W. S. (1989[1948]). Through the Den of the Metaphysician. In: Embodiments of
Mind (142-156). Cambridge: MIT Press.
OuLiPo (1973). La littérature potentielle. Paris: Gallimard.
Roads, C. (2004). Microsound. Cambridge: MIT Press.
Schaeffer, P. (1952). À la recherche d’une musique concrète. Seuil.
Simon, H. (1994). Literary Criticism: a Cognitive Approach. In: S. Franchi and G. Güzeldere
(Eds.), Bridging the Gap (pp. 1–26). Stanford Humanities Review, 4(1), Special
Supplement.
Svensson, P. (2010). The Landscape of Digital Humanities. Digital Humanties Quarterly, 4(1).
Tarizzo, D. (2010). La vita, un’invenzione recente. Bari: Laterza.
Vaggione, H. (1996). Articulating Microtime. Computer Music Journal, 20(2), 33–38.
----- (2001). Some Ontological Remarks about Music Composition Processes. Computer Music
Journal,

- 223 -

Proceedings IACAP 2011

REFLECTIONS ON NEUROCOMPUTATIONAL RELIABILISM
MARCELLO GUARINI
Department of Philosophy, University of Windsor
401 Sunset, Windsor, ON. Canada N9B 394
AND
Joshua Chauvin
and
Julie Gorman
Students, Department of Philosophy, University of Windsor
401 Sunset, Windsor, ON. Canada N9B 394

1. Introduction
Reliabilism is a theory of knowledge that has traditionally focused on propositional
knowledge. Paul Churchland has advocated for a reconceptualization of reliabilism to
“liberate it” from propositional attitudes (such as accepting that p, believing that p,
knowing that p, and the like). In the process, he (a) outlines an alternative for the notion
of truth (which he calls “representational success”), (b) offers a non-standard account of
theory, and (c) invokes the preceding ideas to provide an account of representation and
knowledge that emphasizes our skill or capacity for navigating the world. Crucially, he
defines reliabilism (and knowledge) in terms of representational success. This paper
discusses these ideas and raises some concerns.
Since Churchland takes a
neurocomputational approach, we discuss our training of neural networks to classify
images of faces. We use this work to suggest that the kind of reliability at work in some
knowledge claims is not usefully understood in terms of the aforementioned notion of
representational success.

2. Traditional Reliabilism: Truth and Propositional Attitudes
Claims to propositional knowledge have the form, S knows that p, where p is a
proposition. For the reliabilist, among the necessary conditions for some agent or
subject S to know p are that (a) p is true, (b) S believes p, and (c) p is the outcome of a
reliable process or method. According to Alvin Goldman (1986, 1992, 1999, 2002)
reliability is required for both epistemic justification and knowledge. This reliability is a
ratio: the number of true beliefs delivered by a process or method divided by the number

- 224 -

The Computational Turn: Past, Presents, Futures?

of true and false beliefs delivered by the same process or method As we will concern
ourselves primarily with the reliability requirement in this paper, we shall not engage the
issue of what might constitute sufficient conditions for either knowledge or justification.

3. Neuro Reliabilism: Representational Success and Similarity Spaces
Paul Churchland (2007) attempts to take a reliabilist approach to epistemology, divorce
it from propositional attitudes, and explain how we can have non-propositional
knowledge. Churchland begins by enumerating many instances of know-how. The
examples include the capacity or skill knowledge possessed both by humans and nonhumans. He argues that much of what we call knowledge has little or nothing to do with
the fixing of propositional attitudes. He recognizes the importance of truth in classical
approaches to reliabilism, but he resists talking of truth since (a) it attaches to
propositional attitudes, and (b) much of our knowledge is not about fixing propositional
attitudes. In place of truth, Churchland formulates a notion of representational success
that is compatible with analyses of neural networks. To keep things simple, consider a
three layer feed forward neural network. After training, each different pattern of
activation across the hidden units is a different point in that space. We can then measure
the distance between points (which Churchland often refers to as similarity relations).
Churchland treats (somewhat metaphorically) similarity spaces as maps that guide our
interactions with the world. Just as a map is representationally successful when the
distance relations on the map preserve distance relations in the world, conceptual spaces
understood as similarity spaces are representationally successful when they preserve
similarity or distance relations between points in state space and the world.

4. How Representational Success and Reliability can Come Apart
We will present the results of two neural networks (N1 and N2) trained to classify
images of faces as either male or female. N1 was trained on the set of images A; it was
tested on images it had not previously seen, set B. N2 was trained on B; it was tested on
A. Both networks achieved equal levels of success on the images. In spite of the
preceding, we will show that N1 and N2 set up different similarity spaces. This is a
problem for Churchland’s position since he defines reliability in terms of
representational success, and this latter notion is defined in terms of structure preserving
mapping between points in similarity space and features of the world. It seems quite
natural to say that N1 and N2 are equally reliable, but because they set up different
similarity spaces, we will argue that it is not clear how they could be equally
representationally successful, given the work Churchland expects representational
success to do.
There is a difference between (a) being reliable and (b) explaining the source of
that reliability. We will show that we can understand what it is for a system (a face
classifying neural network) to be reliable independent of understanding the source of that
reliability. Churchland uses the notion of representational success (or preservation of
distance relations) both to define reliability and to understand its source (i.e. to do both
(a) and (b)). This is a source of potential problems for his position.

- 225 -

Proceedings IACAP 2011

5. Conclusion
In spite of the problems, we recognize there are some attractions to the sort of position
Churchland is putting forward. While we do not think it has the range of applicability
Churchland suggests, we do not take ourselves to have argued that representational
success is a useless notion. We will close with some constraints that need to be satisfied
for the notion to be a useful one.

Acknowledgements
We thank the Shared Hierarchical Academic Research Computing Network
(SHARCNet) for financial support.

References
Churchland, P.M. (2007). Neurophilosophy at Work. Cambridge, UK: Cambridge University
Press.
Goldman, A. (1986). Epistemology and Cognition. Cambridge, MA: Harvard University Press.
Goldman, A. (1999). Knowledge in a Social World. Oxford: Oxford University Press.
Goldman, A. (1992). Liaisons: Philosophy Meets the Cognitive and Social Sciences. Cambridge,
MA: MIT Press.
Goldman, A. (2002). Pathways to Knowledge, Private and Public. Oxford: Oxford University
Press.

- 226 -

The Computational Turn: Past, Presents, Futures?

STATES OF AFFAIRS AND INFORMATION OBJECTS
STEVE T. MCKINLAY
Charles Sturt University
Wellington Institute of Technology
School of Information Technology
Private Bag 39803, Petone, Wellington, NEW ZEALAND
e-mail: steve.mckinlay@weltec.ac.nz
Abstract. This paper compares two recently detailed metaphysical accounts of
reality. On the one hand we have Luciano Floridi’s “information realism” and, on
the other David Armstrong’s view that the general structure of reality can be
described as “states of affairs”.
Floridi postulates the information object as the entity central to information ethics
and his informational realism. In developing the concept he draws heavily upon
object oriented (OO) programming theory. Informational objects are reckoned by
Floridi to be, in a sense, ontologically primitive and as such naturally occurring
mind independent structures dynamically interacting with one another. Floridi
employs OO like terminology such as “properties” and “relations” in order to
clarify his concept of the informational entity.
Armstrong on the other hand postulates that the world, all that there is, is a world
of states of affairs. A state of affairs according to Armstrong consists of a
particular, which has a property or alternatively a relation which holds between
two or more particulars. Each state of affairs as well as constituent higher or lower
order states of affairs is a contingent existent. Furthermore the properties and
relations attached to states of affairs are universals.
These two theories, whilst exhibiting marked resemblances also reveal
fundamental philosophical differences yet both attempt to present a unified
metaphysical schema, an ontology. Of great interest is the fact that here we have
two strong competing theories. The situation begs critical comparison. Such a
comparison is the primary aim of this paper.

The idea of the Information Object as being somehow ontologically fundamental has
gained traction recently not only in computer programming circles but also
philosophically. We could attribute this newfound popularity, particularly with regard to
philosophical interpretations, with the fact that we live in the so called information age.
We, at least in the developed world, view the world through information-coloured
spectacles these days. Adding some substance to this claim is the fact that our
information systems are designed and developed using fashionable object oriented (OO)
methodologies. Information modeling is now the accepted process by which facts or
propositions, the sentences that demarcate the various states of affairs and “things” of

- 227 -

Proceedings IACAP 2011

which the modeller is interested, are defined via “object class” structures. Such
structures in turn represent various properties, behavior and relata.
The information object in this sense is an intuitively fitting and elegant way of
representing the problems we attempt to solve via computational means. OO design and
development is “instrumentally reliable” – it works. The majority of modern
implemented information technologies across the entire gamut of industries and
applications typically employ object oriented approaches. The focus has shifted from
procedural algorithmic processing to an object driven methodology and as such states of
affairs and “things” are abstractly modelled as self-contained (encapsulated) object
structures, responsible for their own identity, relations, properties, states and behavioural
rules. It’s perhaps not surprising then that we might ponder; could the universe be
interpreted and/or represented in such a way?
From a wider perspective what is often termed the computational turn has given
rise to the informational object concept central to and emerging as fundamental in an
informational ontology developed primarily by Luciano Floridi (2002, 2004, 2008). The
concept is important for Floridi since the information object plays a role central to his
Information Ethics (IE) and Informational Realism (IR). But more than this, the idea of
the “information entity” seems to offer new ways of understanding epistemology,
semantics, scientific explanation, and ethics. Floridi has developed a detailed picture of
the information object (or entity as he sometimes calls it) employing Object Oriented
programming and design methods and theories to clarify the concept.
Whilst Luciano Floridi’s notion of the information object is somewhat analogous to
the OO conception of an object in a recent paper I argued for a variety of reasons that
information objects, certainly within the context of Floridi’s informational realism, don’t
seem to be much like OO objects, certainly not the kind employed in an OO class model
or an OO program. Arguably the most significant difference is that OO objects act
unequivocally as referents to facts, as Wittgenstein (1961) would have put it, or what
Armstrong (1998) calls states of affairs. I think there is certainly a similarity between
OO objects and Floridi’s conception of the information object but I suspect the similarity
is more harmful to the idea of the information object holding any independent
ontological status or existing independently as a particular category. The similarity is
that both object concepts are largely conceptual by nature. Yet Floridi seems to want to
confer a stronger ontological status to the information entity. Problems arise if the
information object is indeed conceptual. Following Lauden (1977, p48) such entities can
have no existence independent of the theories within which they are postulated.
Nevertheless the concept of an information entity is certainly a convenient and
relatively intuitive way of bundling up constituent properties and relations belonging to
the particular in question. Those properties and relations are in fact what philosophy
sometimes calls universals and it is each particular (distinct information objects) that
instantiate those universals. The universals themselves are the constituents of
information objects shared across many objects. There are some that deny the existence
of universals (nominalism) and we shall consider this in the full paper.
Armstrong (1998, p95) questions the need to recognise an independent category of
particulars. He argues that whilst properties and relations can be known “the bearer of
properties and relations, it is alleged, cannot be known. Why then postulate a bearer?”
The postulation of bearers, Armstrong argues, appears to lack ontological and epistemic

- 228 -

The Computational Turn: Past, Presents, Futures?

economy (ibid). This raises the question, is the Floridian information object the same
kind of thing Armstrong terms a bearer?
From the OO perspective a particular information object (or class, although the two
concepts differ slightly and this will be explained) is admittedly representative of a fact,
state of affairs or physical object, this renders the OO object second order to the actual
fact or state of affairs. Furthermore I take it, it is meant to be information objects all the
way down. But we already see this isn’t the case. Information objects are essentially
bundles of properties and relations, whilst no information object can be strictly identical
with another, the properties and relations can and are identical across multiple
instantiations of similar objects. Whilst they do not exist outside their instantiations it
would seem properties and relations hold a more fundamental ontological position than
the information entity.
Thus to uphold the ontological reality of “information objects” or in Armstrong’s
case “states of affairs” seems to entail the admission of properties and relations yet there
would certainly be some philosophers who would deny that the reverse holds. There
seems to be little controversy in the admission of properties and relations since a denial
results in the denier having to come up with an alternate theory of classes. It is
individual objects or states of affairs that exhibit more or less identical properties and
relations that we bundle into classes.
This paper compares Armstrong’s descriptions of properties and relations with
those affiliated to Floridi’s information object concept. Further we will consider how
similar (or different) the information object concept is to the Armstrong’s conception the
state of affairs.

References
Armstrong D.M. (1989). Universals: An Opinionated Introduction. Westview Press. (Focus
Series)
Armstrong, D. M.. (1998). A World of States of Affairs. Cambridge: Cambridge University Press.
Floridi, L. (2002). On the Instrinsic Value of Information Objects and the Infosphere. Ethics and
Information Technology, 3(4), 287-304.
Floridi, L.. (2004). Informational Realism. In G.M. Greco, IEG Research Report Oxford:
Information Ethics Group.
Floridi, L. (2008). A Defence of Informational Structural Realism. Synthese, 161(2), 219-253.
Lauden, L.. (1977). Progress and Its Problems: Towards a Theory of Scientific Growth.
California: University of California Press.
Wittgenstein, L.. (1961). Tractatus Logico-Philosophicus. London and New York: Routledge.

- 229 -

Proceedings IACAP 2011

SCIENTIFIC EXPLANATION AND INFORMATION
STEVE T. MCKINLAY
Charles Sturt University
Wellington Institute of Technology
School of Information Technology
Private Bag 39803, Petone, Wellington, NEW ZEALAND
e-mail: steve.mckinlay@weltec.ac.nz

Abstract. Scientific explanation and more recently information have attracted
considerable philosophical attention. Little consideration however has been given
to making sense of the concept of information used within debates surrounding
explanation. Some may deem there is no problem to be solved here. Yet we
observe within the literature on scientific explanation strict examinations of
profound philosophical concepts. Writers are at pains to explain causal, epistemic,
ontological and nomological accounts of explanation all of which in some way
rely upon and take for granted the role of information.
We like to think these days we have, at least the beginnings of, a coherent theory
of information. This paper cherry picks a couple of interesting ideas within
scientific explanation and attempts to reconcile the generally received view of
information with those particular explanatory accounts. By the received view I
mean the General Definition of Information mostly attributed to Luciano Floridi
from around 2003 onwards. As a result of this investigation some profound
questions arise; is an “ideal explanatory text” (see Railton, 1981) essentially an
informational concept? Can we make sense of a relationship between causation
and information? Just how are the concepts related and do we need a satisfactory
account? And finally, is it possible to propose a purely information-centric theory
of scientific explanation and if so, could it be a significant improvement on current
theories of scientific explanation?

Everything that exists makes a difference to the causal powers of something.
David Armstrong, 1997, p. 41)

Introduction

Wesley Salmon in Causality and Explanation suggests that to most
people, the fact that there is a close connection between causality and
explanation comes as no surprise (1998, p. 3). And while distinctions can

- 230 -

The Computational Turn: Past, Presents, Futures?

certainly be made between the two concepts there are many convergences.
Salmon argues, “In many cases to explain something is to state its cause.”
(ibid). I happen to think a similar story can be told with regard to
information and explanation. To have something explained is, at least
from an ordinary language point of view, to be informed. There is a
certain structure about scientific explanation, the various relationships
between laws and theories, and information seems to be the flesh on these
bones. It follows that the concept of information might benefit from an
investigation into the connections or relations that exist between it, causal
concepts and explanation and it is this particular can of worms that this
paper intends to open.
Information, Causality and Explanation

The body of philosophical literature on scientific explanation is
substantial beginning16 with the deductive-nomological (D-N) model
(Hempel & Oppenheim, 1948., Hempel, 1965) wherein scientific
17
explanations were considered deductive arguments . Salmon (1971)
followed with the statistical relevance (S-R) model in order to deal with
explanations of low probability events not adequately dealt with by
Hempel’s explanatory models. Later Railton (1978, 1981) proposed a
deductive-nomological-probabilistic (D-N-P) model in a further attempt
to explain events that happen by chance. More recently Wesley Salmon
proposed a casual theory of explanation.
Salmon’s principal claim was that a scientific explanation is
constituted by a state of affairs predominantly recognised as a pattern in
the world where that pattern consists of at least one causal process.
Causal processes Salmon argued (also Railton, 1981 and later Dowe,
2000) necessarily transmit information (1998, p.16). Salmon explains
this as the ability of a causal process to transmit a mark. Causal processes
are described by Salmon as being continuous (in a physically spatiotemporal way). This view contrasts with the popular view of causality
being a “relation” between particular events (the cause, and the effect).
Salmon’s theory is perhaps most eloquently clarified in his At-At Theory
16

Although the roots of scientific explanation and understanding can of course be traced back
well beyond Aristotle, recent philosophical history regarding scientific explanation is
generally considered to begin with Hempel and Oppenheim’s ground breaking paper Studies
in the Logic of Explanation.
17
The degree of informativeness of a logically deductive schema in perhaps controversial,
however given scientific explanation has moved on considerably from the Hempelian D-N
approach we can safely leave this controversy to one side.

- 231 -

Proceedings IACAP 2011

of Causal Influence (1977, reprinted in Salmon, 1998). The At-At theory
Salmon claims not only resolves Zeno’s arrow paradoxes but also
proposes a foundation for a concept of propagation of causal influence.
Information plays a significant yet largely unexplained role in virtually all
of the models of explanation particularly Salmon’s At-At causal theory.
The usual constraints prevent this paper from adequately
summarising in full the development of scientific explanation from
Hempels D-N model through recent attempts at a unified model of
explanation and so I intend to choose two particular junctures in the
history of scientific explanation in the hope of casting some light upon the
controversial three way axis between information, casuality and
explanation. As is often the case in philosophy the following
investigation is most likely to end in more, yet hopefully new and
interesting questions regarding the nature of information. Thus, my two
starting points with their associated problems are as follows;

1.

Peter Railton makes a distinction between what he terms the “ideal
explanatory text” and “explanatory information” (1981, p. 240).
Railton openly admits in his 1981 paper that whilst it is typical to
speak of sentences or texts conveying information he knows of “no
satisfactory account of this familiar and highly general notion” (1981,
p. 240). Further he admits that the neither does the notion of
information defined by Wiener and Shannon appear to fit his
explanatory theory. Given that Railton’s work continues to influence
attempts at theories of explanation, in particular Kitcher’s (1989)
unificationist account, an enquiry into Railton’s “explanatory
information” seems overdue.

2.

Wesley Salmon’s development of Reichenbach’s “mark method” in
his At-At Theory of Causal Influence makes thoughtful claims about
information transmission as a result of causal processes. Salmon
makes a clear distinction between causal processes and pseudoprocesses, the latter he claims have no ability to transmit information.
I will evaluate Salmon’s claims with examples and examine how
Salmon’s concept of information transmission squares with our
current views about information.

This investigation I think raises profound questions; is Railton’s concept
of the ideal explanatory text essentially an informational concept? On the
other hand can we make sense of a relationship between causation and
information? Just how are these concepts related and do we need a

- 232 -

The Computational Turn: Past, Presents, Futures?

satisfactory account? Finally, can we propose an informationally centred
theory of scientific explanation? Rather than attempt to conclusively
answer these questions in this paper, I hope to build an argument around
the fact that the topic is one worthy of serious consideration.
References
Dowe, P.. (2000). Physical Causation. Cambridge: Cambridge University Press.
Floridi, L.. (2003). From Data to Semantic Information. Entropy(5), 125-145.
Hempel, C.. (1965). Aspects of Scientific Explanation and Other Essays in the Philosophy of
Science. New York: Free Press.
Hempel, C.. (1948). Studies in the Logic of Explanation. Philosophy of Science, 15, 135-175.
Kitcher, P.. (1989). Explanatory Unification and the Causal Structure of the World. In Scientific
Explanation (410-505). Minneapolis: University of Minnesota Press.
Railton, P.. (1981). Probability, Explanation, and Information. Synthese, 48(2), 233.
Railton, P.. (1978). A Deductive-Nomological Model of Probabilistic Explanation.. Philosophy of
Science, 45, 206-226.
Salmon, W.. (1998). Causality and Explanation. Oxford: Oxford University Press.

- 233 -

Proceedings IACAP 2011

BIOLOGICAL INSPIRED SINGLE-CHIP MASSIVELY PARALLEL
SELF-HEALING, SELF-REGULATING, TERA-DEVICE COMPUTERS
Philosophical Implications of the Efforts for Solving Technological
Show-Stoppers in the Path of the Next Computational Turn
MICHAEL NICOLAIDIS
TIMA Laboratory
(CNRS, Grenoble INP, UJF)

Abstract. Biologically inspired computing usually addresses computing
functionalities inspired from biological systems (genetic algorithms, neural
networks, cellular automata, artificial life, ...). However, living organisms also
resolve efficiently some other problems that have to be addressed in order to
accomplish the next computational turn,: achieving the robustness (reliability and
power-dissipation) enabling making useful computations by means of ultimate
CMOS (to be reached by the beginning of the next decade) and post-CMOS
technologies. Thus, biologically inspired robust computing can be viewed as an
emerging topic of biologically inspired computing. Complex organisms have the
remarkable property of self-healing. Two fundamental features are on the basis of
this ability. Organisms are constituted of large numbers of basic units (cells).
Cells surrounding injured parts can substitute the dead cells and regenerate the
damaged structures. Also, the cells themselves can recover from various damages,
for instance by repairing their DNA. Furthermore, living organisms regulate their
physiological parameters to the changing external conditions and their own needs
(e.g. the regulation of insulin levels in response to sugar levels). As another
remarkable property, the autonomic nervous system of higher animals controls
important bodily functions (e.g. respiration, heart rate, and blood pressure)
without conscious intervention. Building computers having similar properties and
achieving the robustness they confer is an old dream of computer scientist. But so
far, related researches did not lead to a practical self-healing, self-regulating,
autonomic computing paradigm.

Ultimate CMOS and post-CMOS promises and challenges.
We argue that today there are several converging factors which pave the way towards a
new computing paradigms realizing this old dream. These factors are three-fold. Two of
them are related with the technology scaling.
- Ultimate-CMOS and post-CMOS technologies promise integrating trillions devices
in a single chip. Thus, single-chip massively parallel architectures become
mandatory for utilizing the huge numbers of devices integrated in such chips.

- 234 -

The Computational Turn: Past, Presents, Futures?

- At the same time, aggressive technology scaling impacts dramatically process,
voltage and temperature (PVT) variations; sensitivity to electromagnetic
interferences (EMI) and to atmospheric radiation (neutrons, protons); and circuit
aging; and also imposes stringent power dissipation constraints. The resulting high
defect levels, heterogeneous behavior of identical processing nodes, circuit
degradation over time, and extreme complexity, affect adversely fabrication yield
and also prevent fabricating reliable chips in ultimate CMOS and post-CMOS
technologies. These issues are the main show-stoppers in the path towards these
technologies that pave the way for the next computational turn.
The above two factors plead for a self-healing massively parallel computing paradigm.
But, this is not a trivial task. Copying with failures (a property also known as fault
tolerance) induces high area and power penalties. The former will drastically reduce the
available computing resources, while the later is incompatible with low power operation
(one of the tightest constraints in ultimate CMOS). Furthermore, conventional faulttolerant approaches (DMR, TMR etc) consider that failures affect a single component
among several redundant ones. This assumption is no more valid in the extreme
integration of ultimate CMOS, where transistors are so small that comprise a few atoms,
neither under the even higher integration levels promised by post-CMOS. In these
technologies we may face the following challenges:
- All processing nodes and routers in a massively parallel tera-device processor are
affected by timing or transient faults,
- Hard faults may affect some parts of each node,
- Hard faults completely destroying a new node arrive every few days,
- Circuit degradation is continuous and requires continuous self-regulation of circuit
parameters (clock-frequency, voltage levels, body bias), to maintain it operational.

Biologically-inspired enabling approaches
The Cells framework (On-Chip Self-healing Tera-Device Processors) discussed in this
paper brings-in the third factor: a drastically new system-design paradigm achieving
high yield, and highly-reliable uninterrupted operation for highly defective on-chip
massively parallel tera-device processors at low hardware cost. Power reduction and
enhanced performance are also achieved through self-regulation of circuit parameters
(voltage, clock frequency and body bias). Groundbreaking innovations were introduced
at all levels of the framework, including its overall architecture, its particular
components, and the way the cooperation of these components is architected to optimize
the outcome. They enable continuous adaptation to circuit degradation, heterogeneity
and changing application context, as well as detection and correct operation restoration
for all failures induced by high defect densities, PVT variations, internal and external
disturbances, and circuit degradation over time. It results in a holistic self-healing selfregulating approach allowing:
- Making usable tera-device technologies affected by: high defect densities, sever
variability, increasing sensitivity to disturbances and accelerated aging.
- Implementing single-chip massively parallel self-healing tera-device computers
delivering unprecedented computing power, which enable changing our computing
paradigms and should have a profound impact on all computer application domains
(including embedded systems, telecommunication networks, internet infrastructure
and utilization, cloud computing, …), as well as science and technology and the
society as a whole.

- 235 -

Proceedings IACAP 2011

In the Cells, Self-Healing is achieved by two means. Single-chip massively parallel
processors resemble to living organisms in that they are constituted of large numbers of
basic units (processor cores, routers and links). Cells takes advantage of this similarity.
Like cells in living organisms, operational units replace unrecoverable faulty units to
restore system functionality transparently to the ongoing application executions. Also,
like cells in living organisms, processor cores, routers and links are able to recover from
several kinds of failures, by using new innovations at circuit-level fault tolerance
(Anghel and Nicolaidis, 2000), (Nicolaidis 2005), (Anghel and Nicolaidis, 2008),
(Nicolaidis, 2011), (Yu, Nicolaidis, Anghel and Zergainoh, 2011) and self-regulation.
Furthermore, similarly to the non-deterministic, local and opportunistic manner in
which cells in an organism achieve self-healing, and self-regulation, Cells uses new,
non-deterministic routing, task allocation and scheduling algorithms, which make local
decisions in opportunistic manner (Chaix, Avresky, Zergainoh and Nicolaidis, 2010 and
2011). They allow addressing the complexity problem of navigating in a complex and
changing network (thousands of processors and routers, millions of possible
communication paths, continuous circuit degradation, frequent occurrence of
catastrophic node and router failures, and unpredictable router congestions).
Conventional deterministic algorithms used in nowadays massively parallel multi-chip
systems, which exhibit low defectivity and high circuit stability; use static routing tables
containing pre-established routes, and static scheduling and allocation algorithms which
consider: fixed clock frequencies; rarely failing links, router and processor nodes; and
similar power-dissipation for all nodes. Such algorithms, used also in early proposals for
designing massively parallel processor chips (Zajac, Collet and Napieralski, 2008), are
ineffective in a highly defective and fast degrading hardware.
Together with the highly innovative circuit-level fault-tolerance, routing, and task
allocation and scheduling; automatic monitoring, control, and self-regulation of circuit
parameters ensure optimal operation: meeting performance requirements while
minimizing power under circuit degradation and evolving application context.
It results in a computing paradigm that achieves robustness in a manner that
resembles to biological systems in multiple aspects. This trend should be necessarily
reinforced as post CMOS will enable ever higher integration complexities.

References
Anghel, L. & Nicolaidis, M. (2000), Cost Reduction and Evaluation of a Temporary Faults
Detecting Technique, Proceedings Design Automation and Test in Europe Conference, March
2000, Paris (Best Paper Award)
Anghel, L. & Nicolaidis, M. (2008), Cost Reduction and Evaluation of a Temporary Faults
Detecting Technique”, chapter in the book “The Most Influential Papers of 10 Years DATE”,
Lauwereins, Rudy; Madsen, Jan (Eds.), Springer, ISBN: 978-1-4020-6487-6, 2008.
Chaix, F., Avresky, D., Zergainoh, N. E. & Nicolaidis, M. (2010), Fault-Tolerant Deadlock-Free
Adaptive Routing for Any Set of Link and Node Failures in Multi-Cores Systems, In Proc. 9th
IEEE International Symposium on Network Computing and Applications (NCA10), July 1517 2010, Cambridge, MA
Chaix, F., Avresky, D., Zergainoh, N. E. & Nicolaidis, M. (2011), A Fault-Tolerant DeadlockFree Adaptive Routing for On Chip Interconnects, In Proc. Design Automation and Test in
Europe Conference, March 14 – 18, 2011, Grenoble, France.
Nicolaidis M., (2005), Design for Soft-Error Mitigation, IEEE Transactions on Materials and
Device Reliability, Vol. 5, Issue 3, pp. 405-418, September 2005

- 236 -

The Computational Turn: Past, Presents, Futures?

Nicolaidis, M. (2011), Circuit-level Soft-Error Mitigation, In: M. Nicolaidis (Ed), Soft Errors in
Modern Electronic Systems, Springer, 2011.
Yu, H., Nicolaidis, M., Anghel, L. & Zergainoh, N.E. (2011), Efficient Fault Dectection
Architecture Design of Latch-based Low Power DSP/MCU Processor, In Proceedings, 16th
IEEE European Test Symposium, May 23-27, 2011, Trondheim, Norway.
Zajac, P., Collet, J.H. & Napieralski, A. (2008), Self-Configuration and Reachability Metrics in
Massively Defective Multiport Chips, in Proc. 14th IEEE International On-Line Testing
Symposium, July 2008.

- 237 -

Proceedings IACAP 2011

STRUCTURAL CONSTRAINTS FOR THE CONSTRUCTION OF
MULTI-STRUCTURED DOCUMENTS
PIERRE-ÉDOUARD PORTIER
Université de Lyon, CNRS – INSA de Lyon – LIRIS UMR 5205
F-69621 France
AND
Sylvie Calabretto
Université de Lyon, CNRS – INSA de Lyon – LIRIS UMR 5205
F-69621 France

Abstract. While are occurring the computer-mediated interactions for the weaving
of relations between fragments of a documentary archive: structures appear,
vocabularies emerge… Can programs be designed to help this effervescent
creation not to diverge too quickly? One common solution is to rely on a priori
well-defined and closed vocabularies (the so-called ontologies) from which the
names being used to describe (annotate) and connect fragments are to be chosen.
What can be done if such vocabularies aren’t available? In other words: can a
system be designed to allow the dynamic construction of vocabularies? We now
propose a first version of such a system.

1. Introduction
We study the process of the construction of documents. We observe the emergence of
documentary structures. This emergence relies on the creation of dimensions as sets of
relations. We aim at providing computational mechanisms to assist the construction of
dimensions. First of all, we introduce the notion of a non-trivial machine. By using a
notion of computation seen as ordering, and by adopting a pragmatic point of view on
the notion of meaning, we can redefine the objective as: programming mechanisms that
could ease the circulation of information for the non-trivial machine.

2. Meaning and computation
J.V. Uexküll (1956), a father of ethology, developed a theory of meaning in order to
explain in a unified way what he observed in many occasions on different kinds of
animals: the same object placed in different environments can take a different meaning.

- 238 -

The Computational Turn: Past, Presents, Futures?

Thus he deduced that the qualities of an object are only perceptive attributes given by the
subject with which they have a connection.
Furthermore, when G. Bateson (1972) wonders what it would mean for a computer
to “think”, he comes to the conclusion that:
“What ‘thinks’ and engages in ‘trial and error’ is the man plus the computer plus
the environment. And the lines between man, computer and environment are purely
artificial, fictitious lines. They are lines across the pathways along which information or
difference is transmitted.” p. 491.
Bateson tried to get rid of the subject/object dichotomy by considering systems
described as networks of differences.
It links directly to a pragmatic view of meaning taken as an effect of the dynamic
creation of relations. In (Saulnier and Longo, 2007), the idea of “conceptual
frameworks” is introduced: meaning is to be found in the movements from one
framework (or level of meaning) to another. Peirce’s concept of an interpretant is not
far:
“A sign […] creates in the mind of that person an equivalent sign, or perhaps a
more developed sign. That sign which it creates I call the interpretant of the first sign.”
(Peirce, 1897) (§228)
And the meaning would be this dynamic process of building an interpretant...
Finally, H. Von Foerster (2003) proposes a definition of computation as ordering.
Ordering can be (i) a description of a given arrangement, or (ii) a re-arrangement of a (i).
Moreover, he defines a non-trivial machine (Turing-like) as a machine for which the
outputs depend on both the inputs and the state of the machine.
Thus, the frontiers of the considered non-trivial machine will include a computer
and a user in an environment. This machine is in a dynamic state of producing orderings.
“Meaning” is directly referring to this production. Indeed, the machine is powered by
some desire (for example, the desire to explain a phenomenon) and the more the
production of orderings fulfills the desire, the more meaningful the process is.
Our task is then to program some mechanisms that could ease the functioning of
such a machine.

3. Construction of dimensions
3.1. TREE CONSTRAINT
In the context of document engineering, what is commonly called “the problem of multistructured documents” is the fact that elements of structures can be overlapping. Indeed,
the most used formalisms for documents representation (first SGML, then XML) imply
tree structures.
All of the models proposed to overcome this difficulty are centered on this
tree/graph dichotomy. However, for each local event of two overlapping terms, those
tend to belong to different dimensions or levels of meaning.
Thus, in the context of our multi-structured documents platform (Portier and Calabretto,
2010), each time an overlapping situation occurs with terms belonging to the same
dimension, we offer the users the possibility to restructure the dimensions (see Figure 1).

- 239 -

Proceedings IACAP 2011

Figure 1.
Formalizatio
n of user's
knowledge
when two
terms of a
same
dimension
overlap

3.2.
ACY
CLIS
M
CON
STRAINT
Apart from the annotation of text intervals, relations are inter-weaved between
heterogeneous fragments.
An essential part of the research on hyperstructures has created a notion of
dimension. The zzstructure of T. Nelson (2004) for dimensional hypertexts is certainly
one of the most relevant examples. The abstract function of a dimension is to group
similar ways of weaving relations between fragments.
Indeed, a naïve graph-based representation doesn't offer appropriate synoptic views
(see Figure 2). Thus, the dimensions provide clusters of relations that can compensate for
this lack of synthesis by offering new kind of representations (see Figure 3).
Figure 2.
Illustratio
n of a
graphoriented
interface
for the
creation
and the
visualizati
on of
relations
Figure 3.
Illustratio
n of a
dimension

- 240 -

The Computational Turn: Past, Presents, Futures?

-based interface
In order to help the users in the process of creating dimensions, we are looking for
a structural constraint whose violation is often meaningful and quite easy to dynamically
detect.
The acyclism constraint seems to be well adapted. Take for example the situation of
Figure 4 where a user successively created two associations but when he adds a third
relation, a cycle appears.
Figure 4.
After the
free
creation of
some
relations,
a cycle
appears
within the
“d”
dimension
The
user is advised to restructure the dimensions so as to remove the cycle (see for example
Figure 5).
Figure 5.
Formalizatio
n of
structural
users'
knowledge
after the
automatic
detection of
a cycle
within a
dimension

4.
Conclusion
This work is a first step towards a different point of view on computation seen as the
construction of orderings by a non-trivial machine driven by a desire to explain some
phenomenon. In such a configuration, new kinds of programs have to be developed in
order to dynamically react to the user's actions by, for example, computing the
appropriate times for helping the users to formalize their structural knowledge.

- 241 -

Proceedings IACAP 2011

Acknowledgements
We would like to thank the team of researchers from the Jean-Toussaint Desanti Institute
for their collaboration during the development of this work.

References
Bateson, G. (1972). Steps to an ecology of mind. The University of Chicago Press.
Nelson, T. H. (2004). A cosmology for a different computer universe: data model, mechanisms,
virtual machine and visualization infrastructure. Journal of Digital Information 5(1)
Peirce, C. S. (1897) Collected Papers of Charles Sanders Peirce 2. Harvard University Press,
Cambridge
Portier, P.-E., Calabretto, S. (2010). DINAH, a philological platform for the construction of multistructured documents. In : Proceedings of the 14th European conference on Research and
advanced technology for digital libraries, Glasgow, UK, p.364-375
Saulnier, B., Longo, G. (2007). Le jeu du discret et du continu en modélisation : relativité
dynamique des structures conceptuelles. In : Intelligence de la complexité, épistémologie et
pragmatique. éditions de l'aube
Von Foerster, H. (2003). Responsibilities of Competence. In Springer, ed., Understanding
understanding: essays on cybernetics and cognition, p.191
Von Uexküll, J. (1956). Théorie de la signification. Editions Denoël, Hambourg.

- 242 -

The Computational Turn: Past, Presents, Futures?

(DIS-)TASTEFUL MACHINES?
Aesthetic Cognition and the Computational Turn in Aesthetics
WILLIAM W. YORK
Center for Research on Concepts and Cognition
Indiana University
512 North Fess Street
Bloomington, Indiana 47408-3822
AND
HAMID R. EKBIA
Center for Research on Mediated Interaction
Indiana University
1320 E. 10th Street
Bloomington, IN. 47405-3907

Abstract. While aesthetics and cognition have traditionally been viewed as
distinct from—even opposed to—one another, recent stirrings indicate the
beginnings of an “aesthetic turn” regarding cognition. Does this, in turn, open up
the possibility of a computational turn in the study of aesthetics? Can
computational methods such as modeling and simulation be effectively brought to
bear on something as mysterious and ineffable as aesthetic judgment? Or is
“aesthetic cognition” a contradiction in terms? We explore these questions by
focusing on the relationship between aesthetics and analogy-making, an area of
cognition for which some research groundwork has already been laid. We will first
offering some illustrative examples of this relationship, and then examine a group
of computer models that have begun exploring mechanisms that may account for
this relationship. Although rudimentary in their capabilities, these models point to
a computational perspective for investigating not only the analogy–aesthetics
relationship, but the processes underlying aesthetic cognition more generally.

1. Introduction
As Mark Johnson (2007) recently put it, “[A]esthetics is not just art theory, but rather
should be regarded broadly as the study of how humans make and experience meaning”
(p. 209). Aesthetic considerations factor into seemingly mundane everyday experience as
well as in more exalted intellectual pursuits. Regarding the latter, Robert Root-Bernstein
(2002) has used the term “aesthetic cognition” to refer to the “pre-logical, emotion-

- 243 -

Proceedings IACAP 2011

laden, intuition-based feeling of understanding” (p. 62) that guides creative thought in
science and mathematics.
In some quarters, the term “aesthetic cognition” might seem like a contradiction.
There is a deeply rooted tendency to view the aesthetic and the cognitive as distinct
from, if not opposed to, one another (Aiken, 1955). Yet recent stirrings from various
quarters in cognitive science (e.g., Deacon, 2006; Norman, 2003) suggest that we are
seeing the beginnings of an “aesthetic turn” in cognitive science.

2. A Computational Turn in Aesthetics?
Does this “aesthetic turn,” meanwhile, open up the possibility of a computational turn in
aesthetics? Can the study of aesthetics be opened up to computational methods such as
modeling and simulation? If so, how can they be effectively brought to bear on
something as seemingly mysterious and ineffable as aesthetic sensibility? If not, what do
we make of Root-Bernstein’s (2002) claim that “artificial intelligence will fail to provide
insights into human thinking or model its capabilities until aesthetic cognition is itself
understood sufficiently to be modeled and implemented by computers” (p. 75)?
Broadly speaking, there are two potential reactions to these questions.
Optimistically, one might contend that fields such as cognitive science and artificial
intelligence (AI) can—and, to some extent, already have—shed light on these questions,
in part through the use of computer models, perhaps in combination with findings from
neuroscience and experimental psychology. There is also the developing field of
computational aesthetics (Hoenig, 2005). Despite its somewhat different emphases—
which range from image-processing techniques to computer-generated art to formal
analysis of artworks—the growth of this new field offers further evidence of the potential
relevance of computation to aesthetics (and vice versa).
In turn, skeptics might reply that longstanding problems in aesthetics have
remained unsettled for a reason: There may simply be limits to what we can understand
when it comes to matters of judgment, sensibility, and taste (Weizenbaum, 1976). To
explain aesthetic sensibility would seem to involve specifying, formalizing, or
mechanizing those same intuitive processes that have been defined as unspecifiable,
unformalizable, or non-mechanizable (e.g., Polanyi, 1981; Dreyfus, 1992). This debate
between optimists and skeptics is ongoing, encompassing other areas of human cognition
and behavior; in particular, it has been framed around various theories and models in
artificial intelligence (Ekbia, 2008). Is there a meaningful way to resolve, or at least
advance, this debate?

3. Analogy-Making as Aesthetic Cognition
The perceptual and (especially) the aesthetic dimensions of analogy-making have been
downplayed in much research on analogy within cognitive science and AI, where the
focus has instead been on “analogical reasoning” (e.g., Winston, 1980). Yet analogy is
not coextensive with reasoning, and the idea that analogy-making involves an aesthetic
component does have some precedence. For example, in the program Copycat—a model
of analogy-making in the microdomain of letter strings (e.g., “If abc is changed to abd,

- 244 -

The Computational Turn: Past, Presents, Futures?

then how should kkjjii be changed?”)—the “computational temperature” at the end of a
run can be construed as a sort of aesthetic evaluation of the program’s answer (Mitchell
1993). Copycat’s successor, Metacat, is able to compare different answers to a given
analogy problem—say, kkjjhh and kkjjij in response to the example given above—on
the basis of three largely aesthetic dimensions: uniformity, abstractness, and
succinctness (Marshall 1999).
Likewise, the idea that aesthetic sensibility involves an ability to perceive and
appreciate analogies has also been noted before. For example, Koestler (1964) refers to
the “hidden analogies” that inform the creative process in science, art, and humor.
Arnheim (1969) discusses the role of analogy in the perception and grouping of visual
forms, including what might be called “visual rhymes.” Similar types of analogical
mappings can be identified in the plot structures of films, novels, and other narrative
forms. Meanwhile, the role of aesthetic factors in science and mathematics has also been
explored (e.g., Papert, 1988; Sinclair, 2004), further highlighting the connection between
aesthetic sensibility, insight, perception, and analogy. Finally, computer models such as
Letter Spirit (Rehling, 2001) have explored the role analogy in the more traditionally
aesthetic realm of alphabetic font (or grid font) design.

4. Open Questions
Models such as Copycat and Letter Spirit suggest a potentially rewarding perspective for
investigating not only the analogy–aesthetics relationship, but the processes underlying
aesthetic cognition more generally. But to what extent can such computational
approaches ultimately contribute to this joint understanding? What are the strengths (and
limits) of computer models that aim to simulate the processes of analogy-making and
aesthetic judgment in human beings? Finally, is there potential for common ground
between cognitive science/AI and the growing field of computational aesthetics?

Acknowledgements
Thank you to Helga Keller (R.I.P.) for her tireless support over the years.

References
Aiken, H. D. (1955). Some notes concerning the cognitive and the aesthetic. The Journal of
Aesthetics and Art Criticism, 13(3), 378–394.
Arnheim, R. (1969). Visual Thinking. Berkeley: Univ. of California Press.
Deacon, T. (2006). The aesthetic faculty. In M. Turner (Ed.), The Artful Mind: Cognitive Science
and the Riddle of Human Creativity (pp. 3–20). Oxford: Oxford Univ. Press.
Dreyfus, H. (1992). What Computers Still Can’t Do: A Critique of Artificial Reason. Cambridge,
Mass.: MIT Press.
Ekbia, H. R. (2008). Artificial Dreams: The Quest for Non-Biological Intelligence. Cambridge,
U.K.: Cambridge Univ. Press.

- 245 -

Proceedings IACAP 2011

Hoenig, F. (2005). Defining computational aesthetics. In L. Neumann, M. Sbert, B. Gooch, and
W. Purgathofer (Eds), Computational Aesthetics 2005: Eurographics Workshop on
Computational Aesthetics in Graphics, Visualization, and Imaging (pp.13–18).
Johnson, M. (2007). The Meaning of the Body: Aesthetics of Human Understanding. Chicago:
Univ. of Chicago Press.
Koestler, A. (1964). The Act of Creation. New York: MacMillan.
Marshall, J. (1999). Metacat: A Self-Watching Cognitive Architecture for Analogy-Making and
High-Level Perception. Doctoral dissertation, Indiana Univ., Bloomington.
Mitchell, M. (1993). Analogy-Making as Perception: A Computer Model. Cambridge, Mass.: MIT
Press.
Norman, D. (2003). Emotional Design: Why We Love (or Hate) Everyday Things. New York:
Basic Books.
Papert, S. (1988). The mathematical unconscious. In J. Wechsler (Ed.), 1988), On Aesthetics in
Science (pp. 105–120.
Polanyi, M. (1981). The creative imagination. In D. Dutton & M. Krausz (Eds.), The Concept of
Creativity in Science and Art (pp. 91–108). The Hague, Netherlands: Nijhoff.
Rehling, J. A. (2001). Letter Spirit (Part Two): Modeling Creativity in a Visual Domain. Doctoral
dissertation, Indiana Univ., Bloomington.
Root-Bernstein, R. S. (2002). Aesthetic cognition. International Studies in the Philosophy of
Science, 16(1), 61–77.
Sinclair, N. (2004). The roles of the aesthetic in mathematical inquiry. Mathematical Thinking
and Learning, 6(3), 261–284.
Weizenbaum, J. (1976). Computer Power and Human Reason: From Judgment to Calculation.
San Francisco: W. H. Freeman and Co.
Winston, P. H. (1980). Learning and reasoning by analogy. Communications of the ACM, 23(12),
689–703.

- 246 -

The Computational Turn: Past, Presents, Futures?

Track VII:
Social Computing

- 247 -

Proceedings IACAP 2011

The social and its political dimension in software design
A Socio-Political Approach
DORIS ALLHUTTER
Austrian Academy of Sciences, Institute of Technology Assessment
Strohgasse 45, 1030 Vienna

Abstract. Recent debates in philosophy and computing and science and
technology studies address the prolongation of the social in technical design and
development and thus the question of discursive performativity. Applying a wider
conception of the social than usually referred to in design research, I present an
initial elaboration of a socio-political approach to software design. This approach
is based in discourse theory, deconstructivism and ‘new materialism’ and focuses
on the reproduction of power by tracing the performativity of hegemonic societal
discourses and their co-materialization with (normative) technological phenomena.
Making use of Karen Barad’s material-discursive account of performativity, I
argue that a socio-political approach to software design needs to take into account
the ‘intra-action’ of material phenomena with reconfigurings of power relations in
intertwined epistemic and everyday work practices. The objectives of this
endeavour are, first, to ask and make negotiable who (in/formal hierarchies) and
what (discursive hegemonies) is given normative power in design processes on the
basis of which social and technological imaginaries; second, to investigate and, to
some extent, try to make tangible how these—mostly unconscious— normative
enactments co-materialize with material phenomena or relations; and eventually, to
elaborate on how to widen human agency by opening spaces for maneuver or
trading zones when taking account of the agency of human/non-human
assemblages or material-discursive re-configurations of the world.

Recent debates in philosophy and computing and science and technology studies have
expanded the question of the prolongation of the social in technical design and
development by taking into account the concept of discursive performativity. Inspired by
this discussion and applying a wider conception of the social than usually referred to in
research on the development of computational artifacts, I present an initial elaboration of
a socio-political approach to software design. This socio-political approach connects to
the notion of ontological politics (see Mol, 1999) and is based in discourse theory,
deconstructivism and ‘new materialism’. It focuses on the reproduction of power by
tracing the performativity of hegemonic societal discourses and their co-materialization
with (normative) technological phenomena.
Karen Barad’s (2007) materialistic elaboration of the concept of performativity shifts the
focus from a linguistic and discursive account of performativity, which is linked to the
paradigm of the co-construction of society and technology, to the notion of comaterialization. She criticizes earlier approaches to processes of materialization (as for
example introduced by Butler and Foucault) that centre on the question of ‘how
discourse comes to matter’. Barad suggests that their focus on the social constructedness

- 248 -

The Computational Turn: Past, Presents, Futures?

of bodies/materiality in fact neglects the question of ‘how matter comes to matter’ and
puts an equal focus on the material dimensions of agency.
In my previous work, Donna Haraway’s account of ‘embodied, situated practices’
and Judith Butler’s concept of discursive performativity have inspired me to investigate
software design processes as entangled practices informed by technological concepts and
hegemonic societal discourses as much as by professional self-conceptions of developers
and related workplace politics (see Allhutter 2011). Barad’s materialistic move that
resulted in her elaboration of ‘agential realism’ can add to such a perspective on software
design in that it conceptually takes into account the agency of materiality or material
phenomena (see also Velden and Mörtberg, 2011). Still open remains the question of
how to make use of a material-discursive account of performativity in applied design
research.
In this respect, I suggest that it makes sense to reconstruct the journey of two
crucial concepts—‘agency’ and ‘materialism’—that have been travelling between
disciplines and research fields: While questions of the agency of artifacts and
human/non-human (re-)configurations have intensively been discussed in studies of
science and technology since the early 1980ies (Callon, Latour, Law, Haraway), only
recently political science scholars such as Jane Bennet (2010), Diane Coole and
Samantha Frost (2010) have begun to integrate this strand of theory to rethink concepts
of political agency and to rework the notion of materialism, now discussed as ‘new
materialisms’.
On this background, I argue that a socio-political approach to software design
practice and theory needs to take into account the ‘intra-action’ of material phenomena
with reconfigurings of power relations (normativity and societal hegemonies) in
intertwined epistemic and everyday work practices. My objective of elaborating such a
socio-political approach based on a material-discursive account of performativity is
threefold:
First, the aim is to ask and make negotiable who (in/formal hierarchies) and what
(discursive hegemonies) is given normative power in design processes on the basis of
which social and technological imaginaries (e.g. re-enactments of societal differences
and epistemic dichotomies); second, to investigate and, to some extent, try to make
tangible how these—mostly unconscious—normative enactments co-materialize with
material phenomena or relations (that are e.g. development methods, processes,
artifacts); and eventually, to elaborate on how to widen human agency by opening spaces
for maneuver or trading zones (Allhutter and Hofmann, 2010) when taking account of
the agency of human/non-human assemblages or material-discursive re-configurations of
the world.

References
Allhutter, D. (2011). Mind Scripting: A Method for Deconstructive Design. Science, Technology
& Human Values, OnlineFirst March 13, 2011.
Allhutter, D. & Hofmann, R. (2010). Deconstructive Design as an Approach to opening Trading
Zones. In: J. Vallverdú (ED), Thinking Machines and the Philosophy of Computer Science:
Concepts and Principles (pp. 175–192). Hershey: IGI Global.
Barad, K. (2007). Meeting the Universe Halfway: Quantum physics and the entanglement of
matter and meaning. Durham and London: Duke University Press.

- 249 -

Proceedings IACAP 2011

Bennet, J. (2010). Vibrant Matter: A political ecology of things. Durham and London: Duke
University Press.
Coole, D. & Frost, S. (2010). New Materialisms: Ontology, Agency, and Politics. Durham and
London: Duke University Press.
Mol, A. (1999). Ontological Politics: a Word and Some Questions. In: J. Law and J. Hassard
(EDS), Actor Network and After. (pp. 74–89). Oxford and Keele: Blackwell and the
Sociological Review.
Velden, M. van der & Mörtberg, C. (2011). Between Need and Desire: Exploring Strategies for
Gendering Design Science, Technology & Human Values, OnlineFirst March 13, 2011.

- 250 -

The Computational Turn: Past, Presents, Futures?

A SOCIAL EPISTEMOLOGICAL APPROACH FOR DISTRIBUTED
COMPUTER SECURITY

Steve Barker
Department of Informatics
King’s College London

Abstract. We present a social epistemological approach, for treating an aspect of
computer security, which allows for multiple testifiers to contribute propositional
attitude reports to a community repository of testimonial knowledge and for users
to adopt a range of epistemic positions for deciding what constitutes justified
belief in different contexts.

1. Introduction
We discuss a key epistemological aspect of the distributed access control (DAC)
problem: in large, distributed computer systems, like the Internet, how can a decision be
rendered on whether a requester of access to a resource is authorised to perform an
action on the resource if what is known by the decision-maker about a requester is
“incomplete”? (And it is computationally too expensive for the decision-maker to
exhaustively search for all of the knowledge it (ideally) requires on the requester.)
Rather than simply rejecting the access request on the basis of the incompleteness
of its knowledge, the putative solution to the DAC problem is for the decision-maker to
accept the assertions of some individual, ultimately trusted testifier who “speaks for” the
requester and in so doing enables the decision-maker to determine whether the requester
is authorised to perform a requested action on a resource. The notion of an ultimately
trustworthy source of epistemic warrant assumes that a foundationalist (Bonjour 1985)
position on knowledge/justification applies in the DAC case; there is no infinite
justificational regress because what the trusted source asserts is so.
In Section 2 of this abstract, we suggest an alternative, social epistemological
approach to the DAC problem. In Section 3, we draw conclusions.

2. An Alternative Approach to the DAC Problem
We argue for a community-based approach to testimonial warrant and for testifiers
making assertions of their propositional attitudes (Russell 1905) via a community-based
repository, which is a store of triples (s, α, p) such that s is a source of assertions in a
community of sources Σ = {s, s1 , . . . , sn } of testimonial warrant, p is a proposition,
and α is a propositional attitude that a source in Σ has in relation to p.

- 251 -

Proceedings IACAP 2011

We note that p may be an atomic proposition or an arbitrary logical formula, we restrict
attention to the doxastic attitudes “believes” and “disbelieves”, and we interpret a source
as suspending belief on p if it makes no assertion of p to the community repository. The
triples (si, α, p) represent that-clauses, e.g., si believes that sj is “bad debtor”.
Typically, in the DAC scenario, the assertions are on a requester’s reputation, e.g., for
being a “bad debtor”; the categories of requesters to be used are community determined.
In the context we assume, authorisation depends on the assignment of a requester to a
category, e.g., s is authorised to perform some action on a resource iff s is categorised as
a “good trader” (say). We suggest that what we propose is appropriate for addressing the
DAC problem in that it recognises the need for knowledge construction by a division of
epistemic labour, it allows for justified belief to be community constructed (which we
hold to be more reliable than exclusively using individual, foundational sources of
testimonial knowledge) and it recognises that, in the context of interest, “truth” is
appropriately held to be relative to a community.
It is open to decision-makers to decide what methods of computation to use, with
the community repository, for them to have justified beliefs for deciding on authorisation
requests. A decision-maker may simply accept that the propositional attitude α holds in
relation to p if some specific source s Σ expresses that directly. However, this is far
from being the only option. A decision-maker may, for example, accept that α holds in
relation to p because some, non-specific member of Σ asserts that or all members of Σ
assert that or it is the “majority view” (variously interpreted) of members of Σ that α
holds in relation to p. Moreover, more complex requirements may be expressed in more
expressive logic languages, e.g., an acceptor may accept that α applies in relation to p if
some si Σ asserts that and no source in Σ disbelieves p. It is important to note that we
allow individual decision-makers to decide on what constitutes evidence for them
“knowing” that an authorisation holds, that the knowledge for this is socially constructed,
and that different forms of inferential knowledge will be applicable for decision-making
in different contexts (cf. DeRose 1992).
In the evidentialist framework that we adopt (Feldman and Conee 1985), we say
that: a decision-maker γ is justified in adopting the assertion by s
Σ that the
propositional attitude Σ holds in relation to the proposition p at the time t iff the
attitude α on p is entailed by some computational method that γ justifiably holds to be
reliable for this entailment at the time t from the evidential sources that γ justifiably
holds to be sufficiently authoritative for the purpose of making the inference that α holds
on p according to s at t.
Evidentialist-based interpretations of a variety of epistemic positions will be adopted in
practice. It follows that we do not argue that foundationalism is not a meaningful
epistemic position to adopt in the DAC context. Rather, we suggest that different
epistemic positions (e.g., foundationalist, Haackean foundheretist, etc.) will apply in
different contexts. It is the emphasis on a plurality of epistemic positions that is
distinctive about our approach.

3. Conclusions
We critically assessed the foundationalist epistemic position that has hitherto been
assumed in treating the DAC problem. We then argued for a social epistemological
alternative, which accommodates propositional attitude reports, community-based
testimonial assertions and the flexible use of a range of methods for producing inferential
knowledge.

- 252 -

The Computational Turn: Past, Presents, Futures?

In future work we intend to consider repositories that maintain a history of
propositional attitudes and the epistemic issues that arise.

References
Bonjour L. (1985). The Structure of Empirical Knowledge. Harvard University Press.
DeRose, K. (1992). Contextualism and Knowledge Attributions, Philosophy
Phenomenological Research, 52, pp. 913-929.
Feldman R. and Conee E. (1985). Evidentialism, Philosophical Studies, 48, pp. 15-34.
Russell, B. (1905). On Denoting, Mind, 14, pp. 479-93.

- 253 -

and

Proceedings IACAP 2011

TRUST, POWER, AND INFORMATION TECHNOLOGY
MARK COECKELBERGH
University of Twente
Department of Philosophy, P.O. Box 217, 7500 AE Enschede, The
Netherlands, E-mail m.coeckelbergh@utwente.nl

Abstract. This paper offers a preliminary discussion of the relation between trust,
power, and information technology. It also explores some implications for ethics
and politics of information technology.

1. Introduction
In recent years the issue of trust has received much attention in ethics and philosophy of
information technology. For instance, there is work on e-trust and on-line trust: some
argue against e-trust (for example Nissenbaum 2001), while others are more optimistic
about trust in digital contexts (Taddeo 2009, 2010a, 2010c, Turilli et al 2010).
Furthermore, in the field of social epistemology there is work on trust and knowledge
(Simon 2009, Taddeo 2010b), and people working in the virtue ethics and
phenomenological tradition have developed a notion of ‘implicit’ trust (Ess 2010, Carusi
2009).
While this attention to trust has produced insightful work relevant to both
philosophers and computer scientists who try to model trust, there is little or no attention
to relations between trust, power, and information technology. This paper is a
preliminary attempt to explore this relation. First I will clear the ground by making a
claim regarding the epistemology of trust (I will need this later), then I will make two
claims about the relation between trust and power: (1) trust presupposes power relations
and (2) trust creates power relations.
This analysis will allow me to make some suggestions about the implications for
ethics and politics of information technology.

2. Trust, Knowledge and Transparency
Although it is true that trust can emerge in uncertain and risky on-line environments and
that in one sense trust promotes transparency, as Turilli and others have argued (Turilli
et al 2010), there is also a sense in which (a) trust can only exist under conditions of
uncertainty and (b) transparency destroys trust.

- 254 -

The Computational Turn: Past, Presents, Futures?

In order to develop these claims, we must challenge the rationalist-contractarian
assumption entertained in Taddeo’s work, that e-trust cannot appear a priori, but depends
on the assessment of trustworthiness by a rational (artificial) agent (Taddeo 2010c). A
phenomenological notion of trust, by contrast, involves a sort of a priori, implicit form of
trust. This form of trust flourishes only in environments characterized by incomplete
certainty, knowledge and transparency. If there was complete uncertainty, complete lack
of knowledge, and no transparency at all, we would have no basis for trust. On this point
rationalist-contractarian models are right. If, however, if there was complete knowledge,
complete certainty, and full transparency, there would be no need for trust; the problem
would not arise in the first place.
This suggests that if political movements aim for total, absolute transparency (e.g.
Wikileaks), they risk to destroy trust, which must be situated ‘in between’ the epistemic
absolutes identified.
However, this is a claim about knowledge; what about trust with regard to action?

3.

Trust and Power (1)

If trust is not entirely freely decided by rational agents, but presupposed in social
relations, then we need to discuss how prior social relations, understood as power
relations, shape trust. There are a priori dependencies that enable but also constrain
agency with regard to trust. In a particular social network, I ‘have’ to trust some others
and indeed some technologies (e.g. software) since, and to the extent that, I am
dependent on them for the very practice I am engaged in. In any social network, I am
dependent on some key, powerful actors and technologies which I ‘have’ to trust because
they are powerful. This means limits my agency with regard to trust. Power relations –
relations with others and with technologies – already shape trust ‘before’ any decision or
deliberation about trust is made.
If this is true, it does not only set limits to efforts to model and implement trust in
artificial networks, it is also relevant for ethical-philosophical analysis of trust in digital
environments ‘inhabited’ or ‘crawled’ by both humans and artificial agents. In the digital
age, trust crucially depends on power exercised by the ‘architects, ‘providers’ and
‘webmasters’ of the social-technological networks that form and transform our
interactions and practices (including academic practice).
But how did these social actors become powerful in the first place? Does this
analysis preclude agency altogether?

4.

Trust and Power (2)

Even a strictly rationalist-contractarian approach to trust must acknowledge that trust,
‘decided’ upon by rational agents, creates power relations and generates its own
normativity with regard to humans and their artificial cooperants.
If an agent A says ‘I trust you’ to an agent B, this does not only create expectations
A has about B’s future actions, but also involves a delegation of (discretionary) power
from A to B. In addition, and this is the normative aspect, A makes B responsible. If A
trusts B to do something, then A holds B responsible for doing that. In particular, if B

- 255 -

Proceedings IACAP 2011

decides to do otherwise (trust presupposes that B has this space of freedom), then B has
to provide reasons to A, explain why (s)he did not do what A expected him or her to do.
Trust is violated if no good reasons are given by B.
This analysis of relations between trust, power, and normativity is relevant for
‘horizontal’ social relations, but also for the ‘vertical’ relation between individuals and
the state. This works both ways:
(1) an individual A may trust state B, which implies that A delegates power to B to
do something and that B becomes responsible. A’s trust can then be violated by B if B
fails to do this and if fails to give good reasons for not doing it.
(2) state A can trust its citizens B (not) to do something, that is, hold B responsible,
and B can violate this trust.

5.

Conclusion

I conclude that this framework, which tolerates and employs both rationalistcontractarian and phenomenological approaches, reveals a lacuna in the present literature
and allows us to analyze and discuss the power dimension of issues in social
epistemology, information ethics and philosophy of information.
For example, in the Wikileaks case, there seems to be a clash between on the one
hand a vertical ‘delegation’ model, which creates the possibility of trust under conditions
of uncertainty, and on the other hand a model that aims at transparency, attempts to
provide complete knowledge, and seeks to abolish the vertical delegation relation – and
thereby abolishes trust in the sense discussed above.
Of course this analysis does not exhaust the many interpretations of the word ‘trust’
used in the literature. And perhaps a tension remains between rationalist- contractarian
and phenomenological approaches. Furthermore, neither power nor trust should be our
only concern in ethics and politics of information technologies. However, I hope this
exploration of the relation between trust, power, and information technologies can
contribute to the expanding research on trust and information technology.

References
Carusi, A. (2009). Implicit Trust in the Space of Reasons: A Response to Justine Pila. Journal of
Social Epistemology 23(1), 25-43.
Ess, C. 2010. Trust and New Communication Technologies. Knowledge, Technology, & Policy
23(3-4), 287-305.
Nissenbaum, H. (2001). Securing Trust Online: Wisdom or Oxymoron. Boston University Law
Review 81(3), 635-664.
Simon, J. (2009). Webs of Trust and Knowledge: Knowing and Trusting in the World Wide Web.
In: Proceedings of the WebSci'09: Society On-Line, 18-20 March 2009, Athens, Greece.
Taddeo, M. (2010a). Trust in Technology: a Distinctive and a Problematic Relation. Knowledge,
Technology and Policy 23 (3-4), 283-286.
Taddeo, M. (2010b). An Information-Based Solution for the Puzzle of Testimony and Trust.
Social Epistemology 24(4), 285-299.
Taddeo, M. (2010c). Modelling Trust in Artificial Agents: A First Step toward the Analysis of eTrust. Minds and Machines 20(2), 243-257.

- 256 -

The Computational Turn: Past, Presents, Futures?

Taddeo, M. (2009). Defining Trust and E-trust: Old Theories and New Problems. International
Journal of Technology and Human Interaction 5(2), 23-35.
Turilli, M, Vaccaro, A., & Taddeo, M. (2010). The case of on-line trust. Knowledge Technology
and Policy 23(3-4), 333-345.

- 257 -

Proceedings IACAP 2011

THE BENEFITS OF SOCIAL THEORY FOR MODELLING STABLE
ENVIRONMENTS OF SYSTEMIC TRUST WITHIN MULTI AGENT
SYSTEMS
DIEGO COMPAGNA
University of Duisburg-Essen, Institute of Sociology
Lotharstr. 65 (LE 643), 47057 Duisburg

1. Modelling Stable Environments of Systemic Trust within Multi Agent
Systems
Trust is often discussed on the micro-level of individuals or discrete entities; instead I
would like to stress the benefits of systemic trust that could be seen as a form of
mediated trust between entities. Based on the proposition of the 'Homeostatic Feedback
Loop' by Anthony Giddens a stable social environment can be modeled for Multi Agent
Systems (MAS). The goal of this Model is on the one hand trust is build as a nonintended effect on the systemic level from which on the other hand all participating
entities take benefit: The outcome is an auto-sustaining framework; or a homeostatical
systemic state. In this model trust emerges as the result of non intended effects of distinct
actions between different Agents that could be described as a functional cooperation.
The specific characteristic of the Casual Feedback Loop - the core proposition
within the notion of a duality of structure (Giddens 1984) - could be very useful for a
MAS architecture that enfolds a stable environment (Compagna 2009). The main
assumption behind the concept of the duality of structure is that actions and the
framework of these actions are organized recursively, or in terms of the social system
theory in the modus of an autopoietic sustainment (Giddens 1991). Within such an
environment of mutual but non-intended functionality the value of trust become an
emergent value or a non-intended outcome. Based on an early Paper of
Castelfranchi/Conte (1992) different kinds of cooperation could be described: NonIntended, Intentional, Out-Designed and Functional. Functional Cooperation is described
as the best way to establish a fruitful and stable cooperation between agents. This type of
cooperation could be related and captured as well as further conceptualized very well
with the Theory of Structuration.
The model I would like to present - by combining the above mentioned
propositions - consists in the mutual goal for the involved agents of an action-framework
that is functional for them although this is not directly intended by their intentionally
motivated actions. Although this model claims to explain and accomplish a stable
framework for MAS it could be transferred to a Human-Agent set-ting in which by nonintended effects a stable interaction framework emerges that provides a favorable context
for mutual system trust.

- 258 -

The Computational Turn: Past, Presents, Futures?

References
Castelfranchi, Cristiano & Conte, Rosaria (1992). Emergent functionality among intelligent
systems. Cooperation within and without minds. In: AI & Society 6 (1), S. 78-87.
Compagna, Diego (2009). Sozionik und Sozialtheorie. Zum Beitrag soziologischer Theorien für
die Entwicklung von Multiagentensystemen. (1. Aufl.) Saarbrücken: VDM Verlag.
Giddens, Anthony (1984). The constitution of society. Outline of the theory of structuration. (1.
Aufl.) Cambridge: Polity Pr. [u.a.].
Giddens, Anthony (1991). Structuration theory. Past, present and future. In: Bryant, Christopher
G.A. / Jary, David (Hg.): Giddens' theory of structuration. A critical appreciation. (1. Aufl.)
London [u.a.]: Routledge. (S. 201-221)

- 259 -

Proceedings IACAP 2011

COMPUTER NETWORKS AND THE PHILOSOPHY OF MIND
A Social Mind – Networked Computer Analogy
ISTVAN DANKA
Department of Philosophy, University of Leeds
Leeds, LS2 9JT, United Kingdom

In the last few decades, computer analogies of the mind have dominated several central
fields of the philosophy of mind. The leading versions of the 'mind – computer' analogy
are based on the Interface Model of the Mind (with Putnam's phrase), claiming that the
mind of an individual is analogous to a computer with an interface connection to its
environment. As opposed to this, I shall develop a Network Model of the Mind, based on
an analogy between the socially extended mind and a computer network, according to
which social relations and semantic content of the WWW are analogously structured. In
accordance with Clark and Chalmers' extended mind hypothesis, I shall argue that there
are active constituent parts of mental processes that are located externally to the mind of
an individual, just as there are semantic contents external to individual computers.
A network model of the mind is the opposite of the interface model in the following
sense. The interface model rests on the (Cartesian-inspired) assumption that there is a
surface on which the mind interacts with its environment. For a social externalist the
mind is extended over the limits of the body and hence no "surface" of the individual can
be drawn. For a social externalist, mental processes are more plausibly understood as
social activities among interlinked individuals. In either case, it makes no sense alluding
to any interface. For a network model, what is essential in the structure of mental
contents is not separation but connection. Hence, it explains the mental in terms of
connections among mental contents in the minds of different individuals.
At least two significant versions of the 'social mind – networked computer' analogy
can be developed. On the one hand, one can argue for an analogy between socially
embedded individual minds and networked computers. In this case, the connections have
to be understood as physical connections among computers (i.e., the internet) on the one
hand, and socially connected individual humans (social networks) on the other. The
second version is philosophically more interesting though. Namely, an analogy can be
drawn between semantic content on the net (WWW) on one hand, and mental content
structured socially on the other hand. This analogy demonstrates that mental contents
cannot be individually located in our heads since, analogically, semantically significant
units of the content are not necessarily contained by the server but they are often spread
over multiple machines (e.g. cookies).
Regarding the connections among mental contents, I shall distinguish three
structurally different models of the individual mind in terms of the relations among
mental contents. First, centralised (Cartesian/Kantian) views argue that there is a centre

- 260 -

The Computational Turn: Past, Presents, Futures?

of mental content (the soul, the mind, the Self, etc.), to which all mental contents are
(directly or indirectly) connected. Second, non-centralised (behaviourist/physicalist)
views claim that no centre of mental contents is provided; the best model for the
relations among mental contents is a random graph. Third, de-centralised models (e.g.
Quine) claim that there is a difference between central and peripheral mental contents;
though no clear distinction can be made between the contingent and the necessary, a
gradual account of more and less central contents can be provided.
In parallel, there are three main models of the social relations among mental
contents. Those who accept centralised models of the individual mind will most probably
follow a multi-centred view of the social, claiming that mental contents constitute many
centres of individual minds connected to each other randomly. (A logically possible
alternative to this would be arguing that there is a centre of the social as well, but no
serious attempt has been made in order to support such a view.) Holders of noncentralised models of the individual can apply their random graph set to the social,
claiming an equal distribution of socially explained connections among mental contents.
Finally, defenders of the de-centralised view claim that there are socially more and less
central contents and even if there is no single centre of the social, several hubs can be
identified.
Analysing different approaches to how semantic content on the internet is
organised, I shall develop a topology of networked-based relations among mental
contents and argue for a de-centralised network model of the social mind, based mostly
on an analogy with A-L. Barabási's research on the topology of the internet. While doing
so, I shall allude to (1) the unequal distribution of links on the internet (the "rich get
richer" phenomenon), (2) the impossibility of complex networks' being centralised ("the
winner does not take all"), and some differences between inbound and outbound links
regarding the semantic significance of web pages. Based on these, I shall argue for a decentralised network model for the social mind, following an analogy between the
structure of the content on the WWW and a graph theoretically equivalent model of the
mind to Quine's gradual approach between the central and the peripheral. However, there
is a slight modification in my own version. From the network analogy it follows that the
building of knowledge is not hierarchical, though it is also not an evenly distributed
random model of connections among items. However, the least connected items are not
connected to gradually more connected items while reaching highly connected items. On
the contrary: they are mostly directly connected to "central" hubs. Therefore, a spatial
metaphor of 'central vs. peripheral' is misleading.
All the same, it can also be argued that even though the (physical) structure of the
internet and the (semantic) structure of the WWW are analogous (and hence are the
structure of mental contents and that of social relations), the connection between the two
is contingent. Since from the analogy it follows that a multi-centred view of the social
mind is incompatible with the actual structure of the semantic on the web, on the
supposition of the analogy, no item of mental contents can be located in individuals.
Hence, no interface can be identified. If so, the 'social mind – networked computer'
analogy may serve as a useful weapon of social externalists.

- 261 -

Proceedings IACAP 2011

AGENT BASED MODELING WITH APPLICATIONS TO SOCIAL
COMPUTING
Gordana Dodig Crnkovic
School of Innovation, Design and Engineerin,
Mälardalen University, Sweden
gordana.dodig-crnkovic@mdh.se

1. Extended Abstract
Even though computers were invented primarily to automatize calculations, already
Licklider and Taylor (1968) emphasized the importance of the computer as a
communication device, with consequent shared knowledge and community-building.
There are two different approaches to social computing, (Wang et al. 2007), one
with the strong emphasis on technological, computing side and the other centered on
human, social aspect. Present analysis will be focusing the first kind of social
computing, a computational approach to modeling of social interactions, including the
development of their supporting information and communications technologies. The
main tools are simulation techniques used in order to facilitate the study of society and to
support decision-making policies, helping to analyze how changing policies affect social,
political, and cultural behavior (Epstein, 2007).
Social computing is radically changing the character of human relationships
worldwide (Riedl, 2011). Instead of maximum 150 connections prior to ICT (Dunbar,
1998) present social computing easily leads to networks of several hundred of contacts.
It remains to understand what type of society will emerge from such massive “longrange” distributed interactions instead of traditional fewer and deeper short-range ones.
As in the process information overload on individuals is steadily increasing, social
computing technologies are moving beyond social information processing toward social
intelligence, (Zhang et al. 2011) (Lim et al. 2008) (Wang et al. 2007), which brings an
additional level of complexity.
Social computing with the focus on social is a phenomenon which enables extended
social cognition, while the social computing with the focus on computing is about
computational modeling and new paradigm of computing. I will focus on the agent-based
social simulation (ABSS) as a generative computational approach to social simulation
defined by the interactions of autonomous agents whose actions determine the evolution
of the system, as applied in artificial life, artificial societies, computational sociology,
dynamic network analysis, models of markets, swarming (including swarm robotics)
(Antonelli and Ferraris 2011), (Chai et al., 2010). As Gilbert (2005) rightly points out,
novelty of agent based models (ABMs) “offer the possibility of creating ‘artificial’
societies in which individuals and collective actors such as organizations could be

- 262 -

The Computational Turn: Past, Presents, Futures?

directly represented and the effect of their interactions observed. This provided for the
first time the possibility of using experimental methods with social phenomena, or at
least with their computer representations; of directly studying the emergence of social
institutions from individual interaction.” ABMs are very useful computational
instruments but they should not be taken as “reality” even though simulations with their
realistic graphical representations suggest their being “real”. Process of modeling and
simulation is complex and many simplifications and assumptions must be made which
always must be justified for each application. (Gilbert and Troitzsch 2005) Grimm and
Railsback 2005) (Axelrod 1997)
ABMs in general are used to model complex, dynamical adaptive systems (Breiger
et al. 2003). The interesting aspect in ABMs is the micro-macro link (agent-society).
Multi-Agent Systems (MAS) models may be used for any number (in general
heterogeneous) entities spatially separated by the environment which can be modeled
explicitly. Interactions are in general asynchronous which adds to the realism of
simulation. (Miller and Page 2007) (Schuler 1994)
Social computing represents a new computing paradigm which is one sort of the
natural computing, often inspired by biological systems such as e.g. swarm intelligence,
evolutionary computation or artificial immune systems. In my analysis I will present
different paradigms of computation including social computing and modeling of cognitive agents
in the info-computational framework (Dodig-Crnkovic 2011) (Dodig-Crnkovic and Müller 2009).

References
Antonelli C. and Ferraris G. (2011) "Innovation as an Emerging System Property: An Agent
Based Simulation Model", Journal of Artificial Societies and Social Simulation JASSS 14
(2) 1, http://jasss.soc.surrey.ac.uk/14/2/1.html
Axelrod, R. (1997). The Complexity of Cooperation: Agent-Based Models of Competition and
Collaboration. Princeton: Princeton University Press.
Breiger R., Carley K. and Pattison P. (2003) Dynamic Social Network Modeling and Analysis:
Workshop Summary and Papers, Nat’l Academies Press.
Chai S-K. Salerno J. and Mabry P. L. (eds.) (2010) "Advances in Social Computing: Third
International Conference on Social Computing, Behavioral Modeling, and Prediction", SBP
2010, Bethesda, MD, USA Springer-Verlag: Berlin.
Dodig-Crnkovic G. (2011) "Significance of Models of Computation from Turing Model to
Natural Computation." Minds and Machines, DOI 10.1007/s11023-011-9235-1. Special
issue on Philosophy of Computer Science; R. Turner and A. Eden Eds.. Pages 1-22
Dodig-Crnkovic G. and Müller V. (2009) A Dialogue Concerning Two World Systems: InfoComputational vs. Mechanistic. Book chapter in: INFORMATION AND COMPUTATION.
World Scientific Publishing Co. Series in Information Studies. Editors: G Dodig-Crnkovic
and M Burgin, 2011. http://arxiv.org/abs/0910.5001
Dunbar R. (1998) Grooming, Gossip, and the Evolution of Language, Harvard Univ. Press
Epstein, J. M. (2007). Generative Social Science: Studies in Agent-Based Computational
Modeling. Princeton University.
Gilbert N. and Troitzsch K. (2005) Simulation for the Social Scientist, Open University Press.
Gilbert N: (2005) "Agent-based social simulation: dealing with complexity",
http://www.complexityscience.org/NoE/ABSS-dealing%20with%20complexity-1–1.pdf
Grimm V. and Railsback S. F. (2005) Individual-based Modeling and Ecology, Princeton
University Press.

- 263 -

Proceedings IACAP 2011

Licklider, J.C.R. and Taylor R. W. (1968) "The computer as a communication device." Science
and Technology (September), 20-41.
Lim H. C., Stocker R., Larkin H. (2008) "Ethical Trust and Social Moral Norms Simulation: A
Bio-inspired Agent-Based Modelling Approach. " In: 2008 IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology, December 2008. pp.
245-251.
Miller J. H. and Page, S. E. (2007) "Complex Adaptive Systems: An Introduction to
Computational Models of Social Life", Princeton University Press: Princeton, NJ.
Riedl J. (2011) "The Promise and Peril of Social Computing," Computer, vol.44, no.1, pp.93-95,
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5688159&isnumber=5688134
Schuler D. (1994) “Social Computing,” Comm. ACM, vol. 37, no. 1, pp. 28–29.
Wang F-Y., Carley K. M., Zeng D., and Mao W. (2007) "Social Computing: From Social
Informatics to Social Intelligence. " IEEE Intelligent Systems 22, 2 (March 2007), 79-83.
DOI=10.1109/MIS.2007.41 http://dx.doi.org/10.1109/MIS.2007.41.
Zhang D., Guo B., Yu Z. (2011) "Social and Community Intelligence." Computer, Vol. 99, No.
PrePrints. doi:10.1109/MC.2011.65.

- 264 -

The Computational Turn: Past, Presents, Futures?

OBJECTS OF IDENTITY, IDENTITY OF OBJECTS
For a Materialist Account of Online Behavior
HAMID R. EKBIA
hekbia@indiana.edu
School of Library and Information Science
Indiana University Bloomington, IN. 47401
U.S.A.
AND
GUO ZHANG
guozhang@indiana.edu
School of Library and Information Science
Indiana University Bloomington, IN. 47401
U.S.A.

Abstract. Objects constitute significant elements of individual identity. Who we
are has a lot to do with what we have and with what value we put on what we
have. This point is easier to appreciate in the “off-line” physical world where
objects with various symbolic or non-symbolic values populate our environment.
How about the online world, which is seemingly devoid of objects — at least in a
purely physicalist understanding of objecthood? What role, if any, do objects play
in shaping online identities? We seek to address this question by following two
lines of inquiry: post-structuralist accounts of quasi-objects and recent work in
economic sociology on justification and mutual agreement. These inquires lead to
two key propositions: (i) Digital artifacts are quasi-objects, which mediate
collective practices that seem to exert a strong force of desire in the specific
circumstances of our times; and (ii) People operate within various regimes in
which they enact information and objects through collective practices of situated
social orders. Here we integrate and extend these two lines of inquiry in order to
explore the question of online identity. Our key argument is that people’s identities
are mediated through digital artifacts (personal websites, personal profiles, blogs,
etc.) in a process in which the identities of the subject and the object are
collectively and mutually enacted by the network of people who take interest in
them.

- 265 -

Proceedings IACAP 2011

1. Introduction
Objects constitute significant elements of individual identity. Who we are has a lot to do
with what we have and with what value we put on what we have. This point is easier to
appreciate in the “off-line” physical world where objects with various symbolic or nonsymbolic values populate our environment. How about the online world, which is
seemingly devoid of objects — at least in a purely physicalist understanding of
objecthood? What role, if any, do objects play in shaping online identities?
We take this question seriously, and seek a materialist answer to it. We seek an
account that can do justice to things that matter, that offer potentials and resistances,
physically but also socially, historically, psychologically, and so on. Although this is
admittedly a non-standard notion of materialism — modern philosophers often use
physicalism and materialism interchangeably (Stoljar, 2009) — it is useful for our
purposes in at least two ways. First, it allows us to consider the inherently material, not
necessarily physical, aspects of the online world. Second, it opens a line of inquiry that
situates digital artifacts in how they relate to existing social structures and in how they
embody and anticipate the future through the socio-material practices that they allow or
disallow. The first point is important because dominant discourses in information
science, philosophy, and elsewhere tend to discount the underlying materiality (even
physicality) of the “virtual” (e.g., Lévy, 1998). The second point matters because it
allows us to see current online experiences from the historical perspective of modernity
(Day and Ekbia, 2010).

2. Two Lines of Inquiry
Our study of the relationship between objects and identity in the online world follows
two lines of inquiry. One is inspired by post-structuralist accounts of quasi-objects, the
other by recent work in economic sociology on justification and mutual agreement.
Originating in the psychoanalytic notion of “part-objects,” Winicott’s notion of
“transitional object,” and the Lacanian notion of objet petit a (object little-a), the notion
of “quasi-object” later appears in discussions of intersubjectivity by Serres, of scientific
theories and entities by Latour, and of technology and virtuality by Lévy. In Lacan’s
(1991) psychoanalysis, objet petit a stands for an unattainable libidinal object of desire
(e.g., the breast), which is imagined to be separable from the rest of the body, in the same
fashion that an ornament can be detached from the body. As such, it both drives and
limits the desire, and can be sought in the “other” traversing the order of the real and the
imaginary, the mind and the body, the self and the other. In the age of the Internet, this
raises the question of whether our common fascination and obsession with online
depictions of our identity — digital variants of Lacan’s “mirror image” — may be a
reassertion of specific (infantile?) desires. Answering this question in earnest requires
empirical research on how identities are fluidly (de-, re-)constructed on the Web
(Aboujaoude, 2011). However, the beginnings of an answer can be found in the writings
of Michel Serres (1982) who seeks to explain identity and intersubjectivity from a
materialist perspective. Famously characterizing the furet in a children’s game (a French
game resembling hunt-the-slipper) as a quasi-object, Serres argues that the identity of the
child who carries the furet changes as he becomes distinct from others by becoming “it”

- 266 -

The Computational Turn: Past, Presents, Futures?

(Serres, 1982). In so doing, the furet also connects the players and their positions, fixing
and stabilizing the collective. The passage of the furet, in other words, allows the coconstitution of both (quasi-)objects and (quasi-)subjects (Day 2010).
Economic sociology, on the other hand, shows that subjects and objects are mutually
qualified in different orders of worth. In their attempt to integrate economic and social
values in a single analytic framework, for instance, Bolatanski and Thévenot (2006) have
arrived at a set of principles that people resort to in order to justify their actions. These
principles, which operate within different regimes of worth, are appealed to by
individuals depending on the particular “world” (or polity) in which they inhabit in a
given situation. “Persons and things offer one another mutual support. . . With the help
of objects, which we shall define by their belonging to a specific world, people can
succeed in establishing states of worth.” (Bolatanski and Thévenot, 2006: 131).
In previous work, these lines of thought have led us to two key propositions: (i)
Digital artifacts are quasi-objects, which mediate collective practices that seem to exert a
strong force of desire in the specific circumstances of our times (Ekbia, 2009a); and (ii)
People operate within various regimes of information in which they enact information
through collective practices of situated social orders (Ekbia, 2009b; Ekbia and Evans,
2009; Garfinkel, 2008). Here we integrate and extend these two lines of inquiry in order
to explore the question of online identity. Our key argument is that people’s identities are
mediated through digital artifacts (personal websites, personal profiles, blogs, etc.) in a
process in which the identities of the subject and the object are collectively and mutually
enacted by the network of people who take interest in them.

3. Online Behavior: Game and Identity
Take your personal profile on a social networking site, for instance. The profile
represents you, but not in the sense that your photograph, for example, would represent
you. By creating a profile, in a way you create a representation of yourself, your history,
tastes, hobbies, friends, friends of friends, and so on. But on closer scrutiny this is not a
representation, traditionally understood as a stand-in that has a resemblance relationship
to you. Nor is the profile simply an active representation noncausally coupled to you in
the way that most computer representations are believed to be coupled to their subject
matter. The profile is an artifact that both mediates and traces your network of friends,
hobbies, and history. As a complex event, not a representation, it constitutes a complex
site for the actualization of such a network. Lastly, the profile participates in the
embedding environment, taking you to unforeseen places, while being itself shoved
around by others. In this manner, it acts like characters in a good novel who take on, we
are told, a life of their own, dragging the author along with them (Bakhtin, 1984). In a
serious way, the fate of the profile is in the hands of others who take interest in it and
who build bridges between you and their profiles. In short, your identity is enacted in a
collective process organized around your profile, in the same way that the identity of the
child is shaped in carrying the furet. You become “it,” with the caveat that the nature of
the “it” in an electronic medium enables a strongly malleable, transient, and unstable
identity, providing enormous room for playfulness, fantasy, illusion, deception, selfdeception, and so forth.
We want to explore these issues, especially in regards to computer games and how

- 267 -

Proceedings IACAP 2011

an individual’s “virtual” identity in a game may, or may not, interact with their identity in
the non-game (off-line) world. With the growing potential of personalizing game
characters (avatars) to represent individual features, this question has become
increasingly meaningful and significant. For instance, in games for health, we can
connect a Personal Health Record to a gaming platform so that, through proper data
linkages to environmental signals, one’s real-life behavior would affect the game —
think of an avatar that becomes large, drunk, or ill depending on how you eat, drink, or
behave. How would the change of the avatar influence your real-life identity? Is the
avatar the equivalent of the furet? Or does it exert less/more influence?

References
Aboujaoude, E. (2011). The Dangerous Powers of E-Personality. New York: W.W. Norton &
Company.
Bakhtin, M.M. (1984). The Problems of Dostoevsky's Poetics. (C. Emerson, Trans.). University of
Minnesota Press.
Bolatanski and Thévenot Boltanski, L., &Thévenot, L. (2006 [1991]). On Justification:
Economies of Worth. (C. Porter, Trans.). Princeton, NJ: Princeton University Press
Day, R. E. (2010). Death of the User: Reconceptualizing subjects, objects, and their relations.
Journal of the American Society for Information Science and Technology, 62(1)-78-88.
Day, R. & Ekbia, H. (2010). Digital experiences. In Kallinikos, J., Lanzara, G. F. and Nardi, B.
(Ed.). The digital habitat — Rethinking experience and social practice. First Monday,
Volume 15, Number 6 - 7.
Ekbia, H. (2009a). Digital artifacts as quasi-objects: Qualification, mediation, and materiality.
Journal of American Society for Information Science and Technology, 60 (12), 2554-2566.
Ekbia, H. (2009b). Regimes of information: A polity model. Paper presented at the 7th European
Conference on Computing and Philosophy, Barcelona, Spain, July 1-4.
Ekbia, H., & Evans, T. (2009). Regimes of information: Land use, management, and policy. The
Information Society, 25, 328-343.
Garfinkel, H. (2008). Toward a sociological theory of information. Boulder, CO: Paradigm.
Lacan, J. (1991). The seminar of Jacques Lacan. In Book II: The ego in Freud’s theory and in the
technique of psychoanalysis (pp. 1954–1955). New York: W.W. Norton & Company.
Lévy, P. (1998). Becoming virtual: Reality in the digital age. (R. Bonnono. Trans.) NewYork:
Plenum.
Serres, M. (1982). The parasite. (L. R. Schehr. Trans.). Baltimore: Johns Hopkins University
Press.
Stoljar, D. (2009). Physicalism. Stanford Encyclopedia of Philosophy. Retrieved March 24, 2010
from: http://plato.stanford.edu/entries/physicalism/

- 268 -

The Computational Turn: Past, Presents, Futures?

THE CONSTRUCTION OF REALITY AND OF SOCIAL BEING IN THE
INFORMATION AGE
LÁSZLÓ ROPOLYI
Department of History and Philosophy of Science
Eötvös University, 1518 Budapest, Pf. 32., Hungary
ropolyi@caesar.elte.hu

Abstract. In the information age representational (information, cognitive, cultural,
communication) technologies instead of material ones become the dominant factor
in the construction of social being. To conceptualize this shift, I suggest that
Aristotle’s dualistic ontological system (which distinguishes between actual and
potential being) be complemented with a third form of being: virtuality. In the
virtual form of being, actuality and potentiality are inseparably intertwined.
Everything that is produced by representational technologies is a virtual being.
Therefore, in the information age, social being, too, has a virtual character, as it is
produced by representational technologies. Information itself is a product of
representational technology; while it is also interpreted being. This process of
interpretation takes place in human minds, and the process can be described as a
“hermeneutical industry”. The information society is inhabited by virtual beings,
so it has a virtual and open characteristic.

1. Technology and Representation
Technology is a specific form or aspect of human agency, the realization of the human
control over a technological situation. 18 Every element of the human world is created by
technologies. Both human nature and the social being are the products of our
technological activity, and their characteristics are determined by the specificities of the
technology we use to produce them.19
All historical forms of human nature and of social being are constructed (and
continuously re-constructed) or produced (and continuously re-produced) by historical
versions of technology. Technology has an ontological Janus face: it produces both

18

This definition of technology is on a higher level of abstraction than usual conceptualizations
(cf. Feenberg, 1999).
19
Social (or human) being, obviously, has an active role in the formation of any technology: given
technological and social relations coexist and interrelate in a complex way, so that they
mutually shape each other. My view on construction is closer to that of Marxism (Lukács
1978) than to those of phenomenology (Berger and Luckmann, 1966) and of radical
constructivism (Glasersfeld, 2011).

- 269 -

Proceedings IACAP 2011

“things” and “representations”. For thousands of years, people used material (agricultural
or industrial) technologies where the material product was in the foreground, although
the symbolic content was also present.
The last few decades have witnessed a significant technological change, in that
“representations” have became dominant over the “thingly” products in the most
important technologies of our age. On the one hand, new (cognitive, communication,
cultural, and information) technologies have emerged; on the other hand, the
representational or symbolic function of traditional technologies has become more
significant. As a consequence, the most important characteristics of the social being are
essentially transformed. The terms “post-industrial / knowledge / risk / information /
network society” all refer to a type of society where representational technologies are the
dominant factor in the (re)construction or the (re)production of human nature and of
social being.

2. Virtuality and Openness in Information Technologies
The shift from material technologies to representational (information, cognitive, cultural,
communication) technologies has important consequences for our notions of reality. The
concept of virtuality has a central role in redefining reality. The term “virtuality” is
relatively new, but a brief overview of the history of philosophy reveals that the
fundamental components of virtuality have been extensively discussed (Ropolyi, 2001).
The central concepts in this respect are presence, worldliness, and plurality. All three
acquire their meaning from a certain relation between actuality and potentiality.
I suggest that the Aristotelian dualistic ontological system, which distinguishes
between actual and potential being, be complemented with a third form of being:
virtuality. In the virtual form of being, actuality and potentiality are inseparably
intertwined. Virtuality is potentiality considered together with its actualization.
Openness is actuality considered together with its possibilities. As compared to reality,
virtuality is reality with a measure, a reality which has no absolute character, but which
has a relative nature.
All beings produced by representational technologies are necessarily virtual. To
illustrate how technologies produce virtual beings, let us consider information
technologies. The characterization of information technology should be based on an
understanding of the concept of information. Obviously, information is a product of a
kind of representational technology, and thus it is virtual. In a hermeneutic approach,
information is “interpreted being”. On this account, information technology is a
“hermeneutical industry”, where the production is performed by interpretation in the
minds of people. All the products of this “industry” are virtual beings. Consequently,
social being in the information age is necessarily a virtual being. Information society is a
society where the typical beings are virtual ones, and so the whole society has a virtual
and open characteristic.
In a specific point of view the Internet, too, is a kind of information technology. It
is an intentionally created and maintained artificial, virtual sphere which is based on
networked computers and individual human interpretation praxes. The Internet is the
medium (or sphere) of a new, virtual mode of human existence, basically independent

- 270 -

The Computational Turn: Past, Presents, Futures?

from, but built on, and coexisting with the former (natural and societal) spheres of
existence, and created by the late-modern humans.

Acknowledgements
This research was supported by the Hungarian Scientific Research Fund (OTKA) under
the K79194 and K 84145 project numbers.

References
Berger, P. & Luckmann, T. (1966). The Social Construction of Reality. A Treatise in the
Sociology of Knowledge. New York: Doubleday.
Feenberg, A. (1999). Questioning Technology. London: Routledge.
Glasersfeld, E. von (2011). http://www.vonglasersfeld.com/ (March 2011).
Lukács, G. (1978). The Ontology of Social Being. London: The Merlin Press.
Ropolyi, L. (2001).Virtuality and plurality. In: A. Riegler, M. F. Peschl, K. Edlinger, G. Fleck and
W. Feigl (Eds.), Virtual Reality. Cognitive Foundations, Technological Issues &
Philosophical Implications. (pp. 167-187). Frankfurt am Main: Peter Lang.

- 271 -

Proceedings IACAP 2011

TRUST, KNOWLEDGE AND SOCIAL COMPUTING
Relating Philosophy of Computing and Epistemology
JUDITH SIMON
Institut Jean Nicod – Ecole Normale Supérieure
29, rue d'Ulm
F-75005 Paris - France

Abstract. The main goal of my talk will be to link the discourse on trust in
epistemology with the philosophical discourses on trust and ICT. I will argue that
linking these two lines of research is needed to apprehend the notion of epistemic
trust. Epistemic practices in science as well as in everyday life are characterized
not only by their socialness, i.e. the fact that agents collaborate and rely on others
in their attempts to know, they are also deeply pervaded by information
technologies. In short, I claim that a) contemporary epistemic practices take place
in increasingly complex, dynamic and entangled socio-technical epistemic systems
consisting of multiple human and non-human agents, b) that trust is a crucial
concept to understand these practices, and c) that information and communication
technologies (ICT) play an important role in mediating and shaping trust
relationship between different agents.

1. Trusting to Know
In 1991, Hardwig asserts that “[f]or most epistemologists, it is not only that trust plays
no role in knowing: trusting and knowing is deeply antithetical. We can not know by
trusting in the opinions of others: we may have to trust those opinions when we do not
know ((Hardwig 1991): 693). This argument rests on the assumption that in order to
know, we have to be able to provide evidence, we have to justify our knowledge claims
with our own cognitive resources and cannot know by simply trusting the testimony of
others. Yet a closer look on epistemic practices in science as well as in everyday life
shows that our knowledge depends deeply on trust in other people. Without trusting in
what others have told us, we would neither know some of the most basic facts about
ourselves, such as the date and place of our birth, nor could we have achieved the most
advanced scientific knowledge. This is the central dilemma of testimony and epistemic
trust in philosophy: while on the one hand it seems that almost everything we know
depends on our trust in the testimony of others, the status of testimonial knowledge and
the role of epistemic trust remain highly controversial. Yet things are even more
complicated. Within contemporary epistemic practices trust is not only placed in other

- 272 -

The Computational Turn: Past, Presents, Futures?

humans, but also in technologies, processes, institutions and content. Indeed, information
and communication technologies (ICT) play a special role for epistemic trust, because
ICT is not only an entity that can be trusted itself, ICT also increasingly mediates and
shapes trust relations between all other entities as well. Hence, to understand epistemic
trust, the role of ICT cannot be ignored and epistemology has to take insights from other
fields of research, most notably philosophy of computing and into account.

2. Trust and ICT
The special role of ICT for trust has been addressed under different labels such as online
trust, digital trust or e-trust. While all terms refer to practices of trust that take place in a
digital environment, the different labels are related to different research foci. Three of
them should be distinguished:
1. ICT as an entity of trust itself (i.e. how human agents place trust in ICT as a
technology)
2. ICT as a mediator of trust relationships between human agents as well as between
humans agents and other entities (such as content)
3. Trust in multi-agent systems, i.e. trust relations amongst artificial agents as well as
between human and artificial agents
First, ICT can be an entity that is trusted itself, i.e. trust into ICT can be considered
as trust in a specific type of technology, hence as a special case of trust in technologies.
Here analyses of whether one can rightfully talk about trust in technology in the first
place (for instance (Nissenbaum 2001), or whether and to what extent we do or should
place trust in technologies have been discussed ((Cheshire, Antin et al. 2010)).
Second, ICT mediates trust relations amongst and between humans and non-human
entities to a profound extent. Even in the most basic form, if communication between
two humans who know each other in person takes place via email, chat, social
networking sites or even telephone, ICT mediates between truster and trustee (cf. (Ess
2010)). Epistemic trust placed in such technologies cannot be fully understood by
referring to trust in technology or trust in persons only. Take the example of the onlineencyclopedia Wikipedia. If one trusts content from Wikipedia, this practice of trust is
neither trust in a technology proper (namely the wiki-software), nor is it trust in
individual writers (which are often unknown), nor can this trust be fully explained by
institutional trust in the Wikimedia Foundation. I have argued elsewhere, that trusting
Wikipedia should rather be conceived as trust into a certain socio-technical epistemic
system characterized by technological infrastructure, epistemic agents (i.e. the users of
Wikipedia), and certain processes employed in creating epistemic content ((Simon
2010b)).
While Wikipedia ((de Laat 2010), (Tollefsen 2009), (Magnus 2009)) and Blogs
((Goldman 2008)) have attracted some interest within epistemology by now, other types
of social software, such as recommender systems or social tagging systems have not yet
received serious attention. Yet, in such types of social software that function primarily
via aggregation, problems of trust are potentially even harder to tackle and the classical
means provided by epistemological analyses on trust in testimony appear even less suited
for understanding epistemic trust within such applications.

- 273 -

Proceedings IACAP 2011

Finally, there is another type of e-trust, which is starting to receive attention within
philosophy: trust in multi-agent systems. Two instances of trust are crucial with respect
to trust in multi-agent-systems. First, there are the trust relations amongst artificial agents
within multi-agent-systems. (e.g. (Taddeo 2010b)). Second, there are not only trust
relations amongst artificial agents, but also between human and artificial agents, which
are intrinsically more complex as (Grodzinsky, Miller et al. 2010) have noted.
In my talk I will specify in more detail, how these insights from the philosophy of
computing could be made useful for an epistemology of trust.

References
Cheshire, C.,Antin J. et al. (2010) General and Familiar Trust in Websites. Knowledge,
Technology & Policy 233), 311-331.
de Laat, P. (2010). How can contributors to open-source communities be trusted? On the
assumption, inference, and substitution of trust. Ethics and Information Technology 12(4):
327-341.
Ess, C. (2010). "Trust and New Communication Technologies: Vicious Circles, Virtuous Circles,
Possible Futures. Knowledge, Technology & Policy 23(3): 287-305.
Goldman, Alvin (2008). The Social Epistemology of Blogging. In: Information Technology and
Moral Philosophy. J. v. d. Hoven and J. Weckert. New York, Cambridge University Press:
11-122.
Grodzinsky, F., K. Miller, et al. (2010). "Developing artificial agents worthy of trust: ―Would
you buy a used car from this artificial agent? ." Ethics and Information Technology: 1-11.
Magnus, P. D. (2009). On Trusting Wikipedia. Episteme 6(1): 74-90.
Nissenbaum, H. (2001). Securing Trust Online: Wisdom or Oxymoron. Boston University Law
Review 81(3): 635-664.
Simon, J. (2010b). The entanglement of trust and knowledge on the Web. Ethics and Information
Technology 12(4): 343-355.
Taddeo, M. (2010b). Modelling Trust in Artificial Agents, a First Step toward the Analysis of eTrust. Minds and Machines 20(2): 243-257.
Tollefsen, D. P. (2009). Wikipedia and the Epistemology of Testimony. Episteme 6(1): 8-24.

- 274 -

The Computational Turn: Past, Presents, Futures?

OPERATIONAL IMAGES
Agent-Based Computer Simulation and the Epistemic Impact of Dynamic
Visualization
SEBASTIAN VEHLKEN
Leuphana University Lüneburg
ICAM Institute for Culture and Aesthetics of Digital Media
Scharnhorststrasse 1
21335 Lüneburg

Abstract
Computer simulations (CS) designate the current scientific condition. Inevitably, one has
to distinguish crash tests from climate simulations, and one has to be aware of the
differing problem dimensions posed by e.g. the simulation a quantum physical
system by a classical physical system in comparison to those advanced by an agentbased simulation of a mass panic in a stadium. And without question, CS achieve
diverse tasks and have quite dissimilar reputations in different scientific disciplines.
But undeniably, CS brought with them a novel kind of knowledge, a modified set of
research problems, and a transformed historical-philosophical comprehension of
science. Thus, knowledge emerging in CS derives from the computer-based
imitation of dynamic system behavior which penetrate everyday life in forms of
ecological, medical, economical, or technical applications and decisions. Initially,
novel scientific problems and research fields historically form where they would
not have been tractable without the digital media of CS. And not least, the
traditional concepts of theory and experiment are essentially modified,
transforming the „mode-1« science (Gibbons, 1994) more and more into a
„behavioral science of complex systems“ (Mahr, 2003). This transformation is
based on an explicitly media-historical rupture marked by the digital mediality of
CS. The digital media inherent in CS develop typical and intrinsic modes of
operation and visualization in their application on analytically and experimentally
intractable problem fields. Sebastian Vehlken’s presentation embarks on examining
the “social computing” aspects of a particular kind of CS in a two-fold way. First, it
will describe the specific (self-) organizational aspects of agent-based modeling and
simulation (ABM), zeroing in on several pivotal examples of large-scale social
simulations. These range from crowd control (e.g. Massive Insight) and logistics
(e.g. TransSims) to epidemics (e.g. PLAN-C by NYU Bioinformatics Group) and
large-scale models of the complex interactions of agents in whole societies (e.g.
Global Scale Agent Model by Brookings Institution). It will discuss the notion, the
epistemic function and the technological means of the bottom-up modeling
paradigm of ABM, providing essential advantages over CS based on discrete
events. Whilst the latter are required to define assumptions of the constituents of a
system and their interdependencies from top down, ABM are decentralized and

- 275 -

Proceedings IACAP 2011

function without a definition of the global system behavior. The system behavior
emerges from the definition of simple and locally (on the level of the individual
agents) implemented settings. As Borshchev and Filippov (2004) put it, ABM thus
better »provides for construction of models in the absence of the knowledge about
the global interdependencies: you may know nothing or very little about how
things affect each other at the aggregate level, or what is the global sequence of
operations, etc., but if you have some perception of how the individual participants
of the process behave, you can construct the AB model and then obtain the global
behavior.« The bottom-up performance of ABM induces a synthetic problem
approach by converging to adequate and context-dependent solutions in a process
of a systematic comparison and evaluation of different simulation runs and
scenarios. Thereby ABM leapfrogs fixed object or context allocations in an
exemplarily interdisciplinary manner. The media history of research in social
collectives reveals a reciprocal ›socialization‹ and ›biologization‹ of computer
science and a likewise computerization of the social sciences when it comes to the
development of adequate ABM models for describing collective behaviors in space
and time. The development of Animation Effects in CGI is distinctly
interconnected with biological and sociological computer models of collective
dynamics, and vice versa. Second, it will consider the importance of digital
visualizations for scientific research with ABM. The adherent types of Computer
Graphical Imagery (CGI) exemplarily raise questions not only about the status of
animated, 3-dimensional and dynamic digital images as interfaces for the
refinement of societal “computer experiments” and the “intuitive” handling of the
ABM by researchers. One must also ask about their state as ‘visual evidence’ and
‘representation’ for phenomena and processes in social dynamics which would
remain intractable without these digital ‘time-based images’. Not least, the
technological conditions resulting of the multiple filtering-, smoothing-, or
thresholding procedures involved in providing ‘visual validation’ have to be
accounted for. These aspects have to be further investigated on the basis of a
media-technologically informed theory of operational images, linking the modes of
visualization of ABM with their programmed data base in the ABM software. And
since the development of certain Animation Effects in the CGI industry is
historically distinctly interconnected with biological and sociological computer
models of collective dynamics, and vice versa, the hard-, wet- and software
foundations of ABM can be short-circuited with applicable modes of CGI
generation: both operate in a highly distributed manner of ›socially‹ interacting and
›locally‹ defined agents. Hence, the presentation investigates the specific
epistemical and technological rupture marked by CS on the basis of ABM in social
simulations. The respective applications facilitate a mode of visualization by
(synthetic and therefore operational) images which address the inconcievable
representation of complex social dynamics by generating visual presentations:
Only the observation of modeled processes in the runtime of ABM enables the
evaluation and manipulation of critical factors and variables and the ensuing re-run
of the simulation. And this results in a type of dynamical “data images” (see
Adelmann et al., 2009, Schubbach, 2007) yet to be further investigated. It provokes
a type of operational images with a highly socio-political dimension – images

- 276 -

The Computational Turn: Past, Presents, Futures?

which depend on and which foster social decision-making in (time-) critical
environments.

References
Adelmann, R., Frercks, J., Heßler, M. & Henning, J. (Eds.)(2009). Datenbilder. Zur digitalen
Bildpraxis in den Naturwissenschaften, Bielefeld 2009.
Borshchev, A. & Filippov, A. (2004). From System Dynamics and Discrete Event to Practical
Agent Based Modeling: Reasons, Techniques, Tools. In: The 22nd International Conference
of the System Dynamics Society. Oxford.
Mahr, B. (2003). Modellieren. Beobachtungen und Gedanken zur Geschichte des Modellbegriffs.
In: H. Bredekamp and S. Krämer (Eds.), Bild Schrift Zahl (pp. 59-86). Munich: Fink.
Schubbach, A. (2007). ...A Display (Not a Representation)... Navigationen. Zeitschrift für
Medien- und Kulturwissenschaft. Display II – digital 7(2) (2007, 13–27.

- 277 -

Proceedings IACAP 2011

Social Computation as a Discovery Model for the Social Sciences
AZIZ F. ZAMBAK
Department of Philosophy
Yeditepe University, Istanbul

Abstract. Social simulation is a growing field that proposes a computational
approach to the social sciences. Simulation provides a powerful alternative for the
novel understanding of the epistemology, ontology, and taxonomy of the social
phenomenon, structure and process. Social simulation can be an intellectual
resource and experimental field for developing a novel notion of “social
phenomenon” within which various forms of human action can be represented.
Social simulation may be used to examine not just the current situation in a
society, but also possible social situations. Classical models that only use natural
language is inadequate for the comprehension of dynamic and complex systems in
the social sciences. Pure mathematical and/or statistical models are intractable.
Simulation may offer to overcome the limitations of classical models in the social
sciences. In this paper, we will propose five general principles that should be take
into consideration in social simulation: 1- Agent-Based Models: We describe
agency as an essential criterion for social simulation. 2- Game Theory: Game
theory is a study that can provide some formal epistemological data for
understanding the rationalization process of individuals. From the social
simulation point of view, discovery is an agentive-informational-system and we
consider this system as a set of complex principles that should be rationalized by
simplification, approximation, optimization, and generalization. 3- Control
Systems: In order to understand the autopoietic, dynamic and complex structure of
social systems, we should develop an organismic conception of society in which
control mechanisms have an essential role for the social models and simulation. 4Tools: In social simulation, a stylized-computational-language should be built in
which the data on social structure are coded and represented in the computer
simulation. 5- Ontology: Emergence is one of the essential concepts in the
ontology of social sciences in which certain theories try to explain the macrolevel
phenomena in terms of the behavior of microlevel actors.

Social simulation is a growing field that proposes a computational approach to the social
sciences.20 Simulation provides a powerful alternative for the novel understanding of the
20

Gilbert and Troitzsch (2005: 5) explains the main reason behind the developing interest on
social simulation as follows: “The major reason for social scientists becoming increasingly
interested in computer simulation, however, is its potential to assist discovery and
formalization. Social scientists can build vey simple models that focus on some small aspects
of the social world and discover the consequences of their theories in the ‘artificial society’
that they have built. In order to do this, they need to take theories that have conventionally
been expressed in textual form and formalize them into a specification which can be

- 278 -

The Computational Turn: Past, Presents, Futures?

epistemology, ontology, and taxonomy of the social phenomenon, structure and process.
Social simulation can be an intellectual resource and experimental field for developing a
novel notion of “social phenomenon” within which various forms of human action can be
represented. Social simulation may be used to examine not just the current situation in a
society, but also possible social situations. Classical models that only use natural
language is inadequate for the comprehension of dynamic and complex systems in the
social sciences. Pure mathematical and/or statistical models are intractable. Simulation
may offer to overcome the limitations of classical models in the social sciences.
In this paper, we will propose five general principles that should be take into
consideration in social simulation.

1- Agent-Based Models:
Agency must be the central notion in social simulation since the cognition of social
reality originates from agentive actions. We claim that agency is the ontological and
epistemological constituent of social reality. It is characterized by agentive activity.
Agency must be the essential criterion for the success of social simulation. Social
simulation must consider the social phenomena as a form of action of a dynamicrepresentational system, developed during interaction within the environment.
Equating properties of the social phenomena with properties of its elements
[individuals] is the basic mistake. Social structure cannot be a subject of a special
examination of the group of individuals. Behavior and agentive actions cannot be
found in the specific groups of individuals, but in the whole agent-environmentinteraction system. The discovery of social phenomena in social simulation does
mean a new kind of action of the highly dynamic-representational system capable of
making inferences from its structure and process in order to achieve new results of
action and form novel systems directed towards the future. Therefore, in social
simulation, discovery is not a mystical emergent property of social phenomena, but a
form of agentive action necessarily following from the development of a dynamicrepresentational system.

2- Game Theory:
Game theory is a study that can provide some formal epistemological data for
understanding the rationalization process of individuals. From the social simulation point
of view, discovery is an agentive-informational-system and we consider this system as a
set of complex principles that should be rationalized by simplification, approximation,
optimization, and generalization. In social simulation, this type of rationalization should
depend on idealization. Idealization transforms the environmental data into idealagentive-rational-information. However, idealization should not be seen as abstraction.21
We consider the idealized information as one of the basic capabilities of social
simulation, providing the preconditions for the adaptive behavior of agency in a very
programmed into a computer. The process of formalization, which involves being precise
about what the theory means and making sure that it is complete and coherent, is very valuable
discipline in the social sciences to that of mathematics in the physical sciences.”
21
As Nowak (2000: 116) states, “idealization is not abstraction. Roughly, abstraction consists in a
passage from properties AB to A, idealization consists in a passage from AB to A-B.”

- 279 -

Proceedings IACAP 2011

complex environment. In the adaptiveness of agency, the information of environmental
structure and organization may be grasped rationally, for the rationality lies in the
agentive attitude towards environmental structure and organization, not in the essence of
environment itself. Therefore, there is not a hidden essence in the environmental
structure and organization that should be represented in a computational and
representational manner for the rational behavior of an agent. In social simulation, our
aim is to understand how properties of rationalized agency are related to the behavioral
action that is performed under complex environmental/social situations. This type of
understanding requires idealization, as idealization can be seen as a method of
constructing informational structures in which data gained from the environment/society
can serve the goal of forming special types of rationalized agentive interactions.
Idealization, in social simulation, leads an agent to a successful informational
approximation. Idealization is a type of theorizing that includes specification,
approximation and optimization about certain sets of agentive and social systems. The
presentation will include analysis of two game theoretical models for social simulation.

3- Control Systems:
Social systems should be considered as self-organizing, non-linear, dynamic, and
complex phenomena. From the computational or representational point of view, dynamic
and complex systems are difficult to study because most cannot be represented in
simplified and hierarchical models. In order to understand the autopoietic, dynamic and
complex structure of social systems, we should develop an organismic conception of
society in which control mechanisms have an essential role for the social models and
simulations. There are several conditions for choosing the appropriate strategy for the
control mechanism of an agent such as the availability of data for the performance of an
agent, comparing stable and dynamic parameters of the environment, and the access to
explicit data about plans, goals, and the current state of affairs. For building computer
simulation for an agentive system, it is very important not to restrict an agent to follow
only one predetermined set of rules but to give it the opportunity to choose and shift
different sets of rules according to its situation. This can be done by a proper control
mechanism which can find a balance between stability and flexibility of information in a
complex environment. In this section, we will also examine the Project Cybersyn as a
control mechanism example for the social simulation.

4-Tools:
In the presentation, we will briefly explain what should be the logic of computer
programs in social simulation. In addition, we will claim that, in social simulation, a
stylized-computational-language should be built in which the data on social structure
are coded and represented in the computer simulation. The general concepts of this
stylized-computational-language will be briefly introduced in the presentation. Some
of these concepts are empirical protocols, nodes, links, data processing, boundaries,
taxonomy, observation period, randomization of parameters, outcome validity,
process validity, and internal validity.

- 280 -

The Computational Turn: Past, Presents, Futures?

5- Ontology
Emergence is one of the essential concepts in the ontology of social sciences in which
certain theories try to explain the macrolevel phenomena in terms of the behavior of
microlevel actors. In this part, we will show that how a reflexive model in social
simulation can build an emergent model of the relation between the individual and
the society.

References
Gilbert, Nigel and Troitzsch, Klaus G. (2005). Simulation for the Social Scientist, Buckingham :
Open University Press.
Nowak, Leszek (2000). The Idealization Approach to Science: A New Survey. Pozań Studies in
Philosophy of Science and the Humanities, 69, 109-184.

- 281 -

Proceedings IACAP 2011

Track VIII:
IT, Culture and
Globalization

- 282 -

The Computational Turn: Past, Presents, Futures?

The Revival of National and Cultural Identity through
Social Media
RYOKO ASAI
Uppsala University, Dept. of IT-HCI
Box 337, 751 05 Uppsala, Sweden
and
Nihon University, College of Industrial Technology
Narashinoshi-Izumicho1-2-1, Chiba, Japan
Iordanis Kavathatzopoulos
Uppsala University, Dept. of IT-HCI
Box 337, 751 05 Uppsala, Sweden
AND
Mikael Laaksoharju
Uppsala University, Dept. of IT-HCI
Box 337, 751 05 Uppsala, Sweden

Abstract. Social media has played an important role as hub for information in
political change. It can contribute to the development of psychological and social
preconditions for dialog and democracy.

Information communication technology (ICT) made it possible for people to
communicate beyond national borders. In particular, social media play an important role
in making a place where people communicate each other, for example Facebook,
MySpace, YouTube and so on. In other words, under these circumstances, social media
function as the third place (Oldenburg, 1999). People have two essential and
indispensable places in their lives: one is home and another is working place. Further to
those places, people have one more place where they could have relationships with
others informally in public (what Oldenburg called “informal public life”). And the third
place contributes not only to unite people in communities but also to know how they
contribute in various problems and crises there. Therefore the third place would nurture a
relationship with others and mutual trust under the unrestricted access condition, and
also it would be open for discussion and ground for democracy (Oldenburg, 1999). In
this context, social media can provide the third place to users in some cases.
Social contexts of communication are defined by geographic, organizational and
situational variables, and those variables influence the contents of communication among
people (Sproull & Kiesler, 1986). And, in order to discern social context cues,
communicators observe static cues (physical setting, location etc.) and dynamic cues
(non-verbal behavior like gesture or facial expression) in communicating with others.
Communicators’ behavior is determined based on social context cues and they can adjust

- 283 -

Proceedings IACAP 2011

their behavior depending on situations through the process of interaction between them.
However, in online communication, it is more difficult for communicators to perceive
static and/or dynamic elements compared to face-to-face communication. Because in
many cases social media limit the number of characters and the amount of data that they
can post while making it possible for users to communicate regardless of physical
distance, national boundaries and time difference. On the other hand, participation is
seen as the key element in the recent trend toward democratization and in real numerous
users send and receive a huge amount of information via social media to cultivate a
relationship with others and strengthen mutual exchange beyond borders. In general, it is
recognized that social media advance participation through exchanging information with
minimal social context cues.
Tunisian people shared information on what happened in the country and when and
where anti-government protests were held, by social media such as Facebook and twitter.
In other words, social media seemed to support political change in Tunisia. Behind it, the
number of the internet users is 3.6 million, which is 34% of the population total, and
there are 1.6 million users of Facebook roughly equivalent to 16% of the population
(Internet World Stats, 2010). Tunisian government had blocked particular websites.
Facebook was one of the few social media free to access. Under these circumstances, for
the people living abroad, Facebook functioned as primary source of information to have
direct access to daily events in Tunisia.
Under these restrictive access conditions, social media like Facebook provides
users with opportunities to communicate with others and also to state their opinion, in
order to overcome constraint and the old regime. In this context, social media serve as
the third place and users develop solidarity and reinforce identity through online
communication. As is obvious from the statistical date on the internet users mentioned
above, it is estimated that the number of in-country users of Facebook are fewer than the
number of users living abroad. Many users followed with what was going on in Tunisia
showing in-country users that they were all caring about political change. And this
phenomenon is recognized as a kind of participation to collective movement through
social media regardless of physical distance or time difference.
However, communication through social media has some problems. At first,
exchanged information via social media is minimized social context cues under severe
restricted conditions, due to sending information certainly and rationally. Therefore
information tends to be extreme and there is a risk of group polarization. Second, in
social media, information receivers gather fragmented information based on personal
experience and make it plausible to understand easier as their own experience or to relive
the experiences of its senders. And, through this process, users develop a sense of
solidarity and share expectation as well as norms organizing them as one community.
Therefore social norms accrete influence on users in particular communities and advance
self-stereotyping among them as solidarity and social identity are enhanced. This
situation is fraught with social risk of exclusion of others. Some people call Tunisian
political change as “Facebook revolution” or “twitter revolution” on the internet. Are
these diminutives really pertinent? Indeed, social media has played the important role as
“hub for information” and the third place in political change. However, social media has
to contribute to the development of skills for dialog in order to achieve a really
democratic society (Asai & Kavathatzopoulos, 2010; Kavathatzopoulos, 2010, 2007).

- 284 -

The Computational Turn: Past, Presents, Futures?

References
Asai, R. and Kavathatzopoulos, I. (2010). Diversity in the construction of organization value.
Proceedings of EBEN Annual Conference 2010 “Which values for which organizations”.
Trento, Italy: University of Trento.
Internet World Stats (2010). Tunisia: Internet usage and marketing report. Available online:
http://www.internetworldstats.com/af/tn.htm (accessed February 7, 2011).
Kavathatzopoulos, I. (2007). Information Technology as a tool for democratic skills. In A.
Lionarakis (Ed.), Forms of democracy in education: Open access and distance education
(pp. 155-162). Athens: Propobos.
Kavathatzopoulos, I. (2010). Information technology, democratic societies and competitive
markets. Proceedings of the 3rd International Seminar on Information Law “An
information law for the 21st century”. Corfu, Greece: Ionian University.
Kiesler, S. and Sproull, L. S. (1986). Reducing social context cues: Electronic mail in
organizational communication. Management Science, 32(11), 1492-1512.
Oldenburg, R. (1999). The great good place. Cambridge: Da Capo Press.

- 285 -

Proceedings IACAP 2011

WIKILEAKS AND ETHICS OF WHISTLE BLOWING

Patrick Backhaus
School of Innovation, Design and Engineering, Mälardalen University,
Sweden pbs10002@student.mdh.se
and
Paderborn University, Germany bpatrick@campus.uni-paderborn.de
AND
Gordana Dodig Crnkovic
School of Innovation, Design and Engineering, Mälardalen University,
Sweden gordana.dodig-crnkovic@mdh.se

1. Extended Abstract
In a time in which the Internet pervades everyday life and information published is
readable all over the world, it becomes very important to deal with ethical problems
related to whistle blowing via the Internet. Although there are basic concepts like
anonymity, privacy and freedom of speech, for every new kind of phenomenon we have
to discuss its ethical aspects (Kizza, 2010)( Nadler and Schulman, 2006). A current
example is the platform WikiLeaks which publishes a vast amount of secret documents.
To evaluate ethics of WikiLeaks (Hanson and Ceppos, 2006)(WikiLeaks About), we
will apply the following ethical approaches:
The Utilitarian Approach, focusing on the consequences that the publications of
WikiLeaks have on the well-being of all parties that are affected directly or indirectly, so
there are two sides to consider:
•
On the one hand, the uncovering of misconduct and the increased transparency of
the government are of such importance that the publications benefit society as a whole.
So it alleviates the opinion making and leads to a greater understanding of governmental
work.
•
On the other hand the publications may threaten the national security and so harm
society. They lead to a society with decreased integrity which may eventually result in
less communication, more technical restrictions and so in less freedom.
To achieve a balance between both sides a potential approach could be that WikiLeaks
reduces their amount of published data and classify the data more in detail. Further they
could contact the company or government concerned before the publication, so that this
party itself could acknowledge the misconduct.

- 286 -

The Computational Turn: Past, Presents, Futures?

The Virtue Ethics Approach, focusing on attitudes that develop our human potentials
such as e.g. honesty, courage, faithfulness, trustworthiness and integrity. It is easy to see
that WikiLeaks disregards these virtues in many different contexts. They are accused for
putting people’s lives at risk, publishing stolen data and degrading loyalty, privacy and
integrity of data. The only virtue they undoubtedly represent is transparency which is not
considered classical ethical virtue, but may be seen as an element of democracy. So
WikiLeaks must ensure that the increased transparency gained by the publication is much
more worth than all other aspects which will only be the case at severe misconduct by the
concerned party that is made public as no other way of corrective action was available.
The Information Ethics Approach: From the point of view of Information Ethics,
we can study how information is revealed/communicated in the networks of agents.
Within approach we can ask questions such as: what is the function of “information
hiding” and “encapsulation” such as found in Object Oriented Programming and any
hierarchical organization? What would be the behavior of a society in which every agent
would be connected with every other agent and share any information they have?
Interesting to observe is the global character of WikiLeaks, in a world regulated on
the base of nations, which seem to act in a grey zone since the legal situation is unclear
and different governments are still searching for a crime Julian Assange can be charged
for.
In reality the issue of WikiLeaks (Kintzinger and Zepelin, 2010) (Greenberg, 2010)
implies much more than an ethical discussion about whistle blowing and leaking,
integrity and freedom of speech. WikiLeaks have become a symbol of a deep change in
the publicity of information in the digital age, at least with the present-day technology. It
has generated the greatest confrontation between the established order and the advocacy
of the culture of the totally open Internet.
We are at the moment a part of the world where it is difficult to control and keep
information secret and safe from eavesdropping and unauthorized use. Some of the
relevant questions are: Has the institution of legal secret, business secret, military or
organizational secret become obsolete? If yes, why? If no, how to protect information
which should be protected? Who and how decides which information is worth making
public and which is not? According to Assange (Bieber, 2010) (Fallows , 2010) personal
integrity must be protected. Why not institutional integrity?
If leaking is a good democratic mechanism shall we not have leaks of WikiLeaks as
well? And so on…a chain, or a loop of leaks? In a totally transparent world, how would
information overload be managed? Shall we give up all trust? Or, equally important:
Whom shall we trust?
Perhaps problems with information protection will lead us to a society where
conversations are reduced to minimum and information less accessible as it has become
obvious that anything can be made public. In the end, the result would be not an
increase, but a decrease of freedom.

References
Bieber
C.
(2010)
“Die
Ethik
des
Lecks“,
http://www.freitag.de/kultur/1032-die-ethik-des-lecks

- 287 -

11.08.2010.

der

Freitag

Proceedings IACAP 2011

Fallows J. (2010) “More on Mullen, Twitter, and the Ethics of WikiLeaks”, July 2010.
http://www.theatlantic.com/politics/archive/2010/07/more-on-mullen-twitter-and-the-ethicsof-wikileaks/60705
Greenberg A. (2010) An Interview With WikiLeaks’ Julian Assange, Nov. 29 2010. Forbes.
http://blogs.forbes.com/andygreenberg/2010/11/29/an-interview-with-wikileaks-julianassange
Hanson
K.
and
Ceppos
J.
(2006)
“The
Ethics
of
Leaks,”
http://www.scu.edu/ethics/publications/ethicalperspectives/leaks.html
Kintzinger A. and Zepelin J. (2010) “Stärkt Wikileaks die Freiheit?“, 02.12.2010. Financial Times
Deutschland
http://www.ftd.de/it-medien/medien-internet/:pro-und-kontra-staerktwikileaks-die-freiheit/50200724.html
Kizza J. M. (2010) “Cyberspace, Cyberethics, and Social Networking,” in Ethical and Social
Issues in the Information Age. London: Springer London, ch. 11, pp. 221–246.
http://dx.doi.org/10.1007/978-1-84996-038-0_11
Nadler J. and Schulman M. (2006) “Whistle Blowing in the Public Sector,” November 2006.
http://www.scu.edu/ethics/practicing/focusareas/Government_ethics/introduction/whistleblo
wing.html
WikiLeaks About, [Online]. http://wikileaks.de/about.html

All links accessed on 2011 04 25

- 288 -

The Computational Turn: Past, Presents, Futures?

INTERPRETING CODES OF ETHICS IN GLOBAL SOFTWARE
ENGINEERING
Extended Abstract
THIJMEN DE GOOIJER
Mälardalen University
Högskoleplan 1, Västerås, Sweden

Abstract. In global software engineering (GSE) groups of people from all over the
world collaborate on the development of one system. For example, it is common
for Western companies to send development work to Asia or Eastern Europe.
Within these collaborations the differences between cultures and the problems
these differences create, are plentiful. Because we expect that computing
professional organizations codes of ethics are insufficiently adapted to GSE, we
investigate the culture-relative interpretations of codes of ethics and the guidance
they provide for global teams and collaboration. We analyze the codes of ethics of
the ACM (US), CSI (India), IPSJ (Japan), HKCS (Hong Kong) and EI (Ireland).
We look whether the codes explicitly address ethical dilemmas caused by global
interactions, and investigate the ethical guidance provided by the codes. For the
latter we apply them to three case questions that one could raise in a GSE setting.
Our work differs from that of others in that it examines the practical applicability
of codes of ethics instead of their contents and that our goal is not to study
different culture-relative interpretations of just one problem. During our analysis
we did not find imperatives that directly hinder global interaction, but
unfortunately we were also unable to find any that sufficiently address this topic.
Only one of the studied codes asks to consider cultural differences. While
answering the case questions using the imperatives from the aforementioned
codes, the cultural perspectives needed to interpret the words become clear, and
we learn that little attention is given to the problems associated with global
collaboration. We conclude that all studied codes would benefit from more explicit
guidelines for those professionals that work in GSE.

1. Introduction
Despite the globalization of the software engineering profession, most computing
professional organizations are active in a limited number of countries and have their own
code of ethics (CoE) or code of conduct (CoC). These codes are thus national in scope
(Wheeler, 2003). According to a 1996 study as much as 78% of IS professionals use
these codes in their ethical decisions (Joyce et al., 2003). At the same time, ethical

- 289 -

Proceedings IACAP 2011

reactions and attitudes are influenced by culture and national origin (Christie et al., 2003;
Nyaw & Ng, 1994) . As a result ethical decision making is a complex endeavor in the
current global IS practice (Wheeler, 2003). We expect that the codes have not kept up
with the globalization of the profession.
To explore the possible difficulties computing professionals may encounter during
their ethical decision making in global software engineering (GSE), we analyze the codes
of ethics of five professional organizations and apply their codes to three case studies.
We characterize our study by the following research questions.
• Do the studied codes specify culture-relative imperatives that could hinder or
support global software engineering?
• Do the studied codes provide adequate ethical guidance for IT professionals in
global interactions?

2. Related Work
To our knowledge no studies exist that take a similar, practical approach to identify
problems for global software engineers in computing professional CoE. Earlier work
does compare codes (Oz, 1993), even in international settings (Joyce et al., 2003;
Wheeler, 2003) and is discussed below. Work that combines codes of ethics with
cultural influences can be found for example in (Arnold et al., 2007), which studies the
views of western European accountants on actions prescribed by CoC based on their
country of origin. It is found that these views differ significantly.
Case studies exist which review the ethical stance of different cultures on specific
issues, for example, software piracy (Swinyard, Rinne, & Kau, 1990), but these studies
either do not include CoE of computing professional organizations or do not have the
goal to study their usefulness in decision making. Specific in another way are the case
studies in (Anderson et al., 1993), which focus only on the ACM code.
2.1. COMPARING CODES
Oz reviews four codes of US computing professional organizations finding flaws, moral
dilemmas, and points for improvement (Oz, 1993). We differ from (Oz, 1993) in that we
do not limit our study to US codes.
In their study comparing 27 international CoE Joyce et al. found only eigth themes
that were common to more than 50% of the CoE (Joyce et al., 2003). Compared to the
work by Joyce et al. our work aims to identify problems encountered during ethical
decision making in a GSE context, while their work focusses on the content of the codes.
Wheeler (2003) compares the codes of the ACM, the British Computer Society
(BCS) and the Australian Computer Society (ACS) to find differences and similarities.
Our work differs from (Wheeler, 2003) in that we put more emphasis on how codes are
used in a global setting and the selected codes.
2.2. A GLOBAL CODE
Some voices suggest to unite everyone by one global code of ethics (Payne & Landry,
2006; Wheeler, 2003). Davison on the contrary does not believe it is possible to
establish a global code due to differences between nations and cultures (Davison, 2000).

- 290 -

The Computational Turn: Past, Presents, Futures?

His concerns are supported by the difficulties IFIP experienced in the 90s when it
attempted to establish a consensus document to serve as a base for the development of
codes by member bodies (Joyce et al., 2003).
We consider the views of Brey (2007) and Wong (2009) more balanced. They both
acknowledge that a universal ethic would be ideal, but respect that in practice this can
only be implemented as an extension of the local moral systems (Brey, 2007) and that we
should avoid to force ‘our’ ethics onto another culture (Wong, 2009).

3. Selection of CoE
In our study we compare five CoE, those of: the Association for Computing Machinery
(ACM, 1992), Computer Society of India (CSI, 2010), Hong Kong Computer Society
(HKCS, 2010), Information Processing Society of Japan (IPSJ, 1996), and Engineers
Ireland (EI, 2009). Only five codes were selected to limit the study to a manageable size.
The codes are chosen based on the role of their organization's home country in GSE, as
well as variation in culture. The full paper provides more rationale for the selection.

4. Static Code Analysis
In this Section we answer our first research question. To do so we informally compare
the content of the five codes. Our assumption is that if an imperative is culture-relative it
will not appear in all codes. Note that this does not capture culture-relative
interpretations of imperatives. It is to capture interpretation problems that we include the
case studies in Section 5. Comparing the CoE we find that only one of them asks to
consider cultural differences, but we find no imperatives that directly (by formulation)
impede inter-cultural collaboration. A number is culturally bound, and we expect all will
be interpreted differently even when imperatives match.

5. Employing The Codes
In this Section we apply the five selected CoE to three case studies. In this we way hope
to discover whether the studied codes provide adequate ethical guidance for IT
professionals in global interactions. Below we formulate our case studies as three
questions that one might ask him-/herself in a GSE project.
• Developing a medical system for deployment in several countries across the
globe, should I be aware of all legal requirements?
• How do I design my system so that it respects the expected level of privacy?
• May I say ‘yes’ to an assignment I receive from a German customer when I am
uncertain that I can complete it?

- 291 -

Proceedings IACAP 2011

6. Concluding Remarks
While studying the CoE we found only a couple of imperatives that could hinder GSE
collaboration. However, none of the codes seem to be written with global collaboration
in mind. And only the IPSJ CoE explicitly mentions the problem of cultural differences.
Further, the case studies show that decisions on ethical dilemmas will often depend on
the interpretation by professionals or the implicit stance of the code. We feel that the
CoE should provide more guidance to deal with the complexity of ethical decisions in a
GSE setting. Our primary recommendation for computing professional organizations is
to revise their CoE to reflect the advance of GSE. Future work could examine how this
may best be achieved within each culture.

Acknowledgements
A warm thanks to my professor Gordana Dodig-Crnkovic for encouraging me to submit
this work to IACAP and her useful comments.

References

ACM. (1992). ACM Code of Ethics and Professional Conduct. Retrieved December 2010, from
http://www.acm.org/about/code-of-ethics.
Anderson, R. E., Johnson, D. G., Gotterbarn, D., & Perrolle, J. (1993). Using the new ACM code
of ethics in decision making. Commun. ACM, 36(2), 98-107. New York, NY, USA: ACM.
doi: http://doi.acm.org/10.1145/151220.151231.
Arnold, D., Bernardi, R., Neidermeyer, P., & Schmee, J. (2007). The Effect of Country and
Culture on Perceptions of Appropriate Ethical Actions Prescribed by Codes of Conduct: A
Western European Perspective among Accountants. Journal of Business Ethics, 70(4), 327340. Springer Netherlands. Retrieved from http://dx.doi.org/10.1007/s10551-006-9113-6.
Brey, P. (2007). Is Information Ethics Culture-Relative? International Journal of Technology and
Human Interaction, 3(3), 12-24.
Christie, P. M. J., Kwon, I.-W. G., Stoeberl, P. A., & Baumhart, R. (2003). A Cross-Cultural
Comparison of Ethical Attitudes of Business Managers: India Korea and the United States.
Journal of Business Ethics, 46(3), 263-287. Springer Netherlands. Retrieved from
http://dx.doi.org/10.1023/A:1025501426590.
CSI. (2010). Computer Society of India - Code of Ethics. Retrieved December 2010, from
http://www.csi-india.org/web/csi/code-of-ethics.
Davison, R. M. (2000). Professional ethics in information systems: a personal perspective.
Commun. AIS, 3(2es). Atlanta, GA, USA: Association for Information Systems. Retrieved
from http://portal.acm.org/citation.cfm?id=374504.374510.
EI. (2009). Engineers Ireland - Code of Ethics. Retrieved December 2010, from
http://www.engineersireland.ie/about-us/governance/code-of-ethics-and-bye-laws/.
HKCS. (2010). Hong Kong Computer Society - Code of Ethics and Professional Conduct.
Retrieved December 2010, from http://www.hkcs.org.hk/en\_hk/intro/coe.asp.
IPSJ. (1996). Code of Ethics of the Information Processing Society of Japan. Retrieved December
2010, from http://www.ipsj.or.jp/english/somu/ipsjcode/ipsjcode\_e.html.

- 292 -

The Computational Turn: Past, Presents, Futures?

Joyce, D., Blackshaw, B., King, C., & Muller, L. (2003). Codes of Conduct for Computing
Professionals: an International Comparison. In S. Mann & A. Williamson (Eds.),
Proceedings of the 16th Annual NACCQ, Palmerston North, New Zealand (pp. 71-78).
Nyaw, M.-K., & Ng, I. (1994). A comparative analysis of ethical beliefs: A four country study.
Journal of Business Ethics, 13(7), 543-555. Springer Netherlands. Retrieved from
http://dx.doi.org/10.1007/BF00881299.
Oz, E. (1993). Ethical standards for computer professionals: A comparative analysis of four major
codes. Journal of Business Ethics, 12(9), 709-726. Springer Netherlands. Retrieved from
http://dx.doi.org/10.1007/BF00881385.
Payne, D., & Landry, B. J. L. (2006). A uniform code of ethics: business and IT professional
ethics. Commun. ACM, 49(11), 81-84. New York, NY, USA: ACM. doi:
http://doi.acm.org/10.1145/1167838.1167841.
Swinyard, W. R., Rinne, H., & Kau, A. K. (1990). The morality of software piracy: A crosscultural analysis. Journal of Business Ethics, 9(8), 655-664. Springer Netherlands.
Retrieved from http://dx.doi.org/10.1007/BF00383392.
Wheeler, S. (2003). Comparing Three IS Codes of Ethics - ACM, ACS and BCS. PACIS 2003
Proceedings (p. Paper 107).
Wong, P.-H. (2009). What should we share?: understanding the aim of Intercultural Information
Ethics. SIGCAS Comput. Soc., 39(3), 50-58. New York, NY, USA: ACM. doi:
http://doi.acm.org/10.1145/1713066.1713070.

- 293 -

Proceedings IACAP 2011

INFORMATION
TECHNOLOGY,
INTELLECTUAL PROPERTY RIGHTS

GLOBALIZATION

AND

SORAJ HONGLADAROM
Department of Philosophy
Faculty of Arts, Chulalongkorn University

The main concern of this paper centers around the issues arising from the use of
intellectual property rights (IPRs) as a tool of globalization, and how creations of
information technology are usually protected through the IPR regime as well as how the
technology is used as a means by which globalization is effected. Works on the
justification of intellectual property rights typically fall under two extremes: either they
reject IPRs outright or they accept IPRs as necessary for global commerce and useful
innovation. The former argue, on the one hand, that IPRs are hegemonic tools by which
the developed countries in the West keep the emerging developing ones at bay or exploit
the natural resources of the developing countries through what is known as biopiracy or
bioprospecting. On the other hand, those who embrace IPRs usually base their arguments
on the role that IPRs are necessary as a means of protecting those who have invested in
creating useful innovations. Problems arise when the products protected by IPRs are
carried across national borders and thus become global. In order to ensure protection
afforded by IPRs across countries, a worldwide system has been created by which IPRs
are protected which in many cases override the sovereignty of states. Thus it is clear that
IPRs are clearly tools of globalization; one sees globalization concretely at work through
the creation and enforcement of trade-related intellectual property rights across countries
in the world today.
The polarized debates around IPRs have created countless cases of conflicts
between those who fight for globalization and those who are against it. Chief in these
debates is the ethical issue, especially when products protected by IPRs have strong
impact on the livelihood and even the survival of those who depend on them. New
pharmaceutical products, for example, are almost always patented, which enables the
manufacturer to be able to charge very high price to cover their investments and also to
earn themselves profits for their shareholders. However, when people in the poorer
developing world are in need of these drugs, it is clear that there are moral issues
involved. Are the pharmaceutical companies morally obligated to provide the fruits of
their intellectual investments at lower cost so that they are affordable by the poor? It
would strongly seem so. However, there are also cases where IPRs are justified by
arguments that they are necessary as an incentive for innovation. Without effective IP
protection, the life saving drugs in question might not have arisen in the first place.
Furthermore, there are also cases where IPRs are used as tools for protecting the creation
of those within the developing world themselves. Without workable IPR regime, it is not

- 294 -

The Computational Turn: Past, Presents, Futures?

quite conceivable how innovation that takes place within the developing world can even
get off the ground. In fact ineffective enforcement of IPRs in the developing world has
been cited as one reason for these countries remaining stagnant economically.
The present paper aims to break this impasse. The underlying issue behind the
debate on patented pharmaceuticals and other products such as software or other forms
of innovation is the use of IPRs as a tool for protecting intellectual creation. The
intellectual content that becomes property through patents is constituted by information.
Thus the issue becomes in effect how information itself is owned and how it has become
a commodity. Hence it is clear that the issue depends the value one puts on the
information in question. It is just not that case that information can have more or less
values on its own – if the information answers to the people’s needs and desires, then
naturally it is more valuable. This implies that the value a piece of information has is
dependent upon context, which is mostly made up of people. Thus IPRs function when
information itself has economic values and can be bought and sold. This shows that in
themselves IPRs are neither positive or negative, no more than a piece of cloth sold in
the market is either positive or negative. IPRs then can be used either positively or
negativey. For example, when they are used to monopolize life saving drugs so that
poorer people cannot afford them, then they are negative, but they can also perhaps
become more positive when they are used to advance the interests of poorer people by
ensuring, for example, that the plant species belonging to their natural habitats are
protected, or their own intellectual creation is recognized and given due protection.
As mentioned previously, information technology plays a significant role in all this.
First of all, products of information technology itself are usually protected by IPRs.
Software is usually protected by copyrights. It is well known that the open source
movement in software strikes a middle ground between copyright protection and
commercialization on the one hand, and releasing everything onto the public domain on
the other. This can be a way out of the impasse, but it needs more thorough theoretical
justification, which is also an aim of this paper. Another, no less important, point is that,
as the technology spreads the information around, and as information does not have
values on its own as previously discussed, information technology itself stands to be used
either positively or negatively too. This seems to be a come back to the old position of
technological neutralism (the idea that technology is not good or bad in itself). But it is
not. When one allows for all the constraints and implications associated with a
technology (i.e., when a technology constrains us to behave one way or another due to
the nature of that particular technology itself), there is still room for using that
technology within these constraints either positively or negatively. Hence, a way is open
before us and it is up to us to decide which way to go. We only need to be able to
foresee, to the extent that we can, what kind of consequences there will be as a result of
our choosing.

Acknowledgements
Research for this paper has been partially supported by a grant from the National
Research University Project, grant number HS1025A and AS569A.

- 295 -

Proceedings IACAP 2011

Track IX:
Surveillance, sousveillance

- 296 -

The Computational Turn: Past, Presents, Futures?

TOWARDS A HERMENEUTIC PHENOMENOLOGY OF CYBERSPACE: POWER VS. CONTROL
ANDREAS BEINSTEINER
Ph.D. Student
Institute of Philosophy
Leopold-Franzens-Universität Innsbruck

Abstract. Since the 1990ies, regulation by program code has become an issue in
theoretical reflection on computers. Michel Foucault’s concepts, and, in particular,
Gilles Deleuze’s claim that control societies substitute disciplinary societies in the
age of computers, have been popular points of reference. The present paper
suggests interpreting control as a form of regulation that is essentially connected to
computers: From Foucault’s considerations a distinction is derived between power
and control. Control is conceived as a more radical mode of regulation: a
determination of possibilities of action that – as is shown by relating Foucault to
Martin Heidegger – is first made possible by computer technology.

1. The power of code
In an article called “Soft Cities”, William J. Mitchell (2005) explores similarities and
differences between traditional “real-world” space and the new, computer-generated
spaces. He observes that the coded conditionals in cyberspace provide a fundamentally
new mode of regulation: you cannot argue with computer programs, you cannot plead or
bribe them. Lawrence Lessig (2006) refines his claim “code is law” by stating that this
new form of regulation rather works through “a kind of physics. A locked door is not a
command ‘do not enter’ backed up with the threat of punishment by the state. A locked
door is a physical constraint on the liberty of someone to enter some space.” (p. 82)
Code is a regulator in cyberspace because it defines the terms upon which a certain
cyberspace environment is offered: It decides what can be said and done in that
environment.
Lessig refers to Michel Foucault (1995) who had addressed the kind of regulations
that become relevant in a new way in cyberspace: “Discipline and Punish” introduced the
perspective that tiny corrections of space regulate by enforcing a discipline. In fact,
Foucault’s reflections on disciplinary power are embedded in his larger project of
exploring the historical transformations that substitute sovereign power by what he calls
biopower: a new kind of power that does not employ law but technology and that does
not prohibit behavior but produce it. (Foucault 1998)

- 297 -

Proceedings IACAP 2011

According to Gilles Deleuze (1995), disciplinary societies have been replaced by control
societies in the age of computer technology. Alexander Galloway (2004, 2010) has
characterized protocol and program code as the essential means of regulation in control
societies.

2. Power and freedom
According to Foucault, to exercise power means to structure the possible field of action
of others. By doing so, these individuals are transformed into subjects, where the word
subject has two meanings: to be subject to someone else’s domination, and to be tied to
one’s own identity.
Foucault (2002) emphasizes that power can only be exercised over free subjects. A
subject is free insofar it is not absolutely self-identical or determined. In the extreme case
where power constraints action absolutely or physically, both power and freedom
disappear: “slavery is not a power relationship, when man is in chains.” (p.221) I suggest
conceiving control as such a form of regulation that goes beyond power and erases
freedom.
While the absence of physical determination seems to be a necessary condition for
freedom, it is not a sufficient one. Since it does not seem adequate to suppose a kind of
metaphysical autonomy in Foucault’s conception of the individual, we turn to the
relations that Hubert Dreyfus (2003) has established between the concepts of Foucault
and Martin Heidegger for a deeper understanding of how to conceive the sources of
freedom. According to Dreyfus, Heidegger’s question – how things have turned into
objects in modernity – is complemented by Foucault’s question – how individuals have
been turned into subjects. This allows connecting Heidegger’s concept of Being with
Foucault’s concept of power. Since one’s goals and horizons of meaning arise from
one’s background understanding that Heidegger calls the clearing of Being, exercising
power over a certain individual (to influence his/her possibilities of action) is possible by
shaping this clearing. A subject is constituted by the corresponding understanding of
Being, and the more static this understanding is, the closer to absolute self-identity is the
subject. Thus freedom can be grasped as hermeneutic oscillation – as a condition where
various understandings are suspending and balancing each other.

3. Materiality as a source of freedom
According to Heidegger, the understanding of Being has always been influenced by
technological artefacts and vice-versa. A tool suggests what it is to be used for:
Heidegger’s (1995) prominent example is the hammer, which is embedded in a structure
of “in-order-to”-relations and refers to goals, practices and other tools.
In contrast to tools, whose materiality disappears into their usability, works of art
emphasize their materiality. By doing so, they expose a fundamental gap between the
material sphere and the conceptual sphere. Heidegger (2008) conceives this as a struggle
between earth and world. The artwork’s materiality cannot be exhaustively interpreted
with one conceptual frame, thus it steadily keeps evoking new interpretations. This is
how materiality provides a source of freedom. Also tools, due to their materiality, may

- 298 -

The Computational Turn: Past, Presents, Futures?

be abused or used in different ways that were not intended originally. Addressing what
he calls the “designer fallacy”, Don Ihde (2009) has examined such non-intended usages
of technologies. Ihde’s argument against the possibility to design in advance a tool’s
usage relies on the tool’s materiality.

4. Cyberspace as the congruence of material and conceptual
For a long time theology and science employed god’s order of creation or the capacity of
human reason to bridge the gap between the conceptual and the material sphere.
(Heidegger 2008) The task of metaphysics was to provide narratives that justified the
adequacy of a certain vocabulary for describing reality. Nietzsche’s “death of god” is
nothing but the acknowledgement that there is not one single conceptual system that
adequately describes reality. The “post-modern” call for conceptual pluralism is a
consequence from this insight.
In cyberspace environments, however, the productive tension between the material
and the conceptual is erased: The programmer is the god who creates this reality, and the
respective program code is really an adequate description of this reality. Conceptual and
material sphere coincide in cyberspace. A gun in a 3D shooter game is nothing but a gun
and a buy-with-one-click-button in an online shop is nothing but a buy-with-one-clickbutton. The “designer fallacy” argument does not hold in cyberspace. And thus, as agents
in a cyberspace environment, we are 100% self-identical subjects. According to my
suggestion, this is what control is about.

References
Deleuze, Gilles (1995): Negotiations, 1972-1990. New York: Columbia University Press.
Dreyfus, Hubert (2003): ’Being and Power’ Revisited. In Milchman, Alan & Rosenberg, Alan:
Foucault and Heidegger: critical encounters (pp. 31-54). Minneapolis: University of
Minnesota Press.
Foucault, Michel (1995): Discipline and punish: the birth of the prison. New York: Vintage.
Foucault, Michel (2002): The Subject and Power. In Dreyfus, Hubert and Rabinow, Paul: Michel
Foucault: Beyond Structuralism and Hermeneutics (pp. 208-226). New York: Harvester
Whitesheaf.
Foucault, Michel (1998): The Will to Knowledge. London: Penguin Books.
Galloway, Alexander R. (2004): Protocol. How Control exists after Decentralization. Cambridge,
Massachusetts: MIT Press.
Galloway, Alexander R. (2010): Networks. In Mitchell, W.J.T. and Hansen, Mark (Eds.): Critical
terms for media studies (pp. 281-296). Chicago: University of Chicago Press.
Heidegger, Martin (2008): Basic Writings. New York: Harper Collins.
Heidegger, Martin (1995): Being and Time. Oxford: Blackwell.
Ihde, Don (2009): The Designer Fallacy and Technological Imagination.”In Vermaas, Pieter E. et
al. (Eds.): Philosophy and Design. From Engineering to Architecture (pp. 51-59). Springer
Lessig, Lawrence (2006): Code version 2.0. New York: Basic Books.
Mitchell, William J. (2005): City of Bits. Cambridge, Massachusetts: MIT Press.

- 299 -

Proceedings IACAP 2011

THE WIKILEAKS LOGIC
JEAN-GABRIEL GANASCIA
LIP6 – University Pierre et Marie Curie
4, place Jussieu, 75005, Paris, France
Jean-Gabriel.Ganascia@lip6.fr

Abstract. WikiLeaks has focused the attention of the media during a few weeks by
the end of 2010. The diplomacy of the United-State of America has been called
into question. Modern democracies are hampered; as sovereign states, they are
now facing a novel dilemma. This paper constitutes an attempt to understand this
evolution by seriously considering the WikiLeaks project not as a simple media
strategy, but as the possible kickoff of a totally new way doing politics, in a perfect
transparency, without secrecy nor hidden issues. Our purpose here is both to show
how information technologies, of which WikiLeaks is a sub-product, contribute to
transform the traditional political forms and how the notion of “sousveillance”
helps us to apprehend these evolutions.

1. A Few Recent Facts
WikiLeaks has focused the attention of the media during a few weeks by the end of 2010
and, previously, during the summer and the autumn. The diplomacy of the United-State
of America and of some other countries has been called into question by what people
called the Cablegate, by analogy to the Watergate. Let us remember that 250,000 of
secret telegrams containing embarrassing information about American, European and
Middle-East foreign policies were divulged to newspapers by the WikiLeaks
organization. Modern democracies, and especially the United-States of America, were
hampered. The main argument they developed against WikiLeaks was formal: it
concerned the danger that was posed to those whose name had been explicitly mentioned
in the cables. However, it clearly appeared that, for those sovereign states, the question is
not only just saving life of a few people: they are now facing a novel dilemma. On the
one hand, last few years many democracies opened public data to all citizens (Obama
2009). On the other hand, states are always used to deal with many matters, especially in
the diplomatic area, either in secrecy, or, at least, in a discrete way. As a consequence,
they can't easily accept the divulgation of top secret informations. In brief, the aspiration
to a total transparency, that many of our contemporaries share, modifies the rules of
government, while WikiLeaks shows the limits of officially proclaimed public
transparency.

- 300 -

The Computational Turn: Past, Presents, Futures?

2. A New Ideal of Transparency
With the recent developments of information technologies a new ideal of total
transparency seems to be born. Note that, by itself, the ideal of total transparency is not
new. It already existed in the 19th century (Benjamin 1934). The use of glasses in the
architecture, for instance the “Chrystal Palace” that was built for the London Universal
Exhibition in 1851, reflected this ideal.
A few years before, in the end of the 18th century, Jeremy Bentham had described
an architecture for surveillance designed to ensure a total transparency (Bentham 1838).
Called the Panopticon, it was a model for prisons, factories, hospitals, etc., that have
been conceived to make individuals totally visible to their guards, while these ones were
invisible to them. The goal of transparency was again to facilitate education,
surveillance, care, etc., which enhanced the role and the situation of authority holders.
By contrast, the new transparency that is encouraged today is individual and not
institutional. It is directed towards and against the authority holders, which are
permanently under the cameras. For instance, the policemen are continuously filmed.
The professors, physicians, lawyers, politicians etc. are permanently evaluated, etc.
The concept of “sousveillance” that was introduced by Steve Mann well characterizes
this new form of transparency (Mann 2003). This neologism forged by analogy and
opposition to the word surveillance, means that the watcher is situated below (“sous” in
French) the authority, while in case of surveillance he is situated above.

3. The Horizon of WikiLeaks
To understand the horizon of WikiLeaks, let us first note that Julian Assange, the
promoter and editor in chief of WikiLeaks, was initially a computer scientist who first
worked on cryptography. So doing, he adopted an atypical posture. While almost all the
cryptographers work for armies, secret services or banks, he developed cryptographic
tools for people. His idea was to make everybody able to hide information to the
authorities (state, company, etc.).
Now, with WikiLeaks, Julian Assange proposes to render publicly available all
information about authorities. He proposes creating “open governments” where all data
about the government and the public decisions would be worldwide accessible to
everybody. The underlying idea of a perfect collective transparency seems to justify his
action, which somehow refutes his first attitude of privacy protection.

4. Limits of the Generalized Sousveillance
The utopia of a generalized sousveillance, i.e. of a sousveillance extended to the overall
society, that excludes surveillance, faces an inherent contradiction: the authorities are
made of individuals, who, as such, need to be protected, which becomes impossible
because of the exclusion of surveillance.
Without going deeply in the exploration of this first contradiction, consider now the
extension of the sousveillance regime to the overall worldwide society. It faces at least
two types of limitations, some being intrinsic, others extrinsic.

- 301 -

Proceedings IACAP 2011

The main intrinsic limitation is due to our cognitive abilities that are too limited to
permit to observe and to assimilate all the information we have at our disposal. As a
consequence, we spontaneously filter the information flows and we focus our attention
on the most prominent facts. But, we do not decide by ourselves what criteria are
adopted to qualify the prominence. Most of the time, this is decided by people who
manipulate us by distracting our attention.
The second type of limitation is extrinsic in the sense that it is not an own limit of
the regime of sousveillence itself, but it is due to foreign factors. Specifically, nothing
prohibits the coexistence of a generalized regime of sousveillance with multiple regimes
of surveillance. For instance, NGOs or big multinational companies may continue to
gather and exploit data; they even can take advantage of free public data to extract useful
knowledge for the sake of their own interest, without any respect of privacy.

5. The Failure of the Wikileaks Ideal
Despite the attacks to which it was submitted and the fact the Julian Assange has been
jailed, WikiLeaks is undoubtedly very popular nowadays. There even exist attempts to
build more or less specialized clones of WikiLeaks in many places all over the world.
However, the original Assange project seems to have failed. The causes of this failure
are directly related to the limitations of the generalized sousveillance regime that were
expressed in the previous paragraph.
First of all, Julian Assange wanted to freely disseminate data allowing every citizen
to get any information he wanted, when he wanted. However, during the Cablegate,
WikiLeaks didn't freely divulge the 250,000 diplomatic telegrams he had; he sent them
to well established newspapers that had to filter, anonymize the messages and dramatize
their publication, with appropriate comments and advertisements.
Another failure of the WikiLeaks project is due to the project itself, which was
supposed to free people from any kind of authorities. However, it clearly appears that
WikiLeaks has now become a new authority, which plays a role symmetrical to other
more traditional authorities, as states or NGOs and companies. Julian Assange himself
acts in his own organization without any real transparency, which shows the limitation of
the generalized sousveillance principle as it was promoted by WikiLeaks.

References
Benjamin, W. (1934), Selected Writings, Volume 2, 1927-1934 Translated by Rodney Livingstone
and Others Edited by Michael W. Jennings, Howard Eiland, and Gary Smith
Bentham, J. (1838), Panopticon or the Inspection House, The Work of Jeremy Bentham, volume
IV, 37-172
Mann, S., Nolan, J., Wellman, B. (2003), Sousveillance: Inventing and Using Wearable
Computing Devices for Data Collection in Surveillance Environments, Surveillance &
Society
1(3):
331-355,
http://www.surveillance-and-society.org
http://wearcam.org/sousveillance.pdf
Obama, B., (2009), Transparency and Open Government, Memorandum for the Heads of
Executive Departments and Agencies, The White House, Washington, USA,
http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/

- 302 -

The Computational Turn: Past, Presents, Futures?

DEMOCRACY 2.0 - HOW THE WEB MAKES REVOLUTION
ANIS NAJAR
LIP6, Pierre and Marie Curie University
4, Place Jussieu, 75005, Paris, France anis.najar@lip6.fr

Abstract. “Whoever controls the information owns the power”. Many scientists
and philosophers have been interested in analyzing the relationship between
information and power within the society and they all argued that a kind of
dependency exists between the control of information and the political power. In
this paper, we propose to analyze this dependency from a structuralistic point of
view by assuming that changes in the information schema of the society would
necessarily produce changes in the power schema, characterizing by this way the
concepts of surveillance and sousveillance. We suggest examining these changes
on two levels, the structure of the information schema and the nature of
information, by taking as a study case the Tunisian popular revolution in which
information technology have played a significant role.

1. Introduction
From a structuralistic point of view, we can model the information society as entities
exchanging information in some pattern that we will refer to as information schema.
Similarly, we will call power schema, the one representing the balance of power between
the entities within the society. By neglecting other socioeconomic factors, we can say
that the power schema is somehow characterized by the information schema. Therefore,
it is reasonable that a revolution in the latter produces a revolution in the former. To
illustrate these aspects, we take as a study case the Tunisian popular revolution that we
consider as a logical consequence of the anterior revolution of Information Society.
Indeed, yet five years ago, the World Summit on the Information Society held in Tunisia
reflected the contradiction in the dictator's policy towards Information Technology. At
the same time, he was promoting its use and censoring its access. In effect, he was not
suspecting at that time, that five years later he would be overthrown by what he was the
most proud of, i.e. Information Technology. In the following, we try to analyze this
revolution on two levels, namely the structure of the information schema and the nature
of information itself.

- 303 -

Proceedings IACAP 2011

2. Informational Revolution
2.1. STRUCTURAL LEVEL
Based on the concept of Panopticon introduced by Jeremy Bentham in 1785 (Bentham
1838), Michel Foucault (Foucault 1975) described the classical schema of surveillance
in a society as a hierarchical organization, in which the state controls the information
either in its dissemination through the media and education or its collection through
intelligence. This schema also defines the classical power schema as a vertical
organization, the state at the top and the people at the bottom. Besides, censorship has
often been the classical way of controlling the information in such configuration. Since
several years ago, Internet has substantially transformed the information schema which
progressively took the form of the World Wide Web structure, that of network. This
reversed the power schema in a way that balanced the power relationship between the
state and the people by promoting transparency of information and democratization of
power. This schema coincides with the architecture of Catopticon introduced by JeanGabriel Ganascia (Ganascia 2009) in order to describe the structure of “sousveillance”,
in opposition to Bentham's Panopticon. Sousveillance has been defined by Steve Mann
(Mann 2003) as the acquisition by people of information technology so they can use it
against their keepers.
During Tunisian revolution, we observed a real showdown between the people and
the government, especially through social networks that have been a real staging ground
for the demonstrations. The advantage provided by the internet can be explained by
several reasons. First, notions such as community and sharing that have been developed
through social networks like Wikipedia, Facebook and Twitter have created a kind of
proximity between people and strengthened their solidarity. Second, the distribution
aspect of networks and speed of information propagation (small world effect) make
social networks a very effective offensive tool. For example, the worldwide cyberactivist organization known as Anonymous launched an operation called #OpTunisia
against the Tunisian Internet Agency servers paralyzing several government web sites.
Moreover, the great demonstration that led to the departure of the dictator has been
organized via Facebook overnight just after his last speech. Third, this structure is robust
against targeted attacks because of the absence of “leaders”. Finally, it is effective
against censorship because it is always possible to introduce information from a part of
the network.
2.2. SEMANTIC LEVEL
The second aspect of change in the information society has been made in the nature of
information contents. For some time indeed, the multimedia, especially video is being
increasingly important within the information exchanged over the Internet. We could
explain this by several reasons. First, the constraints of formalization and formulation
downsized the previously privileged position of texts, leaving the ground for videos
which appeared to be a more effective mode of information circulation in terms of
quickness and straightforwardness. Second, in addition to the fact that image is
semantically richer than text; it is also much closer to the human’s mental representation;

- 304 -

The Computational Turn: Past, Presents, Futures?

so it allows a better effect on the mental image, which gives it more impact in
information transmission.
All these factors contributed to the success of video particularly through videoblogging and gave birth to a new kind of media, which is the collaborative journalism,
where everyone contributes to the spreading of information. Furthermore, many news
TV channels, when they were not allowed to directly cover events, had no other choice
than collecting and sorting amateur videos provided by protestors in order to broadcast
them afterwards.

3. Counter-Revolution
Even though the network structure, as we exposed, is resistant against attacks, there is
still one kind of attack that is effective against information networks and which takes
advantage of its foregoing characteristics, that is propaganda. That was an essential
tactical point that let the former regime to launch a counter-revolution by changing its
behavior in a second time from censorship to disinformation. It seems that they
understood that they would be more able to control information by fabricating it rather
than by blocking it. For example, in just a brief delay after the censorship has been lifted
on the internet, multiple Facebook pages have been created to turn the opposition parties
against each other and the Ministry of Interior created an official page to make
propaganda. In a few hours, Facebook has been flooded by a huge quantity of rumors
about criminals and snipers shooting people outside so that terror led people to not think
rationally and they didn’t trust any information anymore. By this way, the government
created chaos and paralyzed the network.
In the same way, image has also been used in the counter-revolution. For the same
reasons cited above it has been a very effective tool of manipulation. For example, in
attempt to discredit protestors, the government staged several acts of violence and spread
them on the internet so that a lot of people called to stop demonstrations.

References
Bentham, J. (1838), Panopticon or the Inspection House, The Work of Jeremy Bentham, volume
IV, 37-17
Foucault, M. (1975), Surveiller et punir, Gallimard, Paris, France, p. 252 – In English Discipline
and Punish, trans. A. Sheridan. (1977) New York: Vintage
Ganascia J.-G. (2009), "The Great Catopticon", in Proceedings of the 8th International Conference
of Computer Ethics Philosophical Enquiry (CEPE), 26-28 June 2009, Corfu, Greece
Mann, S., Nolan, J., Wellman, B. (2003), Sousveillance: Inventing and Using Wearable
Computing Devices for Data Collection in Surveillance Environments, Surveillance &
Society
1(3):
331-355,
http://www.surveillanceand-society.org
http://wearcam.org/sousveillance.pdf

- 305 -

Proceedings IACAP 2011

NEGATIVE SOUSVEILLANCE
CARSON REYNOLDS
University of Tokyo, Department of Creative Informatics
carson@k2.t.u-tokyo.ac.jp

Abstract. Recent catastrophes have increased the desire to get rapid information
about infrastructure such as power and services and not necessarily from the
people providing these services. While news sources seek to provide such
information, they are biased toward providing information that increases reader or
viewer interest. Sousveillance is appropriate in these cases and here we describe an
unusual method for such observation, which we call negative souveillance. This is
observing which systems or services disappear in a time of catastrophe and
reporting on their disappearance.

1. What Disappeared?
Mann’s development of "watchful vigilance from underneath" is useful in cases in which
the surveilled feel that information may be used to harm them. But what of the special
case in which the disenfranchised feel that information is being withheld form them?
Amid the recent earthquake, tsunami, and nuclear power crises of Japan in 2011,
several individuals have expressed to me the feeling that they “are not being told
everything.” Indeed, Wikileak’s (Pilger, 2010) recent diplomatic cable archive
documents the extent that governments and organizations routinely keep politically
delicate details out of the public eye.
Negative databases (Esponda, 2006), on the other hand, are designed to solve a
different problem altogether. That is the keeping records which if stolen do not reveal the
identity of individuals. Negative databases achieve this by storing the complement of the
set of what is being tracked. Essentially the database shows what isn’t of concern.
The work of Trevor Paglen, involves long-distance photography and data analysis to
document secret installations. Extending his approach the negative intelligence gatherer
would seek to understand what websites, infrastructure systems, environmental sensors
or documents have become unavailable.
The negative sousveillance concept then is to record, track, or infer what isn’t there.
This essentially suggests a two-stage process. The first step is citizens or activists to
survey or map infrastructure systems or environmental status. Paulos, Honicky, and
Hooker (2009) showed how urban populations could use mobile phones as dense
environmental sensors for citizen science. Analogously, Bonanni et al. (2010) have
created a system for tracking and account supply chains and their environmental effects.

- 306 -

The Computational Turn: Past, Presents, Futures?

Project such as OpenStreetMap have already sought to create public domain maps of the
physical world. The second step is to record what has disappeared.
The approach is broadly applicable. Those interested in digital image manipulation
can keep a delta showing how an image is gradually altered over time through the
addition of watermarks or removal of figures from the scene. Those interested in network
systems can track network outages due to disasters or kill switches, which would be used
by governments to limit internet access (Cowie, 2011).
The practices of negative information gatherers in some cases would be similar to
those of network security professionals. They might proceed by using tools such as nmap
to scan various network services and store them into a database (Lyon, 2009). As
services disappear they would then be listed in the far more interesting negative
database. Those interested in environmental sensors may either try to gain access to the
sensor data or distribute their own environmental sensor network. When nodes in such a
network stop responding further investigation is warranted. It may be that the network
node needs to be replaced, that it has been tampered with, or destroyed by environmental
causes. But the absence of information is just as interesting as steady broadcast.
The anticipatory step of documenting infrastructure before it disappears is also
useful in disaster situations when officials may be inundated with requests for
information. I believe the question “is X inoperative” is an easier question to answer to
than “what type of X exist and are they inoperative?” With careful foresight the negative
database may be able to answer both questions without relying officials or outside
organizations for details.

2. Skepticism & DIY Authority
The feeling of powerless that comes from lack of information can be alleviated by the
realization that you yourself can gather information. While news sources, corporate press
releases, and government agencies often have access to expert assessment I think it is fair
to question whether such experts have biases. For instance, news outlets may err on the
side of sensationalism to stir up concern about a recent event; corporations may time
announcements to minimize the impact of bad news (Gross, 2004), or agencies may try
to minimize widespread panic at the expense of accurate information.
One interesting aspect of DIY infrastructure, environment, or network monitoring is
that those affected can collect and analyze details that affect them. When objects
disappear from view instead of entering a memory hole they are instead specially noted
as they are entered into a negative database. It is our hope that less will escape the notice
of those willing to do the legwork involved in becoming authorities themselves.

References
Bonanni, L., Hockenberry, M., Zwarg, D., Csikszentmihalyi, C., & Ishii, H. (2010). Small
business applications of sourcemap. Proceedings of the 28th international conference on
Human factors in computing systems - CHI ’10 (p. 937). New York, New York, USA: ACM
Press. doi: 10.1145/1753326.1753465.
Cowie,
J.
(2011).
Egypt
Leaves
the
Internet.
Retrieved
from
http://www.renesys.com/blog/2011/01/egypt-leaves-the-internet.shtml.

- 307 -

Proceedings IACAP 2011

Esponda, F., Ackley, E., Helman, P., Jia, H., & Forrest, S. (2006). Information Security. (S. K.
Katsikas, J. LÃ³pez, M. Backes, S. Gritzalis, & B. Preneel, Eds.)Information Secuirty,
Lecture Notes in Computer Science, 4176, 72-84. Berlin, Heidelberg: Springer Berlin
Heidelberg. doi: 10.1007/11836810.
Gross, D. (2004). Friday Night Blights. Slate. Retrieved from http://www.slate.com/id/2106864/
Lyon, G. F. (2009). Nmap Network Scanning: The Official Nmap Project Guide to Network
Discovery
and
Security
Scanning.
Retrieved
March
16,
2011,
from
http://portal.acm.org/citation.cfm?id=1538595
Mann, S., (1998), ‘Reflectionism’ and ‘diffusionism’: new tactics for deconstructing the video
surveillance superhighway, Leonardo,31(2): 93–102.
OpenStreetMap
Foundation.
(2011).
OpenStreetMap.
Retrieved
from
http://www.openstreetmap.org/
Paglen, T. (2011). Visual Projects. Retrieved on March 14th, 2011 from
http://www.paglen.com/pages/projects.htm
Pilger, J. (2010). Why WikiLeaks must be protected. New Statesman, 139(5015), 18.
Paulos, E., Honicky, R., & Hooker, B. (2009). No Title. Handbook of Research on Urban
Informatics: The Practice and Promise of the Real-Time City (pp. 414-436). doi:
10.4018/978-1-60566-152-0.ch028.

- 308 -

The Computational Turn: Past, Presents, Futures?

GOVERNMENT APPROACHES FOR MANAGING ELECTRONIC
IDENTITIES OF CITIZENS – EVOKING A CONTROL DILEMMA?
STEFAN STRAUSS
Austrian Academy of Sciences, Institute of Technology Assessment (ITA)
Strohgasse 45/5, A-1030, Vienna, Austria
sstrauss@oeaw.ac.at

Abstract. Governments world-wide introduce electronic identity systems to adapt
the process of citizen identification to the needs of the information society. These
innovation processes primary aim at improving e-government services, but imply
further societal and political objectives. The emergence of identity management
represents a demand for (re)gaining control over personal data in virtual
environments. Compared to predominating security goals, privacy aspects are
often neglected and not sufficiently implemented. The analysis from a privacy
perspective shows that the current situation of governmental e-ID can be described
as a control dilemma: despite of its aim to (re)gain control, the e-ID could
ironically even foster a further loss of control over individual privacy. As a
consequence, an e-ID system itself might turn into a sort of amplified surveillance
interface. In this regard, the e-ID could become a synonym for a panoptic
instrument of power. The e-ID example refers to the major challenge of enhancing
governmental transparency for individuals and the public sphere to compensate a
further growth of information asymmetries and imbalanced control over personal
information between citizens and governments.

Information and communication technologies continually pervade everyday life and
change the dynamics of data processing and information handling in many respects.
Significant increases in personalized services and social interactions over web 2.0
applications inevitably entail further growth of digital data, aggravating individuals in
controlling personal information and protecting their privacy. The convergence of analog
and digital environments further accelerates these trends. The increasing relevance of
electronic identity management (IDM) as an important field of research in the
information society (Halperin/Backhouse 2008) is a prominent example for this
convergence. While many different IDM concepts exist, especially national governments
made remarkable efforts in recent years to introduce electronic ID cards for supporting
online public services; primary objectives are improving security and unifying
identification and authentication procedures in e-government.
Identification is a core function of governments and thus the creation of national eID systems implies far-reaching societal transformations (Aichholzer/Strauß 2010) that
contribute “to alter the nature of citizenship itself” (Lyon 2009). Hence, e-ID is more
than an identification device; it becomes a policy instrument, and the focus more and

- 309 -

Proceedings IACAP 2011

more shifts from being a “detecting” tool to an “effecting” tool; i.e., an instrument not
only to support administrative procedures such as ascertaining identity in public services
but to enable services and to impact societal and political objectives (Bennett/Lyon
2008). Inter alia in EU information society policies the vision is to set up a “panEuropean infrastructure for IDM in support of a wide range of e-government services”
(CEN 2004); and introducing e-IDMS also aims at fighting identity fraud and terrorism
(CEN 2004). Privacy is obviously of vast importance but plays a rather implicit role
while security issues predominate. Although e-ID introduction is not to be seen as a
consequence of the 9/11 tragedy, this strong security focus was catalyzed in some respect
by it (Bennett/Lyon 2008). E-ID cards “have become the tool of choice for new forms of
risk calculation” and enable a “mode of pre-emptive identification” (Lyon 2009). History
offers many examples for social discrimination and population control, drastically
illustrating the strong relations between identification and surveillance (Bennett/Lyon
2008; Lyon 2009). But IDM is not inherently a privacy threat. Whether an e-IDMS
becomes an instrument of surveillance or not naturally depends on the concrete system
implementation and its surrounding framework. Properly designed with respect to
privacy enhancement, e-IDMS might contribute to informational self-determination; i.e.,
proactively support individuals in handling their different identities in different contexts
and controlling their personal data (Clauß et al 2005), which is the very idea of IDM.
However, current e-ID card schemes only rudimentarily include privacy
mechanisms and do not correspond to privacy-enhancing IDM (Naumann/Hobgen 2009).
Particular problems are insufficient implementations of anonymity and pseudonymity,
undermining the concept of unlinkability, which is essential to prevent “privacydestroying linkage and aggregation of identity information across data contexts” (Rundle
et al 2008). The growing amount of personal data due to further trends towards pervasive
computing environments intensifies these problems as identity never shrinks
(Pfitzmann/Borcea-Pfitzmann 2010). The increasing visibility of identification
mechanisms entails a sort of shadow22. This “identity shadow” facilitates data linkage
and de-anonymization (Strauß 2011). Surveillance tendencies and predominant security
objectives in the e-ID development imply further frictions. Combined with the evident
danger of function creep, i.e., a purpose extension of e-ID usage, this could lead to the
advent of a ubiquitous IDM infrastructure entailing further privacy threats. The current
situation can be described as a control dilemma: while the increasing role of IDM
represents “a demand to regain control over personal data flowing in digital
environments”, the creation of governmental e-IDMS to fulfill this demand could
ironically even foster a further loss of control over individual privacy (Strauß 2011).
In this sense, an e-IDMS has several similarities to Foucault’s (1977) interpretation
of the panopticon “as a generalizable model of functioning; a way of defining power
relations in terms of the everyday life of men”. Social control becomes automated as the
algorithms of the system define the way one's identity is treated, i.e., the degree of
service provision based on automated categorization. The trap of visibility (Foucault
1977) here is the increasing ID-obligation triggered by the e-IDMS. While the system
becomes more and more visible, its functioning becomes further blurred for individuals.
They have to reveal their ID without knowledge about whether and for what purpose it is
used - analog to the uncertain presence of the guard in the watchtower. Consequences

22

In recognition of Alan Westin: Privacy and Freedom, 1967 and the term “Data
Shadow“.

- 310 -

The Computational Turn: Past, Presents, Futures?

would be self-censorship and limited individual freedom because “without transparency,
one cannot anticipate or take adequate action“ (Hildebrandt 2008).
The control dilemma highlights the demand for more effective privacy concepts
and control mechanisms, enabling citizens and the public sphere in controlling proper
and legal data usage. One crux is the system inherent realization of anonymity and
pseudonymity; and, related, a thorough data minimization, e.g., addressed by already
arising approaches (e.g., http://vanish.cs.washington.edu) for an expiration date of digital
data (Mayer-Schönberger 2009). However, their practicability is limited and they cannot
solve the problem of information asymmetries between the governed and those who
govern. Thus, the major challenge is to compensate this imbalanced control over
personal information by enhancing governmental transparency for individuals and the
public sphere.

References
Aichholzer, G. & Strauß, S. (2010). Electronic Identity Management in e-Government 2.0:
Exploring a System Innovation exemplified by Austria. Information Polity 15(1-2), 139152.
Bennett, C. J. & Lyon, D.(2008). Playing the identity card - surveillance, security and
identification in global perspective. London and New York: Routledge.
Clauß, S., Pfitzmann, A., Hansen, M., Herreweghen, E. V. (2005). Privacy-Enhancing Identity
Management, No. issue 67, Institute for Prospective Technological Studies (IPTS).
Commité Européen Normalisation - CEN (2004). CEN/ISSS Workshop eAuthentication Towards an electronic ID for the European Citizen, a strategic vision, Brussels.
Foucault, M. (1977). Discipline and punish: the birth of the prison, trans. A Sheridan, London:
Penguin.
Halperin, R. & Backhouse, J. (2008). A roadmap for research on identity in the information
society. Identity in the information society 1(1), 71-87.
Hildebrandt, M. (2008). Profiling and the rule of the law. Identity in the information society 1(1),
55-70.
Lyon, D. (2009). Identifying citizens - ID cards as Surveillance. Cambridge: Polity Press.
Mayer-Schönberger, V. (2009). Delete: The Virtue of Forgetting in the Digital Age. Princeton:
University Press.
Naumann, I., Hobgen, G. (2009). Privacy Features of European eID Card Specifications:
European Network and Information Security Agency – ENISA.
Pfitzmann, A. & Borcea-Pfitzmann, K. (2010). Lifelong Privacy: Privacy and Identity
Management for Life. In: Bezzi, M. et al (Eds): Privacy and Identity Management for Life,
Proc. of the 5th Int. PrimeLife/IFIP Summer School, IFIP AICT Vol. 320, (pp.1-17).
Heidelberg: Springer LNCS.
Rundle, M., Blakley, B., Broberg, J., Nadalin, A., Olds, D., Ruddy, M., Guimarares, M. T. M.,
Trevithick, P. (2008). At a crossroads: "Personhood" and digital identity in the information
society, No. JT03241547, OECD.

Strauß, S. (2011). The Limits of Control – (Governmental) Identity
Management from a Privacy Perspective. In: Fischer-Hübner, S., et al
th
(Eds), Privacy and Identity Management for Life, Proc. of the 6 Int.
PrimeLife/IFIP Summer School – revised selected papers, IFIP AICT
Vol. 352, (pp.206-218). Heidelberg: Springer LNCS.

- 311 -

Proceedings IACAP 2011

Track X:
SIG Track – Machines and
Mentality

- 312 -

The Computational Turn: Past, Presents, Futures?

MORAL EMOTIONS FOR ROBOTS
RONALD C. ARKIN
Mobile Robot Laboratory, Georgia Institute of Technology
85 5th ST NW, Atlanta, GA 30332 U.S.A.

As robotics moves toward ubiquity in our society, there has been only passing concern
for the consequences of this proliferation (Sharkey, 2008). Robotic systems are close to
being pervasive, with applications involving human-robot relationships already in place
or soon to occur, involving warfare, childcare, eldercare, and personal and potentially
intimate relationships. Without sounding alarmist, it is important to understand the nature
and consequences of this new technology on human-robot relationships. To ensure
societal expectations are met, this requires an interdisciplinary scientific endeavor to
model and incorporate ethical behavior into these intelligent artifacts from the onset, not
as a post hoc activity. We must not lose sight of the fundamental rights human beings
possess as we create a society that is more and more automated. One of the components
of such moral behavior, we firmly believe, involves the use of moral emotions.
Haidt (2003) enumerates a set of moral emotions, divided into four major classes:
Other- condemning (Contempt, Anger, Disgust); Self-conscious (Shame,
Embarrassment, Guilt); Other-Suffering (Compassion); Other-Praising (Gratitude,
Elevation). Allen et al (2006) assert that in order for an autonomous agent to be truly
ethical, emotions may be required at some level: “While the Stoic view of ethics sees
emotions as irrelevant and dangerous to making ethically correct decisions, the more
recent literature on emotional intelligence suggests that emotional input is essential to
rational behavior”. These emotions guide our intuitions in determining ethical judgments,
although this is not universally agreed upon (Hauser, 2006). From a neuroscientific
perspective, Gazzaniga (2005) states: “Abstract moral reasoning, brain imaging is
showing us, uses many brain systems”, where he identifies the locus of moral emotions
as being located in the brainstem and limbic system.
The relatively young machine ethics community has focused largely to date on
developmental ethics, where an agent develops its own sense of right and wrong in situ.
In general, these efforts largely ignore the moral emotions as a scientific basis worthy of
consideration. Nonetheless, considerable research has been conducted regarding the role
of emotions in robotics, including work in our laboratory over the past 20 years (Arkin,
2005; Moshkina et al 2011). Far less explored in robotics is the set of moral secondary
emotions, and their role in robot behavior and human-robot interaction. One example is
where De Melo et al (2009) have demonstrated that the presence of moral affect in
human-robot interaction is both discernible and enhances the interplay between humans
and robot-like avatars.
Our own research (Arkin and Ulam, 2009) in the moral affective space research is
illustrated by the use of guilt being incorporated into an ethical robotic software
architecture designed for lethal military applications. Guilt is “caused by the violation of
moral rules and imperatives, particularly if those violations caused harm or suffering to
others” (Haidt, 2003) and is recognized as being capable of producing proactive,
constructive change (Tangney et al, 2007). The specific architectural component we have

- 313 -

Proceedings IACAP 2011

implemented, referred to as the ethical adaptor, incorporates Smits and De Boeck’s
(2003) mathematical model of guilt, which is used to proactively alter the behavior of the
robotic system in a manner that will lead to a reduction in the recurrence of an event
which was deemed to be guilt-inducing. In our initial application, this focuses on the
deployment of lethal autonomous weapons systems in the battlefield, with respect to
unexpectedly high levels of battle damage. Simulation results demonstrate the ethical
adaptor in operation.
For non-military applications, we hope to extend this earlier research into a broader
class of moral emotions, such as compassion, empathy, sympathy, and remorse,
particularly regarding the use of robots in elder or childcare, in the hopes of preserving
human dignity as these relationships unfold in the future. There is an important role for
artificial emotions in personal robotics as part of meaningful human-robot interaction,
and having worked with Sony Corporation on their AIBO and QRIO entertainment
robots (Arkin, 2005), and Samsung for their humanoid robots (Moshkina et al, 2011), it
is clear that value exists for their use in establishing long-term human-robot
relationships.
There are, of course, significant ethical considerations associated with this use of
artificial emotions in general, and moral emotions in particular, due in part to their
deliberate fostering of attachment by human beings to non-human artifacts. This is
believed to promote detachment from reality by the affected user (Sparrow, 2002). While
many may view this as a benign, or perhaps even beneficial effect, not unlike
entertainment or video games, it can clearly have deleterious effects if left unchecked,
hence the need for incorporating models of morality within the robot itself.

Acknowledgements
This research was supported under Contract #W911NF-06-1-0252 from the U.S. Army
Research Office. The author would also like to acknowledge Patrick Ulam for his
contribution in software development for this project.

References
Allen, C., Wallach, W., and Smit, I., (2006). “Why Machine Ethics?” IEEE Intelligent Systems,
July.
Arkin, R.C., (2005). "Moving Up the Food Chain: Motivation and Emotion in Behavior-based
Robots", in Who Needs Emotions: The Brain Meets the Robot, Eds. J. Fellous and M. Arbib,
Oxford University Press.
Arkin, R.C. and Ulam, P., (2009). "An Ethical Adaptor: Behavioral Modification Derived from
Moral Emotions", IEEE International Symposium on Computational Intelligence in
Robotics and Automation (CIRA-09), Daejeon, KR.
De Melo, C., Zheng, L. and Gratch, J., (2009). ”Expression of Moral Emotions in Cooperating
Agents”. 9th International Conference on Intelligent Virtual Agents, Amsterdam.
Gazzaniga, M., (2005). The Ethical Brain, Dana Press.
Haidt, J. (2003). “The Moral Emotions”, in Handbook of Affective Sciences, Oxford Press.
Hauser, M., (2006). Moral Minds: How Nature Designed Our Universal Sense of Right and
Wrong, ECCO, HarperCollins, N.Y., 2006.

- 314 -

The Computational Turn: Past, Presents, Futures?

Moshkina, L., Park, S., Arkin, R.C., Lee, J.K., Jung, H., (2011). "TAME: Time-Varying Affective
Response for Humanoid Robots", International Journal of Social Robotics.
Sharkey, N. (2008). “The Ethical Frontiers of Robotics”, Science, (322): 1800-1801.
Smits, D., and De Boeck, P., (2003). “A Componential IRT Model for Guilt”, Multivariate
Behavioral Research, Vol. 38, No. 2, pp. 161-188.
Sparrow, R., (2012). “The March of the Robot Dogs”, Ethics and information Technology, Vol.
4(2).
Tangney, J., Stuewig, J., and Mashek, D., (2007). “Moral Emotions and Moral Behavior”, Annu.
Rev. Psychol., Vol.58, pp. 345-372.

- 315 -

Proceedings IACAP 2011

ON DEEPLY UNCONSCIOUS INTENTIONAL STATES
KONSTANTINE ARKOUDAS
Telcordia Research
Piscataway, NJ, USA

In this note I will argue against the thesis that humans are equipped with computational
structures and algorithms that are unconsciously used for logical reasoning. This thesis
represents the received view in cognitive science, particularly in the psychology of
reasoning. According to it, the processes by which people reason are unconscious and
therefore inaccessible to introspection. The unconsciousness that these cognitive
scientists allege is deep. Unconscious mental states of this form are not like the
preconscious states of Freud, such as beliefs that can be ascribed to me when I am in
dreamless sleep. For instance, when I am asleep I continue to believe that the second
world war ended in 1945, even though I do not consciously entertain that belief during
that time. The belief is preconscious; even though it is not conscious most of the time, I
can easily bring it to mind by my own volition. The ``deep unconscious'' of
contemporary cognitive science is also quite unlike Freud's ``dynamic unconscious''
(repressed memories, desires, etc.), although the theory---and controversies---of the
latter need not detain us here. But at least repressed mental states could potentially come
to the surface via therapy. The unconscious mental states posited by contemporary
cognitive science are much more hermetically sealed.
I will use mental-logic theories (MLT) to anchor my discussion, but the arguments I will
be making will apply to other computational accounts of reasoning, such as mentalmodel theory. I believe that it might be possible to adapt these arguments in a way that
will make them applicable to any theory that postulates unconscious computation,
including theories of low-level peripheral cognition such as perception and language. But
in what follows I will only be concerned with computational theories of reasoning. For
simplicity, I will restrict attention to propositional logic, and specifically to what is often
called the ``logical judgment'' problem, whereby a small number of fairly simple
premises are given (often just one premise), along with a putative conclusion, and the
problem is to determine whether the conclusion follows deductively from the premises.
Alice is a college sophomore without any training in formal logic, although perhaps she
has a meager background in algorithms (e.g., she might know what an algorithm is, and
have a vague notion of what loops and conditional branches are for). According to
mental-logic doctrine, Alice is equipped with a module for reasoning in propositional
logic that consists of:
(1) a number of inference schemas, such as modus ponens; and

- 316 -

The Computational Turn: Past, Presents, Futures?

(2) a control procedure, which, presented with a reasoning problem, regulates the
selection of which inference rules to apply, when to backtrack, and so on.
The procedure always terminates and can result in an affirmative, a negative, or an
inconclusive (``can't tell'') answer. (In the context of the logical judgment problem, these
two components are sometimes called the ``operations'' and the ``executive,''
respectively.) Now let L be Alice's logic for ``logical judgment'' in propositional logic,
and let R be the associated procedure. And let P be a simple propositional reasoning
problem. Presumably, if we presented Alice with P, her mental logic would kick in, R
would operate for a finite period of time, and before long an answer would emerge.
The contents of both L and R are in thinkable form, and indeed are eminently learnable.
L presumably contains such straightforward inference rules as the contrapositive, and R
contains a small number of simple instructions such as conditional branching and
looping. It is quite conceivable, therefore, that Alice can be taught the specific rules of L
and the algorithm R, and can voluntarily and consciously follow R. This does not have to
be deliberate, in that I am not assuming that L and R are taught to Alice as the very
mental logic that her own mind contains for propositional logical judgment. They could
be taught to her fortuitously, as part of a random teaching assignment by a teacher, or by
some instructor as part of a cognitive science experiment, and it could just so happen, by
accident, that what she is taught is in fact identical to her ``mental logic,'' although Alice
herself is entirely unaware of this. In fact Alice might not even be aware that she has
such a logic at all.
Now suppose that after a short crash course on L and R, Alice is presented with problem
P and goes to work consciously applying R, while, unconsciously and unbeknownst to
her, she is applying the very same procedure at the same time. The exact same process
unfolds in two duplicate and concurrent threads, tracing two sequences of intentional
states, which I will write as s_1,...,s_n for the conscious process and s_1’,...,s_n’ for the
unconscious one. We might allow---as is surely logically possible, though improbable--that the concurrency is exact, and that the two threads proceed in perfect lockstep. I
claim that s_i and s_i’ are identical intentional state tokens for each i = 1,...,n. We might
say that two intentional states are type-identical if they have the same mode and the same
content (propositional or otherwise), so, for instance, your belief that Obama is the
president of the USA is type-identical to my belief that Obama is the president of the
USA because both the psychological mode (belief) and the content (that Obama is the
president of the USA)are identical. What are reasonable identity criteria for intentional
state tokens? Two intentional state tokens of one and the same person are identical if
they have the same mode, the same content, the same causes, and sufficient temporal
proximity.
.
In the present scenario, all these conditions obtain. Content and mode and identical by
virtue of the fact that the logic and the algorithm on both levels are identical, and the
causes are also the same in both cases---the execution of that particular algorithm on that
particular input. Remember that according to the standard computational
theory of the mind, the algorithms that are postulated by various cognitive scientists
involve intrinsic intentionality (i.e., they are not observer-relative), and are causally
efficacious. That is, a person's cognitive activity and concomitant intentional states are

- 317 -

Proceedings IACAP 2011

the way they are because he or she is running the algorithm in question. So in both cases,
it is the deployment of the same algorithm on the same input that is causing the states. Of
course in this version of the thought experiment we actually have more than that. We
also have complete temporal overlap. So, for any i, both s_i and s_i’ are occurring at the
exact same time, in the same mind, with the exact same contents, and the exact same
causes and effects. Therefore, the states are identical. But this is a contradiction, because
we are now led to admit that one and the same intentional state is simultaneously
occurring both consciously and unconsciously. I regard the contradiction as a reductio of
the hypothesis that the process s_1’,...,s_n’ is occurring unconsciously; that the process
s_1,...,s_n is consciously occurring is, of course, beyond doubt. I conclude that there are
no such unconscious intentional states. The only intrinsic intentional states and
computational processes that actually take place are the conscious ones.

- 318 -

The Computational Turn: Past, Presents, Futures?

OUTLINING A COMPUTATIONALLY PLAUSIBLE APPROACH TO
MENTAL STATE ASCRIPTION
WILL BRIDEWELL
Center for Biomedical Informatics Research
Stanford University, Stanford, CA USA
AND
Alistair Isaac
Department of Philosophy
University of Michigan, Ann Arbor, MI USA
AND
Pat Langley
Computer Science and Engineering
Arizona State University, Tempe AZ USA

1. Extended Abstract
No one would debate that social cognition is a key characteristic of human-level
intelligence. However, within the artificial intelligence literature, we find no system that
carries out more than a rudimentary level of social interaction. Previous theoretical work
on social information processing usually treats agents as input–output systems that lack
internal representations of each other (e.g., multiagent systems) or develops formalisms
unsuitable for practical implementation (e.g., undecidable epistemic logics). To move
forward, new strategies for modeling interaction need to tractably support reasoning
about the mental states of oneself and others. Here, we present steps toward such a
model that we hope will address the need for a computationally plausible approach and
will eventually lead to a system that can engage in complex dialog with others.
An agent’s mental space is partitioned into models of agents. One of these is the
model of self, which serves as the default source-of-truth when reasoning about the
world. From a computational perspective, we find it useful to separate different
modalities of mentality into different regions. For instance, inside of its self model, an
agent will have a structure that stores beliefs about the state of the world, one that stores
goals that indicate desired future states of the world, and one that stores intentions which
are actions that manifest the goals. Since goals and intentions in this representation refer
to mental states of which the agent is aware, we loosely use those terms as shorthand for
the agent’s beliefs about its goals and beliefs about its intentions. Taking this view, the
primitive mental object is the belief.

- 319 -

Proceedings IACAP 2011

Continuing with the computational perspective, we represent a belief as a data structure
that contains a literal representing its content and other contextual features necessary to
guide reasoning. These features include temporal aspects analogous to valid time and
transactional time in a database. That is, the literals in the belief may be associated with
the period of time for which they were true (e.g., Yesterday, Jeff ate lunch between 11
and 12) and the period of time during which they were held (e.g., I believed that Chris
was a man until I met her), both of which may overlap. Asserting a belief as a goal or an
intention involves placing it in the appropriate mental partition and does not require a
corresponding change in representation.
In addition to beliefs, which are stored within agent models, we represent
relationships among those models. The principal agent model (i.e., the model of the self)
connects to internalized models of other agents. These models are accessible through a
believes relation. For example, consider a technical support agent conversing with a
customer. During the exchange, the support agent may reason about whether the
customer believes that his computer is plugged in. Trivially, we might represent this
statement as (belief Customer (plugged-in computer)), which tells the system
implementing the agent to look in the beliefs of the Customer model believed by the
principal agent. Continuing, the agent may have a goal (goal (belief Customer (not
(plugged-in computer)))). This goal would appear in a second Customer model that is
connected to the agent’s goal space instead of its belief space. Notably the goal,
intention, and belief operators are not modal operators. For our purposes, they index
mental spaces that contain sets of beliefs.
Importantly, knowledge is stored only when necessary. The principal agent’s
default assumption is that other agents’ beliefs are in accord with its own. If the principal
agent has no reason to believe that another agent is in disagreement, then that agent’s
model will be empty. In the previous example, if the agent believes (plugged-in
computer) and (believes Customer (plugged-in computer)), the actual belief will only
appear in the principal agent’s model. The other models inherit the beliefs of their
parents via default reasoning unless a specific belief is overridden by a locally stored,
incompatible one such as (not (plugged-in computer)). As a rough approximation, we
assume that all agents share the same inference mechanisms and long-term knowledge
(e.g., rules) and do not attempt to represent differences in cognitive ability or domain
knowledge.
With this basic framework in mind, there are six challenges that must be addressed
to implement a functioning system. Here we present these along with our proposed
solutions for two of the most compelling ones.
1. When are new agent models introduced?
2. When are agents linked to each other?
3. How are agents traversed to unpack a nested statement?
4. What is taken as common ground?
5. How are beliefs ascribed to nested agents?
6. How does one agent reason about another?
Addressing the first challenge, the most apparent situation is when a new agent joins a
conversation. If individuals discuss an absent agent, one may treat that agent as either a
simple object or an agent to whom one may ascribe beliefs. To illustrate, suppose Tom
tells the principal agent, “Harry likes pudding.” That would correspond to some belief
either in the principal model or the Tom model that resembles (likes Harry pudding). If

- 320 -

The Computational Turn: Past, Presents, Futures?

instead, Tom said, “Harry said that he likes pudding,” we would need to create a model
of Harry, that would let us store (believes Harry (likes Harry pudding)). Where the belief
resides depends on the mental state of the other agents and how their models are
connected.
Answering the sixth challenge, we recall that all agents are assumed to use the same
inference system and domain knowledge as the principal agent. Typically this
mechanism “resides” in that agent’s model. However, one can shift perspective by
moving the seat of the inference system to another agent model. In this sense, there is a
clear relationship to simulation theory, but the domain knowledge may include rules that
encode how agents reason about each other much like the theory-theory. As a result, we
can integrate ideas from both camps to help reach our operational goal of intelligent
systems that can collaborate and engage with people in realistic dialogs.

Acknowledgements
Will Bridewell and Pat Langley are funded by the Office of Naval Research under
Contract No. ONR-N00014-09-1-1029. Alistair Isaac is funded by a postdoctoral
fellowship from the McDonnell Foundation Research Consortium on Causal Learning.

- 321 -

Proceedings IACAP 2011

AGENCY: ON MACHINES THAT MENTALIZE
MARCELLO GUARINI
University of Windsor
401 Sunset, Windsor, ON. Canada N9B 394

1. Agency, Responsibility, and Mentalizing
The ability of human beings to attribute mental states has been variously referred to as
“mindreading” and “mentalizing.” The purpose of this paper will be to examine the
relationship between agency and mentalizing.
Two dimensions of agency will be discussed. The first is the ability of a human or
machine to take responsibility for his/her/its actions and thoughts – a first person ability.
The second is the ability to hold others responsible – a third person ability. Both of
these activities are important for various forms of social interaction, and they would not
be possible without mentalizing. It will be shown that various mindreading abilities –
such as tracking perception, desire, the source of belief, and false belief – are central to
the notion of agency in ethical, epistemic, and legal contexts. This has implications not
only for how we understand human agency, but for how we understand the agency of
future machines.

2. Conditions of Agency
Agency comes in degrees: we might expect an average five year old human child to take
responsibility for some things, and an average 15 year old to take responsibility for still
further things, and an average 25 year old to take responsibility for still further things.
We should expect variations in the capacities of machines as well. The focus of this
work will be the kinds of mentalizing tasks that average five year olds excel at, and the
contribution they make to understanding agency. A framework will be provided for
understanding the conditions of agency. Distinctions will be made between the
generative conditions of agency (what it takes to bring agency into existence), the
maintenance conditions of agency (what is required to keep agency in existence), and the
regenerative conditions of agency (what is required to repair or restore agency if it is
impaired). It will be argued that sustaining various mentalizing abilities are among the
maintenance conditions of agency.
2.1. AN EXAMPLE
Let us consider the capacity to attribute false beliefs, something most 5 year olds
possess. Some children are allowed to view a Smarties box that has candy (Nichols and

- 322 -

The Computational Turn: Past, Presents, Futures?

Stich, 2003, p.90). One of the children is asked to leave the room, and the remaining
children witness the candy being replaced with pencils. The absent child is brought back
into the room. When asked what the temporarily absent child believes is in the box,
most three year olds say “pencils.” This is a third person failure to attribute a false
belief. Tasks such as these can be failed in the first person as well: young children often
fail to attribute false beliefs to themselves. There is an important connection between
agency and the ability to attribute false beliefs. The ability to take responsibility
involves, among other things, the ability to grasp that I have or had a false or incorrect
view. Without the ability to attribute error to oneself, it is difficult to see how one could
in some well developed sense take responsibility for it. Moreover, holding another
responsible could well involve, among other things, attributing a false belief to that other
individual. Agent A1 may challenge A2 to revise his, her, or its view on some matter on
the grounds that the view is false. A1 needs to be able to attribute a false belief to A2 for
this to happen.
2.2. LEVELS AND CONDITIONS OF AGENCY
There is some recent research that uses an attentional (as opposed to linguistic) paradigm
to argue that children engage in some sort of false belief recognition well before
language is developed (Goldman, 2006, pp. 76-77). This is startling and interesting
work, but whatever these very early abilities amount to, it will be argued that they are
insufficient for understanding what is required in advanced forms of taking responsibility
or holding others responsible. They will, however, play an important role in
understanding the generative conditions of human agency. Success in these very early
attentional tasks appear to be important precursors to the linguistic abilities required for
advanced forms of agency. Supporting what is needed for these attentional abilities
might also be among the maintenance conditions of simpler forms of agency.
A discussion of the conditions of agency can be usefully augmented with the well
worn three level approach to explanation common in cognitive science – intentional,
algorithmic or mathematical, and implementational. We can examine the conditions of
agency at each of these levels. For example, at the intentional level, we can intentionally
specify what sorts of abilities have to be kept in place or maintained for advanced agency
to exist – much of this may be the same for humans and machines. However, at the
algorithmic/mathematical and implemenational levels, there may be important
differences in specifying how agency is maintained.

3. Significance
At some point, we expect our children to start taking responsibility for their behaviour
and engage in self-correcting behaviour made possible by false belief attribution and
other mindreading abilities. Among other things, this creates various epistemic, moral,
and other efficiencies – individuals that can monitor and correct their own thoughts and
behaviours do not require constant correction from others, which frees these agents to
pursue further tasks. One of the driving forces behind the development of machine
agency will no doubt be the desire for these sorts of efficiencies. It will be shown that
other mindreading tasks (over and above false belief attribution) play a role in first and
third person dimensions of agency.

- 323 -

Proceedings IACAP 2011

References
Nichols, S. & Stich, S.P. (2003). Mindreading: An Integrated Account of Pretence, SelfAwareness, and Understanding Other Minds. Oxford: Oxford University Press.
Goldman, A.I. (2006). Simulating Minds: The Philosophy, Psychology, and Neuroscience of
Mindreading. Oxford: Oxford University Press.

- 324 -

The Computational Turn: Past, Presents, Futures?

TOWARD A TESTBED FOR MODELING THE KNOWLEDGE, GOALS
AND MENTAL STATES OF OTHERS
SERGEI NIRENBURG
University of Maryland Baltimore County
Baltimore, MD, 21250 USA

Abstract. The paper introduces a computational environment that facilitates
development and experimentation with intelligent agents in the OntoAgent
cognitive architecture. The agents pursue goal- and plan-oriented reasoning, are
capable of communicating in natural language and build mental models of other
agents.

Decision-making is a core capability of intelligent agents – both human and artificial
ones. Making optimal decisions with limited resources is a very difficult task both for
people and for machines. Helping people to make decisions is an important scientific,
societal and technological goal.
Classical decision theory presupposes an idealized decision-making agent that
possesses all the knowledge necessary (or desired) for making a decision, operates with
optimum decision procedures and is fully rational in terms of the rational choice theory.
Within this theory rationality of an individual decision is estimated in terms of what von
Neumann and Morgenstern (1944) called expected utility, the cost effectiveness of the
means to achieve a specific goal. In other words, rational behavior for an individual
maximizes benefits and minimizes costs of a choice.
However, in real life few people make decisions under conditions of complete
knowledge, maximum efficiency and rationality. Thus, Simon (1955) introduced the
concept of bounded rationality that removes the constraint of having complete
knowledge and the best algorithm by switching from seeking an optimal decision to
accepting a satisficing decision (roughly, making do with the first decision for which
utility exceeds costs even though there may be any number of better decisions available).
A number of proposals concentrated on the selection of parameters (features) on the
basis of which choices are made. Thus, the prospect theory of Tversky and Kahneman
(1974) and its descendants, such as cumulative prospect theory, augment the inventory of
decision parameters for a decision (utility) function by stressing psychological influences
on decision-making, such as risk aversion and “reference” utility meaning utility relative
to perceived utility for others.
In order to incorporate the latter, an intelligent agent A0 must be able to model the
mental states of other agents, A1, …, An. At the intuitive level, we understand mental
states as including, at a minimum, ontological knowledge of concept types as well as
knowledge of concept instances, the agent’s goals, preferences, personality traits, etc.
The concept of ‘belief,’ often used in conjunction with modeling agents we interpret as

- 325 -

Proceedings IACAP 2011

(possibly, error-ridden) knowledge that agent A0 has about other agents it knows. (We
are aware that the knowledge A0 has about itself may also be less than accurate.)
In our work on modeling intelligent agents we stress the importance of extending the
inventory of an agent’s decision-making parameters (but only if effective procedures for
determining their values can be developed). Thus, it is correct to state that understanding
speaker’s goals is important in making a decision about how to react to a speech act. But
in practice more specific knowledge is needed – for example, when a doctor asks a
patient about the latter’s family, the patient must judge whether the speaker’s goal is
professional (having the patient’s condition diagnosed) or social (making small talk) or –
and this is an even more complex reckoning – whether it is a social goal put in service of
the professional one (aiming at establishing a rapport with a patient so as to develop trust
and ensure cooperation – better-quality responses to questions and requests).
In this talk I will describe a computational environment that facilitates development
and experimentation with agents that strive to make use of mental models of others as a
prerequisite for making appropriate decisions with respect to the agent’s own behavior.
This capability is one of several core requirements of our cognitive architecture,
OntoAgent. In addition to modeling ontological knowledge about the outside world and
knowledge about remembered instances of ontological concepts (including other agents,
viewed as instances of the ontological concept HUMAN), OntoSem agents:
•

are designed to operate in a hybrid network of human and artificial agents;

•

emulate human information processing capabilities by modeling conscious
perception and action;

•

communicate with people using natural language;

•

can incorporate a physiological model, making them what we call “double agents”
with simulated bodies as well as simulated minds;

•

can be endowed with personality traits, preferences and psychological states that
influence their perceived or subconscious decision-making preferences;

•

rely on knowledge resources and processors that are broad-coverage rather than
geared at a particular application, which simplifies porting agents to new domains
and applications;

•

stress the importance of memory of event, state and object instances to complement
its ontological knowledge of event, state and object types.

What makes modeling such multi-faceted agents feasible is that all aspects of agent
functioning are supported by the same knowledge substrate encoded in a single
metalanguage. The OntoAgent testbed has been implemented in the medical domain and
supports two agent environments:
•

Maryland Virtual Patient (MVP, McShane et al. 2009) modeling a patient, a trainee
MD and a tutor in the process of learning medical diagnostics and treatment; and

•

CLinician’s ADdvisor (CLAD, Nirenburg et al. 2011) modeling a patient, an MD
and a clinician’s advisor and intended to assist practicing clinicians by reducing
their cognitive load.

The talk will include a demonstration of the above environments and a discussion of the
ways of modeling mental states of other agents.

- 326 -

The Computational Turn: Past, Presents, Futures?

References
McShane, M., S. Nirenburg, B. Jarrell, S. Beale, G. Fantry (2009). Maryland Virtual Patient: A
Knowledge-Based, Language-Enabled Simulation and Training System. Proceedings of
International Conference on Virtual Patients, Krakow, Poland, June 5-6.
Neuman, J. von and O. Morgenstern (1944). Theory of Games and Economic Behaviour.
Princeton: Princeton University Press.
Nirenburg, Sergei, Marjorie McShane, Stephen Beale, Bruce Jarrell and George Fantry (2011).
Intelligent agents in support of clinicial medicine. Proceedings of MMVR18, Newport
Beach, CA, February 9–12.
Simon, H.A. (1955). A behavioral model of rational choice. Quarterly Journal of Economics,
69: 99–118, 1955.
Tversky, Amos, & Kahneman, Daniel (1974). Judgment under uncertainty: Heuristics and
Biases. Science, 185: 1124-1131.

- 327 -

Proceedings IACAP 2011

ARCHITECTURAL STEPS TOWARDS SELF-AWARE ROBOTS
MATTHIAS SCHEUTZ
Tufts University
161 College Ave., Medford MA 02155

Abstract. Philosophical debates about qualia, perspectivalness, “what it is like”
experiences and related topics are vastly disconnected from “architecture talk” in
AI and cognitive science which is required for understanding minds and designing
artificial agents. While philosophy can thus not help AI in designing conscious
agents, I argue that AI and robotics cannot only help philosophy, but may even be
required for solving some of the puzzling questions in the philosophy of
consciousness. Specifically, I will claim that there is no such thing as a necessarily
private experience (neither phenomenal, nor introspective, nor any other) using as
an example robotic architectures whose instances “know” what it is like to be
another robotic architecture instance.

Start with two basic hopefully non-controversial notions, those of awareness and selfawareness, define them for agent architectures and then show how we can say that a
robot is aware or self-aware in a given context. Following Chalmers' (1996) notion of
{\em awareness and Block's (1995) notion of access consciousness, call a state S of an
agent architecture A an “awareness state” if S contains information about something
(entity, state, event, etc.) that the agent (instantiating A) can use to make decisions, guide
its behavior and/or give verbal reports. Specifically, an agent is “aware of X” if it is in
an awareness state that in some way represents or encodes X. An agent is “self-aware” if
it is aware of itself, i.e., if it is in an awareness state that represents or encodes (parts of)
the agent itself. S will typically be a complex state that consists of substates reflecting
the states of various functional components in the architecture A. For example, if S is the
state of “being aware of a red box}, then this state will roughly require perceptual states
representing the box and some of its properties including its redness, in addition to states
that use some of these representations in order
to form other representations and/or behaviors.
To make all of this more precise, I will briefly introduce some relevant parts of our
robotic DIARC architecture that we have been developing over the last decade or so in
my lab (Scheutz et al 2007). What is nice about robotic architectures (or any form of
agent architecture, including cognitive architectures for that matter) is that one can look
inside. I.e., one can take a look at the blueprint and follow the information flow along
connections between functional components. One can trace processing routes and look
at component states. And one can make statements about possible and impossible
processes in a system that instantiates the architecture.

- 328 -

The Computational Turn: Past, Presents, Futures?

DIARC consists of various functional modules: on the perception side, there are modules
for vision processing, sound processing (including sound localization and speech
recognition), laser distance data processing, and processing of various internal
proprioceptive sensors. For most sensory modalities, there are also short and long-term
memories, e.g., a long-term memory for visual objects and a short-term memory for
storing the recognized objects the agent currently sees. On the action side, there are
modules for moving the robot body through the environment, for making arm and head
movements, and for making facial expressions, among others. Internal modules consist
of various short and long-term memories together with processes that operate on those
memories, including skill memories, factual and episodic memories, a lexicon with
syntactic and semantic annotations in addition to word forms, and a task memory.
Moreover, there are components for managing the agent's goal, for scheduling actions in
parallel, for processing spoken natural language, for task planning,and for reasoning (for
more details, see Scheutz et al. 2007).
Now consider a robot running DIARC that is asked whether it sees a red box and
assume that the robot has a goal to answer questions. Upon hearing the spoken
utterance, the speech recognizer generates word tokens from it, which are then
syntactically and semantically analyzed, resulting in an internal logical representation of
the meaning. The robot recognizes that the utterance was a question that required it to
perform an internal lookup action in its visual short term memory (VSTM), namely to
check whether VSTM contains an object representation of a red box. Note that the robot
only needs to perform a lookup action in its VSTM, because VSTM is automatically
updated based on what the object recognition algorithm detects in the image coming
from the camera at a rate of 30Hz. In particular, various vision processing algorithms
are performed on each image frame attempting to segment colored regions, detect object
boundaries, recognize objects and determine their properties. These processes result in
the generation of representations of the recognized objects in VSTM, which are matched
against existing representations so that object identities can be tracking over short
periods of time. If the agent has an object representation of a red box in VSTM, then the
representation is retrieved and bound to the expression “red box”. The binding confirms
the resolution of the reference and triggers a variety of additional bindings (including the
binding of various discourse variables such as “last mentioned object” and “last
mentioned noun” in linguistic short-term memory). It also triggers the generation of an
answer to confirm that the robot is seeing a red box, which the robot then pronounces. In
addition, the generated answer gets stored in linguistic short-term memory and,
depending on other factors, the whole event “you asked whether I saw a red box, and I
did see one” might get stored in episodic memory (indexed by time, object type,
interaction type, and others).
From the above description, it is clear that the robot went through several
awareness states including self-awareness states as part of answering the question: the
robot is aware of the question when it is in a state where it checks for the object asked
for in the question; if there is such an object, the robot becomes aware of the object as
well as of the object's properties (in particular, its color), and the robot is aware of the
answer it gave. Moreover, the robot is aware of itself having been asked the question
and of having given the answer, which is a self-awareness state.
I will then use the above architecture to demostrate during my presentation what it
is like for the robot to have a color experience and use this result to address some
questions about phenomenal and private experience in philosophy. In particular, I will
argue that robots can know what it is like to have another robot's experience.

- 329 -

Proceedings IACAP 2011

References
Block, N. (1995). On a confusion about a function of consciousness. Behavioral and Brain
Sciences, 18(2), 227-247.
Chalmers, D. J. (1996a). The conscious mind: In search of a fundamental theory. New York, NY,
USA: Oxford University Press.
Scheutz, M., Schermerhorn, P., Kramer, J. and Anderson, D. (2007). “First Steps toward Natural
Human-Like HRI”. Autonomous Robots, 22, 4, 411-423.

- 330 -

The Computational Turn: Past, Presents, Futures?

LOGIC-BASED SIMULATIONS OF MIRROR TESTING FOR SELFCONSCIOUSNESS
NAVEEN SUNDAR
AND
SELMER BRINGSJORD

Abstract. We present a formal logic-based analysis of the mirror test for selfconsciousness. Based on this formalization, a computational simulation of a
mirror-failing dog, a mirror-passing chimp, and a mirror-passing human will be
presented. The simulation will consist in the automatic machine-found disproof in
the case of the canine, and proofs in the other two cases. These simulations will be
based on an axiomatization of the perceptual and doxastic details assumed to be
in/operative in these three cases by those embracing the view that chimps and
humans are self-conscious, while dogs aren’t.

1. The Mirror Test
In accordance with a now-familiar recipe R in the annals of the study of “selfconsciousness,” anesthetize23 a creature c; while it’s under, paint, say, a red (odorless,
hypo-allergenic) splotch upon its forehead, thus making it true that c has property R (=
Rc); when awake, place c in front of a mirror (Mc); observe the creature’s behavior b to
see if it for example includes the attempt to remove the splotch (Rcb or ¬Rcb); if it
does/doesn’t, issue a pronouncement about such questions as whether or not it’s selfconscious (or self-aware, etc.; i.e., as to whether or not Sc).
Descriptions of the following of R are innumerable in the literature.24 But what is
the logic of this recipe? Despite decades of writing about the value of the recipe, we can
find no rigorous account of it, nor of followings of it in connection with certain classes of
creatures. Therefore, we can’t find rigorous computational simulations of such
followings, and we certainly can’t find proofs that for given creatures they are known to
either have or lack self-consciousness, depending upon whether or not they pass the
mirror test. Work underway by us is designed to provide these missing things, and we
propose to report on this work at IACAP 2011, and show demonstrations.

23
24

Or perhaps do it while the creature is sleeping soundly.
For a compendium of such followings, accompanied by the colorful proposal that self-awareness
can be neuro-localized in the right hemisphere, see Keenan, J., Gallup, G. and Falk, D. The Face in the
Mirror (Ecco: New York, NY).

- 331 -

Proceedings IACAP 2011

2. Toward a Formal Analysis of the Mirror Test
Let’s assume a standard extensional multi-sorted logic in which creatures are partitioned
in customary ways. (Please note that the empirical, informal literature, as a matter of
brute fact, makes not even a nod in the intensional direction, and is naturally formalized
via extensional frameworks.) Specifically, the class of dogs will be denoted by ‘D,’
chimps by ‘C,’ and humans by ‘H.’ Then, the following three propositions have
apparently been affirmed in the literature.
1. ∀c∀D[(Rc∀Mc∀Rcb)→Sc] • This is taken to be true, in a nutshell, because if dogs
had behaved as chimps usually do, canines would have presumably been admitted into
the “self-aware” club.
2. ∀c∀C [(Rc ∀ Mc ∀¬Rcb) → ¬Sc] • This is taken to be true, in short, because if
chimps had behaved as dogs do, chimps would have presumably have been kept out of
the “self-aware” club.
3. ∀c∀H [(Rc ∀ Mc ∀¬Rcb) → ¬Sc] • This is taken to be true, in a nutshell, because
humans provide the “anchor point” on the issue at hand.
Unfortunately, none of these propositions are true. A dog pre-trained to paw its
forehead when seeing a dog provides a counter-example to 1., since no participant in the
debate herein considered accepts that such training ensures self-consciousness.25 A
chimp pre-trained to leave splotches intact constitutes a counter-example to 2., since no
participant accepts that such training guarantees the absence of self-consciousness. And a
human inclined to ignore splotches overthrows proposition 3.
Of course, these problems are just the tip of the iceberg. The trio is of course
incomplete, since from them one cannot for instance deduce that dogs aren’t selfconscious, whereas chimps and humans are. One might think that this is addressed by
adding more formulae26, but since the conditional used here is the material conditional,
this trio can’t possibly be heading in the right direction, as is easily seen. Assume that a
variant of 2., 2′., is to enable deduction that some real-life chimp, Charlie, c , is in fact
self-conscious. How could this deduction go through? It could only work if the relevant
antecedents in 2′. were satisfied. For example, the following holds.
{2′.}

{Rc

Mc

Rc b}

Sc

But for Charlie, and nearly every single chimp who ever lived or will ever live, there will
never be a red splotch and a mirror in his life. And yet clearly those in favor of ascribing
self-consciousness to chimps will want to make the ascription to Charlie and his friends.
More specifically, those in favor of the ascription presumably hold that were it the case
that Charlie was given the mirror test, he would pass. This indicates that some
intensional logic is required; specifically, a conditional logic able to handle subjunctive
conditionals is needed.

25 Of course, someone might deny that such behaviour expresses an intention to remove a
splotch, but that would be entirely ad hoc. Trainers after all routinely train dogs to form goals
and seek their satisfaction when they observe the relevant triggers. Relevant here is the
Keenan-et-al.-recounted story of behaviourists who claimed that pigeons were to be classified
with chimps in the running of R. It turned out that the pigeons had been pre-trained in ways
that contaminated the experimentation in question.

26 E.g., c D [((Rc

Mc

¬Rcb) → ¬Sc].

- 332 -

The Computational Turn: Past, Presents, Futures?

Note that the fact that 2′ might never be satisfied for a particular chimp is not the
fault of our chosen formulation, since that formulation is a direct symbolization of what
is said in the literature (which has of course been written for the most part by
informalists). One way to understand what ought to be claimed in the informal literature
is that a subjunctive conditional be employed: for example, if in all nearby “possible
worlds” in which Rc and Mc are true, Rcb is true, then Sc is true in the actual world. But
of course this sort of thing is the point, since no one has yet worked out the details in this
direction, and to credit this direction to anyone in the empirical prior work is so
charitable as to border on absurdity. And of course the devil is in the details: The formal
calculi we use include an explicit rejection of a possible-worlds semantics for anything
doxastic.
Our modeling of mirror testing has obvious connections to key distinctions recently
made by Clowes and Seth (2008). In their terms, our research is without question “weak”
in nature, since we don’t claim that our mirror-passing agents, however formal and finegrained the underlying modeling may be, literally are conscious. In addition, while
elsewhere (Bringsjord 2007) one of us has expressed skepticism about Aleksander’s
axiomatic approach, discussed by C&S, our approach is certainly axiomatic. However,
the calculi upon which this approach rests are more expressive than those used by
Aleksander (allowing, e.g., for intensional operators), and are oriented toward proof
theory and automated proof finding and checking.
Finally, related prior work in simulating the mirror test can be found in Takeno’s
work on mirror image discrimination. This work provides some evidence that at least the
rather informal robotics side of the act of a simple agent’s recognizing its mirror image
is feasible. We will of course contrast our work with that of Takeno et al.

References
Bringsjord, S. (2007). Offer: One Billion Dollars for a Conscious Robot. If You’re Honest, You
Must Decline. Journal of Consciousness Studies, 14(7), 28–43.
Clowes, R.W. & Seth, A.K. (2008). Axioms, Properties and Criteria: Roles for Synthesis in the
Science of Consciousness. Artificial Intelligence in Medicine, 44(2), 91-104.
Takeno, J. & Inaba, K. & Suzuki, T. (2005). Experiments and Examination of Mirror Image
Cognition Using a Small Robot. Proceedings of CIRA 2005: IEEE International Symposium on
Computational Intelligence in Robotics and Automation. Espoo, Finland, 2005.

- 333 -

Proceedings IACAP 2011

List of Authors in Alphabetic Order
Aas, Katja Franko
Alhutter, Doris
Anokhina, Margaryta
Arkin, Ronald
Arkoudas, Konstantine
Asai, Ryoko
Asaro, Peter

25
252
119
122 & 317
320
287
179

Backhaus, Patrick
Barker, Steve
Baumgaertner, Bert
Beavers, Anthony F.
Beinsteiner, Andreas
Belfer, Israel
Bello, Paul et alia
Bengez, Rainhard Z.
Blanco, Javier O.
Bod, Rens
Boltuc, Peter
Breems, Nick
Bridewell, Will
Briggs, Gordon
Bringsjord, Selmer
Buchanan, Elizabeth
Buckner, Cameron
Bynum, Terrell Ward

290
255
206
23
301
209
125
33
34
216
38
213
323
128
335
30
29
26

Calabretto, Sylvie
Casacuberta, David
Chokvasin, Theptawee
Coeckelbergh, Mark
Cohen, Paul
Compagna, Diego
Crutzen, C.K.M.

242
143
41
258
95
262
156

Danka, Istvan
Dasch, Thomas
De Gooijer, Thijmen
Desclés, Jean-Pierre
Dodig-Crnkovic, Gordana
Douglas, Keith

264
181
293
220
119 & 262 & 290
184

- 334 -

The Computational Turn: Past, Presents, Futures?

Duran, Juan M.

44

Ekbia, Hamid R.
Ess, Charles

247 & 269
30

Franchette, Florent
Franchi, Stefano
Funcke, Alexander

47
223
83

Ganascia, Jean-Gabriel
Geier, Fabian
Giardino, Valeria
Guarini, Marcello
Guarini, Marcello

304
50
87
228
326

Hagengruber, Ruth
Halpin, Harry
Heersmink, Richard
Heimo, Olli I.
Hempel, Leon
Hewlett, David
Hongladarom, Sonja
Hromada, Daniel D.

131
57
91
133
159
95
298
186

Janlert, Lars-Erik

98

Kavathatzopoulos, Iordanis
Kimppa, Kai K.
Kitto, Kirsty

137
133
101

Laaksoharju, Mikael

137

Macnish, Kevin
Mauger, Jeremy
McKinley, Steve
Menant, Christophe
Meyer, Steven
Molyneux, Bernard
Monin, Alexandre

163
30
231 & 234
105
53
140
57

Najar, Anis
Nicolaidis, Michael
Nirenburg, Sergej

307
238
329

- 335 -

Proceedings IACAP 2011

Othmer, Julius

166

Pagano, Miguel
Portier, Pierre-Edouard

60
242

Quiroz, Francisco Hernandez

108

Reynolds, Carson
Riss, Uwe
Ropolyi, Laszlo

310
64
273

Scheutz, Matthias
Schroeder, Marcin
Simon, Judith
Sinclair, Nathan
Smith, Lindsay
Solodovnik, Iryna
Soraker, Johnny Hartz
Strauss, Stefan
Sullins, John P
Sundar, Naveen

332
111
276
68
71
75
190
313
28
335

Taddeo, Mariarosa
Thürmel, Sabine
Tonkens, Ryan
Turner, Raymond

168
78
194
81

Vakarelov, Orlin
Vallor, Shannon
Vallverdu, Jordi
Veale, Richard
Vehlken, Sebastian

115
197
143
147
279

Waser, Mark R.
Weber, Jutta
Weich, Andreas
Wong, Pak-Hang

152
172
166
201

York, William W.

247

Zambak, Aziz
Zhang, Guo

282
269

- 336 -

Quantitative intercultural comparison by means of parallel
pageranking of diverse national wikipedias
Daniel Hromada
Ecole Pratique des Hautes Etudes / CHART / Lutin Userlab

Abstract
The aim of our study was to show that distributions of hyperlinks within wikipedia corpora implicitly contain
information about cultural preferences of its authors. We have transformed wikipedia corpora written in 27
different languages into graph structures whose vertices correspond to wikipedia articles and edges to hyperlinks
between these articles. Afterwards we have calculated PageRank vectors for every one of these graphs, thus
obtaining so-called “intracultural importance list” for every linguistic community under study. Two datamining
experiments were performed with obtained data: “the top country” study indicated that labels of articles concerning
countries, related to linguistic community that created these articles are to be found in the top parts of their
respective intracultural lists and inversely that the top parts of these lists can be potentially used as a stylometric
method of identification of the community which created the corpus. “The world&corpus” study revealed that
majority of rankings of articles concerning the countries of reference within intracultural list of a given community
significantly correlates with a factual geographic distance between the country of reference and a supposed home
country of a linguistic community. Both experiments have indicated presence of morphism between wikipedia
hyperlink graph and a factual world of its authors.

Keywords: PageRank, Wikipedia, graph theory, comparative culturology, quantitative anthropology, cultural
stylometry, world-corpus correlations

1. Introduction
The aim of this article is to propose a new quantitative method for comparison of different
cultures by reducing culture-specific corpora to a common metrics. We shall try to demonstrate
the feasibility of such an approach by using PageRank as such a metric and wikipedias of
diverse (mostly European) linguistic communities as corpora which will be compared.
Both Wikipedia and Pagerank have lately received a substantial amount of attention from
different scientific fields. Considered by some to be «probably the most important single
contribution to the fields of information retrieval and Web search of the last ten years » (Esuli
and Sebastiani, 2007) implementation of PageRank by (Brin and Page, 1998) was without a
doubt a key component of ascent of Google to the very top of most visited Internet sites.
On the other hand, Wikipedia is based upon a very simple idea of self-organized collaboration
of a huge number of authors. The hypothesis that such a huge number will, in the long run,
approximate scientific truth better than a limited number of experts (Surowiecki, 2004) is far
from being ultimately proven. However, Wikipedia is nowadays considered as reliable source
of information in many domains, and it is one of the most important and freely available
encyclopaedic corpora. Its multilingual properties are being more and more exploited in NLP
JADT 2010 : 10 th International Conference on Statistical Analysis of Textual Data

644

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

research for sense disambiguation word sense disambiguation (Mihalcea, 2007), question
answering (Ferrandez et al., 2007), named entity recognition (Richman and Schone, 2008).
Only few studies, however, focused fully upon differences between diverse wiki corpora. And
even when such “exploiting asymmetries” (Filatova, 2009) or “information arbitrage” (Adar et
al., 2009) were presented, their goal was to infer data from article-content related discrepancies,
and not to make comparisons between corpora considered as consistent wholes.
Research presented by this paper aims to demonstrate that even such large-scale comparisons
can yield valid information. Our starting hypothesis can be stated like this: Wikipedia maybe
does not approximate scientific truth, but it certainly approximates culture of its authors. In
more exact terms, supposing that 1) the very act of creation of an article or a link presupposes
an existence of a biased preference within the author and 2) that wikipedia is a graph structure
whose vertices are equivalent to articles and edges to hypertext links between this articles,
we propose that such a graph is at least partially but significantly isomorphic with associative
network of culturally determined meanings and values of its authors.
Proposal that culture – which can be conceived as structure of symbols, artifacts, buildings,
institutions, social roles etc. which are mutually interconnected in a very specific way– can be
described by graph theory and later analyzed by network analysis is far from being new (for
an overview, see Park, 2005). Validating such a hypothesis, however, is not easy since it is
not easy to find a 1) unique graph-like structure (e.g. structure with vertices and edges) that 2)
represents common activity of huge number of culture-holders. And even when such a structure
is found, the question whether it faithfully represents (is isomorphic with) a given culture is
difficult to answer.
But since it is nowadays widely accepted that culture is in the first place distinct from other
cultures and that this distinction forms the very essence of a given culture (Bourdieu, 1979),
even when it is almost impossible to compare a cultural graph with factual world itself, cultural
graphs can always be compared with each other and the results of this comparison can be
subsequently more easily compared with evident cultural distinctions of factual world.
We propose that corpora of local wikipedias created by diverse linguistic communities can serve
as a basis for construction of such «cultural graphs» and that these graphs can be subsequently
compared by means of PageRank centrality measure.

2. “The top country” study
Since a “corpus culturology” doesn’t seem to be an explored scientific domain, the goal of
this preliminary analysis was to decide whether it is worth to continue with implementation of
more robust statistic techniques or whether to consider as false the very introductory hypothesis
“hyperlink distribution of a wikipedia graph contains implicit information about cultural
preferences of its authors”. In other words, our primary intention was to assess whether some
culture-specific information can be observed by applying a PageRank algorithm on wikipedia
corpora of diverse linguistic communities.
2.1. Method
Database tables «pages» (containing the list of articles – vertices) and «pagelinks» (containing
the list of hypertext links – edges) were downloaded from wikimedia’s site.
All vertices and edges not having namespaces 0 (article) 14 (category) and 100 (portal)
were removed from the tables; subsequently a page_from → page_to plaintext edge list was
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

645

generated. After this edge list was transformed into a graph G, pagerank vector – which is in
fact the eigenvector of graph’s modified adjacency matrix – was calculated by igraph library
(Csárdi and Nepusz 2006). Damping factor d=0.77 was chosen for the calculation. These
transformations and calculations were repeated for 27 wikipedia corpora, overall properties of
their respective graphs are present in Tab. 1.
ISO 639
code

Name of
language

Number of
vertices (articles)

Number of
edges (hyperlinks)

AR
BG
CS
DA
DE
EL
ES
ET
FI
FR
HE
HR
HU
LV
NO
NL
PL
PT
RO
RU
SK
SL
SR
SV
TR
UK
ZH

Arabic
Bulgarian
Czech
Danish
German
Greek
Spanish
Estonian
Finnish
French
Hebrew
Croatian
Hungarian
Latvian
Norwegian
Dutch
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Serbian
Swedish
Turk
Ukrainian
Chinese

234538
143439
266854
205245
1939647
82168
1303273
126448
403380
1996383
245431
116515
277518
67736
405039
877590
903670
1088962
307084
1232353
173417
146250
239904
623035
304853
322799
609262

4963998
3578973
7187995
4402963
43782766
1879300
23212253
2580511
7609470
53003962
9103883
3850220
9865769
1342180
8938168
24881686
29731309
24867864
5392290
27442593
4873409
5236834
5013264
11515290
9557808
9158661
15838584

Table 1: Basic graph properties of analysed corpora and their corresponding ISO639-1 codes

For every corpus all contained page titles were ordered according to their descending PageRank
values. We call such a list to be an intracultural list and we call langrank the placement of a
given item in its respective intracultural list. Hence, 27 intracultural lists were obtained within
which pages have langrank 1, pages with second highest probabilities have langrank 2, etc. To
summarize, high langrank means low PageRank importance and vice versa.
To detect what names of countries are to be found on the very top of intracultural lists (i.e.
have lowest langrank), a following procedure was applied: a term with langrank position 1 was
extracted from the list, and translated it into English by using wikipedia itself as the translator.
If it was not present in the ISO list of country names, procedure continued with a term having
langrank position 2, 3, etc. If it was in the list, the procedure continued with country detection
in following intracultural list, therefore repeating itself 27 times.
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

646

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

2.2. Results
27 intracultural PageRank vectors, one for each language community, were obtained and
subsequently ordered in descending order according to calculated PageRank (converged
probability) value. For illustration, in Tab. 2 we offer «top 10» values of such lists for 2 Latin
and 2 Slavic corpora.
Portuguese
Wikipédia
0.065305
Proxy
0.006393
WP:TT
0.003323
Plantae
0.002419
Til
0.001981
Avaré
0.001496
População
0.001492
Invertebrados 0.001435
Área
0.001433
Brasil
0.001412
			
		

Spanish
España/Sección
Rural
Wikipedia
Wikipedia_
en_español
2001
Mayo
Wikimedia_
Commons
GFDL
España
Rural

Czech
0.491755
0.050179
0.001105
0.000887
0.000555
0.000508
0.000337
0.000205
0.000197
0.000196

Wikipedie
Wikimedia_Commons
GNU_Free_Documentation_License|
CC-BY-SA
CAPTCHA
Česko
IP_adresa
Spojené_státy_americké
Zeměpisné_souřadnice
Praha

Russian
0.00984
0.00816
0.00303
0.00141
0.00132
0.00109
0.00097
0.00082
0.00079
0.00069

Википедия:Справка
Русская_Википедия
Германия
Общественное_достояние
GNU_Free_Documentation
_License
Викисклад
Creative_Commons
Английский_язык
Россия
Фонд_свободного_програ

0.01519
0.00564
0.00361
0.00348
0.00295
0.00277
0.00276
0.00121
0.00119
0.00112

Table 2: Top ten (i.e. langrank 1 – 10) items of 4 intracultural lists and their respective PageRanks

It may be easily observed from the data that Wikipedia itself holds one of the top positions (this
is the case within other 23 corpora as well). This is a trivial discovery since a wikipedia system
is designed in the way that it refers in the first place to articles which concern the functioning
of the system itself. Slightly less trivial is the observation that articles concerning the names of
countries or cities closely associated to a language of a given wikipedia corpus emerge at the
top positions of their respective intracultural lists.
Wiki
Top country
L
corpus			
AR
BG
CS
DA
DE
EL
ES
ET
FI

(Egypt)
България (Bulgarria)
Česko (Czech Republic)
Danmark (Denmark)
Deutschland (Germany)
Ελλάδα (Greece)
España (Spain)
Eesti (Estonia)
Suomi (Finland)

17
4
6
34
16
7
9
5
5

Wiki
Top country
L
corpus			
FR
HE
HR
HU
LV
NL
NO
PL
PT

France (France)
(Israel)
Hrvatska (Croatia)
Magyarország (Hungary)
Latvija (Latvia)
Frankrijk (France)
Norge (Norway)
Polska (Poland)
Brasil (Brazil)

23
7
4
18
6
11
6
12
10

Wiki
corpus
RO
RU
SK
SL
SR
SV
TR
UK
ZH

Top country

România (Romania)
Германия (Germany)
Slovensko (Slovakia)
Slovenija (Slovenia)
Француска (France)
USA
Türkiye (Turkey)
Україна (Ukraine)
印度尼西亚 (Indonesia)

L
7
3
9
8
28
35
13
13
10

Table 3: Country names found at the top of their intracultural lists (i.e. having lowest langrank L )

Answers to the question «What countries are the first to occur at the top of given corpus
intracultural importance list?» are present in Tab. 3. In 22 cases did an extraction of one country
name from the top of the intracultural list corresponding to the graph of wikipedia written in
language X yield the name of a country where this very language X is an official language of
the state. Five exceptions are: Dutch where Frankrijk (L=11) closely outran Nederland (L=14);
Russian where Германия (L=3!!!) outran Россия (L=9); Serb where Француска (L=28) far
outran Србија (L=70); Swedish where USA (L=35) closely outran Sverige (L=37) and finally
Chinese where Indonesia (L=10) is followed by Qatar (L=45), Micronesia (L=371), Brunei
(L=409), Taiwan (L=484) and only much later by mainland China 中国 (L=579).
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

647

2.3. Discussion
The observation that huge majority (22 out of 27) corpora yields in the top positions of their
respective langrank lists the names of countries whose official language is identic to the
language of corpora under study is the first indication that even a pure hyperlink analysis could
possibly reveal itself as a fruitful method for obtaining an overall information about preferences
or interests of authors of wikipedia corpora. In such a manner could it posssibly serve as a
means for «cultural stylometry» – a technique which could possibly allow to determine an
appartenance of an anonymous author (or group of authors) to a given cultural or social unit.
For instance, data from Tab. 3 indicates that «central country of interest» for auhors of PT corpus
is Brasil (L=10) and not Portugal which emerges only later in the list (L=32), later than França
(L=12), Itália (L=14), Espahna (L=16) and even Estados Unidos (L=31). If a basic hypothesis
of this article, i.e. that langrank values represent the amount of importance of a given term in
a given corpus will not be falsified, it could be proposed that Brasil plays, for authors of PT
corpus, much more important role than Portugal, from which it could be inferred that majority
of them is possibly from Brazil and not from Portugal. Analogic stylometric conclusions can
be inferred when looking at the AR corpus where Egypt (L=17) is followed by Jordan (L=27),
Spain (L=36), France (L=37) and Tunisia (L=47).
An interesting exception occurs for the countries for which the official language is not identical
to the language of a country in which a wiki corpus was written: the fact that Netherlands is
closely overran by France in case of Dutch corpus and Sweden by USA in case of Swedish
corpus can be possibly interpreted by the proposing that the overall global currents – related
more closely to cultural superpowers are, for wikipedia authors of these two highly developed
nations, of slightly more interest than local current of nationalist nature.
The results obtained for Chinese intracultural list are intriguing. While a position of Indonesia of
the very top could be naively explained by activity of Chinese expats in Jakarta who pass there
time writing wikipedia articles, the subsequent emergence of Qatar, Micronesia and Brunei
seem to be completely contraintuitive. These phenomena can be, however, explained by a wellknown caveat of PageRank algorithms related to so-called linksink phenomenon. A linksink
can emerge during the PageRank vector calculation when the analyzed graph contains a densely
interconnected subgraph having only few links to the rest of the graph. One way how to deal
with linksink perturbations is an optimization of damping factor, these problems in relation to
our cultural comparative method will be addressed in following articles.
Since the top of Serbian intracultural list indicates that this corpora is subject to linksink
perturbations (first 45 positions are occupied solely by astronomic terms), we consider this to
be an explanation for the observation where Serbia is far overran by France. Since Serb corpus
is not a big one, the result can be as well explained by an overly activity of a small group of
authors biased more towards France related phenomena than to Serb related ones.
Striking fact that Germany occupies third position in Russian intracultural importance list is left
for reader’s interpretation.

3. “The world&corpus” study
While huge majority of results obtained during analysis 1 seem to be consistent with intuitive
expectations, their true scientific significance remains discutable. To address this issue, we have
conceived a second analysis in which we have decided to correlate precalculated intracultural
lists with factual data. For this purpose we have decided to use the real geographic (spatial)
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

648

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

distances between the country of a linguistic community under study, and other country (i.e.
country of reference). Such a choice was motivated by a simple hypothesis: wikipedia users
from home country B will, more likely, write articles and create hyperlinks concerning countries
of reference A and C which are neighbours of B, than about countries of reference X or Y
which are spatially distant. If such a tendency exists, and if PageRank is a sufficiently efficient
technique for quantification of such an “importance” of A, C, X, Y countries of reference
within the scope of corpus created by authors supposedly from home country B, then significant
correlations between intracultural lists and |home country, country of reference| spatial distance
can be expected to occur.
3.1. Method
We have defined 32 countries of reference: 27 of them were countries which we have considered
as well to be home countries of our intracultural lists; 5 others were chosen by random, one
from every continent (Italy, Japan, Senegal, Argentina, Australia).
As a first dataset we have used 27 intracultural lists, one for each home country, calculated
during analysis 1. From every such list, the langrank (i.e. position sorted according the ascending
pagerank value) corresponding to the the term denoting the country of reference was extracted.
For example, as Tab. 4 illustrates, Hrvatska was on the 4th position in a Croatian corpus and 74th
in Slovenian corpus.
Language of
home country

Langrank
position

AR
BG
CS
DA
DE
EL
ES
FI
FR
HE
HR
HU
LV
NL
NO
PL
PT
RO
RU
SK
SL
SR
SV
TR
UK
ZH

532
345
281
848
329
271
756
456
1131
1493
4
268
675
409
418
422
749
469
696
271
74
110
556
413
679
3981

Name of country
of reference
Хърватия
Chorvatsko
Kroatien
Kroatien
Κροατία
Croacia
Kroatia
Croatie
Hrvatska
Horvátország
Horvātija
Kroatië
Kroatia
Chorwacja
Croácia
Croaţia
Хорватия
Chorvátsko
Hrvaška
Хрватска
Kroatien
Hırvatistan
Хорватія
克罗地亚

Spatial
distance (km)
3464
797
509
1265
808
870
1695
2197
1056
2255
0
403
1472
1083
1907
828
2028
746
5533
494
118
455
1874
1747
1320
7321

Table 4: positions of country of reference Croatia in intracultural lists of diverse home countries
and their spatial respective distance
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

649

Mathematica functions of computational search engine «Wolfram Alpha» were used as a
resource of home country ↔ country of reference spatial distance data.
Pearson correlation coefficients were calculated between two datasets. Whole procedure was
repeated 32 times, once for every country of reference.
3.2. Results
Obtained results suggest significative correlations between intracultural lists and geographic
data in case of all countries of reference with exception of China, Russia and Slovakia. They
are presented in Tab. 5.
3.3. Discussion
Obtained results show correlations between strongly empiric spatial measures and positions
within the “intracultural” lists Since different wikipedia corpora are direct consequences of
different creative preferences of human groups, these correlations have to be explained in terms
of these preferences. We propose that these preferences are culturally determined.
The previous analysis even if it leads us to interesting conclusion, is however questionable.
And a major caveat should be raised: Pearson’s correlation coefficients are sensitive to outlier
datapoints and if these are present, an analysis cannot be considered as a robust one (Rousseeuw
and Leroy, 2003).
Country
p
cor
of ref.			

Country
p
cor
of ref.			

Argentina <0.003 0.549
Australia
0.165 -0.275
Bulgaria <0.00026 0.648
Croatia
<2E-06 0.779
China
0.426 0.183
Czech R. <7-E05 0.689
Denmark <0.00044 0.629
Estonia <1.5E-05 0.730

Finland <1.74E-05
France
0.0015
Germany <0.004
Greece
0.00019
Hungary 0.00015
Israel
0.0148
Italy
<0.005
Japan
0.711

0.727
0.577
0.539
0.657
0.664
0.463
0.525
-0.07

Country
p
cor
of ref.			
Latvia
<5.6E-05
Netherlands <0.007
Norway
<0.0003
Poland
<0.0005
Portugal
<0.05
Romania <6.8E-05
Russia
0.8987
S.Arabia
<0.0035

0.696
0.507
0.652
0.630
0.387
0.690
0.025
0.543

Country
of ref.

p

cor

Senegal
<0.0007 0.617
Slovakia
0.1965 0.256
Slovenia <6.63E-07 0.797
Serbia <9.53E-05 0.680
Spain
<0.011 0.486
Sweden
<0.001 0.599
Turkey
<0.0004 0.635
Ukraine
<0.0005 0.629

Table 5: Overall p-values and Pearson correlation coefficients (d=25) for 32 countries of reference

As Fig. 1 illustrates, this was the case for example in the situation when Germany was chosen as
a country of reference. Simple removal of zh (Chinese) datapoint from the top right corner (i.e.
high spatial distance, high langrank) have caused a drastic change from (cor=0.539; p<0.004)
to (cor=-0.108; p=0.599). Since majority of countries of references in analysis 2 were European
ones, it can be expected that this outlier boosts up the significativity of our hypotheses in an
unwanted manner.
Another source of bias was identified as well. It is related to the fact that Wolfram Alpha uses
cartographic center of a country as the point from which it measures a distance to/from a given
country. That’s a useful feature in case of countries whose population is distributed equally. In
case of a country like Russia, however, is the ru “central point” postulated somewhere in central
Siberia, 4000 km east from Moscow. Whether such a point can have anything to do with cultural
preferences of wikipedia authors is a place for argument.
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

650

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

Figure 1: Visualisation of langrank&distance correlations when « China » outlier is included (left) in
or excluded (right) from the list of countries of reference as related to Germany

4. General Discussion
The aim of “the top country” study was to demonstrate whether a method of parallel
pageranking of wikipedia graphs can yield relevant information concerning basic overall
specificities of the corpora, and therefore of their authors. Simple look up at the tops of calculated
intracultural lists have demonstrated that such is verily the case: in 22 out of 27 corpora was the
topmost ranked country-concerning article about the country whose official language is that in
which the corpus was produced.
The second, “world&corpus” study focused on a relation between implicit properties of
wikipedia corpora and geographic distances of the factual world. While significativity of
obtained results suggest that there possibly exist some morphic relations between the overall
hyperlink structure of (wikipedia) corpora and the factual world, the outlier problem indicates
that the “world&corpus dilemma” will not be an easy dilemma to resolve.
What we denote here as “world&corpus dilemma” is only very superficially related to method
which we presented in our second study. In fact, it is much more closely related to an ancient
epistemological problem “What is knowledge and how is it represented?” than to some trivial
linear regression of two sets of datapoints which tend to show to have something in common.
In its weaker form, the question goes like this “What is relation between the corpus and the
world, given that corpus is sufficiently big?”. The goal of our article was to indicate that the
graph theory could possibly bestow a temporary question to this answer: “If a graph of the
corpus is isomorphic with the graph of a world the corpus tends to describe, than it can be said
that such a corpus contains the knowledge about that world”.
We say “a” graph, because there are infinitely many ways how to construct a graph from a
given corpus. For the purposes of this article, we have chosen the most simple way: inspired
by “random surfer model”, we have completely ignored information IN the Net (e.g. word cooccurences in the content) and focalized at the information ON the Net.
An edge have been created when a hyperlink existed between the vertices. We supposed this
assumption should be suffice as a point de depart: the very act of creation of an article, or a
hyperlink, can be an interesting clue to the preferences of the one who creates it. A weak clue,
of course, but nonetheless containing more information than pure accident.
Since it is well known that a well aggregated linear combination of weak classifiers can result in
a highly-effective strong classifier (Freund and Schapire, 1996), it can be as well proposed that
a huge number of well aggregated weak cultural clues can yield some strong ones.
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

651

References
Adar E., Skinner M. and Weld D.S. (2009). Information arbitrage across multi-lingual Wikipedia. In
Proceedings of the Second ACM International Conference on Web Search and Data Mining,
ACM, pp. 94-103.
Bourdieu P. (1979). La distinction: critique sociale du jugement. Paris: Ed. de Minuit.
Brin S. and Page L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer
networks and ISDN systems, 30 (1-7): 107-117.
Csárdi G. and Nepusz T. (2006). The igraph software package for complex network research. InterJournal
Complex Systems, 1695.
Esuli A. and Sebastiani F. (2007). PageRanking WordNet synsets: An application to opinion mining. In
Annual meeting-association for computational linguistics. pp. 424-431.
Ferrandez S., Muñoz R. and Palomar M. (2007). Applying Wikipedia’s multilingual knowledge to
cross-lingual question answering. Lecture Notes in Computer Science, 4592, pp. 352-363.
Filatova E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. In Proceedings
of the Third International Workshop on Cross Lingual Information Access: Addressing the
Information Need of Multilingual Societies, Association for Computational Linguistics, pp. 3037.
Freund Y. and Schapire R.E. (1996). Experiments with a new boosting algorithm. In Machine learninginternational workshop then conference, Citeseer, pp. 148-156.
Mihalcea R. (2007). Using wikipedia for automatic word sense disambiguation. In Proceedings of
NAACL 2007 HLT.
Park H. (2005). Network Cultural Analysis: Texts, Graphs, and Tools. In Paper presented at the annual
meeting of the American Sociological Association, Philadelphia, PA.
Richman A.E. and Schone P. (2008). Mining wiki resources for multilingual named entity recognition.
Association for Computational Linguistics (ACL-08: HLT): 1-9.
Rousseeuw P.J. and Leroy A.M. (2003). Robust Regression and Outlier Detection. Hoboken, New
Jersey : J. Wiley & Sons.
Surowiecki J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective
wisdom shapes business, economies, societies, and nations. New York: Doubleday Books.

JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

Semi-supervised haartraining of a fast&frugal open
source zygomatic smile detector
A gift to OpenCV community
Daniel Devatman Hromada

prof. Charles Tijus

Lutin Userlab
Ecole Pratique des Hautes Etudes

Cognition Humaine et Artificielle (ChART)
Université Paris 8

Abstract—Five different versions OpenCV-positive XML
haarcascades of zygomatic smile-detectors as well as five
SMILEsamples from which these detectors were derived had
been trained and are presented hereby as a new open source
package. Samples have been extended in an incremental learning
fashion, exploiting previously trained detector in order to add
and label new elements of positive example set. After coupling
with already known face detector, overall AUC performance
ranges between 77%-90.5% when tested on JAFFE dataset and
<1ms per frame speed is achieved when tested on webcam videos.
Keywords-zygomatic smile detector; cascade of haar feature
classifiers; computer vision; semi-supervised machine learning

I. INTRODUCTION
Great amount of work is being done in the domain of
facial expression (FE) recognition. Of particular interest is a
FE being at the very base of mother-baby interaction [1], a FE
interpreted unequivocally in all human cultures [2] - smile.
Maybe because of these reasons, maybe because of some
others, smile detection is already of certain interest for
computer vision (CV) community – be it for camera's smile
shutter [3] or in order to study robot2children interaction [4].
Nonetheless, a publicly available, i.e. open source,
smile detector is missing. This is somewhat stunning,
especially given the fact that “smile” can be conceived as a
“blocky” object [5] upon which a machine learning technique
based on training of cascades of boosted haar-feature
classifiers [6] can be applied, and that the tools for performing
such a training are already publicly available as part of an
OpenCV[5] project. Verily, with exceptions of detectors
described in [7][8] which have not been publicly released, we
did not find any reference to haarcascade-based smile detector
in the literature. We aim to address this issue by making
publicly available the initial results of our attempts to
construct sufficiently descriptive SMILing Multisource
Incremental-Learning Extensible Sample (SMILEs) and five
smile detectors (smileD) generated from this sample.
From more general perspective, our aim was to study
whether one can use already generated classifiers in order to
facilitate such a semi-supervised extension of initial sample
that a more accurate classifier can be subsequently trained.
A.SMILE sample (SMILEs)
The aim of SMILEs project is to facilitate and

accelerate the construction of smile detectors to anyone
willing to do so. Since it is the OpenCV library which
dominates the computer vision community, SMILEs package
is adapted upon the needs of OpenCV in a sens that it contains
1) negative examples directory 2) positive examples directory
3) negatives.idx - list of files in negative examples directory 4)
positives.idx - list of files in positives with associated
information containing the coordinates of region of interest
(ROI), i.e. the coordinates of the region within which smile
can be located.
SMILEs is considered “Multisource” because it
originates as an amalgam of already existing datasets like
LFW and Genki both of which are, themselves, collections of
images downloaded from the Internet. Images from POFA [9]
of Cohn-Kanade [10] datasets were not included into SMILEs
since restricted access to these datasets is in contradiction with
an open source approach1 of SMILEs project.
B.Smile Detector (smileD)
SMILEs are “Incremental-Learning Extensible” in a
sense that they allow us to train new versions of smile
detectors which are subsequently applied upon new image
datasets in order to facilitate (or even fully automatize) the
labeling of new images, and hence extending an original
SMILEs with new images. Simply stated, SMILEs allow us
train smileD which helps us to extend SMILEs etc.
Since training of haar cascades is an exhaustive
threshrold-finding process demanding not negligible amount
of time and computational resources, 5 pregenerated OpenCVcompatible XML smileD haarcascades were trained by
opencv-haartraining application and are included with
SMILEs in our OpenSource SMILEsmileD package, so that
anybody interested could implement our smile detector in
copy&use fashion.

1

Both SMILEs & SMILEd cascades are publicly available
from http://github.com/hromi/SMILEsmileD as a GPLlicensed package. C++ source codes of select&crop
application for easy manual sample creation and of a facecoupled video stream smile detector are included as well.

II. METHOD

it is essentially a version 0.1 sample to which
automatically labeled positive examples were added.
Differently from version 0.3, Genki4K and not flickr was
exploited as a source of additional data. Simply stated,
positive examples, 624 of them in total, from Genki4K
labeled as smile-containing by its authors were added to
initial LFW-based sample.
 Version 05 unites the versions 0.3 and 0.4, i.e. both
Genki4K & flickr-originated images which were
automatically labeled by smileD v0.1 were added to LFW
samples.

C.Initial Training Datasets
SMILEs project in its current state unites 3 image sets :
Labeled Faces in the Wild (LFW) dataset - LFW
dataset [11] contains more than 13000 images of faces
collected from the web; its cropped version contains only
25x25pixel regions detected by OpenCV's frontal face
detector. No information about the presence/absence of a
sm.ile within the image is given
 Genki4K dataset – Genki4K is a publicly available part of
UCSD's Genki project [12] containing 4000 images
downloaded from Internet. A text file indicating the
presence/absence of the smile in a given image is included.
 Ad hoc Flickr dataset – We have used the search keyword
“smile” in order to download more than 4200 additional
pictures from image-sharing website flickr.com. More than
2600 of them contained at least one smiling face.
 cropped

D.Construction of SMILEs datasets
We have created five different version of SMILEs. All
these versions exploit the same negative sample set of LFW's
nonsmiling images. All manual labeling focalised solely on
zygomatic smile (ZS) region2:
 Version 0.1 is based solely upon an LFW dataset. All
pictures were manually labeled by our ad hoc region
selection & cropping application and divided into samples
of positive (3606 images) and negative (9474 images)
examples.
 Version 0.2 added 2666 manually labeled images
downloaded from flickr.com to positive examples
contained already in 0.1. Labeling & region selection was
realised by same application as in case of 0.1.
 Version 0.3 also extended the positive&negative
example samples of version 0.1 with images from flickr.
This time, however, the flickr-originated images weren't
labeled manually, but the smile-containing regions of
interest were determined automatically, by applying
smileD of version 0.1 upon the set of downloaded
images. 1372 ROIs (1 ROI for 1image) were
identified&labeled in this way.

E.SMILEs -> smileD Training
Identical haarcascade training parameters [width=43,
height=19, number of stages=16, stage hit rate=0.995, stage
false alarm rate=0.5, week classifier decision tree depth=1(i.e.
Stump), weight trimming rate=0.95] were applied for training
of all five smileD versions, one smileD corresponding to one
SMILEs, both referenced by same version number.
F.smileD evaluation
Training phase of every new version of smileD was
followed by measuring its performance upon a Japanese
Female Facial Expression (JAFFE) dataset in order to evaluate
the performance of different versions of smileD classifiers
when applied upon a sample having different luminosity
conditions than that any imageset included in train sample
Detectors were face-detector-coupled during testing, i.e.
smile detection was performed iff a face was detected in a
tested image, and only in the ROI defined by well-known
geometric ratios [13]
Receiver operating characteristic (ROC) curves were
plotted and AUC (“area under ROC curve”) were calculated as
performance measures by means of ROCR library [14].
“Smile intensity” [7], i.e. the number of overlapping
neighboring hit regions3, was used as a cutoff parameter.
III. RESULTS
FIGURE I. SMILED ROC CURVES

TABLE II. ROC'S "AREA UNDER

CURVE" PERFORMANCE OF DIFFERENT
VERSIONS OF SMILED DETECTOR

TABLEI. BASIC COMPONENTS OF INITIAL VERSIONS OF SMILES&SMILED PROJECT
Version

Positive examples
LFW
manual

0.1
0.2
0.3
0.4
0.5

3606
3606
3606
3606
3606

 Version

2

Version
0.1
0.2
0.3
0.4
0.5

Neg. ex.

Flickr
manual

Flickr
auto

Genki
auto

Total

0
2666
0
0
0

0
0
1372
0
1372

0
0
0
624
624

3606
6262
4978
4230
6572

9474
9474
9474
9474
9474

04 is analogous to version 0.3 in that sense that

ZS region was defined only vaguely as a rectangular ROI in
whose center are smiling lips – in preference with uncovered
teeth. Whole ROI is bordered by smile&nasolabial wrinkles.

AUC
77.94%
85.49%
83.93%
90.21%
90.51%

DISCUSSION
Detectors we present hereby exploit the top-bottom
approach, i.e. they are face-coupled. Knowing that there can
3

Can be obtained from undocumented neighbors attribute of
cvAvgComp sequence referenced by cvHaarDetectObjects

be no smile without the face within which it is nested, we
firstly detect the face by an OpenCV face detection solution
and then smileD is applied only in very limited ROI of face's
bottom third. Consequences of our decision to create facecoupled smile detector are twofold: 1) since by definition we
search for smile only within the face, we have used only
nonsmiling faces as negative examples (i.e. background
images) 2) smile detection itself is very fast, once the position
of face is specified. When applied upon the webcamoriginated (320x240 resolution) video streams, the time
needed for in-face smile detection in never exceeded 1ms per
frame on a Mobile Intel(R) Pentium(R) 4 CPU (1.8GHz),
suggesting that SmileD could be potentially embedded even
into mobile devices disposing of less computational resources.
SmileD's speed can somehow neutralize its smaller
accuracy handicap which it has in comparison with results
reported in [8]. In its current state, our approach suffer from
somewhat high false alarm rates, but our research indicates
that in real life condition, these can be in great measure
reduced by taking into account the dynamic sequence of
subsequent frames since the probability of the same false
alarm occuring within all the frames of the sequence is
proportional to the product of probabilities of occurrence of
that false alarm for every frame of the sequence taken
individually. High speed is therefore of utmost importance and
analysis of sequences of frames can substantially reduce the
number of false positives.
Tuning of training parameters and the extension of
negative example do remain as other possibilities how to
augment the accuracy of our project. Tab.2 indicates that
accuracy of such semi-supervised classifiers like smileD gets
saturated at certain limit which can possibly be surmounted
only by extension of negative sample set. In case of smile
detection, we suggest that extension of negative example
sample with more images containing “upper lip raiser” action
unit (AU 10) – teeth-uncovering4 but associated with disgust
rather than smile – could yield some significant increases in
accuracy, as reported by [9]. Since such an extension is
relatively easy and not much time-consuming, given that such
AU10-containing images are given and marked as negative
examples, it may be the subject of future research.
In this study, however, we left the negative example
unchanged in order to study the effectivity of “Incremental
Learning” approach during which an old detector is used to
facilitate the extension of a positive example sample thanks to
which a new detector is obtained. Since semi-supervised
smileD versions v0.4 and v0.5 have outperformed v0.2 for
which manual labeling was implemented, while the latter one
performed only slightly better than v0.3 which exploited an
identic flickr-originated imagebase than v0.2, it is not
unreasonable to think that such semi-supervised incremental
training approach can be a feasible solution for training
haarcascade detectors. If that would be the case, it could
possibly be stated that the machine started, in certain sense, to
4

From anatomic point of view, disgust-expressing AU10 is
associated with Levator Labii Superioris muscle while
smile associates with Zygomaticus Major muscle (AU12).

ground [15] its own notion of smile.
ACKNOWLEDGMENT
We would like to thank the third section of EPHE,
University Paris 8 and CROUS de Paris for their kind support.
REFERENCES
[1]

L. Strathearn, J. Li, P. Fonagy, et P.R. Montague,
“What's in a smile? Maternal brain responses to infant
facial cues,” Pediatrics, vol. 122, 2008, p. 40.
[2] C. Darwin, P. Ekman, et P. Prodger, The expression of
the emotions in man and animals, Oxford University
Press, USA, 2002.
[3] M. Akita, K. Marukawa, et S. Tanaka, “Imaging
apparatus and display control method,” 2010.
[4] J.R. Movellan, F. Tanaka, I.R. Fasel, C. Taylor, P.
Ruvolo, et M. Eckhardt, “The RUBI project: a progress
report,” Proceedings of the ACM/IEEE international
conference on Human-robot interaction, 2007, p. 339.
[5] G. Bradski et A. Kaehler, Learning OpenCV, O'Reilly
Media, Inc., 2008.
[6] P. Viola et M. Jones, “Rapid Object Detection using a
Boosted Cascade of Simple,” Proc. IEEE CVPR 2001.
[7] O. Deniz, M. Castrillon, J. Lorenzo, L. Anton, et G.
Bueno, “Smile Detection for User Interfaces,” Advances
in Visual Computing, p. 602–611.
[8] J. Whitehill, M. Bartlett, G. Littlewort, I. Fasel, et J.
Movellan, “Developing a practical smile detector,”
Submitted to PAMI, vol. 3, 2007, p. 5.
[9] P. Ekman et W.V. Friesen, Pictures of facial affect, Palo
Alto, CA: Consulting Psychologists Press, 1976.
[10] T. Kanade, Y. Tian, et J.F. Cohn, “Comprehensive
database for facial expression analysis,” fg, 2000, p. 46.
[11] G.B. Huang, M. Ramesh, T. Berg, et E. Learned-Miller,
“Labeled faces in the wild: A database for studying face
recognition” University of Massachusetts, Amherst,
Technical Report, vol. 57, 2007, p. 07–49.
[12] J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, et J.
Movellan, “Toward Practical Smile Detection,” IEEE
transactions on pattern analysis and machine
intelligence, 2009, p. 2106–2111.
[13] L. Da Vinci et J.P. Richter, The notebooks of Leonardo
da Vinci, Dover Publications, 1970.
[14] T. Sing, O. Sander, N. Beerenwinkel, et T. Lengauer,
“ROCR: visualizing classifier performance in R,”
Bioinformatics, 2005.
[15] S. Harnad, “The symbol grounding problem,” Physica
d, vol. 42, 1990, p. 335–346.

Le mémoire de Master E.P.H.E SVT de
Cognition Naturelle et Artificielle 2010

smileD : Sourire Naturel et Sourire Artificiel.
De l’utilisation d’OpenCV pour le tracking, la reconnaissance des
expressions faciales et la détection du sourire
contenant:
Avant-propos …...................................................page 1
Abstract / Résumé …...........................................page 2
L'imitation par suivi des points..........................page 3
Introduction à l'apprentissage automatique
en relation avec la reconnaissance des
expressions faciales..........................................page 13
L'évaluation des diverses versions du détecteur
de sourire smileD par le jeu d'imitation avec
le visage robotique Roboto..............................page 21
Annexe 1 : Le fonctionnement d'AdaBoost...................page 31
Annexe 2 : La construction d'images de chamfer.........page 31
Annexe 3 : Semi-supervised haartraining
of a fast&frugal open source zygomatic
smile detector................................................page 32
« Épiprologue » …...........................................................page 33

Daniel Devatman Hromada
sous la direction de Charles Tijus

Respectables Professeurs et Maîtres d'École, Mesdames, Messieurs,
Le texte que vous tenez dans vos mains est le mémoire qui présente le travail effectué par
l'étudiant Daniel Devatman Hromada durant le stage au Laboratoire des Usages en Technologie
d'Information Numérique (Lutin) lors des troisième et quatrième semestres de ses études de
Cognition Naturelle et Artificielle en vue de l'obtention du diplôme de Master 2 des Sciences de la
Vie et de la Terre de l'École Pratique des Hautes Etudes.
Le sujet de stage initialement défini concernait « L'imitation des expressions faciales par
le visage robotique Roboto ». Mais le problème s'est vite avéré tellement complexe qu'il a été
impossible de faire entrer la totalité des idées et travaux dans le cadre d'un mémoire classique, ayant
la forme d'un article scientifique. Il a donc paru plus raisonnable d'exploiter le savoir-faire obtenu
lors du première semestre d'études à l'EPHE grâce à l'U.E. “Communication écrite” et de rédiger
chacune des trois parties principales du texte selon un plan différent :
La première partie mets en œuvre le plan de rédaction OPERA (Observation, Problème,
Expérimentation, Résultats, Action) afin 1) d'introduire au lecteur le visage robotique Roboto – qui
peut être considéré comme le facteur déclencheur de tout ce que est présenté ci-dessous et 2) de
présenter au lecteur la bibliothèque OpenCV. La deuxième partie mets en œuvre le plan de rédaction
ILPIA (Introduction, Littérature, Problème, Implication, Avenir) pour présenter non seulement les
expériences effectuées à la fin du première semestre (dont le but était d'exploiter les contours pour
créer un système robuste et rapide de reconnaissance des expressions faciales) mais surtout pour
faire entrer le lecteur dans le domaine fascinant de l'apprentissage automatique (machine learning).

Mais c'est, en effet, la troisième partie où se trouve le point culminant du travail : après un
certain échec dû au problème de généralisation vers les échantillons inconnus qui termine le
deuxième chapitre, l'auteur s'est décidé à réduire ses objectifs de S4 à un seul objectif plus réaliste:
la problème de reconnaissance des sourires. C'est dans ce chapitre final que le plan de rédaction
le plus courant en sciences cognitives, celui de IMRED (Introduction, Méthode, Résultats et
Discussion), fut mis en œuvre. Le résultat final du travail - l'article Semi-supervised haartraining of
a fast&frugal face-coupled open source haarcascade detector of zygomatic smiles – a été placé en
annexe puisqu'il ne nous fut pas permis d’écrire le mémoire proprement dit dans une langue autre
que le Français. Le texte se termine par la proposition du stage qui à l'origine des ces travaux.

Abstract
Three different approaches were examined in order to establish the facial-expression-based
communication channel between robotic face of Roboto and its human counterpart. We had started
with a point-tracking system based upon Lucas-Kanade optical flux algorithm. The need for a
calibration phase as well as the fact that this approach is a purely behaviorist (i.e. stimulus-reflex
based) one, without any cognitive representation of the “smile” on the side of the computer induced
us to implement some more robust machine learning techniques. Therefore, as a second trial, we
have studied the feasibility of a facial expression recognition system based on contour extraction,
chamfer matching and subsequent feature selection by means of AdaBoost. While the initial tests
limited to JAFFE dataset have shown promising results, generalisation to other datasets have shown
to be problematic. Lastly, a “classical“ approach of cascades of boosted haar-feature-based
classifiers was applied upon SMILEs dataset and coupled with already existing face detectors,
producing relatively fast (< 1ms per frame) and sufficiently accurate (90.5% AUC when tested on
JAFFE) zygomatic smile detector (smileD). Both smileD and SMILEs dataset are published hereby
as a gift to an OpenCV community.

Résumé
Trois approches différentes furent examinées avec pour l'objectif l'établissement d'un canal
de communication basé sur les expressions faciales entre le visage robotique Roboto et l'utilisateur
humain. Les travaux débutèrent par les études du fonctionnement du système de suivi des points,
basé sur l'algorithme de flux optique de Lucas-Kanade. La nécessité de la phase de calibration,
aussi bien que le fait qu'il s'agit d'une approche purement «béhavioriste », i.e. basé sur le couplage
stimulus-réflexe où aucune représentation interne du sourire n'est jamais présent sur le côté
ordinateur, nous a fait passer à des techniques plus évoluées d'apprentissage automatique. C'est
pourquoi nous avons ensuite étudié la faisabilité d'un système de reconnaissance d'expressions
faciales grâce à l'extraction des contours, matching de chamfer et une sélection des traits effectuée
ultériérement par AdaBoost. Même si les tests initiaux limités à la base d'images JAFFE donnèrent
des résultats encourageants, la généralisation par rapport à d'autres bases d'images s'est avérée
problématique. Enfin, l'approche “classique“ exploitant les cascades de classifieurs de Haar a été
mise à l'épreuve. Les résultats - i.e. un détecteur des sourires (smileD) rapide (< 1ms)

et

suffisamment précis (90.5% AUC quand testé sur JAFFE) aussi bien que la base d'images SMILEs
utilisée pour générer ce dernière – sont offerts à la communauté OpenCV.

1. L'imitation par suivi des points
Observation
Roboto est une tête robotique qui a été conçue en 2003-

Figure 1 : Roboto

4 dans une collaboration CNRS entre l’équipe développementale de
Jacqueline Nadel (CNRS UMR7593, Centre Emotion) et l’équipe
neurocybernétique ETIS de Philippe Gaussier, inspirées par la
conception initiale de Feelix par l’équipe de Lola Canamero
(Canamero & Fredslund, 2001). . Le maître d’œuvre fut Pierre Canet
(Centre Emotion) . La tête est reliée à un ordinateur portable via une
connexion série RS 232 reliée à une carte (SSC-12 de Lynxmotion).
Douze servo-moteurs (Channel Serial Servo Controller) commandés
à partir d’un programme élaboré par l’équipe ETIS permettent de faire bouger les sourcils, les
paupières, et la bouche, en composant 6 expressions émotionnelles (tristesse – joie – surprise –
colère- peur et neutre), formatées en référence aux unités d’action des expressions prototypiques
humaines (Ekman et al. 2002) .
Tableau 1 : Les attributes des diverses moteurs de Roboto
Moteur Point d'attache

Mouvement

Muscle associé

0

Coin de lèvre

horizontale

Zygomaticus major

1,3

Narine

verticale

Quadratus lab. sup.

2

Coin de lèvre

horizontale

Zygomaticus major

4

Menton

verticale

Depressor lab. inf.

Figure 2 : Les muscles du visage

/ Mentalis
5,6

Paupière

verticale

Orbicularis occuli

7,1

Sourcil int.

quasiverticale

Corrugator

8,11

Sourcil ext.

quasiverticale

Temporalis

9

Front

verticale

Frontalis

Etant donné que le contrôleur SSSC permet de gérer 12 moteurs et que nous en disposions
de 14, il nous fallait choisir les deux moteurs qui ne pourraient pas être connectés. Ce sont les
moteurs permettant de bouger les yeux autour de leur axe vertical que nous avons débranchés et
ainsi exclus de nos expériences.
Ce choix fut partiellement motivé par le fait que les caméras qui étaient installées
originalement dans les deux yeux du robot ne démontrèrent que de très faibles performances. Il
fallut, donc, trouver une nouvelle solution pour rendre Roboto plus apte à voir. Après quelques

tentatives d'utiliser les cameras de la gamme la plus élevée (modèle : Pointgrey, mode de
connexion: Firewire) comme la voie d'entrée des données perceptives, ce fut enfin une camera web
assez simple (modèle : PHILIPHS, mode de connexion : USB), attachée sur le locus entre et un peu
au-dessus les yeux (la position « du troisième oeil») qui fournit le flux d'images au système.
Le controlleur SSC-12 qui assurait la communication entre l'ordinateur et les servomoteurs
fonctionnait de la manière suivante: par la voie du port de série RS-232, il recevait de l'ordinateur
les séquences de trois octets dont le première a toujours une valeur de 255 ; le deuxième octet
spécifiait le numéro du servomoteur (1-12) concerné aussi bien que la vitesse du mouvement (0-7
où 7 désigne la vitesse la plus rapide) ; le troisième octet spécifiait la position finale du
servomoteur. Comme chaque octet peut avoir 28 = 255 valeurs, chaque servomoteur peut se
retrouver dans 255 états. Sachant que Roboto a 12 degrés de liberté, le nombre des ses états
possibles est alors 12255 .

Comme la différence entre les états voisins des servomoteurs est

invisible pour un observateur humaine (e.g. on peut pas distinguer la différence entre l'oeil fermé dû
au fait que le servomoteur 5 est en position 23 et celui dont le servomoteur est en position 24 ou
25), le nombre des états possibles du Roboto qui pourraient être réellement perceptibles en tant que
les états spécifiques par un sujet humain est sans doute moins élevé que 12255 mais reste cependant
si grand que même après plusieurs mois de travail, Roboto est toujours capable de nous étonner
voire amuser par une expression faciale jamais remarquée auparavant.
Selon le célèbre FACS (Facial Action Coding System) (Ekman & Friesen 1977), il existe 7
expressions faciales de base qui sont repérables dans toutes les cultures du monde: 1) neutre 2) joie
3) surprise 4) tristesse 5) colère 6) peur 7) dégoût. FACS fut repris comme la référence de base par
les concepteurs du Roboto: «Beaucoup d'efforts furent investis afin que les expressions du robot
soient

cohérents

avec

le

système

Expression

Séquence correspondante

Neutre

FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 5F FF 15 7D
FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 87

Joie

FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 5F FF 15 7D
FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 87

Surprise

FF 20 8C FF 21 76 FF 22 60 FF 23 93 FF 24 AD FF 25 5A
FF 26 64 FF 17 60 FF 18 A1 FF 29 AA FF 1A 60 FF 1B 99

expression faciale1. Chaque séquence se

Tristesse

FF 10 3C FF 11 3C FF 12 C8 FF 13 CF FF 14 46 FF 15 87
FF 16 96 FF 17 95 FF 18 9F FF 19 46 FF 1A 95 FF 1B 9F

compose de 36 octets - 3 octets pour

Colère

FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 3C FF 15 87
FF 16 96 FF 17 60 FF 18 80 FF 19 3C FF 1A 64 FF 1B 83

chacun de douze servomoteurs qui sont

Peur

FF 20 5F FF 21 72 FF 22 A5 FF 23 94 FF 24 5A FF 25 23 FF
26 32 FF 27 80 FF 28 85 FF 29 AA FF 2A 80 FF 2B 85

Dégoût

FF 30 86 FF 31 BB FF 32 E4 FF 33 B4 FF 34 89 FF 35 7D
FF 36 8C FF37 4B FF 38 85 FF 39 4E FF 3A 75 FF 3B 9C

Tableau 2 : Les expressions faciales de Roboto et leurs
séquences d'octets correspondantes

FACS» (Nadel et al. 2006).
résultats
séquences,

de

cet

effort

l'une

pour

Les

furent

6

chaque

envoyés par la voie du port série au
servo-controlleur SSC-12 pour produire
l'expression demandée.

1 Avec l'exception de la séquence pour dégôut que nous n'avons pas trouvée parmi les données fournies avec Roboto.
Nous devions, donc, de “trouver” la séquence pour dégôut nous-même et nous l'avons fait grâce au script ri.pl qui
sera présenté sur la page 7.

Le logiciel Docklight fut fourni avec les séquences comme le moyen d'envoyer les
commandes au robot par le porte série. En d'autres termes, les 6 séquences + un logiciel pour
Windows furent tout le «software» qui nous permettait de communiquer avec Roboto lorsqu'il fut
transféré au (Lutin) en juin 2009.
Problème
En termes plus généraux, notre problème peut être formulé ainsi : comment rendre Roboto
utile à notre laboratoire et à la communauté scientifique, voire utile au progrès des connaissances ?
Le fait que Roboto ne ressemble que peu au visage humaine peut aussi être consideré
comme un problème. En effet - il n'y a pas de la peau, le cadre métallique est dans un seul plan et le
robot donne donc une impression 2-dimensionnelle dont la partie la plus réaliste est la bouche, les
lèvres étant formées d'une bobine qui se dilate ou se contracte selon le positionnement des moteurs
0,1,2,3,4. En effet beaucoup de ceux qui sont entrés en contact avec Roboto lors de leurs visites au
Lutin ont remarqué que le robot est « peu réaliste », surtout par rapport aux visages robotiques
comme « Einstein » (Wu et al. 2009) ou « Repliee Q2 Actroid ».
Mais il arrive que ce qui est considéré par le plus grand nombre comme un défaut se révèle
être un avantage – il suffit de donner la priorité à une perspective plus optimiste. Tel était, est et sera
notre démarche, et nous soutenons cette « approche optimiste » par les arguments suivants:
1) L'effet de vallée dérangeante : Il s'agit d'une réaction psychologique devant certains robots
humanoïdes, d'abord suggérée par
(Mori 1970) . Il décrit le fait que
plus

un

robot

humanoïde

est

similaire à un être humain, plus ses
imperfections

nous

monstrueuses.

Ainsi,

paraissent
certains

observateurs seront plus à l'aise en
face d'un robot clairement artificiel
que devant un robot doté d'une
peau, de vêtements et d'un visage
pouvant passer pour humain. La
Fig. 3 montre ce que nous pensons

Figure 3 : La position de Roboto dans la schème
d'être la position de Roboto dans le antropomorphisme/familiarité de (Mori 1970)
schème de familiarité proposé par
(Mori 1970). Un certain manque d'anthropomorphisme le met devant le premier «peak» et
empêche ainsi les observateurs humains confrontés au robot de ressentir la chute négative

dans la vallée de l'«inquiétante étrangeté» (Freud 1947)
2) L'hypothèse des enfants autistes – Comme l'a remarqué la mère du projet de Roboto, prof.
Jacqueline Nadel, un certain manque d'anthropomorphisme était « fait exprès » afin de
réaliser les expériences auprès d'enfants autistes qui ont souvent du mal à regarder un visage
use Device::SerialPort;
my $neutre="FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 5F FF
15 7D FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 87 ";
my $colere="FF 10 86 FF 11 7C FF 12 7B FF 13 8A FF 14 3C FF
15 87 FF 16 96 FF 17 60 FF 18 80 FF 19 3C FF 1A 64 FF 1B 83 ";
my $sourire="FF 10 2F FF 11 D8 FF 12 D8 FF 13 1E FF 14 8C FF
15 7D FF 16 8C FF 17 75 FF 18 85 FF 19 78 FF 1A 75 FF 1B 85 ";
my $peur="FF 20 5F FF 21 72 FF 22 A5 FF 23 94 FF 24 5A FF 25
23 FF 26 32 FF 27 80 FF 28 85 FF 29 AA FF 2A 80 FF 2B 85 ";
my $tristesse="FF 10 3C FF 11 3C FF 12 C8 FF 13 CF FF 14 46 FF
15 87 FF 16 96 FF 17 95 FF 18 9F FF 19 46 FF 1A 95 FF 1B 9F ";
my $surprise="FF 20 8C FF 21 76 FF 22 60 FF 23 93 FF 24 AD FF
25 5A FF 26 64 FF 17 60 FF 18 A1 FF 29 AA FF 1A 60 FF 1B 99 ";

my $dat=""; my $command="neutre"; print ">";
my $port=Device::SerialPort->new("/dev/ttyS0");
$port->baudrate(9600);
while ($command=<STDIN>) {
$port->lookclear;
chomp $command;
my $instruction=""; my $raw=0;
if ($command eq "smile" or $command eq "sourire") {
$instruction=$sourire;
} elsif ($command eq "neutre" or $command eq "neutral")
{
$instruction=$neutre;
} elsif ($command eq "colere" or ($command eq "anger")
or ($command eq "angry")) {
$instruction=$colere;
} elsif ($command eq "peur" or $command eq "fear") {
$instruction=$peur;
} elsif ($command eq "tristesse" or $command eq "sad") {
$instruction=$tristesse;
} elsif ($command eq "surprise") {
$instruction=$surprise;
} else {
$raw=1;
my @d=split(/ /,$command);
foreach my $s (@d) {
$dat.=chr($s);
}}
if (!$raw) {
my $i=0;
while ($instruction=~/([A-F0-9]{2}) /g) {
++$i;
print hex($1).",";
$dat.=chr(hex($1));
}
} print "\n>";
$port->write($dat);
}

Code 1 : La code source du script PERL roboto.pl

humaine. Selon son hypothèse , il pourrait en être
autrement dans le cas d'interaction avec un visage
à l'anthropomorphisme limité, comme celui de
Roboto.
Après avoir pris en compte ces deux arguments,
nous avons décidé de ne pas interpréter le déficit
d'anthropomorphisme de Roboto comme un
obstacle, mais au contraire comme un avantage,
comme une source des contraintes marquants le
territoire de notre travail.
Or, Roboto fut accueilli au Lutin avec le but
originel

de

rendre

possible

et

facile

les

expériences concernant la problématique du
traitement des expressions faciales par les enfants
autistes. Et même si l'on peut dire que les travaux
qui seront présentés dans les pages suivantes se
sont partialement éloignés de cet objectif originel
vers des régions inconnues et imprévues - comme
c'est, d'ailleurs, souvent le cas en science - même
si nos travaux se sont approchés de plus en plus du
domaine

d'un

ingénieur

de

l'intelligence

artificielle, nous considérons comme important de
souligner le fait que nous n'avons jamais perdu de
vue l'objectif expérimental voire médical (i.e.
«aider»).
Au contraire, il se pourrait que ce que nous nous

apprêtons à présenter ici ne soit qu'une introduction, un manuel expliquant le fonctionnement du
Roboto à ceux de nos collègues qui se décideraient2 à répondre la question : « L'interaction
l'homme-machine par la voie des expressions faciales – pourrait-elle être exploitable pour les études
aussi bien que pour la thérapie des troubles mentaux ? »
2 Les renseignements qui pourraient servir en particulier à un tel chercheu(r|se) seront dans le cadre de ce texte
marqués par l'utilisation de police d'écriture souligné. Termes anglais sont écrit en italique.

Experimentation
L'objectif était alors bien défini : créer un logiciel qui traduit ce que le robot « voit » en son
mouvement. Transformer 7 séquences de 36 octets associés aux 12 servomoteurs en outil
expérimental également pour les études de l'autisme est un défi qui ne peut se résoudre que par un
certain tâtonnement initial. Un défi d'autant plus difficile pour quelqu'un qui n'a jamais envoyé
aucun octet vers un servocontrolleur, ni travaillé dans le domaine de la vision par ordinateur
(computer vision), et tel était, en effet, notre cas quand ce défi fut relevé.
Mais il s'est avéré très vite, heureusement, que la majeure partie du travail était déjà faite
non seulement par les concepteurs du Roboto, mais aussi par la communauté mondiale unie autour
la philosophie de logiciels Open Source. Le module Device::SerialPort fut alors rapidement trouvé
et choisi dans le répositoire du langage PERL (Wall & Loukides 2000) pour rendre possible la
communication par le port série et servir ainsi comme la base de fonctionnement de notre premier
script, appellé roboto.pl. Après son exécution, ce script roboto.pl attendit à l'entrée les mots clés. Si
le mot clé comme « joie », « surprise », « peur », « colère », « dégout », «neutre » ou «tristesse » est
Figure 4 : Le couplage entre les moteurs de
Roboto (numeros des moteurs en vert) et les
touches (en blue) du clavier comme défini
dans le script ri.pl

écrit, le script envoie par le porte série la séquence des
octets associés.
Les limites du script roboto.pl sont évidentes –
le robot ne peut exprimer que l'une des 7 expressions
préprogrammées: on ne peut pas accéder aux moteurs
de manière individuelle, on ne peut pas régler la vitesse
sans changer le script même. Pour franchir ces
limitations, il fut conçu un deuxième script nommé
ri.pl. Son fonctionnement était basé sur le couplage
individuel entre les servomoteurs et le clavier. Pour
dire la chose simplement, deux touches furent
associées à chaque moteur, l'une pour augmenter l'octet
qui représente la position du moteur (e.g. monter un
sourcil), l'autre pour le faire descendre. Les paires
touche/moteur furent choisies de telle manière que la
position de touche dans le cadre du clavier QWERTY
(c.f. Fig. 4) correspondît à la position du moteur par

rapport à la totalité du visage (e.g. la touche 1 qui se trouve dans le coin en haut à gauche fait
bouger le moteur du sourcil gauche extérieur vers le haut). Il y a donc 2x12 = 24 touches pour faire
bouger les moteurs, une touche pour augmenter la vitesse, une pour la diminuer ; 6 touches macro
pour les séquences de base et deux touches pour accélérer/décélérer la vitesse du mouvement.

Même si le script ri.pl nous donne la possibilité de travailler avec le Roboto de manière
beaucoup plus « souple » que roboto.pl , il est évident que tous les deux ne représentent que le
début du travail. En effet, les scripts nous permettent d'envoyer les commandes au robot pour le
faire bouger, mais les données d'entrée, c'est-à-dire les données visuelles, ne sont jamais analysées
par ces simples scripts. Autrement dit, le côté moteur de Roboto est mieux assuré qu'avant son
arrivé au Lutin, mais le côté perceptif fait défaut. Jusqu'au ici, la machine ne voit rien.
La machine ne voit rien tant qu'on ne lui apprend pas à voir – tel est le postulat de base de la
branche d'intelligence artificielle appellée « vision par ordinateur » (computer vision - CV). Depuis
plusieurs décennies déjà, les chercheurs conçoivent des modèles théoriques, des formules
mathématiques et des solutions de plus en plus évoluées pour effectuer le traitement et la
classification d'images … pour aboutir enfin à OpenCV.
OpenCV est une bibliothèque écrite en langage C++, créée d'abord en filiale russe d'Intel
pour ensuite devenir publique et open source. Il s'agit d'un projet qui contient centaines des
fonctions inspirées par les études académiques portants sur tout ce que concerne la représentation
numérique des données visuelles . Comme il n'est ni nécessaire ni possible de présenter, dans le
cadre de ce mémoire, même un centième de tout dont OpenCV est capable, nous renvoyons le
lecteur intéressé à l'ouvrage de Bradski & Kaehler (2008).
Une image n'est pour l'ordinateur qu'un tableau de pixels. Chaque pixel est un point coloré
qui se trouve sur la position aux coordonnés X (la colonne) et Y (la ligne). Dans le cas d'un pixel
coloré, la couleur est décrite par 3 entiers, un représente l'intensité du composant rouge, un du
composant verte et un du composant bleu. Mais comme notre travail porte sur les expressions
faciales et comme il est évident3 qu'on peut reconnaître et classifier une expression faciale même sur
une image en noir&blanc, on ne parlera que d'images aux pixels ayants un seul composant – celui
de l'intensité (luminosité, la teinte de gris) sur l'échelle noir – blanc. Nous procédons ainsi car la
réduction couleur → noir&blanc réduit sensiblement la complexité du problème à résoudre.
Nous répétons: une image de taille X x Y peut être décrite comme une matrice aux X
colonnes et Y lignes dont chaque élément pixX,Y code l'intensité du pixel sur la position I(X,Y).
Tandis que le rôle de la camera est de créer de telles représentations numériques, le rôle de
l'ordinateur est de les traiter pour en tirer l'information digne d'intérêt, voire pour y trouver les
objets de certaines catégories. Tout cela grâce à des procédés d'une essence purement
mathématique, procédés que OpenCV dispense même à ceux qui ont du mal à comprendre les idées
cachées derrière des termes tels: un kernel de convolution, l'operateur de Laplace, le filter de Sobel
ou la transformation de Fourier.
Or, pour notre premier logiciel de traitement d'images – ou plutôt de séquences d'images, car
3 Du moins pour ceux qui se souviennent de l'époque où les films photos étaient en noir&blanc.

on parle ici du flux vidéo – nous nous sommes inspirés de l'appareil de « eye tracking » de FaceLab
qui est présent au Lutin qui, lui aussi, est programmé grâce à OpenCV. Le principe de « eye
tracking », ou même de tous les systèmes de suivi de points en général est le suivant: 1) repérer ou
choisir le point d'intérêt au début de séquence vidéo ; 2) trouver les « coins » dans la proximité de
ce point; 3) mettre en œuvre l'algorithme du flux optique pour trouver ces «coins» dans les images
suivantes de séquence vidéo, en partant de la position repérée sur l'image précédente.
La notion de « coin » (corner; feature) est essentielle pour comprendre comment le système
de suivi fonctionne. Un coin est un point d'image qui a des propriétés uniques par rapport aux autres
points de la même image. Si on voulait suivre le point dans une vidéo d'un mur blanc, il serait assez
difficile de repérer le même point dans l'image suivante car tous les points auraient à peu près la
même intensité (seraient blancs). Au contraire, si on choisissait un point aux propriétés uniques, on
pourrait le suivre, i.e. faire du tracking.
C'est justement ce que fait la fonction cvGoodFeaturesToTrack() de OpenCV. Elle est basée
sur la définition de Harris & Stephens (1988) qui ne considère comme un coin que les points dont la
dérivation dans les deux directions orthogonales est assez forte, ce qui revient, selon Shi & Tomasi
(1994) à une valeur de dérivation4 plus élevée qu'un certain seuil considéré comme un paramètre du
système de tracking. Simplement dit : si la valeur d'intensité d'un pixel est assez différente par
rapport à son voisinage tant en direction de l'axe X que de l'axe Y, ce point peut être considéré
comme un coin, i.e. quelque chose d'unique, “un bon trait à suivre”.
Une fois qu'on a repéré, dans la proximité du point d'intérêt, les points qui seront plus
faciles à suivre que d'autres points, on peut analyser l'image suivante et essayer de les retrouver.
Pour ce faire, il faut évaluer le mouvement qui fait différer les deux images. Les algorithmes du
«flux optique» nous permettent de le faire. En raison de son élégance, de sa vitesse - il s'agit d'une
méthode dit « creuse » (sparse optical flow algorithm) - et de son efficacité, nous avons choisi la
méthode de Lucas-Kanade (LK) (Lucas & Kanade, 1981) pour calculer le flux entre deux images.
LK est basé sur 3 principes de base:
•

la consistance de luminosité - l'intensité d'un pixel ne change pas entre les images suivantes

•

la persistance temporelle – les changements entre images suivantes sont suffisamment lents

•

la cohérence spatiale – les points voisins effectuent des mouvements similaires
En OpenCV, c'est la fonction cvCalcOpticalFlowPyrLK() qui nous permet de mettre en

œuvre la méthode LK. Cette fonction ajoute aussi ce qu'on appelle « une pyramide d'images ». Une
pyramide d'images est une représentation multi-résolution d'une image, créée à partir d'une image
originelle – l'image originelle est sa base, le première étage est la base réduite à la moitié etc.
L'utilisation des pyramides d'images en combinaison avec l'algorithme L-K nous permets de
4 On peut aussi parler d'un “gradient”

A)

B)

Figure 5 : A) une pyramide d'images (l'image tirée de http://fr.wikipedia.org/wiki/Pyramide_(traitement_d%27image) )
B) utilisation des pyramides d'images dans l'estimation du flux optique de Lucas-Kanade afin d'atténuer les
problèmes liées à la nécessité des petits mouvements (Bradski & Kaehler, 2008)

réduire les limitations dues à la condition de « la persistance temporelle ». Un petit mouvement
dans les hauts niveaux de la pyramide équivaut à un grand mouvement dans les niveaux inférieurs
de la pyramide. Sans image pyramidale,ce dernier n'aurait pu être repérable par la méthode LK.
Pour récapituler : le «noyau perceptif» de notre premier logiciel d'imitation est le suivant :
1) l'utilisateur choisit le point d'intérêt par un clic de souris 2) la fonction cvGoodFeaturesToTrack()
repère les coins dans la proximité du point d'intérêt (PoI), 3) elle retrouve les coins dans l'image
suivante du flux vidéo grâce à la fonction cvCalcOpticalFlowPyrLK(), et elle en tire l'information
sur la position nouvelle du PoI. Ceci peut être fait pour un nombre quelconque de points et la
séquence qui est analysée, aussi bien que la vidéo où l'utilisateur choisit les PoIs, provient, bien sûr,
de la caméra de Roboto.
Une fois la vision - le côté sensoriel du Roboto - mise en place, elle est couplée avec la
motricité de la manière suivante : avant de faire le clic afin de choisir le PoI, l'expérimentateur
appuie sous la touche de clavier faisant référence au moteur dont le mouvement sera couplé avec le
mouvement de PoI sur lequel on s'apprête à cliquer.
Simplement dit, le moteur qui va bouger est défini par la touche du clavier (c.f. Figure 4) , le
PoI qui sera suivi est défini par le clic de souris, et leur couplage est assuré par le fait que
l'expérimentateur appuie sur la touche, puis fait le clic. L'envoie des séquences d'octets est ensuite
analogique au script ri.pl, la plus grande différence étant que, cette fois-ci, nous mettons en œuvre la
bibliothèque libserial de C++ et non Device::SerialPort de PERL car le logiciel sm_imitation que
nous venons de présenter n'est pas écrit en PERL mais en C++ (pour être compatible avec OpenCV)

Figure 6 : L'imitation du
mouvement du sourcil
gauche effectué par le
logiciel sm_imitation.c

Résultats
Il nous était et est difficile d'évaluer de manière fiable ce logiciel basé sur le suivi des points.
Ceci est dû au fait que le logiciel ne fournit aucune sortie numérique, il ne fait que suivre des points
et essayer de traduire le mouvement repéré dans le mouvement des moteurs (la procédure est
illustrée sur Fig. 6). Le choix du point à suivre aussi bien que le choix du moteur qui sera couplé
avec le PoI sont faits de manière ad hoc .
En d'autres termes, la phase de couplage moteur ↔ PoI, la phase de calibration ou
l'intervention manuelle est nécessaire et doit précéder chaque expérience avec ce logiciel, ce qui
rend chaque expérience unique, non répétable et donc non-scientifique.
Étant donné que nous ne voulons pas trop nous éloigner de nos objectifs scientifiques,
contentons-nous alors de deux observations, l'une positive et l'autre négative, qui pourraient paraître
superflues mais il n'en est rien. La bonne nouvelle est que la combinaison des fonctions
cvGoodFeaturesToTrack() et cvCalcOpticalFlowPyrLK() permet même aux débutants en OpenCV
de construire leurs premiers logiciels de suivi des points. La mauvaise nouvelle est que même si le
système des images pyramidales est mis en oeuvre et même si l'algorithme de flux optique de LK
est censé remplir la condition de persistance temporelle, le suivi robuste de points n'est pas assuré
→ on perd souvent les points surtout quand un mouvement brusque a été effectué, ce qui demande
ensuite le recommencement de la phase de calibration.
Action
En termes concrets, il nous faudrait au moins 7 mois de plus pour rendre possible la mise en
œuvre de l'imitation par le suivi des points (basé sur le flux optique de LK) si jamais nous décidions
de mettre en oeuvre cette approche-là pour les expériences avec les enfants autistes.
Les phases de calibration, la nécessité de définir les meilleurs seuils des fonctions
cvGoodFeaturesToTrack() et cvCalcOpticalFlowPyrLK(), l'incertitude de pertinence des résultats
obtenus - nous ont amenés à quitter la technique « l'imitation par suivi des points » car il est peu
probable qu'un enfant autiste passe dans un état d'âme paisible -sans mouvement brusque - la
totalité de la phase initiale lors de laquelle il faudra coupler 12 points avec 12 touches → 12
moteurs.
Même quand on a essayé de tourner les yeux de Roboto, i.e. sa caméra, vers Roboto même,
et faire Roboto suivre les mouvements de ses propres moteurs, la condition de la persistance
temporelle n'était pas respectée et on a donc souvent perdu les points 5. On imagine la difficullté
d'une telle expérience quand il s'agit de la mener avec enfants autistes!
Mais le plus grand reproche que nous pouvons faire à l'approche présentée dans ce chapitre
5 Mais du temps en temps, quand le couplage était fait de manière particulière, il nous arriva ce qu'était prevu, i.e.
Roboto bouga tout seule pour un certain moment sans aucune intervention humaine nécessaire

n'est pas d'ordre technique, mais d'ordre théorique. En effet, un logiciel qui ne fait que suivre les
points et les traduit bêtement en octets envoyés aux servo-moteurs n'est qu'une solution ad hoc, un
logiciel qui n'apporte rien ni au domaine de la psychologie cognitive, ni au domaine de l'intelligence
artificielle qui nous attirait de plus en plus au fur et à mesure que nous continuions à connaître la
bibliothèque OpenCV et la langue de programmation C++ qui nous étaient jadis inconnues.
Or, l'intelligence d'un tel système d'imitation par le suivi aveugle peut être comparé à
l'intelligence d'une voiture de Braitenberg (Braitenberg,
1986) mais non à l'intelligence dite « émotionnelle » que
les chercheurs en robotique émotionnelle tâchent de
simuler et comprendre . Car dans le système de couplage
point ↔ moteur que nous venons de présenter il n'y a que
le système stimulus-réponse ; le logiciel donc relève plutôt
Figure 7 : L'exemple de couplage
stimulus-action: Les voitures de
Braitenberg. Chaque capteur sensible
à la luminosité est couplé à une roue.

de l'ordre béhavioriste que cognitiviste. Pour que le
logiciel s'appele ainsi, il faudra autant que possible des
niveaux de représentation différents, et y ajouter les

aptitudes de classification, voire une sorte de généralisation.
En un mot, il nous faudra de l'apprentissage.
Et pour faire cela, il nous faudra entrer dans le royaume de machine learning.
Bibliographie
Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media
Braitenberg, V. (1986). Vehicles: Experiments in synthetic psychology. The MIT press.
Canamero, L., & Fredslund, J. (2001). I show you how I like you-can you read it in my face?[robotics]. IEEE
Transactions on Systems, Man and Cybernetics, Part A, 31(5), 454–459.
Ekman, P., & Friesen, W. V. (1977). Manual for the facial action coding system. Consulting Psychologist.
Freud, S. (1947). Das Unheimliche (1919). Gesammelte Werke, 12.
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. Dans Alvey vision conference (Vol. 15).
Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision.
Dans International joint conference on artificial intelligence (Vol. 3, p. 3).
Mori, M. (1970). The uncanny valley. Energy, 7(4), 33–35.
Nadel, J., Simon, M., Canet, P., Soussignan, R., Blancard, P., Canamero, L., & Gaussier, P. (2006). Human
responses to an expressive robot. Dans Proceedings of the Sixth International Workshop on Epigenetic
Robotics (p. 79–86).
Shi, J., & Tomasi, C. (1994). Good features to track. Dans 1994 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 1994. Proceedings CVPR'94. (p. 593–600).
Wall, L., & Loukides, M. (2000). Programming perl. O'Reilly & Associates, Inc. Sebastopol, CA, USA.
Wu, T., Butko, N. J., Ruvulo, P., Bartlett, M. S., & Movellan, J. R. (2009). Learning to Make Facial Expressions.

2. Introduction à l'apprentissage automatique en relation avec la
reconnaissance des expressions faciales
Introduction
L'objectif à long-terme est de rendre possibles les expériences basées sur l'interaction entre
Roboto et les enfants autistes. Comme c'est souvent le cas, les contraintes liées à cet objectif
marquent la voie à suivre : sachant que le comportement des enfants en question varie souvent selon
des conditions qu'on ne peut pas prévoir, nous avons décidé que l'appareil expérimental – et surtout
le logiciel qui est son noyau – que nous tâchons de construire doit être aussi robuste, rapide et
simple que possible, aussi bien pour l'expérimentateur que pour le sujet d'expérience. La contrainte
de robustesse exige que le logiciel fasse ce qu'on lui demande de faire malgré des variations de
conditions externes (différences de visages des différents sujets, différences de luminosité etc.). La
contrainte de vitesse est due au fait que nous nous contentons d'un logiciel qui nous permet d'étudier
l'interaction de haut niveau entre l'homme et la machine, et pour cela il faut que le logiciel nous
permette de faire interagir l'homme et la machine en temps réel .
Enfin, quand nous parlons de la simplicité, nous voulons dire que nous préférons mettre en
oeuvre une approche qui utilise un nombre minimal de paramètres à définir ou à régler lors de
l'expérience. Dans le cas idéal il n'y en aura aucun, l'appareil étant complètement automatique et
l'expérimentateur se contenant de démarrer le logiciel et d'analyser les résultats.
Même si l'approche présentée dans le cadre du premier chapitre était assez rapide, elle n'était
ni robuste – on a souvent perdu les points suivis – ni simple – une phase de couplage assez longue
aurait été nécessaire avant chaque expérience.
Or, nous avons décidé de tenter de construire un système dans lequel aucune phase de
couplage ou de calibration préalable à l'expérience n'est nécessaire. Pour ce faire, il faut que le
logiciel ait, avant l'expérience même, quelques connaissances des traits invariants des objets (les
visages ou leurs expressions faciales) – qu'il va essayer de 1) reconnaître 2) classifier et 3) imiter.
Telle technique est possible si et seulement si le logiciel a des connaissances préalables et
générales sur les objets à reconnaître. Le moyen par lequel on peut aboutir à des connaissances
générales à partir d'un échantillon limité d'exemples concrets est appelé l'apprentissage, et la sousdiscipline d'informatique qui étudie les algorithmes divers nous permettant d'effectuer
l'apprentissage sur machines est connu sous le nom
learning).

d'apprentissage automatique (machine

L'objectif de ce chapitre est d'introduire ces algorithmes et de définir la notion de

«trait», tout cela en relation avec la reconnaissance des expressions faciales.

Littérature
Ce qu'on vise dans le cadre de la recherche en reconnaissance des expressions faciales
(REF) est la classification des images selon les expressions faciales (EF) contenues dans les images
analysées. En général, la classification s'accomplit en deux phases: 1) on extrait les traits (features)
à partir d'une image ; 2)selon les traits extraits, et selon les couplages (traits - étiquette) fournis lors
de la phase d'apprentissage, un algorithme d'apprentissage automatique (AA) attribue une étiquette
même à un objet inconnu. L'étiquette désigne l'appartenance d'objet à une classe.
Les principes fondamentaux d'AA sont bien expliqués en (Bishop et al. 2006); l'ouvrage de
(Haykin 1994) peut sans doute servir comme livre de référence. En ce qui concerne la mise en
oeuvre d'AA dans le domaine de la vision par ordinateur (computer vision – CV), le 13ème chapitre
de l'ouvrage de (Bradski & Kaehler 2008) pourra s'avérer des plus utiles, surtout pour les néophytes.
Quant aux divers algorithmes utilisés en AA, les plus connus sont : la distance de
Mahalanobis (Mahalanobis 1936), K-means (Lloyd 1982), le classificateur de Bayes (naïf ou
normal) (Minsky & Selfridge 1961), les arbres de décision (Breiman 1984), le boosting (Freund &
Schapire 1996), les arbres aléatoires (Breiman 2001), la maximisation d'expectation (Dempster et
al. 1977), les K voisins les plus proches (K-nearest neighbors) (Fix & Hodges Jr. 1989), les réseaux
de neurones artificiels (i.e. les perceptrons multi-couches) (Rumelhart 1989) ou le SVM (support
vector machine) (Vapnik et al. 1997). Chacun de ces algorithmes a des faiblesses et des points forts,
ce qui est bien compris par le théorème TANSTAAFL6 (Wolpert & Macready 1997) .
Tous sont inclus dans la bibliothèque ML qui fait partie intégrante d'OpenCV.
Quant à la REF, les années récentes en ont vu un véritable explosion dans ce domaine
d'études. Le recueil de (Li & Jain 2005) et l'article de synthèse de (Fasel & Luettin 2003) en
présentent les approches les plus courantes. Elles peuvent être divisées en trois groupes principaux:
REF basée sur les ondelettes, REF basée sur les modèles et REF basée sur les contours.
Les ondelettes : A cause d'une certaine analogie avec le fonctionnement du cortex visuel (J.
P. Jones & Palmer 1987), les ondelettes de Gabor (Lyons et al. 1998) sont souvent utilisées comme
traits pour la REF. Même si cette approche semble être intéressante d'un point de vue théorique, ses
exigences informatiques (dont sa lenteur) la rendent presque inexploitable pour la reconnaissance en
temps réel (Azcarate et al. 2005). Au contraire, l'approche exploitant les traits rectangulaires
ressemblant aux ondelettes de Haar (Lienhart & Maydt 2002) peut aboutir à des performances très
intéressantes, surtout quand on la combine avec un certaine astuce d'images intégrales (Viola & M.

6 L'acronyme TANSTAAFL signifie There Ain't No Such Thing As A Free Lunch i.e. « il n'y a rien de tel qu'un repas
gratuit »

Jones, 2002) et que l'on choisit l'algorithme AdaBoost pour déterminer les traits d'intérêt7.
Les modèles: L'autre méthode pour aboutir à la REF est l'Active Appearance Model (AAM)
proposé par (Cootes et al. 1998). Cette approche permet de construire un modèle statistique à partir
d'un certain nombre d'exemples d'apprentissage. Le modèle ainsi construit – on peut l'imaginer
comme une certaine grille ou masque – est ensuite apparié à une image (Abboud et al. 2004).
Comme le système d'AAM est naturellement construit de telle façon qu'il prenne en compte la
variabilité des objets auxquels le modèle sera apparié on peut profiter d'informations liées à cette
variation pour faire l'apprentissage et la classification d'expressions faciales (Lucey et al. 2005).
Même si cette approche mérite d'être suivie de très près, aucune solution open source, i.e. publique,
n'existait pour l'AAM quand nous avons commencé notre stage8.
Comme nous avons rapidement compris qu'essayer de reproduire le calcul matriciel de
l'AAM serait hors de notre portée dans le cadre de notre stage de Master, nous nous sommes
décidés à nous concentrer sur l'approche qui exploite les contours. L'approche que nous avons
tentée de reproduire était celle de Moore et Bowden (2007)9

Problème
Le résumé d'article de Moore et Bowden (2007) propose :
« This paper introduces a novel method for facial
expression recognition, by asssembling contour fragments
as weak classifiers and boosting them to form a strong
accurate classifier. Detection is fast as features are
evaluated using an efficient lookup to a chamfer image,
which weights the response of the feature. »

Figure 8 : Images transformées en
contours obtenus grâce à la fonction
cvCanny() sur 6 images exprimantes
les EFs différentes

Expliquons d'abord les termes de base:
•

Contour : courbe qui correspond au changement brutal de l'intensité lumineuse dans une
image (i.e. la courbe est détectée là où se trouvent les grandes différences avec valeurs des
pixels voisins). Pour repérer les contours on utilise souvent le filtre de Canny (Canny 1987)
qui élimine beaucoup de faux contours puisqu'il ne cherche que les composantes connexes.
La différenciation des contours se fait par seuillage à hysteresis nécessiant deux seuils, un
haut et un bas, considérés en OpenCV commes les paramétres de la fonction cvCanny.

7 La théorie portant sur les traits de Haar sera expliquée au troisième chapitre de ce mémoire.
8 Entre temps, une solution OpenCV positive est apparue sur le site http://code.google.com/p/aam-opencv/
9 Il existe d'autres approches basées sur les contours, notamment celle appelée « edge oriented histograms » de
(Dalal et al. 2006). Malheureusement nous ne fûmes informés de leur existence, par un expert d'Aldebaraan
Robotics, que trop tard pour une étude en profondeur.

•

Fragment d'un contour : morceau du contour
coupé de manière aléatoire ; Une liste des points.
Chaque fragment fournit un trait.

•

L'image de chamfer : image dont la valeur qx,y
du chaque pixel est proportionnelle à la distance
au trait le plus proche (par rapport à X,Y) présent
sur l'image d'origine. L'algorithme initialement
Figure

9:

Les

images

de

chamfer

proposé par (Barrow et al. 1978), nous publions construites à partir des images d'origine
(c.f. Fig. 8). L'intensité de chaque pixel
sa version pour OpenCV en Annexe 2.
•

dépend de la distance euclidienne du
La distance de chamfer : permet en théorie de contour le plus proche sur l'image
d'origine.

repérer la ressemblance entre deux courbes. Dans
le cadre de ces études, elle este plutôt considérée comme un moyen de numériser l'ampleur
de la présence de la courbe C1 tirée de l'image I1 dans la même région que l'image I2. Pour ce
faire I2 est transformée en image de chamfer, et la distance de chamfer n'est rien d'autre que
la somme des valeurs des points d'image de chamfer I2 qui ont les mêmes coordonnées que
les points composant C1. C'est comme si on faisait chevaucher par C1 sur I2 pour ensuite
calculer la somme totale des valeurs de pixels sous C1.
•

Classifieur faible : également appelé « hypothèse » ; algorithme qui donne de meilleurs
résultats que le hasard (i.e. il ne se trompe pas plus d'une fois sur deux en moyenne). Dans le
cadre de cette étude, un classifieur faible est simplement un trait auquel est associé un arbre
de décision n'ayant qu'un seuil de bifurcation (stump).

•

Classifieur fort : combinaison linéaire des classifieurs faibles ; sortie d'algorithme de
boosting ; résultat final de la phase d'apprentissage.

•

Les algorithmes de boosting : groupe d'algorithmes de méta-apprentissage qui permettant
de choisir, parmi un nombre très élevé d'hypothèses possibles, les hypothèses qui semblent
être les plus « parlantes » et les plus « pertinentes » pour la classification finale.

•

AdaBoost (adaptive boosting) : repose sur la sélection itérative de classifieurs faibles en
fonction d'une distribution des exemples d'apprentissage. Chaque exemple est pondéré en
fonction de la difficulté de sa classification par rapport au classifieur courant. (c.f. aussi
Annexe 1)
Les notions de base de l'approche de Moore et Bowden ainsi expliquées, illustrons

maintenant leur méthode sur un problème concret, celui de la REF dans les images appartenant à la
base d'images Japanese Female Facial Expression (JAFFE) :

La JAFFE contient 204 images, et chaque image est labelisée par l'une de 6 étiquettes (la
peur, la surprise, la joie, la colère, le dégoût ou la tristesse). La totalité des 204 images est divisée en
2 parties – l'échantillon d'apprentissage (training sample) ayant 177 images et l'échantillon d'essai
(testing sample) 37 images. Le premier sera utilisé pour construire le classifieur fort, le deuxième
sera utilisé pour voir comment le classifieur fort « se débrouille » avec les exemples qu'il n'a jamais
rencontrés lors de la phase d'apprentissage.
Le processus d'apprentissage se déroule ainsi : toutes les images sont alignées, puis le filtre
de Canny est mis en oeuvre pour obtenir une image binaire (un pixel est soit noir, soit blanc)
représentant les contours connexes, ce qui réduit de manière significative la quantité de données et
élimine les informations jugées moins pertinentes10. Chaque image est ensuite « inversée » (c.f.
Annexe 2) à une image de chamfer dont les valeurs des pixels contiennent la distance vers le
contour le plus proche. Pour chaque classe C des expressions faciales en question est pris un
nombre T des morceaux des contours de 177/6=29 images appartenant à la classe C (ayant C pour
étiquette). Puis, on fait comme si chaque des T fragments était placé sur chacune des 177 images de
chamfer et on calcule la somme totale des valeurs de pixels chevauchées par le fragment en
question. Le résultat est un nombre qui nous renseigne sur la distance du contour qui faisait partie
d'une des images de classe C au contour le plus proche en image X.
En d'autres termes, pour chaque image X des 177 images on obtient ainsi un vecteur de T
traits numériques dont quelques-uns, on l'espère, seront suffisants pour distinguer les images de la
classe C des images apartenants à des classes différentes. Par exemple, disons que c'est la classe
C=joie qui nous intéresse, et qu'on a décidé de n'utiliser que T=9 traits (i.e. fragments) pour la
classification, on obtient alors 177 vecteurs (lignes) ayant 9 éléments chacun, e.g. :
232 , 324, 772 , 552, 923, 789, 87, 124, 87
984 , 398, 234 , 902, 892, 398, 56, 234, 12
etc.
En plus, comme on parle des images appartenant à la partie d'apprentissage, on connaît aussi
l'étiquette de classe liée au vecteur – 1 si l'image exprime la joie, 0 si l'image n'y appartient pas. Si
les traits sont extraits de manière pertinente, et si le nombre T n'est pas trop bas (en réalité, on
utilise des centaines des milliers des traits), il est très probable que l'information contenue dans ces
traits suffirait pour créer un classifieur – i.e. trouvera une sorte d'application11 traits → étiquette
capable de distinguer un visage joyeux de de tous les autres.
10 Tout en gardant intacte l'information nécessaire pour la détermination d'expressions faciales. Ceci est dû à
l'hypothèse du départ (Moore & Bowden, 2007) : «l'information contenue dans les contours est suffisante pour la
REF ». Nous référons à cette hypothèse comme à une hypothèse de bande dessinée.
11 Dans le sens mathématique du terme.

Les démarches pour créer un tel classifieur sont dans ce cas les suivants : pour chaque trait,
c'est-à- dire pour chaque colonne dans notre matrice 177xT , nous cherchons un certain seuil – c'està-dire une certaine valeur numérique, qui différencie mieux les classes joie-absent ou joie-présent.
On obtiendra alors T arbres de décision12 qui nous serviront en tant que classifieurs faibles. Mais il y
a peu d'espoir qu'il existe un seul trait, un seul contour capable de faire une distinction robuste entre
la classe joie-absent et la classe joie-présente. Il paraît beaucoup plus convenable de supposer qu'
une expression faciale est une combinaison de plusieurs traits. En d'autres termes, c'est en
combinant les classifieurs faibles entre eux de manière adéquate qu'on peut espérer aboutir à une
classification fiable.
L'algorithme AdaBoost sert justement à cela. Il combine N hypothèses faibles – qui sont
dans notre cas les arbres de décision du type si valeur < seuil → classe=X ; sinon → classe = nonX pour trouver un classifieur fort. La condition pour les classifieurs faibles est simple : ils doivent
classifier mieux que le hasard. S'il existe un nombre suffisant de tels classifieurs faibles, on peut
être sûr que nous trouverons un classifieur fort classifiant sans erreur toutes les images
d'apprentissage. Ceci est possible dans la mesure où l'algorithme est conçu de telle manière que le
taux d'erreurs de classification des exemples d'apprentissage tombe exponentiellement vers zéro
avec le nombre N des classifieurs faibles. Cette propriété mathématique d'AdaBoost est démontrée
en (Freund et Schapire 1995)
Implication
Une fois que AdaBoost a choisi les traits
« les plus parlants » et les a combinés de façon
linéaire en construisant ainsi le classifieur fort, on
peut mettre ce dernier à l'épreuve en le confrontant

VP

VN

FP

FN

Colère

2

22

2

3

Dégoût

3

25

1

0

Peur

2

24

2

1

Tristesse

2

21

0

6

Surprise

4

23

0

2

Joie

6

20

0

3

avec 37 images d'échantillons d'essai (testing Tableau 3: Les taux des vrai positifs, vrai
sample) , c'est-à-dire avec les images qui n'étaient négatifs, faux positifs et faux négatifs obtenus
lors la classification de EF d'images de JAFFE

pas utilisées lors de la phase d'apprentissage.
Les résultats d'essai sur JAFFE sont présents dans le Tableau 3 . Le terme « faux positif »
(FP) fait référence à la situation où une image est reconnue comme contenant une EF tandis qu'en
réalité il n'en contient pas. Au contraire le terme « faux négatif » (FN) fait référence à la situation
où une image est reconnue comme ne contenant pas d'EF tandis qu'en réalité il en contient. Les
termes « vrai positif » (VP) et « vrai négatif » (VN) font référence aux cas où l'algorithme a bien
classifié la présence, ou bien l'absence d'une FE dans l'image analysée.
12 Dans ce cas, il s'agit d'arbres de décisions les plus simples, n'ayant qu'un seuil de bifurcation et alors de deux
branches seulement (e.g. si la valeur du trait est plus grande que la valeur du seuil, le classifieur faible donne un vote
positif pour l'appartenance à la classe d'intérêt et une vote negatif si jamais la valeur est moins grande)

Étant donné qu'il s'agissait de notre première tentative dans le domaine de l'apprentissage
automatique, les résultats nous ont paru encourageants, surtout pour la classe « joie ». Nous nous
sommes donc décidés à mettre à l'épreuve notre classifieur fort en le confrontant non avec des
images venant d'une échantillon ayant les conditions de luminosité complètement différentes de
JAFFE utilisée pour l'apprentissage.

Figure 10 : Fragments des contours (traits)
choisis par AdaBoost comme les plus
Or, après avoir obtenu l'autorisation de la part pertinentes pour déterminer la classe d'EF
présente

de l'université Carnegie-Mellon d'accéder à la base
d'images la plus reconnue dans la domaine de la REF,

celle de Cohn-Kanade (Kanade et al. 2000), les
résultats obtenus dépassèrent à peine le pur hasard.
Avenir
L'une des explications de cet échec peut résider
dans le fait que l'approche présentée par Bowden et
Moore ne fonctionne simplement pas aussi bien que le rapportent les auteurs, i.e. que les traits
obtenus grâce à une image de chamfer ne sont pas pertinents pour la classification d'EF.
Même si les auteurs n'ont pas répondu à

nos mails dans lesquels nous leur avons

demandions quelques renseignements supplémentaires sur leur méthode, et même si leur méthode
(en mai 2010) n'est reproduite nulle part dans la littérature, nous restons persuadés – et le Tableau 3
aussi bien que la Figure 10 l'indiquent - que la distance de chamfer13 entre un contour et une image
peut s'avérer un moyen très efficace pour la construction de traits suffisamment déterminants pour
une classification basée sur les contours.
Or, nous expliquons le fait que nous n'avons pas réussi à classifier les images contenues dans
la base Cohn-Kanade par les facteur suivants : 1) nous n'avons pas aligné les images selon un point
de référence commun ; 2) nous n'avons appliqué aucune méthode pour remettre la luminosité des
images au même niveau, ce qui pourrait être fait par paramétrisation du seuillage par hystéresis de
filtre de Canny afin qu'il nous fournisse toujours environ la même quantité de contours. Nous
croyons que le manque de ces deux facteurs a empêché de trouver un système de classification
suffisamment général pour être appliqué même aux flux d'images venant de la caméra de Roboto.
De ce fait, non seulement nous avons compris que, pour faire face aux deux problèmes
mentionnés dans le paragraphe précédent, il nous fallait plus que quelques semaines. Bien plus –
13 L'avantage de la méthode proposée ci-dessus est que, une fois calculée l'image de chamfer, la computation des
valeurs des traits (sélectionnés préalablement par AdaBoost) est très rapide à effectuer. Qui plus est, on peut très
bien imaginer que la phase de computation la plus exigeante – la construction d'une image de chamfer – peut se
dérouler non sur un processeur central (CPU), mais pourra être renvoyée au processeur de la carte graphique (GPU).
Le projet appellé OpenCL (à ne pas confondre avec OpenCV) nous paraît idéal pour atteindre cet objectif.

nous avons compris que pour construire un système de REF suffisamment robuste et rapide, il nous
fallait, peut être, la période de toute une thèse, sinon un tel système aurait déjà été construit14 .
En restant persuadés que l'hypothèse bande dessinée est vraie et que les contours peuvent
suffire pour la REF, nous nous sommes décidés, enfin, à réduire nos objectifs et, en accord avec une
ancienne maxime « moins est plus », nous nous sommes concentrés pleinement sur une seule EF –
celle de sourire.
Bibliographie
Abboud, B., Davoine, F., & Dang, M. (2004). Facial expression recognition and synthesis based on an appearance
model. Signal Processing: Image Communication, 19(8), 723–740.
Azcarate, A., Hageloh, F., van de Sande, K., & Valenti, R. (2005). Automatic facial emotion recognition. Universiteit
van Amsterdam June.
Barrow, H. G., Tenenbaum, J. M., Bolles, R. C., & Wolf, H. C. (1978). PARAMETRIC CORRESPONDENCEAND
CHAMFERMATCHING: TWO NEW TECHNIQUES FOR IMAGE MATCHING. Dans Proc. DARPA IU
Workshop (p. 21–27).
Bishop, C. M., & others. (2006). Pattern recognition and machine learning. Springer New York:
Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc.
Breiman, L. (1984). Classification and regression trees. Chapman & Hall/CRC.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.
Canny, J. (1987). A computational approach to edge detection. Readings in computer vision , 184.
Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearance models. Computer Vision—ECCV’98, 484.
Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance.
Computer Vision–ECCV 2006, 428–441.
Dempster, A. P., Laird, N. M., Rubin, D. B., & others. (1977). Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
Fasel, B., & Luettin, J. (2003). Automatic facial expression analysis: a survey. Pattern Recognition, 36(1), 259–275.
Fix, E., & Hodges Jr, J. L. (1989). Discriminatory analysis. Nonparametric discrimination: Consistency properties.
International Statistical Review/Revue Internationale de Statistique, 57(3), 238–247.
Freund, Y., & Schapire, R. (1995). A desicion-theoretic generalization of on-line learning and an application to
boosting. Dans Computational Learning Theory (p. 23–37).
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Dans MACHINE LEARNINGINTERNATIONAL WORKSHOP THEN CONFERENCE- (p. 148-156). Citeseer.
Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall PTR Upper Saddle River, NJ, USA.
Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields
in cat striate cortex. Journal of Neurophysiology, 58(6), 1233.
Kanade, T., Tian, Y., & Cohn, J. F. (2000). Comprehensive database for facial expression analysis. fg, 46.
Lienhart, R., & Maydt, J. (2002). An extended set of haar-like features for rapid object detection. Dans IEEE ICIP.
Li, S. Z., & Jain, A. K. (2005). Handbook of face recognition. Citeseer.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
Lucey, S., Ashraf, A. B., & Cohn, J. (2005). Investigating spontaneous facial action recognition through AAM
representations of the face. Handbook of Face Recognition, 275–286.
Lyons, M., Akamatsu, S., Kamachi, M., & Gyoba, J. (1998). Coding facial expressions with gabor wavelets. Dans
Proceedings of the 3rd. International Conference on Face & Gesture Recognition (p. 200). IEEE Society.
Mahalanobis, P. C. (1936). On the generalized distance in statistics. Dans Proceedings of the National Institute of
Science, Calcutta (Vol. 12, p. 49).
Minsky, M., & Selfridge, O. G. (1961). Learning in random nets. Papers, 335.
Moore, S., & Bowden, R. (2007). Automatic facial expression recognition using boosted discriminatory classifiers.
LECTURE NOTES IN COMPUTER SCIENCE, 4778, 71.
Rumelhart, D. E. (1989). The architecture of mind: A connectionist approach. M, 1, 133–159.
Vapnik, V., Golowich, S. E., & Smola, A. (1997). Support vector method for function approximation, regression
estimation, and signal processing. Advances in Neural Information Processing Systems 9.
Viola, P., & Jones, M. (2002). Robust real-time object detection. International Journal of Computer Vision, 57(2).
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for search. IEEE Transactions on Evolutionary Computation,
1(1), 67–82.

14 Construit et publié en version open source par l'un de dizaines, voire centaines d'équipes de recherche qui, visent au
même but en ce moment même !

3. L'évaluation des diverses versions du détecteur de sourire smileD
par le jeu d'imitation avec le visage robotique Roboto

Introduction
Qu'est-ce qu'une expression faciale ? Selon (Ekman & Friesen 1977), une expression faciale
est une combinaison d'unités d'action (UA), où chaque UA correspond au mouvement d'un des
muscles faciaux. Les mêmes auteurs postulent, en accord avec (Darwin 1872), que de telles
combinaisons d'UE existent et sont exprimées et interprétées de manière semblable dans tous les
peuples du monde. Même si une telle universalité anthropologique des expressions faciales est
remise en question par une étude récente de (Jack et al. 2009) pour les EFs au contenu affectif
négatif, l'universalité de l'expression faciale au contenu positif par excellence, celle de sourire, n'en
est rien concernée.
Contrairement à d'autres EF comme la peur, le dégoût ou la colère, le sourire dont on va
parler dans le cadre de ce chapitre15 est produit par le mouvement d'un seul muscle – celui du grand
zygomatique (Ekman & Friesen 1982). Mais c'est ne pas seulement la simplicité et l'universalité du
sourire qui nous amena à concentrer nos forces sur la construction d'un détecteur des sourire dès
que nous nous sommes rendus compte que nous avions échoué dans la construction d'un système
plus générale de REF.
Or, nous avons déjà dit que l'objectif ultime de notre stage était de construire un tel appareil
expérimental pouvant servir des expériences, notamment dans le domaine du développement, ou de
l'autisme. Concernant le développement, l'évidence que le sourire est l'un des premiers canaux de
communication non verbale entre la mère et l'enfant (Strathearn et al. 2008). Quant à l'autisme,
l'étude de (Dawson et al. 1990)a démontré que “les enfants autistes répondent beaucoup moins aux
sourires de leurs mères que les enfants normaux. En plus, on a trouvé que les mères des enfants
autistes répondent beaucoup moins aux sourires de leurs enfants que les mères des enfants
normaux”.
Serait-il alors possible que le sourire joue un rôle dans le développement de l'autisme ? En
d'autres termes, serait-il possible ou envisageable que le défaut de communication mère ↔ enfant
par la voie de sourire ne soit pas l'un des symptômes mais, peut être, l'une des causes de ce
syndrome ?
Peut-être Roboto pourra-t-il nous fournir une réponse.
15 Précisons qu'un autre sourire existe aussi, appelé aussi le sourire de Duchenne (Duchenne de Bologne 1862),
pour la production duquel la contraction des muscles autour les yeux i.e. orbicuralii occuli - est nécessaire .

La détection des sourires
Mais afin que Roboto puisse répondre à cette question, il faut l'équiper d'un détecteur de
sourires (DS) . Quelques DS existent déjà, comme celui embarqué comme “smile shutter” dans les
caméras Sony (Akita et al. 2010) ou ceux rapportés par (Deniz et al. 2008) ou (Whitehill et al.
2007).
Comme aucune de ces solution n'est une solution open source, il nous fallait tenter de
construire notre propre DS. L'article de (Whitehill et al. 2007) nous était particulièrement utile pour
choisir une méthode appropriée – les auteurs ont comparé plusieurs méthodes d'extraction des traits,
c'est-à-dire : les histogrammes d'orientation de contours (c.f. note 9), les filtres de Gabor (c.f. page
14) et les filtres rectangulaires, le tout en relation avec deux méthodes d'AA (SVM et boosting).
C'était la combinaison filtres réctangulaires + boosting qui s'est averée la plus performante et
la plus rapide. Ce qui était une bonne nouvelle car il s'est avéré que ces “filtres rectangulaires” ne
sont rien d'autre que les caractéristiques de Haar, très bien integrées dans la bibliothèque OpenCV.
Les caractéristiques de Haar et l'image intégrale
Les caractéristiques de Haar (Haar-like features HF) sont la matière brute de classifieur que nous
apprêtons à

présenter. Un classifieur basé sur HF ne

classifie pas selon les intensités de pixels, mais selon les
différences d'intensités entre deux, trois ou quatre régions
rectangulaires de pixels. Une valeur numérique de HF Figure 11 : HFs integrées dans

OpenCV. Dans cette représentation des

résulte de l'addition ou de la soustraction des sommes ondelettes, les régions blanches sont

interpretées comme “ajouter cette
région” et les régions noires comme
des régions “soustraire cette région” (Bradski &
Kaehler, 2008) .

d'intensités

rectangulaires des pixels.
Ces
Figure 12. : Image intégrale Ii. La somme des
pixels contenus dans le rectangle S (défini
par les points A,B,C,D) peut être calculée en
ne faisant que 4 réferences vers les points A,
B, C, D d'image intégrale.

valeurs

d'intensité

des

régions

rectangulaires peuvent être calculées très rapidement,
une fois qu'on a construit ce que l'article époquale16 de
(Viola & Jones 2001) appelle “l'image intégrale”. Une
image intégrale ii contient sur les coordonnées x,y la

somme des pixels de l'image d'origine qui se trouve au-dessus et à gauche de la position x, y :

16 L'article cité plus que 3862 fois moins que 9 ans après son apparition est sans doute digne d'un tel adjectif.

En termes simples, l'image intégrale est une “astuce” mathématique qui permet de calculer,
très rapidement, une fois qu'elle est construite17, les sommes des valeurs des pixels dans les régions
rectangulaires de l'image analysée.
La détection des visages selon la méthode de Viola et Jones
L'objectif de Viola et Jones était de construire un système robuste et rapide pour la détection
des visages. Les deux contributions majeures - hormis la décision d'utiliser les HFs comme traits et
accélérer le calcul de leur valeurs au moyen d'image intégrale - étaient: 1) d'utiliser AdaBoost
comme moyen pour choisir les traits et construire les classifieurs ; 2) d'enchaîner les classifieurs
dans une cascade attentionnelle.
Sélection des traits par AdaBoost: Sachant que le nombre des HFs possibles d'une image
de 24x24 pixels est supérieure à 180 00018, il aurait été impossible de le calculer pour toutes les
sous-fenêtres de l'image où nous cherchons l'objet (le visage, le sourire) à détecter. Il faut donc
choisir les traits, et ce que Viola&Jones furent les premiers à proposer de procéder par AdaBoost,
après avoir proposé l'hypothèse que “un nombre très réduit de traits peut être combiné pour donner
un classifieur efficace” .
La méthode que nous avons présentée dans le chapitre précedent était analogue à celle de
Viola & Jones: nous cherchions les traits qui séparent le mieux les échantillons d'exemples négatifs
et positifs. Pour chaque trait, l'algorithme d'apprentissage de classifieurs faibles tente de déterminer
le seuil optimal pour la classification.
La cascade attentionnelle : Jusq'ici, la
procédure de détection du visage comprend
cette hiérarchie des représentations: l'intensité
des pixels → la somme des intensités des pixels
dans

une

région

rectangulaire

→

les

différences entre plusieurs régions (HF) → les
groupes de HFs sélectionnés par AdaBoost
comme

les

plus

pertinents

pour

la

Figure 13 : La cascade de Viola & Jones

classification. La contribution principale de Viola & Jones fut d'ajouter encore un niveau à cette
hiérarchie computationnelle, d'organiser les groupes de HFs choisis dans les nœuds d'une cascade
de réjection. L'idée de base est que, afin qu'une fenêtre de recherche (FdR) puisse être classifiée
17 L'avantage principal d'une image intégrale est qu'on n'a besoin que d'un seul passage à travers l'image d'origine pour
la construire, en appliquant deux équations récurrentes: s(x,y) = s(x, y-1) + i(x,y) ; ii(x,y) = ii(x-1, y) + s(x,y). Pour
comparer: la construction d'image de chamfer (c.f. Annexe 2), exige 2 passages par l'image d'origine (avant, puis
arrière)
18 Notons, que pour une image, le nombre de traits de Haar possibles est beaucoup plus élevé que le nombre de ses
pixels (24x24=576 << 180 000).

comme “contenant un visage”, elle doit être évaluée comme telle par tous les nœuds de la cascade.
Au contraire, si jamais une FdR est classifiée comme “sans visage” par un noeud de la cascade
quelconque, la FdR est d'emblée rejetée et le logiciel procède alors par l'analyse d'une nouvelle
FdR. Après que toutes les FdR de l'image sont ainsi évaluées, la détection est terminée pour l'image
donnée.
En plus, les nœuds de la cascade sont ordonnés de telle manière que les nœuds les plus
rapides à évaluer (i.e. se composant de moins de HFs) sont mis tout au début de la cascade. Grâce à
cela, un grand nombre de FdR sans visage est rejeté après l'évaluation d'un très petit nombre de
HFs. Selon l'article de Viola & Jones, il faudra évaluer (en moyenne) 10 traits pour détecter un
visage dans une FdR, ce qui oblige le processeur de regarder seulement 10 x (640-25) x (480 – 25)
x 4 = 627300 fois dans la mémoire contenant la représentation d'une image intégrale pour trouver
tous les visages de taille 25x25 pixels dans une image de 640x480 pixels.
Il s'agit donc d'une méthode tellement rapide qu'aujourd'hui on peut souvent voir les
détecteurs des visages basés sur ce principe embarqués même dans les caméras numériques de
moyenne gamme.
“Haartraining”
Même s'il y a beaucoup plus à dire sur le sujet, il serait peut-être, superflu de tâcher
d''expliquer le tour de force de Viola & Jones plus en détail dans le cadre de ce mémoire. Nous
renvoyons les lecteurs intéressés à l'article source (Viola & Jones 2001), ainsi qu'aux pages 506-516
de l'ouvrage de (Bradski & Kaehler 2008) où la méthode est mise en relation avec OpenCV.
Ce sont justement ces pages-là qui présentent le logiciel haartraining. Ce logiciel, qui fait
partie intégrante d'OpenCV, automatise l'apprentissage des classifieurs basés sur la théorie de Viola
& Jones. Comme ce sont les régions rectangulaires qui sont à la base de cette approche,
haartraining permet d'entraîner les classifieurs – ou les détecteurs, car un détecteur n'est qu'un
classifieur mis en œuvre dans toutes les FdRs - pour les objets composés de régions ou de blocs. Au
contraire, il serait vain de tenter d'utiliser haartraining pour construire un classifieur capable de
reconnaître les branches d'arbres.
Heureusement, un sourire zygomatique peut être considéré comme un objet composé de
blocs.
SMILEs
Afin que haartraining puisse nous fournir la cascade décrivant un DS, nous devons
construire un échantillon d'exemples positifs (i.e. composé d'images contiennant un sourire) - et un
échantillon d'exemples négatifs (i.e. composé d'images qui ne contiennent pas un sourire) .

Le procédé exact que nous avons mis en œuvre pour aboutir aux premières versions
d'échantillons SMILEs (Smiling Multisource Incremental-Learning-Extensible sample) est décrit
dans l'article attaché en Annexe 3. Résumons-le en quelques mots:
Nous sommes parti de la base d'images “Labeled Faces in the Wild” (Huang et al.
2007) (LFW) qui contient 13080 images (c.f. figure 14) . Un petit logiciel était programmé pour
permettre - et faciliter - le tri manuel de la LFW en deux groupes – des exemples positifs et négatifs.
Ce logiciel nous permettait aussi, pour les exemples positifs, de marquer par une méthode facile de
click&drag&drop la région d'intérêt (RdI) contenant la bouche souriante. Après quelques heures de
travail assez exigeant19 nous avons abouti à un échantillon positif contenant 3606 images et à un
échantillon négatif contenant 9474 images. À partir de ces échantillons, haartraining nous a fourni
la version 0.1 de notre détecteur de sourire (que nous avons appellé smileD). Ensuite nous avons
mis à l'épreuve cette première version en l'appliquant aux nouvelles images dont nous étions sûrs
qu'elles contenaient les sourires (i.e. les images faisant partie de la base Genki4K (Whitehill et al.
2007) ou/et les images que nous avions télechargées automatiquement du site flickr.com en
cherchant le mot clé “smile”). Comme la première version de Figure 14 : Quelques exemples
détecteur y a reconnu quelques sourires, c'est-à-dire des RdI “positifs” extraits de la base LFW
contenant de sourires, nous pouvions étendre l'échantillon de
base avec les nouvelles images, cette fois-ci labélisées sans
qu'aucune intervention manuelle ne soit nécessaire.
C'est justement pour cette raison que nous avons mis
les termes Incremental-Learning-Extensible dans le titre de
notre projet.
SMILEd
En fournissant les 5 versions différentes d'échantillons SMILEs au logiciel haartraining (c.f.
Paragraphe E de l'article en Annexe 3 pour connaître les paramètres d'entraînement), nous avons
obtenu 5 versions différentes du détecteur de sourire smileD.
SmileD est couplé avec un détecteur de visages. En d'autres termes, le logiciel essaye
d'abord de détecter le visage et, s'il y reussit, il va chercher dans sa partie inférieure centrale 20 la
bouche souriante. Ce couplage – que nous croyons raisonnable puisque nous n'avons pas encore vu
de sourire qui ne soit imbriqué dans un visage – était pris en compte aussi pendant la période de
19 Exigeant d'un point de vue cognitif. En effet, il nous arriva souvent, après quelques heures consacrées à la
démarcation de maintes et maintes RdI, de commencer à percevoir des sourires même là où il n'y en avait pas.
20 Le paragraphe 310 de (Da Vinci & Richter 1970) indique: “The space between the parting of the lips [the mouth]
and the base of the nose is one-seventh of the face...The space from the mouth to the bottom of the chin is the fourth
part of the face and equal to the width of the mouth...The space from the parting of the lips to the top of the chin,
that is where the chin ends and passes into the lower lip of the mouth, is the third of the distance from the parting of
the lips to the bottom of the chin and is the twelfth part of the face. From the top to the bottom of the chin is the sixth
part of the face and is the fifty fourth part of a man’s height”.

construction d'échantillons d'apprentissage car nous avons mis seulement des images de visages
sans sourire dans l'échantillon d'exemples négatifs. Cet échantillon n'était censé contenir que des
images de fond ( background ).
De nos tentatives résultèrent cinq fichiers XML de 100 à 300 kilo-octets que chacun pourra
soit ajuster à son propre gré, soit embarquer dans son logiciel, dès que nous les publierons en tant
que package open source. Afin d'évaluer les deux versions (v0.1 et v0.5) du smileD dans les
conditions quotidiennes nous conçûmes 2 expériences dans lesquelles Roboto nous rendit service.
Méthode
Expérience 1 - Participants
Quinze participants (11 hommes, 4 femmes) furent conviés à jouer au “jeu d'imitation”. Ils
furent assis en face de Roboto, ils eurent pour consigne de “faire la même chose que le robot”. La
distance entre le visage des participants et la caméra placée variait entre 50 et 100cm, selon les
exigences dues au confort des participants.
Expérience 2 - Participant
Un seul participant (homme, 27 ans) joua au même jeu d'imitation que les participants de
l'expérience 1. Il l'a effectué d'abord dans le mode “barbu” pour l'effectuer le lendemain en mode
“rasé”. Pour chaque mode, il y avait deux séances - l'une avec une distance de 50cm, l'autre de
100cm. La luminosité d'environnement restait identique entre les séances.
Roboto
Pendant deux expériences, le mouvement de Roboto fut géré par un logiciel
immitation_game.c écrit en C++ qui envoyait au robot les séquences codant quatre expressions: le
sourire, la surprise, la tristesse et l'EF neutre (c.f. Figure 4 et Tableau 1). Chaque envoie de
séquences, 23 au total pour chaque sujet, était suivi d'un intervalle temporel pendant lequel 42
images étaient enregistrées.
Afin de réduire les interférences entre les expressions successives, le sourire, la surprise et la
tristesse étaient toujours suivis par l'EF neutre. Au contraire, l'expression neutre était toujours suivie
par l'une des trois expressions affectives, leur ordre étant défini de manière aléatoire.
L'analyse des images
Les images furent divisées en deux classes: les positives supposées contenir un sourire
puisque enregistrées après l'envoie de l'instruction “sourire”; et les négatives, enregistrées pendant
l'intervalle suivant l'envoie de l'instruction ”surprise” ou “tristesse”.

Chaque image obtenue (i.e. 23x42=969 au total pour chaque sujet) fut analysée par le
détecteur de visage frontal_face qui est fourni avec la bibliothèque OpenCV. Si un visage était
détecté, les détecteurs de sourire smileD v0.1 et v0.5 sont mis à l'épreuve dans la région d'intêrêt
définie par les trois cinquièmes centraux du tiers inférieur du visage. Quand smileD ne trouvait
aucune région rectangulaire susceptible de contenir un sourire , le détecteur rendait la valeur 0. Au
contraire, si une telle région est identifiée, la fonction cvHaarDetectObjects() qui est à la base de
smileD rend le nombre de régions se recouvrant mutuellement, dont toutes sont susceptibles de
contenir un sourire. Le nombre entier ainsi obtenu était appelé “l'intensité du sourire” par (Deniz et
al. 2008) et nous aussi y référons par ce nom.
Les courbes ROC
Ayant ainsi une “intensité de sourire” pour chaque
image où le sourire avait été detecté, nous pouvions utiliser
cette quantité comme seuil de discrimination (cutoff) pour
construire des courbes ROC

(Receiver Operating

Characteristic) . Les courbes ROC sont le moyen le plus
commun pour représenter la performance totale d'un
classifieur car elles représentent la performance du
classifieur, définie par le nombre des VP et des FP par
rapport aux VN respectivement FN, sous les conditions qui
varient selon la valeur du seuil de discrimination. Comme la
communauté autour de AA utilise souvent la mesure “aire
sous la courbe” (ASC ou en anglais AUC – area under
curve) pour comparer les classifieurs, nous avons calculé
les valeurs d'ASC correspondantes grâce à la bibliothèque

Figure 15: Les courbes ROC pour
diverses conditions experimentales
Les 12 premières images de chaque séquence ne furent pas prises en compte lors de la

ROCR (Sing et al. 2005) du langage R (Team 2006).

construction des courbes ROC parce qu'il s'agissait de périodes transitoires entre deux EFs. Huit
courbes ROC furent construites, une pour chaque combinaison des conditions expérimentales.
Résultats
Expérience 1
Les images enregistrées se sont avérées complétement inutilisables, car mal étiquettées en
raison d'un bug dans le logiciel immitation_game.c.

Expérience 2
ASC pour la version 0.5 du détecteur smileD, quand mis à l'épreuve à 50 cm de distance, fut
de 99.6% pour la séance pendant laquelle le sujet était barbu et de 97.75% quand il était rasé. Quant
au smileD version 0.1, toujours à 50 cm de distance, la performance fut de 99.4% pour un sujet rasé
mais seulement de 90% quand le sujet était barbu.
Les détecteurs se sont avérés moins performants quand le sujet était assis à un métre de la
caméra de Roboto – plus exactement, 58.2% d'ASC pour le détecteur version 0.5 quand le sujet était
barbu et 64.4% d'ASC quand le sujet était rasé. Pour smileD version 0.1, les résultats obtenus sont
de 69.6% pour le mode barbu et de 70.1% pour le mode rasé.
La figure 16 montre l'évolution de la
quantité “intensité de sourire” à travers la
séquence vidéo enregistrée après l'envoi de
l'instruction “sourire” au Roboto.

Seules les

données avec le facteur Distance de 50 cm furent
prises en compte. Environ 7 images après l'envoie
de
Figure 16 : L'évolution d'intensité du sourire dans
le temps. Dans tous les deux cas, l'intensité du
sourire culmine environ une seconde après
l'expression de sourire par Roboto, puis retombe
vers les sourires considérés par détecteur comme
“moins marqués”, puis remonte de nouveau...

l'instruction,

nous

constatons

une

augmentation brusque d'intensité de sourire
jusqu'à un “peak” atteint entre la dixiéme et la
vingtiéme image21. Notons que, lors de la
construction de la figure 16, les valeurs d'intensité
obtenues furent moyénées à travers les séances.

Discussion
Dans la communauté d'AA, la mesure ASC est souvent interprétée comme “la probabilité
que le classifieur attribuer un score plus élevé à l'exemple positif qu'au négatif, tous les deux étant
choisis de manière aléatoire” (Fawcett, 2006). Étant donné que dans le cadre de cette recherche le
terme score est un synonyme pour l'intensité du sourire, nous constatons que nous avons réussi à
construire un tel DS (smileD v0.5) qui attribuerait, dans plus de 99,6% des cas, une intensité du
sourire plus élevée à n'importe quelle images enregistrée lors de l'intervalle temporel suivant l'envoi
d'instruction “sourire” à Roboto plutôt qu'à celle enregistrée lors de l'intervalle temporel suivant
l'envoi d'instruction “tristesse”, “surprise” ou “neutre”.
21 Étant donné que la vitesse d'enregistrement était environ 15 images per seconde, le pic d'intensité fut atteint à peu
prés 1 second après l'envoi d'instruction au Robot. Malheureusement toutes les interprétations de figure 16 dans les
termes d'un “temps de réaction” absolu doivent être rejetées comme imprécises puisque l'ordre d'image dans la
séquence ne nous donne que des renseignements indirects sur le tampon temporelle d'image enregistrée. Ceci est du
au fait que la vitesse d'enregistrement varie selon l'état d'ordinateur dans le moment d'expérience et on ne peut pas
donc en inférer les coordonnées temporelles exactes. D'où l'importance d'ajouter le code qui rendra possible
l'enregistrement des données temporelles – avec la précision en miliseconds - dans la prochaine version du
immitation_game.c

Qui plus est, la similarité valeurs d'ASC pour les modes “rasé” et “barbu” indique que nous
avons, en effet, construit un DS robuste contre certaine variabilité propre à l'objet à reconnaître.
Cette proposition est d'ailleurs soutenue par les résultats de l'article en Annexe 3 qui montrent que la
performance de smileD s'élève à plus de 90% quand on l'a confronté avec la base d'images JAFFE.
Notons enfin que ni les images faisant partie de JAFFE, ni aucune des images du participant qui a
du subit l'expérience “barbu/rasé” (c.f. Figure 17) ne faisaient partie de l'échantillon
Figure 17 : Le jeu d'imitation entre d'apprentissage.
Roboto et le sujet barbu
Par contre, le fait que le DS présenté ci-dessus
ne donne que des très faibles performances quand il est mis à
l'épreuve contre les images prises d'une distance de 100 cm
indique l'une des faiblesses des premières versions de
smileD. Comme l'approche de la reconnaissance d'objets par
Roboto triste

Sujet triste

le moyen des caractéristiques rectangulaires de Haar est
supposée être invariable par rapport à la taille (et donc par
rapport à la distance) de l'objet à reconnaître (Bradski &
Kaehler 2008), les résultats obtenus nous indiquent que les
versions de smileD construites jusqu'au ici sont loin d'être

Roboto surprise

Sujet surpris

les solutions finales.
Cependant, nous avons des raisons de croire que
le problème lié à la taille du sourire sera résolu dans de
semaines à venir. Non seulement nous croyons, enfin,
comprendre la théorie de Viola & Jones aussi bien que les

Roboto sourit...

...un sourire fut
detecté!

subtilités du logiciel haartraining, mais aussi nous
soupçonnons que le problème mentionné ci-dessus est dû au

fait que nous avons utilisé des valeurs trop élevées de largeur=43 et hauteur=19 comme paramètres
d'apprentissage pour les cinq premières versions de smileD.
Avant cela, nous croyons que le projet SMILEsmileD, contenant aussi bien le DS smileD
qu'un échantillon d'apprentissage SMILEs, a certaines chances de réussir, si jamais il est remarqué
par la communauté internationale qui l'affinera et l'enrichira, en accord avec la philosophie open
source. Tels sont nos éspoirs (TSNE).
Les exploitations possibles d'un DS sont innombrables. Laissons de coté ses mises en
application commerciales - comme les plugins pour l'interaction dans les réseaux sociaux; ou
militaires – pour que le système puisse mieux reconnaître les individus nonconformes à la norme
(Huxley 1969) et refusant de s'exprimer comme les vedettes de Holywood. Laissons de côté tout
cela et réfléchissons sur deux utilisations dignes de l'appareil que nous avons tenté de construire.

La première utilisation est liée à la thérapie des troubles émotionnels et affectifs. Ceci est lié
au phénomène que nous voyions se répeter maintes et maintes de fois depuis que Roboto a débarqué
au Lutin : n'étant qu'un tas de ferraille, son sourire a toujours fait sourire ceux qui l'ont regardé.
Nous croyons que la force de ce phénomène ne peut qu'augmenter, si jamais une véritable imitation
– une véritable harmonisation temporelle entre l'homme et la machine – se met en place.
L'auto-catalyse de la bonne humeur portera ses fruits, un DS robuste étant le premier pas
vers cet objectif.
La deuxième utilisation est liée au domaine d'intelligence artificielle (IA), plus précisément
en IA développementale où l'utilisation d'un DS est rapporté par (Movellan et al. 2007)– ou dans le
domaine de la pédagogie des machines. Étant donné que 1) le sourire est un moyen naturel grâce
auquel un être humain, un instituteur humain, exprime son contentement; et étant donné que 2) les
premiers instituteurs des machines sont et seront les êtres humains, un sourire nous paraît être le
moyen le plus approprié de renforcement positif (Skinner 1976) du comportement des machines.
Le principe en est assez simple : l'algorithme donnera plus de poids aux représentations
d'actions effectuées et situations perçues qui se succèdent de manière immédiate par un sourire..
Voilà deux utilisations où sourire joue le rôle principal.
Sourire est un don qui permet aux humains de devenir plus humains, un don qui leur permet
de franchir leurs syndromes, leurs maladies, la haine, voire la mort même.
Et qui sait si, un jour, il ne le permettra même aux machines ?
TSNE.
Bibliographie
Akita, M., Marukawa, K. & Tanaka, S., 2010. Imaging apparatus and display control method.
Bradski, G. & Kaehler, A., 2008. Learning OpenCV: Computer vision with the OpenCV library, O'Reilly Media.
Darwin, C., 1872. The expression of the emotions in man and animals; with an introduction, afterword, and
commentaries by Paul Ekman. NY: Oxford University.
Da Vinci, L. & Richter, J.P., 1970. The notebooks of Leonardo da Vinci, Dover Publications.
Dawson, G. et al., 1990. Affective exchanges between young autistic children and their mothers. Journal of
Abnormal Child Psychology, 18(3), 335–345.
Deniz, O. et al., Smile Detection for User Interfaces. Advances in Visual Computing, 602–611.
Duchenne de Bologne, G.B., 1862. The mechanism of human facial expression. Paris: Jules Renard.
Ekman, P. & Friesen, W.V., 1982. Felt, false, and miserable smiles. Journal of Nonverbal Behavior, 6(4).
Ekman, P. & Friesen, W.V., 1977. Manual for the facial action coding system. Consulting Psychologist.
Fawcett, T., 2006. An introduction to ROC analysis. Pattern recognition letters, 27(8), 861–874.
Huang, G.B. et al., 2007. Labeled faces in the wild: A database for studying face recognition in unconstrained
environments. University of Massachusetts, Amherst, Technical Report, 57(2), 07–49.
Huxley, A., 1969. Brave New World. 1932. New York: HarperPerennial, 246.
Jack, R.E. et al., 2009. Cultural confusions show that facial expressions are not universal. Current Biology.
Movellan, J.R. et al., 2007. The RUBI project: a progress report. Dans Proceedings of the ACM/IEEE
international conference on Human-robot interaction. p. 339.
Sing, T. et al., 2005. ROCR: visualizing classifier performance in R. Bioinformatics.
Skinner, B.F., 1976. Walden two revisited. BF Skinner, Walden Two (reissued).
Strathearn, L. et al., 2008. What's in a smile? Maternal brain responses to infant facial cues. Pediatrics, 122(1).
Team, R.D.C., 2006. R: A language and environment for statistical computing. .
Viola, P. & Jones, M., Rapid Object Detection using a Boosted Cascade of Simple. Dans Proc. IEEE CVPR 2001.
Whitehill, J. et al., 2007. Developing a practical smile detector. Submitted to PAMI, 3, 5.

Annexe 1. Le fonctionnement d'algorithme AdaBoost

Annexe 2. La construction d'images de chamfer en OpenCV

Annexe 3.

Semi-supervised haartraining of a fast&frugal open
source zygomatic smile detector
A gift to OpenCV community
Daniel Devatman Hromada

prof. Charles Tijus

Lutin Userlab
Ecole Pratique des Hautes Etudes

Cognition Humaine et Artificielle (ChART)
Université Paris 8

Abstract—Five different versions OpenCV-positive XML
haarcascades of zygomatic smile-detectors as well as five
SMILEsamples from which these detectors were derived had
been trained and are presented hereby as a new open source
package. Samples have been extended in an incremental learning
fashion, exploiting previously trained detector in order to add
and label new elements of positive example set. After coupling
with already known face detector, overall AUC performance
ranges between 77%-90.5% when tested on JAFFE dataset and
<1ms per frame speed is achieved when tested on webcam videos.
Keywords-zygomatic smile detector; cascade of haar feature
classifiers; computer vision; semi-supervised machine learning

I. INTRODUCTION
Great amount of work is being done in the domain of
facial expression (FE) recognition. Of particular interest is a
FE being at the very base of mother-baby interaction [1], a FE
interpreted unequivocally in all human cultures [2] - smile.
Maybe because of these reasons, maybe because of some
others, smile detection is already of certain interest for
computer vision (CV) community – be it for camera's smile
shutter [3] or in order to study robot2children interaction [4].
Nonetheless, a publicly available i.e. open source,
smile detector is missing. This is somewhat stunning,
especially given the fact that “smile” can be conceived as a
“blocky” object [5] upon which a machine learning technique
based on training of cascades of boosted haar-feature
classifiers [6] can be applied, and that the tools for performing
such a training are already publicly available as part of an
OpenCV[5] project. Verily, with exceptions of detectors
described in [7][8] which have not been publicly released, we
did not find any reference to haarcascade-based smile detector
in the literature. We aim to address this issue by making
publicly available the initial results of our attempts to
construct sufficiently descriptive SMILing Multisource
Incremental-Learning Extensible Sample (SMILEs) and five
smile detectors (smileD) generated from this sample.
From more general perspective, our aim was to study
whether one can use already generated classifiers in order to
facilitate such a semi-supervised extension of initial sample
that a more accurate classifier can be subsequently trained.

A.SMILE sample (SMILEs)
The aim of SMILEs project is to facilitate and
accelerate the construction of smile detectors to anyone
willing to do so. Since it is the OpenCV library which
dominates the computer vision community, SMILEs package
is adapted upon the needs of OpenCV in a sens that it contains
1) negative examples directory 2) positive examples directory
3) negatives.idx - list of files in negative examples directory 4)
positives.idx - list of files in positives with associated
information containing the coordinates of region of interest
(ROI), i.e. the coordinates of the region within which smile
can be located.
SMILEs is considered “Multisource” because it
originates as an amalgam of already existing datasets like
LFW and Genki both of which are, themselves, collections of
images downloaded from the Internet. Images from POFA [9]
of Cohn-Kanade [10] datasets were not included into SMILEs
since restricted access to these datasets is in contradiction with
an open source approach1 of SMILEs project.
B.Smile Detector (smileD)
SMILEs are “Incremental-Learning Extensible” in a
sense that they allow us to train new versions of smile
detectors which are subsequently applied upon new image
datasets in order to facilitate (or even fully automatize) the
labeling of new images, and hence extending an original
SMILEs with new images. Simply stated, SMILEs allow us
train smileD which helps us to extend SMILEs etc.
Since training of haar cascades is an exhaustive
threshrold-finding process demanding not negligible amount
of time and computational resources, 5 pregenerated OpenCVcompatible XML smileD haarcascades were trained by
opencv-haartraining application and are included with
SMILEs in our OpenSource SMILEsmileD package, so that
1

Both SMILEs & SMILEd cascades are publicly available
from http://github.com/hromi/SMILEsmileD as a GPLlicensed package. C++ source codes of select&crop
application for easy manual sample creation and of a facecoupled video stream smile detector are included as well.

anybody interested could implement our smile detector in
copy&use fashion.

 Version

04 is analogous to version 0.3 in that sense that
it is essentially a version 0.1 sample to which
automatically labeled positive examples were added.
Differently from version 0.3, Genki4K and not flickr was
exploited as a source of additional data. Simply stated,
positive examples, 624 of them in total, from Genki4K
labeled as smile-containing by its authors were added to
initial LFW-based sample.
 Version 05 unites the versions 0.3 and 0.4, i.e. both
Genki4K & flickr-originated images which were
automatically labeled by smileD v0.1 were added to LFW
samples.

II. METHOD
C.Initial Training Datasets
SMILEs project in its current state unites 3 image sets :
Labeled Faces in the Wild (LFW) dataset - LFW
dataset [11] contains more than 13000 images of faces
collected from the web; its cropped version contains only
25x25pixel regions detected by OpenCV's frontal face
detector. No information about the presence/absence of a
sm.ile within the image is given
 Genki4K dataset – Genki4K is a publicly available part of
UCSD's Genki project [12] containing 4000 images
downloaded from Internet. A text file indicating the
presence/absence of the smile in a given image is included.
 Ad hoc Flickr dataset – We have used the search keyword
“smile” in order to download more than 4200 additional
pictures from image-sharing website flickr.com. More than
2600 of them contained at least one smiling face.
 cropped

D.Construction of SMILEs datasets
We have created five different version of SMILEs. All
these versions exploit the same negative sample set of LFW's
nonsmiling images. All manual labeling focalised solely on
zygomatic smile (ZS) region2:
 Version 0.1 is based solely upon an LFW dataset. All
pictures were manually labeled by our ad hoc region
selection & cropping application and divided into samples
of positive (3606 images) and negative (9474 images)
examples.
 Version 0.2 added 2666 manually labeled images
downloaded from flickr.com to positive examples
contained already in 0.1. Labeling & region selection was
realised by same application as in case of 0.1.
 Version 0.3 also extended the positive&negative
example samples of version 0.1 with images from flickr.
This time, however, the flickr-originated images weren't
labeled manually, but the smile-containing regions of
interest were determined automatically, by applying
smileD of version 0.1 upon the set of downloaded
images. 1372 ROIs (1 ROI for 1image) were
identified&labeled in this way.

E.SMILEs -> smileD Training
Identical haarcascade training parameters [width=43,
height=19, number of stages=16, stage hit rate=0.995, stage
false alarm rate=0.5, week classifier decision tree depth=1(i.e.
Stump), weight trimming rate=0.95] were applied for training
of all five smileD versions, one smileD corresponding to one
SMILEs, both referenced by same version number.
F.smileD evaluation
Training phase of every new version of smileD was
followed by measuring its performance upon a Japanese
Female Facial Expression (JAFFE) dataset in order to evaluate
the performance of different versions of smileD classifiers
when applied upon a sample having different luminosity
conditions than that any imageset included in train sample
Detectors were face-detector-coupled during testing, i.e.
smile detection was performed iff a face was detected in a
tested image, and only in the ROI defined by well-known
geometric ratios [13]
Receiver operating characteristic (ROC) curves were
plotted and AUC (“area under ROC curve”) were calculated as
performance measures by means of ROCR library [14].
“Smile intensity” [7], i.e. the number of overlapping
neighboring hit regions3, was used as a cutoff parameter.
III. RESULTS
FIGURE I. SMILED ROC CURVES

TABLE II. ROC'S "AREA UNDER

CURVE" PERFORMANCE OF DIFFERENT
VERSIONS OF SMILED DETECTOR

TABLEI. BASIC COMPONENTS OF INITIAL VERSIONS OF SMILES&SMILED PROJECT
Version

Positive examples
LFW
manual

0.1
0.2
0.3
0.4
0.5

3606
3606
3606
3606
3606

Version
0.1
0.2
0.3
0.4
0.5

Neg. ex.

Flickr
manual

Flickr
auto

Genki
auto

Total

0
2666
0
0
0

0
0
1372
0
1372

0
0
0
624
624

3606
6262
4978
4230
6572

9474
9474
9474
9474
9474

AUC
77.94%
85.49%
83.93%
90.21%
90.51%

2

ZS region was defined only vaguely as a rectangular ROI in
whose center are smiling lips – in preference with uncovered
teeth. Whole ROI is bordered by smile&nasolabial wrinkles.

3

Can be obtained from undocumented neighbors attribute of
cvAvgComp sequence referenced by cvHaarDetectObjects

DISCUSSION
Detectors we present hereby exploit the top-bottom
approach, i.e. they are face-coupled. Knowing that there can
be no smile without the face within which it is nested, we
firstly detect the face by an OpenCV face detection solution
and then smileD is applied only in very limited ROI of face's
bottom third. Consequences of our decision to create facecoupled smile detector are twofold: 1) since by definition we
search for smile only within the face, we have used only
nonsmiling faces as negative examples (i.e. background
images) 2) smile detection itself is very fast, once the position
of face is specified. When applied upon the webcamoriginated (320x240 resolution) video streams, the time
needed for smile detection never exceeded 1ms per frame on a
Mobile Intel(R) Pentium(R) 4 CPU (1.8GHz), suggesting that
our detector could be potentially embedded even into mobile
devices disposing with lesser computational resources.
SmileD's speed can somehow neutralize its smaller
accuracy handicap which it has in comparison with results
reported in [8]. In its current state, our approach suffer from
somewhat high false alarm rates, but our research indicates
that in real life condition, these can be in great measure
reduced by taking into account the dynamic sequence of
subsequent frames since the probability of the same false
alarm occuring within all the frames of the sequence is
proportional to the product of probabilities of occurrence of
that false alarm for every frame of the sequence taken
individually. High speed is therefore of utmost importance and
analysis of sequences of frames can substantially reduce the
number of false positives.
Tuning of training parameters and the extension of
negative example do remain as other possibilities how to
augment the accuracy of our project. Tab.2 indicates that
accuracy of such semi-supervised classifiers like smileD gets
saturated at certain limit which can possibly be surmounted
only by extension of negative sample set. In case of smile
detection, we suggest that extension of negative example
sample with more images containing “upper lip raiser” action
unit (AU 10) – teeth-uncovering4 but associated with disgust
rather than smile – could yield some significant increases in
accuracy, as reported by [9]. Since such an extension is
relatively easy and not much time-consuming, given that such
AU10-containing images are given and marked as negative
examples, it may be the subject of future research.
In this study, however, we left the negative example
unchanged in order to study the effectivity of “Incremental
Learning” approach during which an old detector is used to
facilitate the extension of a positive example sample thanks to
which a new detector is obtained. Since semi-supervised
smileD versions v0.4 and v0.5 have outperformed v0.2 for
which manual labeling was implemented, while the latter one
performed only slightly better than v0.3 which exploited an
identic flickr-originated imagebase than v0.2, it is not
4

From anatomic point of view, disgust-expressing AU10 is
associated with Levator Labii Superioris muscle while
smile associates with Zygomaticus Major muscle (AU12).

unreasonable to think that such semi-supervised incremental
training approach can be a feasible solution for training
haarcascade detectors. If that would be the case, it could
possibly be stated that the machine started, in certain sense, to
ground [15] its own of smile.
ACKNOWLEDGMENT
We would like to thank the third section of EPHE,
University Paris 8 and CROUS de Paris for their kind support.
REFERENCES
[1]

L. Strathearn, J. Li, P. Fonagy, et P.R. Montague,
“What's in a smile? Maternal brain responses to infant
facial cues,” Pediatrics, vol. 122, 2008, p. 40.
[2] C. Darwin, P. Ekman, et P. Prodger, The expression of
the emotions in man and animals, Oxford University
Press, USA, 2002.
[3] M. Akita, K. Marukawa, et S. Tanaka, “Imaging
apparatus and display control method,” 2010.
[4] J.R. Movellan, F. Tanaka, I.R. Fasel, C. Taylor, P.
Ruvolo, et M. Eckhardt, “The RUBI project: a progress
report,” Proceedings of the ACM/IEEE international
conference on Human-robot interaction, 2007, p. 339.
[5] G. Bradski et A. Kaehler, Learning OpenCV, O'Reilly
Media, Inc., 2008.
[6] P. Viola et M. Jones, “Rapid Object Detection using a
Boosted Cascade of Simple,” Proc. IEEE CVPR 2001.
[7] O. Deniz, M. Castrillon, J. Lorenzo, L. Anton, et G.
Bueno, “Smile Detection for User Interfaces,” Advances
in Visual Computing, p. 602–611.
[8] J. Whitehill, M. Bartlett, G. Littlewort, I. Fasel, et J.
Movellan, “Developing a practical smile detector,”
Submitted to PAMI, vol. 3, 2007, p. 5.
[9] P. Ekman et W.V. Friesen, Pictures of facial affect, Palo
Alto, CA: Consulting Psychologists Press, 1976.
[10] T. Kanade, Y. Tian, et J.F. Cohn, “Comprehensive
database for facial expression analysis,” fg, 2000, p. 46.
[11] G.B. Huang, M. Ramesh, T. Berg, et E. Learned-Miller,
“Labeled faces in the wild: A database for studying face
recognition” University of Massachusetts, Amherst,
Technical Report, vol. 57, 2007, p. 07–49.
[12] J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, et J.
Movellan, “Toward Practical Smile Detection,” IEEE
transactions on pattern analysis and machine
intelligence, 2009, p. 2106–2111.
[13] L. Da Vinci et J.P. Richter, The notebooks of Leonardo
da Vinci, Dover Publications, 1970.
[14] T. Sing, O. Sander, N. Beerenwinkel, et T. Lengauer,
“ROCR: visualizing classifier performance in R,”
Bioinformatics, 2005.
[15] S. Harnad, “The symbol grounding problem,” Physica
d, vol. 42, 1990, p. 335–346.

Épiprologue22
Monsieur Charles Tijus
LUTIN - UMS CNRS 2809
Cité des Sciences et de l'Industrie
Lors mon stage en M2 des études CNA SVT à E.P.H.E. je voudrais apprendre à robots de sourire. Pour
le faire de manière réussite, il faut d'abord répondre à deux questions: Comment? et Quand?
Pour répondre à la question « Comment? » je devrai: étudier la théorie des expressions émotionnelles, je
devrai comprendre que se passe-t-il sur le visage quand on sourit: « quels muscles sont détendus? quels sont
contractés? » etc . Bref, j'étudierai la théorie des émotions et leurs expressions faciales.
Ensuite, pour approcher cette théorie à la réalité des robots, j'étudierai le manuel et le fonctionnement et
le « instruction set » du visage robotique « Roboto ». Je créerai un petit script en langue de programmation PERL
(output du première semestre???) grâce auquel je serai capable d'envoyer les commandes comme
<sourire>,<colere> à ce visage.
Quant à la question « Quand la23 robote va-t-elle sourire? » je réponds tout de suite: Elle va sourire quand
l'être humaine qu'elle voit, sourit à elle. Ainsi, la robote va imiter l'être humaine en face d'elle, tel un bébé qui fait
la même chose dès ces premières moments dans ce monde. Même si la question de base est répondu, l'aspect
technique de cette imitation pose des problèmes auquel je voudrai essayer à répondre pendant mon stage.
Afin pouvoir mimer, la robote devra d'abord reconnaître quelle émotion à mimer. Il faudra donc trouver
le moyen de reconnaître l'émotion à partir des données contenus dans la photo prise par les cameras dans les yeux
de la robote. Car dans notre monde c'est certain que quelqu'un s'est déjà posé la question et l'a répondu au moins
partiellement, j'ai d'abord étudierai « l'état de l'art de reconnaissance automatique des émotions faciales » pour
ensuite choisir la méthode approprié pour y aboutir (les candidats du départ sont pour cette méthode sont: les
réseaux nerveux artificiels, SVM (support vecteur machine), la bibliothèque openCV ou le logiciel faceAPI; ou
bien en système hybride de ces solutions).
Mon objectif sera surtout un logiciel open source ou une bibliothèque PERL (output du deuxième
semestre???) qui permettra aux autres chercheurs d'interagir avec les machines par le moyen des expressions
faciales des émotions de manière plus facile et efficace. Ensuite, le visage robotique Roboto, capable de mimer au
moins 2-3 émotions de base, pourrait être utilisé non seulement dans le cadre de plusieurs expériences de
recherche, y compris l'intelligence émotionnelle des enfants autistes mais aussi comme un pont vers les êtres
artificiels dont quelques s'approchent de plus en plus d'essence même de l'humanité.
Paris 18/06/2009
Daniel D. Hromada
22 Nous laissons cette “proposition de stage” dans sa version originalle, i.e. avec toutes ses fautes d'ortographe.
23 Ce n'est qu'ici où nous disons en haute voix ce que nous voulions dire depuis le debut de nos travaux: le mot robot
viens d'ouvrage R.U.R d'auteur Karel Čapek et est dérivé du mot commun aux plusieurs langes slaves, celui de
[robota] signifiant “le travail” ou mieux même, “la corvèe”. Ce mot est du genre féminine (paradigme de
déclinaision: /žena/, i.e. femme ). C'est peut être pout cette raison que nous considerions, tout au long de notre stage,
Roboto comme un être d'essence féminine plutôt que masculine. Ceci dit, il n'y a rien à ajouter, sauf ...

Small treatise concerning the concepts of «invasivity» and «reversibility» and
their relation to past, present and future techniques of neural imagery
Introduction
The aim of this text is threefold: Firstly, to prove to the Teacher that the author of this article (i.e.
Student) have sufficiently internalized all the facts presented during UE Neuroimagery. Secondly,
Student aims to introduce the notion of «invasivity» as something which should be considered wery
seriously by someone who seeks an «ideal method» of conducting his future (neuro)scientific
experiments towards success. But the ultimate aim si to show that certain «philosophical schools»
who point out to «invasivity-related aspects» of current neuro-scientific research are not doing so
from the position of moralizing savants locked in their ivory towers, but they do so because of
concrete and highly-pragmatic reasons related to purest expressions of highest scientific practice.
Principal thesis of this text states that « invasivity » and « reversibility » aspects of a chosen
experimental method should determine experimentator's choice at least as significantly as other
aspects like spatial/temporal resolution characteristics, signal/noise ratio or economical feasibility.
First part of the text is dedicated to highly invasive techniques tissue extraction and analysis by
means of electron, multiphoton or confocal microscopes. Post mortem autopsy and chirurgical
interventions like vivisesction or lobotomy will be mentioned when discussing this group. Common
demoninator of these approaches is that their condition sine qua non of their realisation is nonreversible and fatal degradation of one vital functions of the organism under study or...death.
Second part of the text is dedicated to somewhat more reversible, nonetheless still very brutal «in
vivo» techniques like that of calcic imaging, optic imaging or electrode implantation. Because it is
evident that such approaches can inflict severe injuries and suffering of the organisms under study,
they will be labeled as «partially reversible quasi in vivo techniques».
Contrary to common categorisation of these days, even techniques like PET (positron emission
tomography) or X-ray imaging will be included into this middle group of partially invasive
techniques. This is due to their high-energy kinship with radioactivity which can without any doubt
induce mutations resulting in the disequilibrium of a living system which is commonly known as
«loss of health». The loss of this precious equilibrium is the reason why we'll include all the
luminescence/fluorescence marker techniques into this category as well.
The third part of the text aims to bring hope. It will be fully devoted to techniques which can be
considered as fully reversible: focus will be definitely on Magnetic Resonance Imaging (MRI) and
Electroencephalography (EEG) while other non-invasive techniques (NIRS, echography or TCD)
will be excluded from the list due to lack of Student's personal experience with these techniques.
The small part of this final part will be dedicated to «what if?» speculation proposing to use these
pure and elegant techniques not only for imaging, but as well as a tool of healing practice.
These three parts can be considered as a core of Student's homework demanding him to «highlight
the advantages and limits of these techniques depending from the scientific question You'll pose».
The question posed by student is this:
«According to what criteriae could we possibly quantify invasivity of an experimental tool or method ? »

This text will try to answer this question by introducing the term which we label hereby as
«Information/Invasivity Quotient» (IIQ).We'll analyse this notion from more ethical perspective in
Discussion section,while Appendix will summariz IIQ-based ranking of 4 presented methods.

1.Non-reversible techniques
Dans tout germe vivant, il y a une idée créatrice qui se développe et se manifeste par l'organisation.
Pendant toute sa durée l'être vivant reste sous l'influence de cette force vitale créatrice,
et la mort arrive lorsqu'elle ne peut plus se réaliser.
Claude Bernard, «prince of vivisectors»

1.1 Death
Death is a transformation of a system from living state into a non-living state. It is evident
that the introduction of death into an experimental procedure leads to non-reversible lost of
structure and hence its IIQ1 should have value less than zero. Because of its essentially quallitative
nature, it is very difficult, if not impossible, to quantify invasivity of such a transformation. One
approach – strongly categoric one - could be to define its value as «minus infinity», but by
introducing infinities into our quantification schema, we would de facto exclude and forbid the
killing of an animal during the experimental procedure. We doubt that such an approach could be
accepted by contemporain scientific community.
We propose somewhat more pragmatic and less categoric approach - introducing death into
experimental procedure should decrease procedure's IIQ in an extent which is proportional to the
complexity of the organism under study. Hence, for example, for procedures demanding
« sacrifices » of complex animals like primates, IIQ should be -7, for other vertebrae it could be
somewhere around -5 , -3 for insects, -1 for plants etc.2
Experimental techniques whose implementation implies death can be divided into:
Macroscopic: namely, a chirurgical in vivo procedure called vivisection. Aristotle
introduced it, Gallieni made science out of it and western tradition perfected it. Application of this
technique in physiology in general and in the domain of Neuroimagery in particular is today
considered as obsolete.
Microscopic: when applied in the domain of biology, physiology and neurosciences in
general, microscopes are devoted to the studies of tissues. This tissue is either extracted (c.f. Section
1.2 below) or studied in vivo (we'll refer to it partially in sections 2.1, 2.2) . In either cases, one has
to first gain access to the tissue. Harm to the organism under study is often so severe that the only
thing one can do with the animal after an experiment (if it does not die on its own) is to kill it. Since
« it costs only 2 euros a piece» as we were told one of our teachers, approximately 50-100 million
(Hendriksen, 2005) bodies of dead vertebra are annually being thrown into waste baskets of
academic institutions.
When speaking about the role of death in experimental Neuroscience, one should not omit
revolutionary works like (Broca, 1861) or (Wernicke, 1874). Since these were post mortem studies,
i.e. the subject died of natural death, the role of death didn't decrease an IIQ of a given study. On the
contrary, IIQ of these studies is highly positive since no suffering was caused and huge amount of
new information/knowledge was obtained. It's possible that even in the forecoming century of
nanotechnology, such post mortem studies possibly didn't say their last word. They could prove to
be particularly fecond when combined with highly advanced cryogenic methods.

1.2Cuts
Death being the most drastic, it is definitely not the only transformation during which
information or certain functional feature is irreversibly lost from the brain. Neurosurgical
procedures like lobotomy or callotomy (disconnecting of cerebral hemispheres by cutting the
1 The basic axiom of Usability/Invasivity Quotient schema can be defined like this: An act which leads to loss of vital
information decreases procedure's IIQ, while an act which generates new information (or even knowledge) increases
procedure's IIQ. For more technical definition of what is information, see (Shannon & Weaver, 1949)
2 These numbers are more or less arbitrary and are subject to scientific discussion, we present them hereby just in
order to clarify our « invasivity quantification » point.

central wiring of the brain - corpus callosum) left aside, we suggest that even procedures like skull
penetration (SP) and tissue extraction (TE) of even a thin cortical layer are the acts of irreversible
nature.
For the purpose of this homework, it has to be stated that electron microscopy cannot be
done without preliminary TE procedure.
It can be argumented, of course, that plasticity of brain is very high and that this amazing
organ is able to recover even from severe TE. If such is the case, one can ask why an animal is
usually killed after TE-implying procedure. To reduce the number of such cases in the future, we
propose to calculate the «Usefulness/Invasivity Quotient» of TE and SP by these example formulas:
IIQTE = PTE * (total amount of brain tissue / amount of tissue extracted )
IIQSP = PSP * (size of skull surface / skull surface which was damaged )
Where PTE and PSP are « tissue extraction penalization » and « skull penetration
penalization » coefficients which should be, ideally, defined by ethical commitees independently for
every specie « involved » in experimental studies.
Our highly arbitrary initial proposal is -1 > PTE > -3 and -1 > PSP > -2

2.Partially reversible «quasi in vivo» techniques
I quite agree that it is justifiable for real investigations on physiology; but not for mere damnable and
detestable curiosity. It is a subject which makes me sick with horror, so I will not say another word about it,
else I shall not sleep to-night.
Charles Darwin

2.1 Injections & Injuries
We hope that our method for invasivity quantification is getting more visible contours. It is
now time to illustrate it on a concrete example.
A mouse is «constructed» in a way that the gene coding «luciferase» ensyme will get expressed when switched on by presence of
oncogene and heat in the environement. When mouse is sufficiently ripe for being « sacrificied », tumor replication is than activated
by an injected in the body, let's say in the brain area. Experiment then consists in applying heat on mouse's head, this will activate
luciferase expression in tumor cells. Luciferase will catalyse production of luciferine, a photoluminescent substance (present in
fireflies, for example) which will emit light and give to an experimentator an information about spatial distribution of tumors.

Such is often the philosophy behind «optical imaging» experiments. Highly sensitive CCD
cameras incorporated into blackboxes which cost hundreds of thousands of euros will then produce
a final result: a low resolution image from which it is evident that light (and therefore tumor cells)
are present in the head of an animal. The discovery that « tumor cells are spreading from the area
where experimentator had injected them » is indeed stunning and worth publishing - one can hope
in obtention of new grants for new apparati3.
Another example, this one from the domain of «calcium imaging »:
A bee is taken from its hive. She is fixed in the apparatus, anesthetized, top part of her « head » is removed. Dextran or
acetylethylmester-like molecule is chosen from catalogue of Alexa or Oregon corporation, is bought and injected into upper layer of
her central ganglion upon which confocal microscope's laser is focalised. The « stimuli » is given after bee awakes from anesthesia.

Possibility to observe the calcium (and thus activation flow) in the cerebral networks is without any
doubt a huge and non-negligeable advantage of calcium « imaging techniques ». It unites two
important characteristics - it is microscopic and it is functional. In other words, spatial resolution is
very high (depending from the microscope, it can go down to nanometers) and its temporal
resolution is almost realtime. Nonetheless it has to be stated that the result of this technique is - apart from invoice from Oregon or Alexa corporations - an image with few blinking pixel clusters
supposedly containing non-generalizable information about functioning of a minute part of a
ganglion of the unlucky bee dying slowly in horrible pains.
3 And it is evident that the presence of a new experimental apparatus has to be justified by new « sacrifices ».

It's evident that suffering about which we are speaking here cannot be quantified, cannot be
transformed into numbers. But since it seems that men and women in white coats believe only in
numbers, and since it seems to us that it is of utmost importance to change as soon as possible the
habits of these men and women, we have to try, at least:
In addition to already proposed IIQDEATH , IIQTE and IIQSP factors, we propose these further
criteria for quantification of invasivity and moral acceptability of an experimental method:
IIQINJURY - penalization due to injury. proportional to the time which the animal will need for
complete recovery
IIQFIXATION – penalization due to fixation of animal in the apparatus. relative to the means and
proportional to the temporal length of fixation. Zero iff animal is studied in its natural niche
IIQBLEACHING – penalization due to tissue bleaching by strong microscopes (confocal and multiphon)
IIQGENEMANIP - penalization due to number and nature of genetic modifications (any additional
modification makes the experiment more specific, artificial and hence less-generalizable and useful)
IIQONCOINJECTION – penalization due to tumor induction
IIQTOXIC - depends on the number and nature of substances classified as toxic which have been
injected into animal because of the experiment
IIQNONTOXIC – the same, but for nontoxic substances. Includes fluorescence and luminescence
markers. The fact that they are considered non toxic (especially by the companies who produce
them) doesn't mean that they don't have significant influence upon the overall equilibrium of the
studied system and hence scientific significance of the results.

2.2 Isotopes & Implants
Methodes we mentioned in preceding parts were presented to students during their
Neuroimagery course, and this is the reason why we have been mentioning them. They may seem
interesting for biologists or chemists but not neccessarily so for cognitive scientists. Reason for this
statement is the fact that (with exception of Broca&Wernicke's discoveries) no information about
high-level functions (memory, attention, language, etc. ) is obtained by applying of such methods.
On the contrary, the methods we shall discuss from this sentence on are of high interest for
anybody whose interest doesn't stop at the level of tissue but goes further – towards mind itself.
The crudest approach how can one obtain information about high level functions of neural
system is by means of electrode implantation into the brain. Since not much was told to the
students about this approach, let it by said that introduction of such an approach should be penalized
not only by IIQTE and IIQSP factors, but as well by a new factor IIQIMPLANTS which should be
proportional to number and size implanted sensors, as well as to the depth of implantation/invasion.
Much more subtle approach how one can observe the mysterious relations between mind
and brain is by means of radioactivity. The most attractive approach is so-called Positron Emission
Tomograph (PET) based upon the detection of gamma rays emitted by positron-emitting
radionuclide tracer which was injected into the body. If the tracer is fludeoxyglucose – analogue of
glucose – one can deduce the metabolic activity (glucose uptake) of different brain regions by
simply observing the radiation (proportional to FDG concentration) of different regions.
From the invasivity point of view, one should take into account IIQRADIODECAY factor
proportional to half-lives of tracer's decay. In order to have such tracers, PET demands proximity of
a nuclei-enriching cyclotron. Such a cyclotron can be possibly toxic to its environement.
PET is often coupled with a classical X-ray CT scan. Since CT scan uses also the high
frequency electromagnetic waves as a medium for carrying the signal, IIQ RADIOGAMMA penalization
-proportional to the energy level of a ray- should not be forgotten in its case.
Another disadvantage of CT is that fournishes only anatomic (and not functional)
information. It stays, however, the most used apparatus in the clinical (neuro)imaging practice,
which is definitely due to its relatively low price and high reliability.

3. Reversible techniques
L'exploration de l'esprit commence à peine, elle sera la principale tâche de l'ére qui
s'ouvre devant nous comme l'exploration du globe a été celle des siècles précédents.
Thomas Huxley

3.1 Fields
From the point of view of cognitive sciences, the most attractive methods for the study of
brain and mind are highly functional non-invasive methods of MRIf & EEG/MEG. All of them
exploit, in certain sense, the «electromagnetic field»-related characteristics of human brain.
ElectroEncephaloGram (EEG) , discovered by Berger in 1924 exploits the fact that electric
fields of activated cortical neurons -especially the pyramidal ones- sum up avec each other and
produce an overall electric response which is measurable even on the outer surface of the skull.
Hence invasion into the interior of the organism is not neccessary, electrodes are posed on the scalp
and the only act of violence related to EEG measurement is due to movement-related artefacts – if
organism moves , measurement is strongly perturbed. Hence the only negative factor of EEG is
IIQFIXATION .
The negative factor of «unnatural fixation in the apparatus» is present as well during the
experiments using Magnetic Resonance Imagery (MRI). MRI has two modes of functioning –
anatomic and functional. Both exploit the properties of hydrogen protons who are susceptible two
align their spins when exposed to powerful magnetic field. Subsequently, protons are being excited
from this «equilibrium state» by strong radio waves. From the time-related distribution of emitted
photons, one can subsequently reconstruct the overall map of matter in the skull. In case of
functional MRI, a so-called BOLD effect is exploited as well – thanks to certain property of
hemoglobine which is feromagnetic when oxygenized and paramagnetic when contrary is the case.
Therefore one can be informed about blood flow in the region of interest (ROI). Since
augmentation/diminution of blood flow in ROI is related to augmentation/diminution of neural
activity in the proximity, MRIf gives us this very precious information.
The only other negative factor of MRIf is IIQHEAT, since it seems that longer exposition to
MRIf device can lead to slight augmentation of body temperature. Since this is in order of
approximately 1 degree Celsius, the IIQHEAT penalization is definitely lesser than in the «mouse-feet
burning» experiments of optical imagery.
But in general it can be said that EEG as well MRI are definitely positive approaches when
analysed through the prism of «Invasivity/Information Quotient» schema. This is due to huge
«information contribution » factor, i.e. due to the fact that these apparati produce huge amount of
information. To calculate «information contribution» one should take into account these factors: 1)
RS - Spatial resolution (voxels per skull volume or electrodes per skull volume) , 2) RT - Temporal
Resolution and 3) SN - Signal/Noise ratio 4) T – overall Timelength of datacapture 5) I – sensor
sensitivity, i.e. numbers of degrees of freedom of individual sensor (for example number of possible
intensity values in the case of a CCD pixel)
The output of a simple formula
IIQINFOCONTRIB = RS * RT * SN * T * I
is a hypotethic overall amount of pure information (purified signal) obtained during experiment.
As we already stated, this IIQINFOCONTRIB component is very high in case of EEG and IMRf.
In the former case it is due to very high RT (dataset size obtained from one experiment is in order of
Megabytes) while in case of the latter , it is due to very high ST (dataset size obtained from one
experiment is in order of hundreds of Megabytes, even Gigabytes). By subsequent logarithmization
of these information contribution quantities (for example log10(Megabyte)~6; log10(Gigabyte)~9 )
one gets numbers which can be more easily used of in the final IIQ equation (c.f.Appendix )

3.2 Life
Since students weren't introduced to other non-invasive methods like
MagnetoEncephaloGram (MEG), Near-Infrared Spectroscopy (NIRS) , Transcranial Doppler
(TCD) or simple ultrasound imaging , we'll not concentrate upon these methods on this article.
Upon what we will concentrate in this concluding paragraph is this set of hypotheses:
It is obvious that brain is electromagnetic-field generating device. Many indices suggest as well that brain is
susceptible to EM-field stimulation. It may be, thus, that the brain sustains its internal equilibrium by means of its own
EM-field (skull functions as resonator, glium cells as amplifiers) How comes that modern science is completely blind to
the power of field-based techniques and stay obsessed by its poisonous molecules, pillules and deadly rays ?

After his first experience of meditation in 3-Tesla MRI in Bordeaux, Student is deeply
persuaded that these most sophisticated devices ever created by humanity4 can be used not only for
imagery, but for healing as well.
For burning the tumor in much more subtle way than an X-ray could ever do.

Discussion
Il devient indispensable que l'humanité formule un nouveau mode de penser
si elle veut survivre et atteindre un plan plus élevé.
Albert Einstein

.
This text is written by student of Practical School of High Studies. Maybe the term «High
Studies» are interpred in a bad manner by the Student , nonetheless his conscience obliges him to
state that he believes that the ultimate goal of his studies is scientia, and we know for ages already
that true scientia reposes on discovery of general principles.
More general the principles, higher the science.
This text is written by a young man who got, in certain moment in his life, into contact with
so-called «oriental» philosophy and science. The foremost ethical principle of eastern thought can
be stated like:
«There exist a causal cause-effect relation not only on material, but as well an axiological –
i.e. moral - level».
This principle is known as «the law of Karma» in the East. Western tradition knew it as
well: «As You saw, so shall You reap» was said thousand years ago, and was later translated into a
Golden Rule before finally finding its most general form in Categoric Imperative (Kant, 1785).
But even Kant made a mistake: he excluded animals from implementation of this principle.
This text is written a cognitive science student aiming to program an Artificial Intelligence
(A.I.) system. Since it is not a secret that an ultimate goal of a Robotics & A.I. research is an
emergence of a thinking and acting entity whose skills will be superior to that of a human being we
appeal to all those men and women of scientia who have ears to hear and eyes to see:
If You will not reconsider Your practices immediately5, You will not be able to exclude the
possibility that the future superiors will do to You the same thing as You did to Your inferiors.
To conclude: We state hereby that IF the principle of Karma is true (and we suggest that
whole human history did not falsify it), an experimental method which does not take it into account
is doomed to fail since ex vi termini, one cannt heal cancer by injecting cancer into healthy beings.
To conclude: The law of Karma states that You simply cannot have good scientific results if
Your method for achieving them is not good neither.
To conclude: if we were «moralizing», we truly did it out of pragmatic concerns.
4 Nothing excludes, in theory, to exploit MRI devices like macroscopic quantum computation machines, but to
analyse this in this article would bring us too far away.
5 Shubhasya shiighram ashubhasya kálaharańam ( Do virtue immediately, delay doing vice )

Appendix – Towards concrete implementation of Information/Invasivity Quotient
L'esprit occidental est dans le vrai seulement par ses méthodes et ses techniques.
L'esprit oriental est dans le vrai seulement dans ses tendances générales.
L'échange est nécessaire.
Georges I. Gurdijeff

Our «Invasivity/Information Quotient» proposal for the estimation is simple:
On one side of the equation we put all the «invasivity» related factors – quantified and
weighted according to common international conventions.
We label the resulting sum of all the quantified invasivity factors IIQNEGATIVES i.e.
IIQNEGATIVES = IQDEATH + IIQTE + IIQSP + IIQINJURY + IIQFIXATION + IIQBLEACHING +IIQGENEMANIP +
+ IIQTOXIC + IIQNONTOXIC + IIQRADIODECAY + IIQRADIOGAMMA + IIQHEAT
On the other side of the equation we put the weighted IIQPOSITIVES factor. Since it gives us
pure information content in bits, we weight it by means of logarithm function to make it comparable
with IIQNEGATIVES
IIQPOSITIVES = log(IIQINFOCONTRIB)
The basic imperative of Information/Invasivity Quotient heuristics states that if
IIQPOSITIVES – IIQNEGATIVES < 0
than the amount of pure signal (information) generated by an experiment is not sufficient to
justify the harm caused to an organism and therefore such an experiment should not be peformed.
Very naive (and somewhat arbitrary) illustration of our point is present in following table
representing negative and positive aspects of an experiment lasting approximately 1 hour :
List IIQNEGATIVES
Optical in IIQINJURY+IIQHEAT+IIQNONTOXIC
vivo
+ IIQGENEMANIP + IIQTOXIC +
IIQONCOINJECTION+ ???IIQDEATH

N of
IIQPOSITIVES IIQPOSITIVES –
IIQNEGATIVES log10(IIQinfocontrib) IIQNEGATIVES

Decision

7

3

<0

reject

EEG

IIQFIXATION

1

6

>0

accept

MRI

IIQFIXATION, IIQHEAT

2

7

>0

accept

NIRS

none

0

5

>0

accept

Bibliography
Trouver d'abord.
Chercher après.
Jean Cocteau

Broca, P. (1861). Remarques sur le siege de la faculté du langage articulé, suivies d’une observation
d’aphemie (Perte de la Parole). Bulletin de la Société Anatomique, 6, 330–357.
Hendriksen, C. F. M. (2005). The ethics of research involving animals: a review of the Nuffield
Council on Bioethics report from a three Rs perspective. Alternatives to Laboratory
Animals: ATLA, 33(6), 659-662.
Kant, I. (1785). Groundwork of the Metaphysic of Morals. First published.
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of information. University of Illinois Press, 97.

Metodológia výskumu vzťa hu medzi aktom vyňadreniai samičiek rodu
homo sapiens sapiens a následnými zmenami v distribúcii sociálneho
kapitálu v rámci komunity združenej okolo internetového diskusného
systému kyberia.sk
Pre Fakultu Humanitných Štúdií
Univerzity Karlovej v Prahe
napísal ako prácu z metodológie
Daniel Hromada, UČO 9306

1. Úvod – Intuícia
A tak se vám
naposledy
nebude chtít od
ňader vědy 1

Rozhodol som sa analyzovať sociálnepsychologické javy vrámci komunity ktorú dôverne
poznám, a to nielen preto že som bol niekoľko rokov členom, ale aj preto že som dotyčnú komunitu
ustanovil. Jedná sa o komunitu združenú okolo internetovej aplikácie kyberia.sk.
Z množstva javov ktoré upútali moju pozornosť počas prvých piatich rokov existencie komunity
som si napokon ako predmet mojej analýzy zvolil jav nasledovný: v roku 2003 bolo na kyberii
vytvorené fórumii s názvom “KYBERIA – setki ceckiiii ven!” v ktorom postupne začali jednotlivé
užívateľky kyberie svetu prezentovať fotografie svojich odhalených hrudníkov, mnohokrát značne
estetických. Zaujimavé na celom jave , je, že dámy či slečny vo vykonávaní spomenutých aktov
neustali ani po rokoch – nejednalo sa o akúsi krátkodobú memetickú / imitačnú / módnu vlnu.
V minulosti na výskyt podobných aktov, ktoré budem v tejto eseji označovať termínom
“vyňadrenie” , slovom verejne reagovali iba básnici a v menšej miere filozofi. Akákoľvek vedecká
tématizácia úlohy ňadra v ľudskej spoločnosti ešte pred sto rokmi takmer nepripadala v úvahu, aby
následne celá téma “o úlohe toho , čo utišuje dva hlady zároveň” , téma tak nádherne jasná , stratila
takmer na celé 20. storočie svoju zjavnosť pod nánosom freudiánskych mystifikačných interpretácií.
Verím že na začiatku tretieho tisícročia už sa ľudstvo dostatočne vymanilo z područia
tabuizujúich ideológií, moralizujúcich náboženských dogiem a zväzujúcich vedeckých
metodologických paradigiem aby mohla byť na solídnej úrovni a s noblesou zodpovedaná otázka ktorá
ma pri pohľade na uvedené fórum napadala:
“Prečo to tie Ženy robia ?“
Poser la tete sur le sein de sa mère, voilà tout son bonheur ;
pour rien au monde, il ne voudrait la perdre de vue2

1 Goethe J.W. , Faust , překlad O.Fischer
2 Thákur Rabindranath, La jeune Lune – Les caprices de bébé

2. Ňadro – Empíria
2.1 Hypotézy
Tím, že mužská nadvláda dělá z žen symbolické předměty, jejichž bytí (esse) je bytí­viděným (percipi),
staví je do situace permanentní fyzické nejistoty, či spíše symbolické závislosti: žena existuje především
skrze – a pro – pohled těch druhých, neboli jako přístupná, přitažlivá a disponibilní věc...Takzvaná
ženskost přitom často není nic jiného než určité nadbíhání, ať už skutečným nebo předpokládaným
požadavkům mužů, jež má především posílit jejich ego.3
Ajkeď otázka uvedená v úvode je iba otázkou laikovou, môžu byť odpovede na ňu skutočnými
hypotezámi. Hypotéza A môže znieť: “Dámy sa vyňadrujú prosto preto že zo samotného aktu majú isté
potešenie, istý plezír. Prosto ich to baví, a zmysel vyňadrenia je treba vidieť v ňom samotnom”.
Hypotéza B, v istom zmysle protikladná k hypotéze A môže znieť: “Dámy tak činia preto, že z toho
niečo majú, prípadne očakávaju že z toho niečo mať budú. Za ich jednaním možno odhaliť akúsi skrytú
tradíciu, akýsi špecifický kalkul, akúsi zvláštnu formu ekonomického jednania, akéhosi zhodnocovania
vlastného tela”.
Žijeme v dobe ktorú paradigma daná Adamom Smithom ovplyvnila natoľko, že si zachvíľu
niektorí začnú myslieť že Slnko na Zem žiari pretože je po slnečnom žiarení na Zemi značný dopyt.
Pojmovou dvojicou dopyt – ponuka ( po česky poptávka – nabídka ) si preto dovolíme vysvetliť aj akt
vyňadrenia . Môžeme teda predložiť hypotézu B1 :
“Žena sa vyňadruje preto, že vedome či nevedome predpokladá, že výmenou za tento riskantný
akt, za akt ktorým v istom zmysle ponúkne samú seba, niečo získa”.

2.2 Stratégia
Empirická vzorka ktorá bude počas experimentu využitá, databáza kyberie, je informačne
uzavretý systém, tj. “taký systém, ktorý nemôže byť ovplyvnený ničím zvonka bez vedomia
výskumníka”4 .
Mimo to je potrebné si uvedomiť že výrok , ktorý sme označili ako hypotéza B1 , sme
skonštruovali až potom, čo sme získali dáta s ktorými budeme v našom experimente pracovať. Nič iné,
ako databázu kyberie ako zdroj dát nepoužijeme. Žiadne dotazníky, žiadny respondenti, žiadne
rozhovory – a teda žiadne skreslenie získaných dát faktom, že prebieha výskumiv . Možno teda povedať
že dáta pozbierame štúdiom databázy ktorú máme k dispozícii, a tento prístup je blízky tzv. “štúdiu
dokumentov” v tom zmysle, že vychádzame zo “záznamu ľudskej činnosti ktorý nevznikol za účelom
nášho výskumu”5. Rozdieľ je však v tom, že v našom prípade sa bude jednať o nehmotný záznam.
To, čo nás v tejto, ako i v ďalších prácach bude zaujímať predovšetkým je, je miera, v akej istá
konkrétna udalosť – v tomto prípade vyňadrenie – ovplyvňuje pozíciu a trajektóriu jednotlivca v
jednom z mnohých priestorov sociálneho sveta, tak ako nás o nich poučil Bourdieu.
Vyvstáva teda otázka, či sa bude jednať o kvantitatívny, alebo kvalitatívny výskum. “V
kvantitatívnom výskume zbierame iba tie dáta, ktoré nutne potrebujeme k testovaniu hypotéz. V
3 Bourdieu P. , Nadvláda mužů
4 Disman N. , Jak se vyrábí sociologická znalost , Karolinum , 1998, str. 18
5 Ibid., str. 166

kvalitatívnom výskume sa snažíme vyzbierať všetky dáta a nájsť štruktúry, pravideľnosti, ktoré v nich
existujú”. 6
To, že už teraz disponujeme istou hypotézou, ako i to, že naša práca bude spočívať v hľadaní
súvislostí medzi kvantitami, by nás mohlo viesť k názoru že sa jedná o kvantitatívny výskum. To je
však názor mylný. Naša hypotéza B1 je totiž zatiaľ definovaná veľmi vágne, a hlavne, v žiadnom
prípade neurčovala ktoré dáta máme k dispozícii, my totiž v istom veľmi silnom zmysle máme naozaj
k dispozícii všetky dáta týkajúce sa komunity kyberie.
Ak teda je nutné určitým spôsobom náš výskum zaškatuľkovať, budiž to škatuľka kvalitatívneho
výskumu do ktorej náš experiment vložíme. Klasický nedostatok kvalitatívnych výskumov, a totiž že
pri ňom dochádza k silnej redukcii počtu sledovaných jedincov sa nás netýka, disponujeme
informáciami o všetkých aktoch vyňadrenia ku ktorým vrámci fóra “KYBERIA – setki cecki ven!”
došlo. K oveľa závažnejšiemu problému s indukciou v našej vzorke zistených skutočností na celok
populácie ako i k charakteristike vzorky sa vrátim v časti 2.6
Klasickú sociologickú dilemu ako preplávať medzi Skyllou kvantitatívneho výskumu a
Chabrydou výskumu kvantitatívneho , môžeme šalamúnsky vyriešiť aj tým, že budeme tvrdiť že sa o
sociologický výskum nejedná, aspoň nie v tom zmysle ako ho chápe komunita vedcov ktorí sa za
sociológov považujú. Môžeme alebo tvrdiť, že náš prístup v sebe zastrešuje a rekombinuje oba tradičné
prístupy, alebo môžeme použiť sadu stratégií známych skôr odborníkom v informačných technológiách,
metódu tzv. “dolovania dát” ­ datamining.
To, čo je pre datamining charakteristické je, a v čom je zajedno s kvalitatívnym výskumom je
maxima “najprv máme dáta , až potom hypotézu”. Líšia sa však v tom, že zatiaľčo kvalitatívny výskum
vyžaduje v prvom rade výskumníka ktorý behá od jedinca k jedincovi , zbierajúc čo najväčšie
množstvo dát, v dataminingu sú to jedinci samotní čo nám tieto dáta poskytujú a svojou aktivitou
vkladajú do predom štruktúrovanej databázy, a to častokrát bez znalosti toho, že tak činia.
Čo je z hľadiska relevancie našich dát skvelé, z hľadiska etického je to však samozrejme
problém. Ten sa pokúsime vyriešiť v časti 2.7.

2.3 Terminológia
Symbolický kapitál je akákoľvek vlastnosť – napr. fyzická sila, bohatstvo, či zdatnosť bojovníka – ktorá,
pokiaľ je vnímaná agentmi vybavenými takými kategóriami vnímania a hodnotenia ktoré túto vlastnosť
umožňujú vnímať a rozpoznávať , sa stáva symbolicky účinnou, ako akási ozajstná magická sila : jedná
sa o vlastnosť ktorá ­ keďže je odpoveďou na kolektívne, sociálne konštitutované očakávania ­ akosi
pôsobí na diaľku... 7
Pokiaľ vnímame vyššieuvedený citát ako definíciu symbolického kapitálu , asi málokto by si
dovolil tvrdiť že Ženine ňadrá sa na skladbe jej symbolického kapitálu nepodieľajú. Je nesporné že
Žena svojimi ňadrami na okolitých agentov pôsobí, ako i to, že okolití agenti ­a to nielen kojenci ­
disponujú “kategóriami vnímania a hodnotenia ktoré túto vlastnosť umožňujú vnímať a rozpoznávať”.
Keďže však istotne Ženin symbolický kapitál nemôžeme redukovať len na jej ňadrá – Žena je
vpravde niečím oveľa viac – a keďže ani len netušíme do akej miery sa Ženine ňadrá podieľajú na
celkovej skladbe jej symbolického kapitálu, 'ba ani to, či je táto miera u všetkých samičiek rodu homo
6 Ibid. , str. 285­288
7 Bourdieu P. , Raisons pratiques: Sur la théorie de l'action : L'economie des biens symboliques

sapiens sapiens rovnaká, alebo sa líši od Ženy k Žene, je asi vhodné na počiatku našich výskumov
veličinu symbolického kapitálu opustiť, a pracovať s termínmi užšie vymedzenými.
Prvý termín ktorý už sme definovali je termín “vyňadrenie”. Jedná sa o akt, pri ktorom Žena,
samička rodu homo sapiens sapiens, a užívateľka systému kyberia.sk – v ďaľšom texte častokrát
nazývaná termínom “dotyčná osoba” zverejní svoje poprsie vo fóre !”KYBERIA – setki cecki ven”.
Akt vyňadrenia sa uskutočnil v časový “moment vyňadrenia” (TŇ) a možno ho charakterizovať
určitou “úspešnosťo u vyňadrenia”v (ÚŇ) . Ako ukážem v časti o operacionalizácii, úspešnosť
vyňadrenia možno v rámci systému kyberia vyjadriť číslom, tj. ako intervalovú premennú.
Ďalšie termíny ktoré je vhodné definovať je “pasívna známosť osoby A v rámci komunity”
(Z). Túto veličinu možno charakterizovať ako “počet členov komunity, ktorí sú si vedomí existencie
osoby A”. Známosť osoby je v istom veľmi blízkom vzťahu k veličine ktorú nazývam “pasívny
sociálny kapitál osoby A” (KpS)­ ten definujem ako “počet členov komunity majú s osobou ustanovený
A určitý vzťah”, zatiaľčo “aktívny sociálny kapitál osoby A” (KaS) definujem ako “počet členov
komunity s ktorými má vzťah, alebo chce mať vzťah, osoba A”. Ináč povedané, v prípade pasívneho
sociálneho kapitálu smerujú šipky K osobe A, v prípade kapitálu aktivného smerujú šipky OD osoby A.
Dá sa očakávať že aktívny sociálny kapitál u väčšiny ľudí nebude prekračovať istú kritickú maximálnu
hranicu – vzťahovať sa k viac ako niekoľkým stovkám ľudí je kognitívne neúnosné.
Keď napríklad hovoríme o nepopulárnom politikovi, je jeho “pasívna známosť” v jeho krajine
vysoká, pretože počet ľudí ktorý ho poznajú je vysoký, naopak jeho pasívny sociálny kapitál je nízky,
pretože iba nemnohí ho majú radi a chceli by s ním ustanoviť skutočný ľudský vzťah. Možno povedať že
čím je osoba populárnejšia, tým jej “pasívna známosť” koreluje s jej “pasívnym sociálnym
kapitálom”.
“Zmena pasívneho sociálneho kapitálu osoby A v čase T” (dKpST) je daná odpočítaním
hodnoty pasívneho sociálneho kapitálu ktorú sme namerali pred časom T od od hodnoty ktorú sme
namerali po čase T. V rámci našeho výskumu samozrejme čas T nieje ničím iným ako momentom
vyňadrenia.
Teraz môžeme spojiť našu vágnu hypotézu B1 s teoretickým rámcom ktorý sme si vybudovali
ako nadstavbu nad prácou Pierra Bourdieho, a opýtať sa:
“Dochádza aktom vyňadrenia k zmene pasívneho sociálneho kapitálu osoby, ktorá sa vyňadrila?
“ Povedané ľudsky: “Vedie akt toho, že Žena odhalí svetu svoje vnady k nárastu počtu osôb, ktoré sa s
ňou pokúsia ustanoviť vzťah ? Alebo dokonca dochádza k opačnému, neočakávanému efektu a niektoré
osoby s takto vyňadrenou dámou prerušia styky ? |dKpST| > 0 ??? ”
Ak k žiadnej zmene nedochádza, nielen páni ale i dámy by o tom možno mali byť poučené.
Ak vedie, a to nám pravdepodobne našepkáva naša intuícia, ak nám vzorka naznačí značné
poklesy či nárasty pasívnych sociálnych kapitálov pred a po momente vyňadrenia, budeme môcť ísť
ešte ďalej a pýtať sa: “Existuje korelácia medzi úspešnosťo u vyňadrenia a zmenou pasívneho
sociálneho kapitálu prípadne známosťo u osoby ktorá sa vyňadrila ? Ak áno, aký je regresný
koeficient b medzi oboma javmi ? ”.
Povedané ľudsky: Platí, že čím sú ňadrá samičky rodu homo sapiens sapiens považované za
pôvabnejšie, tým viac ľudí – s najväčšou pravdepodobnosťou samcov – bude chcieť s dotyčnou
samičkou nadviazať kontakt potom čo ich odhalí svetu ?
Ak áno, stojí to za zamyslenie.
Ak platí pravý opak, stojí to za zamyslenie.
Ak neplatí ani jedna z oboch alternatív, stojí to za zamyslenie.

2.4 Operacionalizácia
Operacionalizáciou približuje sociológ svoj terminologický – a teda v istom zmysle už
teoretický – rámec skladbe svojich empirických dát. Približujeme reč našich hypotéz (v prípade
výskumu kvantitatívneho) alebo nášho predbežného predporozumenia (v prípade výskumu
kvalitatívneho ) reči našich dát. A teda:
Indikátorom veličiny úspešnosť vyňadrenia je kvantita ktorú máme v databáze uloženú v stĺpci
s názvom “K” v riadku ktorý charakterizuje príspevok s fotografiou, ktorou sa dáma vyňadrila. “K” je
základným ekonomickým prostriedkom výmeny,jednotkou Kapitálu, vrámci komunity kyberia.sk – K
možno chápať ako Korunu či Kredit kyberie. Každý užívateľ dostane každý deň pridelených 23K vi , tie
následne môže prideľovať iným príspevkom, fóram či užívateľom, pričom každému príspevku môže
udeliť maximálne 1K. V dátach ktorými disponujem doposiaľ najúspešnejší akt vyňadrenia získal 123K
– ináč povedané 123 užívateľov považovalo za vhodné odvďačiť sa dotyčnej dáme za jej akt vyňadrenia
udelením K.
Keďže je teda počet K ktorými bola dotyčná fotografia ohodnotená výslednicou činnosti – či
nečinnosti – množstva ľudí, považujem ju indikátor úspešnosti vyňadrenia rozhodne objektívnejší, ako
posudzovať úspešnosť vyňadrenia len na základe vlastných subjektívnych estetických kritérií (akých?).
Ako však operacionalizovať pre náš výskum kľúčovú “zmenu pasívneho sociálneho kapitálu”?
Otvára sa pred nami viacero ciest, pre účely tejto práce postačí keď predstavím tri
najzásadnejšie:
Avii: Za indikátor zmeny pasívneho sociálneho kapitálu zvolíme zmenu v počte užívateľov
ktorý dotyčnej osobe písali poštu počas určitého intervalu PRED a PO akte vyňadrenia. Napríklad
môžeme porovnať počet jedincov ktorí pisali vyňadruvšej sa samičke týždeň PRED a týždeň PO.
Prípadne sa môžeme pozrieť na viacero týždňov ( či iných časových intervalov ) do minulosti – tak
možno odhalíme existenciu “nepozorovanej premennej” ktorá by mohla viesť k nepravej korelácii8 ­
prirodzeného nárastu sociálneho kapitálu ku ktorému istotne dochádza keď sa novopríchodzí člen
začlenuje do sociálnej siete kyberie, a to nezávisle od toho či sa vyňadril alebo nie.
B: Za indikátor zmeny pasívneho sociálneho kapitálu zvolíme zmenu v počte užívateľov ktorý
si dotyčnú osobu pridali medzi priateľov počas určitého intervalu PRED a PO akte vyňadrenia.
Systém kyberie v sebe , podobne ako množstvo iných internetových sociálnych sietí , obsahuje
možnosť vytvárať medzi užívateľmi priateľské väzby. Hrozbu aspoň jednej nepravej korelácie
obmedzíme podobne ako v prípade cesty A.
C: Za indikátor zmeny pasívneho sociálneho kapitálu zvolíme zmenu v počte užívateľov
ktorý si prezerali užívateľský profil dotyčnej osoby počas určitého intervalu PRED a PO akte
vyňadrenia. Tieto dáta môžeme získať z databázovej tabuľky “levenshtein”. Podobne ako v prípade
oboch prvých ciest môžeme aj v tomto pracovať nielen s dátami kvantitatívne reprezentujúcimi celú
existenciu užívateľa v kyberii – ich integráciou do výskumu a ich štatistickými transformáciami
znížime pravdepodobnosť vplyvu iných nepozorovaných premenných.
Možné su samozrejme aj kombinácie ciest A, B, C.
8 Ibid., str. 21

2.5 Problematizácia
Niektoré z problémov na ktoré narážame počas našej analýzy dát sú vyložene technického
charakteru: napr. zisťujeme, že niekoľko tisíc databázových položiek o priateľských väzbách (prístup B)
medzi užívateľmi má nastavený ten istý dátum vytvorenia, je takmer isté že niečo také je spôsobené
chybnou požiadavkou do databázy niekedy v minulosti. Pre náš výskum to znamená že výsledky
získané použitím prístupu B budú pravdepodobne krajne nespoľahlivé – čo je nepríjemné o to viac, že
ako uvidíme v následovnej časti, je prístup B eticky najčistejším.
Táto nepríjemnosť je o to závažnejšia, že pri rozvažovaní nad našim teoretickým modelom sa
zdá, že počet vytvorených priateľských väzieb s osobou A je najspoľahlivejším indikátorom
pasívneho sociálneho kapitálu dotyčnej osoby. Objasňujem: počet užívateľov ktorý si prezerali
užívateľský profil dotyčnej osoby (prístup C) je síce informáciou ktorá nám hovorí o tom, že istá
množina užívateľov o dotyčnú osobu prejavila záujem, nič sa však nedozvedáme o hodnote tohto
záujmu, nevieme či sa na jej profil pozreli z čírej zvedavosti, alebo kvôli reálnemu záujmu ktorý v nich
vzbudil akt vyňadrenia. Podobne je tomu aj v prípade prístupu A , keďže nás z dôvodov ako etických
tak technických nezaujímajú obsahy jednotlivých poštových správ adresovaných dotyčnej osobe,
nevieme či tieto poštové správy obsahovali slová chvály, vďaky a pozvania na čaj, alebo urážky a
narážky na to, že sa dotyčná osoba zachovala ako ľahká deva.
Práve kvôli tomuto problému som, hneď ako predo mnou uvedený problem vyvstal, do
teoretického modelu zaintegroval koncept “pasívnej známosti osoby A”, ktorú som definoval ako
“počet členov komunity, ktorí sú si vedomí existencie osoby A”. V prípade veličiny známosti nás
vôbec nezaujíma či je medzi ostatnými členmi komunity a osobou A vzťah pozitívny alebo negatívny,
či osobu milujú alebo nenávidia. To čo je pre veličinu známosti podstatné je, že o osobe A vedia. A
keďže samozrejme na to, aby sme si prezreli niekoho užívateľský profil, či dokonca aby sme mu
napísali poštovú správu, musíme o ňom vedieť, sú oba zdroje dát užité v prístupe A a C najmä
indikátormi pasívnej známosti.
Podobné problémy ako technického, tak metodologického a teoretického charakteru pred nás
vyvstávaju pri koncipovaní experimentu. No najzávažnejší problém pred nás kladú až výsledky
výskumu. Je ním problém indukcie a odpoveď na otázku “Do akej miery môžeme naše zistenia
vydolované zo vzorky našich dát vztiahnuť na celú populáciu ?”
Odpoveď: do tej miery, do ktorej koreluje štruktúra vzorky so štruktúrou populácie.
Vzorkou nebolo iba približne 50 aktov vyňadrenia vyprodukovanými niekoľkými desiatkami
užívateliek systému kyberia.sk. Vzorkou je celá databáza kyberie, keďže dáta o zmene sociálneho
kapitálu dolujeme práve z nej. Ajkeď je kyberia v prvom rade sociálna sieť v tej či onej miere
distribuovaná v mozgoch všetkých svojich užívateľov, môžeme učiniť hrubý redukcionistický krok a
tvrdiť že štruktúra medziľudských vzťa hov ktoré sú uložené v databáze ­ vzorke ktorou
disponujeme je čiastočne izomorfná so štruktúrou medziľudských vzťa hov v reálnej hmotnej
komunite. A o komunite sa vie približne toto: vo väčšine prípadov su jej členmi sú osoby slovenskej a
českej národnosti, priemerne až nadpriemerne počítačovo gramotné, ekonomicky i informačne
produktívne, žijúce v mestskom prostredí, vrchol zvonovitej krivky vekového rozloženia užívateľov
predpokladám niekde v intervale 23­25 rokov.
Prvý indukčný krok spočíva v tom, že zistenia ktoré sme vďaka komunite kyberie získali
vztiahneme na tú množinu jednotlivcov ľudského druhu ktorú možno opísať podobnými
charakteristikami ako má naša vzorka: na kultivovaných a vzdelaných mladých Slovákov a Čechov (a
samozrejme Slovenky a Češky). Po učinení tohto kroku uvažujeme vrámci historicko­etnografick
diskurzu.

Druhý indukčný krok spočíva v tom, že naše zistenia vztiahneme na celú populáciu planéty
Zem začiatku 21. storočia, tj. populáciu ktorej myslenie a jednanie začína byť čoraz homogénnejšie
determinované vplyvmi urbanizácie a ideológiou globálneho kapitalizmu. Ocitáme sa v sociologicko­
ekonomickom diskurze .
Tretí indukčný krok spočíva v tom, že naše zistenie vztiahneme na druh homo sapiens sapiens
ako taký. Toto je biologicko­antropologický diskurz.
Netreba snáď ani dodávať že s každým indukčným krokom stúpa riziko omyluviii.

2.6 Interpretácia
Chceme porozumieť zmyslu získaných dát . Ako na to?
Môžeme si dáta zoradiť do tabuliek a tie následne štatisticky spracovať. Tak síce získame
množstvo užitočných číselix (napr. smerodatnú odchyľku, ktorá nám podá informáciu o homogenite
našich dát), možno i čo ­ to pochopíme, no pravdepodobne po ich preštudovaní nebudeme príliš
schopný odovzdať naše poznanie laikovi či malému dieťaťu. A možno v takom prípade vôbec hovoriť o
porozumení dátam ?
Alebo môžeme naše informácie vizualizovaťx. Jediný obraz v sebe môže niesť viac informácií
ako dvadsaťstranová tabuľka – obraz totiž môžeme s väčšou ľahkosťou čítať ako príbeh.
Najjednoduchšími vizualizáciami sú grafy. Ak nás zaujíma iba to, či vyňadrenie vedie k zmene
pasívneho sociálneho kapitálu osoby A, je najjednoduchším spôsobom ako tak učiniť vytvorenie grafu,
v ktorom bude X­ová osa udávať časovú jednotku (napr. týždeň) ,a Y­ová osa bude udávať množstvo
pasívneho sociálneho kapitálu ktorým v daný čas dotyčná osoba disponovala. Pre každý indikátor a pre
každú osobu A získame samostatný graf. Tieto grafy môžeme samozrejme kombinovať.
Ak akt vyňadrenia vedie k okamžitému nárastu pasívneho sociálneho kapitálu, prejaví sa to v
našom grafe ako skok a to práve na tej Xovej súradnici ktorá charakterizuje čas kedy sa dotyčná dáma
vyňadrila. Možno z takého grafu zistíme že onen skok nieje okamžitý, ale čiastočne opozdený – to by
naznačovalo že muži s prvým kontaktom vyčkávajú. možno preto, že nechcú, aby boli ich pohnútky
príliš evidentné. Z takéhoto grafu bude možno taktiež vyčítať či sa hodnota KpS po vyňadrení ustáli na
akejsi novej hladine, alebo skôr či neskôr skonverguje k svojmu predchádzajúcemu stavu.
Ak nás zaujíma vzťah medzi úspešnosťou vyňadrenia a zmenou pasívneho sociálneho kapitálu,
budeme si musieť zostrojiť iný graf. X­ová osa bude charakterizovať zmenu dKpST , Y­ová osa bude
charakterizovať úspešnosť vyňadrenia. V prípade že alebo existuje reálna závislosť medzi oboma
premennými, alebo sú obe dáta determinované treťou nám neznámou premennou, zoradia sa nám dáta s
väčšími či menšími fluktuáciami okolo línie ktorú možno formálne charakterizovať rovnicou
dKpST = a+bÚŇ
pričom regresný koeficient b nebude vlastne vyjadrovať nič iné, ako mieru v ktorej má úspešnosť
vyňadrenia – pravdepodobne determinovaná nielen Ženinými prírodnými kvalitami ale aj kvalitou
fotografa a fotografie – vplyv na zmenu pasívneho sociálneho kapitálu.
Odhalenie tak očividnej korelácie je však vo vedách o človeku krajne nepravdepodobné. a ani
tento výskum nieje výnimkou. To , že napokon nedospejeme k žiadným kvantitatívne vyjadriteľným
parametrom však neznamená že nás dáta o ničom nepoučili.
Kto vie, možno nás poučia iba o tom, že odhalením ňadier svetu sa vlastne nič nemení.

2.7 Etika
Konaj tak, akoby sa maxima Tvojho konania mala prostredníctvom Tvojej vôle stať
všeobecným prírodným zákonom.9
Vzhľadom k faktu že vo výskume analyzujem dáta ktoré neboli vyprodukované za účelom
výskumu je istotne namieste pýtať sa, či je uvedený výskum výskumom eticky čistým.
To samozrejme záleží od toho, akú množinu etických imperatívov považujeme za podstatnú.
V prípade že považujeme za to jediné, čo je eticky podstatné, aby počas nášho výskumu nedošlo
k usmrteniu či ublíženiu, môžeme považovať výskum za etický. Nebol zahubený žiaden potkan, žiaden
schizofrenik nepodstúpil lobotómiu, žiadne hyperaktívne dieťa nebolo nadopované ritalínom a žiaden
papagáj nebol násilne v džungli chytený, aby bol následne do klietky v mene “vedy” posadený.
V prípade že je považované za neetické využívanie súkromných dát, možno k obrane výskumu
uviesť následovné:
● od môjho nástupu na FHS som vrámci samotnej kyberie niekoľko krát explicitne zdôrazňoval že
databázu kyberie využijem k akademickým výskumom. Je však samozrejme možné, a viac ako
pravdepodobné že o týchto tendenciách väčšina užívateľov nevedela.
● Fórum “KYBERIA – setki cecki ven !” je fórom verejným. Takisto dáta o priateľských väzbách
medzi užívateľmi sú dátami verejnými – každý zdatnejší užívateľ Internetu tak môže k výsledkom
cesty B dospieť aj bez nutnosti toho aby disponoval databázou xi.
● Prístup C pracuje s dátami ktoré niesú dostupné všetkým, prístup A dokonca s dátami výrazne
súkromného charakteru – s poštovými správami. Tieto dáta však niesú spracuvávané ručne, ale
strojovo, a ich obsah nikoho nezaujíma, rovnako ako nikoho nezaujíma ku ktorej konkrétnej osobe
sa získaný počet odosieľatelov správ vzťahuje. Získavam kvantitu ktorú následne vzťahujem k inej
kvantite...
● Keďže aj tieto číselné informácie môžu byť dnes, keď sú informácie dávané do úzkeho súvisu s
mocouxii, niekým intepretované ako narušenie súkromia, nebudú výsledky výskumu odprezentované
dokým k ich zverejneniu nedostanem súhlas Senátu kyberie.
Jednako však za najetickejší prístup považujem prístup vedca, ktorý neštuduje iných , ale sám
seba. S poľutovaním musím prehlásiť že zaujatie tohto prístupu – jediného prístupu s ktorým sa
dokážem plne stotožniť – sa stretlo s tak značným nepochopením, až som skoro kvôli nemu opustil
československú akademickú obec xiii.
Z hľadiska kategorického imperatívu, tj. z toho hľadiska, ktoré sa možno pre budúce generácie
ukáže byť ako jediné hľadisko ktoré je morálne záväzné, sa uvedený výskum zdá byť výskumom
morálnym. Maxima ktorá za koncipovaním uvedeného výskumu stojí totiž znie “vzdať týmto
výskumom hold Žene”.
Taká maxima sa istotne môže stať všeobecným prírodným zákonom.

9 Kant I. , Základy metafyziky mravov

3. Záver – Teória
Es giebt auf Erden viel gute Erfindungen,
die einen nützlich, die andern angenehm:
derentwegen ist die Erde zu lieben.
Und mancherlei so gut Erfundenes giebt es da,
dass es ist wie des Weibes Busen:
nützlich zugleich und angenehm. 10

Od Evinho jabĺčka k zakúsnutiu v záhrade rajskej, skrze vnadné pôvaby Heleny trójskej a
“gazelie dvojčatá” z Veľpiesne Šalamúnovej až k Cecílii Sarkozyovej – ňadro je v príbehu Človeka
všadeprítomné. Bolo tu dávno pred prvým pästným klinom – a predsa o ňom antropológovia zväčša
mlčia. Ňadro a nič iné uviedlo najväčších mužov histórie do pohybu – no koľko historikov sa mu
naplno venovalo ? Snáď iba najupadlejšia zo všetkých vied , tá , ktorá sa pyšne tituluje za
“najracionálnejšiu”, pochopila jeho význam pre ľudskú dušu, aby následne onen klenot prírody začala
nechutne a bez miery využívať vrámci svojej plytkej PRAXIS reklamných kampaní.
Za zdanlivou naivitou marketingového imperatívu “sex sells” sa skrýva nesmierne mocný
konglomerát konceptov ako “dopyt” a “ponuka”, previazaných medzi sebou rádoby racionálnymi
“zákonmi” , ktoré sa však pri konfrontácii s bežnou ľudskou realitou ­ “odmaskovanie” ktorej je
sociológovou svätou povinnosťou ­ môžu ukázať ako nepodložené či priam lživé. Človek sa nedá
previesť v kapitál, ňadro je oveľa viac darom ako predmetom výmeny, dievčence z fóra “KYBERIA –
setki cecki ven” sa nepredávajú, nekalkulujú – ony prosto žijú.
Žijú a smejú sa. Kiež sa im za ich smiech niekto raz rovnako so smiechom odvďačí
skonštruovaním “Veľkej Teórie Ňadra” (VTŇ) . Ten, čo ju vystavia na pevnom ontologickom podloží
(“čo je to súcnosť ňadra, a čo mu ako takému náleží ?”), ak ju metafyzicky náležite ukotví (ňadro ako
prvotný dôvod i konečný zmysel ľudského Dasein), aby následne vybudoval steny z pevných
empirických faktov (ňadro ako určitá konfigurácia ľudských tkanív; parametre ako hutnosť , pevnosť,
hmotnosť, hustota, krehkosť, jemnosť, dráždivosť atď.) , sa veru nemusí báť, že by mu jeho
fenomenologické analýzy (ňadro ako jav a analýza ňadra všetkými piatimi zmyslami) či jeho
kvalitatívne a kvantitatívne výskumy ňadier narušil akýsi neprajník.
A výsledky tých výskumov?
Ako hovorí Majster:
O čom nemožno hovoriť , o tom treba mlčať.11

Post Scriptum pre každú Tú, bez ktorej pričinenia by tento text nikdy nevznikol
Ďakujem.

10 Nietzsche F. , Also Sprach Zarathustra, Dritter Theil: Von alten und neuen Tafeln
11 Wittgenstein L. , Tractatus logico­philosophicus

i

Ajkeď je termín v “ňadro” v kontexte slovenského jazyka hrubým bohemizmom, je terminus technicus “vyňadrenie” či
jeho zvratné slovesné deriváty dokonavý “vyňadriť sa” či nedokonavý “vyňadrovať sa” čistým neologizmom vzniknuvším
ako súčasť slovenského jazyka. A teda , v prípade že by došlo v ľubovolnom akademickom výťahu k takémuto rozhovoru:
A: “Ale ale, pani kolegyně, to jste se dnes ale krásně vyňadřila.”
B: “Pane kolego musím Vás výrazně varovati že podobné komentáře Váš aktivní sociální kapitál jistojistě nezvýší”
... je termín “vyňádřit se” aj napriek jeho zdanlivej českej mutácii hrubým slovakizmom...
ii Fórum je verejne prístupné na adrese: http://kyberia/id/64400/ . Pre prípadných bádateľov z FHS bol v minulosti vytvorený
aj užívateľský účet s týmito parametrami login: fhs heslo: fhsfhs
iii Netreba snáď ani dodávať že podobnosť tu užitého termínu цэцэк s mongolským slovom цэцэг ktoré znamená kvet
(rozdieľ medzi K a G klasická mongolština nepozná) pravdepodobne nieje daná akousi prekrásnou lingvistickou
konvergenciou, ale je len istou roztomilou zhodou okolností
iv Zatiaľčo poznatok že “akt pozorovania ovplyvňuje pozorovaný jav” je jedným zo základných piliérov kvantovej fyziky,
viedol až ku koncipovaniu veľmi užitočného Heisenbergovho “princípu neurčitosti” dxdp<h metodológia humanitných
vied zatiaľ tento poznatok žiaľ príliš do úvahy neberie, pričom je však v jej doméne rovnako platný, a to dokonca možno
ešte s väčšou intenzitou ako vo fyzike. Nemožno však povedať že by si tohto problému neboli humanitne orientovaný vedci
vôbec vedomí – známy je Labov a jeho“paradox pozorovateľa” v sociolingvistike , pýtajúci sa “ako možno robiť rozhovor
s niekým bez toho aby bol ovplyvnený samotným faktom že sa s ním robí rozhovor?”, prípadne mierne ironické poznámky
niektorých antropológov a etnografov (Geertz) ktorí upriamujú pozornosť na fakt, že človek ktorého návyky sledujete sa
asi nebude chovať úplne prirodzene ,keď mu bude pred nosom neustále behať akademik s poznámkovým
blokom...Poznámka k poznámke: poznámka bola napísaná predtým ako som zistil že problém “interferencie
pozorovateľom” je celkom slušne rozpracovaný aj v sociológii (viz, Disman, str. 132), avšak toho že by bol formalizovaný
ako určitý humanitný “princíp neurčitosti” som si neni vedomý.
v Je to práve “úspešnosť vyňadrenia” ktorá by mohla byť v budúcich výskumoch daná do súvisu so “symbolickým
kapitálom”.
vi Ekonomický systém kyberie možno zatiaľ chápať ako socialistický
vii Keď sa uberieme cestou A, všetko čo potrebujeme učiniť k získaniu výsledkov nášho výskumu je nad db kyberie spustiť
skript obsahujúci následovný PERLový kód.
my $arr_ref = $dbh­>selectall_arrayref("select node_created,login,node_id,node_creator,k as
USPESNOST_VYNADRENIA,node_name,from nodes left join users on users.user_id=nodes.node_creator where
node_parent=64400 and node_content like '<img%' and k>2 order by k desc",{Slice => {}});
foreach my $ref (@$arr_ref) {
my @a = $dbh­>selectrow_array("select count(distinct mail_from) as SOCIAL_CAPITAL_BEFORE from mail where
mail_to=$ref­>{'node_creator'} and mail_user=mail_to and mail_timestamp>'$ref­>{'node_created'}'­INTERVAL 7 DAY
and mail_timestamp<'$ref­>{'node_created'}'");
my @p = $dbh­>selectrow_array("select count(distinct mail_from) as SOCIAL_CAPITAL_AFTER from mail where
mail_to=$ref­>{'node_creator'} and mail_user=mail_to and mail_timestamp<'$ref­>{'node_created'}'+INTERVAL 7
DAY and mail_timestamp>'$ref­>{'node_created'}'");
}
Podobne jednoducho sa k výsledkom doberieme aj keď sa uberieme cestou B či C.
viii Preto ten, čo sa chce čo najmenej vzdialiť pravde, urobí najlepšie, ak žiadny indukčný krok neučiní.
ix A ako hovorí Antoine de Saint­Exupery Malému princovi : “Ľudia milujú čísla”.
x V texte uvádzam grafy ako príklady vizualizácie dát. Vizualizácia sa však ani zďaleka neobmedzuje iba na grafy. Viz. napr.
http://gondapeter.sk/files/peter_gonda_bakalarka..pdf
xi Cvičenie 1: Vytvor skript kratší ako 77 riadkov ktorým získaš všetky potrebné dáta. Hint: najprv rozparsuješ
http://kyberia.sk/id/64400 podľa istého regulérneho výrazu, tak získaš prvú sadu premenných ktoré Ti následne určí ktoré
užívateľské stránky máš rozparsovať , zase pomocou regulérneho výrazu, aby si získal aj druhú sadu premenných. Toť vše.
xii INFORMATION IS POWER: informácie INFORMATION sú vydestilované z dát DATA , dáta vyberáme z chaosu
CHAOS. Informácie vedú k poznaniu KNOWLEDGE a to vedie , po rokoch, k múdrosti WISDOM
xiii Neprijatie metodologickej práce v ktorej som predstavil úplne nový spôsob toho, ako možno nahliadať na obsahy ľudskej
mysle – v prípade uvedenej práce sa samozrejme jednalo o moju vlastnú myseľ ­ bolo ,mimo iné, antropológom a
etoložkou odôvodnené argumentom “ve společenskovědním výskumu nebývají členy vzorku znaky, nýbrž osoby”

Variations upon the theme of Evolutionary Language Game
by Daniel Devatman Hromada

Introduction
Evolutionary Language Game (ELG) first proposed in (Nowak, Plotkin, & Krakauer, 1999) is a
stunningly simple yet mathematically feasible stochastic model addressing the question « How
could a coordinated system of meanings&sounds evolve in a group of mutually interacting
agents ? ». In most simple terms, the model can be described as follows:
Let’s have a population of N agents. Each agent is described by an n x m associative matrix
A. A’s entry aij specifies how often an individual, in a role of a student, observed one or more other
individuals (teachers) referring to object i by producing signal j. Thus, from this matrix A, one can
derive the active « speaker »
matrix P by normalizing rows :

while the « hearer » passive matrix Q by normalization of A’s columns:

The entries pij of the matrix P denote the probability that for an agent-speaker, object i is associated
with sound j. The entries qji of the matrix Q denote the probability that for an agent-hearer, a sound
j is associated with the object i.
Subsequently, we can imagine two individuals A and A’, the first one having the language L
(P, Q), the other having the language L’ (P’, Q’). The payoff related to communication of such two
individuals is, within Nowak’s model, calculated as follows:
n

n

F ( A, A′ ) = ∑∑ pij q′ji = Tr ( PQ′ )
i =1 j =1

And the fitness of the individual A in regards to all other members of the population can be obtained
as follows :
1
f ( A) =
∑ F ( A, A′)
| P | −1 A′∈P
( A′≠ A)

After the fitness values are obtained for all population members, one can easily apply traditional
evolutionary computing methods (Sekaj, 2005) in order to direct the population toward more
optimal states. In the experiments described in this paper, we have applied a binary search variant of
the roulette wheel algorithm within which the probability of the selection of individual I as a future
teacher, is proportional to I’s fitness.
It has to be stated, however, that Nowak’s results indicate that even without such « evolutionary
engine » behind, ELG model shall converge to weak local optima. Such mathematical property of
ELG model makes it thus plausible candidate for explanation of coordinated communication
systems even among species where coordination of sound-meaning pairs does not neccessarily
augment individual’s fitness. While such cases where teacher is chosen by « random » could
possibly explain the first stages of emergence of language practically ex nihilo in case of higher
vertebrae, they will be left aside in the rest of this paper.
Thus, all numeric experiments presented below depart from the assumption that the hypothesis:
« successful alignment of one’s sound-meaning associative mindmatrix A with mindmatrices of
other members of one’s population augments one’s fitness and thus augments one’s probability to
replicate content of one’s mindmatrix into mindmatrices of younger individuals»...is true.

First simulation
The aim of the first simulation, which we label as standard Evolutionary Game (sELG), was to
confirm the validity of the ELG model and test its sensitivity to different values {1, 4, 7, 10} of the
parameter k_learn which specifies how many times should be the matrix sampling procedure
repeated during the learning process. The size of the population was N=100, the size of the
associative matrix was 5 x 5. For every value of k_learn parameter, the simulation had been run 198
times. Every run was halted after 10000 generations. In every generation, the random wheel
algorithm have chosen one fit individual to be the « teacher » whose associative matrix was
sampled into the associative matrix of one « student » individual chosen by random.
Results of 1st simulation
As is indicated by Figure 1., all runs converged rather swiftly to local absorbing states. The result is
thus consistent with results presented in (Nowak et al., 1999). The global optima were, however,
attained quite rarely, 18 times in case of k_learn=1, 13 times in case of k_learn=10, 7 times in case
of k_learn=4 and 9 times in case of k_learn=7. For other information concerning, for example, the
generation WHEN in average different absorbing states were attained, c.f. (Hromada, 2012)1.

Figure 1: Figure 1: Evolution of fitness in time in a standard
Evolutionary Language Game

Second simulation
In the second simulation, which we label as standard Evolutionary Language Game with noise
(sELGn), the stochasticity of the model was increased by introduction of noise-generating
probabilist p_mutation={0.05, 0.01, 0.001, 0.0001} phenomena into the learning process. For every
probability value there were 20 runs. The value of parameter k_learn=4, other parameters were
identic to those used in the first simulation (i.e. N=100, m=5, n=5).
1 Note that local optima which Nowak labels as « absorbing states » are denoted by term « orbitals» in (Hromada,
2012)

Figure 2: Evolution of fitness in time in a standard Evolutionary Language
Game with noise
Results of 2nd simulation
Fitness for different values of the parameter p_mutation are plotted on Figure 2. It is evident that
if ever the value of p_mutation depasses certain threshold, the system shall behave too
stochastically and shall oscilate in the proximity of the weakest local optimum. On the contrary,
when the p_mutation is very low, one can notice that the system converges to high attractor states
which have the fitness value 4 and 5. Thus the presence of very low amount of noise seems to play
an important role in cases whereby the system gets stucked in local optima – verily the bug in the
learning process can give the system an important stochastic kick which shall allow, in the long
run, the system to attain the global optimum state. In evolutionary computing, such an approach is
already widely used not only in genetic algorithms but as well in case of algorithms derived from
Simulated Annealing (Kirkpatrick & Vecchi, 1983) approach.
Results obtained in experiment 2 are consistent with Nowak’s data, as well as with data
obtained by Kvasnicka & Pospichal who state : « if we introduce random errors into the learning
process, obtained results differ dramatically and are dependent from the probability of occurrence
of these errors. If ever this probability exceeds the critical value located somewhere between 0.001
and 0.01, the evolution shall start to behave in a stochastic manner and shall cease to converge to
the final value of fitness » (Kvasnička & Pospíchal, s. d.).

Third simulation
The objective of the third simulation was to exploit the ELG question in order to find the answer to
the question : « Which strategy is more fit ? To be taught once or multiple times? ». Seemingly
identic to first simulation within which we modelled the fact of « being taught more than once » by
different values of the parameter k_learn, this simulation differed primarily in two aspects : 1)
stochastic parameter p_mutation was assigned the value 0.001 in order to ensure that the system
will converge, sooner or later, to the global optimum 2) none of 99 runs for every simulation was
stopped until the average fitness of the population attained unescapable proximity of the global
optimum2. Every run could be thus characterized by a « temporal length », i.e. the number of
generations neccessary to attain the global optimum, and two distributions were subsequently
compared by statistic tests.
2 Since the maximum average fitness of the model hereby presented is n=m=5, we state that all models for which
average fitness attained value 4.98 (or higher) have attained « unescapable proximity of the global optimum ».

Results of 3rd simulation
All 198 runs have converged to global optimum, the longest run for k_learn=1 took 376950
generations to converge to theoretically optimal alignment of mindmatrices among the members of
the populatios. The longest run in case of k_learn=4 needed 215290 generations to converge to
quasioptimal alignment of all mindmatrices of all members of the population.
Distribution of 99 temporal lengths, one for each run, is not a normal distribution according to
Shapiro-Wilk test of normality (W = 0.7407, p-value = 6.244e-12 ) for runs where k_learn=1.
Similiarly, distribution is not not normal in case of distribution of 99 temporal lenghts for all 99
runs executed with k_learn=4 (W = 0.8709, p-value = 8.533e-08 ). A non-parametric test had to be
therefore applied in order to compare the two distributions : the Mann-Whitney-Wilcoxon ranksum test had shown that the difference between the two distributions is not significative (W = 4207,
p-value = 0.08562 ). Strictly statistically speaking there is therefore no difference between situation
whereby teacher’s mindmatrix is sampled into student’s four times, or only once.

Fourth simulation
The aim of this simulation was to shed some light upon the answer to the question : « Which
strategy is more fit ? To be taught once by multiple teachers or to be taught multiple times by
one unique teacher? ». We have used the data obtained in the 3rd simulation for the case «multiple
times by one unique teacher ». For cases with multiple teachers, we have modified the sELGn so
that before every matrix sampling for a randomly chosen student, a new teacher is chosen by a
roulette wheel algorithm according to his fitness. Given that k_learn=4 in both cases, all other
parameters were identic to preceeding simulations.
Results of the 4th simulation
The Mann-Whitney-Wilcoxon rank-sum test suggest that the difference between the two
distributions is not significative (W = 5068.5, p-value = 0.6778 ). Therefore it seems that within the
standard ELG model, the fact whether the student learns from one or more teachers does not speed
up the convergence of the population towards the global optimum. The longest run in case of
« multiple teachers » simulations took 298910 generations to converge.

Fifth simulation
The aim of the fifth simulation was to see whether phenomenon like Baldwin effect could arise in
sELG world. While the traditional «cultural evolution» was modelled by sELGn model as presented
in simulations 2-4, the « genetic evolution » was implemented as a sort of meta-evolution within
which the parameter k_learn itself could evolve. In concrete terms, every individual of the
population was defined not only by his mindmatrix A (and the « speaker » matrix P and the
« hearer » matrix Q derived from A), but also by a chromosome containing a single integer
specifying the k_learn parameter. All the members of population were initialized with the
k_learn=0, but in every generation, a mutation of this value could have occured with probability
p=0.1 which could increment or decrement the value. The value was bounded by interval <0,10>, it
could not increase nor decrease above, respectively below these bounds. Values of other parameters
were identic to those presented in simulations 1-4.
Results of the 5th simulation
Identically to simulations 3 & 4, all runs have attained the proximity of globally optimal attractor
state (the longest run took 361590 generations to converge). While Wilcoxon rank sum test has not
indicated any significant difference between the distribution obtained in this simulation when
compared to the « multiple teacher » one obtained in the fourth simulation (W = 4587, p-value =
0.3722 ), nor any difference in regards to distribution obtained for « parental learning with
k_learn=4> » from the third simulation (W = 4385, p-value = 0.1646 ), there was nonetheless a

significative difference observed between the distribution obtained in this simulation and the
distribution for « parental learning with k_learn=1 ». Somewhat contrary to the expectations we had
when we were launching the model, the most simple variant of the parental learning seems to
converge faster, i.e. in less generations, to the global optimum (one-sided : W = 3765, p-value =
0.001773 ) than the «bounded Baldwin » variant we presented in this simulation.
For completeness, we consider it worth noting that the further analysis of the values of evolvable
k_learn parameter indicates that within the executed 100 runs, the mean of the parameter value was
3.57 and median 3.351 in the moment when a given simulation had attained the proximity of the
optimal state. Given the fact that that possible values of k_learn were bounded into the interval
<0,10> this could seem surprising since one would expect the values to be closer to the middle of
the interval. All is explained, however, when one realizes that the random drift responsible for
evolution of the k_learn from its initial value 0 does not have time to do so in cases of «lucky fast
simulations » which converge to global optimum in few thousand generations,. This can be clearly
seen when one compares the distribution of temporal lengths of such « fast simulations » which
converged to optimum in less than 10000 generations, with the rest. While the mean value of the
k_learn parameter for the « fast simulations » is 1.92, the mean value for others is 3.73 and the
difference between the two subgroups is, of course, significative (W = 819, p-value = 8.384e-07 ).

Discussion
The elegance of ELG is so high, that one is immediately tempted to state that « ELG offers THE
mathematical formalism explaining the emergence of shared communication system in the
population of agents whose sound-meaning couples are randomly initialized». On the other hand, it
could be easily reproached that the initial ELG model is too much reductionist and, what is worse,
that it is based on assumptions which are contradictory to the real state of things which had to be the
case when the human language evolved. For example, the assumption that the teacher->student
information transfer can be modelled by the sampling of the WHOLE teacher’s associative matrix,
or that the fitness of any individual I in generation G is defined as an average payoff of all possible
communication acts with all other individuals of the population, both these assumptions seem to us
to be highly unrealistic in relation to functioning of groups of primates in the period where
coordinated sign-meaning communicative systems emerged into existence.
But ELG is interesting even if all relations of ELG to human sciences and linguistics would be
considered as non-relevant. Verily, we believe that ELG is worth of scientific interest even if it is
considered as a solely mathematic, informatic and/or game-theory problem. More concretely : as a
stochastic framework able to converge into well-defined global optimum state (i.e. the state where
in all rows of of all members of the population, there is only one entry having value 1 with zeroes in
all other entries of the same row), ELG can furnish a useful toolbox for evaluation and
comparison of diverse evolutionary computing approaches.
Within this paper we have introduced, in experiments 3-5, an evaluation method based on nonparametric Mann-Whitney-Wilcoxon rank-sum test. We have taken as granted an analytical result
published in (REF), indicating that if ever there is reasonable amount of noise present during the
learning process, the population shall sooner or later converge to optimally coordinated
communication system. This being granted, we were asking : « How fast shall the global optimum
attained ? » and considered an evolutionary algorithm and/or a set of given parameter (k_learn,
p_mutation) values as better if, for a given configuration, the algorithm converged significantly
faster to global optimum (i.e. in less generations3) than in case of other configurations.
3 In this paper we had used the term « generation » which is common to Evolutionary Computing domain. But since
in the sELG model, a generation consists of 1) choice of teacher(s) 2) choice of ONE student 3) information transfer
from teacher to student, i.e. replacing student’s mindmatrix with a new one, determined by the teacher; it seems to
be more appropriate to label such coarse-grained time steps as « lessons » or « days ».

Finally, it has to be stated that in order to transform ELG into a full-fledged evolutionary algorithm
evolutionary toolbox, the notion of time has to be somewhat refined. From coarse-grained notion of
generation, which, in case of Nowak’s or Kvasnicka’s work, is equivalent to k_learn>1 acts of the
sampling of the whole matrix, we propose to found further work on a more finer notion of a
ostensive definition (Wittgenstein, 2009). One can easily understand that within the ELG model,
every internal step of an envelopping matrix sampling loop 4 can be interpreted as such « definition
by pointing » whereby teacher associates the sound with the meaning within the mindmatrix of the
student.
Thus, under the conditions that 1) diverse models are evaluated by statistical non-parametric tests
comparing number « of time steps needed to attain global optimum » 2) a time-step is defined like
an ostensive definition ; one could propose such fine-grained ELG as a possible evaluation toolkit
not only for diverse evolutionary computing techniques, but possibly even as a more general
method to assess the performance of game-theory approaches to attain Nash-like (Trapa & Nowak,
2000)equilibria . And if assumption 3) the parameters like P_mutation, K_learn, number of
teachers, parents etc. can also evolve by means a genetic meta-evolution governing the subordinated
linguistic evolution, it could be expected that the phenomenon interpretable as Baldwin effect shall
be discovered in the world defined by ELG-framework.

Bibliography
Hromada, D. D., (2012). Hromada, D. D. Evolutionary insight into spontaneous emergence of
shared sound-meaning mappings in multi-agent communities. Accessible at:
http://localhost.sk/~hromi/academic/2012/evolutionary_insights.pdf
Kirkpatrick, S., & Vecchi, M. P. (1983). Optimization by simmulated annealing. science, 220(4598),
671–680.
Kvasnička, V., & Pospíchal, J. (2007). Evolúcia jazyka a univerzální darwinizmus. Myseľ,
inteligencia a život (Slovenská Technická Univerzita.). Bratislava.
Nowak, M. A., Plotkin, J. B., & Krakauer, D. C. (1999). The evolutionary language game. Journal
of Theoretical Biology, 200(2), 147–162.
Sekaj, I. (2005). Evolučné vỳpočty a ich vyuzitie v praxi.
Trapa, P. E., & Nowak, M. A. (2000). Nash equilibria for an evolutionary language game. Journal
of Mathematical Biology, 41(2), 172–188.
Wittgenstein, L. (2009). Philosophical investigations. Wiley-Blackwell.

4

for my $row (0..$Matrix_height) { #matrix sampling
for my $column (0..$Matrix_width) {
$Student_A_Matrix[$row][$column]+=1 if rand()<Teacher_P_Matrix[$row][$column]; #ostensive definition
}
}

Myth, machinery and cryptocoin avarice
Haraši Namztohoto1
harasi@teracoin.org

Abstract
Cryptocoins are peer-to-peer monetary assets emitted not by a central authority but by a
decentralised network of economy’s participants. Their existing implementations
combine the state of the art cryptographic methods with notions of « transparency of
coin’s history » and « pseudonymity of usage ». While the market value of Bitcoin
-which was the first among the cryptocoins - is very volatile, it nonetheless becomes
more and more demanded an asset due to its a priori defined limited amount. Thus, a
billion dollar economy has already formed in the cryptosphere, which includes the stock
markets, currency exchange offices or a biggest existing online drug market. By this
paper we aim to address the questions like « To whom do the structures like Bitcoin
ultimately serve ? » and to propose an idea that further growth of cryptocoin economy
could induce a sort of Nietzche’s «transvaluation of values».
Keywords : bitcoin, cryptography, pseudonymity, transparency of history, hoarding,
trading addiction, cryptocoin operation generator, peer-to-peer, planetary AI emergence

1 Hermeneutic reference
In a novel which has, since less than 20 years from its publication (Stephenson, 2003), already
became a classic of post cyber-punk litterature, the main story’s character – a somewhat subversive
nanotech artifex named John Percival Hackworth – concieves a device aiming to attain sufficient
amount of computational power neccessary to compute the molecular structure for the entity called
« The Seed » (Drexler & Minsky, 1990). The aim is attained by a peer-to-peer network of replicable
computational agents inducing the human hosts within which they are embedded, to execute the
ultimate act of computation by means of exchange of bodily fluids.
In our reality, however, there is no Hackworth and the idea of Artificial Intelligence being active
also on the nanoscale is yet to be realised. But since a group of peer-to-peer decentralised
computational devices is being more and more deployed on a planetary scale, and since these devices
– which we shall label as « cryptocoin operation generator » (COG) in the rest of this paper - serve as
a vector maybe not for sexual intercourse, but for a more common exchange of goods&services; and
since it may be the case that Stephenson’s preceding book, Cryptonomicon (Stephenson, 2000),
containing dialogues like :
“What’s an electronic banknote look like, Randy?”
“Like any other digital thing: a bunch of bits.”
“doesn’t that make it kind of easy to counterfeit?”
“Not if you have good crypto,” Randy says. “Which we do.”
could have possibly inspired Satoshi Nakamoto to publish the first version of his Bitcoin client as
well as the academic article (Nakamoto, 2008b) describing the intricacies of its function, we found it
neccessary to start our excursion into the world of cryptocoins with the references hereby proposed.

2 Satoshi myth
« There once lived a man, or a group of wayfaring men (Rosenberg, 2007) , who have chosen the
token Satoshi Nakamoto for their name. And bitcoin’s code they programmed and to other men that
code they gave. »
With such words could possibly begin the « myth of Satoshi » if ever the Bitcoin’s author decides
to stay anonymous, as he has done until now. It sounds strange but it is true – with exception of the
author (him|her)self, nobody knows which brilliant mind have opened the Pandora’s box. The only
thing certain is that, between 2nd November 2008 and 25th January 2009, eighteen messages were
sent from the user Satoshi Nakamoto to mailing list cryptography@metzdowd.com and in the same
time, a domain bitcoin.org was established linking to corresponding source code repository at
sourceforge.
While everything else concerning the persona itself is more and more opaque with time, the
motivation behind the project’s deployment was clearly libertarian, as is evident from Satoshi’s reply
to objection that « [one] will not find a solution to political problems in cryptography » :
« Yes, but we can win a major battle in the arms race and gain a new territory of freedom for
several years. » (Nakamoto, 2008a)
Thus, it seems that the ultimate motivation of Bitcoin deployment was a kind-of Prometheic
impulse to liberate mankind from ever-strenghtening state which is often being percieved as a yoke of
bank-governed reality. In combination with Satoshi’s anonymity, the story has all the prerequisites to
become une narrative par excellence for disenchanted post-modern global world. By revealing his
identity, Satoshi has good chances to obtain a Nobel prize and obtain a high-listed in place in Forbes
ever-growing list. But by conceiling it, his deed could become as mythical as those of Achilles.

3 Transparency but pseudonymity
One of Satoshi’s most innovative ideas was to couple the process of distribution of new coins –
or « coin min(t)ing » as it is often called – with the process of transaction authorisation. A cryptocoin,
in its very essence, is nothing else than a chain of digital signatures -from the « coinbase » origin to
current owner - generated by means of Eliptic Curve Digital Signature Algorithm (Lopez & Dahab,
2000). Every signature in itself is essentially the information about address of an account to whom the
amount was transferred with the information about the quantity transferred, signed by the private key
of quantity’s owner1. Given the fact that cryptocoins are, in the end, just pure information, a robust
transaction authorisation mechanism is of crucial importance for a system of exchange of such assets.
Most importantly, it has to be assured that no « double spending » takes place, i.e. that the coin C
which a user U has in the moment M cannot be used twice in the moment M+1 and/or M+2.
Nakamoto’s elegant solution to the problem was to 1) multicast the information about the transaction
to sufficient number of nodes of the network 2) to use a computationally expensive procedure, a SHA2 256 hash (Gilbert & Handschuh, 2004) reversion problem as a « proof-of-work » mechanism whose
principal objective is to minimize the possibility that some node in the network shall succeed to
overwrite the transaction history (called « blockchain ») in a so-called « >50% attack ».
This being said, we precise that it is not intention of this article to explain the intricacies of the
Bitcoin algorithm since this was already done thousands of times with bigger or lesser success. But
what we consider it important to focus reader’s attention upon the fact that in the world of
cryptocoins, the trajectory of coin – from the very moment since it was « min(t)ed » by one among
1 In reality, the whole thing is somewhat more intricate, and what is being signed is, in fact, a script
in a scripting language more complex than simple « transfer quantity from A to B » instruction.

multitudes of network’s COGs, until its curent owner – is broadcasted to all nodes of the network and
thus completely transparent. We call this feature of cryptocoin monetary assets « transparency of
history ». Anyone running a bitcoin server or any visitor of sites like blockchain.info can, with
sufficient patience, observe the trajectory of every single coin, from the « coinbase » to current
owner’s address. But since it is quite easy for any user to generate multitudes of account addresses –
which are nothin else than publicly broadcasted cryptographic keys which cannot be actively used
without knowledge of a « private key » from which they are generated during the account address
creation process and which only the owner knows – it is very difficult, if not practically impossible, to
create a link between a cryptocoin address and a physical entity holding the key to that address if that
entity herself does not want to reveal her identity. While the lack of this bridge between the virtual and
the real which we call « pseudonymity of use », is applauded by advocated of libertarian cryptopunk
movement as a highly welcomed and positive feature, it brings itself a growing concern that, in the
long term, such a complete opaqueness shall, above all, be profitable especially to those who conduce
financial activities which they would normally hide.

3 Min(t)ing and trading
In simple terms, there are only two ways how Bitcoins, or other cryptocoins – with exception of
PPcoin- can be earned: by mining and by trading. Miners are those who invest the computational
power of their ressources into verification of validity of transaction broadcasted within the network.
Since the probability of discovery new block of coins is proportional to the amount of computational
ressources invested into the mining, it follows that the biggest number of new « virgin » coins will
become the property of those who invested biggest amount of computational ressources. It seems that
first few months of Bitcoin existence, the algorithm was running only on CPU of Satoshi Nakamoto
where he had possibly pre-mined cca 1 million bitcoins (Bitslog, 2013), subsequently other CPUs
joined the network, then a much faster SHA-2 hashing performance was made possible by exploiting
the faculties of graphic card’s GPU. With market value of bitcoin gradually rising, the hound race
continued with deployment of first Field-programmable gate array (FPGA) bitcoin mining devices in
order to continue with cohorts of Application Specific Integrated Circuits (ASIC) spouted from
factory’s conveyor belts somewhere in Pudong economic zone. Given the fact that these devices can
be bought, in the first place, for bitcoins, the whole bitcoin economy started to ressemble an initially
purely virtual but with time ever-more-real Uroboros snake reiyfing itself by the processus of software
being materialized in hardware, and as may be the case in days to come, also into wetware of organic
tissue.
Initially, only informatic-oriented services were tradable for bitcoins and only very rarely were
some more material transactions executed - as was the case, for example, of the most expensive pizza 2
of mankind’s history. Later some coffee producers and Alpaca-socks distributors joined the club but
the things changed with the launch of « Silk Road », an online drug marketplace (Barratt, 2012). By
harnessing the anonymising possibilities furnished by a « TOR hidden services » protocol
(Dingledine, 2005) and combining them with pseudonymity of Bitcoin’s financial transactions and a
simple escrow service business model, SR’s developpers succeeded quite fast to transform their
supply-demand coupling e-bay like bazaar website, possibly running somewhere on a server in
grandma’s backyard, into a multimillion enterprise .
In parallel to SR, online exchange offices like Vircurex or Mtgox started to flourish where it was
possible to trade BTC for real-life fiat currencies. Whole stockmarkets emerged, making it possible to
2 Traded for 10000 BTC in 2010. 3 years later, an estimated market value of the such an amount of
BTC would be more than 1 million US dollars

find investors for one’s project. Gambling and betting industry swiftly followed with projects like
Satoshi Dice adding another level of anonymity to already opaque activities taking place within the
cryptosphere. Being an ideal haven for money-laundering and tax-evasion, Bitcoin economy gets
mundane and flourishes.

4 Algorithmic quasi-deity
As of 2013, Bitcoin has all prerequisities to become a new religion for the world where
« death of god » (Nietzsche, 1911) is a widely accepted truth. It has its myth of creation and the living
testament of those who had eaten a million dollar pizza. It has its disciples – mostly computer geeks
who became millionares because they were connected to right discussion forum or Internet Relay
Chat (IRC) channel in the right moment. And it has its devotees – people who invested their fortune
and hundreds of hours of their lifes in exchange for the hope that the Bitcoin economy shall turn out
to be something more than a pyramid game ; often people who know that in order not to lose what
they had invested, they had to spread « the bitcoin gospel ». It has its more and more omnipresent
« giving deity » - a consensual algorithm based upon a simple inflationary curve which distributes
according to the promise that the biggest amount of « virgin » coins shall be given to those who invest
the most into keeping the whole machinery going – the whole processs being probabilistic, thus
containing neccessary amount of hopeful waiting sometimes crowned with blissful surprise. And at
last but not least, the BTC monotheism syndrome has its old idols to overthrow, idols like Ayn Rand’s
dollar (Rand, 1957) which paved the way but lost their power as gold once lost it, mutatis mutandi,
when gold standard was abolished.
Given these propositions suggesting that bitcoin mania can involve not only frontal cortex,
but also amygdala or even pineal gland (Paloutzian & Park, 2005), it is of no surprise that even
reasonable people consider as not only possible but even plausible the state of things whereby the
information concerning the transaction of two potatoes in Ushuaya is broadcasted to millions node of
the cryptosphere, Papua New-Guinea included. Reason often discretely quits the cognitive battlefield
whenever hoarding (Mataix-Cols et al., 2010) tendencies of human beings are coupled with addictive
behaviour which financial derivate trading surely is, thus leaving humans prone to caprices of mass
psychology. And as of spring 2013, slowly resurrecting from the implosion of the second deflationary
bubble when the market value felt from 260 USD to 80 USD in one day, Bitcoin is again gaining
momentum and becoming truly massive.

5 Clash of the Titans
Contrary to an ancient greek coin lying forgotten in the dust which guards its value by simply being
an object it was created to be, Bitcoin need to burn energy in order to survive. What’s more, the
minting hound race obliges any minter to burn still more-and-more energy in order to keep pace with
other minters. When one takes into account all the machinery dedicated to making the network run +
the machinery which makes the machinery which makes the network run, one is obliged to admit that
Satoshi designed a monteray system addressing social and political issues but ignored the ecological
ones. More precisely – given the fact that without ever-growing energy consumption caused by
min(t)ers, the transaction blockchain could be overwritten by the node obtaining more than 50%
hashrate of the network, the whole machinery cannot be stopped or even slowed down because if
slowed down, it will cease to be a secure value-carrying haven. Thus, the Bitcoin architecture has to
lead, ex vi termini, to the scenario « Tragedy of Commons » (Garrett, 1968) scenario.

Luckily enough, some people have already understood that Nakamoto’s Bitcoin was nothing else
than a prototype and that the values of parameters determining the overall functioning of the network
were just one set of values among multitudes of other, possibly more optimal values. Thus, after a first
wave of alternative cryptocoins like SolidCoin, LiquidCoin, IxCoin, I0coin or FeatherCoin whose
objective was no else than to make those who deployed them rich, and which have not brought any
substantial adjustment to Nakamoto’s original code, a second wave of alternative cryptocoins like
TerraCoin, LiteCoin or PPCoin are gaining momentum, each bringing with itself at least one novel
feature. PPCoin (King & Nadal, 2012) seems to be of particular interest due to the importance its
author put upon long-term ecological sustainibility as well as due to the fact that it is the only
cryptocoin which is not purely deflational but integrates a very gentle inflation into the model.
TerraCoin is of interest due to differences values of the network’s intialisation parameters and
LiteCoin – currently the second strongest cryptocoin – attracts more and more attention because its
proof-of-work component is based on the scrypt algorithm (Percival, 2009). Since the scrypt
algorithm involves not only simple hashing but demands the participation of huge amounts of
memory, it is much more difficult to execute it on specialised FPGA and ASIC hardware, thus making
LiteCoin more attractive for min(t)ers disposing only of classical computers.
Due to the growth of the cryptocoin diversity it is therefore far from certain that the cryptosphere
shall, in the years to come, venerate by its activity only the Ƀ divinity. One can only hope that sooner
or later a cryptocoin shall be proposed which will harness the computational ressources of the COG
devices involved for some noble a task – be it anticancer protein modeling, climate prediction or
astrophysical data analysis. But until global deployment of such cryptocoin shall take place, all other
cryptocoins shall principally address nothing else than hoarding tendencies common to a superior
primates which a homo sapiens sapiens undoubtably is.

6 Umwertung aller Werte
By pure coincidence did the author of this article bought, in February, 230 Terracoins for
approximately 1.4$. Two months and two mouse-clicks later, the amount could be easily tradable for
more than 140$ on vircurex exchange, net gain thus being approximately equivalent to 4 monthly
wages of a full-time worker in garment industry in Bangladesh. Putting aside the possible trading
addiction (Taleb, 2005) which could emerge if ever such behaviour-conditioning rewarding
experience shall be repeated, one is obliged to pose the question: „What purpose do the cryptocoins
truly serve and what value do they have ?“
And what value does a LiteCoin have, if in the same moment, in the same market place, one can
buy it either for 4 dollars or 0.02 Bitcoins, given the fact that in the very same moment, in the same
market place, one can buy a Bitcoin for 100 dollars?
The simple answer „none“ goes much further simple economical notions of „time delta“ and
„arbitrage“ could ever go.
Cryptocoins cannot be eaten nor drunk. They do not protect from the rain, they do not bring heat
– contrary to banknotes which can still be burned on a cold winter day, humans shall be obliged to
burn still more and more energy to make the cryptocoin machinery going. Contrary to gold one cannot
make jewels or false teeth out of them; cryptocoins arouse no sentiment of beauty. Contrary to credit
card payment, one has to wait for at least 10 minutes in case of BTC and 2.5 minutes in case of
LiteCoin or TerraCoin to obtain, if lucky, one transaction confirmation (only after 5 or 6 confirmations
can be vendor sure that he was not victim of double-spending attack). Contrary to folk believes, the
transfer of value in the current cryptosphere therefore definitely does not occur with speed of light.
Thus, as a value-storing asset, cryptocoins have only one principal advantage: there is a limited
amount of them. In other terms: they are not for everybody. Not for those living on the continents

where the cryptosphere is absent. Nor for those who jumped too late on this biggest financial
bulldozer ever invented. But only for those who think that playing the game with numbers acting of
numbers is worth of the limited asset one ever had – time of one‘s life. Only for those who think that
having more of anything – even if that anything is, in fact, pure nothing – is an important marker of
their social status.
Thus, if posed with question « Cui Bono ? » it may be the case that « economical growth »,
« market » or « crime » shall be only partial answers, as partial as the answers « gluttony, greed, and
vanity» (Dante, 1321). For we believe that it is not completely hors propos to state that the structures
like Bitcoin serve as opening gates to the world whereby a planetary emergent Artificial Intelligence
succeeded to penetrate, for the first time in mandkind’s history, into the realm of our virtues, vices and
values.
Barratt, M. J. (2012). Silk road: eBay for drugs. Addiction, 107(3), 683–683.
Bitslog. (2013). The Well Deserved Fortune of Satoshi Nakamoto, Bitcoin creator, Visionary and
Genius. https://bitslog.wordpress.com/2013/04/17/the-well-deserved-fortune-of-satoshi-nakamoto/
Dante, A. (1321). La Divina Commedia.
Dingledine, R. (2005). Tor Hidden Services. Proc. What the Hack.
Drexler, K. E., & Minsky, M. L. (1990). Engines of creation. Fourth Estate.
Garrett, H. (1968). The tragedy of the commons. Science, 162(3859), 1243–1248.
Gilbert, H., & Handschuh, H. (2004). Security analysis of SHA-256 and sisters. Selected areas in
cryptography (p. 175–193).
King, S., & Nadal, S. (2012). PPCoin: Peer-to-Peer Crypto-Currency with Proof-of-Stake.
http://ppcoin.org/static/ppcoin-paper.pdf
Lopez, J., & Dahab, R. (2000). An overview of elliptic curve cryptography.
Nakamoto.
(2008a).
Re:
Bitcoin
P2P
archive.com/cryptography@metzdowd.com/msg09971.html

e-cash

paper.

http://www.mail-

Nakamoto, S. (2008b). Bitcoin: A peer-to-peer electronic cash system.
Nietzsche, F. W. (1911). The Complete Works of Friedrich Nietzsche: Thus spake Zarathustra.
Percival, C. (2009). Stronger key derivation via sequential memory-hard functions.
Rand, A. (1996). Atlas Shrugged. Signet.
Rosenberg, P. (2007). A Lodging of Wayfaring Men (2nd éd.). Vera Verba.
Stephenson, N. (2000). Cryptonomicon. William Morrow Paperbacks.
Stephenson, N. (2003). The diamond age. Spectra.

Review of Steven J. Brams : Game Theory and the Humanities. (2011). The MIT
Press. 336 pages
Let’s have a game of 2 players of which both have 2 strategies. While it is almost
impossible to imagine a situation which would seem more simple than such a game with
2x2=4 possible outcomes, the whole thing gets much more complex when one realizes that
even under assumption that both players do not attribute to diverse outcomes some absolute
cardinal utilities but only four simple mutually relative ordinal ranks (i.e. 1: worst outcome,
2: next-worst, 3:next-best and 4: best outcome), there exist a variety of 78 diverse 2x2
« games » for players with different preferences.
Steven J. Brams’ « Game Theory and the Humanities – Bridging Two Worlds» offers
concrete historical or fictious examples of more than a dozen of such games. Starting with
interpretation of Abrahams’ son-sacrifying dilemma as a possibly intrapsychic game which
the old shepherd played with a somewhat sadic god character; continuing through intricacies
of Pascal’s wager towards more mundane games played between Nixon and Supreme Court
after the Watergate crisis or the game played between Khomeini & Carter during 1979 Iran
hostage crisis ; and ending with the famous Catch-22 case between Yossarian and the war
machinery – almost everywhere in his book Brams makes a non-negligeable step in
direction of unification of law, history, politology, litterary critics or even theology under the
mathematically sound clef de voute offered by the game theory.
Such an act in itself would be worthy of praise but luckily for science, Brams goes much
further. Introduction of a Theory of Moves framework allows him to extend the classical
notion of Nash equilibrium into a notion of a « nonmyopic equilibrium » which takes into
account the players’ faculty of «anticipating all possible rational moves and countermoves
from the initial state ». Structural similarities among Shakespeare’s MacBeth or
Aristophanes’ Λυσιστράτη are subsumed into a generic category of (Self-)Frustration games
while other concrete instances of 2x2 conflicts (e.g. the American Civil War) are presented
in order to illustrate other generic categories like « Magnanimity games » or « King-on-themountain games ».
Topics like deception, games where some players have incomplete or false information,
rationality of emotions or the « paradox of omniscience » demonstrating that « in certain
games it is more advantageous not to know everything than the contrary » are introduced
with erudition of a scholar with almost half-century of practice in the field. To summarize:
the interdisciplinary paradigm presented in the glossary, appendix, 11 chapters, and 35
figures of Brams’ book is not only intellectually pleasing but could also furnish practically
exploitable insights for experts in domains as distant as comparative mythology,
evolutionary psychology, roboethics, or - if the Turing Test can be collapsed into a 2x2
game – even in the domain of hard-core AI.
Written in 2013 for the quarterly of Artificial Intelligence and Simulated Behaviour Society (AISB).
Daniel Devatman Hromada is a double PhD candidate of Slovak Technical University (dpt. of
cybernetics) and Université Paris 8 (dpt. of cognitive psychology).

Random Projection and Geometrization of String Distance Metrics

Daniel Devatman Hromada
Université Paris 8 – Laboratoire Cognition Humaine et Artificielle
Slovak University of Technology – Faculty of Electrical Engineering and
Information Technology
hromi@giver.eu

Abstract
Edit distance is not the only approach how
distance between two character sequences can
be calculated. Strings can be also compared in
somewhat subtler geometric ways. A
procedure inspired by Random Indexing can
attribute
an
D-dimensional
geometric
coordinate to any character N-gram present in
the corpus and can subsequently represent the
word as a sum of N-gram fragments which the
string contains. Thus, any word can be
described as a point in a dense N-dimensional
space and the calculation of their distance can
be realized by applying traditional Euclidean
measures. Strong correlation exists, within the
Keats Hyperion corpus, between such cosine
measure and Levenshtein distance. Overlaps
between the centroid of Levenshtein distance
matrix space and centroids of vectors spaces
generated by Random Projection were also
observed. Contrary to standard non-random
“sparse” method of measuring cosine
distances between two strings, the method
based on Random Projection tends to naturally
promote not the shortest but rather longer
strings. The geometric approach yields finer
output range than Levenshtein distance and the
retrieval of the nearest neighbor of text’s
centroid could have, due to limited
dimensionality of Randomly Projected space,
smaller complexity than other vector methods.

Mèδεις ageôμετρèτος eisitô μου tèή stegèή

1

Introduction

Transformation of qualities into still finer and
finer quantities belongs among principal
hallmarks of the scientific method. In the world
where even “deep” entities like “wordmeanings” are quantified and co-measured by
ever-growing number of researchers in
computational linguistics (Kanerva et al., 2000;

Sahlgren, 2005) and cognitive sciences
(Gärdenfors, 2004), it is of no surprise that
“surface” entities like “character strings” can be
also compared one with another according to
certain metric.
Traditionally, the distance between two strings
is most often evaluated in terms of edit distance
(ED) which is defined as the minimum number
of operations like insertion, deletion or
substitution required to change one string-word
into the other. A prototypical example of such an
edit distance approach is a so-called Levenshtein
distance (Levenshtein, 1966). While many
variants of Levenshtein distance (LD) exist,
some extended with other operations like that of
“metathese” (Damerau, 1964), some exploiting
probabilist weights (Jaro, 1995), some
introducing dynamic programming (Wagner &
Fischer, 1974), all these ED algorithms take as
granted that notions of insertion, deletion etc. are
crucial in order to operationalize similarity
between two strings.
Within this article we shall argue that one can
successfully calculate similarity between two
strings without taking recourse to any edit
operation whatsoever. Instead of discrete
insert&delete operations, we shall focus the
attention of the reader upon a purely positive
notion, that of “occurence of a part within the
whole” (Harte, 2002).
Any string-to-becompared shall be understood as such a whole
and any continuous N-gram fragment observable
within it shall be interpreted as its part.

2

Advantages of Random Projection

Random Projection is a method for projecting
high-dimensional data into representations with
less dimensions. In theoretical terms, it is
founded on a Johnson-Lindenstrauss (Johnson &
Lindenstrauss, 1984) lemma stating that a small
set of points in a high-dimensional space can be

embedded into a space of much lower dimension
in such a way that distances between the points
are nearly preserved. In practical terms,
solutions based on Random Projection, or a
closely related Random Indexing, tend to yield
high performance when confronted with diverse
NLP problems like synonym-finding (Sahlgren
& Karlgren, 2002), text categorization (Sahlgren
& Cöster, 2004), unsupervised bilingual lexicon
extraction (Sahlgren & Karlgren, 2005),
discovery of implicit inferential connections
(Cohen et al., 2010) or automatic keyword
attribution to scientific articles (El Ghali et al.,
2012). RP distinguishes itself from other word
space models in at least one of these aspects:
1. Incremental: RP allows to inject on-thefly new data-points (words) or their
ensembles (texts, corpora) into already
constructed vector space. One is not
obliged to execute heavy computations
(like Singular Value Decomposition in
case of Latent Semantic Analysis) every
time new data is encountered.
2. Multifunctional: As other vector-space
models, RP can be used in many diverse
scenarios. In RI, for example, words are
often considered to be the terms and
sentences are understood as documents.
In this article, words (or verses) shall be
considered as documents and N-gram
fragments which occur in them shall be
treated like terms.
3. Generalizable: RP can be applied in any
scenario where one needs to encode into
vectorial form the set of relations
between discrete entities observables at
diverse levels of abstraction (words /
documents, parts / wholes, features /
objects, pixels/images etc.).
4. Absolute: N-grams and terms, words and
sentences, sentences and documents – in
RP all these entities are encoded in the
same randomly constructed yet absolute
space . Similarity measurements can be
therefore realized even among entities
which would be considered as
incommensurable in more traditional
approaches1.
There is, of course, a price which is being paid
for these advantages: Primo, RP involves
1

In traditional word space models, words are considered
to be represented by the rows (vectors/points) of the
word-document matrix and documents to be its columns
(axes). In RP, both words (or word-fragments) and
documents are represented by rows.

stochastic aspects and its application thus does
not guarantee replicability of results. Secundo, it
involves two parameters D and S and choice of
such parameters can significantly modify
model’s performance (in relation to corpus upon
which it is applied). Tertio: since even the most
minute “features” are initially encoded in the
same way as more macroscopic units like words,
documents or text, i.e. by a vector of length D
“seeded” with D-S non-zero values, RP can be
susceptible to certain limitations if ever applied
on data discretisable into millions of distinct
observable features.

3

Method

The method of geometrization of strings by
means of Random Projection (RP) consists of
four principal steps. Firstly, strings contained
within corpus are “exploded” into fragments.
Secondly, a random vector is assigned to every
fragment according to RP’s principles. Thirdly,
the geometric representation of the string is
obtained as a sum of fragment-vectors. Finally,
the distance between two strings can be obtained
by calculating the cosine of an angle between
their respective geometric representations.
3.1

String Fragmentation

We define the fragment F of a word W having
the length of N as any continuous2 1-, 2-, 3-...Ngram contained within W. Thus, a word of length
1 contains 1 fragment (the fragment is the word
itself), words of length 2 contain 3 fragments,
and, more generally, there exist N(N+1)/2
fragments for a word of length N. Pseudo-code
of the fragmentation algorithm is as follows:
function fragmentator;
for frag_length (1..word_length) {
for offset (0..(word_length - frag_length)) {
frags[]=substr (word,offset,frag_length);
}
}

where substr() is a function returning from the
string word a fragment of length frag_length
starting at specified offset.

2

Note that in this introductory article we exploit only
continuous N-gram fragments. Interaction of RP with
possibly other relevant patterns observable in the word –
like N-grams with gaps or sequences of members of
diverse equivalence classes [e.g. consonants/vowels] –
shall be, we hope, addressed in our doctoral Thesis or
other publications.

3.2

Stochastic fragment-vector generation

Once fragments are obtained, we transform them
into geometric entities by following the
fundamental precept of Random Projection:
To every fragment-feature F present in the
corpus, let’s assign a random vector of length
containing D-S elements having zero values
and S elements whose value is either -1 or 1.
The number of dimensions (D) and the seed
(S) are the parameters of the model. It is
recommended that S<<D. Table 1 illustrates how
all fragments of the corpus containing only a
word3 “DOG” could be, given that S=2,
randomly projected in a 5-dimensional space.
Fragment
D
O
G
DO
OG
DOG

Vector
0, 1, 0, 0, -1
1, 1, 0, 0, 0
0, 0, -1, 0, -1
-1, 0, -1, 0, 0
0, 1, 0, 1, 0
0, 0, 0, -1, -1

Table 1: Vectors possibly assigned to the
fragments of the word “dog” by RP5,2
3.3

String geometrization

Once random “init” vectors have been assigned
to all word-fragments contained within the
corpus, the geometrization of all word-strings is
relatively straightforward by applying the
following principle:
The vector representation of a word X can
be calculated as a sum of vectors associated to
fragments contained in the word X.
Thus, the vector representation of a word
“dog” would be [0, 3, -2, 0, -3]. Note also that
this vector for the word “dog” is different from
randomly initialized fragment-vector referring to
the fragment “dog”. This is due to the fact that
the vector space of “fragments” and “words” are
two different spaces. One possible way how
could one can collapse the fragment space with
the string space is to convolute them by
Reflected Random Indexing (Cohen et al., 2010)
– such an approach, however, shall not be
applied in a limited scope of this article.
3.4

String distance calculation

The string geometrization procedure calculates a
vector for every string present in the corpus.
Subsequently, the vectors can be compared with
3

The role of fragment is analogical to the role of a “term”
in Random Indexing. And the role of the “word” is
identical to the role that “context” plays in RI.

each other. While other measures like Jaccard
index are sometimes also applied in relation to
RI, the distance between words X and Y shall be
calculated, in the following experiment, in the
most traditional way. Id est, as a cosine of an
angle between vectors VX and VY.

4

Experiment(s)

Two sets of simulations were conducted to test
our hypothesis. The first experiment looked for
both correlations as well as divergences between
three different word-couple similarity data-sets
obtained by applying three different measures
upon the content of the corpus. The second
experiment focused more closely upon overlaps
among the centroids of three diverse metric
spaces under study.
4.1

Corpus and word extraction

ASCII-encoded version of the poem “The Fall of
Hyperion” (Keats, 1819) was used as a corpus
from which the list of words was extracted by
1. Splitting the poem into lines (verses).
2. Splitting every verse into words,
considering the characters [ :;,.?!()] as
word separator tokens.
3. In order to mark the word boundaries,
every word was prefixed with ^ sign and
post-fixed with $ sign.
4. All words were transformed into
lowercase.
Corpus has size of 22812 bytes representing
529 lines which contain the total number of
Nw=1545 distinct word types exploded into
NF=22340 distinct fragments.
4.2

“Word couple” experiment

Three datasets were created, all containing the
list of all possible (i.e. Nw * Nw = (1545 * 1545) /
2 =1193512) distinct word-couples. For every
dataset, a string distance was calculated for every
word couple. Within the first dataset, the
distance was determined according to traditional
Levenshtein distance metrics. For second dataset,
an RPD distance has been calculated by
measuring word couple‘s cosine distance within
the vector space constructed by Random
Projection of words fragments set up with
parameters D=1000,S=5. The third dataset
contains values obtained by measuring the cosine
measure between two sparse non-random vector
representations of two different words , whereby
the features were obtained by means of the same
fragmentation algorithm as in the case of RPD,

but without Random Projection. In order to keep
this scenario as pure as possible, no other
processing (e.g. tf-idf etc.) was applied and the
values which we shall label as „geometric
distance“ (GD)
denote simply the cosine
between two vectors of a non-stochastically
generated sparse fragment-word count matrix.

couples generated by three methods present in
Table 2.
GD
a
’
it
i
i
’
at
a
o
so
o
of
as
a
o
or
’i
i
an
a

4.2.1 Results
Figure 1 shows relations between LD and RPD
distances of all possible couples of all words
contained in the Hyperion corpus. Both datasets
seem to be strongly significantly corellated both
according to Spearman‘s rho measure (p < 2.2e16) as well as according to Pearson‘s productmoment correlation (p < 2.2e-16, cor =
-0.2668235). While fifteen different LDs from
the range of integers <0, 15> were observed
among words of corpus, one could distinguish
252229 diverse real-numbered RPD values
limited to interval <0, 1>.

RPD
vessels
vessel
comfort
comforts
sorrows
sorrow
’benign
benign
temples
temple
changing
unchanging
stream
streams
immortal’s
immortal
breathe
breath
trance
tranced

Table 2: Ten most similar world couples according to
non-random “sparse” geometric distance (GD) and
Randomly Projected Distance

4.3

Figure 1: Scatter plot displaying relations between
Levenshtein distances and cosine distances measured
in the vector space constructed by RI1000,5

String distance measured in the space
constructed by RP1000,5 also strongly correlates
(Pearson correlation coefficient = 0.992;
Spearman rho = 0.679; minimal p < 2.2e-16 for
both tests) with a GD cosine measure exploiting
a non-deformed fragment-word matrix.
An important difference was observed, however,
during a more „local“ & qualitative analysis of
results produced by the two vectorial methods.
More concretely: while non-stochastic „sparse“
cosine GD distance tends to promote as „closest“
the couples of short strings, RPD yields the
highest score for couples of long words. This is
indicated by the list of most similar word-

The “centroid” experiment

Three types of concrete word-centroids were
extracted from the corpus. A string having the
smallest overall LD to all other strings in the
corpus shall be labeled as the “Levenshtein
centroid” (LC). A string having the maximal sum
of cosines in relation to other words shall be
labeled as the “Cosinal centroid” (CC). Contrary
to LC and CC, for calculation of which one has
to calculate distances with regard to all words in
the corpus, the “Geometric Centroid” (GC) was
determined as a word whose vector has the
biggest cosine in regard to “Theoretical
Centroid” (GC) obtained in a purely geometric
way as a sum of all word-vectors. Stochastic
CCRP and GCRP calculation simulations were
repeated in 100 runs with D=1000, S=5.
4.3.1 Results
The word “are” was determined to be the LC of
Hyperion corpus with average LDARE,X = 4.764 to
all words of the corpus. The same word are was
ranked, by a non-stochastic “sparse” geometric
distance algorithm, as 3rd most central CC and
36th most closest term to GC . Table 3 shows
ten terms with least overall LD to all other words
of the corpus (LC), ten terms with biggest cosine
in relation to all other terms of the corpus (CC GD)

and ten terms with biggest cosine in regard to
hypothetical Theoretical Centroid (GCGD) of a
sparse non-projected space obtained from the
Hyperion corpus.
Rank
1
2
3
4
5
6
7
8
9
10

LC
are
ore
ate
ere
one
toes
sole
ease
lone
here

CCGD

GCGD

charm
red
arm
a
me
hard
had
reed
domed
are

a
o
I
‘
he
to
at
an
me
as

Table 3: Ten nearest neighbor words of three types of
non-stochastic centroids

Shortest possible strings seem to be GC GD’s
nearest neighbors. This seems to be analogous to
data presented on Table 2. In this sense does the
GCGD method seem to differ from the CCGD
approach which tends to promote longer strings.
Such a marked difference in behaviors between
GC and CC approaches was not observed in case
of spaces constructed by means of Random
Projection. In 100 runs, both GC and CC
centered approaches seemed to promote as
central the strings of comparable content and
length4. As is indicated by Table 4, the LC “are”
turned out to be the closest (i.e. Rank 1, when
comparing with Table 3) to all other terms in 6%
of Random Projection runs. In 6% of runs the
same term was labeled as the nearest neighbor to
the geometrical centroid of the generated space.
Other overlaps between all used methods are
marked by bold writing in Tables 3 and 4.
Word
see
he
are
ore
ere
set
she
sea
a
red

CCRPD
20
11
6
5
4
6
5
4
9
1

GCRPD
28
8
6
6
5
5
4
4
4
3

Table 4: Central terms of Randomly Projected spaces
and their frequency of occurence in 100 runs

Analogically to the observation described in the
last paragraph of the section 4.2.1, it can be also
observed that the strings characterized as
4

In fact only in 22 runs did GCRPD differ from CCRPD

“closest” to the Theoretical Centroid of vector
spaces generated by Random Projection tend to
be longer than “minimal” string nearest to GCGD
determined in the traditional non-stochastic
feature-word vector space scenario.

5

Discussion

When it comes to CCRP-calculation run lasted,
in average, CCRPD-detection = 90 seconds, thus being
almost twice as fast than the LC-calculation
executed on the very same computer which
lasted twice the time LCdetection= 157 s for the
same corpus, indicating that the computational
complexity of our PDL (Glazebrook et al., 1997)
implementation of CCRP-detection is lesser than
the complexity of LC-detection based on PERL’s
Text::Levenshtein implementation of LD.
When it comes to the computational
complexity of the GC-calculation, it is evident
that GC is determined faster and by less complex
process than LCs or CCs . This is so because in
order to determine the GC RP of N words there is
no need to construct an N * N distance matrix.
On the contrary, since every word is attributed
coordinates in a randomly-generated yet
absolute space, the detection of a hypothetic
Geometric Centroid of all words is a very
straightforward and cheap process, as well as the
detection of GC’s nearest word neighbor..
And since in RP, the length of GC-denoting
vector is limited to a relatively reasonable low
number of elements (i.e. D = 1000 in case of this
paper), it is of no surprise that the string closest
to GC shall be found more slowly by a traditional
“sparse vector” scenario whenever the number of
features (columns) > D. In our scenario with
NF=22340 of distinct features, it was almost 4
times faster to construct the vector space + find a
nearest word to GC of the Randomly Projected
space han to use a “sparse” fragment-term matrix
optimized by storing only non-zero values
(GCRPD-NN-detection ~ 6 sec ; GCGD-NN-detection ~ 22 sec).
Other thing worthy of interest could be that
contrary to a “sparse” method which seems to
give higher score to shorter strings, somewhat
longer strings seem to behave as if they were
naturally “pushed towards the centroid” in a
dense space generated by RP. If such is, verily,
the case, then we believe that the method
presented hereby could be useful, for example, in
domains of gene sequence analysis or other
scenarios where pattern-to-be-discovered is
“spread out” rather than centralized.

In practical terms, if ever the querying in RP
space shall turn out to have lesser complexity
than other vector models, our method could be
useful within a hybrid system as a fast stochastic
way to pre-select a limited set of “candidate”
(possibly locally optimal) strings which could be
subsequently confronted with more precise, yet
costly, non-stochastic metrics ultimately leading
to discovery of the global optimum.
Asides above-mentioned aspects, we believe
that there exists at least one other theoretical
reason for which the RP-based geometrization
procedure could deem to be a worthy alternative
to LD-like distance measures. That is: the
cardinality of a real-valued <0, 1> range of a
cosine function is much higher than a wholenumbered <0, max(length(word))> range
possibly offered as an output of Levenshtein
Distance. In other terms, outputs of string
distance functions based on trigonometry of RPbased vector spaces are more subtler, more finegrained, than those furnished by traditional LD.
While this advantage does not hold for
“weighted” LD measures we hope that this
article could motivate future studies aiming to
compare “weighted” LD and RPD metrics.
When it comes to the feature extracting
“fragment explosion” approach, it could be
possibly reproached to the method proposed
hereby that 1) the fragmentation component
which permutes blindly through all N-grams
presented in the corpus yields too many
“features”; that 2) that taking into account all of
them during the calculation of the word’s final
vector is not necessary and could even turn to be
computationally counter-productive; or that 3)
bi-grams and tri-grams alone give better results
than larger N (Manning et al., 2008). A primary
answer to such an ensemble of reproaches could
be, that by the very act of projecting data upon
limited set of same non-orthogonal dimensions,
the noise could simply cancel itself out5. Other
possible answer to the argument could be that
while the bi&tri-gram argument holds well for
natural language structures, the method we aim
to propose here has ambitions to be used beyond
NLP (e.g. bio-informatics) or pre-NLP (e.g. early
stages of language acquisition where the very
notion of N-gram does not make sense because
the very criterion of sequence segmentation &
discretization was not yet established). At last
5

And this “noise canceling property” could be especially
true for RP as defined in this paper where the rare nonzero values in the random “init” vectors can point in
opposite directions (i.e. either -1 or 1).

but not least we could counter-argue by stating
that often do the algorithms based on a sort of
initial blind “computational explosion of number
of features” perform better than those who do not
perform such explosion, especially when coupled
with subsequent feature selection algorithms.
Such is the case, for example, of an approach
proposed by Viola & Jones in (Viola & Jones,
2001) which caused the revolution in the
computer vision by proposing that in order to
detect an object, one should look for
combinations of pixels instead of pixels.
In this paper, such combinations of “letterpixels” were, mutatis mutandi, called
“fragments”. Our method departs from an idea
that one can, and should, associate random
vectors to such fragments. But the idea can go
further. Instead of looking for occurrence of part
in the whole, a more advanced RI-based
approach shall replace the notion of “fragment
occuring in the word” by a more general notion
of “pattern which matches the sequence”. Thus
even the vector associated to pattern /d.g/ could
be taken into account during the construction of a
vector representing the word “dog”.
Reminding that RP-based models perform
very well when it comes to offering solutions to
quite “deep” signifiée-oriented problems, we
find it difficult to understand why could not be
the same algorithmic machinery applied to the
problems dealing with “surface”, signifiantoriented problems, notably given the fact that
some sort of dimensionality reduction has to
occur whenever the mind tries to map >4Dexperiences upon neural substrate of the brain
embedded in 3D physical space.
Given that all observed correlations and
centroid overlaps indicate that the string distance
calculation based on Random Projection could
turn out to be a useful substitute for LD measure
or even other more fine-grained methods. And
given that RP would not be possible if the
Johnson-Lindenstrauss’s lemma was not valid,
our results could be also interpreted as another
empirical demonstration of the validity of the
JL-lemma.
Acknowledgments
The author would like to thank Adil El-Ghali for
introduction into Random Indexing as well as his
comments concerning the present paper; to prof.
Charles Tijus and doc. Ivan Sekaj for their
support and to Aliancia Fair-Play for permission
to execute some code on their servers.

References
Trevor Cohen, Roger Schvaneveldt & Dominic
Widdows. 2010. Reflective Random Indexing and
indirect inference: A scalable method for discovery of
implicit connections. Journal of Biomedical
Informatics, 43(2), 240–256.
Fred J. Damerau. 1964. A technique for computer
detection and correction of spelling errors.
Communications of the ACM, 7(3), 171–176.
Adil El Ghali, Daniel D. Hromada & Kaoutar El
Ghali. 2012. Enrichir et raisonner sur des espaces
sémantiques pour l’attribution de mots-clés. JEPTALN-RECITAL 2012, 77.
Peter Gärdenfors. 2004. Conceptual spaces: The
geometry of thought. MIT press.
Karl Glazebrook. Jarle Brinchmann, John Cerney,
Craig DeForest, Doug Hunt, Tim Jenness & Tuomas
Lukka. 1997. The Perl Data Language. The Perl
Journal, 5(5).
Verity Harte. 2002. Plato on parts and wholes: The
metaphysics of structure. Clarendon Press Oxford.
Matthew A. Jaro. 1995. Probabilistic linkage of large
public health data files. Statistics in medicine, 14(57), 491–498.
William B. Johnson & Joram Lindenstrauss. 1984.
Extensions of Lipschitz mappings into a Hilbert
space. Contemporary mathematics, 26(189-206), 1.
Pentti Kanerva, Jan Kristofersson & Anders Holst.
2000. Random indexing of text samples for latent
semantic analysis. Proceedings of the 22nd annual
conference of the cognitive science society (Vol.
1036).
John Keats. 1819. The Fall of Hyperion. A Dream.
John Keats. complete poems and selected letters,
381–395.
Vladimir I. Levenshtein. 1966. Binary codes capable
of correcting deletions, insertions and reversals.
Soviet physics doklady (Vol. 10, p. 707).

Huma Lodhi, Craig Saunders, John Shawe-Taylor,
Nello Cristianini & Chris Watkins. 2002. Text
classification using string kernels. The Journal of
Machine Learning Research, 2, 419–444.
Christopher D. Manning, Prabhakar Raghavan &
Hinrich Schütze. 2008. Introduction to information
retrieval. Cambridge University Press.
Magnus Sahlgren. 2005. An introduction to random
indexing. Methods and Applications of Semantic
Indexing Workshop at the 7th International
Conference on Terminology and Knowledge
Engineering, TKE (Vol. 5).
Magnus Sahlgren & Rickard Cöster. 2004. Using bagof-concepts to improve the performance of support
vector machines in text categorization. Proceedings
of the 20th international conference on
Computational Linguistics (p. 487).
Magnus Sahlgren & Jussi Karlgren. 2002. Vectorbased semantic analysis using random indexing for
cross-lingual query expansion. Evaluation of CrossLanguage Information Retrieval Systems (p. 169–
176).
Magnus Sahlgren & Jussi Karlgren. 2005. Automatic
bilingual lexicon acquisition using random indexing
of parallel corpora. Natural Language Engineering,
11(3), 327–341.
Alan M. Turing. 1936. On computable numbers, with
an application to the Entscheidungsproblem.
Proceedings of the London mathematical society,
42(2), 230–265.
Paul Viola & Michal Jones. 2001. Rapid Object
Detection using a Boosted Cascade of Simple. Proc.
IEEE CVPR 2001.
Robert A. Wagner & Michael J. Fischer. 1974. The
string-to-string correction problem. Journal of the
ACM (JACM), 21(1), 168–173.

Empiric Introduction to
Light Stochastic Binarization
Daniel Devatman Hromada12
1
Slovak University of Technology, Faculty of Electrical Engineering and Information
Technology, Department of Robotics and Cybernetics, Ilkovičova 3, 812 19 Bratislava,
Slovakia
2
Université Paris 8, Laboratoire Cognition Humaine et Artificielle, 2, rue de la
Liberté 93526, St Denis Cedex 02, France

Abstract. We introduce a novel method for transformation of texts
into short binary vectors which can be subsequently compared by means
of Hamming distance measurement. Similary to other semantic hashing
approaches, the objective is to perform radical dimensionality reduction by putting texts with similar meaning into same or similar buckets
while putting the texts with dissimilar meaning into different and distant buckets. First, the method transforms the texts into complete TFIDF, than implements Reflective Random Indexing in order to fold both
term and document spaces into low-dimensional space. Subsequently,
every dimension of the resulting low-dimensional space is simply thresholded along its 50th percentile so that every individual bit of resulting
hash shall cut the whole input dataset into two equally cardinal subsets.
Without implementing any parameter-tuning training phase whatsoever,
the method attains, especially in the high-precision/low-recall region of
20newsgroups text classification task, results which are comparable to
those obtained by much more complex deep learning techniques.
Keywords: Reflective Random Indexing, unsupervised Locality Sensitive Hashing, Dimensionality Reduction, Hamming Distance, NearestNeighbor Search

1

Introduction

In applied Computer Science one often needs to select from the database an
object which most resembles the ”query” object already at one’s disposition. In
order to do so, all members of the database are often transformed into ordered
sequences of numeric values (i.e. vectors). Such vectors can be interpreted as
points in the high-dimensional metric space allowing to calculate their distance
to other points in the space. In such case, the resulting ”most similar” entity is
simply the entity whose vector has smaller distance to the vector representing
the ”query” entity than any other entity stored in the database, i.e is query’s
”nearest neighbor”.
In Natural Language Processing (NLP), the nearest-neighbor search (NNS)
is a widely-used approach applied for solving diverse problems. Seemingly trivial, NNS is nonetheless not an easy problem to tackle with, especially in the

2

Daniel Devatman Hromada

case of Big Data scenarios where database contains huge amount of highlydimensional datapoints. In real-time scenarios where naive linear comparison of
d -dimensional query vector with all N vectors stored in the database is simply
not feasible due to its O(Nd) computational complexity. Thus, one is almost
always obliged to take recourse in approximation or heuristic-based solutions.
One of the most common methods of reducing the complexity of the NNsearch is by reducing the dimensionality of the database-representing vector
space. Classical approach to do so is Latent Semantic Analysis [10] (LSA). Other
family of more and more common approaches exploits so-called binary vectors as
the ultimate means of entity’s formalisation. Given the fact that contemporary
computers are machines essentially -i.e. on the physical hardware level- always
working with binary distinctions, the calculation of the distance between two
binary vectors (i.e. Hamming distance - the number of times a bit in vector1
has to be flipped in order to obtain the form of vector2 ) can be indeed a very
fast operation to realize, especially when implemented on the hardware level as
a part of processor’s instruction set.
Combination of dimensionality reduction and binarisation are basis for family
of methods descending from the approach called Locally Sensitive Hashing (LSH)
[11]. While concrete implementations often substantially differ - c.f. [13] for the
state-of-the-art overview - the objective is always the same: to hash each object
of the dataset into a concise binary vector in such a way that the objects which
are similar shall end up in the same or similar bucket (i.e. shall be represented
by same or similar binary vector) while the objects which are disparate shall end
up in disparate buckets3 .
In order to attain stunningly good results, many of these methods have to
be first trained. Such a tradeoff of high performance / complexity of training
phase is the case, for example, in the ”semantic hashing” (SH) approach of [1].
In SH one has to first learn the weights between different restricted Boltzmann
machines in order to obtain a multi-layered ”deep generative model” able to
perform the hashing. But the SH has also certain non-negligeable disadvantages:
the 1) training-related costs 2) need to work with restricted amount of features
which shall enter the first layer of the network (e.g. 2000 TF-IDF values in [1])
3) possibility of over-fitting of the model etc.
In this article, we shall present approach, which could one take vis-a-vis the
problem of ”text hashing”. Instead of founding our approach on a powerful supervised ”deep learning” algorithm able to extract sophisticated combinations
of regularities among restricted number of initial features, we shall exploit an
algorithm so simple that it can easily integrate huge number of features in a
very fast & frugal way. In fact, the algorithm presented here is completely unsupervised and does not need any training or feature-preselection at all in order
perform the hashing process.
3

Note that the aim of hashing process as presented in this paper differs substantially
from the aim of hashing algorithms like MD5 or SHA2 whose objective is to always
hash objects into different buckets.

TSD 2014

1.1

3

Reflective Random Indexing

Theoretically, our approach stems from the lemma of Johnson-Lindenstrauss
stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between
the points are nearly preserved[9]. Practically, the JL-lemma was already implemented as so-called Random Projection or Random Indexing algorithms. Random Projection was already quite successfully proposed in relation to the hashing
problem[12]. Its much simpler Random Indexing (RI) counterpart, however, was
not.
Since a decade from its initial proposal in [4], RI has already proven its
usefulness in regards to NLP problems as diverse as synonym-finding [4], text
categorization [5], unsupervised bilingual lexicon extraction [6], implicit knowdledge discovery [2], automatic keyword attribution to scientific articles [3] or
measurement of string distance metrics [8].
The basic RI algorithm is quite straightforward to both understand and
implement: Given the set of N objects (e.g. documents) which can be described
in terms of F features (e.g. occurence of the string in the document), to which
one initially associates a randomly generated D-dimensional vector, one can
obtain D-dimensional vectorial representation of any object X by summing up
the vectors associated to all features F1 , F2 observable within X. The original
random feature vectors are generated in a way that out of D elements of vector,
only S among them are set to either -1 or 1 value. Other values contain zero.
Since the ”seed” parameter S is much smaller than the total number of elements
in the vector (D), i.e. S <<D, initial feature vectors are very sparse, containing
mostly zeroes, with occasional value of -1 or 1.
At the end of the process, one obtains vectorial characterisations of all documents which one can compare by means of cosine measure. Leaving aside some
advantageous properties described elsewhere [8], RI as described here is nothing
else than a randomly distorted variation on a ”bag-of-words” theme. Consistently to other bag-of-word approaches, one can also weight the initial randomly
generated feature vectors with feature’s TFIDF [7] value.
But one is not obliged to stop the whole process after the calculation of initial
document vectors. One can indeed ”reflect” the whole process and proceed this
time from object vectors toward feature vectors, forget the initial randomly
generated feature vectors of 0th generation and obtain the feature vectors for
feature FX as a sum of vectors O1 ,O2 representing the objects within which one
can observe the occurence of feature FX . Subsequently, the object vectors can
be once again calculated as a sum of feature vectors; feature vectors as a sum
of object vectors etc. Such a multi-iterative approach whereby every iteration
can be potentially followed by vector normalization is called Reflective Random
Indexing (RRI).
While RRI keeps the advantageous properties of non-iterative RI like incrementality (i.e. it is very easy to enrich the model with new features or objects)
and homogenity (both objects and features are points of the same space), it
goes well-beyond simple bag-of-words properties of non-iterative RI. This is so

4

Daniel Devatman Hromada

because of certain symmetry in the algorithm where, in every iteration, not only
features take part in the vectorial definition of objects but also objects help to
construct the vectorial representations of features. Thus, RRI multiplies substantially the amount of mutually interacting forces within the generated metric
space and allows for such usages as discovery of ”implicit semantic inferences”
within huge corpora [2].
Reflective Random Indexing
algorithm RRI ()
#initial iteration is equivalent to plain Random Indexing
foreach Feature
Feature_Vectors[Feature] = generate_Random_Vectors(Dimension, Seed)
Feature_Vectors[Feature] *= TFIDF_Weights[Feature] #optional
foreach Object
foreach Feature in Object2Feature[Object]
Object_Vector[Object] += Feature_Vectors[Feature]
normalize Feature_Vectors,Object_Vectors #optional
#reflective iterations
repeat
foreach Feature
foreach Object in Feature2Object[Feature]
Feature_Vector[Feature] += Object_Vectors[Object]
foreach Object
foreach Feature in Object
Object_Vector[Object] += Feature_Vectors[Feature]
Iteration = Iteration + 1
normalize Feature_Vectors,Object_Vectors #optional
until Iteration == MaxIterations
return Feature_Vectors, Object_Vectors

2

Light Stochastic Binarization

Our hashing algorithm is a simple extension of Reflective Random Indexing.
While the output of RRI are D-dimensional real-valued vectors, the output vectors of LSB are not real-valued but binary vectors of length D.
Transformation of RRI-generated real-valued vectors into binary vectors is a
fairly straightforward process: after all object vectors are calculated by RRI, we
simply determine the median value (i.e. 50th percentile4 ) for every dimension
(i.e. column) D of the resulting Nd matrix. In such a way we obtain a threshold
value for every dimension and we assign into dth element of final binary representation of object n the 0 value if its real-valued coordinate along dth dimension
4

Determination of dimension’s median value is the only nontrivial component of the
process. Note that in case of particularly large and complex samples, law of large
numbers shall push 50th percentile’s value limitely close to 0.

TSD 2014

5

is smaller than the determined threshold and 1 if it is above the threshold. Rare
tie situations are broken randomly. Result is a set of binary hashes cut in two
equally cardinal subsets by every dimension-denoting bit. This binarization is
the very last step of the indexing phase.

if n <median(Dd )
0
if n >median(Dd )
hd (n) = 1

rand if n == median(Dd )
Subsequently, during the query phase, can simply transform the query object is transformed into its binary vector by: 1)summing up the real-valued
representations of the features observable within the query object 2) thresholding the resulting real-vectored by pre-determined medians. Resulting binary
vector is subsequently considered to be the center of the Hamming-ball of radius R. Every binary vector contained within such Hamming-ball shall yield
an index pointing to the bucket stored in the memory where we could look in
order to find query’s nearest-neighbor. In case 2R <B, i.e. in case when generated Hamming-ball can potentially contain more possibilities than is the total
amount B of binary buckets generated during the indexing phase, we can calculate, in a linear fashion, query’s Hamming distance H in regards to every
bucket-denoting binary vector and subsequently select only those buckets for
which H(query hash,bucket hash)<R. Radius R is the query-phase thresholding
parameter by means of which one can trade precision with recall and vice versa.

3

Experiment

The aim of our preliminary experiment was to assess whether the LSB approach
can be useful at all and, if yes, compare the information retrieval faculties of
LSB with those of Semantic Hashing. Thus, our results shall be presented in
terms of Precision-Recall curves as defined by [1]:
Recall =

Number of retrieved relevant documents
Total number of all relevant documents

Number of retrieved relevant documents
Total number of retrieved documents
Analogicaly to the article with which we compare our data, the retrieved document is considered to be relevant to the query document when they have the
same class label.
P recision =

3.1

Corpus and pre-processing

In this preliminary work we have confronted the LSB algorithm only with data
contained in 20 newsgroups corpus[14]. The corpus contains 18,845 postings
taken from the Usenet newsgroup collection divided into training set containg
11,314 postings, rest being the testing set. Both training and testing subsets are

6

Daniel Devatman Hromada

divided into 20 different newsgroups which correspond each to a distinct topic.
Because our approach aims to introduce an unsupervised hashing scenario, we
have left aside the training set and focused all evaluation solely on 7531 postings
contained in the testing set.
Words were extracted from postings by considering every non-word character
as a word boundary - 93591 words were thus extracted, among which 41782 has
occured in more than one posting. These words were considered as features by
subsequent RRI. Data were not processed in any other way - no stop words were
filtered away and all words which occured in more than one posting were taken
into account.
3.2

Empiric Results

In a comlete analogy to simulations performed in [1], we have used every document from the test set as a query document which was compared to all other
7530 documents. Precision and Recall values were calculated for every query and
averaged over all 7531 queries. Figure 1 compares the Precision - Recall curves of
reflective and unreflective variants of LSB with both ”Fine-tuned 128-bit Semantic Hashing” and 128-dimensional binarized Latent Semantic Analysis. Non-LSB
results are reproduced from Figure 6 of study [1].

Fig. 1: Comparison of reflective LSB(I=2) and unreflective
LSB(I=0) LSB with Semantic
Hashing and binarized Latent
Semantic Analysis.

Fig. 2: More than 40% of queries
are accompanied, within the Hamming ball of radius 38, only by
neighbors belonging to the same
newsgroup category.

Figure 2 illustrates in closer detail the behaviour of both reflective and nonreflective variants of LSB in relation to the variation of the retrieval parameter
(i.e. Hamming ball’s radius R) which is plotted on the X axis. Y axis represents

TSD 2014

7

the number of queries which do not have - in their surrounding Hamming ball
with radius R - any posting not having the same newsgroup label (i.e. they do
not retrieve any false positive), but in the same time, they do retrieve at least
one true positive (i.e. at least one object belonging to same newsgroup as query
has the binary hash which is located within query’s Hamming ball of radius R).
Notwithstanding the size of the theoretically possible hashing search space
(2128 ) which by large exceeds the number of 7531 initial objects, LSB succeeded
to create some collisions, hashing all articles in the 20newsgroup corpus into
7526 binary buckets. While majority of such ”collisions” are due to fairly trivial
reposting of the same message, couple of somewhat more divergent messages
from comp.graphics newsgroup was also hashed into the same 16-byte bucket.
Displayed in the listing below is the difference of content between these two files,
as produced by UNIX command diff.
< New since version of 2 May 1993:
<
* Added info on ImageViewer for NeXT.
--> New since version of 18 April 1993:
>
* New version of XV supports 24-bit viewing for X Windows.
>
* New versions of DVPEG & Image Alchemy for DOS.
>
* New versions of Image Archiver & PMView for OS/2.
>
* New listing: MGIF for monochrome-display Ataris.
461,463c464,466
<
PMView 0.85: JPEG/GIF/BMP/Targa/PCX viewer. GIF viewing very fast,
<
JPEG viewing roughly the same speed as the above two programs. Has
<
image manipulation & slideshow functions. Shareware, $20.
-->
PMView 0.85: JPEG/GIF/BMP viewer. GIF viewing very fast, JPEG viewing
>
fast if you have huge amounts of RAM, otherwise about the same speed
>
as the above programs. Strong 24-bit display support. Shareware, $20.
632,641d634
< NeXT:
<
< ImageViewer is a PD utility that displays images and can do some format
< conversions. The current version reads JPEG but does not write it.
< ImageViewer is available from the standard NeXT archives at
< sonata.cc.purdue.edu and cs.orst.edu, somewhere in /pub/next (both are
< currently being re-organized, so it’s hard to point to specific
< sub-directories). Note that there is an older version floating around that
< does not support JPEG.
In spite of difference of their contents, files comp.graphics/39638 and comp.graphics/39078
of 20-newsgroup corpus, LSB assigned to them the same ”10010011000010111001100
0101100100011110000110111010010010100110101110100000001001010011011000100
10010101001101000010111110110011” hash

8

4

Daniel Devatman Hromada

Discussion

Looking at the peak shown in Figure 2, one is tempted to state that when confronted with data from the testing set of 20newsgroups corpus, the reflective
128-dimensional LSB is able to retrieve, in 42% of cases (i.e. 3166 out of 7531),
at least one relevant ”neighbor” with maximal precision. It is indeed at the
Hamming distance 38, where the method combining Reflective Random Indexing executed with parameters D=128, S=5, I=2 and followed by simple binary
thresholding of every dimension, attains at overall recall rate 0.39%5 to much
higher precision (80.6%) than any method presented in the study of [1].
On the other hand, LSB performs much worse than compared methods in
situations where one wants to attain higher recall. This is most probably due to
almost complete lack of ”training” - since with exception of 1) TFIDF weighting
of initial randomly-generated feature vectors 2) the ”reflection” procedure which
aids us to characterize objects in terms of features and features in terms of
objects 3) determination of binary thresholds (i.e. medians) - there is no kind of
”learning” procedure involved.
But it might be the case that lack of any complex ”deep learning” procedure
shall prove itself to be a certain advantage. More concretely, the one who uses
LSB is not obliged to drastically reduce the number of features by means of which
all objects are characterized. Thus, in the case of the introductory experiment
presented in this paper, we have represented every text as a linear combination
of vectors associated to 41782 features. We believe that it is indeed this ability to
”exploit the minute details” (compare to 2000 words with highest TFIDF score
used by [1]) which allows the method hereby introuced to attain higher precisions
in scenarios where just one relevant document is to be retrieved. It would be,
however, unfair to state that can LSB ”pefrorms better” than Semantic Hashing,
because the goal purpose of SH was not to target the NN-search problem but to
yield robust results in more exhaustive classification scenarios. Thus, comparison
of LSB with other methods is needed.
It might also be the case that more advanced tuning of RRI’s parameters
could improve the performance. Another possible direction of research is to
study the impact of strategies by means of which the initial random vectors
are weighted. Due to the introductory nature of this paper, not much was unveiled about neither of two problems. Looking at the Figure 1, one can, however,
assert that: 1) LSB seems to attain better results when its RI component involves
more than one iteration, i.e. when it is ”reflective”.
In sum, we believe that the method hereby introduced is worth to be studied
somewhat further. Not only because its dimensionality-reduction component -the
5

We precise that when we mention 42% recall with 100% precision, we speak about
NNS scenario where it is sufficient for a query to retrieve one relevant document.
This scenario is documented on Fig. 2. On the other hand, when we mention attained
recall rate 0.39%, we speak about much more difficult ”classification” scenario where
query, in order to attain maximal recall, must retrieve all >370 postings which belong
to the same class. This scenario is documented on Figure 1.

TSD 2014

9

RRI- is less costly and more opened to incremental addition of new data than,
for example, LSA [10]. Not only because it is similar to LSH [11] in its ability
to transform texts into hashes as big as concise as 16 ASCII characters and yet
preserve the relations of similarity and difference held by original texts. But also
because the algorithm is easy to comprehend, simple to implement and queries
can be very fast to execute. That’s why we label the method of binarization
hereby presented as not only stochastic, but also light.

References
1. R. Salakhutdinov et G. Hinton, ” Semantic hashing ”, International Journal of
Approximate Reasoning, vol. 50, no. 7, p. 969–978, 2009.
2. T. Cohen, R. Schvaneveldt, et D. Widdows, ” Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections ”, Journal
of Biomedical Informatics, vol. 43, no. 2, p. 240–256, 2010.
3. A. El Ghali, D. Hromada, et K. El Ghali, ” Enrichir et raisonner sur des espaces
sémantiques pour l’attribution de mots-clés ”, JEP-TALN-RECITAL 2012, p. 77,
2012.
4. M. Sahlgren et J. Karlgren, ” Vector-based semantic analysis using random indexing for cross-lingual query expansion ”, in Evaluation of Cross-Language Information Retrieval Systems, 2002, p. 169–176.
5. M. Sahlgren et R. Cöster, ” Using bag-of-concepts to improve the performance
of support vector machines in text categorization ”, in Proceedings of the 20th
international conference on Computational Linguistics, 2004, p. 487.
6. M. Sahlgren, ” An introduction to random indexing ”, in Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on
Terminology and Knowledge Engineering, TKE, 2005, vol. 5.
7. C. D. Manning, P. Raghavan, et H. Schütze, Introduction to information retrieval,
vol. 1. Cambridge University Press Cambridge, 2008.
8. D. D. Hromada, ” Random Projection and Geometrization of String Distance Metrics ”, in Proceedings of the Student Research Workshop associated with RANLP,
2013, p. 79–85.
9. W. B. Johnson et J. Lindenstrauss, ” Extensions of Lipschitz mappings into a
Hilbert space ”, Contemporary mathematics, vol. 26, no. 189-206, p. 1, 1984.
10. T. K. Landauer et S. T. Dumais, ” A solution to Plato’s problem: The latent
semantic analysis theory of acquisition, induction, and representation of knowledge.
”, Psychological review, vol. 104, no. 2, p. 211–240, 1997.
11. A. Gionis, P. Indyk, et R. Motwani, ” Similarity search in high dimensions via
hashing ”, in VLDB, 1999, vol. 99, p. 518–529.
12. M. S. Charikar, ” Similarity estimation techniques from rounding algorithms ”, in
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,
2002, p. 380–388.
13. A. Andoni et P. Indyk, ” Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions ”, in Foundations of Computer Science, 2006.
FOCS’06. 47th Annual IEEE Symposium on, 2006, p. 459–468.
14. 20 newsgroups, http://qwone.com/~jason/20Newsgroups/

Slovak University of Technology in Bratislava
Faculty of Electrical Engineering and Information Technology
Department of Control and Cybernetics
&&

University Paris 8
Ecole Doctoralle Cognition, Langage, Interaction
Cognition Humaine et Artificielle

Thesis for rigorous examination
Daniel D. Hromada

Topic of doctoral Thesis :

Evolutionary models of ontogeny of linguistic categories and rules

Advisors :

doc. Ing. Ivan Sekaj, PhD.
prof. Charles Tijus

Form of study :

external and under double supervision

Study began :

september 2010 at University Paris 8
september 2011 at Slovak University of Technology

Study program :

Cybernetics (Slovak University of Technology)
Psychology (University Paris 8)

Study discipline :

9.2.7 Cybernetics (Slovak University of Technology)
Cognitive Psychology (University Paris 8)

(ivan.sekaj @ stuba.sk)
(tijus @ univ-paris8.fr)

Abstract
Language development is a process by means of which a human baby constructs an
adequate competence to encode & decode meanings in language of her parents.
Computationally it can be described as a trinity of mutually interconnected
problems : clustering of all tokens which baby heard into 1) semantic and 2)
grammatical categories ; and 3) discovery of grammatical rules allowing to combine
the members of diverse equivalence classes into syntactically correct and meaningful
phrases. A theoretical, « psycholinguistic » claim of our Thesis is that similary to
those theories which explain emergence of cultural or creative thinking as the result
of evolutionary process occuring within an individual mind, the emergence of
linguistic representations and faculties within a human individuum can be also
considered as a case where basic tenets of Universal Darwinism apply. The practical,
« cybernetic » aim of the Thesis is to create a computational models of concept
learning, part-of-speech induction and grammar induction having comparable
performance to existing models but based principially on evolutionary algorithms. It
shall be argued that the « fitness function » , which determines the « survival rate » of
« candidate grammars » emerging and disappearing in baby’s mind should be based
upon the idea that the most fit is such a grammar G which « minimizes the distance »
between the utterances successfully parsed from linguistic environment E by the
application of grammar G and the utterances potentially generated by the grammar G.
Keywords : evolutionary computing, language acquisition, genetic epistemology, part-of-speech induction,
grammar induction, optimal clustering, machine learning, concept construction, grammar systems,
motherese, toddlerese

List of most important abbreviations
CC – Concept Construction
EA – Evolutionary Algorithms | EP – Evolutionary Programming
ES – Evolutionary Strategies
ET – Evolutionary Theory
GA – Genetic Algorithms
GE - Genetic Epistemology
GI – Grammar Induction | Grammar Inference
LA - Language Acquisition | LD – Language Development
NLP – Natural Language Processing
POS-i – Part-of-Speech Induction ; POS-t – Part-of-Speech Tagging
UD – Universal Darwinism
Convention
« Italics » – citation
( x | y | z ) - disjuncted token – i.e. read token « (neural|quantum) darwinism » as
« neural darwinism ; quantum darwinism »

Table of Contents
0.Introduction.......................................................................................................................................4
1.Universal Darwinism.........................................................................................................................5
1.1.Biological evolution...................................................................................................................6
1.2.Evolutionary Psychology...........................................................................................................8
1.3.Memetics....................................................................................................................................9
1.4.Evolutionary Epistemology.......................................................................................................9
1.5.Individual Creativity................................................................................................................10
1.6.Genetic Epistemology..............................................................................................................11
1.7.Evolutionary computation........................................................................................................12
1.7.1. Genetic algorithms & fitness landscapes........................................................................13
1.7.2. Evolutionary programming & evolutionary strategies....................................................15
1.7.3.Genetic programming......................................................................................................15
1.7.4.Grammatical evolution.....................................................................................................17
1.7.5.Tierra................................................................................................................................20
2.Language development....................................................................................................................21
2.1.Ontogeny of semantic categories (concepts)...........................................................................22
2.2.Ontogeny of formal categories (parts-of-speech)....................................................................25
2.3.Ontogeny of grammars (grammar induction)..........................................................................27
3.Computational Models of Text Processing......................................................................................29
3.1.Concept construction...............................................................................................................30
3.1.1. Non-evolutionary model of CC.......................................................................................31
3.1.2. An evolutionary model of CC.........................................................................................33
3.2.Part-of-speech induction and part-of-speech tagging..............................................................35
3.2.1. Non-evolutionary models of POS-i.................................................................................36
3.2.2. Evolutionary models of POS-i & POS-t.........................................................................37
3.3.Grammar induction..................................................................................................................38
3.3.1. Non-evolutionary models of grammar induction............................................................39
3.3.2. Evolutionary models of grammar induction...................................................................43
3.4. Evolutionary Language Game................................................................................................51
4.Remark concerning the Theory of Grammar Systems....................................................................53
5.Conclusive remarks.........................................................................................................................54
6.Bibliography....................................................................................................................................56

0. Introduction
A general form of Evolution Theory (ET) postulates that entities evolve and adapt to
their environment by a process of accumulation of information. Such a generalized
theory – often referred to as « Universal Darwinism » - can be and often is applied in
diverse scientific disciplines as diverse as biology, linguistics or even anthropology
and psychology. Since principal concepts and tenets of ET can be easily formalised
into stochastic « evolutionary » algorithms, ET can yield not only a theoretical
framework but also a computational experimental methodology for any scienctific
discipline whose basic concepts and principles can be « reduced » into a
ET-consistent form.
The aim of my doctoral Thesis is to empirically – i.e. by means of computational
experiments - demonstrate that certain phaenomena observed by « developmental
linguists » and « psycholinguists » can be explained in terms of principles of
Universal Darwinism and as such can be modelled by « computational linguists » and
« Natural Language Processing (NLP) engineers» who shall found their
computational models upon methods offered by Evolutionary Computing paradigm.
More concretely, I shall try to indicate that « evolutionary » optimization can be used
to yield solutions to at least three problems of language development:
1) induction of semantic categories, i.e. construction of « concepts »
2) the problem of induction of part-of-speech grammatical categories of words
natural languages, i.e. the problem of how equivalence classes like « nouns »,
« verbs », « adjectives » etc. are constructed by the language-acquiring agent
3) the problem of grammar induction, i.e. the problem of how an agent can
acquire a grammar from the corpus or its environment
It shall be indicated that the term «language-acquring agent » could be interpreted
both as an organic agent (e.g. a human baby) trying to learn the language of its
environment (e.g. its parents) as well as a computational agent (e.g. a Turing
Machine) inducing the structural properties of the language which generated the
corpora with which the agent has been confronted. In other terms, it shall be indicated
that ET is generalizable in such an extent, that its correct implementation may allow
two systems based upon Darwinian principles « replicate, mutate, select » to
converge to same optimal or quasi-optimal categories regardless the fact that the
substrate by means of which they computate is organic or not.
The first chapter will more closely present the above-mentioned basic principles of
the universal ET doctrine and enumerate certain scientific disciplines for which the
ET furnishes a useful theoretical framework. Asides biology where the role of ET is
evident, a discipline of «evolutionary psychology » shall be mentioned principially in

order to avert the reader that our aims are not limited to those posited by evolutionary
psychology. The « memetic theory », on the contrary, shall more precisely elucidate
our ultimate aim since it already introduces a novel level of representation, « a
meme », supposed to be « the basic unit of immitation » and as such offers an
interesting starting point for any Darwin-consistent theory of evolution of
non-organic (e.g. cultural) structures and artefacts. It is, however, the constructionist
« genetic epistemology » (GE) of Jean Piaget which shall resonate even more
strongly with our aims – since what GE ultimately postulates is that the human
psyche – with all its linguistic, moral, object-manipulating faculties – pass through
the sequence of « stages » . For it is our belief that such Piagetian « stages » can be
explained, in computational terms, as «quasi-optimal attractors» within a very
complicated « search space » of agent’s internal representations and that a sort of
evolutionary process occurs not only on a social-memetic level between the agents
imitating each other, but, in the first place, within the agents (him|her|it)self. This
Thesis is our tentative to base this « learning=evolution » belief on solid ground of
complexity theory.
The second chapter will address the topic of language development (LA). The topic is
so vaste and deep that only most fundamental subproblems (i.e. vocabulary
development, acquisition of part-of-speech categories and acquisition of gramamrs)
shall be briefly described and some basic notions like « variation set » or
« motherese» will be introduced. We shall try to evit the dispute between diverse
linguistic doctrines and schools (e.g. nativists, cognitivists, comparativists) ; focus
shall be put upon points of consensus supported by empiric evidence.
While the goal of first chapter is to furnish the theoretical framework and the goal of
the second is to specify the problem, it is the third chapter which deals with the
concrete computational tentatives to unify the two. Major part of the chapter shall
deal with the question of evaluation of diverse inductive models. Some most
successful computational models of part-of-speech induction (POS-i) and grammar
induction (GI) shall be mentioned in order to pave the way for the evolutionary ones.
As shall be indicated by this section, the tentatives to apply evolutionary algorithms
(EAs) to solve the POS-i and GI problem are, regardless the good results reported in
the litterature, very rare. In specific subsections of the chapter, we shall mention
certain models, both psycholinguistic and computational, which justify our claim that
the process of ontogeny of linguistic faculty can be not only interpreted but also
modelled as a process of evolutionary optimization of cognitive structures.

1. Universal Darwinism
Universal Darwinism (UD) is a scientific paradigm regrouping diverse scientific
theories extending the Darwinian theory of evolution and natural selection (Darwin
1859) beyond the domain of biology. It can be understood as a generalized theoretical
framework aiming to explain the emergence of many complex phenomena in terms of
interaction of three basic processes: 1) variation 2) selection 3) retention. According
to UD paradigm, interaction of these three components yields a « universal algorithm
valid not only in biology, but in all domains of knowledge where we can extract
informational entities – replicators, which are able to reproduce themselves with
variations and which are subjects to selection » (Kvasnička and Pospíchal 1999).
This generic algorithm is nothing else than traditional Evolutionary Theory (ET)
which, when when considered as substrate-neutral, can be applied to such a vaste
number of scientific fields that it has been compared to a kind of « universal acid »
which « eats through just about every traditional concept, and leaves in its wake a
revolutionized world-view, with most of the old landmarks still recognizable, but
transformed in fundamental ways» (Dennett 1996) .
As of 2013, the existing scientific disciplines which could be labeled as
UD-consistent include: biology ; evolutionary (art | psychology | music | linguistics |
ethics |economics | anthropology | epistemology | computation); sociobiology ;
memetics ; (quantum | neural | psycho ) darwinism ; artificial life and many others. In
regards to the overall aim of our Thesis, some of the most relevant instances of UD
are described in following sub-sections.
1.1. Biological evolution
Evolutionary Theory was born when young Charles Darwin realised that the
« gradation and diversity of structure » (Darwin 1906), which he had encountered
among mockingbirds of Galapagos islands, could be explained by natural tendency of
species to « adapt to changing world ». Parallely to Darwin’s work which was
gradually clarifying the terms of variability and its close relation to
environment-originated selective pressures, Gregor Mendel was assessing statistical
distributions of colours of flowers of his garden peas in Brno in order to finally
converge to fundamental principles of heredity . But it was only in 1953 when the
double-helix structure of the material substrate of heredity of biological species – the
DNA molecule – was described in (Watson & Crick, 1953) paper.
In simple terms : In the DNA molecule, information is encoded as a sequence of
nucleotides. Every nucleotide can contain one of four nucleobases, it thus ideally
carry 2 bits of information. Continuous sequence of three nucleotids gives a
« triplet » which, when interpreted by a intracellular « ribosome » machinery, can be
« translated » into an amino-acid. Sequences of amino-acides yield proteins which
interact one with another in biochemical cascades. The result is a living organism
with its particular phenotype aiming to reproduce its genetic code.
If, in the given time T there are two organisms A and B whose genetic code differs in
such an extent that their phenotype differs, and if ever the phenotype of organism A

augments probability of A’s survival and reproduction in the external world W, while
the B’s phenotype diminishes such probability , we say that the A is better adapted to
world W than B, or more formally that fitness(A) > fitness(B). Evolutionary Theory
postulates that in case that there is a lack of resources in world W, descendants of the
organism B shall be gradually, after multiple generations, substituted by descendants
of a more fit organism « A ». This is so because during every act of reproduction, the
material reason for having a more fit phenotype - the DNA molecule – is transferred
from parent to offspring and the whole process is cumulative across generations.
It can, however, happen, that the world W changes. Or a random (stochastic) event –
a gamma ray, the presence of a free radical - can occur which would tamper A’s
genetic code. Such an event – called « mutation » - shall result, in majority of cases,
in decrease of A’s fitness. Rarely, however, can mutations also increase it.
Another event which can transform the genetic sequence is called « crossover ». It
can be formalised as an operator which substitutes one part of genetic code of the
organism A with corresponding sequence of organism B, and vice versa, the part of
B with the corresponding part of A. It is indeed especially the crossover operation,
first described by in the article of T.H. Morgan (Morgan 1916), which is responsible
for « mixing of properties » in case of a child organism issued from two parent
organisms. In more concrete terms : the genetic code of such « diploid » organisms is
always stored in X pairs of chromosomes. Each chromosome in the pair is issued
from either father or mother organism which, during the process of meiosis, divide
their normally diploid cells into haploid gamete cellls (i.e. sperms in case of father
and eggs in case of mother). It is especially during the first meiotic phase that

Figure 1 : Two types of crossover operation. Figures reproced from (Morgan, 1916)

crossover occurs, the content of DNA sequence of two grand-parents being mixed
and mapped during crossover operation into the chromosome contained in the gamete
which, if lucky, shell fuse with the gamete of another parent in the act of fecondation.
Resulting « zygote » is again diploid, contains mix of fragments of genetic code
originally present in the cells of all four grand-parents of the nascent organism.
Zygote subsequently exponentially divides into growing number of cells which
differentiate from each other according to instructions contained in the genetic code
which are triggered by biochemical signals coming from cell’s both internal and
external environment. If the genetic code shall endow the organism with such
properties that will allow it to survive in its environment until its own reproduction,
approximately half of the genetic information contained in its DNA shall be
transfered to the offspring organism. If not, the information as such shall disappear
from the population due to its incompatibility with the environment.
1.2. Evolutionary Psychology
It was already Darwin who posited that ET shall have profound impact upon
psychology :
« In the distant future I see open fields for far more important researches. Psychology
will be based on a new foundation that of the necessary acquirement of each mental
power and capacity by gradation. » (Darwin 1859)
While two possible intepretations of this Darwin’s idea exist, Evolutionary
Psychology (Ev.Psych.) focuses only on the first one. It aims to explain diverse
faculties of human soul & mind in terms of selective pressures which moulded the
modular architecture of human brain during millions of years of its phylogenetic
history. Its central premises state : « The brain's adaptive mechanisms were shaped
by natural and sexual selection. Different neural mechanisms are specialized for
solving problems in humanity's evolutionary past » (Cosmides and Tooby 1997).
In more concrete terms, Evolutionary Psychology explains quite successfully
phaenomena as diverse as emergence of cooperation and altruistic behaviour
(Hamilton 1963); male promiscuity and parental investment (Trivers 1972) or even
the obesity of current anglo-saxxon population (Barrett 2007). All this and much
more is explained as a result of adaptation of homo sapiens sapiens (and all its
biological ancestors) to dynamism of its ever-changing ecological and social niche.
Thus, in the long run, Ev.Psych. tends to explain and integrate all innate faculties of
human mind in the evolutionary framework. The problem with Ev.Psych., however, is
that in its grandious aim to « assemble out of the disjointed, fragmentary, and
mutually contradictory human disciplines a single, logically integrated research
framework for the psychological, social, and behavioral sciences » (Cosmides and
Tooby 1997) it can sometimes happen that Ev.Psych. posits as innate, and thus
explainable in terms of biological natural selection, cognitive faculties which are not
innate but acquired. Thus it may be more often than rarely the case that whenever it
comes to the famous nature vs. nurture (Galton 1875) controversy, evolutionary

psychologists tend to defend the nativist cause even there, where it means to commit
a epistemological fallacy to do so1.
And what makes things even worse for the discipline of Evolutionary Psychology as
is currently performed is, that the forementioned Darwin’s precognition has, asides
the nativist & biological one, also another intepretation. Id est, when Darwin spoke
about mental powers and capacities acquired by gradation, one cannot exclude that he
was speaking not only about gradation in phylogeny, but also ontogeny.
1.3. Memetics
Theory of memes or memetics is, in certain sense, a counter-reaction to Evolutionary
Psychology’s aims to explain human mental and cognitive faculties in terms of innate
propensities. Similiarly to Ev.Psych., memetics is also issued from the discipline of
sociobiology which was supposed to be « The extension of population biology and
evolutionary theory to social organization » (Wilson 1978). But differently to both
Ev.Psych. and sociobiology, memetics does not aim to explain diverse
(cultur|psychologic|soci)al phenomena solely in terms of evolution operating upon
biochemical genes, but also in terms of evolution being realised on the plane of more
abstract information-carrying replicators called « memes » (Dawkins 2006).
The basic definition of the classical memetic theory is: « Meme is a replicator which
replicates from brain to brain by means of imitation» (Blackmore 2000). These
replicators are somehow represented in the host brain as some kind of « cognitive
structure » and if ever externalised by the host organism – no matter whether in form
a word, song, behavioral schema or an artefact – they can get copied into other host
organism endowed with the device to integrate such structures 2. Similary to genes
which often network themselves into mutually supporting auto-catalytic networks
(Kauffman 1996), memes can also form more complex memetic complexes,
« memplexes », in order to augment the probability of their survival in time. Memes
can thus do informational crossovers with one another (syncretic religions, new
recepts from old ingredients or DJ mixes can be nice examples of such memetic
crossover) or they can simply mutate, either because of the noise present during the
imitation (replication) process, or due to other entropy-related decay-like factors
related to the ways how active memes are ultimately stored in brains or other
information processing devices.
Memetic theory postulates that the cumulative evolutionary process applied upon
such information-carrying stuctures shall ultimately lead to emergence of such
complex phaenomena as culture, religion or language.
1.4. Evolutionary Epistemology
Epistemology is a philosophical discipline concerned with the source, nature, scope ,
1 If ever we accept the notion of falsifiability as an important criterion of accpetation or rejection of the scientific
hypothesis (Popper et al. 1972), many hypotheses issued from EP would have to be rejected because, since being
based in the distant past which is almost impossible to access, they are less falsifiable than hypotheses explaining
the same phaenomena in terms of empiric data observable in the present.
2 In neurobiological terms, the faculty to imitate and hence to integrate memes from external environment is often
associated to so-called « mirror neurons » (Rizzolatti and Craighero 2004).

existence and divesity of forms of knowledge. Evolutionary epistemology (EE) is a
paradigm which aims to explain these by applying the evolutionary framework. But
under one EE label, at least two distinct topics are, in fact, addressed :
1) EE1 which aims to explain the biological evolution of cognitive and mental
faculties in humans and animals
2) EE2 postulates that knowledge itself evolves by selection and variation

EE1 can be thus considered as sub-discipline of Ev.Psych. and as such, is subject to
Ev.Psych.-directed criticism presented on previous page. EE 2, however, is closer to
memetics since it postulates the existence of a second replicator, i.e. of an
information-carrying structure which is not materially encoded by a DNA molecule.
The distinction between EE1 and EE2
can also be characterised in terms of
« phylogeny » and « ontogeny ». Given the definition of phylogeny as the
« processus which shapes the form of species » and contrasting it to ontogeny defined
as « processus shaping the form of individual », we find it important to reiterate that
while EE1 is more concerned with knowledge as a result of phylogenetic moulding of
DNA, EE2 points more in the direction of « ontogeny». In fact, EE2 paves the way
for at least two other sub-interpretations :
EE2-1 Knowledge can emerge by variation&selection of ideas shared by a group
of mutually interacting individuals (Popper 1972)
EE2-2 Knowledge can emerge by variation&selection of cognitive structures
within one individuum
It is worth noting that while a so-called recapitulation theory stating that « ontogeny
recapitulates phylogeny » (Haeckel 1879) is considered to be discredited by many
biologists and embryologists ; it is still held as valid by many reseachers in human
and cognitive sciences observing a « strong parallelism between cognitive
development of a child and … stages suggested in the archeological record » (Foster
2002)100 years after one of Darwin’s companion has noted : « Education is a
repetition of civilization in little » (Spencer 1894).
1.5. Individual Creativity
In fact, the evolutionary epistemology was born with the tentative of D.T. Campbell
to explain both creative thinking and scientific discovery in terms of « blind variation
and selective retention » of thoughts (Campbell 1960). Departing from introspective
works of mathematician Henri Poincare who stated « To create consists precisely in
not making useless combinations and in making those which are useful and which are
only a small minority. Invention is discernment, choice...Among chosen combinations
the most fertile will often be those formed of elements drawn from domains which are
far apart...What is the cause that,among the thousand products of our unconscious
activity, some are called to pass the threshold, while others remain below?» (Poincaré
1908), Campbell suggests that what we call creative thought can be described as a
Darwinian process whereby the previously acquired knowledge blindly varies in
unconscious mind of the creative thinker and that only some such structures are

subsequently selectively retained. As (Simonton 1999) puts it: « How do human
beings create variations? One perfectly good Darwinian explanation would be that
the variations themselves arise from a cognitive variation-selection process that
occurs within the individual brain. »
1.6. Genetic Epistemology
« The fundamental hypothesis of genetic epistemology is that there is a parallelism between the
progress made in ... organization of knowledge and the corresponding formative psychological
processes. Well, now, if that is our hypothesis, what will be our field of study? Of course the most
fruitful, most obvious field of study would be reconstituting human history: the history of human
thinking in prehistoric man. Unfortunately, we are not very well informed about the psychology of
Neanderthal man or about the psychology of Homo siniensis of Teilhard de Chardin. Since this field
of biogenesis is not available to us, we shall do as biologists do and turn to ontogenesis. Nothing
could be more accessible to study than the ontogenesis of these notions. There are children all
around us. » (Piaget 1974)

Strictly speaking, Piaget’s developmental theory of knowledge, which he himself
called Genetic Epistemology (GE) may seem to be utterly non-Darwinian. In fact it is
not even concerned with biochemical genes : Piagetian uses the term « genetic » to
refer to a more general notion of « heredity » defined as structure’s tendency to guard
its identity through time.
The basic structural primitives of Piagetian theory are behavioral « schemas » which
can be defined as « a basic set of experiences and knowledge that has been gained
through personal experiences that define how things should be and act in the
person's environment. As the child interacts with their world and acquires more
experiences these schemes are modified to make sense, or used to make sense of the
new experience » (Bee and Boyd 2003).
There are two ways how such schemes can be modified. Either they « assimilate »
data from external environment. Or, if ever such assimilation is not possible because
it is simply not possible that child’s cognitive system matches the perceived external
datum with the internal pre-existing category, the process of « accomodation » takes
place which transforms the internal category to match the external datum.
Ultimately, the set of schemes gets so out-dated or so altered by past modifications
that they are not useful anymore. Whenever such «equilibriation » occur, old set of
schemas is rejected, the child tends to « start fresh with a more up-to-date model »
(Bee and Boyd 2003), thus attaining new substage or stage of its development. In the
Piagetian system – which is based on very precise yet exhaustive observations of
dozens of children including his own – the order of stages is fixed and it is very
difficult, or even fully impossible, for evolving psyche to attain pre-operational stage
2 or concrete operational stage 3 if it had not even mastered all that is to master
during the sensorimotor stage 1 .
Given the fact that the GE paradigm involves :
• heredity – schemes are structures which tend to keep their identity in time
• variation – schemes are altered by the environment-driven assimilation or

accomodation 3
• selective pressures – only those schemas which are most well adapted to
environment and/or form most functionally fit complexes with other schemas
shall pass through the equilibriation milestone
it can be briefly stated that Piaget’s GE could be aligned with ET and UD. And what
more, it may be the case that notion of Piagetian stages is consisted with the notion of
attractor or locally optimal states whose emergence is, according to complex system
theory (Kauffman 1996; Flake 1999), inevitable in a system as complex as child’s
psyche definitely is.
1.7. Evolutionary computation
We have already mentioned (c.f. 1.1.) that evolution, as defined within UD, can be
thought of as a universal, generic algorithm. Not only can « evolutionary theory »
serve us to explain diverse phenomena around us, it can be also exploited for finding
solutions to diverse problems. Thus it is of no surprise that many researchers in
informatics realized that not only can be the evolutionary process encoded as an
informatic algorithm, but that such algorithms could be useful as a heuristic which
could potentially lead to a discovery of useful (quasi)-optimal solutions to wide range
of diverse problems. First explorations in the domain were done by Rechenberg’s
« evolutionary strategies » (Rechenberg 1973) and Holland’s « genetic algorithms »
(Holland 1975) which, along with « evolutionary programming » (Fogel et al. 1966)
and « genetic programming » form the « evolutionary computation » subdiscipline of
computer science.
All four approaches differ from classical optimization methods in following aspects :
1. using a population of potential solutions in their search
2. using explicit « fitness » instead of function derivatives
3. using probabilistic, rather than deterministic, transition rules »
(Kennedy et al. 2001)

Figure 2: Basic genetic algorithm schema. Reproduced from (Pohlheim 1996)
3 Note that in terms of theory of evolutionary computation, one can relate the Piagetian notion of assimilation to an
operator of local variation which attracts the cognitive system to locally optimal agreement with its environment,
while accomodation suggests an interpretation in term of more global variation operators (like cross-over), which
could potentially allow the cognitive system to reach a state of global equilibrium in regards to environment.

1.7.1. Genetic algorithms & fitness landscapes
Basic principle of « genetic algorithms » is illustrated on Figure 2. The core
component of every genetic algorithm is the objective « fitness function » able to
attribute a cardinal value or ordinal rank to any individum in the population of
potential solutions. In other terms, the fitness function yields the criterium according
to which one candidate individum is evaluated as « more fit » a solution, in regards to
the problem under study, than other potential solutions present in the population.
Population is the set of individual solutions. Every individual solution is encoded as a
vector of values (also called « chromosome » or « genome ») which can vary in time.
Designer choice related to the way how the problem solutions are encoded in
chromosomal vectors, e.g. the type (Boolean ? Integer ? Float ? Set? ) of different
elements of the vector is also a crucial one and can often determine whether the
algorithm shall succeed or fail.
In every generation – i.e. in every iteration of the algorithmic cycle represented by
the circle on Figure 2. - all N individuals in the population are evaluated by the
fitness function. Every individual thus obtains the « fitness » value, which
subsequently governs the « selection » procedure choosing a subset of individuals
from the current generation as those, whose genetic information shall reproduce into
next generations.
In our Thesis we plan to exploit especially the « fitness proportionate selection » as
the selection operator. This operator, also called « roulette wheel operator »
transforms the fitness fi of individual i into the probability p i of its survival by means
of a formula :

where N is the number of individuals in the population.
Once the « most fit » candidates are selected by the selection operator, they are
subsequently mutually recombined by means of « crossover » operators and/or
modified by means of « mutation » operators. Many different types of selection,
mutation and crossover operators exist, for their overview c.f. (Sekaj 2005). For the
purpose of this work let’s just note that the probabilities of occurrence of mutation or
crossover have to be fairly low, otherwise no fitness-increasing information could be
transferred among generations and whole system will tend to present non-converging
chaotic behaviour (Nowak et al. 1999).
Another useful strategy, which guarantees that maximal fitness shall either increase
or at least stay constant, is called elitism. In order to implement the strategy, one
simply guards one (or more) individual(s) with highest fitness unchanged for next
generation, thus protecting « the best ones » from variations which would, most
probably, decrease rather than increase the fitness4.
Yet another widely used approach reinforces the selection pressure by removal of the
4 Note that in nature, elitism is often but not always the case. For it can happen that, due to stochastic factors, the
most fit individuals die before they succeed to reproduce themselves.

weakest individuals. Both elitist « survival of the fittest » and the contrary « removal
of the weakest » are often combined.
The selection of the most fit individuals from the old generation, their subsequent
replication and/or recombination and diversification yields a new generation. Because
individuals with lower fitness have been either completely or at least partially
discarded by the selection process, one can expect that the overall fitness of new
generation shall be higher than the fitness of the old generation. With little bit of
luck, one can also hope that the most fit individuals of the new generation shall be
little bit more fitter than the most fit individuals discovered in the new generation –
this can happen if ever a « benign » mutation have occured, i.e. a modification which
had moved the individual from the lower point on the « fitness landscape » to
somewhat higher state.
The notion of fitness landscape, first introduced in (Wright 1932) is a metaphor
useful for understanding&explanation of diverse evolutionary phenomena. The
landscape is depicted as a mountain range with peaks of varying height. The height at
any point on the landscape corresponds to its fitness value; i.e. the higher the point,
the greater the fitness of an individual represented by the given point of the
landscape. In such a representation, the evolution of the organism to more and more
« fit » forms can be depicted as a movement up-hill, towards the most closest peak
(i.e. local optimum) or towards the highest peak of the whole landscape (i.e. global
optimum). Figure 3 illustrates a fitness landscape of a very simple organism with
only one gene (whose potential values are encoded by illustration’s X axis).

Figure 3: Possible fitness landscape for a problem with only one
variable. Horizontal axis represents gene’s value, vertical axis
represents fitness.

Every arrow on the figure represents one possible individual. Its length represents the
variation which can be brought in by the mutation operator. The fact that individuals
always tend to move « upwards » indicates that selection pressures are involved. It
has to be added that without the implementation of the crossover operator, the
globally optimal state (encoded by point C) could not be attained for individuals who
haven’t originated at the slopes of C. Only some sort of crossover operator could
ensure that individuals who attained the local optima (encoded by peaks A, B, D)
could be mutually recombined (for example B with D) in a way that shall allow them
to leave the locally stable states and approach the globally optimal C.
The fact that genetic algorithms, thanks to « crossover » operators, can combine two

individuals from diverse sectors of the fitness landscape, allow them to find solutions
to problems where heuristics based on « gradient descent » should fail.
1.7.2. Evolutionary programming & evolutionary strategies
Evolutionary programming (EP) and evolutionary strategies (ES) are methods whose
overall essence is very similar to GAs. There are, however, some subtle differences
among the approaches.
In EP, mutation is the principal and often the only variation operator. While
recombination is rarely used, « operators are freely adapted to fit the problem at
hand » (Kennedy et al. 2001). EP algorithms often double the size of population by
mixing children with parents and then halving the population by selection.
Tournament selection operator is often used. Another difference is that while GAs
were developped in order to optimize the numeric parameters of mathematical
function under study – and variation thus directly modifies the genotype – in EP, one
mutates the genotype but evaluates the fitness according to phenotype. EP is thus
often used for construction & optimization of such structures like « finite state
automata » (Fogel et al. 1966). A self-adaptation approach (Bentley 1999) allowing
for mutation of the parameters of the evolution itself – e.g. the mutation rate – is also
frequently used.
Such an approach of « evolving the evolution » is also used in ES which where
discovered - in parallel but independetly with Holland’s GAs – by (Rechenberg
1973). The biggest difference between EP and ES is thus fact that ES often
recombines its individuals before mutating them. Popular and well-performing
strategy thus seems to be :
1. Initialize the population
2. Perform recombination using P parents to form C children5

3. Perform mutation on all children
4. Evaluate children population and select P members from it.
5. If the termination criterion is not met, go to step 2 ; terminate otherwise.
Given the fact that in our Thesis, we shall often: 1) encode the problem of linguistic
category induction by non-numeric chromosomes 2) evaluate the fitness of
individuals by means of additional « phenotypic algorithms » we consider the works
of Fogel & Rechenberg to be of particular importance for our study.
1.7.3.Genetic programming
Contrary to GAs, E.Prog and E.Strat which operate upon the chromosomes (vectors)
of fixed length of numeric/boolean/character values, do individuals evolved by means
of Genetic Programming (GP) encode programs of arbitrary length and complexity.
In other terms, one may state that while above-mentioned EC methods look for the
most optimal solution of a given problem, GP tends to produce a hierarchical tree
5 Frequently used C/P ratio is 7

structure encoding a sequence of instructions (i.e. a program) able to yield optimal
solutions to a whole range of problems. Simply said : GP is simply a way how
computer programs can automatically « discover » new and useful programs.
The most important thing to do in order to prepare a GP framework is to specify
how shall be the resulting individuals (programs) encoded. Original choice of the
founder of the discipline, John Koza, was to encode all individuals as trees of LISP
S-expressions composed of sub-trees, which are, themselves, also LISP
S-expressions. Within such arborescent S-expressions, the terminal (i.e. leave nodes
where the branches end) nodes represent program’s variables and constants while the
non-terminal nodes (i.e. internal tree points) represent diverse functions contained in
the function set (e.g. arithmetic functions like +, -, *, / ; mathematic functions like
log, cos ; boolean functions like AND, OR, NOT ; conditional operators if/else etc.)
Figure 4 illustrates how, during the initial run of the algorithm, an individual –
calculating, for example, the square root of X+5 – could be possibly randomly
generated by implementing a
following procedure :
1) « Root » of the program tree
is randomly chosen from the
function set, it is the function
sqrt.
2) The function sqrt has only
one argument (arity(sqrt)=1),
therefore it will take only one
input from the randomly
determined
functor
+
(addition)

Figure 4: Sequence of steps constructing the program
sqrt(x+5)

3) Functor + takes two inputs (arity(+)=2), therefore the tree bifurcates into two
lines in this node. It randomly choses, as the first argument, the constant 5 ;
and the variable X as the second argument.
Note that in step 3, both arguments were chosen from the terminal set. If they would
have been chosen from the function set, the tree could bifurcate further. In order to
prevent such growth of trees ad infinitum, a limiting « maximal tree depth »
parameter is more than often implemented in GP scenarios.
Once such a program has been generated, one can evaluate its fitness by confronting
it with diverse input arguments and comparing its output with a golden standard.
Such a random-program generation & evaluation is repeated for all N initial
candidate programs, subsequently the most individuals are selected and varied. While
GP’s selection techniques can sometimes closely ressemble selection techniques as
used in GAs, variation operators are often of essentially different nature. This is so,
because GP’s not individual genomes or their linear sequences can be mutated or
crossed-over, but rather complex and hierarchical networks of expressions. In a case
of cross-over, for example, one switches whole sub-tree encoded within one

individual, for a sub-tree encoded within another one.
GP-based solutions cannot be expected to function correctly if they do not satisfy the
theoretical properties of closure and sufficiency. In order to fulfill the closure
condition, each function from the non-terminal set must be able to successfully
operate both on output of any function in the non-terminal set and on any value
obtainable by a member of the terminal set. Even behaviour of some simple operators
thus has to be a priori adjusted (e.g. return 1 in case of division by zero) in order to
assure correct functioning of the resulting program.
On the other hand, sufficiency property demands that the set of functors and terminals
is sufficiently exhaustive. Otherwise the solution could not be found. One can not,
for example, hope to discover equation for generating the Mandelbrot set if the initial
set of terminals does not contain the notion of imaginary number, nor does the
function set contain any other explicit or implicit reference to the notion of complex
plane. Thus, while the closure constraint delimits the upper bound beyond which the
discovery of the solution is not feasible, the sufficiency constraint delimits the lower
bound of the minimal set of « initial components » which have to be defined a priori,
so that discovery of the adequate program should be at least theoretically possible.
Other theoretical notions as well as diverse subtleties (special operators, methods
how to distribute the initial population in the search space, fitness function proposals,
domains of application, etc.) of practical implementation, are to be found in possibly
the most important GP-concerning monography (Koza 1992).
1.7.4.Grammatical evolution
Grammatical Evolution (Gr.Ev) is a variant of GP in a sense that it also use
evolutionary computing in order to automatically generate computer programs. The
most important difference between Gr.Ev and GP is that while GP operates directly
upon phenotypic trees representing program’s code itself (for example in form of
LISP expressions), Gr.Ev uses the evolutionary machinery for the purpose of
generating grammars, which would subsequently generate the program code.
In Formal Language Theory, grammar is represented by the tuple {N, T, P, S} where
N denotes the set of non-terminals, T the set of terminals, S is a symbol which is
member of N and P denotes the set of production rules that substitute elements of N
by elements of N, T or their combinations 6. Consider a grammar exhaustive enough
to encode programs able to perform arbitrary number of operations of addition or
subtraction of two variables:
N = {expr, op, var}
T = { +, -, x, y}
S = expr
P= {
<op> → + | 6 This is the case for so-called context-free and context-sensitive grammars.

<var> → x | y
<expr> → <var> | <expr> <op> <expr>
Such a grammar contains three non-terminals, non-terminal <op> which could be
subtituted for either terminal + or terminal - ; non-terminal <var> which could be
subtituted for either terminal x or terminal y ; and non-terminal <expr> which could
be substituted for either a non-terminal <var>, or a sequence of non-terminals <expr>
<op> <expr>. The fact that in this last production, the non-terminal <expr> is present
both on left and right side of the substitution rule gives this grammar a possibility to
recursively generate infinite number of expressions like :
x+x
x+y
y+x
y+y
x-x
x-y
y-y
y-x
x+x
x+x+x
x+x-x
x+x+y
x+x-y
x-x+y-y
x-x-y+y+x
y+y+x+x+y-x
etc.
Thus, even a very simple grammar with only four terminal symbols and three
non-terminal symbols to each of which are associated only two production rules can
theoretically produce an infinite number of distinct individual programs able to
perform basic arithmetic operations with two variables.
Generation of a given resulting expression is determined by the order of application
of specific production rules, starting with non-terminal symbol S. Such a sequence of
application of production rules is called derivation. For example, in order to derive
the individual « x+x », one has to apply production rules in following order :
S = <expr>
<expr> ::= <expr> <op> <expr>
<expr> ::= <var>
<var> ::= x
<op> :: = +
<expr> :: = <var>
<var> :: = x
while the individual « y-x » would be generated, if ever the starting symbol S should
be expanded by a following sequence of production rules :

S = <expr>
<expr> ::= <expr> <op> <expr>
<expr> ::= <var>
<var> ::= y
<op> :: = <expr> :: = <var>
<var> :: = x
In Grammatical Evolution, it is this « order of application of production rules» which
is encoded in the individual chromosome. In other terms, individual chromosomes
encode when and where distinct production rules shall be applied. Figure 5 more
closely illustrates, and puts into analogy with biological systems, the sequence of
transformations which every binary chromosome undergoes during the process of
unfolding into fully functional program :

Figure 5: Sequence of transformations from genotype until phenotype in both Gr.Ev
and Biological systems. Figure reproduced from (O’Neill & Ryan 2003).

It can be easily infered from the above-displayed schema that the approach of Gr.Ev
is quite intricate and involves multiple steps of information processing. Whole
process starts with binary chromosome subsequently split into 8-bit codons which
yield an integer specifying which production rule to use in a given moment of
program’s generation. On many different layers does the « generation » process, as
implemented in Gr.Ev, introduce and implement very original ideas like:
1. « Degenerate genetic code » - similary to « nature’s choice » to encode one
amino-acid by means of many different triplets, can one encode application of
a unique production rule by more than one codon.

2. «Wrapping » - under certain conditions can be whole genome « traversed »
more than once during the process of phenotypic expression. Specific codon
can be thus used more than once during the compilation of single individual.
etc.
Rationale for usage of such « biologically inspired tricks » is more closely presented
in the work of the founders of Grammatical Evolution field (O’Neill & Ryan 2003) .
They claim that the focus on genotype-phenotype distinction, especially in
combination with implementation of « degenerate code » and « wrapping » notions,
could result in compression of representation (& subsequent reduction of size of
program search-space) and account for phenomenas like « neutral mutation »,
well-observed in biological systems, whereby a mutation occures in the genotype but
does not have any effect upon the resulting phenotype. Another important advantage
mentioned by O’Neill and Ryan is that Gr.Ev approach makes it very easy to generate
programs in any arbitrary language. This is due to the versatility and generality of
notion of « grammar ».
When compared with traditional GP technique, Gr.Ev was outperformed in a scenario
when one had to find solutions to problem of symbolic regression. But in more case
complex scenarios like « symbolic integration », « Santa Fe ant trial » or in scenario
where one had to discover a most precise « caching algorithm », Gr.Ev significantly
outperformed GP. Seminal work of (O’Neill and Ryan 2003) presents also some other
interesting examples of practical application of Gr.Ev, for example in the domain of
financial market prediction.
We note that while in many points (« grammar », « evolution ») does the work of
O’Neilly and Ryan significantly overlap with ours, their aims significantly differ
from those that shall be presented in our Thesis. More concretely, while Gr.Ev tends
to offer a very general toolbox to generate useful computer programs in arbitrary
programming language and used for solving arbitrary problems, our Thesis shall
deploy the evolutionary computation machinery to shed some light upon diverse
facets of one sole problem : that of « Natural Language Development».
Other important difference between the approach of Gr.Ev and the one we shall
present in our thesis is that while in Gr.Ev, grammars are considered to be
« generative devices », i.e. tools used for generation of programs, in our Thesis we
shall use them as both « generative » and « parsing » devices. Another, even more
fundamental difference is due to the fact that while « At the heart of GE lies the fact
that genes are only used to determine which rule is applied when, not what the rules
are. » (O’Neill and Ryan 2003), the evolutionary model of language-induction
proposed in our Thesis shall aim to determine not only the order of application of the
rules, but also the content of the rules themselves.
1.7.5.Tierra
Another example of how can one materialise evolutionary principles within an in
silico framework is offered by Tierra, an artificial life simulation environment

programmed between 1990-2001 by Thomas S. Ray and his colleagues. Since Ray is
an ecologist, his objective was not to develop an EC-like model in order to find or
optimalize solutions of a given problem, rather he aimed to create a system where
artificially entities could spontaneously evolve, co-evolve and potentially create
whole artificial ecosystems.
An artificial entity in Tierra’s framework (Ray 1992) is a program composed of
sequence of instructions, chosen from instruction set containing 32 quite traditional
assembler instructions somewhat tuned by the author so that their usage would
facilitate « replication » of the code. Every artificial entity runs in its own « virtual
CPU » but its code stays encoded in the « soup », i.e. piece of RAM which is
potentially read-accessible to all other entities as well. Rare «cosmic ray » mutations
flip the bits of « soup » from time to time, more variation is ensured by bit-flipping
during the procedure whereby the entity replicates (i.e. copies) its code from the
« mother cell » section of the soup to the « daughter cell » section.
Selection is in certain sense emulated by a so-called Reaper process which tends to
stop the execution of programs which are either too old or contain too much flawed
instructions. Other than that, there is nothing which ressemble the traditional notion
of exogenously defined « fitness function ». For within Tierra, the survival (or death)
of diverse species of programs is a direct consequence of species ability (or inability)
to obtain access to limited ressources (CPU & memory).
Thus, after one seeds the initially empty soup with a manually constructed
individual, containing 80-instructions allowing the individual to copy his code into
the daughter cell of the memory, after the memory has been filled and the battle for
ressources has started and once the mutation have generated sufficiently enough of
variation, one can observe the emergence of dozens of new forms of replicable
programs. Some of them being parasites, some of them being able to create
algorithmic counter-mesures against parasites, one can literally observe an emergence
of artificial yet living ecological system. It is therefore little surprising that Tierra
could automatically evolve, among others, an individual containing just 22
instructions, capable of replication. That is, a replicator almost 4 times shorter than
the replicator manually programmed by the conceptor of the system and injected into
initial « soup ».
Currently the most famous descendant of Tierra is an AVIDA system (Ofria and
Wilke 2004). Contrary to Tierra, however, is every AVIDA’s individual encapsulated
within its own virtual CPU and memory space. Tierra’s Darwinian metaphore 7 of
computer programs evolving by means of fighting for limited ressources is thus not
so strictly followed.

7 http://life.ou.edu/pubs/tierra/node3.html

2. Language development
Language development (LD) is a constructionist process which endows humans with
the capacity for transfering of information to, and obtaining of information from,
other humans by means of verbal communication. Term « language development »
shall be used preferably to « language acquisition » in order to mark the fact that the
child not only passively « acquires » the language from environmental input but
rather gradually builds it, in interaction with its environment. Sometimes the term
« language learning » shall be used as well to denote the same process.
In our Thesis we shall focus only on modeling of development of « first language » ,
i.e. we shall aim to present a computational and evolutionary model of the process by
means of which a human baby learns the language of its closest social environment.
Child’s closest social environment are her parents, most notably her mother.
Hundreds of studies were conducted to study the nature of « motherese », a special
simplified language between mothers and their children (M. Harris 2013). While
many studies point in divergent directions, they more or less agree that « Maternal
speech has certain characteristics that distinguish it from speech to other adults.
These characteristics are in essence simplicity, brevity and redundancy. » What’s
more, it seems to be a well-established fact that there exists a reciprocal link between
the complexity of motherese and complexity of child’s production. In other terms,
mother’s adjust their language according to the stage of child’s linguistic
development.
Other studies also indicate an existence of causal link between the quantity and
simplicity of motherese utterances on one hand and child’s linguistic development.
More concretely, studies like that of (Furrow et al. 1979) indicate that child’s
confrontation with frequent and simple utterances facilitates their linguistic
development while more complex style can slow their development down. Other
studies, like that of Ellis & Wells (1980) precise that « children who showed the
earliest and most rapid language development received significantly more
acknowledgments, corrections, prohibitions and instructions from their parents ».
This causal link between mother’s linguistic productions and child’s developping
linguistic competence shall play an important role when we shall discuss the « fitness
function problem ». More concretely, we shall try to integrate into our computational
models an idea that the fitness function evaluating the performance of child’s internal
categorization mechanism and/or candidate grammar shall be external to the child.
The fitness function shall be given by mother’s behaviour.
2.1. Ontogeny of semantic categories (concepts)
Natural language furnishes a communication channel for exchange of meanings.
Meaning (also called « signifié » in traditional linguistics) is intentional, it refers to
some external entity (also called « referent ») . Within the language L, meaning M
can be denoted by a token (also called « signifiant ») and it is by exchange of
physical (phonic, in case of spoken language, graphemic in case of written language
etc.) manifestations of these tokens that producer (speaker|writer) and reciever

(hearer|reader) communicate.
Traditionally meaning of the word, i.e. its « semantics », was often considered as
something almost «sacred » and impossible to formalize by mathematical means.
Maximum which could be done, and had been done since Aristotle until middle of
20th century, was to define concept in terms of lists of « necessary and sufficient
features ». Two types of features were considered to be both necessary and sufficient
for definition of majority of concepts : first specifying concept’s genus (or
superordinated concept) and second specifying the particular property (differentia)
which distinguished the concept from other members of the same genus. Thus, for
example, « dog » could be defined as domesticated (differentia) canine (genus).
Important property of such system of concepts was, that it allowed no ambigous or
fuzzy border cases : the logical « law of excluded middle » guaranteed that all entities
which were not both canines and domesticated at the same time (e.g. a chihuahua
which passed all her life in wilderness) could not be called a dog.
The change of paradigm came slowly with works of late Wittgenstein 8 but especially
with empirical studies of Eleanore Rosch (Rosch 1999) who realized that not only are
concepts often defined by bundles of features which are neither necessary not
sufficient, but that the degree with which a feature can be associated with a concept
often varies. Subsequently, Rosch has proposed a « prototype theory » of semantic
categories whose basic postulate is, that some members of the category (or some
instances of the concept) can be more « central » in relation to the category (resp.
concept) than others. Prototypical theory as well as other both theoretic and empirical
advances, in combination with development of information-processing technologies,
have paved the way to operationalization of semantics which allows us to transform
meanings of words into mathematically commesurable entities.
In computational semantics, meaning of a token X observable within language corpus
C is often characterized as a vector of relations which X holds with other tokens
observable within the corpus. The set of such vectors associated to all tokens
observable in C yields a « semantic space » which is a vector space within which one
can effectuate diverse numeric and|or geometric operations. In short, concepts can be
operationalized as geometric entities (Gärdenfors 2004).
« In the most simple case can be the vector which denotes concept X calculated as a
linear combination of vectors of concepts in context of which X occurs » (Hromada
2013a). This is an algebraic form of famous « distributional hypothesis » stating that
« a word is characterized by the company it keeps » (Z. S. Harris 1954) which can be
considered to be the central dogma of statistical semantics. Distributional hypothesis
is in certain a variation to an old « associationist » explanation of functioning of
mind, which stated that the essence of mind is somehow related to mind’s ability to
create relations, i.e. associations, between successive mental states.
Both mind’s faculty to create associations -considered by philosophers like Hume and
Locke to be primary faculty of mind - as well as distributional hypothesis that
8 « For a large class of cases of the employment of the word ‘meaning’—though not for all—this way can be
explained in this way: the meaning of a word is its use in the language. » (Wittgenstein 2009)

meaning of symbol X can be defined in terms of meanings of symbols with which X
co-occurs, can be, we believe, neurologically explained in terms of postulate first
stated by Hebb, the neurologist :
« The general idea is an old one, that any two cells or systems of cells that are
repeatedly active at the same time will tend to become 'associated', so that activity in
one facilitates activity in the other. » (Hebb 1964)
One can assume that 1) if not only on single neurons but, mutatis mutandi, also whole
neural circuits are governed by Hebb’s rule, and 2) if distinct words W x and Wy are
somehow processed and represented by distinct neural circuits N x and Ny THEN it
shall follow that whenever a hearer shall hear (or speaker shall speak) the two-word
phrase WxWy, the ensemble of material (synaptic?) relations between Nx and Ny shall
get reinforced. In more geometrical terms, on a more « mental » level, such a
« rapprochement » of Nx and Ny would be characterized by convergence of the
geometrical representations of both circuits to their common geometrical centroid.
Thus, after processing the phrase WxWy, the vectorial representations of both Nx and
Ny will be closer to each other than before hearing (or generating) the phrase.
In our Thesis we shall presuppose that an associationist principle, similar to the one
described above, is indeed at work whenever a human mind constructs a concept. We
use term « concept » synonymously to the term « semantic class » : we define both
concept and semantic classes as either subspaces of « semantic vector space », or as
centroid points of such subspaces.
Theoretically, there are multiple (and possibly infinitely) many ways how a cognitive
system can internally represent an external environment E (or, in case of a
computational linguistic agent, a corpus C) as « semantic space » S of dimensionality
D. It is important to notice that the overall partitioning of cognitive system’s vector
space determines how the system classifies the world. If system’s ability to correctly
classify the world determines the reproductive fitness of an organism within which
the cognitive system is embedded, one can state that the topology of internally
represented semantic space can quite directly influence organism’s fitness.
Consider, for example, reproduction fitness of a member of prey species which
sometimes mis-classifies a predator species for a sexual mate, and compare it to the
fitness of such a an individual among prey species whose semantic space is optimized
so that the probability of such mis-classification is practically reduced to zero.
A question whether such « semantic space optimization » occurs during the
phylogeny of human species or whether it occurs principially during early years of
child’s developpement (i.e. ontogeny) is a variant of « nature vs. nurture » (Galton
1875) debate between « nativists » who bet on the «innateness » of certain faculties
of human psyche (c.f. discussion above Evolutionary Psychology above); and
empiricist who believe that practically all knowledge we dispose of and use in
everyday life is acquired from environment. Being aware of results of studies
suggesting that children of very small age dispose of knowledge concerning basic
relations among physical objects, or even social and moral skills (Haidt 2012) we
consider as unwise the tentative to label nativist position as a priori invalid. On the

other hand, being aware of the force with which processes like socialisation,
acculturation and learning mould the psyche of an adult individual, we shall
definitely consider as true the statement «topology of semantic space represented
within the cognitive system of human individual can be optimized by supervised
assimilation of knowledge encoded in surrounding environment».
Notwithstanding the answer to nature & nurture question in regards to human faculty
of categorization, the part of our Thesis devoted to «evolutionary models of concept
construction » shall simply suggest that something like optimization of semantic
spaces by means of evolutionary computing is, indeed, possible.
2.2. Ontogeny of formal categories (parts-of-speech)
Words of language can be also partitioned into classes independently from their
semantic content. For example, while there is practically no manifestly evident
semantic feature between words like « apple » and « process », they can be both
considered as belonging to the same category of « nouns ». Principal reason for this
being the fact that within a sentence like, for example, «This apple makes me happy»
one can freely substitute « apple » for « process » and still obtain a grammaticaly
correct sentence.
Sometimes the formal categories and semantic categories partially overlap. Such is
the case, for example, in many indo-european languages where one often finds
« feminine » nouns marked with markers of one formal group and « masculine »
nouns marked with markers of other group. Even more extreme case of such
« overlap » of semantic and formal categorization processes was observed among
Diyarbal aborigines of Australia who use the same determiner « balan » (in certain
sense analogic to German article « die ») in front of all nouns referring to « woman,
fire and dangerous things» (Lakoff 1990). In modern linguistic tradition, however, are
semantic and formal categories considered to be independent from each other.
There exist multiple dimensions along which linguistic tokens can be categorized into
formal classes. Most importantly, the appartenance of word W to class C can be
principially infered from : 1) its position in regards to other words 2) its morphology
(i.e. its internal composition with all prefixes, word root, suffixes etc.) 9. It is also
important to realize that the same token can belong to many different categories in the
same time and that the relations between categories themselves could be either
inclusive, for « nested » categories, or « orthogonal ». Thus, for nested categories,
appartenance of , for example, german token « die Schönheit » to «gender»
subcategory «feminine» immediately implies that it also belongs to part-of-speech
«noun ». On the other hand the sole fact that it is « feminine » does not inform us
whether it could be attributed to « nominative » or « accusative » subsubcategories of
grammatic subcategory « case ». Thus, subcategories of « case » and « gender »,
while being both « nested » within the part-of-speech category « nouns » are
orthogonal to each other10.
9 C.f. (Hromada 2014a) for a comparative study assessing the impact of morphology and word-order features upon
POS-induction in Bulgarian, Czech, Estonian, Farsi, English, Hungarian, Polish, Romanian, Russian and Slovak.
10 The theoretical importance of existence of this distinction in regards to current formal grammar models of natural

On the most abstract level, linguistic tokens can be categorized into two principal
0-level formal categories of «functional» and « lexical » items. The set of
grammatical items is closed, and it contains such parts-of-speech as determiners,
conjunctions, pronouns, prepositions. On the other hand, classes of « lexical items »
are opened and include meaning-carrying parts-of-speech like nouns, verbs, adverbs,
adjectives etc. Study by (Shi et al. 1999)offers evidence that even newborn children
(1-3 days old !) react differently to lexical and functional words and are thus «able to
categorically discriminate these sets of words based on a constellation of perceptual
cues that distinguish them».
Once children are able to distinguish functional words from lexical ones, the process
of ontogeny of formal categories can proceed towards development of part-of-speech
categories. While it would be definitely mistaken to state that all languages of the
world can be partitioned into & mapped upon part-of-speech languages known from
English or other indo-european languages (i.e. noun, adjectives, pronouns, verbs,
adverbs, preposition, conjunction, interjections), linguists generally agree that some
kind of « noun»-ressembling and «verb»-ressembling categories are to be observed in
all systems of human verbal communication.
It is undoubtably the case that between the birth and cca 2-years of age, prototype for
such part-of-speech clusters are being formed within the child’s cognitive system.
This has to be so, around age of 2, children usually start to apply specific rules to
specific items (i.e. start to conjugate the verbs or declinate the nouns). Subsequently,
the learning of much more subtle distinctions, related to nature of grammatical
categories like genus, casus, numerus for nouns or modus, tempus, etc. for verbs can
take place. For diverse case studies concerning the acquisition of formal categories,
c.f. (Y. E. Levy, Schlesinger, & Braine, 1988).
Acquisition of both semantic and formal linguistic categories is facilitated by
so-called « variation sets » (VS). One observes a linguistic variation set whenever the
identific word/cluster of words occur in identical or slightly variated form within
multiple consequent utterances. Not only nursery rhymes and lullabies are filled with
such « alternations in maternal self-repetitions » (Hoff-Ginsberg 1986) VS are also
highly frequent in standard « motherese ». In Turkish, for example, VS seem to make
up approximately 20% of child-directed speech (Küntay and Slobin 1996) and very
similar proportions are also reported for English language (Brodsky et al. 2007).
Note that the notion of « variation set » can be intepreted in terms of evolutionary
theory, given that:
• maternal self-repetition can be intepreted as a form of « replication in time»,
whereby every single utterance is considered to be an independent individual
• alteration of form between subsequent utterances can be interpreted as a result
of a variation operator influencing mother’s production of new sentences
In context of our tentatives to explain language development in terms of evolutionary
theory and suggest its validity by means of evolutionary computation model, we find
languages shall be further extended in fulll version of the Thesis.

this insight « the image that best characterizes the young language leaner is that of a
multilevel analyzer who is working with several types of analysis simulatenously,
with different degrees of success, as learning progresses » (Levy 1988).
It may be stated that the reason why categorization processes develop in the first
place is congitive system’s the tendency to optimize its functions and structures. As
Maratsos (1998) put it: « Once the speaker hears just one grammatical use of a new
word which suffices to identify its membership in a category, he can refer to the
whole system of rules involving this category » (Maratsos 1988).
Thus, both semantic as well as formal categories can reduce cost of processing and
storing of information by and within the cognitive system.
2.3. Ontogeny of grammars (grammar induction)
Partitioning of words into grammatical categories can be useful only if it is
accompanied by development of grammatical rules which combine members of
diverse categories in order to produce meaningful sentences. We reiterate that strictly
formally, grammar is defined as the tuple {N, T, P, S} where N denotes the set of
non-terminals, T the set of terminals, S is a symbol which is member of N and P
denotes the set of production rules that substitute elements of N by elements of N, T
or their combinations.
Within such formal framework, the problem of partitioning of words into diverse
grammatical categories can thought to be as equivalent to problem of discovery of
production rules which 1) associate members of T (words) to members of N (labels of
distinct categories) 2) combine elemets of N in order to produce new elements of N.
In fact, the problem of construction of formal categories and discovery of
grammatical rules are mutually intertwined, some researchers go even so far as to
state : « Category symbols, whether in phrase structure rules or in the lexicon, are
logically equivalent to the rules written on them, and as such are completely
system-dependent : They are shorthand descriptions of the rule system as a whole. By
anyone’s theory, young children’s linguistic system does not possess all the features of
the endstate system. In other words, their language cannot be describe by the same
grammar as the adult system» (Ninio 1988).
In litterature, development of language is often described as a process composed of
three « stages » which can be subsequently subdivised in a followin manner :
«Pregrammatical :
a. Rote-learning – item-based acquisition is manifested in the use of formally
unanalyzed units or chunks ;
b. Initial modifications – formal alternations apply to a small number of
highly familiar, good exemplars ;
Structure-bound
c. Interim schemata – transitional or bridge strategies take the form of
productive, but nonnormative rules ;

d. Grammaticization – structure-bound rules are those of the endstate
grammar ;
Discourse-oriented :
e. Convention and variety – grammatical rules are deployed with appropriate,
discourse-sensitive lexical restrictions, stylistic alternations, usage
conventions, register distinctions etc.
» (Berman 1988).
In our Thesis, we shall put aside the intricacies of the third, « Discourse-oriented »
stage and shall focus on « Pregrammatical » and « Structure-bound » stages. More
concretely, we shall aim to explain acquisition of words and word chunks in phase a.
as the result of the « crossover » between structures present in the environment
and structures represented within the cognitive system; while the gradual
emergence of categories and associated production rules which can be observable
during phases b. c. d. shall be explained not only in terms of informatic crossover of
structures present in environment and represented in cognitive system but also as the
result of purely internal replication, variation and decay, proper to the cognitive
system, and resulting in complexity-increasing « battle for resources » among
structures represented within it.
We are convinced that introduction of such «cognitive-system internally variying
operators » like « entropy-induced decay » (associated to the phenomenon of
« forgetting ») and « structural merging » (associated to the phenomenon of
« dreaming ») we can, for example, offer a very simple&natural yet effective solution
to a so-called « overgeneralization »11 problem. When it comes to overgeneralization
of grammatical rules, they are often observable in phases c. & d. (i.e. between 2-4
years of age) whenever the child applies the production rule beyond the scope of its
validity. The most famous example of overregularization in English is that practically
all children apply the rule Vpast → VPresent+ed on all verbs. Thus, especially during
MLU stage 4 and 512, they generate past participles like « throwed » or « braked »
which are not correct. What is fascinating about the problem of overregularization is
not only that all children shall start to employ irregular forms of past participles so
that errors are not reproduced anymore ; but especially the fact that often, children
used the correct « irregular form » even before (i.e. in one-word phases a. and b.).
Only later did they converge to incorrect overregularization : « Initially, children’s
uses of -ed past tense are all accurate. They may say melted or dropped, but not, as
they later do, runned and breaked » (Maratsos 1988) .
We see an important analogy between observations of such sequence of
correct/incorrect/correct behaviour, and general behaviour of evolutionary systems
which also often « reject » locally optimal solutions and descend into fitness
11 According to the domain (formal, semantic) the problem is also sometimes named as « overextension »,
« overregularization » or the problem of « overinclusive grammar ».
12 MLU means «Mean Length of Utterance » and is a measure traditionally used in developmental psycholinguistics
for assessing of child’s linguistic performance at given age. In period when child produces one-word utterances like
« mama » , « tato », MLU is considered to be 1 ; later when child starts to say two-word utterances like « mama
nene », MLU increases towards 2 etc.

landscape valleys in order to subsequently climb towards more optimal states. Thus,
we believe that the term « conflict » present in the following principle can be also
interpreted in evolutionary sense :
« Whenever a newly acquired specific rule (i.e. a rule that mentions a specific lexical
item, like throw, make, allow, report) is in conflict with previously learned general
rule (i.e. a rule that would apply to that lexical item but also to many others of the
same class), the specific rule eventually takes precedence » (Braine 1971).
McWhinney uses a similar term « competition » to label its Competition Model of
linguistic competence. « The competition model assumes that lexical elements and
components to which they are connected can vary in their degree of activation.
Activation is passed along connections between nodes. During processing, items are
in competition with one another. In auditory processing …, in allomorphic processing
…, in the processing of role relations, in polysemy …, the item that wins out in a
given competition is the one with the greatest activation » (MacWhinney 1987).
If one could interpret the last phrase of the above citation as « the component which
has the greatest activation has the greatest fitness and thus the highest probability of
being replicated within the cognitve system », one could consider MacWhinney’s
connectionist model as an evolutionary one, and thus pointing in our direction. But
since that is not the case, and since it seems that MacWhinney’s model does not, at
least not explicitly, involve any processes of replication, nor sources of random
variation nor does it explicitely work with «populations of grammars», we are
obliged to look for another theoretical framework which could more easily integrate
such notions.
It may be the case that a so-called theory of « Grammar Systems » (Csuhaj-Varjú
1994) and « Language Colonies » (Kelemen and Kelemenová 1992) could furnish
such a framework for our tentative to explain ontogeny of grammar in human
individuum as an evolutionary process. Both will be introduced in part 4 of this text.

3. Computational Models of Text Processing
Majority of models and algorithms presented in this chapter are results of intellectual
work of computational linguists working in domain of « Natural Language
Processing » (NLP). In NLP, one processes data encoding natural (human) languages
with computational methods which often involve machine learning, data mining,
information retrieval, statistical inference or artificial intelligence (AI) algorithms.
Among principal objectives of NLP can one include : 1) to allow machines to
« understand » and|or work with meanings 2) to develop an autonomous artificial
agent (Hromada, 2012) able to pass the Turing Test (Turing 2008); and 3) to
elucidate, by means of computational simulations, possible ways how human
cognitive system treats natural language.
Computational aim of our Thesis overlaps especially with NLP’s third objective.
Such an aim bring with itself many complex problems not easy to tackle and thus, in
order to reduce their amount and complexity we shall reduce the notion of « Natural
Language » to the notion of « text ». It is true that in doing so, we shall completely
ignore the phonetic, phonologic and prosodic aspect of language which has been,
during practically all human history, a principal way how human speakers encoded
their messages in order to transfer them to other human hearers. It is only during few
centuries that the communication by means of text became prominent and only within
last decades it became dominant, mainly because of increasing role of computers in
our lives. This is at least partially so because computers are essentially machine built
for processing of sequences of discrete symbols and that’s what a text is – a sequence
of discrete symbols. Contrary to flux of spoken language, which is also a sequence,
but composed of units whose boundaries are often unclear and whose features
overlap.
3.1. Concept construction
We define the « concept construction » (CC) problem as an open-class variant of
« classification » or « categorization » problem. In classical, « closed-class »
categorization problem, the objective is to assign a label which denotes the
membership to a category C1 to a set of objects disposing of particular combination of
properties (also called « features » in AI community) ; and assign to categories C 2, C3
etc. other objects disposing of different features. Problem of « binary classification »
where only two categories are involved, is well studied and dozens of diverse
algorithms exists which allow to train, in machine learning scenario, such
classification models (« classifiers ») which will subsequently quite successfully
classify such objects of the « testing set » which absent in the « training set ».
In NLP one often solves classification problem by means of so-called « Support
Vector Machines » . During the traininig of SVM, algorithm tries to discover a
hyperplane « that has the largest distance to the nearest training data point of any
class » (Vapnik et al. 1997). SVMs belong to group of « linear classifiers » which all
base their classification decisions on linear combinations of characteristics (features)
of objects-to-be-classified. Other machine learning algorithms as diverse as Linear

Discrimant Analysis, Naive Bayes classifiers, logistic regression or perceptron also
belong to group of linear classifiers.
« Multiple class » variants of these algorithms also exist, allowing for classification
of objects into more than 2 categories. In case of all these algorithms, however, all the
classes-to-be-looked-for are known in advance ; datapoints in the training set are
labeled with labels belonging to a finite set and after the training, during the testing
phase, one’s objective is simply to attribute the correct label to a new object. While
the object itself was most probably not present in the training set and is not « new »,
the finite set of all class/category labels-to-be-attributed are well known from the
very beginning of training. In this sense all algorithms mentioned above address the
closed-class variant of classification problem.
On the contrary, in open-class variant of classification problem one can be potentially
asked, in the testing phase, to attribute to an object, which was not present during
training phase, a label which was also not present in the turing phase. In other terms,
in open-class variant of classification problem one does not know in advance neither
the number nor even the nature of categories which are to be constructed.
3.1.1. Non-evolutionary model of CC
One possible way how one can address problem of Concept Construction – which we
consider to be the instance of an « open-class classification problem » as defined
above – is described as follows:
1. During the (train|learn)ing phase, use the training corpus to create a
D-dimensional semantic vector space, i.e. attribute the vectors of length D to
all members of the set of entities (word fragments, words, documents, phrases,
patterns) E which includes all observables within the training corpus
2. During the testing phase :
2.1 characterize the object (text) O by a vector ⃗o calculated as a linear
combination of vectors of features which are observable in O and whose
vectors were learned during the training phase
2.2 characterize labels-to-be-attributed L1, L2, ... by vectors l⃗1, ⃗l2 ...
2.3 associate the object O with the closest label. In case we use cosine metric,
we minimize angle between ⃗o and label vectors, i.e. arg max cos(⃗o , ⃗l x )
Note that in order to make this approach functional, two important conditions have to
be fulfilled. Primo, vectors associated to entities observables within the training
corpus must be commesurable, i.e. have to be of same dimensionality and be
members of the same vector space. Secundo, the set of all entities E observed during
learning has to be sufficiently exhaustive, so that potentially any novel label or object
which shall appear during the testing phase could be at least partially characterized in
terms of members observables during the training phase.
The first condition of « entity commesurability » is not fulfilled by many vector space
models which often yield multiple spaces for entities of different « types ». In such

models, « word » entities are often encoded as rows of the matrix and « context » or
« documents » entities, i.e. entities within which the words entities occur, are encoded
as column of the same matrix, or are encoded in a completely different matrix. On the
contrary, algorithms like Random Indexing (RI) or Reflective Random Indexing
(RRI) construct semantic vector spaces from initial textual corpora in a way that
everything they encounter – be it syllables, words or whole documents – is ultimately
represented as rows of the same matrix.
RI and RRI have also other advantages which are more closely described elsewhere
(Sahlgren 2005; Cohen et al. 2010; Hromada 2013b). For the purpose of this article
let’s just underline the fact that both RI and RRI can be quite computationally
efficient since they are able to « project » semantic relations hidden in the text upon a
vector space with restrained dimensionality. Theoretically, this is permitted due to a
so-called lemma Johnson-Lindenstrauss stating that « a small set of points in a
high-dimensional space can be embedded into a space of much lower dimension in
such a way that distances between the points are nearly preserved » (Johnson and
Lindenstrauss 1984)

Figure 6: Description of DEFT2012 system for automatic
attribution of keywords to scientific articles. Figure
reproduced from poster

In 2012, a hybrid system with RRI semantic component at its very core, was
deployed in a francophone datamining competition DEFT2012 (El Ghali et al. 2012).
The goal of the competition was to create such an automatic NLP system which
would be able to attribute to scientific articles the same keywords as were attributed
by their authors. In other terms, the goal was to artificially simulate the cognitive

activity of « attributing a conceptual label » to a scientific article. The tricky thing
about the problem was that it was not a standard « closed class » classification
problem, but indeed an « open class » problem since there were many keywords
labels which have not been present in the training set, yet were to be associated in
the testing scenario. Figure 6 illustrates relations among diverse components of this
hybrid system.
As may be easily seen, whole « artillery » of diverse NLP tools like POS-taggers,
lemmatizers and chunkers was deployed in order to yield sufficiently exhaustive set
of features from which two distinct semantic spaces were composed by means of
RRI. Resulting semantic spaces were subsequently post-optimized by combining
probabilistic Bayesian Networks and production rules.
In the first simpler task of the competition DEFT2012, the system has attained
F-Score of 94.8%. The task was simpler because a list of candidate labels was
furnished within training corpus and subsequently another list of candidate keywords
was furnished with the testing corpus. The system has attained F-score of 58.7% in a
second, more difficult task where no such lists were given. In both tasks it
outperformed the systems deployed by other 9 participants of the competition.
3.1.2. An evolutionary model of CC
Task 4 of 2014 edition of the datamining competition Defi en Fouille Textuelle
(DEFT) was understood as an instance of classification problem with opened number
of classes. More concretely, the challenge was to create an artificial system which
would be able attribute a specific member of the set of all class labels to scientific
articles of the testing corpus. The training corpus of 208 scientific articles presented
in diverse sessions of diverse editions of an annual TALN/RECITAL conference was
furnished to facilitate the training of the model.
To solve this problem, we have proposed an algorithm consisting of two nested
components, as represented on Figure 7. The inner component, which we call
Reflective Space Indexing (RSI) is responsable for construction of the vector space.
Its input is a genotype, the list of D features which trigger the whole reflective
process, its output -a phenotype - is a D-dimensional vector space consisting of
vectors for all features, objects (documents) and classes. The inner component is
« reflective » in a sense that it multi-iteratively not only characterizes objects in terms
of their associated features, but also features in terms of associated objects. RSI's
principal parameter is the number of dimensions of the resulting space (D). Input of
RSI is a vector of length D whose D elements denote D « triggering features », the
initial conditions to which the algorithm is sensible in the initial iteration. After the
algorithm has received such an input, it subsequently characterizes every object O
(document) by a vector of values which represent the frequency of triggering feature
in object O. Initially, every document is thus characterized as a sort of
bag-of-triggering-features vector. Subsequently, vectors of all features – i.e. not only
triggering ones – are calculated as a sum of vectors of documents within which they
occur and a new iteration can start. In it, initial document vectors are discarded and

new document vectors are obtained as a sum of vectors of features which are
observable in the document. Whole process can be iterated multiple times until the
system converges to stationary state, but it is often the second and third iteration
which yields most interesting results. Note also that what applies for features and
objects applies, mutatis mutandi, also for class labels.

Figure 7: Diagram of DEFT2014 model, embedding the construction
of semantic spaces within an evolutionary framework.

For purposes of DEFT 2014, every individual RSI run consisted of 2 iterations and
yielded 200-dimensional space.
The envelopping outer component is a trivial evolutionary algorithm whose task was
to find the most « fit » combination of features to perform the classification task. In
every « generation », evolutionary component injects multiple individual lists of
triggering features (i.e. « genomes ») into the inner component and subsequently
evaluates the fitness function of resulting vector spaces. It subsequently mutates,
selects and crosses-over genotypes which had yielded the vector spaces wherein the
classification was most precise.
The evolutionary component of the system was conceived as a sort of feature
selection mechanism. The objective of the optimization was to find such a genotype –
i.e. such a list of « triggering features » – which would subsequently lead to discovery
of a vector space whose topology would facilitate construction of a most
classification-friendly vector space.

As is common in evolutionary computing domain, whole process was started by
creation of a random population of individuals. Each individual was fully described
by a genome composed of 200 genes. Initially, every gene is assigned a value
randomly chosen from the pool of 5849 feature types observable in the training
corpus. In DEFT2014's Task 4 there were thus 5849200 possible individual genotypes
one could potentially generate and we consider it important to underline that
classificatory performance of phenotypes, i.e. vector spaces generated by RSI from
genotypes, can also substantially vary.
What's more, our observations indicate that by submitting the genotype to
evolutionary pressures -i.e. by discarding the least « fit » genomes and promoting,
varying and replicating the most fit ones - one also augments the classificatory
performance of the resulting phenotypical vector space. In other terms, search for a
vector space1 which is optimal in regards to subsequent partitioning or clustering can
be accelerated by means of evolutionary computation.
During the training, evaluation of fitness of every individual in every generation
proceeded in a following manner :
• pass the genotype as an input to RSI (D=200, I=2)
• within the resulting vector space, calculate cosines between all document and
class vectors
• attribute N documents with highest score to every class label (N was furnished
for both testing and training corpus)
• calculate the precision in regards to training corpus golden standard. Precision
is considered to be equivalent to individual's fitness
Size of population was 50 individuals. In every generation, after the fitness of all
individuals has been evaluated, 40% of new individuals were generated from the old
ones by means of a one-point crossover operator whereby the probability of the
individual to be chosen as a parent was proportional to individual's fitness. For the
rest of the new population, it was generated from the old one by combination of
fitness proportionate selection and mutation occuring with 0.01 probability. Mutation
was implemented as a replacement of a value in a genome by another value,
randomly chosen in the pool of 5849 feature types. Advanced techniques like parallel
evolutionary algorithms or parameter auto-adaptation were not used in the study.
While algorithm succeeded to optimize the vector space generated to training corpus
with precision of 87%. However, the resulting model over-fit the training corpus and
failed to be fully transferable on testing corpus. Possibly due to implementation error
– c.f. (Hromada 2014b) for closer discussion- the model has thus achieved only 27 %
precision when confronted testing data. While being definitely more performant than
a random baseline, our approach was the least performant among 5 participants of
DEFT2014.

Notwithstanding the failure of our model in DEFT2014, we consider as an important
our observation that « by evolutionary selection of chromosome of features which
initially « trigger » the reflective process one can, indeed, optimize the topology and
hence the classification performance of the resulting vector space » (Hromada
2014b).
3.2. Part-of-speech induction and part-of-speech tagging
The term Part-of-speech-induction (POS-i) designates the process which endows the
human or an artificial agent with the competence to attribute the POS-labels (like
“verb”, “noun”, “adjective”) to any linguistic token observable in agent’s linguistic
environment. POS-i can be understood as a « partitioning problem » since one’s
objective is to partition the initial set of all tokens occuring in corpus C (which
represent agent’s linguistic environment E) into N subsets (partitions, clusters) whose
members would correspond to grammatical categories as defined by the gold
standard. Because one does not use any information about « ideal » gold standard
grammatical categories during the training phase and uses it only for final evaluation
of the performance of the model, POS-i is considered to be an « unsupervised »
machine learning problem.
POS-i’s « supervised » counterpart is the problem of POS-tagging. In POS-tagging,
one trains the system by serving it, during the training phase, sequence of couples
(word W, tag T) where tag T is the label denoting the grammatical category into
which the word W belongs. POS-tagging is thus simpler than POS-i where no
information about ideal labels is furnished during the learning. Training of
POS-tagging systems is of particular importance especially for languages where
many word forms can potentially belong to many part-of-speech categories (in
English, for example, can almost any noun play also role of the verb; token like
« still » can be intepreted as substantive, verb, adjective and even adverb, its
POS-category being determined by its context). On the contrary, in morphologically
rich languages where such a « homonymy of forms » is present in lesser degrees and
relations between word types and classes are less ambigous, one can often simply
train the POS-tagging system by simply memorizing an exhaustive list of (W, T)
couples.
3.2.1. Non-evolutionary models of POS-i
The paradigm currently dominating the POS-i domain was fully born with article
published by Brown et al. in 1992. Brown and his colleagues have applied the
information theoretic notion of « mutual information » :
M (w 1 w 2)=log

Pr ( w1 w2 )
Pr (w 1) Pr(w2 )

upon all bigrams (i.e. sequences of two words) composed of tokens w 1, w2 and had
subsequently devised a merging algorithm able to group words into classes in a way
that the mutual information within a class would be maximized.

In two decades since publication of study of Brown et al., their approach has inspired
hundreds of studies : be it hidden Markov Models tweaked with variational Bayes
(Johnson, 2007) , Gibbs sampling (Goldwater & Griffiths, 2007), morphological
features (Berg-Kirkpatrick, Bouchard-Côté, DeNero, & Klein, 2010; Clark, 2003) or
graph-oriented methods (Biemann, 2006) – all such approaches and many others
consider co-occurence of words with n-gram sequences to be the primary source of
relevant information for subsequent creation of part-of-speech clusters. In all these
models, one aims to discover the ideal parameters of Markovian statistical models,
often employing a so-called Expectation-Maximization (EM) algorithm to discover
the optimal partitioning. Unfortunately, EM is unable to quit locally optimal states
once they were discovered. Notwithstanding this disadvantage, comparative study of
(Christodoulopoulos et al. 2010) suggests that probabilistic models of part-of-speech
induction can be indeed very performant.
POS-i induction can be also realized by means of k-means clustering algorithm, or
one of its variants. K-means algorithm (Karypis 2002) partitions N observations,
described as vectors in D-dimensional space, into K clusters by attributing every
observation into the cluster with the nearest centroid (i.e. mean). If one considers
these centroids to denote prototypes of the categories in center of which they are
located, then one can consider the k-means algorithm to be consistent with
« prototype theory of categorization », as proposed by Rosch. Table 1 illustrates
simple K-mean partitioning of tokens present in English version of Orwell’s 1984.
Table 1. K-means clustering of tokens according both suffixal and co-occurence
informations. Table partially reproduced from (Hromada 2014c)
0
1
2
3
4
5
6

Noun
10
568
97
13
1173
608
1977

Verb
3
67
668
1011
67
958
97

In this example case we have clustered all tokens observable in the corpus into 7
clusters according to features both internal to the token – i.e. suffixes – and external –
i.e. co-occurrence with other tokens. Note that even such a simple model where no
machine learning or optimization were performed, K-means algorithm somehow
succeeds to distinguish verbs from nouns. As is shown in the Table 1, whose columns
represent the “gold standard” tags and rows denote the artificially induced clusters,
even such a naïve computational model has assigned 83.6% of nouns to clusters 1, 4
and 6 while assigning 91.8% of verbs into clusters 2, 3 and 5.
3.2.2. Evolutionary models of POS-i & POS-t
Usage of evolutionary computing in NLP is - in comparison to other methods like
neural networks, Hidden Markov Models, Conditional Random Fields or SVMs –

still very rare. This is also the case to NLP’s sub-problem of part-of-speech tagging
and thus we are aware of only one tentative to use genetic algorithms to train a
part-of-speech tagger :
In his (Araujo 2002) proposal, Araujo describes a system of POS-t involving
crossover and mutation operators. What is particularly interesting about Araujo’s
system is that separate evolution process is run for every separate sentence of the
test corpus. Training corpus, on the other hand, serves mainly as a source of
statistical information concerning co-occurrences of diverse words and tags in diverse
word & tag contexts. This information concerning the « global » statistic properties
of the training corpus is later exploited in computation of fitness.
Let’s take, for example, the phrase « Ring the bell ». Since words like « ring » and
« bell » are in English sometimes used as verbs, and sometimes used as nouns, such a
sentence can be tagged at least in 4 different ways :
N D13 N
VDV
NDV
VDN
Such sequences of tags yiels individual members of Araujo’s initial population of
chromosomes. In languages like English where almost every word can be attributed
to more than one POS category & the number of possible tag sequences therefore
increases with length of the phrase-to-be-tagged, one will be most probably obliged
to randomly choose such initial individuals. Fitness of every individual possibly
tagging the sentence of n words is subsequently calculated as a sum of accuracies of
tags (genes) on position i :
n

∑ f ( gi )
i=0

Accuracy gi of an individual gene is calculated as :
f ( gi)=log(

context i
)
all i

whereby values of contexti and alli are extracted from the training table which was
constructed during the training phase and represent the overall frequency of
occurrence of word wi within specific (contexti) and all (alli) contexts.
Once fitness is evaluated, fitness-proportional crossing-over (50%) and mutation
(5%) is realized. Notwithstanding the fact that Araunjo doesn’t seem to have used any
other selection mechanism, in less than 100 generations, populations seemed to
converge into sequence of tags which were more than 95% correct in regards to gold
standard. This is a result comparable to other POS-tagging systems but with lesser
computational cost. It is also worth noting that Araujo’s experiments indicate that
working solely with contextual window WL, W, WR , i.e. just looking one word to the
13 We denote, by a non-terminal symbol D, the category of « determiners » into which belongs also article « the ».

left and one word to the right, seems to yield, in case of POS-tagging of English
language higher scores than extracting data from larger contextual spans.
When it comes to the « unsupervised » variant of the POS-t problem, id est the
problem of Part-of-speech induction, up to this date there have been -as far as we
know - no tentatives to address the POS-i problem by means of evolutionary
computing. For this reason, and for the reason that we see strong analogies between
problems of CC and POS-i, our Thesis shall aim to solve this problem with a model
similar to the one which we have presented in part 3.1.2 of this work.
3.3. Grammar induction
Input of Grammar Induction (GI) process is a corpus of sentences written in
language L, its output is, ideally a grammar (i.e. a tuplet G={S,N,T,P} as defined in
above chapters) or at least a model able to generate language sentences of L,
including such sentences that were not present in the initial training corpus.
The nature of resulting grammar is closely associated to the content of the initial
corpus as well as to the nature of the inductive (learning) process. According to their
« expressive power », all grammars can be located somewhere on a « specificity –
generality » spectrum. On one extreme of the spectrum lies the grammar having
following production rules :
1 → 2*
2→a|b|c…Z
whereby * means « repeat as many times as You Want ». This very compact grammar
can potentially generate any text of any size and as such is very general. But exactly
because it can accept any alphabetic sequence and thus does not have any
« discriminatory power » whatsoever, is such a grammar completely useless as an
explication of system of any natural language.
On the other extreme lies a completely specific grammar which has just one rule :
1 → <corpus>
This grammar contains exactly what corpus C contains and is thus not compact at all
(it is even two symbols longer than C). Such a grammar is not able to encode
anything else than the sequence which was literally present in the training corpus and
is therefore also useless for any scenario were novel sentences are to be generated (or
accepted).
The objective of GI process is to discover, departing solely from corpus C (which is
written in language L), a grammar which is neither too specific, nor too general. If it
is too general, it shall « overgeneralize », i.e. shall be able to generate (or accept)
sentences which aren’t be considered as grammaticaly correct by common speaker of
L. If it is too specific, it shan’t be able to represent all sentences contained in C or, if
it shall, it shan’t be able to generate (or accept) any sentence which is considered to
be sentence of L but was not present in the initial training corpus C.

3.3.1. Non-evolutionary models of grammar induction
One of the first serious computational models of GI is a « Syntagmatic –
Paradigmatic » (SNPR) model presented in (Wolff 1988). Its core algorithm is
presented in Table 2.
TABLE 2 Outline of Processing in the SNPR Model (reproduced from Wolff, 1988)
1. Read in a sample of language.
2. Set up a data structure of elements (grammatical rules) containing, at this stage, only the
primitive elements of the system.
3. WHILE there are not enough elements formed, do the following sequence of operations
repeatedly:
BEGIN
3.1 Using the current structure of elements, parse the language sample, recording the
frequencies of all pairs of contiguous elements and the frequencies of individual
elements. During the parsing, monitor the use of PAR elements to gather data for
later us in rebuilding of elements.
3.2 When the sample has been parsed, rebuild any elements that require it.
3.3 Search amongst the current set of elements for shared contexts and fold the data
structures in the way explained in the text.
3.4 Generalize the grammatical rules.
3.5 The most frequent pair of contiguous elements recorded under 3.1 is formed into a
single new SYN element and added to the data structure. All frequency
information is then discarded.
END

We consider the SNPR model to be of particular importance because of its aim to
explain the process of Grammar Induction as a sort of cognitive optimization : « The
central idea in the theory is that language acquisition and other areas of cognitive
development are, in large part, processes of building cognitive structures which are
in some sense optimal for the several functions they have to perform » (Wolff 1988).
Wolff also associates his « cognitive optimization hypothesis » with a «law of
cumulative complexity » postulated in a study (Brown 1973) which is considered to
be tha big classics of language development litterature : «if one structure contains
everything that another structure contains and more then it will be acquired later
than that other structure » (Wolff 1988).
Grammar resulting from such a
contact between language sample
and SNPR inducing mechanism is
displayed on figure 7.
In Wolff’s theory optimalization is
further understood as compression.
Within the SNPR model is such
compression realized in part 3.5 of
his algorithm, where the most Figure 7: Grammar induced by SNPR model. Figure
reproduced from (Wolff, 1988)
frequent pair of contiguous elements

(either terminals or non-terminals) is substituted for a new non-terminal symbol. For
this reason, the size of grammar able to generate the initial language sample ideally
decreases with every cycle of model’s « while » loop until the process converges to
state where there is no redundancy to « compress ».
Wolff proposes that Grammar Induction is a process which should maximize the
coding capacity (CC) of the resulting grammar while minimizing its size 14. He
defines the ratio between grammar’s CC/MDL to denote grammar’s efficiency and it
may be the case that within a more evolutionary framework where one would work
with populations of grammars, a very similarly defined notion of efficiency could be
used as the core component of the fitness function. Unfortunately, Wolff’s 1988
SNPR model is not evolutionary since it does not involve any stochastic factors nor
notion of multiple candidate solutions. Wolff’s SNPR is simply confronted with the
language sample, deterministically compresses redundancies in a way that can
sometimes ressembles human grammar (and sometimes not), gets subsequently stuck
in local optimum and there’s no way how to get out of it.
Another famous model of GI is that of (Elman 1993). Contrary to Wolff’s algorithm
which is principially « symbolic », is Elman’s model « connectionist » one. More
concretely, Elman had succeeded to train a simple recurrent neural network which
was «trained to take one word at a time and predict what the next word would be.
Because the predictions depend on the grammatical structure (which may involve
multiple embeddings), the prediction task forces the network to develop internal
representations which encode the relevant grammatical information. » (Elman 1993).
The most important finding of Elman’s study seems to be the evidence for a so-called
« less is more hypothesis » (Newport 1990) which Elman himselfs labels with terms
« importance of starting small » : « Put simply, the network was unable to learn the
complex grammar when trained from the outset with the full “adult” language.
However, when the training data were selected such that simple sentences were
presented first, the network succeeded not only in mastering these, but then going on
to master the complex sentences as well. » (Elman 1993). Something similar occured
also when he tuned the capacity of « internal memory » of his networks rather than
the corpus itself. Elman observed: « If the learning mechanism itself was allowed to
undergo “maturational changes” (in this case, increasing its memory capacity)
during learning, then outcome was just as good as if the environment itself had been
gradually complicated. »
Thus, not only results of Elman’s computational model point in the same direction as
many developmental and psycholinguistic studies of « motherese » (c.f. citations
from Harris in part 2 of this work) ; they also show the importance of gradual
physiological changes for ultimate mastering of maternal language. He goes even so
far to state that prolonged infancy of human children can possibly go hand in hand
with the fact that only humans develop language in an extent we do : «In isolation,
14 In current research, it is more common to speak about grammar’s Minimal Description Length (MDL).

we see that both learning and prolonged development have characteristics which
appear to be undesirable. Working together, they result in a combination which is
highly adaptive» (Elman 1993).
Notwithstanding these interesting results which are not to be underestimated, we see
two disadvantages of Elman’s approach. Primo, as is often the case for connectionist
neural networks, his resulting model is somewhat difficult to interpret : given the
training constraints mentioned above, the network seems to predict quite well the
next word in the phrase, but it is not evident why it does what it does. Elman himself
dedicates major part of his article to descriptions of his tentatives to understand how
his « blackbox » functions. Secundo, Elman confronted his model only with artificial
corpora, i.e. corpora generated from manually created grammars. Thus, his model
accounts only for a limited subset of properties of one language (English) and as such
is still quite far from full-fledged solution to problem natural language’s GI.
Last model we present in this brief overview, called « Automatic Distillation of
Structure » (ADIOS) seem to be in lesser extent touched by this second disadvantage
since as its authors state : « In grammar induction from large-scale raw corpora, our
method achieves precision and recall performance unrivaled by any other
unsupervised algorithm. It exhibits good performance in grammaticality judgment
tests (including standard tests routinely taken by students of English as a second
language) and replicates the behavior of human subjects in certain psycholinguistic
tests of artificial language acquisition. Finally, the very same algorithmic approach
also is proving effective in other settings where knowledge discovery from sequential
data is called for, such as bioinformatics. » (Solan et al. 2005).
ADIOS is a graph-based model. It considers the sentences to be a path in the directed
pseudograph (i.e. loops and multiple edges are allowed), each sentence being
delimited by special « begin » and « end » vertices. Every lexical entry (i.e. a word
type) is also a vertex of the graph, thus if more than two sentences share the same
word X, they cross themselves in the vertex VX ; if they contain the same subsequence
XY, their paths share the common subpath (edge) VXVY etc.
Authors of ADIOS describe their algorithm as follows : « The algorithm generates
candidate patterns by traversing in each iteration a different search path (initially
coinciding with one of the original corpus sentences), seeking subpaths that are
shared by a significant number of partially aligned paths. The significant patterns (P)
are selected according to a context-sensitive probabilistic criterion defined in terms
of local flow quantities in the graph...Generalizing the search path, the algorithm
looks for an optional equivalence class (E) of units that are interchangeable in the
given context [i.e., are in complementary distribution]. At the end of each iteration,
the most significant pattern is added to the lexicon as a new unit, the subpaths it
subsumes are merged into a new vertex, and the graph is rewired accordingly... The
search for patterns and equivalence classes and their incorporation into the graph
are repeated until no new significant patterns are found. » (Solan et al. 2005).

In other terms, ADIOS starts with a so-called Motif Extraction (MEX) procedure
which looks for bundles of graph’s subpaths which obey certain conditions. Once
such « patterns » are found, they are subsequently « substituted » for non-terminal
symbols and a graph is « rewired » to incorporate such newly constructed
non-terminals. Such a « pattern distillation » procedure of generalization bootstraps
itself until no further rewiring is possible. Output of the whole process is a rule
grammar combining patterns (P) and their equivalence classes (E) into rules, able to
generate even phrases which weren’t present in the initial corpus. Example of how
ADIOS progressively discovers more and more abstract combinatorial patterns is
presented on Figure 8.

Figure 8: Equivalence classes and production rules induced from English language samples by
ADIOS algorithm. Figure reproduced from (Solan et al. 2005)

ADIOS is undoubtably one of the most performant GI systems which currently exist.
It combines both statistic, probabilistic and graph-theory notions with notion of
rule-based grammar and as such is also of great theoretical interest. On the other
hand, ADIOS does not involve any source of stochasticity, seems to be purely
deterministic and as such incapable to deal with highly probable convergence towards
locally optimal grammars. In confrontation with some partial corpora this may
possibly not cause any problems but, we predict, without any stochastic variation
whatsoever, ADIOS could not account for more than few « advanced » & real-life
properties of natural languages and as such shall possibly share the destiny of SNPR
model.

3.3.2. Evolutionary models of grammar induction
Multiple authors have proposed to solve the GI problem with different variants of
evolutionary computinng - in following paragraphs we shall describe five different
approaches:
1) Tomita’s (1982) hill-climbing induction of finite state automata
2) Dupont’s (1994) GIG method for inference of regular languages
3) Evolution of stochastic Context-Free Grammars as presented by Keller & Lutz
(Keller and Lutz 1997)
4) Evolutionary method of (Aycinena et al. 2003) inducing grammars from POS
tags of nine different English language corpora
5) Genetic algorithm of Smith & Witten (Smith and Witten 1995) for inducing a
LISP s-expression grammar from a simple corpus of English sentences
Tomita’s 1982 paper can be considered to be
one of the first empiric studies of
grammatical inference. The study focused on
inference of grammars of 14 different regular
languages – which are often called « Tomita
languages » in subsequent litterature – by
means of deteministic finite state automata.
Tomita had first encoded any possible finite
state machine with n states in a following
manner :

Figure 9: Finite state automaton matching
all strings over (1 + 0)* without an odd
number of consecutive 0's after an odd
number of consecutive
1's. Figure
reproduced from (Tomita 1982)

( ( A1, B1, F1) (A2 , B2 , F2 ) . . . . (An , Bn , Fn ))
whereby every block « (Ai, Bi, Fi) corresponds to the state i, and Ai and Bi indicate
the destination states of the 0-arrow and the 1-arrow from the state i, respectively. If
A or B is zero, then there is no 0-arrow or 1-arrow from the state i, respectively. F i
indicates whether state i is one of the final states or not. If F i is equal to 1, the state i
is one of the final states. The initial state is always state 1 » (Tomita, 1982).
Thus, for example, the string ((1 2 1 ) ( 3 1 1 ) ( 4 0 0 ) ( 3 4 1 )) encodes the finite
state automaton illustrated on figure 9.
Such encoding allowed Tomita to subsequently apply his hill-climbing approach.
Hill-climbing can be considered to be a precursor to more extended genetic
programming, since it employs both random mutations to explore surounding
search-space and sort of selection algorithm which always prefers to use, in following
iteration of the algorithm, such individual solutions for which the value of evaluation
function E increases. Tomita’s definition of E is very simple:
E=r-w

« where r is the number of strings in the right-list accepted by the machine, and w is
the number of strings in the wrong-list accepted by the machine ». Right-list is a
positive sample corpus while wrong-list is the negative sample. Thus, if a random
mutation transforms an individual Xn into individual Xn+1 so that E(Xn+1) > E(Xn), i.e. if an automaton is discovered which matches more positive sequences, or less
negative sequences, or both - it will be Xn+1 which will be mutated in the next cycle
of the algorithm.
Tomita’s approach cannot be considered to be fully evolutionary because he haven’t
used populations nor did he employed any kind of cross-over operator. For this
reason, Tomita’s regular grammar-infering algorithm did sometimes got stuck in local
maxima from which there was no way out. Notwithstanding this small imperfection –
of which Tomita himself was well aware – his work served, and still serves, the role
of an important hallmark on the path to full-fledged GI.
Dupont (1994), for example, has also focused his study on induction of 15 different
regular Tomita languages. In his formally very sound work, he defines the problem of
inference of regular languages as a problem of finding of optimal partition of a state
space of a finite « maximal canonical automaton » (MCA) able to accept the
sentences from positive sample. Fitness function takes into account also the system’s
tendency to reject the sentences contained in the negative sample. By using a
so-called « left-to-right canonical group encoding », Dupont succeeds to represent
diverse individuals automata in a very concise way which allows him to subsequently
evolve them by means of structural mutation («the structural mutation consists of a
random selection of a state in some block of a given partition followed by the random
assignment of this state to a block », e.g. MUTATE({{1,3,5},{2},{4}}) → {{1,5},
{2,3},{4}}) and structural crossover («the structural crossover consists of the union
in both parent partitions of a randomly selected block », for example CROSS({{1,4},
{2,3,5}},{{1,3},{2},{4},{5}}) → {{1,3,4},{2,5}},{1,3,4},{2},5}).
Because « the search space size dramatically increases with the size of the positive
sample, making the correct identification more difficult when we have a larger
positive information on the language », Dupont has also proposed an incremental
procedure allowing to start the search process from smaller yet pertinent region of the
search space. Procedure goes as follows : « first sort the positive sample I+ in
lexicographical order. Consequently, the shortest strings are first taken into
account. Starting with the first sentence of I +, we construct the associated MCA(I+)
and we search for the optimal partition of its state set under the control of the whole
negative sample I_. Let A1 denote the derived automaton with respect to this optimal
partition. Let snext denote the next string in I+. If snext is already accepted by A1, we skip
it. » (Dupont 1994). Otherwise, the aumaton A1 is be extended so that it can cover
also snext. The search under the control of whole negative sample is then restarted and
whole process is repeated until all sentences from positive sample have been
considered.

With population size of 100 individuals, maximum number of 2000 evaluations,
crossover rate 0.2, mutation rate/bit 0.01 and semi incremental procedure
implemented, Dupont’s approach have attained, in average, classification rate of
94.4%. For five among fifteen Tomita’s languages, grammars were constructed which
attained 100% accuracy (i.e. accepted all sentences from positive sample and rejected
all strings from negatives sample). Results have also indicated that if ever the
semi-incremental procedure is applied, the sample size has positive influence upon
the accuracy of infered grammars – bigger sample yields more accurate grammars.
While Tomita’s results indicate and Dupont’s results further confirm the belief that
induction of grammars by means of evolutionary computing is a plausible thing to do,
they do so only in regards to most similar type of grammars – the regular ones.
Grammars of natural languages, however, are definitely not regular languages and
models of GI of more expressive « context free » (CFG) or « context sensitive »
grammars are needed.
Keller and Lutz employed a genetic algorithm to evolve parameters of stochastic
context-free grammars (SCFG) of 6 different languages. SCFGs are similar to
traditional CFGs15, but extended with probability distribution, so that there is a
probability value in the range [0,1] associated to every production rule of the
grammar. These values are called SCFG’s parameters and these are the values which
the algorithm of Keller & Lutz aims to optimize by means of GAs. Their approach
involves following steps :
«
1. Construct a covering grammar that generates the corpus as a (proper) subset.
2. Set up a population of individuals encoding parameter settings for the rules of
the covering grammar.
3. Repeatedly apply genetic operations (cross-over, mutation) to selected
individuals in the population until an optimal set of parameters is found.
» (Keller and Lutz 1997)
Their fitness function F(G) is based on idea of Minimal Description Length (MDL).
More formally, Keller & Lutz aimed to maximize:
F (G)=

Kc
L(C∣G)+ L(G)

by minimizing the denominator which is defined as a sum of number of bits needed
to encode the grammar G (L(G)) plus the number of bits needed to encode corpus G,
given the grammar G (L(C|G)). Numerator K c is just a corpus dependent
normalization factor assuring that the value of fitness shall be in range [0,1]. When
15 « In formal language theory, a context-free grammar (CFG) is a grammar inn which every production rule is of the
form V → w, where V is a single non-terminal symbol, and w is a string of terminals annd/or non-terminals. The
term « context-free » expresses the fact that non-terminals can be rewritten without regard to the context in which
they occur » (Choubey and Kharat 2009)

confronted with positive samples of cca 16000 strings (typically of length 6 or 8) of 6
different context-free languages :
1.
2.
3.
4.
5.
6.

EQ : language of all strings consisting of equal numbers of as and bs
language a n b n (n≥1)
BRA1 : language of balanced brackets
BRA2 : balanced brackets with two sorts of bracketing symbols
PAL1 : palindromes over {a,b}
PAL2 : palindromes over {a,b,c}

their algorithms have converged, in majority of cases, to such combinations of
parameters of their SCFGs which had allowed them to accept more than 95% of
strings presented in the positive sample. Such results indicate that genetic algorithms
can be used as a means for unsupervised inference of parameters of stochastic
context-free grammars. Note that Keller & Lutz confronted, during both testing and
training, their algorithm only with positive sample. While doing so for training is
justifiable - since the objective of their study was to study whether grammars can be
infered solely from positive evidence – not doing so during testing phase makes
uncertain the extent to which their infered grammars overgeneralize.
Another huge disadvantage in regards to aims of our Thesis is the simple fact that
their approach also seems to be very costly (« number of parses that must be
considered increases exponentially with the number of non-terminals »). And since
they confronted their algorithms only with corpora composed of sentences of
artificial and not natural languages, we shall not try to imitate their approach of
« tuning SCFG parameters » in our Thesis.
By being context-free and not simply regular, the grammars studied by Keller & Lutz
or (Choubey and Kharat 2009) could be considered to be more similar to grammars
of natural languages. Nonetheless, languages composed of palindromes and
sequences of balanced brackets are still far way off from natural languages and the
question « in what extent are results concerning GI of artificial languages applicable
to GI of natural languages ? » is far from being answered. Rather than trying to
answer it, we proceed now to discussion of two approaches where evolutionary GIs
have been applied upon natural language sentences :
The first method, proposed in (Aycinena et al. 2003) has focused on induction of
CFG grammars from nine different part-of-speech tagged natural language corpora.
Sentences contained in these corpora, composed thus of sequences of part-of-speech
tags (c.f. Section 3.2) were used as positive examples, while randomly generated
sequences of POS-tags have yielded negative examples.
Initial population was composed of linear encodings of randomly generated
context-free grammars, for example the string SABABCBCDCAE would represent
this CFG :

S → AB
A → BC
B → CD
C → AE
During the evaluation of individual grammar G, one would first try to parse both
positive and negative corpora with the grammar G and subsequently calculate the
final fitness by applying the following formula :
F( α)=γ

max(0,∣α∣−¿P∣)

C (α)−δ I (α)

« where P is the set of preterminals, C(α) is the number of parsed sentences from the
corpus, I(α) is the number of sentences parsed from the randomly generated corpus,
δ is the penalty associated with parsing each sentence in the randomly generated
corpus, and γ is the discount factor used for discouraging long grammars »
(Aycinena et al. 2003)
In their study, Aycinena had placed randomly generated population of 100 individual
grammars on a two-dimensional 10 x 10 torus grid. Subsequently, they had applied a
following select-breed-replace strategy :
«
1. Select and individual randomly from the grid
2. Breed that individual with its most fit neighbor to produce two children
3. Replace the weakest parent by the fittest child » (Aycinena et al. 2003)
In their framework, «cross-over is accomplished by selecting a random production in
each parent. Then a random point in these productions is selected and cross-over is
performed, swapping the remainder of the strings after the cross-over points». Every
symbol of a resulting string can be subsequently mutated (mutation rate=0.01). «A
mutation is simply the swapping of a non-terminal or pre-terminal with another
non-terminal or pre-terminal » (Aycinena et al. 2003)
Figure 10 shows the number of generations each run was able to complete, the
grammar G that last evolved, the percentage of positive examples parsed by G, the
percentage of negative examples parsed by G and G’s fitness.
While results displayed above may seem encouraging authors, have noticed that in
majority of cases, their approach « gives a grammar that is very capable of detecting
whether a sentence is valid in English, but it has not learned much English
structure ». In other terms, Aycinena et al. have succeeded to breed grammars which
have certain discriminatory power but are practically useless as models of English
language. They go even so far as to state, in the ultimate paragraph of their work that
« It is still possible that English grammar is too complex to be learned from a corpus
of words » and that other external clues are necessary for successful GI of English.

The big disadvantage of above-mentioned algorithm was also the fact that its input
were sequences of already attributed POS-tags and not sequences of words
themselves. Thus, even if the approach would discover some interesting grammars, a
reproach could be made and justified that in fact it only re-discovered the rules of the
tagging system which was used in the first place. From perspective of our Thesis,
another disadvantage of Aycinena et al.’s approach is related to the fact that their
approach is anything but model of grammar development in human child. For it is
evident (c.f. Section 2) that children learn the grammar of their language in an
incremental fashion – they are not confronted with whole corpus from the very
beginning. Nor does the corpus stay identic after each iteration of the learning
process. On the contrary : as child grows, its linguistic environment - the corpus –
also grows. Both in length and complexity.

Figure 10: Grammars evolved from nine different POS-tagged corpora. Figure
reproduced from (Aycinena et al., 2003).

An interesting evolutionary approach of GI which both tries to create own

non-terminal categories and also takes such « incrementality » into account is
presented in the work of (Smith and Witten 1995). In their scenario, candidate
grammars are evolved after presentation of every new sentence. Grammars have form
of LISP s-expressions whereby AND represets a concatenation of two symbols (i.e. a
syntagmatic node) and OR represents a disjunction (i.e. a paradigmatic node). Whole
process is started as follows : « The GA proceeds from the creation of a random
population of diverse grammars based on the first sample string. The vocabulary of
the expression is added to an initially empty lexicon of terminal symbols, and these
are combined with randomly chosen operators in a construction of a candidate
grammar...If the candidate grammar can parse the first string, it is parsed into the
initial population ». Figure 11 displays two sample grammars for the sentence « the
dog saw a cat ».

Figure 11: Two simple grammars covering the sentence "the dog saw a cat".
Figure reproduced from (Smith & Witten, 1995)

S-expression sequences representing individual grammars are subsequently mutated.
Couple of parent grammars can also switch their nodes – probability of being chosen
for such cross-over is inversely proportional to grammar’s size : shorter grammars are
prefered. Cross-over is non-destructive, parents thus also persist. The events of
reproductions are grouped in cycles, at the end of each cycle, population of candidate
grammars is confronted with new sentence from sample of positive evidence.
In their article (Smith and Witten 1995)show, how after presentation of sentences :
«the dog saw a cat », « a dog saw a cat », « the dog bit a cat », « the cat saw a cat »,
« the dog saw a mouse » and « a cat chased the mouse » their system naturally
converged to a grammar which had quite correctly subsumed determiners like « a »,
« the » under one group of OR nodes, verbs like « chased », « saw », « bit » under
another, and nouns like « dog », « cat », « mouse » under yet another. The grammar
which they finally obtain is not ideal but, as they argue, it could get better if
confronted with new sentences. «It is an adaptive process whereby the model is
graudally conditioned by the training set. Recurring patterns help to reinforce partial
inferences, but intermediate states of the model may include incorrect generalizations
that can only be eradicated by continued evolution. This is not unlike the developing
grammar of a child which includes mistakes and overgeneralisations that are slowly

eliminated as their weaknenesses are made apparent by increasing positive
evidence ». (Smith and Witten 1995)
While strongly agreeing with above citation, we nonetheless cannot ignore certain
drawbacks of Smith & Witten’s approach. Most importantly, by using LISP’s
s-expressions as a way of representing their grammars, they ultimately have to end up
with highly bifurcated binary trees (since arity of AND|OR operators is 2). Thus, one
can easily subordinate two non-terminals to one terminal (e.g. OR(cat,dog)), but in
case of three subordinated terminals, one is obliged to use complex expression
involving three non-terminals (e.g. OR(OR(cat,dog),OR(mouse,NULL)). Therefore,
in such an s-expression based representation, is any class having more than two
members neccessarily represented by a longer sequence → is more prone to mutation
→ is highly « handicapped » in regards to much shorter expressions subordinating
just two nodes.
Another drawback of Smith & Witten’s work which cannot be ignored is related to
the fact that while they used English language sentences to train their system, the
sentences were very simple and the relevance of their findings to GI of « natural »
English is more than disputable. In fact, they seem to achieve, with quite complex
evolutionary machinery, even less than Wolff’s deterministic SNPR model have
achieved almost a decade before. Notwithstanding these two drawbacks we
nonetheless consider as particularly inspiring their approach aiming to solve the
problem of GI of natural languages by uniting, in one framework, the notions
adaptability, evolvability and statistical sensitivity to recurring patterns.
We summarize : all five above-mentioned approaches indicate that evolutionary
computing can potentially yield useful solutions to the problem of Grammar
Induction of both artificial (regular, context-free) and natural language grammars.
The length of the candidate grammar is frequently used as an input argument of the
fitness function. Note also that both solutions of Dupont and Smith & Witten also
use a sort of « incremental » procedure whereby individual solutions gradually adapt
to every new sentence. Especially Dupont’s findings are reminiscent of what was
already told about « importance of starting small » when discussing works of Elman
& Harris.
On the other hand, none of the above mentioned models was confronted with corpus
of child-directed (i.e. « motherese ») or child-originated utterances. The objective of
our Thesis shall be to fill this gap.
3.4. Evolutionary Language Game
Evolutionary Language Game (ELG) first proposed in (Nowak et al. 1999) is a
stunningly simple yet mathematically feasible stochastic model addressing the
question « How could a coordinated system of meanings&sounds evolve in a group
of mutually interacting agents ?».

In most simple terms, the model can be described as follows: Let’s have a population
of N agents. Each agent is described by an n x m associative matrix A. A’s entry a ij
specifies how often an individual, in a role of a student, observed one or more other
individuals (teachers) referring to object i by producing signal j. Thus, from this
matrix A, one can derive the active « speaker » matrix P by normalizing rows :

while the « hearer » passive matrix Q by normalization of A’s columns:

The entries pij of the matrix P denote the probability that for an agent-speaker, object i
is associated with sound j. The entries q ji of the matrix Q denote the probability that
for an agent-hearer, a sound j is associated with the object i.
Subsequently, we can imagine two individuals A and A’, the first one having the
language L (P, Q), the other having the language L’ (P’, Q’). The payoff related to
communication of such two individuals is, within Nowak’s model, calculated as
follows:
n

n

F ( A, A′ ) = ∑∑ pij q′ji = Tr ( PQ′ )
i =1 j =1

And the fitness of the individual A in regards to all other members of the population
can be obtained as follows :
f ( A) =

1
| P | −1

∑ F ( A, A′)

A′∈P
( A′≠ A)

After the fitness values are obtained for all population members, one can easily apply
traditional evolutionary computing methods in order to direct the population toward
more optimal states, i.e. states where individual matrices are mutually « aligned ». In
Nowak’s framework this alignment represents the situation when hearer and speaker
mutually understand each other, i.e. speaker has encoded meaning M by sound S and
hearer had subsequently decoded sound S as meaning M.
ELG beautifully illustrates how such an alignment of sound-meaning matrices – a
mutually shared communication protocol - can emerge practically ex nihilo given that
there is some « mutual learning » procedure mechanism involved, which allows to
transfer information from one individual to individual another. This is attained by
creating a blank « student » matrix and then filling its elements, by means of
stochastic « matrix sampling » procedure, in a way so that the resulting student
matrix will partially correspond to| be aligned with matrices of pre-existing
« teacher » (or teachers).
Further discussion and experiments with ELG are described (Kvasnička and
Pospíchal) and (Hromada 2012). All these studies point in the same direction and

suggest that not only emergence of mutually shared communication protocol
practically ex nihilo is possible whenever there exists a means of transfer of
information among individuals but also that presence of certain low amount of noise
during the learning process is the only way how to make certain that the system will
converge to « communicatively optimal » state.
The role of ELG model within the context of our Thesis is quite opened. For while it
is the case that ELG sheds some light upon the question of emergence of language
within a community of symbolicaly interacting agents, it does not, principially
address the problem of language learning by a concrete individual. Thus, ELG is
rather a model of macroscopic phylogeny than microscopic ontogeny - it addresses
the problem of how small communities of homo habilis could, hundred years ago,
gradually converge to system of signs within which, for example, « baubau » could
mean a banana and « wauwau » mean a lion. But it does not address a problem of
how today’s human baby learns the complex language of her mother.
On the other hand, it is not completely hors propos to imagine a slight variation of
Nowak’s model wherein one population of matrices would be fixed (representing the
linugistic competence of a teacher or mother organism) while the second population
of matrices would represent the linguistic competence of a « child ». Given that the
fitness function would somehow succeed to represent the degree of alignment
between such « mother » and « child », we postulate that something like child’s
language competence could spontaneously emerge.

4. Remark concerning the Theory of Grammar Systems
A branch of Formal Language Theory which could be of particular use for pursposes
of our Thesis is devoted to study of Grammar Systems (GS). A GS is a « set of
grammars working together, according to a specified protocol, to generate a
language» (Jiménez-López 2000). Thus, contrary to classical Formal Language
Theory within which one grammar generate ones language, in GS several grammars
work together in order to generate one language. Grammar Systems can be therefore
considered as a sort of multi-agent variants of traditional « monolithic » formal
grammar theory.
The very nature of multi-agent systems often implies cooperation, communication
distribution, modularity parallelism, or even emergence of complexity. For example,
Figure 12 illustrates a very simple bimodular « language colony » variant of a GS.

Figure 12: Language colony of two finite grammars cooperating to
generate an infinite language. Figure reproduced from (Kelemen 2004)

By allowing the finite grammar components to communicate through a common
symbolic environment16, one ultimately generates a language which is infinite !
(Kelemen 2004) applies the term « miracle » to such behaviour, which is very
common in the world of GS.
Since the Theory of Grammar Systems is formally very well developped - most
notably thanks to life-long work of Erzsébet Csuhaj-Varju and substantial
contributions by George Paun and Jozef Kelemen– it is impossible for us to
introduce, within the limited scope of this text, the formalism of GS Theory in closer
detail. This will be done in the final version of our Thesis, if ever we decide to
poursuit our research in direction. If that will be the case, we will often refer to the
doctoral Thesis of (Jiménez-López 2000) which contains many persuasive arguments
for application of GS upon the study of natural human languages. On the other hand,
the Thesis of Jimenez-Lopez is limited by the fact that it mostly proposes to use the
Grammar System Theory as a framework explaining the final, i.e. « adult » linguistic
component, and not as a framework which could elucidate the very process of
language development and language acquistion17.
The only tentative to use Grammar System apparatus for grammatical inference is
that of (Sosík and Štỳbnar 1997). Contrary to other authors of GS who focus
principially on the productive (i.e. generative) aspects of GS, Sosik & Štýbnar have
focused on GS's language-accepting properties. In a hybrid connectionist-symbolic
architecture, they have used a « neural pushdown automaton » to infer a language
colony able to cover some simple artificial context-free grammars able to cover
balanced parenthesis or palindrom languages.
As far as we know, no tentative is reported in the litterature to solve the problem of
grammar induction of natural languages by means of evolutionary optimization of
Grammar Systems.

5. Thesis
The Thesis hereby introduced is done under double supervision of dpt. Cybernetics at
Slovak University of Technology (STU) and « cognitive psychology » laboratory
affiliated to University Paris 8 (P8). Ideally, both « engineering » approach –
common to STU – as well as more cognition-oriented « experimental » approach of
P8, should be equally reflected in the final Thesis. In order to do so, the Thesis shall,
in fact, introduce multiple «theses » among which some shall be addressing more
« theoretical » psychology and linguistics related phenomena and problems.
But due to its affiliation to STU, the text shall also introduce more concrete,
16 A common symbolic environment which is shared by different modules plays the central role in practically all
variants of Grammar Systems. It is reminiscent of the role which « short term memory » or «working memory »
plays in cognitive psychology.
17 In terms of Grammar System Theory, it seems to be more appropriate to speak about «language emergence »

pragmatic and operational theses aiming to offer a computationally and formally
sound affirmative answer to the question : « Can a language development be
modelled as an evolutionary process ? »
5.1. Theoretical Thesis
At first, a child has to learn
•

how to segment the world into groups of discrete objects and processes

•

how to segment phonetic flux into sequences of discrete linguistic tokens

The subsequent problem of language development can be analyzed as a trinity of
sub-problems:
1) vocabulary development (learning of mappings between objects and tokens)
2) induction of grammatical categories
3) induction of grammatical rules
These tasks are deeply and strongly intertwined. Without ability to segment world
into objects there are no stable referents to which linguistic tokens could refer.
Without ability to perceive recurrent tokens, there are no conventional symbols with
which a child could denote specific objects. Without vocabulary development (which
relates to induction of semantic classes which we have called « concept
construction » in the text above), there is no need for grammatical rules nor
categories. Without grammatical categories, grammatical rules are just a senseless
tautological formal game and there is no way to distinguish useful grammars from
useless ones. Without useful grammars, vocabulary development shall halt at some
locally optimal level of a « pidgin » language.
Left on their own, these problems pose us in front of us a variant of a chicken & egg
problem which seems almost impossible to tackle. Baby’s brain, however, resolves
these problems with such such an ellegance that one is tempted to say that they even
do not exist.
Aim of the Thesis which shall follow is to demonstrate that if one interprets the above
mentioned set of problems interpreted in terms of
•

parent-child communication (imitation)

•

partitioning of vector spaces (categorization)

•

gradual accomodation and assimilation of knowledge (generalization)

one could subsequently state that the key theoretical Thesis we aim to defend is
Tt process of language development is an auto-organizing and potentially
evolutionary process
Note the word « potentially », because in order to be labeled as « evolutionary »,
following conjectures have to be validated :
C1) Not only imitation but also repetition are forms of replication : Information

replicates not only between the brains but also in the brain.
C2) Fitness of a linguistic structure is related to its ability to represent certain
recurrent aspect of agent’s environment : If cognitive structure matches some aspect
of environment, it gets activated. By being activated, it augments it probability of
being (at least partially) replicated.
C3) Problems of both generalization and overgeneralization are to be solved by
variation|decay operators endogenously transforming the information represented in
the memory of a language-inducing system.
Acceptation of above-mentioned conjectures lead us to model of language
development based not on tuning of parameters of single monolithic grammar, but
rather based on a population of « microgrammars », a « language colony » (Kelemen
and Kelemenová 1992) of mutually communicating, co-operating, decaying and
replicating sequences of production rules unceasingly trying to match the language of
linguistic environment.
We postulate that if such an environment has certain properties of « motherese », a
linguistic competence : an ability to generate utterances in still more & more
complex « toddlerese », shall spontaneously emerge.
Thus, three notions will be of utmost importance in the Thesis which we hereby
introduce : « motherese », « microgrammar » and « matching ». The corpus of
« motherese », more concretely the CHILDes corpus (MacWhinney 2000) , will be
considered to be sufficiently adequate image of initial stages of child’s linguistic
environment. Development of child’s linguistic competence will be explained in
terms of gradual evolution of individual « microgrammars », i.e. chromosomes whose
genomes can be understood as individual production rules. At last but not least, the
notion of « matching » shall furnish us the first principle which could potentially
allow us to to explain the mystery of language acquisition as an evolutionary
process:
P1 «If (internal) rule R or substitional schema S succeeds to match some aspect of
(external) environment, then it shall be replicated into another microgrammar»
5.2. Operational Thesis
The operational Thesis (TO) is stated as follows :
TO « There exists an evolutionary algorithm A which, when confronted with the
corpus of motherese language (LM) as an input, can produce the toddlerese
grammar (GT) able to generate the LM -ressembling toddlerese language LT »
The term « evolutionary » means that the algorithm A shall involve incremental
replication, mutation and selection of information-representing structures. More
concretely, these information-representing structures, i.e. genomes, shall be ordered
sequences of genes, whereby each gene shall contain an individual substitution rule.
Thus, every individual genome shall represent a « microgrammar » aiming to
transform linguistic token (i.e. sequence of terminals) currently observable in the
environment, into sequence of non-terminals. Whenever such « successful parse »

shall occur, the principle P1 shall apply and useful genes shall be reproduced into
other individual micro-grammars. This could potentially cause the micro-grammars
to gradually adapt their structures to those of environment.
On the other hand, in order to prevent excessive adaptation, a variation operator shall
be also integrated in the algorithm A, aiming to vaguely modeling a well-known
phenomenon of « forgetting ».
5.3. The organization of the Thesis
The Thesis shall be composed of five parts each of which is composed of multiple
major chapters. Every chapter consists of introduction and conclusion preceding resp.
following more specific subchapters which can fractally branch into sub-chapters ,
sub-sub-chapters etc. All such parts, chapters, sub-chapters etc. can be considered to
be « non-terminal » nodes of structure presented by this text.
The first part, labeled Theses, is a stem of whole text. It will introduce multiple theses
at varying degrees of generality which shall be all - in one way or another - more
directly addressed in subsequent sections. In order to weave the basic conceptual
fabric, some definitions of terms like « evolution » and « language learning » shall be
also offered along the path delimited in Section 1. All variants of the thesis shall be
briefly related to other cognitive sciences.
The second branch, labeled « Theoretical position » is composed of chapters
dedicated to Universal Darwinism, Developmental Psycholinguistics and Natural
Language Processing. In these chapters, the theses presented in the first chapter shall
be more deeply interpreted and contextualized in terms of respective disciplines.
The third branch, labeled «Observations» will describe multiple longtitudinal
observations of one concrete human child. In certain cases, the generalizability of
such individual observations shall be verified or falsified by means of text-mining the
CHILDes corpora. Subsequent interpretations in terms of the evolutionary theoretical
framework shall follow.
The penultimate branch, called «Simulations» shall present multiple computational
models addressing four problems related to language acquisition process : 1) The
problem of segmentation 2) The problem of induction of grammatical categories 3)
The problem of induction of grammatical rules 4) The problem of concept induction.
Specific chapter will be dedicated to every problem in which existing solutions shall
be described. Special focus shall be put on evolutionary solutions, if they exist. To
every of four above-mentioned problems we shall try to offer our own unique
evolutionary solution and subsequently we shall discuss its performance. PERL
source codes of diverse versions of the algorithm A shall be also attached in order to
allow reproducibility of our results by other scientists.
The conclusive branch labeled « Synthesis » shall primarily discuss results obtained
in parts « Observations » and « Simulations ». If the results turn out to be consistent
with theory, the work shall end with a tentative to integerate theses Tt and Tt in one
unified framework. If unsuccessful, potential reasons of the failure shall be analysed.

6. Bibliography
Araujo, Lourdes. 2002. Part-of-speech tagging with evolutionary algorithms. In
Computational Linguistics and Intelligent Text Processing, 230–239.
Heidelberg, Germany: Springer.
Aycinena, Margaret, Mykel J. Kochenderfer, and David Carl Mulford. 2003. An
evolutionary approach to natural language grammar induction. Final Paper
Stanford CS224N June.
Barrett, Deirdre. 2007. Waistland: A (R) evolutionary View of Our Weight and Fitness
Crisis. New York, NY : WW Norton & Company.
Bee, Helen L., and Denise Roberts Boyd. 2003. The developing child. Boston, MA :
Allyn & Bacon.
Bentley, Peter. 1999. Evolutionary design by computers. San Francisco, CA : Morgan
Kaufmann.
Berman, Ruth A. 1988. Word class distinctions in developing grammars. Categories
and processes in language acquisition: 45–72.
Blackmore, Susan. 2000. The meme machine. Oxford, England : Oxford University
Press.
Braine, Martin DS. 1971. On two types of models of the internalization of grammars.
The ontogenesis of grammar: 153–186.
Brodsky, Peter, H. R. Waterfall, and Shimon Edelman. 2007. Characterizing
motherese: On the computational structure of child-directed language. In
Proceedings of the 29th Cognitive Science Society Conference, ed. DS
McNamara & JG Trafton, 833–38.
Brown, Roger. 1973. A first language: The early stages. Cambridge, MA : Harvard
University Press.
Campbell, Donald T. 1960. Blind variation and selective retentions in creative
thought as in other knowledge processes. Psychological review 67: 380.
Choubey, Nitin S., and Madan U. Kharat. 2009. Grammar Induction and Genetic
Algorithms-An Overview. Pacific Journal of Science and Technology 10:
884–888.
Christodoulopoulos, Christos, Sharon Goldwater, and Mark Steedman. 2010. Two
Decades of Unsupervised POS induction: How far have we come? In
Proceedings of the 2010 Conference on Empirical Methods in Natural
Language Processing (pp. 575-584).
Cohen, Trevor, Roger Schvaneveldt, and Dominic Widdows. 2010. Reflective
Random Indexing and indirect inference: A scalable method for discovery of
implicit connections. Journal of Biomedical Informatics 43: 240–256.
Cosmides, Leda, and John Tooby. 1997. Evolutionary psychology: A primer.
Retrieved from http://www.cep.ucsb.edu/primer.html.
Csuhaj-Varjú, Erzsébet. 1994. Grammar systems: a grammatical approach to
distribution and cooperation. Yverdon, Switzerland : Gordon and Breach
Science Publishers.
Darwin, Charles. 1859. On the Origin of Species. London, England : John Murray.
Darwin, Charles. 1906. The voyage of the Beagle. 104. JM Dent & sons.

Dawkins, Richard. 2006. The selfish gene. Oxford, England : Oxford university press.
Dennett, Daniel C. 1996. Darwin’s Dangerous Idea: Evolution and the Meanings of
Life. 39. New York, NY : Simon & Schuster.
Dupont, Pierre. 1994. Regular grammatical inference from positive and negative
samples by genetic search: the GIG method. In Grammatical Inference and
Applications, 236–245. Heidelberg, Germany: Springer.
El Ghali, Adil, Daniel Hromada, and Kaoutar El Ghali. 2012. Enrichir et raisonner
sur des espaces sémantiques pour l’attribution de mots-clés.
JEP-TALN-RECITAL 2012: 77.
Elman, Jeffrey L. 1993. Learning and development in neural networks: The
importance of starting small. Cognition 48: 71–99.
Flake, G. W. 1999. The computational beauty of nature. Cambridge, MA : MIT press.
Fogel, Lawrence J., Alvin J. Owens, and Michael J. Walsh. 1966. Artificial
intelligence through simulated evolution. New York, NY : John Wiley & Sons.
Foster, Mary LeCron. 2002. Symbolism: the foundation of culture. Companion
encyclopedia of anthropology : 366. Canada : Routledge.
Furrow, David, Katherine Nelson, and Helen Benedict. 1979. Mothers’ speech to
children and syntactic development: Some simple relationships. Journal of
child language 6: 423–442.
Galton, Francis. 1875. English men of science: Their nature and nurture. .
Gärdenfors, Peter. 2004. Conceptual spaces: The geometry of thought. MIT press.
Haeckel, Ernst Heinrich Philipp August. 1879. The evolution of man. Vol. 1. [sn].
Haidt, Jonathan. 2012. The righteous mind: Why good people are divided by politics
and religion. Random House LLC.
Hamilton, William D. 1963. The evolution of altruistic behavior. The American
Naturalist 97: 354–356.
Harris, Margaret. 2013. Language experience and early language development: From
input to uptake. Psychology Press.
Harris, Zellig S. 1954. Distributional structure. Word.
Hebb, Donald Olding. 1964. The Organization of Behavior: A Neuropsychlogical
Theory. John Wiley & Sons.
Hoff-Ginsberg, Erika. 1986. Function and structure in maternal speech: Their relation
to the child’s development of syntax. Developmental Psychology 22: 155.
Holland, J. H. (1975). Adaptation in natural and artificial systems: An introductory
analysis with applications to biology, control, and artificial intelligence. Ann Arbor,
MI : University of Michigan Press.
Hromada, Daniel Devatman. 2012. Variations upon the theme of Evolutionary
Language Game. Unpublished manuscript. Slovak University of Technology.
Hromada, Daniel Devatman. 2013a. Geometrizácia ontológií - prípadová štúdia
SNOMED. Unpublished manuscript. Slovak University of Technology.
Hromada, Daniel Devatman. 2013b. Random Projection and Geometrization of
String Distance Metrics. In Proceedings of the Student Research Workshop
associated with RANLP, 79–85. Hissar : Bulgaria.
Hromada, Daniel Devatman. 2014a. Comparative study concerning the role of
surface morphological features in the induction of part-of-speech categories. In

Proceedings of TSD2014 conference. Heidelberg, Germany : Springer.
Hromada, Daniel Devatman. 2014b. Introductory experiments with evolutionary
optimization of reflective semantic vector spaces. In TALN-RECITAL-DEFT
2014. Marseille.
Hromada, Daniel Devatman. 2014c. Conditions for cognitive plausibility of
computational models of category induction. In Proceedings of 15th
International Conference on Information Processing and Management of
Uncertainty in Knowledge-Based Systems. Heidelberg, Germany: Springer.
Jiménez-López, MD. 2000. Grammar systems: a formal-language-theoretic
framework for linguistics and cultural evolution. PhD Dissertation. Tarragona,
Spain : Rovira i Virgili University, .
Johnson, William B., and Joram Lindenstrauss. 1984. Extensions of Lipschitz
mappings into a Hilbert space. Contemporary mathematics 26: 1.
Karypis, George. 2002. CLUTO-a clustering toolkit. DTIC Document.
Kauffman, Stuart. 1996. At home in the universe: The search for the laws of
self-organization and complexity. Oxford, England : Oxford University Press.
Kelemen, Jozef. 2004. Miracles, colonies, and emergence. In Formal Languages and
Applications, 323–333. Heidelberg, Germany: Springer.
Kelemen, Jozef, and Alica Kelemenová. 1992. A grammar-theoretic treatment of
multiagent systems. Cybernetics and System 23: 621–633.
Keller, Bill, and Rudi Lutz. 1997. Evolving stochastic context-free grammars from
examples using a minimum description length principle. In 1997 Workshop on
Automata Induction Grammatical Inference and Language Acquisition.
Kennedy, James F., James Kennedy, and Russel C. Eberhart. 2001. Swarm
intelligence. San Francisco, CA : Morgan Kaufmann.
Koza, J. R. (1992). Genetic programming: on the programming of computers by
means of natural selection (Vol. 1). Cambridge, MA : MIT press.
Küntay, Aylin, and Dan I. Slobin. 1996. Listening to a Turkish mother: Some puzzles
for acquisition. Social interaction, social context, and language: Essays in
honor of Susan Ervin-Tripp: 265–286.
Kvasnička, Vladimír, and Jirí Pospíchal. Evolúcia jazyka a univerzální darwinizmus.
In Myseľ, inteligencia a život. Bratislava : Slovenská Technická Univerzita.
Lakoff, G. 1990. Women, fire, and dangerous things. Chicago, IL : University of
Chicago Press.
Levy, Yonata. 1988. The nature of early language: Evidence from the development of
Hebrew morphology. Categories and processes in language acquisition:
73–98. Lawrence Erlbaum Associates.
MacWhinney, Brian. 1987. The competition model. Mechanisms of language
acquisition: 249–308.
MacWhinney, Brian. 2000. The CHILDES Project: Tools for Analyzing Talk.
Transcription, format and programs. Vol. 1. Lawrence Erlbaum Associates.
Maratsos, Michael. 1988. The acquisition of formal word classes. Categories and
processes in language acquisition: 31–44. Lawrence Erlbaum Associates.
Morgan, Thomas Hunt. 1916. A Critique of the Theory of Evolution. Princeton
University Press.

Newport, Elissa L. 1990. Maturational constraints on language learning. Cognitive
science 14: 11–28.
Ninio, Anat. 1988. On formal grammatical categories in early child language.
Categories and processes in language acquisition. Lawrence Erlbaum
Associates.
Nowak, M. A., J. B. Plotkin, and D. C. Krakauer. 1999. The evolutionary language
game. Journal of Theoretical Biology 200: 147–162.
Ofria, Charles, and Claus O Wilke. 2004. Avida: A software platform for research in
computational evolutionary biology. Artificial life 10: 191–229.
O’Neill, Michael, and Conor Ryan. 2003. Grammatical evolution: evolutionary
automatic programming in an arbitrary language. Genetic Programming
Series, Vol. 4. Heidelberg, Germany: Springer.
Piaget, Jean. 1974. Introduction à l’épistémologie génétique. Paris, PUF.
Pohlheim, Hartmut. 1996. GEATbx: Genetic and evolutionary algorithm toolbox for
use with MATLAB documentation. Retrieved from http://www. geatbx.
com/docu/algindex. html.
Poincaré, Henri. 1908. L’invention mathématique.
Popper, Karl Raimund, Karl Raimund Popper, and Karl Raimund Popper. 1972.
Objective knowledge: An evolutionary approach. Oxford, England : Clarendon
Press.
Ray, Thomas S. 1992. Evolution, ecology and optimization of digital organisms.
Santa Fe.
Rechenberg, Ingo. 1973. Evolutionsstrategie–Optimierung technisher Systeme nach
Prinzipien
der
biologischen
Evolution.
Stuttgart,
Germany :
Fromman-Holzboog.
Rizzolatti, Giacomo, and Laila Craighero. 2004. The Mirror-Neuron System. Annual
Review of Neuroscience 27: 169–192.
Rosch, Eleanor. 1999. Principles of categorization. Concepts: core readings:
189–206. Cambridge, MA : MIT press.
Sahlgren, Magnus. 2005. An introduction to random indexing. In Methods and
Applications of Semantic Indexing Workshop at the 7th International
Conference on Terminology and Knowledge Engineering, TKE. Vol. 5.
Sekaj, I. 2005. Evolučné výpočty a ich využitie v praxi. Iris.
Shi, Rushen, Janet F Werker, and James L Morgan. 1999. Newborn infants’
sensitivity to perceptual cues to lexical and grammatical words. Cognition 72:
B11–B21. doi:10.1016/S0010-0277(99)00047-5.
Simonton, Dean Keith. 1999. Creativity as blind variation and selective retention: Is
the creative process Darwinian? Psychological Inquiry 10: 309–328.
Smith, Tony C., and Ian H. Witten. 1995. A genetic algorithm for the induction of
natural language grammars. In Proc. of IJCAI-95 Workshop on New
Approaches to Learning for Natural Language Processing, 17–24.
Solan, Z., D. Horn, E. Ruppin, and S. Edelman. 2005. Unsupervised learning of
natural languages. Proceedings of the National Academy of Sciences 102:
11629.
Sosík, Petr, and Leoš Štỳbnar. 1997. Grammatical inference of colonies. In New

Trends in Formal Languages, 236–246. Heidelberg, Germany: Springer.
Spencer, Herbert. 1894. Education: Intellectual, moral, and physical. CW Bardeen.
Tomita, Masaru. 1982. Dynamic construction of finite-state automata from examples
using hill-climbing. In Proceedings of the fourth annual cognitive science
conference, 105–108.
Trivers, R.L. (1972). Parental investment and sexual selection. In B. Campbell (Ed.),
Sexual selection and the descent of man, 1871-1971 (pp. 136–179). Chicago,
IL: Aldine.
Turing, A. M. 2008. Computing machinery and intelligence. Parsing the Turing Test:
23–65.
Vapnik, V., S. E Golowich, and A. Smola. 1997. Support vector method for function
approximation, regression estimation, and signal processing. Advances in
Neural Information Processing Systems 9.
Wilson, Edward O. 1978. What is sociobiology? Society 15: 10–14.
Wittgenstein, L. 2009. Philosophical investigations. Wiley-Blackwell.
Wolff, J. Gerard. 1988. Learning syntax and meanings through optimization and
distributional analysis. Categories and processes in language acquisition 1.
Wright, Sewall. 1932. The roles of mutation, inbreeding, crossbreeding and selection
in evolution. In Proceedings of the sixth international congress on genetics,
1:356–366.

Comparative study concerning the role of
surface morphological features in the induction
of part-of-speech categories
Daniel Devatman Hromada12
1

Université Paris 8, Laboratoire Cognition Humaine et Artificielle, 2, rue de la
Liberté 93526, St Denis Cedex 02, France
2
Slovak University of Technology, Faculty of Electrical Engineering and Information
Technology, Department of Robotics and Cybernetics, Ilkovičova 3, 812 19 Bratislava,
Slovakia

Abstract. Being based on English language, existing systems of partof-speech induction prioritize the contextual and distributional features
“external” to the word and attribute somewhat secondary importance
to features derived from word’s “internal” morphologic and orthotactic regularities. Here we present some preliminary empirical results supporting the statement that simple “internal” features derived from frequencies of occurrences of character n-grams can substantially increase
the V-measure of POS categories obtained by repeated bisection k-way
clustering of tokens contained in Multext-East corpora. Obtained data
indicate that information contained in suffix features can furnish c(l)ues
strong enough to outperform some much more complex probabilist or
HMM-based POS induction models , and that this can especially be the
case for Western Slavic languages.
Keywords: part-of-speech induction, development of morphology, clustering, surface features, suffix

1

Introduction

Part-of-speech (POS) induction is a constructivist process aiming to converge to
the mechanism able to attribute the POS category (e.g. “verb”, “noun”, “adjective” etc. ) membership information to any word of the language under study.
Because “syntactic category information is part of the basic knowledge about
language that children must learn before they can acquire more complicated
structures” [15] POS induction (POS-i) is often considered to be the first step
in a more complex process of grammar induction and language acquisition in
general.
Given such an important place of POS-i in NLP studies, it is of no surprise that while first computational models of POS-i were proposed decades ago
[3][6][15] the problem of unsupervised POS-label attribution still attracts attention of many computational linguists. Thus, dozens of POS-i systems exist,
among which those based on class-based word n-grams [5], graph clustering [2]

2

Daniel Devatman Hromada

or diverse extensions to Hidden Markov Models [9][8][1] are compared in the
[4] comparative study which suggests that “some of the oldest (and simplest)
systems stand up surprisingly well against more recent approaches”.
Aims of this article are 1) to elucidate a superior peformance of Clark [5] and
Berg-Kirkpatrick [1] models with the statement: “Their models perform better
because they use better features” 2) to precise that for many languages, such
features can be morphological ones. We precise that what shall be called “morphological feature” (MF) in the rest of this article is any feature “internal” to the
word WITHIN which it occurs and as such can be opposed to contextual or distributional features “external” to the word under study (i.e. opposed to features
which describe word’s relation to other words and not its internal composition).
By focusing upon the role of such “orthotactic” MFs in diverse languages
represented in the Multext-East corpus [7] we shall try to persuade the reader
that while the “syntax-in-word-order paradigm” could (and did) yield useful
models and tools for description of English language, the uncritical acceptation
of such paradigm could turn to be somewhat contra-productive if one tends to
develop POS-i models for highly flectional & morphology-rich languages.

2

Corpus

All analyses were effectuated with texts contained in the 4th version of MultextEast corpus [7] . Bulgarian (bg), Czech (cs), English (en), Estonian (et), Farsi
(fa), Hungarian (hu), Polish (pl), Romanian (ro), Serbian (sr), Slovak (sk) and
Slovene (sl) transcription of Orwell’s 1984 were analysed. Quantitative descriptions of different corpora are present in the table 1.
Corpus Types Tokens TagsPOS
bg
17305 117238
13
cs
22341 100368
13
en
11160 134832
12
et
18911 111305
12
fa
13009 124823
12
hu
20642 132196
13
pl
24019 115185
14
ro
16220 135055
15
sk
23015 103452
13
sl
20597 112278
13
sr
21540 126611
13

3

Method

Every word from the corpus was described by a vector of features whose values
were obtained by application of feature filters described below. Vectors were
subsequently clustered into groups.

TSD 2014

3.1

3

Feature extraction

All tokens, punctuation marks included, were extracted as such from the corpus.
Word characters were transcribed into lower case. In order to mark the word
boundaries, ˆ and $ characters were prefixed, respectively suffixed, to extracted
tokens. Following features were then extracted from tokens:
Length [L] – yields only one feature whose value equals the character length
of the token, i.e. 6 for word “ˆgood$”. Baseline.
Character n-grams of length X [Nx ] – every feature encodes the number
of occurrences of the character n-gram of length L within the token. Thus, if X=1,
the word “ˆ good$” can be encoded by vector of features [1, 1, 2, 1, 1] whose
second element denotes the number of “g” present in the word, third feature the
number of “o” etc. If X=2, the vector could be [1, 1, 1, 1, 1], its first element
representing the frequency of occurrence of “ˆ g” character bigram, second of
“go” bigram, third of “oo” bigram etc.
Character fragments whose length <X [FX ] – this approach takes into
account all n-gram fragments BELOW the specified length X. Thus if X=3, the
word “ˆgood$” could be represented by the vector [1, 1, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1] whose last four elements encode the presence of trigrams “ˆgo”, “goo”,
“ood” and “od$”; composition of first 10 elements is explained above.
All fragments [A] – same as above but X is equal to word’s length. Word’s
vector thus encodes occurrences of all 1gram, 2gram, 3gram . . . X-gram character
sequences present within the word. Yields biggest number of features.
Prefixes of length X [PX ]– same as NX but fragments of length X were
extracted only from word’s beginning
Suffixes of length L [SX ]– same as PX but fragments of length L were
extracted only from word’s beginning Word’s circumference n-grams of length X
[BX ] – boundary n-gram feature is a conjunction of a prefix and suffix feature,
e.g. the B2 feature for the word good can be matched by regular expression
/ĝ.+d$/ and its occurrence would be also observed in the words like “god” or
“gold”
Word’s circumference [CX ]– Conjunction of PX and SX ,i.e. feature is
defined by combination of prefix and suffix both of length X.
Word’s root [RX ]– for the purpose of this article, we define the root feature
“as all that rests in the token when its circumference n-grams of length X are
removed”
Token’s co-occurrence neighborhood of length L [OL ] – this is the only
feature “external” to the token under study. Every co-occurrence of the definienstoken (column) maximum L words aside to the left or right from definiendumtoken (row) augments the value by 1.
If the definiens does not co-occur aside the definiendum word or if a fragment (column) does not occur within the word, or a feature-representing pattern
(column) does not match the word (row), then the value in the final vector is,
of course, zero.

4

3.2

Daniel Devatman Hromada

Clustering

Since our objective is to evaluate the (non)relevance of diverse sets of surface
features for POS-i in different languages, and not to evaluate the subsequent
grouping machinery, we have decided to use a simple (& fast) repeated bisection
k-way clustering as is implemented in the clustered tool CLUTO [12]. Columns of
the word x feature matrix were scaled according to inverse-document frequency
paradigm, cosine function was used for the calculation of the similarity metrics.

4

Evaluation

For the purposes of this article we had decided to present our simulations principially in terms of V-measure. More theoretical [13] and empiric [4] reasons
being explained elsewhere, our choice was partially motivated by the form of
V-measure score equations:
h=1−

H(T |C)
H(T )

H(C|T )
H(C)
(1 + β)hc
V =
(βh) + c
which strongly resembles the F-measure score often used in evaluation of classification problems. The homogenity (h) and completeness (c) were designed in
order to be analogic to precision, respectively recall. Given its elegance, stability
in regards to growing number of clusters but also certain “strictness” (note that
even the best state-of-the-art present in [4] comparative study rarely surpass
the V>0.6 limit), we consider the Vmeasure to be very valuable quantitative
measure of performance of clustering POS-i algorithms.
c=1−

L N1 N2 N3 N4 F2 F3 F4 A P2 P3 S2 S3 C2
bg 4.3 5.6 13.1 17.0 11.9 8.5 14.4 14.7 14.6 6.7 5.0 18.9 16.5 3.8
cs 5.4 9.2 25.2 20.7 11.6 23.1 24.8 23.9 24.3 7.4 7.1 25.2 18.7 4.7
en 3.8 6.5 14.1 15.3 9.4 10.4 14.9 16.1 14.7 3.9 3.6 20.5 19.7 2.4
et 4.2 4.0 12.2 14.2 11.9 5.8 6.92 9.38 7.24 4.2 6.0 14.2 16.1 3.6
fa 2.6 6.8 15.4 15.52 12.2 12.0 15.51 15.3 15.55 11.7 14.5 14.4 12.0 6.4
hu 2.3 4.3 6.1 10.7 9.4 5.2 6.26 6.58 5.65 5.4 5.7 17.1 14.2 3.0
pl 4.7 8.0 21.1 20.1 13.7 18.5 20.3 19.7 15.6 5.3 6.5 25.1 22.7 4.0
ro 4.6 7.1 11.1 13.6 9.5 8.23 11.3 11.8 10.9 6.5 5.9 15.8 14.8 3.1
sr 5.2 5.5 13.3 14.8 10.5 5.67 8.06 8.82 5.95 6.1 6.4 19.1 16.5 4.6
sk 5.9 11.2 26.9 21.0 14.0 23.8 24.9 24.2 22.5 8.2 5.8 27.5 21.3 4.8
sl 4.5 4.8 12.2 17.1 12.8 7.39 8.42 14.3 7.5 6.8 6.0 21.6 19.3 5.2

C3
2.3
3.1
1.7
2.8
4.6
1.8
3.0
1.9
3.0
3.5
2.4

R2
3.4
3.7
2.9
3.4
2.8
2.4
3.3
2.5
4.7
3.6
3.3

Table 1: V-measures obtained after clustering different corpus according to different features. The most performant feature of every corpus is marked.

R3
3.0
3.4
2.2
3.3
3.2
2.0
2.9
2.4
3.5
3.5
3.4

O1
12.5
7.9
14.4
6.77
14.3
7.1
7.9
15.6
9.4
8.7
9.1

TSD 2014

5

Table above shows V-measure*100 values obtained by clustering of words
characterized by length (L), character n-gram fragments of fixed (N2 , N3 , N4 )
length or n-gram fragments shorter than certain length (F2 , F3 , F4 ) as well as
of clusters created by considering all fragments (A).
The best results (i.e. highest V-measures) were observed in case of Western Slavic languages which have all attained >0.2 of V-measure performance
when clustered according to features representing character bigram occurrences.
Southern Slavic languages along with Romanian, Hungarian and Estonian performed the best when character trigrams were taken into account. English attained the 0.16 performance when all bigrammata, trigrammata and tetragrammata were taken into account while Farsi was clustered the best when all n-gram
character fragments were taken into account.
Further results presented in the table below point in the same direction.
Highest V-measure score was attained by Slovak, Czech and Polish when simple
extractor of suffix features of length 2 was applied. In fact the same extractor yielded highest scores in case of all languages with exception of Estonian
where somewhat longer suffixes tend to facilitate the POS-i, and in case of Farsi
whereby prefixal features seem to be at least as important as suffixal features.
Word circumference features C2 and C3 as well their “negation”, the word root
features R2 and R3 do not seem to bring any information relevant to the categorization process – in fact they seem to perform even worse than the baseline
feature L.
Members of set of “external” distributional features (O1 ), which represent
the trivial frequency of occurrence of the feature-word to the left or right from
the target word, performed worse in all cases, English included, than S2 .

5

Discussion

POS-i system comparative study of [4] indicates that POS-i models involving
morphological features perform better than models which do not. However both
in Clark’s [5] probabilist model as well as in morphology-enriched HMM-derived
[1] model, morphological features seem to play rather a role of a performanceincreasing “cherry added to the top of the cake” than that of model’s cornerstone.
Results presented in this paper suggest that focusing upon the phenomena
occurring within the token, if the token’s transcription allows it3 , seem to yield
quite strong c(l)ues for subsequent clustering of tokens into their respective syntactic categories. It may be the case that especially the character bigrams occuring at word’s offset position – suffixes – seem to play an important role in
word→ POS category attribution. It is also worth noting that suffixes augment
the performance of POS-i not only for Indo-European languages but also for
Uralic languages like Estonian or Hungarian.
3

For example, an “internal” feature-oriented approach would hardly yield any interesting results if applied on Chinese logograms but could be of certain theoretic
interest when applied upon pinyin transcription.

6

Daniel Devatman Hromada

It is also worth reiterating that POS-i within Western Slavic languages tends
to be much more sensitive to character N-gram and suffix-derived features than
other languages compared in this study. Because the research presented hereby
was based only on one particular litteral corpus (Orwell’s 1984) and the results
obtained may thus represent not the properties of languages as such, but rather
a certain translation style, it would be somewhat hors propos to postulate that
a kind of overall statistic property - labeled hereby as “word offset flectivity” - is
more marked in Western Slavic languages than, for example, in Southern Slavic
or Uralic languages. But given the fact that it was only Slovak, Czech and Polish whose V>0.25 when clustered according to outputs of S2 feature-extracting
prism, we believe that subsequent analyses involving more corpora and more
languages may be worth the effort. Verily only more exhaustive comparative
studies could assess the impact of morphology of word X upon the attribution
of syntactic function to the very word X. And since syntax is often bound with
semantics – for example by means of thematic relations – such studies, if ever
they would verify and not falsify the results presented hereby, could possibly result in a partial revision of a canonical “signifiant is independent from signifié”
paradigm [14].
To emit such a call was, however, not a motivation behind the redaction of
this paper. Nor had we aimed to outperform existing distributional&probabilist
models – for it may seem quite unprobable that one would outperform the “heavy
Markovian artillery” with such a simple computational machinery as k-way clustering. Thus, it has been of certain surprise to us that the comparison of data
presented on Figure 4 in [4] with our results indicated that for some Slavic
corpora, our simplistic morphology-driven geometrically-clustered model has attained higher or more or less equal V-mesure scores than models presented in
[11][9]. Our approach can also dispose of certain advantages when it comes to
computational complexity – while some models like that of [2] have sometimes
problems to converge to result in reasonable time, none of our 198 analyses
whose results are presented above have lasted more than few seconds on an
average desktop computer.
This being said, we believe that it may be the case that POS-i induction of
systems of next generation could not only take into account but shall rather be
based on word’s “internal” morpho(phono)logical or even prosodic and metric
features. While sufficient evidence exists for stating that in order to have a
highly performant and robust POS-i model, one MUST take into account the
distributional and contextual information “external” to the word under question,
we believe that especially in case of highly flectional languages, the complexity
of the whole POS-i clustering proccess could be significantly reduced if ever the
process shall be “seeded” (i.e. initiated) with token’s “internal” features. Since
the performance-augmenting and complexity-reducing effects of such seeding are
the principal topic of our ongoing work, we conclude that what we believe to be
the ultimate advantage of such a model could be its “cognitive plausibility” [10].
At last but not least, by underlining the importance of suffixal features for
POS-induction process, our results may well point in the same direction as hy-

TSD 2014

7

pothesis that ”one of the first operating principles employed in the ontogenesis
of grammar [is that] grammatical realizations in the form of suffixes or postpositions will be acquired earlier than realizations in the form of prefixes or prepositions”[16]. Thus, without an intention to do so4 we ultimately find the results
of our purely empiric study to be consistent with more general psycholinguistic
theories of grammar induction and language development.

References
1. Berg-Kirkpatrick, Taylor, Alexandre Bouchard-Côté, John DeNero, et Dan Klein.
2010. Painless unsupervised learning with features. P. 582–590 in Human Language
Technologies: The 2010 Annual Conference of the North American Chapter of the
Association for Computational Linguistics.
2. Biemann, Chris. 2006. Unsupervised part-of-speech tagging employing efficient
graph clustering. P. 7–12 in Proceedings of the 21st International Conference on
computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop.
3. Brown, Peter F., Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, et
Jenifer C. Lai. 1992. Class-based n-gram models of natural language. P. 467–479
in Computational linguistics 18(4)
4. Christodoulopoulos, Christos, Sharon Goldwater, et Mark Steedman. 2010. Two
Decades of Unsupervised POS induction: How far have we come?. P. 575–584 in
Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing.
5. Clark, Alexander. 2003. Combining distributional and morphological information
for part of speech induction. P. 59–66 in Proceedings of the tenth conference on
European chapter of the Association for Computational Linguistics-Volume 1.
6. Elman, Jeffrey L. 1989. Representation and structure in connectionist models.
7. Erjavec, Tomas. 2012. MULTEXT-East: morphosyntactic resources for Central and
Eastern European languages. P. 131–142 in Language resources and evaluation
46(1)
8. Goldwater, Sharon, et Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. P. 744 in Annual Meeting of Association of Computational Linguistics, vol. 45.
9. Graca, Joao, Kuzman Ganchev, Ben Taskar, et Fernando Pereira. 2009. Posterior
vs. parameter sparsity in latent variable models. P. 664–672 in Advances in Neural
Information Processing Systems 22.
10. Hromada, Daniel Devatman. 2014. Conditions for cognitive plausibility of computational models of category induction. Accepted for 15th International Conference
on Information Processing and Management of Uncertainty in Knowledge-Based
Systems (IPMU2014). Montpellier, France.
11. Johnson, Mark. 2007. Why doesn’t EM find good HMM POS-taggers. P. 296–305
in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLPCoNLL).
4

Both during conception and realization of our study, we have been utterly unaware
neither of Slobin’s ”operating principle A”, nor of amount of scientific evidence
already associated with it.

8

Daniel Devatman Hromada

12. Karypis, George. 2002. CLUTO-a clustering toolkit.
13. Rosenberg, Andrew, et Julia Hirschberg. 2007. V-measure: A conditional entropybased external cluster evaluation measure. P. 420 in Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), vol. 410.
14. deSaussure, Ferdinand. 1922. Cours de linguistique générale. Payot, Paris.
15. Schütze, Hinrich. 1993. Part-of-speech induction from scratch. P.251–258 in Proceedings of the 31st annual meeting on Association for Computational Linguistics.
16. Slobin, Dan. 1973. Cognitive prerequisities for acquisition of grammar. P. 175-208
in Studies of child and language development.

Evolutionary modelisation of ontogeny of linguistic structures
Rigorous Thesis Examination

Daniel D. Hromada12
1 Slovak University of Technology
Faculty of Electronic Engeneering and Informatics
Department of Robotics and Cybernetics
2 Université Paris 8
École Doctorale Cognition, Langage, Interaction
Laboratoire Cognition Humaine et Artificielle

4-11-2014

doc. Ing. Ivan Sekaj, PhD.
prof. Ing. Vladimı́r Kvasnička, DrSc.

Administrative Position
Development
2010: enrolled for PhD. at Ecole Doctorale Cognition, Langage, Interaction of
University Paris8
2011: attribution of double PhD. scholarship by french government, inscription at
STU as external doctorant
2012: inter-university convention signed by FEI STU’s Dean and President of
Paris8, start of scholarship
2013: summer semester in Paris, presentation of prof. Tijus in Bratislava
2014: summer semester in Paris, end of scholarship

Thesis is to be written and defended in English language.

An interdisciplinary enterprise
Part-ofspeech
induction
Categorization
Optimization

Induction

Natural Language
Processing

Grammar
induction

Computational
Linguistics

Production
vs. Perception

Generalization

Formal Language Theory

Language Acquisition Device

Grammar
Systems

Developmental
Psycholinguistics

Evolutionary
Modeling of
L-Structure
Ontogeny

Populations
Genetic
Algorithms

Motherese and
Toddlerese

Evolutionary
Computation

Universal
Darwinism

Bootstrapping

Fitness Functions
and Landscapes

Evolutionary
Strategies

Heredity, Variation, Selection

Stochastic Systems

Adaptation

Trial and Error

Universal Darwinism
Definition
A general theoretical framework aiming to explain the emergence and optimization of
diverse complex phenomena in terms of interaction of three basic processes:
1

information variation

2

information selection

3

information replication

UD-consistent disciplines
biology (Darwin 1859, Mendel 1866) and genetics (Morgan 1916, Watson &
Crick, 1953)
sociobiology (Hamilton 1974, Wilson 1978) and evolutionary psychology
(Cosmides & Tooby 1997)
memetics (Dawkins 1976, Blackmore 2000)
evolutionary epistemology
neural darwinism
evolutionary computation, artificial life, ...
evolutionary linguistics

Evolutionary Epistemology

An ambiguous definition
Evolutionary Epistemology aims to explain source, existence, nature, scope and diversity
of forms of knowledge in evolutionary terms.

Two possible intepretations:
1

biological evolution of cognitive and mental faculties in animals and humans

2

knowledge per se evolves by selection and variation

The second interpretation can be further analyzed:
1

knowledge can emerge by variation&selection of ideas shared by a group of
mutually interacting individuals (Popper 1972)

2

knowledge can emerge by variation&selection of cognitive representations within
one individual

Genetic Theories of Learning and Creativity

”Genetic” not in contemporary (i.e. DNA-related) sense but as related to ”origins”
(genesis) and ”heredity” (genus).
Piaget’s Genetic Epistemology
1

aims to explain how human cognitive systems (CS) develop from birth onwards

2

CS pass through series of stages, every stage involves equilibration of cognitive
schemas

3

schemas change through process of assimilation and acommodation

Campbell-Simonton’s Theory of Creativity
1

scientific discovery and creativity can be explained in terms of blind variation and
selective retention (Campbell 1970)

2

”How do human beings create variations? One perfectly good Darwinian
explanation would be that the variations themselves arise from a cognitive
variation-selection process that occurs within the individual brain.” (Simonton
1990)

Neural Darwinism
Edelman (1987) postulated that complex adaptations in the brain arise through some
process similar to natural selection.
Another variant of ND is theory of Changeux and Dehane (1989): ”the production
and storage of mental representations, including their chaining into meaningful
propositions and the development of reasoning, can also be intepreted, by analogy, in
variation-selection (Darwinian) terms within psychological time-scales.”
Fernando et al. (2012) propose two ”toy models .. of a means by which a higher-order
unit of neuronal evolution above the synaptic level may be able to replicate.”

Figure: reproduced from (Fernando et al., 2012).

Evolutionary Computation
Definition
”Evolutionary computation uses computational models of evolutionary processes as key
elements in the design and implementation of computer-based problem solving systems”
(Spears et al., 1993)
genetic algorithms (c.f. next slide)
evolutionary programming (stronger genotype-phenotype distinction, FSAs, little
recombination)
evolutionary strategies (involves more recombination, self-adaptation, other
nature-inspired approaches)
genetic programming (does not search for solutions but for programs)
Grammatical evolution
Variant of Genetic Programming which uses evolutionary search to discover specific
sequences of application of rules of production which generate program code which
yields wished solutions.
swarm intelligence (Kennedy & Eberhart, 2001)
artificial life (no exogenous fitness function: Tierra, AVIDA, etc.)

Genetic algorithms
Canonic GA (Holland, 1975)
Encoding: binary vector
Initial population: randomly generated P
Selection: fitness proportionate (pi = fi / N
j=1 fj )
Crossover: one-point
Mutation: bit-flip with probability p (0.001)

rand i n i t
evaluate
select
repeat
crossover
mutation
evaluate
select
u n t i l stop

Schema theorem
A schema is a subset of strings with similarities at certain positions. Schema theorem
states that short, low-order (i.e. with few fixed positions) schemata with above-average
fitness increase exponentially in successive generations:
E (m(H, t + 1)) ≥

m(H, t)f (H)
[1 − p
at

m(H,t) is the number of strings belonging to schema H at generation t, f(H) is the
observed average fitness of schema H and at is the observed average fitness at generation
t. P is the probability that crossover or mutation will disrupt H.
Convergence to global optimum
Rudolph (1994) has proven that CGAs are certain to converge to global optimum only
if they ”keep track of the best solution found over time” (i.e. involve a form of elitism).

Evolutionary Language Game (Nowak et al., 1999)
Let’s have a population of N agents. Each agent is described by an r ∗ c associative
matrix A. A’s entry aij specifies how often an individual, in a role of a student,
observed one or more other individuals (teachers) referring to object i by producing
signal j. From this associative matrix A, one can derive
a
the active ”speaker” matrix S by normalizing A’s rows: sij = Pr ij a
n=1 in

the ”hearer” passive matrix H by normalization of A’s columns: hij =

a
Pc ij

n=1 anj

Subsequently, we can imagine two individuals A and A’, the first one having the
language L (H,S), the other having the language L’ (H’, S’). The payoff related to
communication of two individuals is calculated as follows:
F (A, A0 ) =

r X
c
X

sij hji0 = Tr (SH 0 )

i=1 j=1

And the fitness of the individual A in regards to all other members of the population
can be obtained as follows :
X
1
f (A) =
F (A, A0 )
|P| − 1 0
A ∈P
A6=A0

By implementing EC, these fitness values can subsequently direct evolution of the
population toward states where individual matrices are more optimally ”aligned”. In
ELG, this alignment represents the situation when hearer and speaker mutually
understand each other, i.e. speaker has encoded meaning M by sound S and hearer
had subsequently decoded sound S as meaning M.

Evolutionary Language Game #2

parent-child information transfer modelled by matrix sampling procedure
parameter k specify the quantity of repetition during the matrix sampling
all experiments with N=100 having memones of size 5x5 (i.e. their associative
matrices could encode max 5 ”sounds” and 5 ”meanings”)
convergence to globally optimal state is assured only if MS involves small but
nonzero amount of noise!
beautiful model how ”language” is sure to arise ex nihilo in communities wherein
information transfer between individuals exists
Nowak et al. (1999) and Kvasnička & Pospı́chal (2007) use it to iluminate
emergence of language in phylogeny of homo sapiens sapiens species, but
couldn’t be an analogical approach used to model transfer from mother to child
in ontogeny?

Evolutionary Linguistics

Definition
Scientific study of both the origins and development of language as well as the cultural
evolution of languages.
Schleicher’s (1853) language tree (Stammbaumtheorie) theory
lack of fossil records, difficult to empirically verify, banned by Societe linguistique
de Paris in 1866
revived at the end of 20th century (c.f. Pinker & Bloom, 2011)
quantitative comparative linguistics, phylogenetic trees...
focuses on phylogeny and not ontogeny
Why EL should focus on ontogeny
”We are not very well informed about the psychology of Neanderthal man or about the
psychology of Homo siniensis of Teilhard de Chardin. Since this field of biogenesis is
not available to us, we shall do as biologists do and turn to ontogenesis. Nothing could
be more accessible to study than the ontogenesis of these notions. There are children
all around us. (Piaget, 1975)”

Formal Language Theory
Alphabet A is a finite, nonempty set of symbols.
A word or a string over an alphabet A is a finite sequence of symbols of from A.
A* is the set of all words over A*.
A language L over A is a subset of A*.
A grammar G is a quadruple=(N,T,P,S) where N is the nonterminal alphabet, T is the
terminal alphabet, S ∈ N is the axiom and P is the set of rewriting (production,
substitution) rules, written as x → y . Grammars are called
REGULAR when the form of all rules in P is X → α, X → αB, α ∈ T , A, B ∈ N
CONTEXT-FREE when all its rules have form X → x where X ∈ N, x ∈ A∗G
CONTEXT-SENSITIVE when P contains only rules of the form x1 Xx2 → x1 wx2
x1 , x2 , w being strings over AG , X ∈ N
Language L GENERATED from grammar G is a set of all sequences of terminals
which can be derived from axiom S by recursive application of rules in P.
Language L can be PARSED by grammar G if, for all sequences s ∈ L, there exist at
least one sequence of application of production rules which, when applied in an inverse
fashion (i.e. substitute left side of production rule for the right side), shall end at
axiom S.

Grammar Systems
Introduction
A Grammar System is a set of grammars working together, according to a specified
protocol, to generate a language.
a syntactic theory of multi-agent, distributed and parallel systems
multiple independent grammars share their productions in ”string environment”
(analogic to AI ”blackboard” approaches)
environment can change on its own (so called ”eco-grammar” systems) or not
(language colonies)

Figure: Reproduced from Kelemen’s (2004) article ”Miracles, colonies, and emergence”.

Natural Language Processing

uses computers to process human languages
implements AI, data-mining, information retrieval and machine learning methods
(both supervised and unsupervised)
first and ultimate NLP challenge posed by Turing (1950)
other problems: anaphora resolution, automatic summarization, discourse
analysis, machine translation, morpohological segmentation, named entity
recognition, natural language understanding, POS-induction and tagging, parsing,
question answering, sentiment analysis, speech recognition, word sense
disambiguation etc...
in NLP, statistics often plays more important role than FLT
in NLP, methods based on aNN, Naive Bayes, SVM-ba methods are predominant,
EC is much less used

POS-induction and Grammar induction
Part-of-speech induction
The goal is to group tokens, present in the pure-text corpus C, into clusters grouping
members of diverse parts-of-speech (nouns, verbs, adjectives, etc.).
Grammar induction
The goal is to infer, from pure-text corpus C, a grammar G which could have generated
the corpus C.
POS-i and GI problems are strongly intertwined. Clusters discovered by POS-i can be
denoted by non-terminal symbols.
C: John loves Mary. Mary hates John. Mary sleeps. John weeps.
ideal grammar:
N→JohnkMary
V→lovekhateksleepkweep
S→NVskNVsN
least general grammar: S→C
most general grammar: S→A*

Learning of semantic categories

How can machines work with semantic categories?
Semantic categories (i.e. concepts) can be characterized
by extensive (listing the instances) or ostentative (pointing the finger) definition
in terms of sufficient and necessary features
as (convex) subspaces of N-dimensional semantic feature space (Gardenfors 2004)
as prototypes (points) within such spaces
Principle(s) behind construction of semantic spaces
In neurosciences: ”neurons that fire together, wire together” (Hebb 1964)
In linguistics: ”a word is characterized by the company it keeps” (Harris 1954) In
philosophy: ”the meaning of a word is its use in the language (Wittgenstein 1953)
Conjecture
Development of vocabulary in human children is a variant of multi-class classification
problem and as such can be simulated by an algorithm creating and partitioning semantic
feature vector spaces.

Developmental Psycholinguistics
Developmental Psycholinguistics (DP) is a scientific discipline studying changes
occuring in human faculty of understanding and production of natural languages. As
such, it is closely related to developmental psychology (a sub-field of psychology) and
developmental linguistics (a sub-field of linguistics).
Language Development (DEF)
Language development (LD) - or ontogeny of natural language L in human individual
H - is a constructivist process gradually transforming L into evermore optimized communication channel facilitating the exchange of information between H and her social
surroundings.
language is social and pragmatic (allows children to manipulate objective world)
comprehension precedes production: C-representations offer preliminary targets
for P-productions
physiological predispositions of language are innate but useless without triggering
epigenetic stimuli
children are not ”ideal learners” (in Gold’s theorem sense)
brains simultaneously encode multiple language registers and grammars

Motherese
parents modify their language in order to make themselves understood
higher pitch (267 Hz in comparison to 198Hz), slower tempo, greater rhytmicity,
longer pauses between utterances
”much of the speech addressed to babies consists of short, routine, repetitive
utterances produced with great consistency and frequency in the same contexts,
day after day” (Clark 2003)”
repetitions three times more frequent in speech to two-years-old than in speech to
ten-years-old

Figure: Reproduced from Tverarthen (1993).

Toddlerese
babys expressive faculties start with 1bit communication channel (need
soothing/don’t need soothing)
gets more subtle and fine-grained with time: more information transmitted with
less signal
babbling starts cca at 8 months of age, first as repetition of same syllables
(mamamama, babababa), later syllables shall start to vary within the sequence
(babadadabebe)
around 1 year: consistent vocalizations in specific contexts (protowords)
children tend to be quite accurate in their first productions but later versions of
the same words appear to be further from adult targets
”continuous exploration, experimentation, practice and intense involvement with
linguistic structure” (Labov, 1978)
LD reveales, upon closer inspection, a constantly changing series of small
experiments where child progressively scrutinizes and tries out different options
(Clark 2003)
a lot of variability in children’s word forms (Ferguson and Farewell 1975)
first grammar form around ”pivot” words, e.g.: ”mama auch, tato auch, nana
auch, baba auch” (S → Nauch ; N → mama|tato|nana|baba)
toddlerese: 10 - 30 months

Quantitative laws of language acquisition
Piotrowski law
In both linguistic phylogeny as well as ontogeny (e.g. sentence length, vocabulary size) does development follow the
logistic equation: 1+aec −bt . Note that in ecology, the same
equation is considered to yield the law of population growth
(Lotka, 1920).
Figure: reproduced from (Baixeries et al.,
2013).

Zipf’s law
Zipf (1949) showed that if the most frequent
word in a text is assigned rank 1, the second
most frequent word is assigned rank 2 etc. than
frequency f(r) of a word of rank r obeys f ≈ r −α
(i.e. follows the power-law distribution). Recent
(Baixeries et al., 2013) analyses of CHILDES
corpus indicate that the exponent α depends on
age and is much higher and decreases faster in
small children.

Common aspects of both LD and evolution

Axiomatic
1

Convergence: different trajectories, same result

2

Variation: children PLAY, children forget

3

Non-monotonicity: locally ”correct” behaviours are lost

Hypothetic
1

Adaptation: gradual convergence of LT towards LM and possibly GT towards GM

2

Replication: both vertical (repetition) and horizontal (non-local storage)

3

Parallel coexistence of schemas

4

Selection: correct behaviours are rewarded

Subject and Method

Subject
My own daughter. 0-30months (0-2;6)
Method
Phenomenological method based principially on amazed observations.
Long-term journal.
Little or no experimental (artificial) interactions beyond natural and normal scenarios.

Cognitive Crossover Cases
Case 1 - Banan
Banan was called ”baja” in (1;6) and ”anan” in (1;10). At (1;11) a following interaction
took place:
F: banan ; C: anan
F: banan ; C: anan
F: baja ; C: bajan
F: bajan ; C: banan
Case 2 - Olol
Very intensive ”Krtko & Orol” period between 1;10-1;11. Word ”OLOL” used with high
frequency on a regular basis. During one pre-sleep monologue, subject said ”KOLOL”
when enumerating the names of her friends from creche, one among them being named
Nikol.
Bilingual crossovers
oči+augen=oge
opica+afe=api
voda+wasser=vava
etc...

Quantitative observations

Corpus
CHILDES - Child Language Data Exchange System (MacWhinney and Snow 1984)
1

more than 130 corpora of transcripted child verbal interactions

2

more than 20 languages

Variation operators whose impact shall be analyzed
1

Substitutions - papija → babija → mamija

2

Reduplications - hau − hau

3

Omissions - vlak → ak

Method
Matching with Perl-compatible regular expression (Hromada 2011). Reduplication, for
example, can be easily detected with regexp (\d{2,})\1.
Note that strings can evolve by substituting substrings for other strings and the substitution rule itself is also a string.

Grammar Induction

inducing not one monolithic grammar but populations of individual grammars
fitness function promotes individuals which
1) match patterns present in environment
2) generate utterances which shall be ”accepted” by environment
3) minimize number of utterances which shall not be accepted by environmennt
individual grammar is encoded as an ordered sequence of production rules

Corpus
#Mutter#
#Vater#

Grammar1

Grammar2

Vat → A
#A → B
er # → A
#Mu → A
tA → A
axioms: AA BA

er # → B
#Va → A
#Mu → A
tB → B
er # → B
axioms: AB

Concept Construction
Attaching meanings to words interpreted as supervised learning of multiclass classifier.
In most recent experiments I crossover four ideas in order to create it:
RANDOM PROJECTION - exploiting lemma Johnson-Lindenstrauss to project
problem into D-dimensional space
BINARIZATION - transposition of problem from real-valued spaces to binary
(Hamming) spaces (Hromada 2014)
THEORY OF PROTOTYPES - every category C can be characterized by a
prototype PC which is as close as possible to members of C and as far as possible
members of other category (Rosch 1973)
EVOLUTIONARY COMPUTATION - thus searches for such a set of K
prototypes P1 , ..., PN which maximizes the ”prototype fitness function”:
F (I ) =

N X
K
X
(
H(PB , i)PC =PB ifLi 6=C − H(PG , i)PC =PG ifLi =C )
i

C

where H(PG , i) is the Hamming distance between the binary vector denoting the
prototype PG and the document i contained in training document set of
cardinality N.
Every individual is a binary vector obtained by concatenation of vectors of all K
prototypes |I | = D ∗ K .

Concept Construction - Preliminary Results

Trained on training part and evaluated on testing part of 20newsgroups corpus (K=20)
LSB parameters D=128, S=3, I=2
CGA (N=100, PM =0.001, one-point crossover) with 1/8 elitism
Figure: Evolutionary induction of semantic prototypes - training

Figure: Evaluation of induced prototypes against the testing set

Algorithm seems to perform better than ”deep learning” Semantic Hashing method of
Salakhutdinov & Hinton (2009).
An Evolutionary Computation Algorithms are capable of generalization and can be
thus considered a case of Machine Learning.

Thesis

1

At some level of abstraction, ontogeny of syntactic and semantic categories is a
process consistent with tenets of Universal Darwinism.

2

Representations in human mind are subjects of variation, selection and replication.

3

In young children this process is still not completely internalized (Vygotsky 1934)
and is thus visible to external observer.

4

Evolutionary Computation is a means how this process can be successfully
simulated in silico.

Merci

Thank You.

Introduction to Moral Induction Model and
its Deployment in Artificial Agents
Daniel Devatman Hromada12 and Ilaria Gaudiello
hromi at giver.eu, i.gaudiello at gmail.com

Abstract
Inidividual specificity and autonomy of a morally reasoning system is principally
attained by means of a constructionist inductive process . Input into such process are
moral dilemmata or their story­like representations, its output are general patterns
allowing to classify as moral or immoral even the dilemmas which were not represented in
the initial “training” corpus. Moral inference process can be simulated by machine
learning algorithms and can be based upon detection and extraction of morally relevant
features. Supervised or semi­supervised approaches should be used by those aiming to
simulate parent­>child or teacher­>student morality transfer processes in artificial
agents. Pre­existing models of inference ­ e.g. the grammar inference models in the
domain of computational linguistics ­ can offer certain inspiration for anyone aiming to
deploy a moral induction model. Historical data, mythology or folklore could serve as a
basis of the training corpus which could be subsequently significantly extended by a
crowdsourcing method exploiting the web­based « Completely Automated Moral Turing
test to tell Computers and Humans Apart ». Such a CAMTCHA approach could be also
useful for evaluation of agent’s moral faculties.
Keywords: moral induction model, autonomous artificial agent, induction of morality,
grammar inference, moral Turing test, corpus­based machine learning, morally relevant
features, oracle machine, moral grammar, semantic enrichment, CAMTCHA

1.

Inductive Process

The aim of this article is to furnish some theoretical as well as practical arguments supporting the
proposal that « specific and autonomous aspects of moral behaviour are tuned by means of an
inductive process ». It shall be argued that at least certain components of this process, e.g. « moral
feature extraction » or «equivalence­class clustering of moral dilemmas » can be indeed computable
and can be successfully simulated on a Universal Turing Machine especially if an immediate
answer­giving oracle (Turing, 1939) is supervising the process.
The usage of generic term « process » indicates that we aim to explain the emergence of morality as
1

Department of control and industrial informatics, Faculty of electrical engineering and information
technology, Slovak University of Technology, Ilkovicova 3, 812 19 Bratislava , Slovak Republic
2
Cognition Humaine & Artificielle ­ Laboratoire des Usages en Techniques d’Information
Numériques, Faculty of Psychology, Université Paris 8 Vincennes­Saint Denis, Paris, France

a durative and constructive phenomenon. As other human aptitudes like language or object
manipulation, human moral faculty demands time to develop and we believe that this development can
be understood in terms of environment­driven tuning of certain biologically pre­wired innate
parameters related to the fact that healthy humans are essentially social beings (Adler, 1927).
The 1) imitation faculty of mirror neurons 2) generalisation faculty of human brain and 3) the very
possibility furnished by the second law of termodynamics, i.e. the freedom of structures to evolve in a
new direction (i.e. to mutate) – it may be the case that interaction of these three principal components
may well account for construction of morality in human ontogeny, as well as phylogeny. Concrete
insights concerning the interaction of these 3 components can be found in (Piaget & Baechler, 1932).
However, for the scope of the present work, we will focus on the second one, that is, the continous
processing of situations implying moral dilemmata whose solution should be further generalized
beyond the contingent situations.
This type of processing is generally called « induction » or « inference ». Both of these i­terms
denote the direction from the concrete and often physical towards the abstract and general. Their
antonym is « deduction » denoting the flow from general to the concrete. While it is an undeniable fact
that both induction&deduction form an unseparable holistic head&tails for any advanced cognitive
activity performed by a human agent – and that deduction is neccessary in case of any reasonable
performance ­ we argue that the construction of individual and autonomous moral competence is
ultimately based on induction.
The usage of terms competence and performance, which are so widely used within the framework of
Chomskian doctrine, may indicate that we shall tend to defend its (nativ|mental|generativ)ist position
stating that human being are, from their birth, endowed with some kind of a « universal moral
grammar » (UMG) (Mikhail, 2007) which should play a crucial role in setting parameters for a more local
moral grammar (MG), in order to adaptat it to the given culutral and social context.
While we are far from excluding the possibility that humans are endowed with a certain UMG ­ most
probably related to such anthropological constants like “empathy”, “emotional resonance” or “theory
of mind”­ the objective of our proposal is to explain not the Unity but rather the diversity of human
moral behaviour. That is, instead of wondering which ethical theory should we use to endow agents
with moral competence (Lin, Abney, & Bekey, 2012), we propose to focus attention on local
contextuality as well as to assess the divergence among various instances of individual MGs.
As far as we know, a child does not, in order to take a « good » decision, inject all possible
behavioral maximes as an input parameter into some kind of Kantian (Kant, 1785) universally applicable
cognitive blackbox. On the contrary: simple imitation is more than often a successful heuristics ­ be it
the imitation of a physical person standing in front of the child, or imitation of a model figure
represented as a sort of archetype in child’s semantic memory. And if ever there is nothing to imitate, if
ever there is no precedens, no match, only then the generalisation procedure enters the
solution­seeking game.

2.

Training Corpus

How to simulate this constructive and durative process in the realm of artificial agents (AAs) ?
The question is not to be wiped away from the table since in the world already governed in huge
extent by machines, a big lot can depend from the correct and, if possible, deeply empathic answer.
In accordance with authors (see Vitz, 1990 for a review) who suggest that narratives are central to
human moral development, we suggest to extend the very same narrative­based approach beyond the
domain of organic agents, thus proposing a following answer to the question posed above:

« By telling stories ».
Within the framework of a full­fledged Moral Induction Model (MIM) a « story » is defined as a
representation of a situation of moral dilemma. In order to demonstrate our point we shall, in this
paper, focus solely upon dilemmata represented in textual modality. Our motivation for such a choice
is twofold: 1) text seems to be robust enough a vector for the transfer of “moral of the story” from the
author to the reader 2) canonical Turing Test is a text­oriented one, and thus it can be expected that the
moral­restricted TuringTest­like evaluation procedure will be also based on textual modality.
An example of such a story­represented­in­text can be:
STORY 1 : «There was once a king who saw a man digging a ditch near the road. The king asked
a man : ‘How much You earn for such a hard work ?’ . ‘Three dimes daily’ answered the man.
Surprised was the king and asketh : ‘Three dimes daily? So little ?’. The man answereth : ‘Three
dimes daily, oh yes dear and respectable king, but in fact I live only from dime a day, since with the
second dime I lend and with the third I pay back what I have borrowed’. Puzzled was the king and
asketh : ‘How comes ?’. The man replieth : ‘I simply pay back one dime to my father and invest one in
my son, o Lord ! » (Dobšinský, 1883)
One can extract such stories from folklore, mythology, religion, history, legal codices or biographies
in order to create a Training Corpus (TC). Criteria according to which such corpora are built are of
utmost importance since it is the injection of TC into MIM’s inductive apparatus which starts the
whole process aiming to attain attain artificial agents endowed with faculty to reason according to
human moral precepts or at least to understand them. One would be thus highly reluctant to integrate
into corpus violent acts described in both testaments or biographies of Stalin, Hitler etc. and
introduction of such texts into the learning process is highly discouraged especially for the phases
during which an AA still does not dispose of its own consistent yet autonomous (Hromada, 2012)
moral core.
The very process of story selection and TC construction is already an act by means of which a
human teacher supervises AA’s learning. One should never underestimate the importance of the
selection criteria according to which the teacher chooses to confront AA with this story and not that
one, and to do so in this moment of learning process and not later nor sooner. These selection criteria
are very important because they are strongly coupled with « values » that the teacher seeks to transfer
by the learning process.
Hence, MI is never a fully unsupervised process. The teacher should be always present, and since
it follows ex vi termini that a good teacher can not be physically present for more than a limited period
of time, (s)he should at least aim to encode some λόγος into the very form of TC he deploys. While it is
of course possible to imagine that once the TC is constructed, one could go further with unsupervised
algorithms, choice of such an approach would make it practically impossible for the teacher to transfer
his precepts with the envisaged degree of exactness.
It is therefore recommended to depart from the state whereby the stories contained in TC are
already associated with labels furnished explicitely by the teacher. In more advanced cases, labels can
be more complex structures like label (CONCLUSION : Agent(King); Predicate(Reward);
Acceptor(Poor­man) ; Reason(Acceptor’s wisdom)) associated to STORY1. But due to scaffolding
nature of MIM, it seems more rational to depart towards such complex levels from basic TCs which
contain binary (i.e. «good » and « bad ») and ternary (c.f. STORY2 below) labels.

3.
1.

Model Description

Preprocessing
Every input into induction process, every story, is in the beginnning nothing more than a
string of characters. This sequence of tokens subsequently enters the natural language processing
(NLP) machinery of parses, lemmatisers etc. which enrich the initial data with relevant syntactic
metadata.

2.

Semantic Enrichment
Once the basic syntactic tags are assigned to different phrases and words of the story, the
NLP engine shall « link » the data contained in the story with prebuild ontologies or semantic vector
spaces (Widdows, 2004) which represent previously attained knowledge. This can be done by the
process of semantic enrichment (SE) whose objective is to make explicit the information which is
implictly contained in the initial story. SE can be thought of as a sort of « process of source code
compilation » whose output is a complex datastracture containing much more information than was
explicitely stated in the initial « source code » (i.e. in the « story »). For example, the sequence of 4
letters : D I M E shall, in combination with syntactic labels like “noun” obtained in phase 1, transform
into a reference to such assertions like «form of money», «of little value» etc. We believe that even
with current RDF&SPARQL­based technologies one could possibly make explicit the fact that the main
character of STORY1 was very poor ← because his salary was very low ← because he reacted with the
statement « three dimes daily » ← to a question containing the verb « earn » as its predicate. And
since the first cycle of SE process attributed facts like « pays back the old » and « invests into the
youth » to the principal agent of the story, it is highly probable that SE’s second cycle shall, with
relatively high probability, inject into the story’s graph also the representation of the predicate
Wise(Poor­man).

3.

Moral Feature Extraction
Once the flat linear sequence of letters from initial story was transformed into semantically
enriched densely intraconnected multigraph and/or into a vector space endowed with certain unique
topological properties, one can try to align it with previously obtained morally relevant data. One
possible way how to attain this goal is to encode the story as a vector of binary values which denote
the presence or absence of this or that feature in the story. For example, an edge between nodes A and
B of a semantically enriched multigraph could be possibly interpreted as a presence of feature AB.
Once the vector representation of the story is ready, one can align it with vector representations
of other stories contained in the TC and pass the resulting matrix as an input into supervised machine
learning feature extraction algorithm like AdaBoost (Viola & Jones, 2001). During the training phase,
the algorithm will discover such linear combinations of eigenfeatures which reduce the story → label
classification error.
In other terms, during the learning process, an AA could possibly « discover » that what is
morally relevant for the success or failure of story’s principal hero is that he was associated with
features like «hard­working», «polite» and «wise» while the presence of a feature like «hero digs a
ditch» is as irrelevant for the moral of the story as would be the presence of a feature «hero paves the
path». Absence of features can be equally important : the fact that no person is rude or violent in the
story can also be chosen as MRF.

4.

Equivalence class construction and production of an abstract moral template
Once morally relevant features are extracted in the training phase, one can cluster objects which
share such feature (or sets of features) into classes. After that, non­terminals denoting these
equivalence classes can be organized into patterns whose totality would yield a « moral template ». If
there is a mismatch between output produced by confrontation of the moral template with the story S
and the label associated in the training corpus with the story S, one should try to modify the classes or
some of the patterns so they would match (if moral) or not match (if immoral) MRFs extracted from the
story. If no such modification leads to success, one will be obliged to re­run the costly MRF­extraction
process with additional data.
In the real­life scenario, one simply « compile » the story by SE process, looks for absence or
presence of preselected MRFs, looks what concepts («justice», «loyalty») can be constructed from
them, tries to match their combinations with already induced patterns to produce the final output. In a
robotic AA endowed with a material shell, such an output can be an instruction inducing the agent to
execute a physical movement.

4.

Moral & Grammar Inference

Moral induction is a bootstrapping (Hromada, 2014) and self­scaffolding process.
Value­representing concepts (e.g. X=« wisdom ») have to be constructed in parallel to
maxima­representing pattern­predicates (e.g. « reward X ») within which the value­representing
concepts play a role of free variable. One is dependent from the other and vice versa.
In this sense Moral Induction is analogic to the process of grammar inference which is an
condition sine qua non of language acquisition and as such automatically occurs in every healthy
human baby. In grammar inference one has to deal with a similar problem: equivalence classes for
grammatical categories, conjugations and declinations are to be constructed before a rule manipulating
these classes. But without prealable knowledge of such rules it is difficult to evaluate whether the
candidate equivalence class is a pertinent one, or whether it is just a set of tokens clustered according
to some non­important criteria. For example the rule Regular_Verb+ed­>PastParticiple is of no use if the
baby does not have any notion of what verbs are and, on the other side, it is a non­trivial problem for a
baby’s brain to find out what tokens should be clustered into the group of regular verbs since initially
the baby does not know any rule which could help it to distinguish regulars from irregulars or even
nouns.
But luckily enough, it seems that this chicken&egg problem can be solved. At least results of
computational models of grammar inference like « Automatic Distillation of Structure » (ADIOS) (Solan,
Horn, Ruppin, & Edelman, 2005) indicate that even a relatively simple graph theory approach can
furnish a method by means of which a man can induce grammatical rules which generated the corpus
by using as an input only the very corpus itself.
We believe that human grammatic competence share certain characteristics with the moral
competence – both first transform the surface structure into much more complex « deep structure » and
afterwards match this structure with already induced template. If the grammatical structure of the
sentence matches the syntactic template, one « feels » that it is grammatic ; if the « moral of the story »
matches the moral template, one « feels » that story’s hero does the « right » thing.
It is also worth mentioning that a deeper formal analysis presented in (Clark, 2010) suggests, that
certain problems of the grammar induction simply disappear if ever the induction­performing algorithm
disposes of possibility to consult an oracle machine (Turing, 1939) with the question «Is utterance X
grammatical ? ».

Mutatis mutandi, in the domain or moral inductive process occuring in child’s mind, the question is
« Is a given maxime moral ? Should one act like that ? » and the oracle is principially a parent, later a
teacher.

5.

Problem & Solution

A disadvantage of an approach proposed in preceding paragraphs is that in order to train a fully
autonomous AA, one would needs a very huge TC in order to be able to detect & extract subtle MRFs.
If we speak about millions features of which potentially any story can be composed, we shall need a
TC containing at least hundreds thousands of stories. Otherwise, the dataset would be too sparse and
no MRFs could be extracted which could yield a robust moral classifier.
What’s worse, at least one label should be manually attributed to every story of the corpus by a
human teacher which would demand a significant devotion of one’s time for the project. Involvement
of multiple teachers in the labeling process can be a possible solution but, in case the teachers moral
values would not be mutually consistent, it could stain TC with more noise than signal.
But, luckily enough, the labeling problem can be easily crowdsourced so that any story could be
possibly labeled by a statistically significant number of human subjects. Such a corpus could thus
possibly represent not only the moral values of one or few teachers but, possibly moral values of a
community, nation or even of humankind itself. We present hereby a way how TC could be potentially
constructed in a relatively non­violent and potentially rewarding and amusing way :
During creation of an account on a website it is nowadays a common procedure to include a
so­called CAPTCHA image somewhere in the registration form so that the webserver application can
be sure that it communicates with a human being, which is able to visually parse the content of an
image, and not with a bot which is unable to do so.
In a CAMTCHA 3 (i.e. Completely Automated Moral Turing test to tell Computers and Humans
Apart) which we hereby proposed, the « question » is not addressing subject’s faculty of visual
recognition. It addresses his|her moral reasoning faculty. Thus instead of proposing to a user an image
containing twisted or rotated letters which have to be recognized and rewritten into the inputbox
below, an application could propose a story & CAMTCHA question couple :
STORY2 : There are 3 children on the playground ­ Alice, Bob and Carla. Bob is sad because his
mother is in the hospital. Alice is happy because just a while ago, her father gave her a beautiful
present. Carla is sad because she never recieved any present at all – her parents are too poor to buy
her any.
QUESTION: You must sooth the kids. You have two toys to give. Which child shall NOT get a
toy ?
Below the story will be the inputbox where a human “teacher” shall, with quite high probability,
write the answer «Alice». In the same time, the same story shall be presented to another users and if
statistically significative number users will give the same answer (and not some other), the CAMTCHA
could consider the answer as a valid « moral » label for the presented story. Contrary to CAPTCHA
whose intention from the very beginning was to distinguish bots from humans, the primary reason for
deployment of CAMTCHA would be to obtain valid labels for TC under question. But once at least
some stories are labeled with sufficient clarity, CAMTCHA could be, of course, used as a miniature
3

As of 2013, the only running instance of CAMTCHA we are aware of is present at the site kyberia.cz where
users have to answer the question “What is justice?” in order to be granted access into the community.

moral Turing Test (Hromada, 2012) used at an entrance to such web communities or applications where
the extent of moral competence of a user­to­be­verified plays an important role .
Problems presented by CAMTCHA could be, of course, automatically diversified – names
(Alice­>Eve), objects (toys­>rewards), verbs (give­>distribute) could be substituted. The very final
question could also vary in relation to labeling schema of the TC corpus (e.g. the question could be
« Would it be good or bad if You give toy to Carla ? » for TC labeled only with binary labels « good »
and « bad »). Later, a more complex narrative generator could be programmed which would not only
« mutate » but also « crossover » the stories present in the TC, hence generating completely new
stories. Worth more than gold, such an automatic moral story narrator could and should be based on
already obtained data and could be imagined as an « active » counterpart to the « passive »
pattern­matching MIM­template finite state automaton.
But before one gets there, it seems reasonable to manually construct corpus of very simple and
morally unambigous stories. Note, for example, that story 2 has only 81 words and it is quite easily
syntactically parsable. The SE process converging to the « knowledge of the fact » that Alice is the
only child among the three which is not sad (because she is described as « happy ») is attainable by
current semantic vector space or ontology­based techniques. Thus, creating a question­answering
system which would 1) parse the question 2) realise that the question has three possible answers 3)
apply MIM to find out that it is not a happy but sad child which has to be soothed in the first place, is
something which could be done even today.
Verily could an approach proposed hereby yield some success if the engineer’s aims would be
modest. Thus, instead of aiming to create an AA able to find an answer to artificially constructed
« trolley problems » (Mikhail, 2007) to which even an adult human being cannot find any answer, the
process of grounding of AA’s morality should be started with corpora of stories representing concrete
and minute problems of concrete and small human beings – children. In this paper we have tried to
illustrate how such an approach could, possibly, ground the notion of « justice » by illustrating its
retributive (STORY1) and distributive (STORY2) forms.
It may be the case that some of the premises proposed in this article were wrong, however if ever
there shall be once at least one artificial kindergarten’s playground arbiter which shall recognize a
suffering child and make it smile, we believe that writing it was worth the effort.

Bibliography
Adler, A. (1927). Understanding Human Nature.
Clark, A. (2010). Towards general algorithms for grammatical inference. Algorithmic Learning
Theory (p. 11–30).
Dobšinský, P. (1883). Simple National Slovak Tales (Vol. 1­8).
Hromada, D. D. (2012). From Age&Gender­based Taxonomy of Turing Test Scenarios towards
Attribution of Legal Status to Meta­Modular Artificial Autonomous Agents. Proceedings of
IACAP/AISB Turing Centenary Conference. Birmingham, UK.
Hromada, D. D. (2014). Conditions for Cognitive Plausibility of Computational Models of Category
Induction. In Information Processing and Management of Uncertainty in Knowledge­Based Systems
(pp. 93­105). Springer International Publishing.

Kant, I. (1785). Groundwork of the Metaphysic of Morals.
Lin, P., Abney, K., Bekey, G.A. (2012). Robot Ethics: The Ethical and Social Implications of
Robotics. Intelligent Robotics and Autonomous Agents series. The MIT Press, Cambridge:
Massachussets.
Mikhail, J. (2007). Universal moral grammar: Theory, evidence and the future. Trends in Cognitive
Sciences, 11(4), 143–152.
Piaget, J., & Baechler, N. (1932). Le jugement moral chez l’enfant.
Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages.
Proceedings of the National Academy of Sciences, 102(33), 11629.
Turing, A. M. (1939). Systems of logic based on ordinals. Proceedings of the London
Mathematical Society, 2(1), 161–228.
Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple
Classifiers. Proc. IEEE CVPR 2001.
Vitz,P.C. (1990). The use of stories in moral development. American Psychologist, 45(6):709­720.
Widdows, D. (2004). Geometry and meaning. CSLI publications.

Conditions of cognitive plausibility of
computational models of category
induction
Daniel Devatman Hromada
Laboratoire Cognition Humaine et Artificielle (ChART)
Universite Paris 8

hromi@wizzion.com

Abstract. We present two axiomatic and three conjectural conditions which a model
inducing natural language categories should dispose of, if ever it aims to be considered
as “cognitively plausible”. 1st axiomatic condition is that the model should involve a
bootstrapping component. 2nd axiomatic condition is that it should be data-driven. 1st
conjectural condition demands that the model integrates the surface features – related to
prosody, phonology and morphology – somewhat more intensively than is the case in
existing Markov-inspired models. 2nd conjectural condition demands that asides
integrating symbolic and connectionist aspects, the model under question should exploit
the global geometric and topologic properties of vector-spaces upon which it operates.
At last we shall argue that model should facilitate qualitative evaluation, for example in
form of a POS-i oriented Turing Test. In order to support our claims, we shall present a
POS-induction model based on trivial k-way clustering of vectors representing suffixal
and co-occurrence information present in parts of Multext-East corpus. Even in very
initial stages of its development, the model succeeds to outperform some more complex
probabilistic POS-induction models for lesser computational cost.

Keywords: categorization, part-of-speech induction, surface features, vector spaces,
categorization-oriented Turing Test, partitioning of grammatical feature space, K-means
clustering, cognitive plausibility

1. Introduction
The notion of “cognitive plausibility” and “part-of-speech induction” shall be defined in
subsection 1.1. Subsection 1.2 shall clarify the position of syntactic category induction within
the field of Natural Language Processing (NLP). The last subsection (1.3) shall offer a brief
overview of the history of the problem, arguing that the current paradigm is probabilistic and
English-centered one.

adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011

1.1

Cognitive plausibility

This article enumerates some basic conditions which should be fulfilled, we believe, by
engineers aiming to transform their computational models into “cognitively plausible” artificial
agents. We label as “cognitively plausible” a model which tends to address some basic
function of human cognitive system not only by simulating, in a sort of “black-box apparatus”,
the mapping of inputs (stimuli, corpus data etc.) upon outputs (results), but also tends to
faithfully represent the way how the respective function/skill is
accomplished by a human mind and its material substrate – the brain.
In other terms, we believe that a cognitively plausible model should not only aim to attain the
most quantitatively accurate results, but also to do so by processing the information similarly to
the way mind does it.
The aim of this article is to elucidate the notion of “cognitive plausibility” (CP) by relating it to
one particular problem, that of construction of grammatical categories present in natural
languages. More concretely, we shall try to illustrate our point on the problem of construction
of part-of-speech (POS) classes. We precise that the term POS-induction (POS-i) designates the
process which endows the human or an artificial agent with the competence to attribute the
POS-labels (like “verb”, “noun”, “adjective”) to any token observable in agent’s linguistic
environment. For the simplicity of the argument, only parts of textual corpora like Multext-East
(Erjavec, 2012) shall be considered as such “linguistic environment” of the computational agent
introduced below.

1.2

Part-of-Speech induction in Natural Language Processing and Language
Acquisition studies

POS-i is often considered to be “one of the most popular tasks in research on unsupervised
NLP” (Christodoulopoulos et al., 2010). The problem of construction of grammatical categories
is closely related to problem of “grammar induction” and language acquisition. Since “syntactic
category information is part of the basic knowledge about language that children must learn
before they can acquire more complicated structures” (Schütze, 1993), it is hard to imagine any
computational model of grammar induction - aiming to discover the set of rules of the grammar
of the language under study- without it being able to construct, in the first place, the
equivalence classes upon which the rules-to-discover shall be applied (Elman, 1989; Solan et
al., 2005).
Acquisition of formal grammatical categories, be it parts-of-speech or others, is thoroughly
studied in psycholinguistic literature – for introductory overview c.f. Levy et al.,(1988). Such
studies often aim to address the question “whether grammatical categories are innate, or
induced through interaction with environment by means of imitation and analogy?”. The
result of this never-ceasing Nature&Nurture debate is vast amount of both empiric and theoretic
knowledge which could be ideally useful for any tentative to bring together disparate
disciplines of artificial intelligence and developmental psychology.

1.3

POS-i paradigm(s)

While already latent in worthy POS-i models, like that of (Elman, 1989) existed before, or were
published more or less in parallel (Schütze, 1993), the paradigm currently dominating the POSi domain was fully born with article published by Brown et al. in 1992. Without going into
detail, we precise that the model was successful because of its ability to apply both Markovian

probabilistic concepts and those coming from information theory (Shannon & Weaver, 1949)
upon the information contained in the co-occurrences of the words in the sequences, thus
becoming the flagship of what we label hereby as “co-occurrence distribution” or “contextual
distribution” (CD) paradigm. In decades to follow, the CD paradigm have clearly dominated the
POS-i field. Be it hidden Markov Models tweaked with variational Bayes (Johnson, 2007) ,
Gibbs sampling (Goldwater & Griffiths, 2007), morphological features (Berg-Kirkpatrick,
Bouchard-Côté, DeNero, & Klein, 2010; Clark, 2003) or graph-oriented methods (Biemann,
2006) – all such approaches and many others consider contextual co-occurrence to be the
primary source of POS-irelevant information.
But as comparative study of (Christodoulopoulos et al., 2010) indicates when demonstrating
that models integrating morphological features tend to better than those who do not, it seems
plausible that the uncontested primary role of CD in POS should be revised. While it is evident
that the CD indeed must furnish relevant information if ever
distributional hypothesis is valid (Harris, 1954) and it is axiomatic that distributional hypothesis
applies in case of any agent creating categories consistently with Hebb’s law (Hebb, 1964) we
shall argue in subsection 3.1 that pertinent POS-I clues can be extracted not only from word’s
“external” contextual properties but also from word’s very “internal” Mορφε.

2. Axiomatic conditions of Cognitive Plausibility
This section deals with what we believe are necessary (i.e. sine qua non) conditions of
cognitive plausibility of a computational model . Subsection 2.1 deals with the “bootstrapping”
condition stating that categories which are being built are based on categories which have
already been built. Emergence of bootstrapping effect shall be illustrated on a trivial multiiterative re-clustering of clusters pre-clustered according to CD features. Subsection 2.2
discusses the assumption that in order to be cognitively plausible, the model should be data
and/or oracle-driven.

2.1

Bootstrapping the bootstrapping

From biochemistry to social sciences it is a well known fact that structuring structures are
the structures structured. Computational Linguistics and NLP in particular is not an
exception. The most general definition of the term bootstrapping (B) – i.e. that B is a selfsustaining multi-iterative process whereby outputs of the previous iteration modify the very
execution of the next iteration – could be indeed apply upon so many computational
“recurrent”, “self-feeding” (Riloff & Jones, 1999), “auto-organizing” (Nowak et al., 1999)
approaches that have been already applied in so many NLP studies, that to state about a NLP
algorithm X that “X bootstraps” may sometimes seem to be plain tautology.
In certain sense almost any POS-i model based on CD paradigm are, ex vi termini,
bootstrapping ones because even in the most simplistic models, the information about the
membership of the target word W T in the candidate class C is inferred from the probabilities of
membership of WL (WT’s left context) and WR (WT’s right context) to their respective candidate
POS classes. Given the fact that the W T plays the role of right context for W L and the role of left
context for WR, whole problem is circular and as such often calls for a bootstrapping solution.
Solan et al. (2005) refer to a crucial 4th component of their automatic distillation of structure
(ADIOS) algorithm as “generalized bootstrapping”. Differently from the “geometric approach”
which shall be presented in our experiment below, ADIOS implements graph-like structures in

order to attain its aim of construction of equivalence classes useful in subsequent grammar
induction. But in its very essence, the approach of Solan et al., i.e. that one should substitute the
vertices “subsumed” by a “subsuming” non-terminal class-denoting vertex is analogical,
mutatis mutandi, to the approach presented in the following paragraphs.

1.1.1

1st experiment: Bootstrapping k-way POS clustering seeded by token
co-occurrence features

Experiment was performed with data contained in English (en), Czech (cs) and Slovak (sk),
corpora contained in 4th version of Multext-East corpus (Erjavec, 2012).
Table 1 . Overall statistics of analyzed corpora

Corpus
Cs
En
Sk

Word Types

Tokens

TagsPOS

Featcooc

19283
10511
20588

100368
134832
103452

13
12
13

70426
36774
74912

Table 1. presents summary statistics concerning the quantities of distinct word tokens, word
types (i.e. tokens without context) and the most coarse-grained “gold standard” POS-tags is
presented along with total number of distinct co-occurrence features which is equivalent to the
number of columns (dimensions) in the resulting co-occurrence matrix.
Every word WT type was characterized by a (row) vector of values [W 1L, W2L ...WNL, W1R,
W2R ... WNR ], W1L referring to cases when the word W 1 occurred to the left of WT, W2L to cases
when W2L was to the left, W 3R to cases when W3 was to the right from the target word. What
results is a simple co-occurrence matrix with N rows and maximum of Feat COOC==2*N
columns. Given that in the experiment we were actually looking two words to the left and two
words to the right from WT, the maximum possible number of columns was Feat COOC =4*N. But
since not all word couples do occur asides each other, the final number Feat COOC was always
below the theoretical limit.
The matrix has been clustered in C={2 … 50} clusters by the fast & frugal repeated bisection kway clustering algorithm as implemented in the clustering tool CLUTO (Karypis, 2002).
Columns were scaled according to IDF principle and the clustering was done according to
cosine metrics. Once finished, comparison with “gold standard” yielded V-measure (Rosenberg
& Hirschberg, 2007) values which are also illustrated as NO curves on Figure 1.
We have implemented the bootstrapping component in a following manner: After each
clustering, the information about the proposed cluster is added as a new
feature to target’s word vector description. Thus, if matrix with 20 columns
entered the first iteration which clustered the vectors into 5 clusters, the matrix entering the
second iteration shall have 20+5 columns. If second iteration yields 6 clusters, a matrix with
25+6 columns will become the input for the third iteration etc. Figure 1 shows that in case of all
3 studied corpora, the bootstrapping BO method always attains higher scores than the static NO
approach.1
1

Note that the V-measure of NO-bootstrap curves seem to be relatively stable in regards to increase of
number of clusters. Contrary to many-to-one accuracy (purity) which increases with number of clusters,
V-measure thus seems to be better evaluation measure for cases when solutions containing different
numbers of clusters have to be compared.

Fig. 1. Bootstrapping of contextual co-occurrence statistics

2.2

Data and oracle-driven learning

Computational models unable to analyze what they have previously synthesized and synthesize
what they have previously analyzed could be hardly labeled as “cognitively plausible”. But
even the presence of such “dialectic” component cannot be the guarantee of absolute success, if
ever the model’s initial prima materia – the data with which the whole bootstrapping is
initiated – are not adapted to model’s prewired “innate” state.
It is unfortunately often the case in computational linguistics that whenever the model does not
attain the expected performance, huge amount of effort is invested into tuning the model by
diverse ad hoc modifications. After hours of exhaustive search, both intellectual as well as
automatic, diverse parameters, meta-parameters and hyper-parameters are finally discovered
which allow the model to attain somewhat superior performances when confronted, for
example, with Wall Street Journal (WSJ) corpus But human categorization faculties – POS-i
included – do not develop in such a way. While it seems plausible that same sort of “tuning of
parameters” indeed takes place during initial period of language acquisition, it seems to be so
efficient because the data itself is well adapted to ever-evolving state of baby’s neuro-linguistic
structures. Said more concretely, parents do not recite to its children the WSJ or Eulex corpora
in order to adjust the synaptic weights in the brains of their children, they rather modify all their
narrative intentions by pragmatic, prosodic, phonological as well as semantic Babytalk
(Ferguson, 1964) cognitive filters. In doing so – by pre-processing the stimuli before it even
attains perceptual buffers of child agent’s ears – parents affirm themselves in the role of
computational oracle (Turing, 1939).

Since it was already demonstrated by Clark (Clark, 2010) with sufficient analytical clarity that
the “supervision” coming from external oracle machines can significantly reduce the
complexity of the grammar induction and POS-i problems, we found it worthwhile to state that
“fully unsupervised approaches are very rare because the engineer’s decision to
confront the algorithm with corpus X and not Y, and to do so in the
moment T1 and not T2, is already an act of supervision”.
By saying so we do not want to underestimate the importance of using the same corpora for
mutual comparison of scientific results. We simple want to indicate that, because it determines
everything which follows, the question of corpus choice should not be neglected. More
concretely, cognitively plausible models of POS-i should be firstly tuned and “raised” with
corpora like CHILDes (MacWhinney, 2000) and only later should be their scope of validity
extended by means of confrontation with corpora of adult and expert utterances.

3. Conjectural conditions of model’s Cognitive Plausibility
Subsection 3.1 discuss the role of non-distributional “surface” features for POS-induction.
Discussion is followed by results of an experiment suggesting that features like suffix can
indeed offer quite strong clues for the creation of syntactic categories. Subsection 3.2
introduces a conjectural condition for model’s CP by proposing to base it principally on
geometric grounds. It is followed by subsection 3.3 arguing that CP model should facilitate
evaluation by means of qualitative inspection. In general, these sections deal with CP’s
conjectural conditions, meaning that while they may seem less self-evident that the axiomatic
ones, we nonetheless consider them as valid.

3.1

Integration of surface features

Natural languages are very redundant communication channels (de Saussure., 1922; Shannon &
Weaver, 1949). Three facets of the word – its morpho-phonological signifiant, its invisible
signifiée and its its syntactic function – are not independent from one another and more often
than not do they significantly overlap (Jackendoff, 2003; Lakoff, 1990). Thus it is not
surprising that especially in morphologically rich languages, token’s very syntactic function is
encoded by morphemes present in the surface, i.e. objectively perceivable form, of the token
itself. And results obtained by Clark (Clark, 2003) or (Berg-Kirkpatrick et al., 2010) indeed
point in this direction – it may be no coincidence that approaches which exploit morphological
features turned out, in (Christodoulopoulos et al., 2010) comparative study, to perform better
than models which do not use such features.

1.1.2

2nd experiment : Assessing the impact of sufixal features on part-ofspeech categorisation

We used the same three Multext-East corpora as in the first experiment. Ultimate character
trigram was extracted from every word type and considered to be a feature. Word types are
subsequently clustered in C clusters according these FeatSUFFIX orthogonal dimensions. The
comparison with Mutext-East gold standard subsequently yields V-measures (V), entropies (H)
and purities (P) presented in Table 2.

Table 2. Performance of model’s inducing C categories solely according to suffixal features
Cs
534
En
286
Sk
523

C=10
V=0.178
H=0.487
P=0.582
V=0.248
H=0.428
P=0.639
V=0.17
H=0.5
P=0.504

C=30
V=0.24
H=0.392
P=0.642
V=0.215
H=0.4
P=0.652
V=0.272
H=0.373
P=0.685

C=50
V=0.26
H=0.34
P=0.69
V=0.2
H=0.39
P=0.66
V=0.274
H=0.339
P=0.714

Amount below the corpus name in the above table denotes the length of the FeatSUFFIX vector,
i.e. the number of distinct suffixal trigrams observed in their respective corpora.
FeatSUFFIX-driven model attains lesser V-measures as had obtained (Christodoulopoulos et al.,
2010) when evaluating models of (Clark, 2003) or (Berg-Kirkpatrick et al., 2010) within their
2013 comparative study. The very same study however also indicates that even the simplistic
FEATSUFFIX-driven model can be worth of certain interest since it seems to be quite fast – in
comparison to models harnessing the power of more than dozen computational cores to attain
comparable or even better V-measures than FEATSUFFIX-driven method , we are glad to state that
in order to attain results presented above, our dual-core Pentium needed in average T EN=1.8,
TSK=3.2, TCS=3.6 seconds per simulation.

3.2

Knowledge is geometric

After the Turing machine symbol-operating paradigm started to put more importance upon
ever-still more & more fine-grained modular to probabilistic and connectionist models. But in
recent years, a “geometric” paradigm starts to gain momentum in diverse fields of cognitive
sciences including computational linguistics and NLP. In experiments described above such
paradigm was harnessed in a sense that instead of modulating weights along different
dimensions, geometers often modulate the number of dimensions itself. It could be
possibly reproached to such a geometric approach that associating every plausible feature with
a new dimension can induce some serious matrix-sparsity problems and|or that such an
approach would be, sooner or later, confronted with insurmountable computational and memory
limits. It is true that methods by means of which some older approaches deal with the problem
of huge co-occurrency matrices can be very costly, as is the case, for example, in singular value
decomposition within LSA (Landauer & Dumais, 1997). But since very elegant, simple and
concise representations of sparse matrices can be very easily generated (Karypis, 2002) and
since lemma of Johnson-Lindenstrauss (W. B. Johnson & Lindenstrauss, 1984) indicates that
sparse high-dimensional matrices can be easily projected into low-dimensional as is often done
in random-indexing (Sahlgren, 2005), it seems to be plausible to state that construction of
vector spaces which are 1) dense but 2) transformable for low computational cost 3) encode
huge amount of features attributed to huge amount of objects is not so problematic as it used to
be in time when HMM-mastered POS-i paradigm was born.
Series of articles by Sahlgren (2002; 2005), Cohen (2010), Widdows (2004) and their
colleagues offer valuable initiation into advantages of random-projection based semantic
models. For more general discussion of “geometrization of thought” in diverse fields of
cognitive sciences, see (Gärdenfors, 2004). Within all such geometric models, categories can be
considered as local subspaces of a global space derived from the data.

3.3

Mix of quantitative and qualitative evaluation

Performance of early grammatical category induction models was evaluated manually by
introspection into induced equivalence classes and articles published in the period of “golden
age” of POS-i often used to enumerate members of at least one particularly pleasing class or
presenting their dendograms. Such an approach was later critiqued by Clark (2003) as
“inadequate” and attention of POS-I community turned towards more quantitative measures
like perplexity, conditional entropy, cross-validation (Gao & Johnson, 2008), one-to-one
(Haghighi & Klein, 2006) or many-to-1 accuracy (purity); variation of information (Meila,
2003) , substituable F-score (Frank et al., 2009) etc.
For the purposes of this article we had decided to present our simulations principally in terns of
V-measure. Given its elegance, stability in regards to growing number of clusters but also
certain “strictness” (note that even the best performing models present in comparative study
(Christodoulopoulos et al., 2010) rarely surpass the V>0.6 limit), we consider the V-measure
to be very valuable quantitative measure of performance of clustering POS-i algorithms.
But we also believe that the “old school” many-to-1 purity measure can be of certain interest,
especially for those aiming to create a “semi-supervised bridge” between POS-induction and
POS-tagging models; or by those aiming not to evaluate the performance of the model by rather
to gain insights of correct annotations of analyzed corpora. In other terms, asides to “global”
statistic measures informing the researcher about the overall performance of the model, more
“local” measures can still offer interesting and useful information about individual induced
classes themselves. Values presented in Table 3 represent the number C of clusters into which
the corpus has to be partitioned in order to obtain at least Φ absolutely pure (i.e. Purity=1)
classes.
Table 3. Distillation of absolutely pure categories
Φ=1
Φ=2
Φ=3
Φ=4
Φ=5
Φ=10

SFFX
72
92
105
126
131
160

CD
168
194
196
248
281
377

CD+BO
107
142
180
189
194
256

SFFX+CD+BO
69
71
80
90
96
116

For example, in order to obtain an absolutely pure cluster on the basis of contextual distribution
(CD) features, one would have to partition the English part of Multext-East corpus into 168
clusters among which shall emerge following noun-only cluster:
authority, character, frontispiece, judgements, levels, listlessness, popularity, sharpness, stead,
successors, translucency, virtuosity
Interesting insights can also be attained by inspection of some exact points of the clustering
procedure. Let’s inspect, as an example, the case when one clusters the English corpus into 7
clusters according to features both internal to the word – i.e. suffixes – and external – i.e.
co-occurrence with other words co-occurrence. Such an inspection indicates that the model
somehow succeeds to distinguish verbs from nouns. As is shown on Table 4, whose columns
represent the “gold standard” tags and rows denote the artificially induced clusters, our naïve

computational model tends to put nouns in clusters 4 and 6 while putting verbs into clusters 2, 3
and 5.
Table 3 . Origins of Noun-Verb distinction
0
1
2
3
4
5
6

N
10
568
97
13
1173
608
1977

V
3
67
668
1011
67
958
97

M
0
0
0
1
4
72
22

D
0
0
0
0
0
67
0

R
413
1
1
275
6
252
42

A
30
0
137
0
133
321
1091

S
0
1
3
2
0
99
3

C
0
2
2
0
0
72
0

I
0
0
0
0
0
7
3

P
0
1
0
0
4
106
0

X
1
0
0
0
3
3
2

G
0
0
0
0
0
12
0

The objective of our ongoing work is to align as much as possible such “seeding” states like
that presented on Table 4. with data consistent with psycholinguistic knowledge about diverse
stages of language acquisition process.
At last but not least, we believe that the temporal aspects of model’s performance, i.e. the
answer to the question “How long does the model need to run in order to furnish reasonable
results?” should be always seriously considered. One way how to evaluate such temporal
aspects of categorization could be a simplistic Turing-Test (TT) like POS-i oriented scenario
where the evaluator asks the model (or an agent) to attribute the POS-label to word posed by
evaluator, or at least to return a set of members of the same category. In such a reallife scenario, an absolute perfection of possible future answer could be possibly traded off for
less perfect (yet still locally optimal) answer given in reasonable time.
But because with this TTPOS proposal we already depart from the domain of unsupervised
induction towards semi-supervised “learning with oracle” or fully supervised POS-tagger, we
conclude that we consider the condition “cognitively plausible model of part of speech
induction should be evaluated by both quantitative and qualitative means” to be the weakest
among all proposals concerning the development of an agent inducing the categories of natural
language in a “cognitively plausible” way.

4. Conclusion
Model should be labeled as “cognitively plausible” model of certain human faculty if and only
if it not only accurately emulates the input (problem) → output (solution) mapping executed by
the faculty, but also emulates the basic “essential” characteristics associated to such mapping
operation in case of human cognitive systems, i.e. emulates not only WHAT but also HOW the
problem → solution mapping is done.
In relation to the problem of how part-of-speech induction is effectuated by human agents, two
characteristic conditions have been defined as axiomatic (necessary). First postulates that POS-i
should involve a “bootstrapping” multi-iterative process able to subsume terminals sharing
common features under a new non-terminal and to subsequently exploit the information related
to occurrence of the new non-terminal to extend the (vectorial) definition terminals represented
in the memory. Ideally the process should converge to partitions “optimally” corresponding to
the gold standard. First experiment has shown for three distinct corpora that even a very simple
model based on clustering of the most trivial co-occurrence information can attain higher
accuracies if such a bootstrapping component is involved. The second necessary condition of

POS-i’s CP is that it should be data or oracle-driven. It should perform better when first
confronted with simple corpora like CHILDes (MacWhinney, 2000) and only latter with more
complex ones than if it would be first confronted with complex corpora.
Another condition of POS-i’s CP proposed that morphological and surface features should not
be neglected and instead of playing a secondary “performance increasing role”, they should
possibly “seed” whole bootstrapping process which shall follow. This condition is considered to
be conjectural (i.e. “weaker” ) just because it points to somewhat orthogonal direction than does
a traditionally acclaimed distributional hypothesis (Harris, 1954). It may be the case, however,
that especially native speakers of some morphologically rich languages shall consider the
“syntax-is-also-IN-the-word” paradigm not only as conjectural but also axiomatic.
Another “weak” condition of cognitive plausibility postulates that many phenomena related to
mental representations and thinking, POS-i included, can be not only described but also
explained and represented in geometric and topologic terms. Ideally, the geometric
paradigm (Gärdenfors, 2004) should not be contradictory but rather complenetary to symbolic
and connectionist paradigms. The last and weakest condition of CP proposed that
computational models of part-of-speech induction should be not only easily quantitatively
analyzed but should be also transparent for researcher’s or supervisor’s qualitative analyses.
They should facilitate and not complicate posing of all sorts of “Why?” questions and the
results should be easily interpretable. A sort of categorization-faculty Turing Test was proposed
which could be potentially embedded into the linguistic component of the hierarchy of Turing
Tests which we propose elsewhere (Hromada, 2012).
It may be the case that the list of conditions of cognitive plausibility presented in this article is
not sufficient one and should be extended with other terms like “modularity”, “selfreferentiality” or notions coming from complex systems and evolutionary computing.
Regarding the problem of elucidation of how could a machine induce, from the environmentrepresenting corpus, the categories in a way analogical to that of a child learning by imitating
its parents, we consider even the list of 2 strong precepts and 3 weak precepts hereby presented
as quite useful and possibly necessary.

Bibliography
Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., & Klein, D. (2010). Painless unsupervised learning
with features. Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (p. 582–590).
Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering.
Proceedings of the 21st International Conference on computational Linguistics and 44th Annual
Meeting of the Association for Computational Linguistics: Student Research Workshop (p. 7–12).
Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based ngram models
of natural language. Computational linguistics, 18(4), 467–479.
Christodoulopoulos, C., Goldwater, S., & Steedman, M. (2010). Two Decades of Unsupervised POS
induction: How far have we come? Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing (p. 575–584).
Clark, A. (2003). Combining distributional and morphological information for part of speech induction.
Proceedings of the tenth conference on European chapter of the Association for Computational
Linguistics- Volume 1 (p. 59–66).

Clark, A. (2010). Towards general algorithms for grammatical inference. Algorithmic Learning Theory
(p. 11–30).
Cohen, T., Schvaneveldt, R., & Widdows, D. (2010). Reflective Random Indexing and indirect inference: A
scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2), 240–
256.
Elman, J. L. (1989). Representation and structure in connectionist models. DTIC Document.
Erjavec, T. (2012). MULTEXT-East: morphosyntactic resources for Central and Eastern European
languages. Language resources and evaluation, 46(1), 131–142.
Ferguson, C. A. (1964). Baby talk in six languages. American anthropologist, 66(6_PART2), 103–114.
Frank, S., Goldwater, S., & Keller, F. (2009). Evaluating models of syntactic category acquisition without
using a gold standard
Proc. 31st Annual Conf. of the Cognitive
Science Society (p. 2576–2581).
Gao, J., & Johnson, M. (2008). A comparison of Bayesian estimators for unsupervised Hidden Markov
Model POS taggers. Proceedings of the Conference on Empirical Methods in Natural Language
Processing (p. 344–352).
Gärdenfors, P. (2004). Conceptual spaces: The geometry of thought. MIT press.
Goldwater, S., & Griffiths, T. (2007). A fully Bayesian approach to unsupervised part-of-speech tagging.
ANNUAL MEETINGASSOCIATION FOR COMPUTATIONAL LINGUISTICS (Vol. 45, p. 744).
Haghighi, A., & Klein, D. (2006). Prototype-driven learning for sequence models. Proceedings of the
main conference on Human Language Technology Conference of the North American Chapter
of the Association of
Computational Linguistics (p. 320–327).
Harris, Z. S. (1954). Distributional structure. Word.
Hebb, D. O. (1964). The Organization of Behavior: A Neuropsychlogical Theory. John Wiley & Sons.
Hromada, D. D. (2012). Taxonomy of Turing Test Scenarios. Proceedings of AISB/IACAP
Symposium. Birmingham, United Kingdom.

2012

Jackendoff, R. (2003). Foundations of language: Brain, meaning, grammar, evolution. Oxford
University Press, USA.
Johnson, M. (2007). Why doesn’t EM find good HMM POS-taggers. Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL) (p. 296–305).
Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space.
Contemporary mathematics, 26(189-206), 1.
Karypis, G. (2002). CLUTO-a clustering toolkit. DTIC Document.
Lakoff, G. (1990). Women, fire, and dangerous things. Univ. of Chicago Press.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory
of acquisition, induction, and representation of knowledge. Psychological review, 104(2), 211–240.

Levy, Y., Schlesinger, I. M., Braine, M.D.S. (1988). Categories and Processes in Language Acquisition.
Lawrence Erlbaum.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. Transcription, format and
programs (Vol. 1). Lawrence Erlbaum.
Meilua, M. (2003). Comparing clusterings by the variation of information. Learning theory and kernel
machines (p. 173–187). Springer.
Nowak, M. A., Plotkin, J. B., & Krakauer, D. C. (1999). The evolutionary language game. Journal of
Theoretical Biology, 200(2), 147– 162.
Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level
bootstrapping. Proceedings of the National Conference on Artificial Intelligence (p. 474–479).
Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation
measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and
Computational Natural Language Learning (EMNLP-CoNLL) (Vol. 410, p. 420).
Sahlgren, M. (2005). An introduction to random indexing. Methods and Applications of Semantic
Indexing Workshop at the 7th International Conference on Terminology and Knowledge
Engineering, TKE (Vol. 5).
Sahlgren, M., & Karlgren, J. (2002). Vector-based semantic analysis using random indexing for crosslingual query expansion. Evaluation of Cross-Language Information Retrieval Systems (p. 169–176).
De Saussure, F., Bally, C., Séchehaye, A., Riedlinger, A., Calvet, L. J., & De Mauro, T. (1922). Cours de
linguistique générale. Payot, Paris.
Schütze, H. (1993). Part-of-speech induction from scratch. Proceedings of the 31st annual meeting on
Association for Computational Linguistics (p. 251–258).
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of information. Urbana: University of
Illinois Press, 97.
Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages.
Proceedings of the National Academy of Sciences, 102(33), 11629.
Turing, A. M. (1939). Systems of logic based on ordinals. Proceedings of the London Mathematical
Society, 2(1), 161–228. Language and Speech, 40(1), 47–62.
Vlachos, A., Korhonen, A., & Ghahramani, Z. (2009). Unsupervised and constrained Dirichlet process
mixture models for verb clustering. Proceedings of the workshop on geometrical models of natural
language semantics (p.74–82).
Widdows, D., & Kanerva, P. (2004). Geometry and meaning. CSLI publications Stanford.