Conditions of cognitive plausibility of
computational models of category
induction
Daniel Devatman Hromada
Laboratoire Cognition Humaine et Artificielle (ChART)
Universite Paris 8

hromi@wizzion.com

Abstract. We present two axiomatic and three conjectural conditions which a model
inducing natural language categories should dispose of, if ever it aims to be considered
as “cognitively plausible”. 1st axiomatic condition is that the model should involve a
bootstrapping component. 2nd axiomatic condition is that it should be data-driven. 1st
conjectural condition demands that the model integrates the surface features – related to
prosody, phonology and morphology – somewhat more intensively than is the case in
existing Markov-inspired models. 2nd conjectural condition demands that asides
integrating symbolic and connectionist aspects, the model under question should exploit
the global geometric and topologic properties of vector-spaces upon which it operates.
At last we shall argue that model should facilitate qualitative evaluation, for example in
form of a POS-i oriented Turing Test. In order to support our claims, we shall present a
POS-induction model based on trivial k-way clustering of vectors representing suffixal
and co-occurrence information present in parts of Multext-East corpus. Even in very
initial stages of its development, the model succeeds to outperform some more complex
probabilistic POS-induction models for lesser computational cost.

Keywords: categorization, part-of-speech induction, surface features, vector spaces,
categorization-oriented Turing Test, partitioning of grammatical feature space, K-means
clustering, cognitive plausibility

1. Introduction
The notion of “cognitive plausibility” and “part-of-speech induction” shall be defined in
subsection 1.1. Subsection 1.2 shall clarify the position of syntactic category induction within
the field of Natural Language Processing (NLP). The last subsection (1.3) shall offer a brief
overview of the history of the problem, arguing that the current paradigm is probabilistic and
English-centered one.

adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011

1.1

Cognitive plausibility

This article enumerates some basic conditions which should be fulfilled, we believe, by
engineers aiming to transform their computational models into “cognitively plausible” artificial
agents. We label as “cognitively plausible” a model which tends to address some basic
function of human cognitive system not only by simulating, in a sort of “black-box apparatus”,
the mapping of inputs (stimuli, corpus data etc.) upon outputs (results), but also tends to
faithfully represent the way how the respective function/skill is
accomplished by a human mind and its material substrate – the brain.
In other terms, we believe that a cognitively plausible model should not only aim to attain the
most quantitatively accurate results, but also to do so by processing the information similarly to
the way mind does it.
The aim of this article is to elucidate the notion of “cognitive plausibility” (CP) by relating it to
one particular problem, that of construction of grammatical categories present in natural
languages. More concretely, we shall try to illustrate our point on the problem of construction
of part-of-speech (POS) classes. We precise that the term POS-induction (POS-i) designates the
process which endows the human or an artificial agent with the competence to attribute the
POS-labels (like “verb”, “noun”, “adjective”) to any token observable in agent’s linguistic
environment. For the simplicity of the argument, only parts of textual corpora like Multext-East
(Erjavec, 2012) shall be considered as such “linguistic environment” of the computational agent
introduced below.

1.2

Part-of-Speech induction in Natural Language Processing and Language
Acquisition studies

POS-i is often considered to be “one of the most popular tasks in research on unsupervised
NLP” (Christodoulopoulos et al., 2010). The problem of construction of grammatical categories
is closely related to problem of “grammar induction” and language acquisition. Since “syntactic
category information is part of the basic knowledge about language that children must learn
before they can acquire more complicated structures” (Schütze, 1993), it is hard to imagine any
computational model of grammar induction - aiming to discover the set of rules of the grammar
of the language under study- without it being able to construct, in the first place, the
equivalence classes upon which the rules-to-discover shall be applied (Elman, 1989; Solan et
al., 2005).
Acquisition of formal grammatical categories, be it parts-of-speech or others, is thoroughly
studied in psycholinguistic literature – for introductory overview c.f. Levy et al.,(1988). Such
studies often aim to address the question “whether grammatical categories are innate, or
induced through interaction with environment by means of imitation and analogy?”. The
result of this never-ceasing Nature&Nurture debate is vast amount of both empiric and theoretic
knowledge which could be ideally useful for any tentative to bring together disparate
disciplines of artificial intelligence and developmental psychology.

1.3

POS-i paradigm(s)

While already latent in worthy POS-i models, like that of (Elman, 1989) existed before, or were
published more or less in parallel (Schütze, 1993), the paradigm currently dominating the POSi domain was fully born with article published by Brown et al. in 1992. Without going into
detail, we precise that the model was successful because of its ability to apply both Markovian

probabilistic concepts and those coming from information theory (Shannon & Weaver, 1949)
upon the information contained in the co-occurrences of the words in the sequences, thus
becoming the flagship of what we label hereby as “co-occurrence distribution” or “contextual
distribution” (CD) paradigm. In decades to follow, the CD paradigm have clearly dominated the
POS-i field. Be it hidden Markov Models tweaked with variational Bayes (Johnson, 2007) ,
Gibbs sampling (Goldwater & Griffiths, 2007), morphological features (Berg-Kirkpatrick,
Bouchard-Côté, DeNero, & Klein, 2010; Clark, 2003) or graph-oriented methods (Biemann,
2006) – all such approaches and many others consider contextual co-occurrence to be the
primary source of POS-irelevant information.
But as comparative study of (Christodoulopoulos et al., 2010) indicates when demonstrating
that models integrating morphological features tend to better than those who do not, it seems
plausible that the uncontested primary role of CD in POS should be revised. While it is evident
that the CD indeed must furnish relevant information if ever
distributional hypothesis is valid (Harris, 1954) and it is axiomatic that distributional hypothesis
applies in case of any agent creating categories consistently with Hebb’s law (Hebb, 1964) we
shall argue in subsection 3.1 that pertinent POS-I clues can be extracted not only from word’s
“external” contextual properties but also from word’s very “internal” Mορφε.

2. Axiomatic conditions of Cognitive Plausibility
This section deals with what we believe are necessary (i.e. sine qua non) conditions of
cognitive plausibility of a computational model . Subsection 2.1 deals with the “bootstrapping”
condition stating that categories which are being built are based on categories which have
already been built. Emergence of bootstrapping effect shall be illustrated on a trivial multiiterative re-clustering of clusters pre-clustered according to CD features. Subsection 2.2
discusses the assumption that in order to be cognitively plausible, the model should be data
and/or oracle-driven.

2.1

Bootstrapping the bootstrapping

From biochemistry to social sciences it is a well known fact that structuring structures are
the structures structured. Computational Linguistics and NLP in particular is not an
exception. The most general definition of the term bootstrapping (B) – i.e. that B is a selfsustaining multi-iterative process whereby outputs of the previous iteration modify the very
execution of the next iteration – could be indeed apply upon so many computational
“recurrent”, “self-feeding” (Riloff & Jones, 1999), “auto-organizing” (Nowak et al., 1999)
approaches that have been already applied in so many NLP studies, that to state about a NLP
algorithm X that “X bootstraps” may sometimes seem to be plain tautology.
In certain sense almost any POS-i model based on CD paradigm are, ex vi termini,
bootstrapping ones because even in the most simplistic models, the information about the
membership of the target word W T in the candidate class C is inferred from the probabilities of
membership of WL (WT’s left context) and WR (WT’s right context) to their respective candidate
POS classes. Given the fact that the W T plays the role of right context for W L and the role of left
context for WR, whole problem is circular and as such often calls for a bootstrapping solution.
Solan et al. (2005) refer to a crucial 4th component of their automatic distillation of structure
(ADIOS) algorithm as “generalized bootstrapping”. Differently from the “geometric approach”
which shall be presented in our experiment below, ADIOS implements graph-like structures in

order to attain its aim of construction of equivalence classes useful in subsequent grammar
induction. But in its very essence, the approach of Solan et al., i.e. that one should substitute the
vertices “subsumed” by a “subsuming” non-terminal class-denoting vertex is analogical,
mutatis mutandi, to the approach presented in the following paragraphs.

1.1.1

1st experiment: Bootstrapping k-way POS clustering seeded by token
co-occurrence features

Experiment was performed with data contained in English (en), Czech (cs) and Slovak (sk),
corpora contained in 4th version of Multext-East corpus (Erjavec, 2012).
Table 1 . Overall statistics of analyzed corpora

Corpus
Cs
En
Sk

Word Types

Tokens

TagsPOS

Featcooc

19283
10511
20588

100368
134832
103452

13
12
13

70426
36774
74912

Table 1. presents summary statistics concerning the quantities of distinct word tokens, word
types (i.e. tokens without context) and the most coarse-grained “gold standard” POS-tags is
presented along with total number of distinct co-occurrence features which is equivalent to the
number of columns (dimensions) in the resulting co-occurrence matrix.
Every word WT type was characterized by a (row) vector of values [W 1L, W2L ...WNL, W1R,
W2R ... WNR ], W1L referring to cases when the word W 1 occurred to the left of WT, W2L to cases
when W2L was to the left, W 3R to cases when W3 was to the right from the target word. What
results is a simple co-occurrence matrix with N rows and maximum of Feat COOC==2*N
columns. Given that in the experiment we were actually looking two words to the left and two
words to the right from WT, the maximum possible number of columns was Feat COOC =4*N. But
since not all word couples do occur asides each other, the final number Feat COOC was always
below the theoretical limit.
The matrix has been clustered in C={2 … 50} clusters by the fast & frugal repeated bisection kway clustering algorithm as implemented in the clustering tool CLUTO (Karypis, 2002).
Columns were scaled according to IDF principle and the clustering was done according to
cosine metrics. Once finished, comparison with “gold standard” yielded V-measure (Rosenberg
& Hirschberg, 2007) values which are also illustrated as NO curves on Figure 1.
We have implemented the bootstrapping component in a following manner: After each
clustering, the information about the proposed cluster is added as a new
feature to target’s word vector description. Thus, if matrix with 20 columns
entered the first iteration which clustered the vectors into 5 clusters, the matrix entering the
second iteration shall have 20+5 columns. If second iteration yields 6 clusters, a matrix with
25+6 columns will become the input for the third iteration etc. Figure 1 shows that in case of all
3 studied corpora, the bootstrapping BO method always attains higher scores than the static NO
approach.1
1

Note that the V-measure of NO-bootstrap curves seem to be relatively stable in regards to increase of
number of clusters. Contrary to many-to-one accuracy (purity) which increases with number of clusters,
V-measure thus seems to be better evaluation measure for cases when solutions containing different
numbers of clusters have to be compared.

Fig. 1. Bootstrapping of contextual co-occurrence statistics

2.2

Data and oracle-driven learning

Computational models unable to analyze what they have previously synthesized and synthesize
what they have previously analyzed could be hardly labeled as “cognitively plausible”. But
even the presence of such “dialectic” component cannot be the guarantee of absolute success, if
ever the model’s initial prima materia – the data with which the whole bootstrapping is
initiated – are not adapted to model’s prewired “innate” state.
It is unfortunately often the case in computational linguistics that whenever the model does not
attain the expected performance, huge amount of effort is invested into tuning the model by
diverse ad hoc modifications. After hours of exhaustive search, both intellectual as well as
automatic, diverse parameters, meta-parameters and hyper-parameters are finally discovered
which allow the model to attain somewhat superior performances when confronted, for
example, with Wall Street Journal (WSJ) corpus But human categorization faculties – POS-i
included – do not develop in such a way. While it seems plausible that same sort of “tuning of
parameters” indeed takes place during initial period of language acquisition, it seems to be so
efficient because the data itself is well adapted to ever-evolving state of baby’s neuro-linguistic
structures. Said more concretely, parents do not recite to its children the WSJ or Eulex corpora
in order to adjust the synaptic weights in the brains of their children, they rather modify all their
narrative intentions by pragmatic, prosodic, phonological as well as semantic Babytalk
(Ferguson, 1964) cognitive filters. In doing so – by pre-processing the stimuli before it even
attains perceptual buffers of child agent’s ears – parents affirm themselves in the role of
computational oracle (Turing, 1939).

Since it was already demonstrated by Clark (Clark, 2010) with sufficient analytical clarity that
the “supervision” coming from external oracle machines can significantly reduce the
complexity of the grammar induction and POS-i problems, we found it worthwhile to state that
“fully unsupervised approaches are very rare because the engineer’s decision to
confront the algorithm with corpus X and not Y, and to do so in the
moment T1 and not T2, is already an act of supervision”.
By saying so we do not want to underestimate the importance of using the same corpora for
mutual comparison of scientific results. We simple want to indicate that, because it determines
everything which follows, the question of corpus choice should not be neglected. More
concretely, cognitively plausible models of POS-i should be firstly tuned and “raised” with
corpora like CHILDes (MacWhinney, 2000) and only later should be their scope of validity
extended by means of confrontation with corpora of adult and expert utterances.

3. Conjectural conditions of model’s Cognitive Plausibility
Subsection 3.1 discuss the role of non-distributional “surface” features for POS-induction.
Discussion is followed by results of an experiment suggesting that features like suffix can
indeed offer quite strong clues for the creation of syntactic categories. Subsection 3.2
introduces a conjectural condition for model’s CP by proposing to base it principally on
geometric grounds. It is followed by subsection 3.3 arguing that CP model should facilitate
evaluation by means of qualitative inspection. In general, these sections deal with CP’s
conjectural conditions, meaning that while they may seem less self-evident that the axiomatic
ones, we nonetheless consider them as valid.

3.1

Integration of surface features

Natural languages are very redundant communication channels (de Saussure., 1922; Shannon &
Weaver, 1949). Three facets of the word – its morpho-phonological signifiant, its invisible
signifiée and its its syntactic function – are not independent from one another and more often
than not do they significantly overlap (Jackendoff, 2003; Lakoff, 1990). Thus it is not
surprising that especially in morphologically rich languages, token’s very syntactic function is
encoded by morphemes present in the surface, i.e. objectively perceivable form, of the token
itself. And results obtained by Clark (Clark, 2003) or (Berg-Kirkpatrick et al., 2010) indeed
point in this direction – it may be no coincidence that approaches which exploit morphological
features turned out, in (Christodoulopoulos et al., 2010) comparative study, to perform better
than models which do not use such features.

1.1.2

2nd experiment : Assessing the impact of sufixal features on part-ofspeech categorisation

We used the same three Multext-East corpora as in the first experiment. Ultimate character
trigram was extracted from every word type and considered to be a feature. Word types are
subsequently clustered in C clusters according these FeatSUFFIX orthogonal dimensions. The
comparison with Mutext-East gold standard subsequently yields V-measures (V), entropies (H)
and purities (P) presented in Table 2.

Table 2. Performance of model’s inducing C categories solely according to suffixal features
Cs
534
En
286
Sk
523

C=10
V=0.178
H=0.487
P=0.582
V=0.248
H=0.428
P=0.639
V=0.17
H=0.5
P=0.504

C=30
V=0.24
H=0.392
P=0.642
V=0.215
H=0.4
P=0.652
V=0.272
H=0.373
P=0.685

C=50
V=0.26
H=0.34
P=0.69
V=0.2
H=0.39
P=0.66
V=0.274
H=0.339
P=0.714

Amount below the corpus name in the above table denotes the length of the FeatSUFFIX vector,
i.e. the number of distinct suffixal trigrams observed in their respective corpora.
FeatSUFFIX-driven model attains lesser V-measures as had obtained (Christodoulopoulos et al.,
2010) when evaluating models of (Clark, 2003) or (Berg-Kirkpatrick et al., 2010) within their
2013 comparative study. The very same study however also indicates that even the simplistic
FEATSUFFIX-driven model can be worth of certain interest since it seems to be quite fast – in
comparison to models harnessing the power of more than dozen computational cores to attain
comparable or even better V-measures than FEATSUFFIX-driven method , we are glad to state that
in order to attain results presented above, our dual-core Pentium needed in average T EN=1.8,
TSK=3.2, TCS=3.6 seconds per simulation.

3.2

Knowledge is geometric

After the Turing machine symbol-operating paradigm started to put more importance upon
ever-still more & more fine-grained modular to probabilistic and connectionist models. But in
recent years, a “geometric” paradigm starts to gain momentum in diverse fields of cognitive
sciences including computational linguistics and NLP. In experiments described above such
paradigm was harnessed in a sense that instead of modulating weights along different
dimensions, geometers often modulate the number of dimensions itself. It could be
possibly reproached to such a geometric approach that associating every plausible feature with
a new dimension can induce some serious matrix-sparsity problems and|or that such an
approach would be, sooner or later, confronted with insurmountable computational and memory
limits. It is true that methods by means of which some older approaches deal with the problem
of huge co-occurrency matrices can be very costly, as is the case, for example, in singular value
decomposition within LSA (Landauer & Dumais, 1997). But since very elegant, simple and
concise representations of sparse matrices can be very easily generated (Karypis, 2002) and
since lemma of Johnson-Lindenstrauss (W. B. Johnson & Lindenstrauss, 1984) indicates that
sparse high-dimensional matrices can be easily projected into low-dimensional as is often done
in random-indexing (Sahlgren, 2005), it seems to be plausible to state that construction of
vector spaces which are 1) dense but 2) transformable for low computational cost 3) encode
huge amount of features attributed to huge amount of objects is not so problematic as it used to
be in time when HMM-mastered POS-i paradigm was born.
Series of articles by Sahlgren (2002; 2005), Cohen (2010), Widdows (2004) and their
colleagues offer valuable initiation into advantages of random-projection based semantic
models. For more general discussion of “geometrization of thought” in diverse fields of
cognitive sciences, see (Gärdenfors, 2004). Within all such geometric models, categories can be
considered as local subspaces of a global space derived from the data.

3.3

Mix of quantitative and qualitative evaluation

Performance of early grammatical category induction models was evaluated manually by
introspection into induced equivalence classes and articles published in the period of “golden
age” of POS-i often used to enumerate members of at least one particularly pleasing class or
presenting their dendograms. Such an approach was later critiqued by Clark (2003) as
“inadequate” and attention of POS-I community turned towards more quantitative measures
like perplexity, conditional entropy, cross-validation (Gao & Johnson, 2008), one-to-one
(Haghighi & Klein, 2006) or many-to-1 accuracy (purity); variation of information (Meila,
2003) , substituable F-score (Frank et al., 2009) etc.
For the purposes of this article we had decided to present our simulations principally in terns of
V-measure. Given its elegance, stability in regards to growing number of clusters but also
certain “strictness” (note that even the best performing models present in comparative study
(Christodoulopoulos et al., 2010) rarely surpass the V>0.6 limit), we consider the V-measure
to be very valuable quantitative measure of performance of clustering POS-i algorithms.
But we also believe that the “old school” many-to-1 purity measure can be of certain interest,
especially for those aiming to create a “semi-supervised bridge” between POS-induction and
POS-tagging models; or by those aiming not to evaluate the performance of the model by rather
to gain insights of correct annotations of analyzed corpora. In other terms, asides to “global”
statistic measures informing the researcher about the overall performance of the model, more
“local” measures can still offer interesting and useful information about individual induced
classes themselves. Values presented in Table 3 represent the number C of clusters into which
the corpus has to be partitioned in order to obtain at least Φ absolutely pure (i.e. Purity=1)
classes.
Table 3. Distillation of absolutely pure categories
Φ=1
Φ=2
Φ=3
Φ=4
Φ=5
Φ=10

SFFX
72
92
105
126
131
160

CD
168
194
196
248
281
377

CD+BO
107
142
180
189
194
256

SFFX+CD+BO
69
71
80
90
96
116

For example, in order to obtain an absolutely pure cluster on the basis of contextual distribution
(CD) features, one would have to partition the English part of Multext-East corpus into 168
clusters among which shall emerge following noun-only cluster:
authority, character, frontispiece, judgements, levels, listlessness, popularity, sharpness, stead,
successors, translucency, virtuosity
Interesting insights can also be attained by inspection of some exact points of the clustering
procedure. Let’s inspect, as an example, the case when one clusters the English corpus into 7
clusters according to features both internal to the word – i.e. suffixes – and external – i.e.
co-occurrence with other words co-occurrence. Such an inspection indicates that the model
somehow succeeds to distinguish verbs from nouns. As is shown on Table 4, whose columns
represent the “gold standard” tags and rows denote the artificially induced clusters, our naïve

computational model tends to put nouns in clusters 4 and 6 while putting verbs into clusters 2, 3
and 5.
Table 3 . Origins of Noun-Verb distinction
0
1
2
3
4
5
6

N
10
568
97
13
1173
608
1977

V
3
67
668
1011
67
958
97

M
0
0
0
1
4
72
22

D
0
0
0
0
0
67
0

R
413
1
1
275
6
252
42

A
30
0
137
0
133
321
1091

S
0
1
3
2
0
99
3

C
0
2
2
0
0
72
0

I
0
0
0
0
0
7
3

P
0
1
0
0
4
106
0

X
1
0
0
0
3
3
2

G
0
0
0
0
0
12
0

The objective of our ongoing work is to align as much as possible such “seeding” states like
that presented on Table 4. with data consistent with psycholinguistic knowledge about diverse
stages of language acquisition process.
At last but not least, we believe that the temporal aspects of model’s performance, i.e. the
answer to the question “How long does the model need to run in order to furnish reasonable
results?” should be always seriously considered. One way how to evaluate such temporal
aspects of categorization could be a simplistic Turing-Test (TT) like POS-i oriented scenario
where the evaluator asks the model (or an agent) to attribute the POS-label to word posed by
evaluator, or at least to return a set of members of the same category. In such a reallife scenario, an absolute perfection of possible future answer could be possibly traded off for
less perfect (yet still locally optimal) answer given in reasonable time.
But because with this TTPOS proposal we already depart from the domain of unsupervised
induction towards semi-supervised “learning with oracle” or fully supervised POS-tagger, we
conclude that we consider the condition “cognitively plausible model of part of speech
induction should be evaluated by both quantitative and qualitative means” to be the weakest
among all proposals concerning the development of an agent inducing the categories of natural
language in a “cognitively plausible” way.

4. Conclusion
Model should be labeled as “cognitively plausible” model of certain human faculty if and only
if it not only accurately emulates the input (problem) → output (solution) mapping executed by
the faculty, but also emulates the basic “essential” characteristics associated to such mapping
operation in case of human cognitive systems, i.e. emulates not only WHAT but also HOW the
problem → solution mapping is done.
In relation to the problem of how part-of-speech induction is effectuated by human agents, two
characteristic conditions have been defined as axiomatic (necessary). First postulates that POS-i
should involve a “bootstrapping” multi-iterative process able to subsume terminals sharing
common features under a new non-terminal and to subsequently exploit the information related
to occurrence of the new non-terminal to extend the (vectorial) definition terminals represented
in the memory. Ideally the process should converge to partitions “optimally” corresponding to
the gold standard. First experiment has shown for three distinct corpora that even a very simple
model based on clustering of the most trivial co-occurrence information can attain higher
accuracies if such a bootstrapping component is involved. The second necessary condition of

POS-i’s CP is that it should be data or oracle-driven. It should perform better when first
confronted with simple corpora like CHILDes (MacWhinney, 2000) and only latter with more
complex ones than if it would be first confronted with complex corpora.
Another condition of POS-i’s CP proposed that morphological and surface features should not
be neglected and instead of playing a secondary “performance increasing role”, they should
possibly “seed” whole bootstrapping process which shall follow. This condition is considered to
be conjectural (i.e. “weaker” ) just because it points to somewhat orthogonal direction than does
a traditionally acclaimed distributional hypothesis (Harris, 1954). It may be the case, however,
that especially native speakers of some morphologically rich languages shall consider the
“syntax-is-also-IN-the-word” paradigm not only as conjectural but also axiomatic.
Another “weak” condition of cognitive plausibility postulates that many phenomena related to
mental representations and thinking, POS-i included, can be not only described but also
explained and represented in geometric and topologic terms. Ideally, the geometric
paradigm (Gärdenfors, 2004) should not be contradictory but rather complenetary to symbolic
and connectionist paradigms. The last and weakest condition of CP proposed that
computational models of part-of-speech induction should be not only easily quantitatively
analyzed but should be also transparent for researcher’s or supervisor’s qualitative analyses.
They should facilitate and not complicate posing of all sorts of “Why?” questions and the
results should be easily interpretable. A sort of categorization-faculty Turing Test was proposed
which could be potentially embedded into the linguistic component of the hierarchy of Turing
Tests which we propose elsewhere (Hromada, 2012).
It may be the case that the list of conditions of cognitive plausibility presented in this article is
not sufficient one and should be extended with other terms like “modularity”, “selfreferentiality” or notions coming from complex systems and evolutionary computing.
Regarding the problem of elucidation of how could a machine induce, from the environmentrepresenting corpus, the categories in a way analogical to that of a child learning by imitating
its parents, we consider even the list of 2 strong precepts and 3 weak precepts hereby presented
as quite useful and possibly necessary.

Bibliography
Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., & Klein, D. (2010). Painless unsupervised learning
with features. Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (p. 582–590).
Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering.
Proceedings of the 21st International Conference on computational Linguistics and 44th Annual
Meeting of the Association for Computational Linguistics: Student Research Workshop (p. 7–12).
Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based ngram models
of natural language. Computational linguistics, 18(4), 467–479.
Christodoulopoulos, C., Goldwater, S., & Steedman, M. (2010). Two Decades of Unsupervised POS
induction: How far have we come? Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing (p. 575–584).
Clark, A. (2003). Combining distributional and morphological information for part of speech induction.
Proceedings of the tenth conference on European chapter of the Association for Computational
Linguistics- Volume 1 (p. 59–66).

Clark, A. (2010). Towards general algorithms for grammatical inference. Algorithmic Learning Theory
(p. 11–30).
Cohen, T., Schvaneveldt, R., & Widdows, D. (2010). Reflective Random Indexing and indirect inference: A
scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2), 240–
256.
Elman, J. L. (1989). Representation and structure in connectionist models. DTIC Document.
Erjavec, T. (2012). MULTEXT-East: morphosyntactic resources for Central and Eastern European
languages. Language resources and evaluation, 46(1), 131–142.
Ferguson, C. A. (1964). Baby talk in six languages. American anthropologist, 66(6_PART2), 103–114.
Frank, S., Goldwater, S., & Keller, F. (2009). Evaluating models of syntactic category acquisition without
using a gold standard
Proc. 31st Annual Conf. of the Cognitive
Science Society (p. 2576–2581).
Gao, J., & Johnson, M. (2008). A comparison of Bayesian estimators for unsupervised Hidden Markov
Model POS taggers. Proceedings of the Conference on Empirical Methods in Natural Language
Processing (p. 344–352).
Gärdenfors, P. (2004). Conceptual spaces: The geometry of thought. MIT press.
Goldwater, S., & Griffiths, T. (2007). A fully Bayesian approach to unsupervised part-of-speech tagging.
ANNUAL MEETINGASSOCIATION FOR COMPUTATIONAL LINGUISTICS (Vol. 45, p. 744).
Haghighi, A., & Klein, D. (2006). Prototype-driven learning for sequence models. Proceedings of the
main conference on Human Language Technology Conference of the North American Chapter
of the Association of
Computational Linguistics (p. 320–327).
Harris, Z. S. (1954). Distributional structure. Word.
Hebb, D. O. (1964). The Organization of Behavior: A Neuropsychlogical Theory. John Wiley & Sons.
Hromada, D. D. (2012). Taxonomy of Turing Test Scenarios. Proceedings of AISB/IACAP
Symposium. Birmingham, United Kingdom.

2012

Jackendoff, R. (2003). Foundations of language: Brain, meaning, grammar, evolution. Oxford
University Press, USA.
Johnson, M. (2007). Why doesn’t EM find good HMM POS-taggers. Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL) (p. 296–305).
Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space.
Contemporary mathematics, 26(189-206), 1.
Karypis, G. (2002). CLUTO-a clustering toolkit. DTIC Document.
Lakoff, G. (1990). Women, fire, and dangerous things. Univ. of Chicago Press.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory
of acquisition, induction, and representation of knowledge. Psychological review, 104(2), 211–240.

Levy, Y., Schlesinger, I. M., Braine, M.D.S. (1988). Categories and Processes in Language Acquisition.
Lawrence Erlbaum.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. Transcription, format and
programs (Vol. 1). Lawrence Erlbaum.
Meilua, M. (2003). Comparing clusterings by the variation of information. Learning theory and kernel
machines (p. 173–187). Springer.
Nowak, M. A., Plotkin, J. B., & Krakauer, D. C. (1999). The evolutionary language game. Journal of
Theoretical Biology, 200(2), 147– 162.
Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level
bootstrapping. Proceedings of the National Conference on Artificial Intelligence (p. 474–479).
Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation
measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and
Computational Natural Language Learning (EMNLP-CoNLL) (Vol. 410, p. 420).
Sahlgren, M. (2005). An introduction to random indexing. Methods and Applications of Semantic
Indexing Workshop at the 7th International Conference on Terminology and Knowledge
Engineering, TKE (Vol. 5).
Sahlgren, M., & Karlgren, J. (2002). Vector-based semantic analysis using random indexing for crosslingual query expansion. Evaluation of Cross-Language Information Retrieval Systems (p. 169–176).
De Saussure, F., Bally, C., Séchehaye, A., Riedlinger, A., Calvet, L. J., & De Mauro, T. (1922). Cours de
linguistique générale. Payot, Paris.
Schütze, H. (1993). Part-of-speech induction from scratch. Proceedings of the 31st annual meeting on
Association for Computational Linguistics (p. 251–258).
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of information. Urbana: University of
Illinois Press, 97.
Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages.
Proceedings of the National Academy of Sciences, 102(33), 11629.
Turing, A. M. (1939). Systems of logic based on ordinals. Proceedings of the London Mathematical
Society, 2(1), 161–228. Language and Speech, 40(1), 47–62.
Vlachos, A., Korhonen, A., & Ghahramani, Z. (2009). Unsupervised and constrained Dirichlet process
mixture models for verb clustering. Proceedings of the workshop on geometrical models of natural
language semantics (p.74–82).
Widdows, D., & Kanerva, P. (2004). Geometry and meaning. CSLI publications Stanford.