Quantitative intercultural comparison by means of parallel
pageranking of diverse national wikipedias
Daniel Hromada
Ecole Pratique des Hautes Etudes / CHART / Lutin Userlab

Abstract
The aim of our study was to show that distributions of hyperlinks within wikipedia corpora implicitly contain
information about cultural preferences of its authors. We have transformed wikipedia corpora written in 27
different languages into graph structures whose vertices correspond to wikipedia articles and edges to hyperlinks
between these articles. Afterwards we have calculated PageRank vectors for every one of these graphs, thus
obtaining so-called “intracultural importance list” for every linguistic community under study. Two datamining
experiments were performed with obtained data: “the top country” study indicated that labels of articles concerning
countries, related to linguistic community that created these articles are to be found in the top parts of their
respective intracultural lists and inversely that the top parts of these lists can be potentially used as a stylometric
method of identification of the community which created the corpus. “The world&corpus” study revealed that
majority of rankings of articles concerning the countries of reference within intracultural list of a given community
significantly correlates with a factual geographic distance between the country of reference and a supposed home
country of a linguistic community. Both experiments have indicated presence of morphism between wikipedia
hyperlink graph and a factual world of its authors.

Keywords: PageRank, Wikipedia, graph theory, comparative culturology, quantitative anthropology, cultural
stylometry, world-corpus correlations

1. Introduction
The aim of this article is to propose a new quantitative method for comparison of different
cultures by reducing culture-specific corpora to a common metrics. We shall try to demonstrate
the feasibility of such an approach by using PageRank as such a metric and wikipedias of
diverse (mostly European) linguistic communities as corpora which will be compared.
Both Wikipedia and Pagerank have lately received a substantial amount of attention from
different scientific fields. Considered by some to be «probably the most important single
contribution to the fields of information retrieval and Web search of the last ten years » (Esuli
and Sebastiani, 2007) implementation of PageRank by (Brin and Page, 1998) was without a
doubt a key component of ascent of Google to the very top of most visited Internet sites.
On the other hand, Wikipedia is based upon a very simple idea of self-organized collaboration
of a huge number of authors. The hypothesis that such a huge number will, in the long run,
approximate scientific truth better than a limited number of experts (Surowiecki, 2004) is far
from being ultimately proven. However, Wikipedia is nowadays considered as reliable source
of information in many domains, and it is one of the most important and freely available
encyclopaedic corpora. Its multilingual properties are being more and more exploited in NLP
JADT 2010 : 10 th International Conference on Statistical Analysis of Textual Data

644

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

research for sense disambiguation word sense disambiguation (Mihalcea, 2007), question
answering (Ferrandez et al., 2007), named entity recognition (Richman and Schone, 2008).
Only few studies, however, focused fully upon differences between diverse wiki corpora. And
even when such “exploiting asymmetries” (Filatova, 2009) or “information arbitrage” (Adar et
al., 2009) were presented, their goal was to infer data from article-content related discrepancies,
and not to make comparisons between corpora considered as consistent wholes.
Research presented by this paper aims to demonstrate that even such large-scale comparisons
can yield valid information. Our starting hypothesis can be stated like this: Wikipedia maybe
does not approximate scientific truth, but it certainly approximates culture of its authors. In
more exact terms, supposing that 1) the very act of creation of an article or a link presupposes
an existence of a biased preference within the author and 2) that wikipedia is a graph structure
whose vertices are equivalent to articles and edges to hypertext links between this articles,
we propose that such a graph is at least partially but significantly isomorphic with associative
network of culturally determined meanings and values of its authors.
Proposal that culture – which can be conceived as structure of symbols, artifacts, buildings,
institutions, social roles etc. which are mutually interconnected in a very specific way– can be
described by graph theory and later analyzed by network analysis is far from being new (for
an overview, see Park, 2005). Validating such a hypothesis, however, is not easy since it is
not easy to find a 1) unique graph-like structure (e.g. structure with vertices and edges) that 2)
represents common activity of huge number of culture-holders. And even when such a structure
is found, the question whether it faithfully represents (is isomorphic with) a given culture is
difficult to answer.
But since it is nowadays widely accepted that culture is in the first place distinct from other
cultures and that this distinction forms the very essence of a given culture (Bourdieu, 1979),
even when it is almost impossible to compare a cultural graph with factual world itself, cultural
graphs can always be compared with each other and the results of this comparison can be
subsequently more easily compared with evident cultural distinctions of factual world.
We propose that corpora of local wikipedias created by diverse linguistic communities can serve
as a basis for construction of such «cultural graphs» and that these graphs can be subsequently
compared by means of PageRank centrality measure.

2. “The top country” study
Since a “corpus culturology” doesn’t seem to be an explored scientific domain, the goal of
this preliminary analysis was to decide whether it is worth to continue with implementation of
more robust statistic techniques or whether to consider as false the very introductory hypothesis
“hyperlink distribution of a wikipedia graph contains implicit information about cultural
preferences of its authors”. In other words, our primary intention was to assess whether some
culture-specific information can be observed by applying a PageRank algorithm on wikipedia
corpora of diverse linguistic communities.
2.1. Method
Database tables «pages» (containing the list of articles – vertices) and «pagelinks» (containing
the list of hypertext links – edges) were downloaded from wikimedia’s site.
All vertices and edges not having namespaces 0 (article) 14 (category) and 100 (portal)
were removed from the tables; subsequently a page_from → page_to plaintext edge list was
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

645

generated. After this edge list was transformed into a graph G, pagerank vector – which is in
fact the eigenvector of graph’s modified adjacency matrix – was calculated by igraph library
(Csárdi and Nepusz 2006). Damping factor d=0.77 was chosen for the calculation. These
transformations and calculations were repeated for 27 wikipedia corpora, overall properties of
their respective graphs are present in Tab. 1.
ISO 639
code

Name of
language

Number of
vertices (articles)

Number of
edges (hyperlinks)

AR
BG
CS
DA
DE
EL
ES
ET
FI
FR
HE
HR
HU
LV
NO
NL
PL
PT
RO
RU
SK
SL
SR
SV
TR
UK
ZH

Arabic
Bulgarian
Czech
Danish
German
Greek
Spanish
Estonian
Finnish
French
Hebrew
Croatian
Hungarian
Latvian
Norwegian
Dutch
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Serbian
Swedish
Turk
Ukrainian
Chinese

234538
143439
266854
205245
1939647
82168
1303273
126448
403380
1996383
245431
116515
277518
67736
405039
877590
903670
1088962
307084
1232353
173417
146250
239904
623035
304853
322799
609262

4963998
3578973
7187995
4402963
43782766
1879300
23212253
2580511
7609470
53003962
9103883
3850220
9865769
1342180
8938168
24881686
29731309
24867864
5392290
27442593
4873409
5236834
5013264
11515290
9557808
9158661
15838584

Table 1: Basic graph properties of analysed corpora and their corresponding ISO639-1 codes

For every corpus all contained page titles were ordered according to their descending PageRank
values. We call such a list to be an intracultural list and we call langrank the placement of a
given item in its respective intracultural list. Hence, 27 intracultural lists were obtained within
which pages have langrank 1, pages with second highest probabilities have langrank 2, etc. To
summarize, high langrank means low PageRank importance and vice versa.
To detect what names of countries are to be found on the very top of intracultural lists (i.e.
have lowest langrank), a following procedure was applied: a term with langrank position 1 was
extracted from the list, and translated it into English by using wikipedia itself as the translator.
If it was not present in the ISO list of country names, procedure continued with a term having
langrank position 2, 3, etc. If it was in the list, the procedure continued with country detection
in following intracultural list, therefore repeating itself 27 times.
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

646

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

2.2. Results
27 intracultural PageRank vectors, one for each language community, were obtained and
subsequently ordered in descending order according to calculated PageRank (converged
probability) value. For illustration, in Tab. 2 we offer «top 10» values of such lists for 2 Latin
and 2 Slavic corpora.
Portuguese
Wikipédia
0.065305
Proxy
0.006393
WP:TT
0.003323
Plantae
0.002419
Til
0.001981
Avaré
0.001496
População
0.001492
Invertebrados 0.001435
Área
0.001433
Brasil
0.001412
			
		

Spanish
España/Sección
Rural
Wikipedia
Wikipedia_
en_español
2001
Mayo
Wikimedia_
Commons
GFDL
España
Rural

Czech
0.491755
0.050179
0.001105
0.000887
0.000555
0.000508
0.000337
0.000205
0.000197
0.000196

Wikipedie
Wikimedia_Commons
GNU_Free_Documentation_License|
CC-BY-SA
CAPTCHA
Česko
IP_adresa
Spojené_státy_americké
Zeměpisné_souřadnice
Praha

Russian
0.00984
0.00816
0.00303
0.00141
0.00132
0.00109
0.00097
0.00082
0.00079
0.00069

Википедия:Справка
Русская_Википедия
Германия
Общественное_достояние
GNU_Free_Documentation
_License
Викисклад
Creative_Commons
Английский_язык
Россия
Фонд_свободного_програ

0.01519
0.00564
0.00361
0.00348
0.00295
0.00277
0.00276
0.00121
0.00119
0.00112

Table 2: Top ten (i.e. langrank 1 – 10) items of 4 intracultural lists and their respective PageRanks

It may be easily observed from the data that Wikipedia itself holds one of the top positions (this
is the case within other 23 corpora as well). This is a trivial discovery since a wikipedia system
is designed in the way that it refers in the first place to articles which concern the functioning
of the system itself. Slightly less trivial is the observation that articles concerning the names of
countries or cities closely associated to a language of a given wikipedia corpus emerge at the
top positions of their respective intracultural lists.
Wiki
Top country
L
corpus			
AR
BG
CS
DA
DE
EL
ES
ET
FI

(Egypt)
България (Bulgarria)
Česko (Czech Republic)
Danmark (Denmark)
Deutschland (Germany)
Ελλάδα (Greece)
España (Spain)
Eesti (Estonia)
Suomi (Finland)

17
4
6
34
16
7
9
5
5

Wiki
Top country
L
corpus			
FR
HE
HR
HU
LV
NL
NO
PL
PT

France (France)
(Israel)
Hrvatska (Croatia)
Magyarország (Hungary)
Latvija (Latvia)
Frankrijk (France)
Norge (Norway)
Polska (Poland)
Brasil (Brazil)

23
7
4
18
6
11
6
12
10

Wiki
corpus
RO
RU
SK
SL
SR
SV
TR
UK
ZH

Top country

România (Romania)
Германия (Germany)
Slovensko (Slovakia)
Slovenija (Slovenia)
Француска (France)
USA
Türkiye (Turkey)
Україна (Ukraine)
印度尼西亚 (Indonesia)

L
7
3
9
8
28
35
13
13
10

Table 3: Country names found at the top of their intracultural lists (i.e. having lowest langrank L )

Answers to the question «What countries are the first to occur at the top of given corpus
intracultural importance list?» are present in Tab. 3. In 22 cases did an extraction of one country
name from the top of the intracultural list corresponding to the graph of wikipedia written in
language X yield the name of a country where this very language X is an official language of
the state. Five exceptions are: Dutch where Frankrijk (L=11) closely outran Nederland (L=14);
Russian where Германия (L=3!!!) outran Россия (L=9); Serb where Француска (L=28) far
outran Србија (L=70); Swedish where USA (L=35) closely outran Sverige (L=37) and finally
Chinese where Indonesia (L=10) is followed by Qatar (L=45), Micronesia (L=371), Brunei
(L=409), Taiwan (L=484) and only much later by mainland China 中国 (L=579).
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

647

2.3. Discussion
The observation that huge majority (22 out of 27) corpora yields in the top positions of their
respective langrank lists the names of countries whose official language is identic to the
language of corpora under study is the first indication that even a pure hyperlink analysis could
possibly reveal itself as a fruitful method for obtaining an overall information about preferences
or interests of authors of wikipedia corpora. In such a manner could it posssibly serve as a
means for «cultural stylometry» – a technique which could possibly allow to determine an
appartenance of an anonymous author (or group of authors) to a given cultural or social unit.
For instance, data from Tab. 3 indicates that «central country of interest» for auhors of PT corpus
is Brasil (L=10) and not Portugal which emerges only later in the list (L=32), later than França
(L=12), Itália (L=14), Espahna (L=16) and even Estados Unidos (L=31). If a basic hypothesis
of this article, i.e. that langrank values represent the amount of importance of a given term in
a given corpus will not be falsified, it could be proposed that Brasil plays, for authors of PT
corpus, much more important role than Portugal, from which it could be inferred that majority
of them is possibly from Brazil and not from Portugal. Analogic stylometric conclusions can
be inferred when looking at the AR corpus where Egypt (L=17) is followed by Jordan (L=27),
Spain (L=36), France (L=37) and Tunisia (L=47).
An interesting exception occurs for the countries for which the official language is not identical
to the language of a country in which a wiki corpus was written: the fact that Netherlands is
closely overran by France in case of Dutch corpus and Sweden by USA in case of Swedish
corpus can be possibly interpreted by the proposing that the overall global currents – related
more closely to cultural superpowers are, for wikipedia authors of these two highly developed
nations, of slightly more interest than local current of nationalist nature.
The results obtained for Chinese intracultural list are intriguing. While a position of Indonesia of
the very top could be naively explained by activity of Chinese expats in Jakarta who pass there
time writing wikipedia articles, the subsequent emergence of Qatar, Micronesia and Brunei
seem to be completely contraintuitive. These phenomena can be, however, explained by a wellknown caveat of PageRank algorithms related to so-called linksink phenomenon. A linksink
can emerge during the PageRank vector calculation when the analyzed graph contains a densely
interconnected subgraph having only few links to the rest of the graph. One way how to deal
with linksink perturbations is an optimization of damping factor, these problems in relation to
our cultural comparative method will be addressed in following articles.
Since the top of Serbian intracultural list indicates that this corpora is subject to linksink
perturbations (first 45 positions are occupied solely by astronomic terms), we consider this to
be an explanation for the observation where Serbia is far overran by France. Since Serb corpus
is not a big one, the result can be as well explained by an overly activity of a small group of
authors biased more towards France related phenomena than to Serb related ones.
Striking fact that Germany occupies third position in Russian intracultural importance list is left
for reader’s interpretation.

3. “The world&corpus” study
While huge majority of results obtained during analysis 1 seem to be consistent with intuitive
expectations, their true scientific significance remains discutable. To address this issue, we have
conceived a second analysis in which we have decided to correlate precalculated intracultural
lists with factual data. For this purpose we have decided to use the real geographic (spatial)
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

648

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

distances between the country of a linguistic community under study, and other country (i.e.
country of reference). Such a choice was motivated by a simple hypothesis: wikipedia users
from home country B will, more likely, write articles and create hyperlinks concerning countries
of reference A and C which are neighbours of B, than about countries of reference X or Y
which are spatially distant. If such a tendency exists, and if PageRank is a sufficiently efficient
technique for quantification of such an “importance” of A, C, X, Y countries of reference
within the scope of corpus created by authors supposedly from home country B, then significant
correlations between intracultural lists and |home country, country of reference| spatial distance
can be expected to occur.
3.1. Method
We have defined 32 countries of reference: 27 of them were countries which we have considered
as well to be home countries of our intracultural lists; 5 others were chosen by random, one
from every continent (Italy, Japan, Senegal, Argentina, Australia).
As a first dataset we have used 27 intracultural lists, one for each home country, calculated
during analysis 1. From every such list, the langrank (i.e. position sorted according the ascending
pagerank value) corresponding to the the term denoting the country of reference was extracted.
For example, as Tab. 4 illustrates, Hrvatska was on the 4th position in a Croatian corpus and 74th
in Slovenian corpus.
Language of
home country

Langrank
position

AR
BG
CS
DA
DE
EL
ES
FI
FR
HE
HR
HU
LV
NL
NO
PL
PT
RO
RU
SK
SL
SR
SV
TR
UK
ZH

532
345
281
848
329
271
756
456
1131
1493
4
268
675
409
418
422
749
469
696
271
74
110
556
413
679
3981

Name of country
of reference
Хърватия
Chorvatsko
Kroatien
Kroatien
Κροατία
Croacia
Kroatia
Croatie
Hrvatska
Horvátország
Horvātija
Kroatië
Kroatia
Chorwacja
Croácia
Croaţia
Хорватия
Chorvátsko
Hrvaška
Хрватска
Kroatien
Hırvatistan
Хорватія
克罗地亚

Spatial
distance (km)
3464
797
509
1265
808
870
1695
2197
1056
2255
0
403
1472
1083
1907
828
2028
746
5533
494
118
455
1874
1747
1320
7321

Table 4: positions of country of reference Croatia in intracultural lists of diverse home countries
and their spatial respective distance
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

649

Mathematica functions of computational search engine «Wolfram Alpha» were used as a
resource of home country ↔ country of reference spatial distance data.
Pearson correlation coefficients were calculated between two datasets. Whole procedure was
repeated 32 times, once for every country of reference.
3.2. Results
Obtained results suggest significative correlations between intracultural lists and geographic
data in case of all countries of reference with exception of China, Russia and Slovakia. They
are presented in Tab. 5.
3.3. Discussion
Obtained results show correlations between strongly empiric spatial measures and positions
within the “intracultural” lists Since different wikipedia corpora are direct consequences of
different creative preferences of human groups, these correlations have to be explained in terms
of these preferences. We propose that these preferences are culturally determined.
The previous analysis even if it leads us to interesting conclusion, is however questionable.
And a major caveat should be raised: Pearson’s correlation coefficients are sensitive to outlier
datapoints and if these are present, an analysis cannot be considered as a robust one (Rousseeuw
and Leroy, 2003).
Country
p
cor
of ref.			

Country
p
cor
of ref.			

Argentina <0.003 0.549
Australia
0.165 -0.275
Bulgaria <0.00026 0.648
Croatia
<2E-06 0.779
China
0.426 0.183
Czech R. <7-E05 0.689
Denmark <0.00044 0.629
Estonia <1.5E-05 0.730

Finland <1.74E-05
France
0.0015
Germany <0.004
Greece
0.00019
Hungary 0.00015
Israel
0.0148
Italy
<0.005
Japan
0.711

0.727
0.577
0.539
0.657
0.664
0.463
0.525
-0.07

Country
p
cor
of ref.			
Latvia
<5.6E-05
Netherlands <0.007
Norway
<0.0003
Poland
<0.0005
Portugal
<0.05
Romania <6.8E-05
Russia
0.8987
S.Arabia
<0.0035

0.696
0.507
0.652
0.630
0.387
0.690
0.025
0.543

Country
of ref.

p

cor

Senegal
<0.0007 0.617
Slovakia
0.1965 0.256
Slovenia <6.63E-07 0.797
Serbia <9.53E-05 0.680
Spain
<0.011 0.486
Sweden
<0.001 0.599
Turkey
<0.0004 0.635
Ukraine
<0.0005 0.629

Table 5: Overall p-values and Pearson correlation coefficients (d=25) for 32 countries of reference

As Fig. 1 illustrates, this was the case for example in the situation when Germany was chosen as
a country of reference. Simple removal of zh (Chinese) datapoint from the top right corner (i.e.
high spatial distance, high langrank) have caused a drastic change from (cor=0.539; p<0.004)
to (cor=-0.108; p=0.599). Since majority of countries of references in analysis 2 were European
ones, it can be expected that this outlier boosts up the significativity of our hypotheses in an
unwanted manner.
Another source of bias was identified as well. It is related to the fact that Wolfram Alpha uses
cartographic center of a country as the point from which it measures a distance to/from a given
country. That’s a useful feature in case of countries whose population is distributed equally. In
case of a country like Russia, however, is the ru “central point” postulated somewhere in central
Siberia, 4000 km east from Moscow. Whether such a point can have anything to do with cultural
preferences of wikipedia authors is a place for argument.
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

650

QUANTITATIVE INTERCULTURAL COMPARISON BY MEANS OF PARALLEL PAGERANKING

Figure 1: Visualisation of langrank&distance correlations when « China » outlier is included (left) in
or excluded (right) from the list of countries of reference as related to Germany

4. General Discussion
The aim of “the top country” study was to demonstrate whether a method of parallel
pageranking of wikipedia graphs can yield relevant information concerning basic overall
specificities of the corpora, and therefore of their authors. Simple look up at the tops of calculated
intracultural lists have demonstrated that such is verily the case: in 22 out of 27 corpora was the
topmost ranked country-concerning article about the country whose official language is that in
which the corpus was produced.
The second, “world&corpus” study focused on a relation between implicit properties of
wikipedia corpora and geographic distances of the factual world. While significativity of
obtained results suggest that there possibly exist some morphic relations between the overall
hyperlink structure of (wikipedia) corpora and the factual world, the outlier problem indicates
that the “world&corpus dilemma” will not be an easy dilemma to resolve.
What we denote here as “world&corpus dilemma” is only very superficially related to method
which we presented in our second study. In fact, it is much more closely related to an ancient
epistemological problem “What is knowledge and how is it represented?” than to some trivial
linear regression of two sets of datapoints which tend to show to have something in common.
In its weaker form, the question goes like this “What is relation between the corpus and the
world, given that corpus is sufficiently big?”. The goal of our article was to indicate that the
graph theory could possibly bestow a temporary question to this answer: “If a graph of the
corpus is isomorphic with the graph of a world the corpus tends to describe, than it can be said
that such a corpus contains the knowledge about that world”.
We say “a” graph, because there are infinitely many ways how to construct a graph from a
given corpus. For the purposes of this article, we have chosen the most simple way: inspired
by “random surfer model”, we have completely ignored information IN the Net (e.g. word cooccurences in the content) and focalized at the information ON the Net.
An edge have been created when a hyperlink existed between the vertices. We supposed this
assumption should be suffice as a point de depart: the very act of creation of an article, or a
hyperlink, can be an interesting clue to the preferences of the one who creates it. A weak clue,
of course, but nonetheless containing more information than pure accident.
Since it is well known that a well aggregated linear combination of weak classifiers can result in
a highly-effective strong classifier (Freund and Schapire, 1996), it can be as well proposed that
a huge number of well aggregated weak cultural clues can yield some strong ones.
JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data

DANIEL HROMADA

651

References
Adar E., Skinner M. and Weld D.S. (2009). Information arbitrage across multi-lingual Wikipedia. In
Proceedings of the Second ACM International Conference on Web Search and Data Mining,
ACM, pp. 94-103.
Bourdieu P. (1979). La distinction: critique sociale du jugement. Paris: Ed. de Minuit.
Brin S. and Page L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer
networks and ISDN systems, 30 (1-7): 107-117.
Csárdi G. and Nepusz T. (2006). The igraph software package for complex network research. InterJournal
Complex Systems, 1695.
Esuli A. and Sebastiani F. (2007). PageRanking WordNet synsets: An application to opinion mining. In
Annual meeting-association for computational linguistics. pp. 424-431.
Ferrandez S., Muñoz R. and Palomar M. (2007). Applying Wikipedia’s multilingual knowledge to
cross-lingual question answering. Lecture Notes in Computer Science, 4592, pp. 352-363.
Filatova E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. In Proceedings
of the Third International Workshop on Cross Lingual Information Access: Addressing the
Information Need of Multilingual Societies, Association for Computational Linguistics, pp. 3037.
Freund Y. and Schapire R.E. (1996). Experiments with a new boosting algorithm. In Machine learninginternational workshop then conference, Citeseer, pp. 148-156.
Mihalcea R. (2007). Using wikipedia for automatic word sense disambiguation. In Proceedings of
NAACL 2007 HLT.
Park H. (2005). Network Cultural Analysis: Texts, Graphs, and Tools. In Paper presented at the annual
meeting of the American Sociological Association, Philadelphia, PA.
Richman A.E. and Schone P. (2008). Mining wiki resources for multilingual named entity recognition.
Association for Computational Linguistics (ACL-08: HLT): 1-9.
Rousseeuw P.J. and Leroy A.M. (2003). Robust Regression and Outlier Detection. Hoboken, New
Jersey : J. Wiley & Sons.
Surowiecki J. (2004). The wisdom of crowds: Why the many are smarter than the few and how collective
wisdom shapes business, economies, societies, and nations. New York: Doubleday Books.

JADT 2010: 10 th International Conference on Statistical Analysis of Textual Data