Random Projection and Geometrization of String Distance Metrics Daniel Devatman Hromada Université Paris 8 – Laboratoire Cognition Humaine et Artificielle Slovak University of Technology – Faculty of Electrical Engineering and Information Technology hromi@giver.eu Abstract Edit distance is not the only approach how distance between two character sequences can be calculated. Strings can be also compared in somewhat subtler geometric ways. A procedure inspired by Random Indexing can attribute an D-dimensional geometric coordinate to any character N-gram present in the corpus and can subsequently represent the word as a sum of N-gram fragments which the string contains. Thus, any word can be described as a point in a dense N-dimensional space and the calculation of their distance can be realized by applying traditional Euclidean measures. Strong correlation exists, within the Keats Hyperion corpus, between such cosine measure and Levenshtein distance. Overlaps between the centroid of Levenshtein distance matrix space and centroids of vectors spaces generated by Random Projection were also observed. Contrary to standard non-random “sparse” method of measuring cosine distances between two strings, the method based on Random Projection tends to naturally promote not the shortest but rather longer strings. The geometric approach yields finer output range than Levenshtein distance and the retrieval of the nearest neighbor of text’s centroid could have, due to limited dimensionality of Randomly Projected space, smaller complexity than other vector methods. Mèδεις ageôμετρèτος eisitô μου tèή stegèή 1 Introduction Transformation of qualities into still finer and finer quantities belongs among principal hallmarks of the scientific method. In the world where even “deep” entities like “wordmeanings” are quantified and co-measured by ever-growing number of researchers in computational linguistics (Kanerva et al., 2000; Sahlgren, 2005) and cognitive sciences (Gärdenfors, 2004), it is of no surprise that “surface” entities like “character strings” can be also compared one with another according to certain metric. Traditionally, the distance between two strings is most often evaluated in terms of edit distance (ED) which is defined as the minimum number of operations like insertion, deletion or substitution required to change one string-word into the other. A prototypical example of such an edit distance approach is a so-called Levenshtein distance (Levenshtein, 1966). While many variants of Levenshtein distance (LD) exist, some extended with other operations like that of “metathese” (Damerau, 1964), some exploiting probabilist weights (Jaro, 1995), some introducing dynamic programming (Wagner & Fischer, 1974), all these ED algorithms take as granted that notions of insertion, deletion etc. are crucial in order to operationalize similarity between two strings. Within this article we shall argue that one can successfully calculate similarity between two strings without taking recourse to any edit operation whatsoever. Instead of discrete insert&delete operations, we shall focus the attention of the reader upon a purely positive notion, that of “occurence of a part within the whole” (Harte, 2002). Any string-to-becompared shall be understood as such a whole and any continuous N-gram fragment observable within it shall be interpreted as its part. 2 Advantages of Random Projection Random Projection is a method for projecting high-dimensional data into representations with less dimensions. In theoretical terms, it is founded on a Johnson-Lindenstrauss (Johnson & Lindenstrauss, 1984) lemma stating that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In practical terms, solutions based on Random Projection, or a closely related Random Indexing, tend to yield high performance when confronted with diverse NLP problems like synonym-finding (Sahlgren & Karlgren, 2002), text categorization (Sahlgren & Cöster, 2004), unsupervised bilingual lexicon extraction (Sahlgren & Karlgren, 2005), discovery of implicit inferential connections (Cohen et al., 2010) or automatic keyword attribution to scientific articles (El Ghali et al., 2012). RP distinguishes itself from other word space models in at least one of these aspects: 1. Incremental: RP allows to inject on-thefly new data-points (words) or their ensembles (texts, corpora) into already constructed vector space. One is not obliged to execute heavy computations (like Singular Value Decomposition in case of Latent Semantic Analysis) every time new data is encountered. 2. Multifunctional: As other vector-space models, RP can be used in many diverse scenarios. In RI, for example, words are often considered to be the terms and sentences are understood as documents. In this article, words (or verses) shall be considered as documents and N-gram fragments which occur in them shall be treated like terms. 3. Generalizable: RP can be applied in any scenario where one needs to encode into vectorial form the set of relations between discrete entities observables at diverse levels of abstraction (words / documents, parts / wholes, features / objects, pixels/images etc.). 4. Absolute: N-grams and terms, words and sentences, sentences and documents – in RP all these entities are encoded in the same randomly constructed yet absolute space . Similarity measurements can be therefore realized even among entities which would be considered as incommensurable in more traditional approaches1. There is, of course, a price which is being paid for these advantages: Primo, RP involves 1 In traditional word space models, words are considered to be represented by the rows (vectors/points) of the word-document matrix and documents to be its columns (axes). In RP, both words (or word-fragments) and documents are represented by rows. stochastic aspects and its application thus does not guarantee replicability of results. Secundo, it involves two parameters D and S and choice of such parameters can significantly modify model’s performance (in relation to corpus upon which it is applied). Tertio: since even the most minute “features” are initially encoded in the same way as more macroscopic units like words, documents or text, i.e. by a vector of length D “seeded” with D-S non-zero values, RP can be susceptible to certain limitations if ever applied on data discretisable into millions of distinct observable features. 3 Method The method of geometrization of strings by means of Random Projection (RP) consists of four principal steps. Firstly, strings contained within corpus are “exploded” into fragments. Secondly, a random vector is assigned to every fragment according to RP’s principles. Thirdly, the geometric representation of the string is obtained as a sum of fragment-vectors. Finally, the distance between two strings can be obtained by calculating the cosine of an angle between their respective geometric representations. 3.1 String Fragmentation We define the fragment F of a word W having the length of N as any continuous2 1-, 2-, 3-...Ngram contained within W. Thus, a word of length 1 contains 1 fragment (the fragment is the word itself), words of length 2 contain 3 fragments, and, more generally, there exist N(N+1)/2 fragments for a word of length N. Pseudo-code of the fragmentation algorithm is as follows: function fragmentator; for frag_length (1..word_length) { for offset (0..(word_length - frag_length)) { frags[]=substr (word,offset,frag_length); } } where substr() is a function returning from the string word a fragment of length frag_length starting at specified offset. 2 Note that in this introductory article we exploit only continuous N-gram fragments. Interaction of RP with possibly other relevant patterns observable in the word – like N-grams with gaps or sequences of members of diverse equivalence classes [e.g. consonants/vowels] – shall be, we hope, addressed in our doctoral Thesis or other publications. 3.2 Stochastic fragment-vector generation Once fragments are obtained, we transform them into geometric entities by following the fundamental precept of Random Projection: To every fragment-feature F present in the corpus, let’s assign a random vector of length containing D-S elements having zero values and S elements whose value is either -1 or 1. The number of dimensions (D) and the seed (S) are the parameters of the model. It is recommended that S< were observed among words of corpus, one could distinguish 252229 diverse real-numbered RPD values limited to interval <0, 1>. RPD vessels vessel comfort comforts sorrows sorrow ’benign benign temples temple changing unchanging stream streams immortal’s immortal breathe breath trance tranced Table 2: Ten most similar world couples according to non-random “sparse” geometric distance (GD) and Randomly Projected Distance 4.3 Figure 1: Scatter plot displaying relations between Levenshtein distances and cosine distances measured in the vector space constructed by RI1000,5 String distance measured in the space constructed by RP1000,5 also strongly correlates (Pearson correlation coefficient = 0.992; Spearman rho = 0.679; minimal p < 2.2e-16 for both tests) with a GD cosine measure exploiting a non-deformed fragment-word matrix. An important difference was observed, however, during a more „local“ & qualitative analysis of results produced by the two vectorial methods. More concretely: while non-stochastic „sparse“ cosine GD distance tends to promote as „closest“ the couples of short strings, RPD yields the highest score for couples of long words. This is indicated by the list of most similar word- The “centroid” experiment Three types of concrete word-centroids were extracted from the corpus. A string having the smallest overall LD to all other strings in the corpus shall be labeled as the “Levenshtein centroid” (LC). A string having the maximal sum of cosines in relation to other words shall be labeled as the “Cosinal centroid” (CC). Contrary to LC and CC, for calculation of which one has to calculate distances with regard to all words in the corpus, the “Geometric Centroid” (GC) was determined as a word whose vector has the biggest cosine in regard to “Theoretical Centroid” (GC) obtained in a purely geometric way as a sum of all word-vectors. Stochastic CCRP and GCRP calculation simulations were repeated in 100 runs with D=1000, S=5. 4.3.1 Results The word “are” was determined to be the LC of Hyperion corpus with average LDARE,X = 4.764 to all words of the corpus. The same word are was ranked, by a non-stochastic “sparse” geometric distance algorithm, as 3rd most central CC and 36th most closest term to GC . Table 3 shows ten terms with least overall LD to all other words of the corpus (LC), ten terms with biggest cosine in relation to all other terms of the corpus (CC GD) and ten terms with biggest cosine in regard to hypothetical Theoretical Centroid (GCGD) of a sparse non-projected space obtained from the Hyperion corpus. Rank 1 2 3 4 5 6 7 8 9 10 LC are ore ate ere one toes sole ease lone here CCGD GCGD charm red arm a me hard had reed domed are a o I ‘ he to at an me as Table 3: Ten nearest neighbor words of three types of non-stochastic centroids Shortest possible strings seem to be GC GD’s nearest neighbors. This seems to be analogous to data presented on Table 2. In this sense does the GCGD method seem to differ from the CCGD approach which tends to promote longer strings. Such a marked difference in behaviors between GC and CC approaches was not observed in case of spaces constructed by means of Random Projection. In 100 runs, both GC and CC centered approaches seemed to promote as central the strings of comparable content and length4. As is indicated by Table 4, the LC “are” turned out to be the closest (i.e. Rank 1, when comparing with Table 3) to all other terms in 6% of Random Projection runs. In 6% of runs the same term was labeled as the nearest neighbor to the geometrical centroid of the generated space. Other overlaps between all used methods are marked by bold writing in Tables 3 and 4. Word see he are ore ere set she sea a red CCRPD 20 11 6 5 4 6 5 4 9 1 GCRPD 28 8 6 6 5 5 4 4 4 3 Table 4: Central terms of Randomly Projected spaces and their frequency of occurence in 100 runs Analogically to the observation described in the last paragraph of the section 4.2.1, it can be also observed that the strings characterized as 4 In fact only in 22 runs did GCRPD differ from CCRPD “closest” to the Theoretical Centroid of vector spaces generated by Random Projection tend to be longer than “minimal” string nearest to GCGD determined in the traditional non-stochastic feature-word vector space scenario. 5 Discussion When it comes to CCRP-calculation run lasted, in average, CCRPD-detection = 90 seconds, thus being almost twice as fast than the LC-calculation executed on the very same computer which lasted twice the time LCdetection= 157 s for the same corpus, indicating that the computational complexity of our PDL (Glazebrook et al., 1997) implementation of CCRP-detection is lesser than the complexity of LC-detection based on PERL’s Text::Levenshtein implementation of LD. When it comes to the computational complexity of the GC-calculation, it is evident that GC is determined faster and by less complex process than LCs or CCs . This is so because in order to determine the GC RP of N words there is no need to construct an N * N distance matrix. On the contrary, since every word is attributed coordinates in a randomly-generated yet absolute space, the detection of a hypothetic Geometric Centroid of all words is a very straightforward and cheap process, as well as the detection of GC’s nearest word neighbor.. And since in RP, the length of GC-denoting vector is limited to a relatively reasonable low number of elements (i.e. D = 1000 in case of this paper), it is of no surprise that the string closest to GC shall be found more slowly by a traditional “sparse vector” scenario whenever the number of features (columns) > D. In our scenario with NF=22340 of distinct features, it was almost 4 times faster to construct the vector space + find a nearest word to GC of the Randomly Projected space han to use a “sparse” fragment-term matrix optimized by storing only non-zero values (GCRPD-NN-detection ~ 6 sec ; GCGD-NN-detection ~ 22 sec). Other thing worthy of interest could be that contrary to a “sparse” method which seems to give higher score to shorter strings, somewhat longer strings seem to behave as if they were naturally “pushed towards the centroid” in a dense space generated by RP. If such is, verily, the case, then we believe that the method presented hereby could be useful, for example, in domains of gene sequence analysis or other scenarios where pattern-to-be-discovered is “spread out” rather than centralized. In practical terms, if ever the querying in RP space shall turn out to have lesser complexity than other vector models, our method could be useful within a hybrid system as a fast stochastic way to pre-select a limited set of “candidate” (possibly locally optimal) strings which could be subsequently confronted with more precise, yet costly, non-stochastic metrics ultimately leading to discovery of the global optimum. Asides above-mentioned aspects, we believe that there exists at least one other theoretical reason for which the RP-based geometrization procedure could deem to be a worthy alternative to LD-like distance measures. That is: the cardinality of a real-valued <0, 1> range of a cosine function is much higher than a wholenumbered <0, max(length(word))> range possibly offered as an output of Levenshtein Distance. In other terms, outputs of string distance functions based on trigonometry of RPbased vector spaces are more subtler, more finegrained, than those furnished by traditional LD. While this advantage does not hold for “weighted” LD measures we hope that this article could motivate future studies aiming to compare “weighted” LD and RPD metrics. When it comes to the feature extracting “fragment explosion” approach, it could be possibly reproached to the method proposed hereby that 1) the fragmentation component which permutes blindly through all N-grams presented in the corpus yields too many “features”; that 2) that taking into account all of them during the calculation of the word’s final vector is not necessary and could even turn to be computationally counter-productive; or that 3) bi-grams and tri-grams alone give better results than larger N (Manning et al., 2008). A primary answer to such an ensemble of reproaches could be, that by the very act of projecting data upon limited set of same non-orthogonal dimensions, the noise could simply cancel itself out5. Other possible answer to the argument could be that while the bi&tri-gram argument holds well for natural language structures, the method we aim to propose here has ambitions to be used beyond NLP (e.g. bio-informatics) or pre-NLP (e.g. early stages of language acquisition where the very notion of N-gram does not make sense because the very criterion of sequence segmentation & discretization was not yet established). At last 5 And this “noise canceling property” could be especially true for RP as defined in this paper where the rare nonzero values in the random “init” vectors can point in opposite directions (i.e. either -1 or 1). but not least we could counter-argue by stating that often do the algorithms based on a sort of initial blind “computational explosion of number of features” perform better than those who do not perform such explosion, especially when coupled with subsequent feature selection algorithms. Such is the case, for example, of an approach proposed by Viola & Jones in (Viola & Jones, 2001) which caused the revolution in the computer vision by proposing that in order to detect an object, one should look for combinations of pixels instead of pixels. In this paper, such combinations of “letterpixels” were, mutatis mutandi, called “fragments”. Our method departs from an idea that one can, and should, associate random vectors to such fragments. But the idea can go further. Instead of looking for occurrence of part in the whole, a more advanced RI-based approach shall replace the notion of “fragment occuring in the word” by a more general notion of “pattern which matches the sequence”. Thus even the vector associated to pattern /d.g/ could be taken into account during the construction of a vector representing the word “dog”. Reminding that RP-based models perform very well when it comes to offering solutions to quite “deep” signifiée-oriented problems, we find it difficult to understand why could not be the same algorithmic machinery applied to the problems dealing with “surface”, signifiantoriented problems, notably given the fact that some sort of dimensionality reduction has to occur whenever the mind tries to map >4Dexperiences upon neural substrate of the brain embedded in 3D physical space. Given that all observed correlations and centroid overlaps indicate that the string distance calculation based on Random Projection could turn out to be a useful substitute for LD measure or even other more fine-grained methods. And given that RP would not be possible if the Johnson-Lindenstrauss’s lemma was not valid, our results could be also interpreted as another empirical demonstration of the validity of the JL-lemma. Acknowledgments The author would like to thank Adil El-Ghali for introduction into Random Indexing as well as his comments concerning the present paper; to prof. Charles Tijus and doc. Ivan Sekaj for their support and to Aliancia Fair-Play for permission to execute some code on their servers. References Trevor Cohen, Roger Schvaneveldt & Dominic Widdows. 2010. Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics, 43(2), 240–256. Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), 171–176. Adil El Ghali, Daniel D. Hromada & Kaoutar El Ghali. 2012. Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés. JEPTALN-RECITAL 2012, 77. Peter Gärdenfors. 2004. Conceptual spaces: The geometry of thought. MIT press. Karl Glazebrook. Jarle Brinchmann, John Cerney, Craig DeForest, Doug Hunt, Tim Jenness & Tuomas Lukka. 1997. The Perl Data Language. The Perl Journal, 5(5). Verity Harte. 2002. Plato on parts and wholes: The metaphysics of structure. Clarendon Press Oxford. Matthew A. Jaro. 1995. Probabilistic linkage of large public health data files. Statistics in medicine, 14(57), 491–498. William B. Johnson & Joram Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206), 1. Pentti Kanerva, Jan Kristofersson & Anders Holst. 2000. Random indexing of text samples for latent semantic analysis. Proceedings of the 22nd annual conference of the cognitive science society (Vol. 1036). John Keats. 1819. The Fall of Hyperion. A Dream. John Keats. complete poems and selected letters, 381–395. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet physics doklady (Vol. 10, p. 707). Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini & Chris Watkins. 2002. Text classification using string kernels. The Journal of Machine Learning Research, 2, 419–444. Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Magnus Sahlgren. 2005. An introduction to random indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE (Vol. 5). Magnus Sahlgren & Rickard Cöster. 2004. Using bagof-concepts to improve the performance of support vector machines in text categorization. Proceedings of the 20th international conference on Computational Linguistics (p. 487). Magnus Sahlgren & Jussi Karlgren. 2002. Vectorbased semantic analysis using random indexing for cross-lingual query expansion. Evaluation of CrossLanguage Information Retrieval Systems (p. 169– 176). Magnus Sahlgren & Jussi Karlgren. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering, 11(3), 327–341. Alan M. Turing. 1936. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London mathematical society, 42(2), 230–265. Paul Viola & Michal Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple. Proc. IEEE CVPR 2001. Robert A. Wagner & Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM (JACM), 21(1), 168–173.