December 2016 Integer-based nomenclature for the ecosystem of lexically repetitive expressions in complete works of William Shakespeare Daniel Devatman HROMADA a,1 , a Berlin University of the Arts, Faculty of Design Abstract. Repetition of morphological or lexical units is an established technique able to reinforce the impact of one’s argument upon the audience. Rhetoric tradition has canonized dozens of repetition-involving schemas as figures of speech. Our article shows a way how hitherto ignored repetition-involving schemata can be identified. It shows that certain classes of repetitive figures can be represented in terms of specific sequences of integer numbers and vice versa, how specific sets of integer numbers can be translated into sets of regexes able to match repetition-involving expressions. A "Shakespeare number" S is simply defined as an integer with at least one repeated digit in which no digit bigger than X can occur if ever a digit X had not yet occurred in S’s decimal representation. Hence, 121 is a Shakespeare number, while 123 or 211 are not. A set of "entangled numbers" is subsequently defined as a subset of "Shakespeare numbers" with an additional property that all digits which occur in them are repeated at least twice in the decimal representation of the number. Thus, a 1212 is an entangled number while 1211 is not. A complete set E of entangled numbers of maximal length of 10 digits is subsequently generated and every member of E is translated into a regex. Each regex is subsequently exposed to all utterances in all works of William Shakespeare, allowing us to pinpoint 3367 instances of 172 distinct E-schemata. This nomenclature may allow scholars to lead a discussion about schemata which have escaped the attention of classical interpretators. e-mail: daniel at udk dash berlin dot de. Keywords. repetitive figures of speech, regular expressions, William Shakespeare, back-references, integers, stylometry 1. Exordium "A faulty argument repeated twice is already better: repeated twenty times, it is excellent. Our ears adapt to it as to any other music and we applaud it mechanically ... One repeats an argument as one hums a vaudeville: not because it is good, but because it has been often chanted." [3, note XXIII] Repetitio mater studiorum, pater oratoriumque. It had already been known to ancients that even the clearest reasoning can fail to convince the audience if ever the in1 Corresponding Author: Daniel Devatman Hromada, Faculty of Design, Berlin University of the Arts, 10823 Schoeneberg, Berlin, Bundesrepublik Deutschland, EU. Mails to: daniel at udk dash berlin dot de Daniel Devatman Hromada / tended argument is not communicated with sufficient redundancy. And it is well known to moderns that the cheapest yet most efficient way how such redundancy can be attained is by means of repetitive transfer of information [7] from sender to receiver [18]. What’s more, in human cognitive systems, repeated information is often amplified (Reference in camera ready version). It is therefore little surprising that repetition plays a non-negligable role in the art of persuasion, commonly known as rhetorics. Thus, in practically every classical manual, the students of oratory and poetic disciplines are reminded to reassert their arguments; to mould forms which reflect their contents and utter contents which reflect their form; to make appear and reappear certain words and syllables; to repeat certain sounds or reactualize certain movements. Simply stated: to remember the figures by means of which one can reinforce one’s influence over one’s audience. Hence, schemata known under names as diverse as polysyndeton, anaphore, anadiplose, epistrophe, symploche, antanaclasis, paronomasia or even antimetabole 2 . are traditionally defined in terms of repetition of their components [22]. But there is more, for one should also not forget repetitive figures (RFs) like alliteration, paregmenon, polyptoton, epizeuxis or even good old psittacism. Hence, dozens of RFs are sure to exist but their scholastic nomenclature complicates any further communication with more computational- and NLP- oriented researchers. The objective of this article is to bridge this gap. 2. Introduction In literature studies it is fairly common to speak about so-called "rhyme schemes" like AAAA for monorhymes, ABAB for alternate rhyme, ABBA for enclosed rhymes etc. It is therefore not much surprising that analogic formalisms - that is, formalisms that involve alphabetic indices - have been adopted by scholars aiming to formalize a subgroup of rhetoric figures, known as the group of schemes. For example, [11] use a following formalism: [W ]a ...[W ]b ...[W ]b ...[W ]a to denote the rhetoric figure known as antimetabole. Subsequent studies in automatized chiasm identification and detection pursue a similiar route and often use formulae like ABXBA, ABCBA, ABCXCBA to denote schemata corresponding to utterances such as: "Drake love loons. Loons love Drake.", "All as one. One as all." ([12] or "In prehistoric times women resembled men, and men resembled women." ([6]) This being understood, the core idea behind this article is simple to explicate. For what shall be principially elucidated here is truly nothing more than the most basic a formalistic quirk1 a notational flip from alphabetic to numeric indices. Hence, Aindices are to be substituted by 1-indices, B-indices by 2-indices, C-indices by 3-indices et caetera. Hence and henceforth, one is free to use the form 1212 instead of ABAB, 1221 instead of BABA and 12321 instead of ABCBA... 2 Note that certain RFs included in a so-called "chiasmatic suite" (Reference - this volume) are not only repetition-involving but also fractal-like in a sense that they embed other repetitions which include yet other repetitions Daniel Devatman Hromada / Such change of notation may subsequently allow certain scholars to percieve and concieve a set of potentially interesting rhetoric schemata as a potentially infinite subset of the infinite but countable set [2,9,21] of non-positive natural numbers . That is, integers. The main implication of such mapping of a set of surface-based, repetitioninvolving rhetoric figures onto the set of integers goes as follows: given that the set of integers is enumerable, the set of our integer-based RF-schemata denoting formulae is enumerable as well. And as shell be shewn, developping a program which shall enumerate big amounts of such schemata is a fairly trivial enterprise which can fit into dozen lines of code (c.f. listings 1 and 2). Such program generating such sets, however, was not developped nor is here presented just to accomplish some mathematicians’ useless fancy. Rather contrary is the case and our objectives are to be considered more practical than theoretical. For such sets of potentially interesting RF-schemata can be translated - by yet another program (c.f. 4) - into so-called regular expressions ("regexes") which could be subsequently used to match and discover hitherto unknown repetition-based expressions occurent in attested natural language corpora [24,8]. Like that of collected works of William Shakespeare, for example. 3. Definitions 3.1. Shakespeare number A Shakespeare number S is a positive natural number (S ∈ N) whose decimal representation expresses two properties: • repetitive property: at least one digit occurs twice • ascending property: S contains no digit n > 1 without containing a digit n − 1 to the left of first occurrence of n In order to see the principle more clearly, table 1 enumerates ten Shakespeare numbers with smallest value. S-number Alphabetic representation Matchable expression 11 111 112 121 122 1111 1112 1121 AA AAA AAB ABA ABB AAAA AAAB AABA "we split we split " "we split we split we split "here here sir " "to prayers to " "trip audrey i attend i attend " "justice justice justice justice " "great great great pompey " "here here sir here " 1122 AABB "gross gross fat fat " 1123 AABC "he he and you " Table 1. First ten Shakespeare numbers, their corresponding alphabetic representations and arbitrarily chosen Shakespearean expressions which can be subsumed under them. Daniel Devatman Hromada / As a counterexample, let’s precise that 22 is not a Shakespeare number because digit 1 does not occur at all and 221 is not a Shakespeare number because 2 occurs with no 1 to its left. These two numbers therefore do not satisfy the ascending property. On the other hand, numbers like 12, 13 or 123 are also not S-numbers because they do not include any repeated digit and therefore do not satisfy the repetition-inclusion constraint. Listing 1 displays the source code of a routine able to generate the sequence of S − numbers from one to potential infinity. The sequence of first 163553 S-numbers - id est those S-numbers whose value is less than 9999999999 is available at Online Encyclopedia of Integer Sequences [13] under sequence number A273977 3 . Deeper mathematical and number-theoretical properties of S-numbers are presented in [19]. 3.2. Entangled number E-number Alphabetic representation Matchable expression 11 111 AA AAA "we split we split " "we split we split we split " 1111 1122 1212 1221 11111 11122 11212 11221 AAAA AABB ABAB ABBA AAAAA AAABB AABAB AABBA 11222 12112 12121 AABBB ABAAB ABABA "justice justice justice justice " "gross gross fat fat " "to prayers to prayers " "my hearts cheerly cheerly my hearts " "so so so so so " "great great great pompey pompey " "come come buy come buy " "high day high day freedom freedom high day " "o night o night alack alack alack " "too vain too too vain " "come hither come hither come " 12122 12211 ABABB ABBAA 12212 ABBAB 12221 ABBBA "come buy come buy buy " "freedom high day high day freedom freedom " "on whom it will it will on whom it will " "thou canst not hit it hit it hit it thou canst not " Table 2. All Entangled numbers with no more than 5 digits, their corresponding alphabetic representations and arbitrarily chosen Shakespearean expressions which can be subsumed under them. A set of entangled numbers is a subset of set of Shakespeare numbers (E ∈ S ∈ N). E − numbers therefore satisfy repetitive and ascending properties of S − numbers. In addition to these does the decimal representation of an entangled number E one additional property: • closure property: each digit of E occurs at least twice 3 https://oeis.org/A273977/b273977.txt Daniel Devatman Hromada / In order to see the idea more clearly, table 2 enumerates ten Entangled numbers having their digit-length equal to five or less. As a counter example, let’s precise that numbers like 12, 13, 22, or 123 are not Enumbers because they are not even S-numbers. On the other hand, S-numbers like 121 or 1211 are not E-numbers because they contain a digit 2 which is not repeated. Listing 2 displays the source code of a routine able to verify whether an S − number presented at the input is an E − number. The sequence of first 4360 E − numbers - id est those E − numbers whose value is less than 9999999999 is available at Online Encyclopedia of Integer Sequences [13] under sequence number A273978 4 . Deeper mathematical and number-theoretical properties of S-numbers are presented in [19]. 4. Method The core idea behind our method can be stated as follows: Any S− or E− number is to be "translated" into a backreference-endowed regular expression. More concretely, every digit of an S- or E- number can be interpreted as a sort of an element or a "brick". In this article, we work only with one type of bricks, those corresponding to sequences which are between two to twenty-three characters long5 . More concretely, a first occurence of a novel brick can be represented as a PERL-compatible regular expression: (.{2,23}) However, any subsequent repeated occurence of a digit in the S- or E- number is interpreted not as an occurence of the new brick, but rather as a backreference to the brick which was already denoted by the same digit. Hence, the very first S- number 11 is NOT to be translated into regex /(.{2,23}) (.{2,23})/. For this would imply existence of two distinct bricks. Rather, the E-number 11 is to be translated into regex: (.{2,23}) \1 wherein the expression \1 denotes the backreference to the content matched by the regex-brick specified in first parentheses, i.e. brick no.1 . Hence, the S-number 111 can be easily translated into a regex /(.{2,23}) \1) \1/, 1111 into a regex /(.{2,23}) \1 \1 \1/ etc. These, however, are cases which correspond only to repetition of one single brick: 11 for duplication, 111 for triplication, 1111 for quadruplication etc. In order to assure the application of the non-identity principle stating that: 4 https://oeis.org/A273978/b273978.txt 5 Minimal (e.g. 2) and maximal (e.g. 23) brick length are the only parameters of our model and can be, of course, adequately tuned. Sometimes we shall denote this parameter couple with the term base. More in discussion. Daniel Devatman Hromada / "Each distinct digit corresponds to distinct content" , an additional adjustment is needed in case we want to translate S-numbers containing multiple digits of different kind. That is, S-numbers like 121, 122 or 211. For if we would not care for the principle of non-identity, a number like 121 could be easily represented as /(.{2,23}) (.{2,23}) \1/ and a number like 122 could be translated into /(.{2,23}) (.{2,23}) \2/. It could turn out, however, that these regexes would match the very same expressions as other, more simple regexes do as well (e.g. the expression "no no no" could be matched by both /(.{2,23}) \1) \1/, as well as by /(.{2,23}) (.{2,23}) \1/ or /(.{2,23}) \1 (.{2,23})/. This is so, because nowhere in such regular expression it is specified that the first brick has to be different from the second brick, or third brick from the second. Luckily enough, syntax of PCREs is exhaustive enough to allow us to encode the non-identity constraint into regexes themselves. This is attained by putting the backreference into a so-called negative lookahead, traditionally expressed by the formula (?!). Hence, by translating the S-number 121 into the regex (.{2,23}) (?!\1)(.{2,23}) \1 we can make sure that the content matched by the brick denoted by digit 2 shall be different from the content matched by the brick denoted by digit 1. Thus, an expression "no no no" shall not be matched by such a regex while an expression "no yes no"6 shall. Going somewhat further, an S-number 12321 - which could be understood as an instance of chiasmatic ABXBA - is to be translated into regex (.{2,23}) (?!\1)(.{2,23}) (?!\1|\2)(.{2,23}) \2 \1 whereby the disjunctive backreference contained in the negative lookahead (?!\1|\2) assures that the content matched brick no.3 - corresponing to filler X - shall be different from content matched by the brick representing digit 1 as well as the brick representing digit 2. This being said, the method of translating S- or E- numbers into regexes which do not transgress the non-identity constraint is pretty much straightforward, and is fully and completely described by PERL code given in listing 3. 5. Experiment 5.1. Corpus A digital, unicode-encoded version of Craig’s edition of "Complete works of William Shakespeare" [4] has been downloaded from a publicly available Internet source 7 . This corpus contains 17 txt files stored in the sub-folder "comedies", 10 txt files stored in the sub-folder "tragedies" and 10 txt files stored in the sub-folder "historical". What’s more, all utterances are annotated according to the following format: 6A cautious reader may now start to observe that non-repeated digits of an S-number in fact correspond to "filler" or "separator" expressions (e.g. "yes") which in many cases fill the space between repeated elements themselves (e.g. "no"). 7 Downloaded from http://www.lexically.net/downloads/corpus_linguistics/ShakespearePlaysPlus.zip. Backup at http://sci.wizzion.com/ShakespearePlaysPlus.zip . Daniel Devatman Hromada / Sentence 1. Sentence ... O, wonder! How many goodly creatures are there here! How beauteous mankind is! O brave new world, That has such people in’t! That is, a format highly reminiscent of the format of a valid XML document. This format wherein diverse values of the tag < PERSONA > denote names of diverse dramatis personae (e.g Miranda, Prospero) , seems to be consistently and stringently followed across all files contained in the corpus. This is advantageous, since it implies that the content present between the opening and closing tag can be understood as a supraphrasal, meaning-encoding monadic unit: a utterance. Verily, this is encouraging. It is encouraging for both theoretical (1.) as well as for practical (2.) a reason: 1. school of thought to which our research tends to adhere is principially a constructivist, usage-based linguistic paradigm best manifested in [20] 2. computational complexity of matching backreference-endowed regexes depends supralineary or maybe even non-polynomially [1] from the length of the text being matched Regarding the practical reason, it could be postulate that our article offers certain evidence for the hypothesis "backreferenced regex-parsing of Shakespearean utterances is computationally tractable in reasonable time", whereby the term "reasonable" denotes time scales between miliseconds and minutes. More in discussion. Regarding the theoretical reason, it is worth making explicit that an implicit leitmotive of Tomasello’s theory is a definition stating: Utterance is the basic unit of linguistic interaction. 5.2. Processing Dramatic pieces are divided into utterances. This is a natural consequence of the fact that dramatic pieces tend to represent scenarios within which diverse dramatis personae interact with each other. It is difficult to see any other litteral genre where division into utterances is as marked as in case of drama8 . And in case of digital version of [4] Shakespeare corpus, such markedness tends to be even more marked. Therefore, one simply needs to cut the corpus into utterances by interpreting the closing tag of the utterance (e.g. < /PERSONA >, < /MIRANDA > etc.) as the utterance 8 Plato’s dialogues are, of course, set aside as a very particular case. When it comes to film scripts and/or subtitles to other audiovisual media, these are principially understood as a particular subtype of dramatic pieces Daniel Devatman Hromada / separator. Even more concretely, one can simply consider the slash symbol / to be the utterance separator. Subsequently, dividing the original dramatic text into utterances is, at least in PERL, as simple as defining the symbol / to be the default input separator. That is, in PERLish, by executing following code: $\ = ”/”; Only two further text-processing steps have been executed during the initialization phase of the experiment hereby presented. Primo, content of each utterance has been put into lowercase. Secundo, non-alphabetic symbols (e.g. dot, comma, exclamation mark etc.) have been replaced by blank spaces. We are aware that such replacement could potentially lead to certain amount of loss of prosody- or pathos- encoding information. However, we consider this step as legitimate because the objective of our experiment was to focus on repetition of lexical units.9 Pre-processing code once executed, identification of expressions containing diverse types of lexical repetition is as simple as matching each Shakespearean utterance with each regex. 6. Results This section presents results of exposure of Shakespeare’s corpus to base=2,23 regular expressions generated out of all entangled numbers with max. length of 10 digits. We focus on E2,23 − numbers because their closure property (i.e. "every digit contained in a valid E-number has to occur at least twice") gives an arbitrary E − number ability to match much more rare a gem than just an arbitrary S − number. 6.1. Quantitative All in all, 3667 instances of a repetitive expression has been detected in Shakespeare’s complete works. These were contained in 2295 distinct utterances and corresponded to 172 distinct E2,32 schemata. Among these, 71 matched more than one instance: these schemata could thus potentially correspond to a certain cognitive pattern or a habitus in Shakespeare’s mind. Table 3 contains summary matching frequency information which concerning schemata matching at least five distinct utterances. 9 Enumerative generation of backreference-involving regexes focusing on repetitions of phonotactic clusters, syllables, phrases or potentially even sememes and prosodies is, in theory, also possible. We prefer, however, not to focus on this topic within the limited scope of this article. Daniel Devatman Hromada / Table 3. Quantities of utterances present in collected works of William Shakespeare which contain at least five distinct utterances corresponding to an E-number encoding the backreference-encoding regex whose individual brick match expressions not shorter than 2 characters and not longer than 23 characters. Instances 2332 525 170 100 48 35 32 E2,23 − number 11 1212 111 123123 12121 1221 12341234 Example "bestir bestir " "to prayers to prayers " "ha ha ha " "cover thy head cover thy head " "come hither come hither come " "fond done done fond" "let him roar again let him roar again " 32 30 23 12 12 11 11 1122 1111 121212 123231 1231231 121233 112323 "with her with her hook on hook on " "great great great great " "come on come on come on " "upholds this arm this arm upholds " "fubbed off and fubbed off and fubbed " "trip audrey trip audrey i attend i attend " "what what what ill luck ill luck " 10 10 9 8 8 7 6 5 123312 11122 121323i 12321434 11111 12312312 11234234 12123434 "my hearts cheerly cheerly my hearts " "lady lady lady alas alas " "a lord to a lord a man to a man " "land rats and water rats land thieves and water thieves " "so so so so so " "let me see let me see let me " "on on on to the breach to the breach " "i thank god i thank god is it true is it true " 5 1112323 "barren barren barren beggars all beggars all " Another phenomenon may be found noteworthy by a reader interested in purely quantitative aspects of our research. That is, the relation between the number of digits of a E − number of length L seems to be in a Zipf-like [25] relation to number of occurences of expressions which can be matched by such EL . For example, Shakespeare’s dramas seem to contain 2332 duplications (E = 11), 170 triplications (E = 111), 30 tetraplications (E = 1111), 8 pentaplications (E = 11111 10 ), two hexaplications (E = 111111 11 ), one heptaplication (E = 1111111 12 ) and zero octaplications. It is worth mentioning, however, that generic relation between the length (in digits) of an E − number X and the amount of utterances which X matches seems not to be Zipfian. This is illustrated by Table 4. 10 E.g. "never never never never never " by Lear in King Lear. "kill kill kill kill kill kill " also by king Lear. 12 E.g. "so so so so so so so " by Shallow in The Second Part of King Henry IV. 11 E.g. Daniel Devatman Hromada / Digits Theoretical Matched 2 3 4 5 6 7 8 9 1 1 4 11 41 162 715 3425 2332 170 622 91 211 56 86 67 Table 4. Schemata corresponding to E − numbers with even number of digits match more frequently than those with odd number of digits. As indicated by Table 4, an observed preference for repetitive expressions including two, four, six or eight bricks cannot be explained in terms of number-theoretical distribution of E − numbers themselves. For example, there exists eleven E − numbers with five digits and fourty-one E − numbers of length six. However, when exposed to Shakespeare corpus, base(2,23) regexes generated from E − numbers six digits long seem to match 211 utterances while five brick long regexes match only ninety-one of them. Whether this observed asymmetry is an artefact of our method and our definition of E − numbers, or whether it is due to a sort of cognitive bias, a sort of preference for balanced repetitions poses us in front of an argument which we do not dare to tackle within the limited scope of the present article. 6.2. Qualitative It may be said that the longer the E- or S- number is, the more complex a structure, the more cognitively-salient, pathos-filled an entity it potentially represents. For this reason, this subsection principially exposes the reader with few answers to a question: "What Shakespearean expressions can be matched with longest possible Enumber ?" In all following examples, we will use the base2,23 E-numbers, i.e. restrict the length of individual bricks to min. 2 and max. 23 characters. In the realm of comedies13 , one can observe that the regex generated from the number 12343434 pin-points a following utterance from Stephano playing his role in The Tempest: Flout (1) ’em (2), and (3) scout ’em (4); and (3) scout ’em (4), and (3) flout ’em (4); Thought is free. while regex generated from number 12343412 identifies Miranda’s: All (1) lost (2) to (3) prayers (4), to (3) prayers (4) all (1) lost (2). 13 Link to the file containing all XXX expressions shall be published in the camera-ready version of the article. Daniel Devatman Hromada / or Caliban’s Freedom (1), high (2) day (3) ! high (2) day (3), freedom (1) ! freedom (1) ! high (2) day (3), freedom (1) ! 14 all appearing in the same play. Another answer, corresponding to E-number 122133144 is given by Dromio, a personage in Shakespeare’s "Comedy of Errors": She is so hot because (1) the meat is cold (2) ; The meat is cold (2) because (1) you come not home (3); You come not home (3) because (1) you have no stomach (4); You have no stomach (4), having broke your fast; Analyzing the realm of tragedies, one may see Polonius - a character in the Hamlet drama - utter a 11231434231-matchable expression: The best actors in the world, either for tragedy, comedy, history, pastoral (1), pastoral (1) - comical (2), historical (3) - pastoral (1) , tragical (4) - historical (3), tragical (4) - comical (2) - historical (3) - pastoral (1) , scene individable, or poem unlimited: Seneca cannot be too heavy, nor Plautus too light. For the law of writ and the liberty, these are the only men. 15 or one can hear Hamlet himself pronouncing a following 1231414312-matchable sequence: Let your own discretion be your tutor: suit the (1) action (2) to (3) the (1) word (4), the (1) word (4) to (3) the (1) action (2) 14 It is important to realize that the very same expression can be matched by multiple regexes. Hence, an above mentioned Caliban’s proclamation can be analyzed not only to match the base2, 23 E-number 1232311231, but also analyzed to match E-numbers like 12211121 (if ever "high day" forms only one brick) etc. This is analogic, mutatis mutandi, to sentence having multiple syntactic parses. 15 Note that regexes have been constructed in a way that ignores suffixes, i.e. use bricks having a form like "(.{2,23})\w{0,4}", than this utterance could be potentially matched with much longer a number, because not only adjectives (e.g. "historic-al") but also the preceding substantives "histor-y" would be accounted for. Daniel Devatman Hromada / while Mercutio from the Romeo and Julia narrative states: Come, come, in thy mood and as soon and as soon thou art as hot a Jack as any in Italy; (1) moved (2) to be (3) moody (4), (1) moody (4) to be (3) moved (2). These examples, of course, are just a tip of an iceberg. Verily, only a tip of an iceberg, because many strongly marked repetitive expressions are also to be found in Shakespear’s historical dramata. Among these, dramata eternalizing narratives of Henry IV. and Henry V. tend to top the list. Hence, Gadshill reasons will strike (1) sooner (2) than (3) speak (4) and (5) speak (4) sooner (2) than (3) drink (6) and (5) drink (6) sooner (2) than (3) pray and yet i lie for they pray continually to their saint the commonwealth or rather not pray to her but prey on her while Falstaff emphasizes: banish peto banish bardolph banish poins but for sweet jack falstaff kind jack falstaff true jack falstaff valiant jack falstaff and therefore more valiant being as he is old jack falstaff banish (1) not (2) him (3) thy (4) harry s (5) company (6) banish (1) not (2) him (3) thy (4) harry s (5) company (6) banish (1) plump jack and banish all the world It is, however a persona named Shallow which seems to be particulary fond of repetitions, once saying come (1) on (2) come (1) on (2) come (1) on (2) sir (3) give (4) me (5) your (6) Daniel Devatman Hromada / hand (7) sir (3) give (4) me (5) your (6) hand (7) and next time saying: where s (1) the roll (2) where s (1) the roll (2) where s (1) the roll (2) let (3) me (4) see (5) let (3) me (4) see (5) let (3) me (4) see (5) so (6) so (6) so (6) so (6) so (6) so (6) so (6) yea marry sir ralph mouldy let them appear as i call let them do so let them do so let me see where is mouldy Given that Shallow appears in historical dramata, an interesting question could be rightfully posed: Is Shallow’s tendency to produce repetitive utterances en masse just Shakespeare’s invention or is it rather a sort of description of particular cognitive characteristics of once existing historical personage ? 7. Conclusion Our article presents a way of maping a subset of a set of all possible backreferenceendowed regexes onto a set of natural numbers. It indicates that for every base of certain kind, the set of regexes-to-be-generated is infinite but enumerable. A set of so-called Shakespearenumbers (S −numbers) is defined as well as the set of "Entangled numbers". The second being a subset of the first, satisfying one additional constraint: Every distinct digit ("symbol") of an entangled number EX occurs in EX at least twice. We have subsequently generated a list of all such S − numbers (c.f. listing 1) and E −numbers (c.f. listing 2) with at max 10 digits. After which the E −numbers have been translated into backreference-endowed regular expressions whose most elementary units, so-called "bricks", were no shorter than two and no longer than twenty-three characters. In the end, such regexes have been exposed to corpus containing collected works of William Shakespeare. This approach allowed us to pinpoint 3667 utterances matching at least one among 172 distinct repetitive formulae. We believe that at lease some among these formulae Daniel Devatman Hromada / could be of certain interest not only for Shakespearean [14] scholars in particular, but also for wider fields of "digital humanities" [23] or stylometry. The good news is that the whole matching process is also fairly fast. More concretely, matching all utterances with all base2, 23 regexes generated out of all 4360 E − numbers with less than 10 digits lasted 9555 seconds in case Shakespearean comedies, 6607 seconds in case of tragedies and 6900 seconds in case of historical dramata. All this on one single core of an 1.4 GHz CPU. 8. Peroratio Rhetorics undoubtedly belongs among five oldest scientific paradigms ever explicated by scholars of the occidental16 tradition. Even before Plato noted down discussions between Socrates and Gorgias and Socrates and Parmenides; even before Aristotle projected his point-of-view upon the realm of man, Athēnaia, had been already venerated. Longevity of rhetorics has positive as well as negative sides. Negative, for such lengthy tradition implies potential impediments caused by centuries of terminological and methodological sediments. We are convinced that, similiarly to diverse occult notations of pre-Mendelean chemistry, may alphabetic notation of BABAs and ABBAs be also considered to be such sediments in regards to rhetoric science. Hence, by a trivial act of switching notation from As to ones and Bs to twos, we aspire to do nothing else than to unblock this science from the state of terminological traffic jam to somewhat more fluid a state. Hence and thus, interesting and almost melodical17 verses of Shakespeare have been pin-pointed and juxtaposed side by side to each other. Being unsure of whether such juxtaposition has ever been explored in the depth their merit, we find our qualitative results worthy of not only exploring but also publishing. For who knows, maybe they shall even inspire some potential Shakespeare of the future ? Quantitative explorations may also turn out to be worthy of further exploration. Three axes of such exploration are immediately visible: 1. "universalia axis": study of language-independent invariants and rhetorical schemata which occur across many distinct languages and/or language groups [12] 2. "ontogenetic axis": exploration of processes by means of which complex eloquency of an individual locutor emerges out of simpler structures, from mind of a child to Shakespeare 3. "historical axis": study of different Digital Humanities resources in order to increase our knowledge about styles, fashions, crossovers and traditions popular during different epoches of human history In terms of Saussurian linguistics ([5]), one may consider the first axis to be synchronic one while the the second and third can be considered as "diachronic" ones. 16 Note, however, that rhetorics is far from being unknown to Orient as well. Known as Sarasvatı̄ in the sanskrit world, the goddess embodies knowledge, arts, music, melody, muse, language, rhetoric, eloquence, creative work ... [17] seems to be active already in vedic or even pre-vedic proto-indo-european times. 17 It may be the case that the application of our method upon musical partitures - as stored in MIDI files, for example - shall also yield some worthy insights. Daniel Devatman Hromada / Listing 1: PERL code generating an ascending sequence of Shakespeare numbers. Code hereby transfered to the public domain under license CC BY-NC-SA for artistic use and mGPL license for general use.. $i =1; INCREMENT : w h i l e ( $ i ++) { my %d ; $d { " 0 " } = 1 ; $r =0; f o r $d ( s p l i t / / , $ i ) { n e x t INCREMENT i f ! e x i s t s $d { ( $d − 1 ) } ; i f ( $d { $d } ) { $r =1; } $d { $d }= t r u e ; } print " $i \ n" i f $r ; } One may, for example, extend the work of [12] in domain of "language-independent detection of figures-of-speech" and demonstrate that E-numbers of considerable length match expressions not only Shakespeare, but also in Goethe, Moliere, Milton or others. Or focus on so-called "sacred texts" like Bible, Koran or RgVed where repetitions, indeed, abound. Or pursue a somewhat more psycholinguistic, ontogeny-oriented line of research and study the a corpus like CHILDES [15] in order to explore how complex eloquency emerges out of variations within repetition of complex sequences (another REFs to be given in camera-ready version). At last but not least, we are convinced that our S− or E− number nomenclatures could be embedded into rhetorical figure ontologies [11,16]. Within such ontologies, antimetaboles could be thus "enriched" with attributes like "12321", "123321", "1234321" etc. ; anadiplosis would be labeled with another set of numbers, antistrophe with yet another, etc. The advantage of such an enrichment is quite easy to see: such enriched elements would become "grounded" [10]. That is - when looking for - or infering the presence of a certain figure of speech F in certain text T , one could consult the ontology and see whether F is not labeled with SF or EF attributes. If yes, one could simply parse the T with corresponding SF or FE regexes. One could thus establish a practical, functional bidirectional bridge between the abstract realm of purely descriptive ontologies and material reality of text corpora which are to be parsed and understood. And, of course, such nomenclatures - or nomenclatures of a similiar vein - may allow communication between computational and classical scholars in unambigous, precise, yet still concise and sufficiently explanatory terms. This being said, we conclude this article with an expression of hope that the method hereby introduces shall make it possible to spot down, identify, classify and study in deeper level the intricacies of cognitive ecosystems populated with swarms and clusters of hitherto unknown psycholinguistic schemata traditionally known as "figures of speech". Acknowledgments\TBD in the camera-ready version of the article. Daniel Devatman Hromada / Listing 2: PERL code checking whether a Shakespeare number given at the input is also an Entangled number. Code hereby transfered to the public domain under the mGPL license.. OUTER : w h i l e ( < >) { my %d ; $ i =$_ ; chop $ i ; f o r $d ( s p l i t / / , $ i ) { ( e x i s t s $d { $d } ) ? ( $d { $d }++) : ( $d { $d } = 1 ) ; } f o r $k ( k e y s %d ) { n e x t OUTER i f ( $d { $k } < 2 ) ; } print " $i \ n" ; } Listing 3: PERL code translating S-numbers into syntactically correct regexes. Code hereby transfered to the public domain under the mGPL license.. my $ b a s e = ’ ( . { 2 , 2 3 } ) ’ ; $n=$ARGV [ 0 ] ; @i = s p l i t / / , $n ; $re = " " ; my %h ; $no = " " ; f o r my $ i ( @i ) { $re .= " " ; i f ( d e f i n e d $h { $ i } ) { $re .= ’ \ \ ’ . $i ; } else { i f ( $i >1) { $ i >2 ? ( $no . = ’ | \ \ ’ . ( $ i −1)) : ( $no . = ’ \ \ ’ . ( $ i − 1 ) ) ; $ r e . = ’ ( ? ! ’ . $no . ’ ) ’ ; } $re .= $base ; $h { $ i } = 1 ; } } $ r e . = ’ [ <] ’ ; p r i n t " $n t r a n s l a t e s i n t o $ r e \ n " ; Daniel Devatman Hromada / Listing 4: PERL code for utterance-oriented pre-processing of texts contained in ShakespearePlaysPlus corpus. Code hereby transfered to the public domain under the mGPL license.. u s e open " : e n c o d i n g ( u t f −16) " ; $ / = " / " ; # c o n s i d e r t h e s l a s h s y m b o l t o be t h e d e f a u l t i n p u t s e p a r a t o r w h i l e ( < >) { $ l i n e = l c $_ ; # l o w r e c a s e $ l i n e =~ s / [ \ r \ n \ t . , ? ! : ; ’ "\ − ] + / / g ; # remove non−a l p h a b e t i c c h a r s p u s h @{ $ u t t e r a n c e s {$ARGV} } , $ l i n e ; # c o n s t r u c t t h e u t t e r a n c e h a s h } References [1] Alfred Vaino Aho. Algorithms for finding patterns in strings. Algorithms and Complexity, 1:255, 2014. [2] Georg Cantor. Über eine elementare frage der mannigfaltigkeitslehre. Jahresbericht der Deutschen Mathematiker-Vereinigung, 1:75–78, 1892. [3] Gorges Caumont. Notes morales sur l’homme et sur la societe. Sandoz&Fischbacher, Paris, 1872. [4] William James Craig. The complete works of Wiliam Shakespeare. Oxford University Press, 1919. [5] Ferdinand De Saussure. Cours de linguistique générale: Publié par Charles Bally et Albert Sechehaye avec la collaboration de Albert Riedlinger. Libraire Payot & Cie, 1916. [6] Marie Dubremetz and Joakim Nivre. Rhetorical figure detection: the case of chiasmus. on Computational Linguistics for Literature, page 23, 2015. [7] Luciano Floridi. The philosophy of information. Oxford University Press, 2011. [8] Jeffrey EF Friedl. Mastering regular expressions. " O’Reilly Media, Inc.", 2002. [9] Kurt Gödel. Über formal unentscheidbare sätze der principia mathematica und verwandter systeme i. Monatshefte für mathematik und physik, 38(1):173–198, 1931. [10] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990. [11] Randy Harris and Chrysanne DiMarco. Constructing a rhetorical figuration ontology. In Persuasive Technology and Digital Behaviour Intervention Symposium, pages 47–52. Citeseer, 2009. [12] Daniel Devatman Hromada. Initial experiments with multilingual extraction of rhetoric figures by means of perl-compatible regular expressions. In RANLP Student Research Workshop, pages 85–90, 2011. [13] OEIS Foundation Inc. The on-line encyclopedia of integer sequences, 2017. http://oeis.org. [14] Sister Miriam Joseph. Shakespeare’s Use of the Arts of Language. Paul Dry Books, 2008. [15] Brian MacWhinney. The CHILDES project: The database, volume 2. Psychology Press, 2000. [16] Miljana Mladenović and Jelena Mitrović. Ontology of rhetorical figures for serbian. In International Conference on Text, Speech and Dialogue, pages 386–393. Springer, 2013. [17] John Muir. Original Sanskrit texts on the origin and history of the people of India, their religions and institutions. Trübner & Company, 1873. [18] Claude E Shannon and Warren Weaver. The mathematical theory of information. 1949. [19] NJA Sloane and Arndt Joerg. Counting words that are in "standard order", 2016. https://oeis.org/A278984/a278984.txt. [20] Michael Tomasello. Constructing a language: A usage-based theory of language acquisition. Harvard university press, 2009. [21] Alan Mathison Turing. On computable numbers, with an application to the entscheidungsproblem. J. of Math, 58(345-363):5, 1936. [22] Alan Mathison Turing. Rhetorique. Grand Memento Encyclopedique, 1:687–689, 1936. [23] Michael Ullyot. Review essay: Digital humanities projects. Renaissance Quarterly, 66(3):937–947, 2013. [24] Larry Wall and Randal L Schwartz. Programming perl. O’Reilly & Associates Sebastopol, CA, 1991. [25] George Kingsley Zipf. The psycho-biology of language. 1935. Daniel Devatman Hromada /