Initial Experiments with Multilingual Extraction of Rhetoric Figures
by means of PERL-compatible Regular Expressions

Daniel Devatman Hromada
Lutin Userlab – ChART – Paris 8 – EPHE - Slovak Technical University
hromi@kyberia.sk

Abstract
A language-independent method of figure-ofspeech extraction is proposed in order to reinforce
rhetoric-oriented
considerations in natural
language processing studies. The method is based
upon a translation of a canonical form of
repetition-based figures of speech into the
language of PERL-compatible regular expressions.
Anadiplosis, anaphora, antimetabole figures were
translated into the form exploiting the backreference properties of PERL-compatible regular
expression while epiphora was translated into a
formula exploiting recursive properties of this very
concise artificial language. These four figures
alone matched more than 7000 strings when
applied on dramatic and poetic corpora written in
English, French, German and Latin. Possible
usages varying from stylometric evaluation of
translation quality of poetic works to more
complex problem of semi-supervised figure of
speech induction are briefly discussed.

1

Introduction

During middle ages and before, the discipline of
rhetoric composed - along with grammar and
logic - a basic component of so-called trivium.
Being considered by Platon as the “one single art
that governs all speaking” (Plato, trans. 1986) in
order to be subsequently defined by Aristotle as
“the faculty of observing in any given case the
available means of persuasion” (Aristotle, trans.
1954), the basic postulates of rhetoric are still
kept alive by those being active in domains as
diverse as politics, law, poetry, literary theory
(Dubois, 1970) or humanities in general
(Perelman & Olbrechts-Tyteca, 1969)
When it comes to more “exact” scientific
disciplines like that of informatics or linguistics ,
rhetoric seems to be somewhat ignored definitely more than its “grammar” and “logic”
trivium counterparts.
While contemporary

rhetoric disposes with a strong theoretical
background - whether in the form of the
Rhetorical Structure Theory (Taboada, Mann, &
Back, 2006), “computational rhetoric” (Grasso,
2002) or computational models of natural
argument (Crosswhite & Fox, 2003); a more
practically-oriented engineer has to nonetheless
agree with the statement that “the ancient study
of persuasion remain understudied and
underrepresented in current Natural Language
systems” (Harris & DiMarco, 2009) .
The aim of this article is to reduce this
“under-representation” gap and in a certain sense
augment the momentum of the computational
rhetoric not by proposing a complex model of
argumentation, but by proposing a simple yet
efficient and language-independent method for
extraction of certain rhetoric figures (RF) from
textual corpora.
RFs, also called “figures of speech”, are
one of the basic means of persuasion which an
orator has to his disposition. Traditionally, they
are divided into two categories : tropes - related
to deeper, i.e. semantic features of the phrasal
constituents under consideration; and schemes related to layers closer to actual material
expression of the proposition, i.e. to the
morphology, phonology or prosody of the
generated utterance.
The method proposed within this article
shall deal only with reduced subset of the latter that is, with detection of rhetoric schemes
anadiplosis, anaphora, antimetabole and epiphora
which are based on a repetition or reordering of a
given word, phrase or morpheme across multiple
subsequent clauses. While such a stylometric
approach was currently implemented with
encouraging results by (Gawryjolek, 2009), his
system is operational only when combined with
probabilistic context-free grammar parser
adapted to English language, and hence

dysfunctional when applied upon languages for
which such a parser does not exist.
In the following paragraphs of this
article we shall present a system of rhetoric
figure extraction which tends to be languageindependent, i.e. applicable upon a textual corpus
written in any language. Ideally, no antecedent
knowledge about the grammar of a language is
necessary for successful extraction by means of
our method, the 1) prescriptive form of the
figure-to-be-extracted and 2) the symbol
representing phrase and/or clause boundaries is
the only information necessary.
More concretely, our proposal is based
on a fairly simple translation of a canonical form
of a rhetoric figure under question into a
computer language, namely into the language of
PERL-compatible regular expressions (PCREs).
PCREs are, in their essence, simply strings of
characters which describe the sets of other
strings of characters, i.e. they are a matching
form, a template, for many concrete character
strings. As many other regular expressions
engines, PCREs make this possible by reserving
special symbols - “the metacharacters” - for
quantifiers and classes. But in addition to these
features common to many finite state automata,
PCREs offer much more (Wall & Loukides,
2000). These are the reasons why we consider
the PCREs to be appealing candidates for a
translation of rhetorical figures into a computerreadable symbolic form:
•

•

•

by implementing “back references”
(Friedl, 2006) , PCREs make it possible
to refer to that which was already
matched, hence allowing to construct
automata able to match repetitive forms
by implementing (from PERL version
5.10 on) “recursive matching”, PCREs
make it possible to match very complex
patterns without a need to have recourse
to other means, external to PCREs
since the language of PCREs is very
concise, the resulting PCRE describing a
rhetorical figure under question is
usually a string of few dozens of
characters which could be eventually
constructed not by means of human
intervention, as was the case in this
article, but by means of unsupervised
genetic programming (Koza, 1992) or
other means of grammar induction
engine (Solan, Horn, Ruppin, &
Edelman, 2005)

Element
W
...
<…>
Subscripts

Meaning
word
arbitrary intervening material
phrase or clause boundaries
identity (same subscripts),
nonidentity (different subscripts)

Table 1: part of RF-representation Formalism (RFRF)

2

Method

2.1

PERL-Compatible Rhetoric Figures

Four figures were chosen - namely anadiplosis,
anaphora, epiphora and antimetabole – in order
to demonstrate the feasibility of the “rhetoric
stylometry” approach. We have adopted the
Rhetoric Figure Representation Formalism
(RFRF) - initially concieved by (Harris &
DiMarco, 2009) - and reduced it in order to
describe only the four figures of interest. Basic
symbols of RFRF and their associated meanings
are presented in Table 1.
Since the goal of this article is primarily
didactic, i.e. we shall start this exposé with very
simple anadiplosis involving just one backreference, and end up our proposal with
somewhat more complex recursive PCRE
matching epiphorae containing arbitrary number
of constituents.
2.1.1

Anadiplosis

Anadiplosis occurs when a clause or phrase starts
with the word or phrase that ended the preceding
unit. It is formalized by RFRF as :
< . . . Wx >< Wx . . . >
We have translated this representation
into this PERL-Compatible Rhetoric Figure
(PCRF):

/((\w{3,})[.?!,] \2)/sig
The repetition-matching faculty is
assured by a backreference to an initial n-gram
composed of at least three word characters.
Therefore, this PCRE makes it possible to match
utterances like the one in Cicero's De Oratore :
Sed genus hoc totum orationis in eis
causis excellit, in quibus minus potest
inflammari animus iudicis acri et vehementi
quadam incitatione; non enim semper fortis
oratio quaeritur, sed saepe placida, summissa,

lenis, quae maxime commendat reos. Reos
autem appello non eos modo, qui arguuntur, sed
omnis, quorum de re disceptatur; sic enim olim
loquebantur.1
This is the simplest possible anadiplosis
figure since it matches only string with two
occurences of a repeated word. Therefore we
label this figure as anadiplosis{2}.

2.1.3

2.1.2

We have translated this representation
into following PCRE form:

Anaphora

Antimetabole is a rhetoric figure which occurs
when words are repeated in successive clauses in
reversed order. In terms of RFRF, one can
formalize it as follows:
<WA WB Wc . . . WC WB WA >

Anaphora is a rhetoric figure based upon a
repetition of a word or a sequence of words at the
beginnings of neighboring clauses.
It is
formalized by RFRF as :
< Wx . . . >< W x . . . >
We have translated this representation
into the following PCRE form:

/[.?!;,] (([A-Z]\w+) [^.?!;,]+[.?!;] \2 [^.?!;,]
+[.?!;,] (\2 [^.?!;,]+[.?!;,])*)/sig
As all RFs presented in this article, this
anaphora is also based on back-reference
matching. In contrast with anadiplosis where
dependency was of very short-distance nature, in
case of anaphora, the second occurrence of the
word can be dozens of characters distant from
the initial occurrence. What's more, this RF takes
into account possible third repetition of a W x
which makes it possible to match utterances like
Cicero's:
Quid autem subtilius quam crebrae
acutaeque sententiae? Quid admirabilius quam
res splendore inlustrata verborum? Quid plenius
quam omni genere rerum cumulata oratio?2
Since this PCRFs allows us to match
anaphorae with two or three occurences of a
repeated word, it is seems to be appropriate to
label it as anaphora{2,3}.

1

2

“For vigorous language is not always wanted, but
often such as is calm, gentle, mild: this is the kind
that most commands the parties. By ' parties ' I
mean not only persons impeached, but all whose
interests are being determined, for that was how
people used the term in the old days. “
“ Is there something more subtle than a rapid
succession of pointed reflections? Is there
something more wonderful than the heating-up of
a topic by verbal brilliance, something richer
than a discourse cumulating material of every
sort? ”

Antimetabole

/((\w{3,}) (.{0,23}) (\w{3,})[^\.!?]{0,23} \4 \3 \
2)/sig
Differently from previous examples
when there was only one element matched and
back-referenced, three elements - A, B, C- are
determined in initial phases of matching this
chiasmatic antimetabole. Subsequently, the order
of A & C is switched while B is considered to be
identic intervening material intervening between
A and C and C and A. Since possible occurrence
of other material intervening between ABC and
CBA (i.e. ABCxCBA) is also taken into account,
this PCRF has successfully matched expressions
like:
Alle wie einer, einer wie alle.3
2.1.4

Epiphora

Epiphora or epistrophe is a RF defined as
“ending a series of phrases or clauses with the
same word or words”. It is formalized by RFRF
as:
< . . . Wx >< . . . Wx >
We have translated this representation
into following PCRE form:
/([A-Z][^\.\?!;]+ (\w{2,}+)([\.\?!;] ?[A-Za-z]
[^\.\?!;]+ (?:\2|(?-1))*)\2[\.\?!;])/sig
In contrast with anaphora{2,3} figure
presented in 2.1.2, the epiphora figure hereby
proposed exploits the “recursive matching”
properties of latest versions of PCRE (Perl
5.10+) engines. In other words, the expression
(?:\2|(?-1)) match any number of subsequent
phrases or clauses which end with Wx and not
just three, as was the case in case of epiphora.
Hence, a quadruple epiphora :

3

“ All as one, one as all. ”

Je te dis toujou la même chose, parce
que c'est toujou la même chose, et si ce n'était
pas toujours la même chose, je ne te dirais pas
toujou la même chose.4
was detected by this recursive PCRF
when it was applied upon corpus of Molière's
works.
Since the recursive matching allows us
to create a sort of “greedy” epiphora, we propose
to label it as epiphora{2,} in possible future
taxonomy of PCRFs.
2.2

Corpora

In order to demonstrate the languageindependence of the rhetoric stylometry method
hereby proposed, we confronted the matching
faculties of initial “PERL Compatible Rhetoric
Figures” (PCRF) with the corpora written in
diverse languages.
More precisely, we have performed the
rhetoric stylometry analysis of 4 corpora written
by poets and orators who are often considered as
exemplary cases of mastering their respective
languages.
For English language, complete works of
William Shakespeare had been downloaded from
project Gutenberg (Hart, 2000). The same site
served us as the source of 40 works of Johann
Wolfgang Goethe written in German language.
When it comes to original works of Jean-Baptiste
Molière, 39 of them where recursively
downloaded from French site toutmoliere.net.
Finally, the basic Latin manual of rhetoric,
Cicero's “De Oratore” was extracted from the
corpus of Perseus Project (Crane, 1998) in order
to demonstrate that PCRF-based approach can
yield interesting results when applied even upon
corpora written in antique languages.
Corpora from Project Gutenberg was
downloaded as pure utf8-encoded text. No
filtering of data was performed in order to
analyze the data in their rawest possible form.
The only exception was the stripping away of
possible HTML tags by means of standard
HTML::Strip filter.
Before the matching, the totality of the
corpus was split into fragments whenever
frontier \n[^\w+] (i.e. new-line followed by at
least one non-word character) was detected.
Shakespeare’s corpus were splitted into 109492
fragments, Goethe’s into 46597 fragments ,
4

“I always tell you the same thing because it is
always the same thing and if it wasn't always the
same thing I would not have been telling you the
same thing.”

Cicero’s into 970 fragments while works of
Moliere yielded 6639 fragments.

3

Results

In total, more than 7000 strings were matched by
3 PCRFs within 4 corpora containing in 17
Megabytes of text splitted into more than 163040
textual fragments.
Anadip Anapho Antimetabole Epipho
losis{2} ra{2,3} {abcXbca} ra{2,}
Cicero
Goethe
Molière
Shkspr

0.00309
0.00242
0.01129
0.00087

0.2711
0.0717
0.1634
0.008

0
0.0003
0.000602
0.000219

0.0144
0.0042
0.0210
0.008

Table 2: Relative frequencies of occurence of diverse
PCRFs within diverse corpora ( PCRF per fragment)

As is indicated in Table 2, the instances
of anadiplosis, anaphora, antimetabole and
epiphora were found in all 4 corpora involved in
this study, the only exception being the absence
of antimetabole in Cicero. In general,
anaphora{2,3} seems to be the most frequent
one: number of cases when this PCRFs
succeeded to match highly surmounts the other
two figures especially in case of Romance
language authors – i.e. almost every sixth
fragment from Moliere and every fourth from
Cicero was matched by anaphore{2,3}.
The only exception to this “dominance
of anaphora” seems to be Shakespeare whose
complete works yielded exactly the same
frequency of epiphora and anaphora occurences.

Cicero
Goethe
Molière
Shkspr

Anadip Anaphora Antimetabol Epiphora
losis{2}
{2,3}
e{abcXbca}
{2,}
20
1
4
19
44
3
33
287
57
1
29
65
7
2
17
64

Table 3: Elapsed time (in seconds) of different
PCRF/corpus runs on average PC desktop

As is indicated in Table 3, computational
demands of PCRF-based are not high in case of
anaphora{2,3}. On the contrary, the recursive
epiphora{2,} is much more demanding. As the
recursive structure of this PCRF indicates, the
speed of matching process is growing nonpolynomially with the length of the textual
fragment upon which the PCRF is applied and
therefore the choice of correct fragment separator

token (c.f. 2.2) seems to be of utmost
importance.

4

Discussion

We propose a language-independent
parse-free method of extracting instances of
rhetoric figures from natural language corpora by
means of PERL-compatible regular expressions.
The fact that PCREs implement features like
back-references or recursive matching make
them good candidates for the detection &
extraction of rhetoric figures which cannot be
matched by simpler finite state automata or
context-free languages.
In order to demonstrate the feasibility of
such an approach, we have therefore “translated”
the canonical definitions of anadiplosis,
anaphora and epiphora into four PERLcompatible
rhetoric
figures
namely
anadiplosis{2}, anaphora{2,3}, epiphora{2,} and
antimetabole{abcXbca} - and applied them upon
Latin, English, French and German corpora. All
four PCRFs successfully matched some strings in
at least three of four corpora, indicating that
repetition-based rhetoric figures can possibly
belong to the set of linguistic universalia
(Greenberg, 1957). Anaphora{2,3} surpassed in
frequency of occurrences all the other figures,
the only exception being Shakespeare in whose
case the number of matched epiphorae was equal
to the number of matched anaphorae.
We do not pretend that PCRFs presented
hereby are the most adequate translations of
traditional anadiplosis, anaphora, antimetabole
or epiphora into an artificial language. Since
PCREs can contain quantifiers and classes, it is
evident that for any set of strings – which is one
our case the set F of all the occurences of a given
figure within its respective corpus – more than
one possible regexp could be constructed in
order to match all members of the set F.
Therefore it may be the case that PCRFs that we
have proposed in this “proof of concept” article
are not the most specific ones nor the fastest
ones.
When it comes to specificity, it may be
stated that the closer look upon the extracted data
indicates that PCRFs proposed hereby have
proposed some “false positives”, i.e. have
matched strings which are not rhetorical figures
(for example an expression “FIRST LORD. O
my sweet lord” was matched by epiphora{2,}
when applied upon Shakespeare's corpus, but is
definitely not a rhetoric figure since the substring

in capital letters simply denotes the name of
dramatic persona pronouncing the following
statement and not the clause of the statement
itself).
When it comes to speed, it is established
that PCREs with unbounded number of backreference are NP-complete (Aho, 1991) and
verily this may be the reason of very high runtimes of a recursive epiphora{2,} in contrast to
its non-recursive PCRF counterparts. From
practical point of view it seems therefore more
suitable – especially in case of analysis of huge
corpora - to stick to non-recursive PCRFs. The
other possible solution how to speed up the
parsing – and in certain cases even to prevent the
machine to fell into “infinite recursion loop” is
the tuning of the “splitting parameter” so that the
corpus is split in fragments of such a size that
the NP-complexity of the matching PCRE shall
not have observable implications upon a real
run-time of a rhetoric figure detection process.
There are at least three different ways
how PCRFs could be possibly useful. Firstly,
since PCRFs are very fast and languageindependent, they can allow the scholars to
extract huge number of instances of rhetoric
figures from diverse corpora in order to create an
exhaustive compendium of rhetoric figures. For
example, the corpus of >7000 strings which were
extracted from corpora mentioned in this article
(downloadable from http://www.lutin-userlab.fr/
rhetoric/) could be easily put to use not only by
teachers of language or rhetoric, but possibly
also by those who aim to develop a semisupervised system of rhetoric figure induction
(c.f. last paragraph). Manual annotation of such a
compendium and subsequent tentatives of such a
figure of speech induction shall be presented in
our forecoming article.
Secondly, the extracted information
concerning the quantities of various PCRFs
within different corpora could serve as an input
element (i.e. a feature) for classifiying or
clustering algorithms. PCRFs could therefore
facilitate such stylometric tasks like authorship
attribution, author name disambiguation or
maybe even plagiate detection.
Thirdly, due to their language
independence, PCRFs presented hereby can be
thought of as a means for evaluation of
differences between two different languages, or
two different states of the same language. One
can for example apply the PCRFs upon two
different translations T1 and T2 and see that the
distribution of PCRFs within T2 is more similar

to the distribution of PCRFs in the original than
the distribution in T2. Therefore, one could
possibly state that from rhetoric, stylistic or even
poetic standpoint, T1 is more adequate
translation of the original text than T2. On the
other hand, when we speak about comparing two
different states of the same language , we
propose to perform PCRF-based analysis not
only upon a corpus representing the l'état de l'art
state of the language - like that of a Shakespeare,
for example – but also to compare such a state
with more initial states of the language
development, as is represented by CHILDES
(MacWhinney & Snow, 1985) corpus.
Finally, by considering PCRFs to be a
method which could possibly be used as a tool of
analysis of the development of language faculties
in a human baby, we come closer to its third and
somewhat “cognitive” implementation. This
implementation - which is the subject of our
current research - is based upon a belief that it is
not unreasonable to imagine that PCRFs could
possibly be constructed not manually, but
automatically by means of genetic programming
paradigm (Koza, 1992). Given the fact that
PCRE-language is one of the most concise
programming
languages
possibles
and
conceivables, and given the fact that the 1) speed
of execution 2) the specifivity 3) the sensitivity
could possibly serve as the input parameters of a
function evaluating the fitness of a possible
PCRF candidate, it is possible that the research
initiated by our current proposal could result in a
full-fledged and possibly non-supervised method
of rhetoric figure induction. In such a way could
our PCRFs possibly become something little bit
more than just another tool for stylometric
analysis of textual corpora – in such a way they
could possibly help answering a somewhat more
fundamental question: “What is the essence of
figures of speech and how could they be
represented within&by an artificial and/or
organic symbol-manipulating agent?”

References

Acknowledgments
The author wishes to express his gratitude to
University Paris8 – St. Denis and Lutin Userlab
for support without which the research hereby
presented would not be possible, as well as to
thank philologues and comparativists of École
Pratique des Hautes Études and ÉNS for keeping
alive the Tradition within which the Language is
considered to be something more than just an
object of parsing and POS-tagging.

Plato. (1986). Phaedrus. 261e.

Aho, A. V. (1991). Algorithms for finding patterns in
strings, Handbook of theoretical computer science
(vol. A): algorithms and complexity. MIT Press,
Cambridge, MA.
Aristotle. (1954). Rhetoric. 1355b.
Crane, G. (1998). The Perseus Project and Beyond:
How Building a Digital Library Challenges the
Humanities and Technology. D-Lib Magazine, 1,
18.
Crosswhite, J., Fox, J., Reed, C., Scaltsas, T., &
Stumpf, S. (2003). Computational models of
rhetorical argument. Argumentation Machines—
New Frontiers in Argument and Computation,
175–209.
Dubois, J. (1970). Rhétorique générale: Par le
groupe MY. Larousse.
Friedl, J. (2006). Mastering regular expressions.
OʼReilly Media, Inc. Sebastopol, CA, USA.
Gawryjolek, J. (2009). Automated annotation and
visualization of rhetorical figures.
Grasso, F. (2002). Towards computational rhetoric.
Informal Logic, 22(3).
Greenberg, J. H. (1957). The nature and uses of
linguistic typologies. International Journal of
American Linguistics, 23(2), 68–77.
Harris, R., & DiMarco, C. (2009). Constructing a
Rhetorical Figuration Ontology. Persuasive
Technology and Digital Behaviour Intervention
Symposium.
Hart, M. (2000).
Gutenberg.

Project

gutenberg.

Project

Koza, J. R. (1992). Genetic programming: on the
programming of computers by means of natural
selection. The MIT press.
MacWhinney, B., & Snow, C. (1985). The child
language data exchange system. Journal of child
language, 12(02), 271-295.
Perelman, C., & Olbrechts-Tyteca, L. (1969). The
new rhetoric: A treatise on argumentation.
Solan, Z., Horn, D., Ruppin, E., & Edelman, S.
(2005). Unsupervised learning of natural
languages. Proceedings of the National Academy
of Sciences, 102(33), 11629.
Taboada, M., Mann, W. C., & Back, L. (2006).
Rhetorical Structure Theory. Citeseer.
Wall, L., & Loukides, M. (2000). Programming perl.
OʼReilly Media, Inc. Sebastopol, CA, USA.