## On the use of Thisted & Efron’s technique to determine authorship

#### with Robert Matthews

(1986-87?) previously unpublished; © The Estate of Eric Sams

*The present text must be regarded as an unfinished and incomplete draft of a never published essay, included here for the sake of completeness. Dr. Robert Matthews wrote to me that "The Thisted-Efron paper was never published, as it contained major errors; the Thisted-Efron method itself is now regarded with suspicion by many." [EB]*

**Abstract**

We review the statistical test of authorship based on rare word usage proposed recently by Ronald Thisted and Bradley Efron, and in particular its use in determining the author of the Elizabethan poem *Shall I die?*. We show that despite its theoretical sophistication, the technique is susceptible to a substantial number of systematic errors, which have the power to invalidate any conclusions based upon its use. Considerable caution should be exercised in the application of the technique, particularly in its computerised form.

**Introduction**

Researchers seeking to establish the authorship of an anonymous work frequently turn to statistical techniques in an effort to inject some objectivity into this notoriously subjective problem. The reduction of a work of literature to long tables of numbers which supposedly measure some or other aspect of style has not, however, been greeted with much enthusiasm by some literary scholars.

This may partly be ascribed to the suspicion many of them hold that such techniques, though couched in hard mathematical form, are in fact a good deal less objective than the equations would have one believe.

Recently, two statisticians, Ronald Thisted of Chicago University and Bradley Efron of Stanford University, proposed a new authorship determination technique based on rare word frequency. It is based on research they carried out in an earlier paper (henceforth referred to as T&E l) on estimating the number of words Shakespeare knew but never use.

Explained in detail in Thisted and Efron's 1986 paper (henceforth T&E II) the rare word technique has attracted considerable interest among those investigating the possible Shakespearean origins of certain anonymous works.

It has even been the subject of articles in the national media, following its application in this paper to the anonymous 429 word Elizabethan poem which starts with the words "Shall I die?". Thisted and Efron concluded that the rare word usage in the poem was "consistent with the hypothesis of Shakespearean authorship".

Mathematically, the technique is based on an empirical Bayesian statistics model that leads to a formula capable of predicting the frequency of rare word usage expected if the text under study is a work of a particular author.

In particular, the technique gives an estimate of the number of words never before seen in the author's work that should appear in the text, and a prediction of the number of words seen once before, twice before etc in the known works that should also appear in the text.

Thisted and Efron provided apparently impressive support for their technique by showing that it appears to have the power to discriminate between works known to be of Shakespearean and non‑Shakespearean origin. In T&E II, they show that four poems in the established Shakespearean canon passed the tests, while poems by Jonson, Marlowe and Donne all failed.

At first sight, it seems that the Thisted-Efron technique is a powerful addition to the armoury of literary detection. As its inventors point out, "unusual" words are very common in Shakespeare. Two-thirds of the 31,534 distinct words in the canon occur no more than three times.

However, as we shall now show, the technique is prey to a considerable number of vagaries that seriously undermine its power. These range from the effects of mathematical devices which have to be employed to get sensible results out of the equations, to our lack of knowledge of how Shakespeare's vocabulary usage changed with time.

We begin with a brief description of the basis of the ThistedEfron technique, and how it may be used on a specific text. We then look in detail at the systematic errors to which the technique is susceptible, and estimate their likely effect on the conclusions drawn using the technique. We conclude with a re-examination of the use of the technique to determine the authorship of the poem *Shall I die?*

**The basis of the Thisted-Efron technique**

The Thisted-Efron technique is an extension of a famous statistical investigation by the renowned statistician R. A. Fisher into the number of animal species that exist in nature as yet unseen. Using the notation of T&E II and their hunting analogy, let n_{x} be the number of word-types seen exactly x times after reading the whole of an author's canon. Assuming that the "trapping" of the word‑types is a Poisson process, it can then be shown that the expected number of words seen exactly x times in the canon that will be picked up by scouring a text of Y words is given by

(1)

where t is Y divided by N, the number of words in the existing canon. According to Spevack, N for Shakespeare is 894647.

Equation (1) is a series which converges well for any sensible values of t; in particular, no terms greater than t^{2 }have to be considered for any play or poem that is likely to require testing for Shakespearean authorship.

The series enables the number of words occurring with frequency x in an author's known canon, denoted by n_{x}, to be used to calculate similar quantities for a text of length Nt. Note especially the subscripts on the quantities v_{x} and n_{x+k}, and the effects of the summation. They imply that it is possible to predict the number of words never before seen in a work by the author, v_{0}, that should crop up in a new work of length Nt. In addition, by inserting the values of n_{x+k}from the concordance into (1), it is possible to build up a table of rare word frequencies from x = 0 (never before seen), to x = 99, say, (seen 99 times in the canon).

Thisted and Efron use (1) to generate three tests, First, by adding up all the values of v_{x} from, say, 1 to 99, it can be used to carry out a comparison of theoretical and observed values of the total number of words used 99 times or fewer. This they regard as their least powerful test, and we shall consider it no further.

The second test is a comparison of the number of new words predicted to occur in the text under study, given by v_{0} in (1), and the number actually seen.

The third test is a comparison of the rare word usage predicted if the work is by a specific author, and the observed usage. As already noted, Shakespeare was a prodigious user of unusual words, so one would expect any text written by him to show a marked increase in word numbers as x tends to zero.

This last comparison, which Thisted and Efron regard as the best discriminator of Shakespearean authorship, can be made quantitative by carrying out a regression analysis. Let m_{x} be the number of distinct words that appear in a text of length Nt, and let v_{x} be the corresponding predicted value. Then, if the m_{x} have independent Poisson distributions with means μ_{x}, they can be compared to their predicted counterparts v_{x} via the relationship given in T&E II:

(2)

So, a regression analysis of ln(μ_{x}/v_{x}) against ln(x+1) makes β equal to the gradient of the resulting linear relationship. If theoretical prediction of word frequencies matches observation (ie the text under test exhibits the same word usage as the author's known canon of works), then β will be zero. This is the basis of what Thisted and Efron call the "slope test".

In their 1986 paper, Thisted and Efron use (1) and (2) to show that *Shall I die?* has a new word count and gradient statistically similar to that expected on the basis that the poem is a work of Shakespeare. To give an indication of how their technique was applied, consider the prediction of the number of new words that should appear in a poem of this length, were it Shakespearean.

The value of the subscript x is zero in this case, corresponding to words which have never appeared in Shakespeare before. The version of *Shall I die?* analysed in T&E II contained 429 words, so that t = 0,004849. Expanding the series given in (1), we see that we only need only the value of n, from the concordance. T&E I contains a table of the values of n_{x} drawn primarily from Spevack's concordance which shows that n_{1}= 14376. The predicted number of new words is therefore

The actual observed value is 9 in the version of *Shall I die?* Used it T&E II, and 10 in the Oxford version.

The same paper shows that one of the similar length works by the other Elizabethan authors mentioned above failed this new words test, while all failed the more complex slope test.

In all cases a measure of the statistical significance of the results was given, However, we contend that the conclusions drawn on the basis of any of the three tests are subject to systematic errors which far outweigh the effect of the random error indicated by the various significance parameters used by Thisted and Efron.

We now consider the sources of these systematic errors, and their effects on the tests.

**The sources of systematic errors**

Any literary detection technique designed to help in studies of authorship must be based on comparisons with proven works of a specific author. It is obvious that any such technique must. compare like with like.

Comparative techniques such as that of Thisted and Efron inevitably have to rely on a. concordance giving word counts for different works in a canon. Immediately, we run into problems with investigations of suspected works of Shakespeare. Almost four centuries after his death, scholars are still arguing about which works are truly Shakespearean, which works are collaborations, and which are simply interlopers.

For example, the recent Oxford edition [fn: William Shakespeare: *The Complete Works*; General editors, Stanley Wells and Gary Taylor (Oxford University Press, 1986)] of the complete works denies or disputes Shakespeare's sole authorship of 1-3 *Henry VI*, *Titus Andronicus*, *Henry VIII*,*Macbeth*, *Timon of Athens*, *The Taming of the Shrew* and *Pericles*. All these are included in Spevack's concordance, on which Thisted and Efron based computerised version of their technique relies.

There also lie more subtle traps for the unwary user of concordances. How reliable a guide to the author's original word usage are such a compilations?

Spevack's concordance, to take the work most relevant to our research, is in fact based not on a direct word-count drawn from the earliest folios of Shakespeare, but on a compilation based on word-counts of various editions of the works collated by Blakemore Evans [fn: *The Riverside Shakespeare*, edited by G. B. Evans (Boston, Mass., 1974)] Such crucial stylistic determinants as spelling and hyphenation thus stand at one remove, at least, from the original.

Spevack further fails to discriminate between homographs. Thus the adjectival and verbal meanings of the word *bare* are brought together within the category of words occurring 59 times. Furthermore, this word occurs uniquely as a noun in *Shall I die?*, a fact completely ignored in Thisted and Efron's analysis of the poem.

There are thousands of such homographs in Spevack. There are thousands more cases of words recorded in Spevack as being editorial amendations of the basic copy-texts.

Spevack's treatment of voiced or unvoiced *–ed* word endings is far from clear. Indeed, Thisted and Efron appear to have completely misunderstood it in making their own classifications for their analysis of *Shall I die?*. For example, they have included passed among the words which occur three times in the canon. But the word is in fact Spevack's convention for recording the disyllable*passed*. The word in line 37 of *Shall I die* is the monosyllable *pass'd*, as demonstrated by its scansion and its rhyme with *last*. Its appearance in the poem should therefore be recorded in the category of words which occur 45 times, not 3, in the canon.

If the Thisted and Efron technique is to be applied to plays rather than poems, the question of whether or not to include stage directions (and the name of characters) has also to be addressed.

Finally, the technique is based on the assumption that Shakespeare's style, as measured by word usage, stays constant over time. But Spevack at least gives no indication of the truth of this assumption. For this he can hardly be blamed: we cannot put undisputed dates to all the works in any case.

One might try to estimate whether the slope test gives an upper or lower bound on word usage by arguing that later works draw a larger vocabulary accumulated by Shakespeare over time. But can anyone assert this with confidence?

So far, we have considered what might be termed literary sources of systematic error. They affect the values of the observed quantities n_{x} and β_{x} and the calculated quantity v_{x} on which the technique depends. However, there are also sources of error which derive from mathematical features of Thisted and Efron's approach.

For small works, some observed word frequencies μ_{x} are likely to be zero. The Oxford Shakespeare edition of *Shall I die?* in fact records no fewer than 37 zero values of μ_{x} for values of x from 0 to 99. Equation (2) shows that the regression analysis requires the calculation of the natural logarithm of the ratio of μ_{x} and v_{x}. But this quantity becomes divergent for zero values of μ_{x} (or v_{x}). Thisted and Efron give no indication of how they tackled this problem. It seems certain, however, that any solution must undermine to some extent the validity of the comparison.

Thisted and Efron explicitly tackle another statistical affliction of their technique: the instability in the values of v_{x }for x greater than about l5. They adopt the device of a "local linear smoother" which has the effect of ironing out the fluctuations in the v_{x} arising from the instability of the n_{x}from which they were calculated via (1). The result is a more or less orderly decrease in their value as x increases.

One must question the validity of any comparison based on such smoothed data. Detailed calculations using equation (1) and the table of smoothed v_{x} cited in T&E II shows that almost 1 in 7 of the v_{x} differ from their "raw" values by at least 30 percent.

Of course, one could simply counter all of the above by saying that the effects of the deficiencies we raise are small. We shall therefore now show mathematically that, in fact, they are likely to considerably weaken the conclusions drawn on the basis of the Thisted and Efron technique, especially the slope test.

**Quantitative effects of the systematic errors**

From the above discussion, we propose that there are significant systematic errors in the two observational parameters used by the Thisted and Efron technique, that is, n_{x} and μ_{x}, and in the v_{x }through the smoothing process used to damp down statistical fluctuations.

What is the quantitative effect of such errors on the tests derived by Thisted and Efron, namely, the new words test and the slope test?

Consider first the new word test. By (1) we find that

(3)

By partial differentiation we find that the error in v_{0}, μv_{0}, arising from errors of t and n_{x} is given by

(4)

Carrying out the partial differentiation, we then find that

So that by (4)

For all cases of interest, t is sufficiently small for the summation to be terminated at k = 2. Ignoring the similarly small term δn2t^{2} and substituting for n_{1} and n_{2} from the table in T&E I (14376 and 4343 words respectively), we finally arrive at the equation

(5)

Now it can be affected by changes in the works considered to be Shakespearean that are included in the concordance, as this will change N, the total number of words used by Shakespeare, and by editorial amendations to the text under study, which will change the total number of words in the text, W. So, with t = W/N, we can write

(6)

Let us now calculate the error in the number of new words predicted to occur in the case of an anonymous play of 15,000 words. We have t = 15000/884647 = 0.017. Expanding (3) to the k = 2 term we then find v_{0} = n_{1}t - n_{2}t^{2} . 243 words never before seen in Shakespeare should appear in the play.

Now, suppose (conservatively) that 3 works of the same length which are included in the concordance used by Thisted and Efron are not, in fact, by Shakespeare. Then δN/N . -0.05, ie 5 per cent of the current canon is wrongly attributed to him. Taking the number of words used only once in Shakespeare, ie in category n_{1}, to be roughly evenly distributed among the plays, we can similarly estimate δn1 at -0,05x14376 . 719 words fewer.

Now consider that editorial amendation cuts the number of words in the text by 1 in 100 (this is representative of the change in the total of words in *Shall I die*? between the version in T&E II and the Oxford version). Then δW/W = -0,01, and so by (6) δt/t ≈ 0,04, leading δt . 0,0007.

Using all these values in (5) to estimate the total error in v_{0}, we find that δv_{o} . 22 words, giving a percentage error of about 10 per cent.

Because it relies on relatively few data, the new words test has quite a small, and quite possibly an acceptable error. The more complex slope test, which Thisted and Efron believe to be the most powerful discriminator of authorship, is, however, far more susceptible to systematic error from the sources outlined above.

It will be recalled that the key parameter for the slope test is the gradient of the regression line given by taking natural logarithms of both sides of (2). This means that β is found from the standard least-squares regression equations

(6)

All the terms containing only x can be calculated explicitly, so that, after some reduction, and writing ln(μ_{x}/v_{x}) = 1nR_{x }we obtain

(7)

We can now calculate the error in β arising from an error in R_{x} from δβ = (dβ/dR_{x})δR_{x}, so that

(8)

where

(9)

We now use these last two equations to investigate the effects on the slope test of two sources of systematic error. First, we consider the "linear smoother" adopted by Thisted and Efron to overcome the statistical fluctuations in the calculated v_{x}.

In T&E II, the smoothing process is introduced for all values of x above14. This means that the effect it has on β is to be found by taking the summation in (8) from 15 to 99. The series in (1) was used to calculate the unsmoothed values of v_{x} for *Shall I die?*, and the smoothed values given in T&E II were used to calculate the proportional error δv_{x}/v_{x} for 14 ∑ x ∑ 99. The error in the gradient arising solely from the smoothing of the v_{x} was then calculated from (8), ie δμ_{x}/μ_{x} in (9) was set at zero.

It was found that the smoothing process alone leads to a δβ of +0,062, which is no less than 82 per cent of the value of β found for the poem in T&E II.

It should be remembered that we have only included here the changes in the v_{x} resulting from the smoothing. As we saw in the calculation of the error in the value of v_{0}, these quantities are subject to systematic error in the value of t and n_{x} as well.

The error in β becomes larger still if we include the effects of editorial amendations between the version of the poem in T&E II and the Oxford Shakespeare. It turns out that no fewer than 14 of the100 values of μ_{x} used in the calculation of β for the slope test are changed between the two version.

This time setting δv_{x}/v_{x} = 0 (9) and calculating the error in β arising from the μ_{x} alone, we find an error of +0,039, 51 per cent of the value of β calculated in T&E II.

**Conclusions**

The above calculations suggest that the new words test proposed by Thisted and Efron may be sufficiently robust against such effects to prove valuable in literary investigations. However, the potentially more powerful slope test seems to us to rely too greatly quantities which cannot, at least as yet, be derived with confidence from a concordance.

The results of statistical tests are invariably accompanied with some parameter, be it a probability or a value of x^{2}, which gives a measure of its significance. The Thisted and Efron technique of investigating authorship on the basis of rare word usage is such a test. What we hope we have shown is that the effect of systematic errors should not, indeed cannot, be neglected in any statistical investigation, least of all in the field of literary detection.