On the use of Thisted & Efron’s technique to determine authorship

with Robert Matthews

(1986-87?) previously unpublished; © The Estate of Eric Sams

The present text must be regarded as an unfinished and incomplete draft of a never published essay, included here for the sake of completeness. Dr. Robert Matthews wrote to me that "The Thisted-Efron paper was never published, as it contained major errors; the Thisted-Efron method itself is now regarded with suspicion by many." [EB]

Abstract

We review the statistical test of authorship based on rare word usage proposed recently by Ronald Thisted and Bradley Efron, and in particular its use in determining the author of the Elizabethan poem Shall I die?. We show that despite its theoretical sophistication, the technique is susceptible to a substantial number of systematic errors, which have the power to invalidate any conclusions based upon its use. Considerable caution should be exercised in the application of the technique, particularly in its computerised form.

Introduction

Researchers seeking to establish the authorship of an anonymous work frequently turn to statistical techniques in an effort to inject some objectivity into this notoriously subjective problem. The reduction of a work of literature to long tables of numbers which supposedly measure some or other aspect of style has not, however, been greeted with much enthusiasm by some literary scholars.

This may partly be ascribed to the suspicion many of them hold that such techniques, though couched in hard mathematical form, are in fact a good deal less objective than the equations would have one believe.

Recently, two statisticians, Ronald Thisted of Chicago University and Bradley Efron of Stanford University, proposed a new authorship determination technique based on rare word frequency. It is based on research they carried out in an earlier paper (henceforth referred to as T&E l) on estimating the number of words Shakespeare knew but never use.

Explained in detail in Thisted and Efron's 1986 paper (henceforth T&E II) the rare word technique has attracted considerable interest among those investigating the possible Shakespearean origins of certain anonymous works.

It has even been the subject of articles in the national media, following its application in this paper to the anonymous 429 word Elizabethan poem which starts with the words "Shall I die?". Thisted and Efron concluded that the rare word usage in the poem was "consistent with the hypothesis of Shakespearean authorship".

Mathematically, the technique is based on an empirical Bayesian statistics model that leads to a formula capable of predicting the frequency of rare word usage expected if the text under study is a work of a particular author.

In particular, the technique gives an estimate of the number of words never before seen in the author's work that should appear in the text, and a prediction of the number of words seen once before, twice before etc in the known works that should also appear in the text.

Thisted and Efron provided apparently impressive support for their technique by showing that it appears to have the power to discriminate between works known to be of Shakespearean and non‑Shakespearean origin. In T&E II, they show that four poems in the established Shakespearean canon passed the tests, while poems by Jonson, Marlowe and Donne all failed.

At first sight, it seems that the Thisted-Efron technique is a powerful addition to the armoury of literary detection. As its inventors point out, "unusual" words are very common in Shakespeare. Two-thirds of the 31,534 distinct words in the canon occur no more than three times.

However, as we shall now show, the technique is prey to a considerable number of vagaries that seriously undermine its power. These range from the effects of mathematical devices which have to be employed to get sensible results out of the equations, to our lack of knowledge of how Shakespeare's vocabulary usage changed with time.

We begin with a brief description of the basis of the ThistedEfron technique, and how it may be used on a specific text. We then look in detail at the systematic errors to which the technique is susceptible, and estimate their likely effect on the conclusions drawn using the technique. We conclude with a re-examination of the use of the technique to determine the authorship of the poem Shall I die?

The basis of the Thisted-Efron technique

The Thisted-Efron technique is an extension of a famous statistical investigation by the renowned statistician R. A. Fisher into the number of animal species that exist in nature as yet unseen. Using the notation of T&E II and their hunting analogy, let n_x be the number of word-types seen exactly x times after reading the whole of an author's canon. Assuming that the "trapping" of the word‑types is a Poisson process, it can then be shown that the expected number of words seen exactly x times in the canon that will be picked up by scouring a text of Y words is given by

(1)

where t is Y divided by N, the number of words in the existing canon. According to Spevack, N for Shakespeare is 894647.

Equation (1) is a series which converges well for any sensible values of t; in particular, no terms greater than t²have to be considered for any play or poem that is likely to require testing for Shakespearean authorship.

The series enables the number of words occurring with frequency x in an author's known canon, denoted by n_x, to be used to calculate similar quantities for a text of length Nt. Note especially the subscripts on the quantities v_x and n_x+k, and the effects of the summation. They imply that it is possible to predict the number of words never before seen in a work by the author, v₀, that should crop up in a new work of length Nt. In addition, by inserting the values of n_x+kfrom the concordance into (1), it is possible to build up a table of rare word frequencies from x = 0 (never before seen), to x = 99, say, (seen 99 times in the canon).

Thisted and Efron use (1) to generate three tests, First, by adding up all the values of v_x from, say, 1 to 99, it can be used to carry out a comparison of theoretical and observed values of the total number of words used 99 times or fewer. This they regard as their least powerful test, and we shall consider it no further.

The second test is a comparison of the number of new words predicted to occur in the text under study, given by v₀ in (1), and the number actually seen.

The third test is a comparison of the rare word usage predicted if the work is by a specific author, and the observed usage. As already noted, Shakespeare was a prodigious user of unusual words, so one would expect any text written by him to show a marked increase in word numbers as x tends to zero.

This last comparison, which Thisted and Efron regard as the best discriminator of Shakespearean authorship, can be made quantitative by carrying out a regression analysis. Let m_x be the number of distinct words that appear in a text of length Nt, and let v_x be the corresponding predicted value. Then, if the m_x have independent Poisson distributions with means μ_x, they can be compared to their predicted counterparts v_x via the relationship given in T&E II:

(2)

So, a regression analysis of ln(μ_x/v_x) against ln(x+1) makes β equal to the gradient of the resulting linear relationship. If theoretical prediction of word frequencies matches observation (ie the text under test exhibits the same word usage as the author's known canon of works), then β will be zero. This is the basis of what Thisted and Efron call the "slope test".

In their 1986 paper, Thisted and Efron use (1) and (2) to show that Shall I die? has a new word count and gradient statistically similar to that expected on the basis that the poem is a work of Shakespeare. To give an indication of how their technique was applied, consider the prediction of the number of new words that should appear in a poem of this length, were it Shakespearean.

The value of the subscript x is zero in this case, corresponding to words which have never appeared in Shakespeare before. The version of Shall I die? analysed in T&E II contained 429 words, so that t = 0,004849. Expanding the series given in (1), we see that we only need only the value of n, from the concordance. T&E I contains a table of the values of n_x drawn primarily from Spevack's concordance which shows that n₁= 14376. The predicted number of new words is therefore

The actual observed value is 9 in the version of Shall I die? Used it T&E II, and 10 in the Oxford version.

The same paper shows that one of the similar length works by the other Elizabethan authors mentioned above failed this new words test, while all failed the more complex slope test.

In all cases a measure of the statistical significance of the results was given, However, we contend that the conclusions drawn on the basis of any of the three tests are subject to systematic errors which far outweigh the effect of the random error indicated by the various significance parameters used by Thisted and Efron.

We now consider the sources of these systematic errors, and their effects on the tests.

The sources of systematic errors

Any literary detection technique designed to help in studies of authorship must be based on comparisons with proven works of a specific author. It is obvious that any such technique must. compare like with like.

Comparative techniques such as that of Thisted and Efron inevitably have to rely on a. concordance giving word counts for different works in a canon. Immediately, we run into problems with investigations of suspected works of Shakespeare. Almost four centuries after his death, scholars are still arguing about which works are truly Shakespearean, which works are collaborations, and which are simply interlopers.

For example, the recent Oxford edition [fn: William Shakespeare: The Complete Works; General editors, Stanley Wells and Gary Taylor (Oxford University Press, 1986)] of the complete works denies or disputes Shakespeare's sole authorship of 1-3 Henry VI, Titus Andronicus, Henry VIII,Macbeth, Timon of Athens, The Taming of the Shrew and Pericles. All these are included in Spevack's concordance, on which Thisted and Efron based computerised version of their technique relies.

There also lie more subtle traps for the unwary user of concordances. How reliable a guide to the author's original word usage are such a compilations?

Spevack's concordance, to take the work most relevant to our research, is in fact based not on a direct word-count drawn from the earliest folios of Shakespeare, but on a compilation based on word-counts of various editions of the works collated by Blakemore Evans [fn: The Riverside Shakespeare, edited by G. B. Evans (Boston, Mass., 1974)] Such crucial stylistic determinants as spelling and hyphenation thus stand at one remove, at least, from the original.

Spevack further fails to discriminate between homographs. Thus the adjectival and verbal meanings of the word bare are brought together within the category of words occurring 59 times. Furthermore, this word occurs uniquely as a noun in Shall I die?, a fact completely ignored in Thisted and Efron's analysis of the poem.

There are thousands of such homographs in Spevack. There are thousands more cases of words recorded in Spevack as being editorial amendations of the basic copy-texts.

Spevack's treatment of voiced or unvoiced –ed word endings is far from clear. Indeed, Thisted and Efron appear to have completely misunderstood it in making their own classifications for their analysis of Shall I die?. For example, they have included passed among the words which occur three times in the canon. But the word is in fact Spevack's convention for recording the disyllablepassed. The word in line 37 of Shall I die is the monosyllable pass'd, as demonstrated by its scansion and its rhyme with last. Its appearance in the poem should therefore be recorded in the category of words which occur 45 times, not 3, in the canon.

If the Thisted and Efron technique is to be applied to plays rather than poems, the question of whether or not to include stage directions (and the name of characters) has also to be addressed.

Finally, the technique is based on the assumption that Shakespeare's style, as measured by word usage, stays constant over time. But Spevack at least gives no indication of the truth of this assumption. For this he can hardly be blamed: we cannot put undisputed dates to all the works in any case.

One might try to estimate whether the slope test gives an upper or lower bound on word usage by arguing that later works draw a larger vocabulary accumulated by Shakespeare over time. But can anyone assert this with confidence?

So far, we have considered what might be termed literary sources of systematic error. They affect the values of the observed quantities n_x and β_x and the calculated quantity v_x on which the technique depends. However, there are also sources of error which derive from mathematical features of Thisted and Efron's approach.

For small works, some observed word frequencies μ_x are likely to be zero. The Oxford Shakespeare edition of Shall I die? in fact records no fewer than 37 zero values of μ_x for values of x from 0 to 99. Equation (2) shows that the regression analysis requires the calculation of the natural logarithm of the ratio of μ_x and v_x. But this quantity becomes divergent for zero values of μ_x (or v_x). Thisted and Efron give no indication of how they tackled this problem. It seems certain, however, that any solution must undermine to some extent the validity of the comparison.

Thisted and Efron explicitly tackle another statistical affliction of their technique: the instability in the values of v_xfor x greater than about l5. They adopt the device of a "local linear smoother" which has the effect of ironing out the fluctuations in the v_x arising from the instability of the n_xfrom which they were calculated via (1). The result is a more or less orderly decrease in their value as x increases.

One must question the validity of any comparison based on such smoothed data. Detailed calculations using equation (1) and the table of smoothed v_x cited in T&E II shows that almost 1 in 7 of the v_x differ from their "raw" values by at least 30 percent.

Of course, one could simply counter all of the above by saying that the effects of the deficiencies we raise are small. We shall therefore now show mathematically that, in fact, they are likely to considerably weaken the conclusions drawn on the basis of the Thisted and Efron technique, especially the slope test.

Quantitative effects of the systematic errors

From the above discussion, we propose that there are significant systematic errors in the two observational parameters used by the Thisted and Efron technique, that is, n_x and μ_x, and in the v_xthrough the smoothing process used to damp down statistical fluctuations.

What is the quantitative effect of such errors on the tests derived by Thisted and Efron, namely, the new words test and the slope test?

Consider first the new word test. By (1) we find that

(3)

By partial differentiation we find that the error in v₀, μv₀, arising from errors of t and n_x is given by

(4)

Carrying out the partial differentiation, we then find that

So that by (4)

For all cases of interest, t is sufficiently small for the summation to be terminated at k = 2. Ignoring the similarly small term δn2t² and substituting for n₁ and n₂ from the table in T&E I (14376 and 4343 words respectively), we finally arrive at the equation

(5)

Now it can be affected by changes in the works considered to be Shakespearean that are included in the concordance, as this will change N, the total number of words used by Shakespeare, and by editorial amendations to the text under study, which will change the total number of words in the text, W. So, with t = W/N, we can write

(6)

Let us now calculate the error in the number of new words predicted to occur in the case of an anonymous play of 15,000 words. We have t = 15000/884647 = 0.017. Expanding (3) to the k = 2 term we then find v₀ = n₁t - n₂t² . 243 words never before seen in Shakespeare should appear in the play.

Now, suppose (conservatively) that 3 works of the same length which are included in the concordance used by Thisted and Efron are not, in fact, by Shakespeare. Then δN/N . -0.05, ie 5 per cent of the current canon is wrongly attributed to him. Taking the number of words used only once in Shakespeare, ie in category n₁, to be roughly evenly distributed among the plays, we can similarly estimate δn1 at -0,05x14376 . 719 words fewer.

Now consider that editorial amendation cuts the number of words in the text by 1 in 100 (this is representative of the change in the total of words in Shall I die? between the version in T&E II and the Oxford version). Then δW/W = -0,01, and so by (6) δt/t ≈ 0,04, leading δt . 0,0007.

Using all these values in (5) to estimate the total error in v₀, we find that δv_o . 22 words, giving a percentage error of about 10 per cent.

Because it relies on relatively few data, the new words test has quite a small, and quite possibly an acceptable error. The more complex slope test, which Thisted and Efron believe to be the most powerful discriminator of authorship, is, however, far more susceptible to systematic error from the sources outlined above.

It will be recalled that the key parameter for the slope test is the gradient of the regression line given by taking natural logarithms of both sides of (2). This means that β is found from the standard least-squares regression equations

(6)

All the terms containing only x can be calculated explicitly, so that, after some reduction, and writing ln(μ_x/v_x) = 1nR_xwe obtain

(7)

We can now calculate the error in β arising from an error in R_x from δβ = (dβ/dR_x)δR_x, so that

(8)

where

(9)

We now use these last two equations to investigate the effects on the slope test of two sources of systematic error. First, we consider the "linear smoother" adopted by Thisted and Efron to overcome the statistical fluctuations in the calculated v_x.

In T&E II, the smoothing process is introduced for all values of x above14. This means that the effect it has on β is to be found by taking the summation in (8) from 15 to 99. The series in (1) was used to calculate the unsmoothed values of v_x for Shall I die?, and the smoothed values given in T&E II were used to calculate the proportional error δv_x/v_x for 14 ∑ x ∑ 99. The error in the gradient arising solely from the smoothing of the v_x was then calculated from (8), ie δμ_x/μ_x in (9) was set at zero.

It was found that the smoothing process alone leads to a δβ of +0,062, which is no less than 82 per cent of the value of β found for the poem in T&E II.

It should be remembered that we have only included here the changes in the v_x resulting from the smoothing. As we saw in the calculation of the error in the value of v₀, these quantities are subject to systematic error in the value of t and n_x as well.

The error in β becomes larger still if we include the effects of editorial amendations between the version of the poem in T&E II and the Oxford Shakespeare. It turns out that no fewer than 14 of the100 values of μ_x used in the calculation of β for the slope test are changed between the two version.

This time setting δv_x/v_x = 0 (9) and calculating the error in β arising from the μ_x alone, we find an error of +0,039, 51 per cent of the value of β calculated in T&E II.

Conclusions

The above calculations suggest that the new words test proposed by Thisted and Efron may be sufficiently robust against such effects to prove valuable in literary investigations. However, the potentially more powerful slope test seems to us to rely too greatly quantities which cannot, at least as yet, be derived with confidence from a concordance.

The results of statistical tests are invariably accompanied with some parameter, be it a probability or a value of x², which gives a measure of its significance. The Thisted and Efron technique of investigating authorship on the basis of rare word usage is such a test. What we hope we have shown is that the effect of systematic errors should not, indeed cannot, be neglected in any statistical investigation, least of all in the field of literary detection.