In Chinese literature, there has been much dispute over the authorship of the classic novel “Dream of the Red Chamber” that Cao Xueqin was the author of the first 80 chapters while the last 40 chapters may have been written by someone else. Let’s see how statistics helps to find out the truth behind the dispute.
Text mining is a statistical method that involves the analysis of word frequency distributions by counting the number of times specified words occur in a text.
A well-known example of text mining is the attempt to tackle the authorship attribution problem of “Dream of the Red Chamber”, one of the four great classics in Chinese literature. One of the approaches is to first divide the 120 chapters of the novel into 12 sections, each with 10 chapters, identify some hundreds of most frequent words (e.g. di “的”, liao “了”, ren “人”, bu “不”, etc.) and count the number of times each word occurs. Cluster analysis is then carried out on the data collected, i.e. grouping of sections based on similar word distributions. The results indicated that there were discrepancies in the style of writing between the first 40 chapters and the last 80 chapters, and thus concluded that the novel was written by more than one author.