Some thoughts on classification

Let's start by thinking about a specific question: how do you classify an ebook? (This problem can be extended to how to classify information.)

The traditional method is to use a top-down catalog taxonomy (Taxonomy). In China, this is the Chinese Taxonomy (4th edition) and in the United States it is the Library of Congress Taxonomy. However, this classification is too complex (all the subheadings are thousands of pages thick), which makes it very costly to implement. Moreover, it is not quite in line with the reality of e-books, one is divided into too fine, a book is often in the fourth or fifth level of the sub-catalogue; Second, the division is uneven, under the map classification A category is "Marxism, Leninism, Mao Zedong Thought, Deng Xiaoping Theory", I category is "literature", it is obvious, for e-libraries, A category into a separate category is very wasteful, while I category can be divided into at least " Chinese literature" and "foreign language literature" are two major categories.

In addition, another inherent weakness of the catalogue classification is that it is sometimes unclear in which category a book belongs. For example, should "Selected Poems of 18th Century England (bilingual)" be a "language" book or a "literature" book? One solution would be to be in both categories, but this would create a huge redundant workload.

In conclusion, the catalogue taxonomy is not an ideal method for classifying huge amounts of information. However, it is intuitive and convenient, which is difficult to compare with other taxonomies.

With the development of the Internet, a new classification method has emerged, and that is Folksonomy. A typical representative website is Del.icio.us.

The most used tags are the ones that best describe the characteristics of the information.

The use of tagging is very convenient and easy to combine, but there are some problems.

(1) Different users often have different interpretations of the same tag, for example, under the "Tools" tag, they may find content that has nothing to do with each other.

(2) The problem of synonyms. Users may use synonyms as tags, such as "tv/television", "Holland/Netherlands/Dutch", "Supergirl/Supergirl". In English, there is also the problem of plurality.

(3) The problem of plurality of words. For example, whether the label "china" refers to China or to porcelain.

(4) The user's label variety, may produce a lot of "noise", increase the burden on the system, reduce the accuracy of classification.

Therefore, the best solution should be to combine top-down directory classification with tag-based public classification, and then to control vocabulary (controlled vocabulary) that can be used as tags, not all words can be used as tags.

Of course, this is the ideal situation, and the technical implementation seems very difficult.

(Note: This article is used to organize my thoughts for later revision and additions.)

Leave a Reply Cancel reply