ETL 505 Module 4 – Vocabularies

  • Info seekers search info based on subject
  • Controlled subject vocabs are standardised vocab that are applied to the subject value metadata so that resources on a particular subject are grouped together enabling a more efficient search/info retrieval.
  • Hider 2012 – “The aim is for each concept to be represented by one, and only one, particular term, and for each term to mean only one particular concept” (p.152).
  • Difficult when subject term allocation is open to interpretation and sometimes difficult to identify.
  • Some resources may also have sub-topics (more than one topic). E.g. Australian History – so do you have ‘Australian History’ or have 2 separate headings, ‘Australia’ and ‘History’?
  • Boolean searching – terms are combined with ‘and’, ‘or’ and ‘not’ operators meaning less precision is needed for searching but the results are likely to be less precise than if terms were pre-coordinated vocab strings.

Main Types of Controlled Vocabulary used by Information Agencies

1. Subject Headings Lists
  • Had origins at time of card catalogues where patrons could look under subject headings for relevant records
  • When these cards were copied and distributed, common or controlled subject vocabularies became the norm
  • Library of Congress Subject Headings (LCSH) is one of the most widely utilised in English-speaking libraries
  • Now used as access points for users conducting subject searches in bibliographic databases and online catalogues
  • LCSH provides for pre-ordination and the order of the subdivisions in each string is fixed
  • Strings can be difficult to construct and can be problematic for computers to process so there have been attempts to break up LCSH strings (e.g. Faceted Application of Subject Terminology – FAST)
  • “Some LCSH can describe form…as opposed to subject. That is, the vocabulary includes terms to represent not only what a resource is about, but what it is” (Hider, 2012, p.158) – Library of Congress Genre/Form Terms for Library and Archival Materials (LCGFT)
  • Criticisms of LCSH – inconsistent, bias towards American culture and terminology etc. but it is widely used and to create a new system would be too costly
  • Other subject heading lists do exist for non-English speaking countries and other agencies but do not have the coverage of LCSH (e.g. SCIS Subject Headings List –SCISSHL for controlled vocabulary suitable for the needs of school libraries in Australia/New Zealand)
2. Subject Thesauri
  • Designed for automated retrieval systems so the term ‘descriptors’ is used instead of ‘headings’ (which relates to the card cataloguing system)
  • It helps indexers and searchers to choose words consistently to describe things or concepts
  • It helps standardise the use of terminology to improve indexing and retrieval
  • “Subject thesauri are also more than just lists; they represent structures based on systematic cross-referencing” (Hider, 2012, p.159).
  • Differ from subject heading lists as while these include many cross-references they, unlike subject thesauri, were not primarily designed as systems of interrelated concepts.
  • Like a regular thesaurus, a subject thesaurus focuses on synonyms but is more selective in the vocabulary it covers (literary and user warrant)
  • Take time to produce and maintain
  • Large number available, covering all kinds of subject areas (e.g. The Schools Online Thesaurus – SCOT which is used by Australian/New Zealand schools)
3. Subject Classification Schemes
  • Used for arrangement (e.g. Dewey Decimal Classification, WebDewey, Library of Congress Classification – LCC, Universal Decimal Classification – UDC) to group the actual resources, such as books, together that are about similar subjects (the grouping of digital content is also becoming a key issue for many libraries – e.g. might include a virtual bookshelf that has a similar arrangement to the physical books on a shelf)
  • “The structure of a faceted thesaurus is a classification scheme, and many subject thesauri assign notation to each of their descriptors so that resources can be arranged in this structure. In this way, subject thesauri can double as classification schemes and provide two vocabularies: one for indexing and one for arrangement” (Hider, 2012, p.163).
  • Some classification schemes are appropriate for particular contexts (e.g. National Library of Medicine Classification – NLM)
  • Faceted classification (where notation for any facet of a subject can be combined with notation of any other facet) allows for more flexibility and is more conducive to computer processing. Many older schemes may be more faceted in the future, the key is to combine in a consistent way.

Taxonomies and Ontologies

  • Library classification schemes can be used to arrange links to resources in online directories, although they are not always the most appropriate means of arranging digital collections.
  • There are schemes designed specifically for online collections (e.g. Open Directory Project)
  • When website and intranet pages have their links logically arranged, the resulting schemes are known as taxonomies in the field of information architecture.
  • Library classification schemes use artificial notation to maintain a particular order and might be considered distinct from taxonomies as they do not need to as they are independent from the resources they represent (i.e. website does not need to be removed from its location to be used).
  • “A taxonomy’s label (i.e. vocabulary) may also be used for indexing and ‘search’ purposes” (Hider, 2012, p.172).
  • Can be polyhierarchical (allowing for various citation order)
  • “An ontology is a knowledge structure as conceived (and labelled) by people. Such as structure may not be limited to taxonomic relationships, of the ‘x is a type of y’ type. All sorts of relationships might be specified. For example, x might produce y” (Hider, 2012, pp.172-173)
  • Can be difficult to construct and use

Advantages of controlled vocabularies

  • Controlled vocabularies can “eliminate or reduce ambiguity; control the use of synonyms; establish formal relationships among terms; and test and validate terms.” (Hillmann & Marker, 2008, p.17)
  • “In contrast, the metadata created by librarians using standards-based metadata approaches is considered to be of relatively higher quality in light of its accuracy, completeness, and consistency” (Alemu, et. al, 2012a, p.313)

Disadvantages of controlled vocabularies

  • Controlled vocabularies do not always reflect the search terms of users and social metadata might better reflect the diversity of cultural, linguistic and local perspectives” (Alemu, et. al, 2012a, p.312)
  • Controlled vocabulary can be difficult to create and expensive to do so

Advantages of user-generated metadata

  • “Socially constructed metadata approaches can be looked at from two dimensions: user-generated (explicit) and machine-generated (implicit) metadata. Thus far, tagging is considered the most dominant type of user-generated metadata. Tagging is the process of attaching labels to objects to make identification and retrieval easier. In the context of information systems, tagging refers to the process of characterizing (describing) an information object with user-chosen keywords…This resultant metadata from tag aggregation and analysis if often referred to as folksonomy” (Alemu, et. al, 2012a, p.320).
  • It is inexpensive to generate (free)
  • It might better represent the vocabulary of searchers and, hence, improve the discovery

Disadvantages of user-generated metadata

  • “Socially constructed metadata approaches such as tagging, rating, reviews, and recommendations have their own limitations including their lack of structure (synonym/homonym) and authoritative control” (Alemu, et. al, 2012a, p.316)
  • “User-generated metadata is considered to lack structure and reliability due to the absence of editorial quality. Major limitations of socially constructed metadata approaches include ambiguity that results from its lack of hierarchical (broader/narrower/related/homonym/synonym) relations, idiosyncratic nature of user-generated metadata” (Alemu, et. al, 2012a, p.313)
  • “Socially constructed metadata (Web 2.0) approaches are criticised for being flat, one-dimensional and plagued with inconsistencies” (Alemu, et. al, 2012b, p.551)

Natural Language Approach

  • Taken from the resource itself
  • Advantages – Often more up-to-date than a thesaurus, without translation it requires less time and therefore less cost and might be more indicative of the vocabulary used by searchers
  • Disadvantages – Lack of control might lead to less precise search results

Mechanisms for supporting natural language searching:

  • Key word searching (e.g. by fields such as author, title, etc.)
  • Records enhancement
  • “The idea here is simple: one way to improve the recall for subject searches is to provide more terms for the user to search on; so for surrogate records in library systems we should add more terms – that is, we should increase the size of the vocabulary “ (Hider & Harvey, 2008, p.157)
  • Records are created using information from other parts of the resource (e.g. contents page)
  • More detailed records are more costly to produce
  • Abstracts can be useful in records for providing information and enabling decision-making about the value of a resource. They need to be concise, comprehensive and unambiguous. They can be created by the author of the information so subjectivity can be an issue or by the abstractor in which case cost may be a factor

Automated indexing – uses the computer to derive information about the resource

  • Limitations – can only derive words present in the resource (no interpretation)
  • “In extraction indexing, terms are extracted from the information resource for inclusion in the index on me basis of how frequently they occur in the information resource. The most frequently occurring words are included in the index. Stems (for example, think, thinking) can be recognised.
  • In assignment indexing, the terms in the index are not necessarily found in the text of the information resource, but can be matched against a thesaurus. Profile terms (non-preferred terms) are matched against the thesaurus terms (descriptors) and if there is a match, then the descriptor is allocated for the purposes of determining whether a term should be included in the index on the basis of word frequency.” (Hider & Harvey, 2008, p. 161)


Alemu, G., Stevens, B., Ross, P., & Chandler, J. (2012a). The social space of metadata: Perspectives of LIS academics and postgraduates on standard-based and socially constructed metadata approaches. Journal of Library Metadata, 12(4), 311-344. doi: 10.1080/19386389.2012.735523

Alemu, G., Stevens, B., Ross, P., & Chandler, J. (2012b). Linked data for libraries: Benefits of a conceptual shift from library-specific record structures to RDF-based data models. New Library World, 113(11-12), 549-570. doi: 10.1108/03074801211282920

Hider, P. (2012). Chapter 8, Vocabularies (pp.151-180). Information resource description: Creating and managing metadata. London: Facet.

Hider, P. & Harvey, R. (2008). Chapter 8, Alphabetical subject access mechanisms (pp.133-164). Organising knowledge in a global society.London: Chandler.

Hillmann, D. & Marker, R. (2008). Metadata standards and applications.The Serials Librarian, 54(1-2), 7-21. doi: 10.1080/03615260801973364

Pelkie, T. (2009). Folksonomies and tagging. Retrieved fromhttps://www.youtube.com/watch?v=e8zajIMPVQE