Friday, 27 February 2015

Thesaurus Debate needs to move on

Surprise, surprise - last Thursday's debate on this proposition was a pushover for the opposition. To defeat any argument of the form “XXX has no place in YYY”, all you have to provide is one counter-example.
Just for starters:
  •  The UK Data Archive, powered by the HASSET thesaurus
  • The FAO’s AGRIS database, searchable using AGROVOC, and
  •  EUROVOC, used for searching publications of the EU institutions and others

were among 11 such examples that Leonard Will managed to cram on to one slide. He could have gone on to cite dozens more cases where a thesaurus provides sophisticated and indispensable search capabilities.
The “expert witness” Philip Carlisle backed him up by describing the nine vocabularies and related services that English Heritage built and maintains for the heritage community. Contributions from the floor drew attention to the power of a thesaurus to cross language boundaries, not to mention image searching, where indexing with a controlled vocabulary still outperforms all the other methods.  
But simply overthrowing the proposition misses the point – the role of the thesaurus in modern Information retrieval has shrunk from what it once was. The high development and maintenance costs of an extensive controlled vocabulary deter most potential implementers. Most users simply do not want to know about such a complicated-looking beast, and so the shy thesaurus needs to perform discreetly but cost-effectively behind the scenes. Given a discerning team of developers, curators, IT support staff and indexers, this sophisticated tool can and should function interoperably alongside statistical algorithms, NLP techniques, data mining, clustering, latent semantic indexing. linked data, etc. Networking and collaboration, not rivalry, are the future.
As the professional body that has grown up around classification, indexing, use of thesauri and other knowledge organization systems, ISKO has a mandate to mark out that future. Follow-up activities could usefully explore:
  •           The contexts in which the thesaurus is or is not a useful tool;
  •           how to choose between a thesaurus and another type of knowledge organization system;
  •           how to integrate a thesaurus with the other components of a modern information retrieval system;
  •           how to adapt a standard thesaurus to the needs of special contexts;
  •           features of the software needed for thesaurus management.

The knowledge organizer with a grasp of these topics is ideally placed to develop the hybrid vocabulary structures (e.g. a layer of thesaurus model hooked on to upper level ontologies and coated with taxonomy features) needed in today’s networked environments.

1 comment:

Stella Dextre Clarke said...

This comment has been posted on behalf of Birger Hjørland:

"First I will congratulate with this fine initiative!

We really need in information science to consider what we are doing and the basic premises on which we are acting.

I’ll provide a few comments here, but I would find it better and more satisfactory to provide comments in our journal “Knowledge Organization”. Therefore here only some short points:

In my recent paper: “Are relations in thesauri “context-free, definitional, and true in all possible worlds”? “ I criticize the claim that “paradigmatic relationships are those that are context-free, definitional, and true in all possible worlds” and that paradigmatic relations are the kinds of semantic relations used in thesauri and other knowledge organization systems. In other words: A see problems in some common norms in standards and understandings of relations in thesauri.

In another paper in press in Knowledge Organization “Theories are knowledge organizing systems (KOS)”, I consider the relations between thesauri and ontologies and argue that “it does not follow that thesauri would not improve, if these characteristics from ontologies were adapted. The question is why thesauri are limited to the relatively few kinds of semantic relations (and therefore tend to bundle different relationships)? As far as I know, there has never been put forward arguments or research demonstrating the functionality of such a bundling. The set of relations used in thesauri have to my knowledge never been theoretically motivated! (They may be intuitively motivated by the need of searchers in online databases to increase “recall” and “precision” but this function has never been properly examined and for me it seems unlikely that a broader set of specified semantic relations should not provide better results).”

There is much more to say about controlled vocabularies in general and their challenge from Google-like systems that need to be explored by our community. But my attitude tend to support the claim “that the traditional thesaurus has no place in modern information retrieval” .

Let us continue this important debate!"

Birger Hjørland