Indexes on the Web

Website Indexing

Website indexing, in the indexing profession, refers to creating a back-of-the-book style index on a website, with index entries and subentries hyperlinked to the relevant web pages or anchors within pages. The website index page (not to be confused with the home page, which typically has the file name of index.htm or index.html), is referred to as a “site index” or “AZ” index. It complements the “site map” which functions somewhat as a table of contents.  While search engines usually achieve desired results for the entire web (where the user merely wants some information on a topic and not everything on a topic, and there are billions of web pages indexed by web search engines), search engines on websites generally do not achieve the results that users would like. In the mid and late 1990s, some of the websites of the time adopted the feature of the site AZ index as an additional means for users to find information on their sites. Software to aid indexers in creating website indexes was also developed in the late 1990s.

Websites suitable for such AZ indexes would be the same size as a small-to-medium book in number of pages, from 25 to a few hundred pages. Index maintenance is an issue, so the majority of the web pages need to be static/unchanging to be suitable for an AZ index. Website indexes are not difficult to update, but the issue is often that the skilled indexer who created the index originally is not on staff to make periodic minor updates, and it’s not worth the trouble to contract an indexer for 1520 minutes every now and then.

Website index example: AZ of the American Society for Indexing website.

Website indexing, however, did not become the commercial endeavor that indexers had hoped for. Website have grown too large, too fast, with changing underlying technologies. AZ indexes can still be found on some websites, typically where someone connected to the website owner knows how to index and can maintain the index. On a public website, such an index is a good way for an indexer to showcase their skills.

Web content management systems ultimately provided the solution for providing findability options on large and changing websites, where search alone had failed. If implemented properly, a web content management system (such as Drupal, WordPress, Joomla!, and proprietary software such as Adobe Experience Manager, and SharePoint for intranets), can include a taxonomy or thesaurus or other controlled metadata, and when web pages are added or changed, these controlled vocabulary terms are applied (indexed) to the page or uploaded document. The onsite search engine then gives more weight to search strings that match controlled vocabulary terms and their variants to mere fulltext keyword matches. Unlike traditional website indexing, the indexing is to the page level only and not to an anchor within a page. This assigning of controlled vocabulary metadata is more akin to periodical/database indexing than it is to back-of-the-book indexing. Since an information professional would have created the taxonomy or other metadata along with instructions for indexing, those applying the terms/tags don’t need to be professional indexers.  Those interested in the development of taxonomies for web content management, should see the Taxonomies & Controlled Vocabularies SIG of ASI.

See also the page: Website Index Best Practices

Indexed Documents on the Web

There is also a role for indexes in large documents posted on the web, also called “web-mounted” indexes, whether as HTML or PDF files, which are large enough to benefit from an internal index. These could be large single files, or a collection of files/pages. This includes online books, which are not the same as ebooks, as the latter include navigation features utilized in an ereader application.  As web content grows in all forms, the number of large documents on the web is also increasing. There are different methods for creating indexes for such web documents, depending on whether a print version index existed previously, would be created at the same time, or would not be created at all.

Resources

Software tools:

  • HTML Indexer – stand-alone software for indexing websites or collections of HTML documents. (Brown Inc.)

Web resources:

Articles:

Books and book chapters: