Search Engine Optimization

Indexing SobekCM

We are currently allowing search and indexing robots to index all pages through the web application. However, this identification algorithm is applied to each incoming request to the web application. If the incoming request is identified as a search engine indexing robot, then the request is treated quite differently.

When a robot requests a single item, a small static html fragment page for that item is read and displayed through the web application. This allows the URL to remain the same, but results in a very quick execution time ( usually between one and three milliseconds of application time ). In addition, these pages make the full text of the item available for indexing purposes as well as the full citation. ( example | book | static html fragment ) As can be seen in the example previously, the HTML served for robots and the web page HTML served for human requestors appears very similar at the top, although the search engine robot page includes much more indexable data below the primary citation.

When a robot requests the browse of a single collection, a simple list of titles and URLs are provided to the robot to allow indexing and following to the individual resources. ( example )

Robots cannot perform searches against the library or against individual items within the library.

Permanent URLs and URL Rewrite

Due to the way that URL rewrite is implemented in this library, the main URL displayed when viewing any item is actually the PURL as well. This PURL is used for referencing the item both internally and externally by users. In addition the PURL is used by search engines. This approach has the advantage of pushing more traffic directly to the item while allowing for simpler URLs and without requiring users to look for the PURL within the citation to reference the resource correctly. This approach also has the advantage of not requiring forwarding to occur for any users (whether human or robot).

Link Advertisement

Several approaches are taken to advertise the links to prospective search engine indexers.

On each item aggregation home page, there is a link which provides a list of all items within the aggregation. This link is identical to the standard ALL ITEMS browse link. Just as this aids in human discovery of the resources within an aggregation, it also works for robot discovery.

Two RSS feeds are provided for each aggregation within this library. ( view rss feeds ) One RSS feed generally lists the last twenty items added to the aggregations. An additional RSS feed (particularly useful for indexing) lists every item within the aggregation. In addition, a particularly unwieldy RSS feed is provided with links to every item within this library.

Site maps are also generated to list all the aggregation home pages and links to each individual item within the digital library. To keep the site maps somewhat small in size, thirty thousand links are provided in each sitemap, resulting in ten sitemaps for the resources alone. The date that the item was last modified is included in the sitemap to make incremental updates simpler for the indexing robots. An additional sitemap is provided for the collection home pages and static web content pages.

While the sitemaps are registered individually with several of the major search engines, they are also included within the robots.txt page for this site. Not all search engines implement this option but it is increasingly used by the major search sites and most robot.txt readers are prepared to skip unrecognized instructions.

Cache Management

To manage the cache and robot behavior, the following rules are now in place:

Individual items are not built, and are thus not cached when requested by a robot
Collection Groups, Collections, Subcollection, and Institutional objects are only cached for one minute for robot requests (usually fifteen)
Browse results (within an aggregation) are not cached at all

Usage Statistics

In addition, hits from indexing robots are carefully excluded from the overall usage statistics of this digital library, using the same identification algorithm as the web application. In general, web site managers can expect search engine robots to far outpace the number of hits from real users. For example, below are the number of robotic hits and real users for the last several months.

DATE	REAL USERS	SEARCH ROBOTS

December 2010	2,097,208	30,503,950

January 2011	1,898,726	26,898,312

February 2011	1,718,275	18,602,354

March 2011	2,028,125	17,844,012

April 2011	2,114,984	18,320,589

May 2011	2,443,209	17,003,036

June 2011	2,207,457	19,625,401

More Information

Want to drill down into some of these topics in more detail? Here are some related, child pages:

Robot Identification Algorithm