Harvesting

Harvesting & Record Feeds

Automatic Discovery and Harvesting

Automated discovery and harvesting of records or data from this digital library is available via several means.

Harvesting through OAI-PMH is supported for each of the main aggregations within the library.
All searches and browses can be read as basic XML, which provides the pointers to all digital resources within the result set.
The metadata for each item is publicly available by reading the METS file, which can be found through the citation tab of each item.
A simple JSON interface is provided for pulling information about the individual images and text for each digital resource.
MARCXML feed.

Harvesting through OAI-PMH

The URL below provides access to the metadata for each item in this library:

http://sobekrepository.org/sobekcm_oai.aspx?verb=Identify

To search the OAI-PMH, follow the standard commands. Items in this digital library can only be listed within each set however. You cannot list the records, nor list the identifiers, for the entire library as a whole.

If you are navigating the OAI-PMH from a web browser, the XSLT used allows for a clean display. Be aware, however, that at times the XSLT appears to fail, resulting in a blank screen in your browser. At that point, if you view the source for the page, you will see the Dublin Core XML.

Searches and Browses as XML

This system is built modularly. There is a data layer which performs searches, browses, and other actions. The result of these actions is passed to a main writer. The most common writer is the html_writer. However, searches and browses may also be passed to the XML_writer. To switch to the XML writer, follow the steps below:

From the main interface, navigate to the aggregation of interest and perform either a browse or search
Make note of the URL, and add a "xml" in the appropriate spot, directly after the main URL.

For example, to switch a browse of all material in the asia1 aggregation, the standard UFDC url is: http://sobekrepository.org/asia1/all. To switch this to the XML view, just insert XML after the ufdc.ufl.edu portion of the url (e.g., http://sobekrepository.org/xml/asia1/all ). It is best to be logged off when doing this.

Changing a search works the same way. The URL to search the JUV aggregation for any appearance of the word kitten is http://sobekrepository.org/juv/results/?t=kittens. Adding the "xml" after the base portion of the URL results in a XML version (e.g., http://sobekrepository.org/xml/juv/results/?t=kittens ).

The BibID and VID values in the XML point directly to the online item. The appropriate URL for an item is the base URL, followed by the BibID, and then followed by the VID ( or volume id ). For example, if the XML for an item looks like below, then the URL to the direct item is http://sobekrepository.org/UF00078891/00001:

    <SobekCM_Item>
      <BibID>UF00078891</BibID> 
      <VID>00001</VID> 
      <Title /> 
      <CreateDate>2011-02-06T15:02:58.3741409-05:00</CreateDate> 
      <Resource_Link>http://ufdcimages.uflib.ufl.edu/UF\00\07\88\91/00001</Resource_Link> 
    </SobekCM_Item>

JSON Interface to Images and Raw Text

When an item is discovered through the usual web interface, the item is displayed by the html_writer class. Another useful writer which can be evoked over http is the JSON writer, which serves pointers to the item's jpeg and text files. This format can easily be easily parsed or read into java(script) objects.

To view the list of jpeg and text files for a single item, you must evoke the json writer through the URL again.

To view an item by BibID and VID ( possibly from parsing the XML output of a browse or search ), the appropriate URL is http://sobekrepository.org/json/[BibID]/[VID]/text. For example, to view the JSON output for UF00078891:00001, change the URL to http://sobekrepository.org/json/UF00078891/00001/text.

If you are currently viewing an item, just place "json" in the appropriate spot in the URL and be sure to set the viewer code at the end to "text". It is best to be logged out when doing this substitution.

The resulting JSON (excerpt shown below) is an array of item_page objects, each with a numeric position index, a string for the image file, and a string for the text file.

    [ {"item_page":
             {   "position":1,
                 "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00000.jpg",
                 "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00000.txt"
             }},
      {"item_page":
             {   "position":2,
                 "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00001.jpg",
                 "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00001.txt"
             }},
      {"item_page":
             {   "position":3,
                 "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00002.jpg",
                 "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00002.txt"
             }},
      {"item_page":
             {   "position":4,
                 "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00003.jpg",
                 "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00003.txt"
              }} ]

This method can be used to pull the actual JPEG images or the actual raw text files for analysis.

Limitations to Harvesting

There are many legitimate and accepted reasons for harvesting our metadata and resource files. While we make these files freely available, we reserve the right to limit harvesting by IP address or any other means at any time. In addition, we request that queries to our server be limited to no more than one every 100 millisecond and encourage users designing robots to program responsibly.

Record Feed

The MARCXML feed can be used to add all records to a catalog system. For example, Trove from the National Library of Australia includes all records. The XML can be parsed to select records using specific parameters. NINES and 18thConnect select related records from the Baldwin Library of Historical Children's Literature and the Digital Library of the Caribbean.

There are MARCXML record feeds for all items hosted in SobekCM. The feeds are quite large and are available by collection-group:

http://sobekrepository.org/AA00025497/

Please contact us for if different or updated feeds are needed or for any questions.