User Help Harvesting
Harvesting & Record FeedsAutomatic Discovery and HarvestingAutomated discovery and harvesting of records or data from this digital library is available via several means.
Harvesting through OAI-PMHThe URL below provides access to the metadata for each item in this library: http://sobekrepository.org/sobekcm_oai.aspx?verb=Identify To search the OAI-PMH, follow the standard commands. Items in this digital library can only be listed within each set however. You cannot list the records, nor list the identifiers, for the entire library as a whole. If you are navigating the OAI-PMH from a web browser, the XSLT used allows for a clean display. Be aware, however, that at times the XSLT appears to fail, resulting in a blank screen in your browser. At that point, if you view the source for the page, you will see the Dublin Core XML. Searches and Browses as XMLThis system is built modularly. There is a data layer which performs searches, browses, and other actions. The result of these actions is passed to a main writer. The most common writer is the html_writer. However, searches and browses may also be passed to the XML_writer. To switch to the XML writer, follow the steps below:
For example, to switch a browse of all material in the asia1 aggregation, the standard UFDC url is: http://sobekrepository.org/asia1/all. To switch this to the XML view, just insert XML after the ufdc.ufl.edu portion of the url (e.g., http://sobekrepository.org/xml/asia1/all ). It is best to be logged off when doing this. Changing a search works the same way. The URL to search the JUV aggregation for any appearance of the word kitten is http://sobekrepository.org/juv/results/?t=kittens. Adding the "xml" after the base portion of the URL results in a XML version (e.g., http://sobekrepository.org/xml/juv/results/?t=kittens ). The BibID and VID values in the XML point directly to the online item. The appropriate URL for an item is the base URL, followed by the BibID, and then followed by the VID ( or volume id ). For example, if the XML for an item looks like below, then the URL to the direct item is http://sobekrepository.org/UF00078891/00001: <SobekCM_Item> <BibID>UF00078891</BibID> <VID>00001</VID> <Title /> <CreateDate>2011-02-06T15:02:58.3741409-05:00</CreateDate> <Resource_Link>http://ufdcimages.uflib.ufl.edu/UF\00\07\88\91/00001</Resource_Link> </SobekCM_Item> JSON Interface to Images and Raw TextWhen an item is discovered through the usual web interface, the item is displayed by the html_writer class. Another useful writer which can be evoked over http is the JSON writer, which serves pointers to the item's jpeg and text files. This format can easily be easily parsed or read into java(script) objects. To view the list of jpeg and text files for a single item, you must evoke the json writer through the URL again. To view an item by BibID and VID ( possibly from parsing the XML output of a browse or search ), the appropriate URL is http://sobekrepository.org/json/[BibID]/[VID]/text. For example, to view the JSON output for UF00078891:00001, change the URL to http://sobekrepository.org/json/UF00078891/00001/text. If you are currently viewing an item, just place "json" in the appropriate spot in the URL and be sure to set the viewer code at the end to "text". It is best to be logged out when doing this substitution. The resulting JSON (excerpt shown below) is an array of item_page objects, each with a numeric position index, a string for the image file, and a string for the text file. [ {"item_page": { "position":1, "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00000.jpg", "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00000.txt" }}, {"item_page": { "position":2, "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00001.jpg", "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00001.txt" }}, {"item_page": { "position":3, "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00002.jpg", "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00002.txt" }}, {"item_page": { "position":4, "image_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00003.jpg", "text_url":"http://ufdcimages.uflib.ufl.edu/UF/00/07/88/91/00001/00003.txt" }} ] This method can be used to pull the actual JPEG images or the actual raw text files for analysis. Limitations to HarvestingThere are many legitimate and accepted reasons for harvesting our metadata and resource files. While we make these files freely available, we reserve the right to limit harvesting by IP address or any other means at any time. In addition, we request that queries to our server be limited to no more than one every 100 millisecond and encourage users designing robots to program responsibly. Record FeedThe MARCXML feed can be used to add all records to a catalog system. For example, Trove from the National Library of Australia includes all records. The XML can be parsed to select records using specific parameters. NINES and 18thConnect select related records from the Baldwin Library of Historical Children's Literature and the Digital Library of the Caribbean. There are MARCXML record feeds for all items hosted in SobekCM. The feeds are quite large and are available by collection-group: Please contact us for if different or updated feeds are needed or for any questions.
|