Migrating from ContentDM

Migrating from ContentDM™

Resource Types

If you are migrating single image files from ContentDM to SobekCM, you should be able to use the Spreadsheet importer to easily create the new resources within SobekCM. Then, you need only write a small script to move the images into folders named with the new BibID_VID and drop those into the SobekCM Builder.

This gets somewhat more complicated when working with complex multi-page documents and supplementary materials.

Migration Notes

What follows are notes regarding a migration from ContentDM to SobekCM in October of 2013. These notes are posted in the hopes that this will make future migrations simpler. Having never had a collection in ContentDM, there is a very good chance that there may be a better way. If you know of anything to make this process easier, please do not hesitate to contact me at Mark.V.Sullivan at Gmail.com.

Preparing the collection folders for import

I received an exact copy of the collection folders from ContentDM. If you are not already working a copy of your collection folders, make a copy now.
Step into each collection folder and delete all the subfolders EXCEPT image and supp.
Delete all the small "icon" jpeg images
Pull all the TIFFs, XML, and CPD files out of the image subfolder and into the root collection folder
Use Adobe Photoshop batching with actions to convert any remaining JPEG images to TIFF (to allow the SobekCM builder to create its own derivatives)
Once this work is complete and confirmed, move the new TIFFs into the collection folder and delete the remaining image subfolder
I then moved all the prepped source folders into a new folder for processing below

All of this work listed above was done by hand, although it would be very simple to automate much of this with simple scripting.

Preparing the collection-level metadata for processing

I received collection-level metadata output from ContentDM, which included the text of the pages and links to each of the files associated with each object. This metadata was exported using the Exporting to Tab-delimited Text Files method. One of the most interesting things about this format is that each individual page for a complex, multi-page object is listed AND the multi-page complex object is also referenced, in the same file.
Convert each individual collection output (txt) into Excel for ease of working with them
Add a new column at the beginning of each Excel file with the new SobekCM collection code
Combine all of the separate collection-level spreadsheets into a single spreadsheet for processing everything at the same time
Add a new column at zero position named ID and fill with series starting at 1, 2, 3, etc..
For rows that are multiple issues of the same title (in a newspaper or periodical) set the ID to be identical

Process the files and metadata

Using code included here, check that all the files exist ( see Verify_Resource_Files_Exist() within code )
Create text files from the text in the spreadsheet ( see Add_Text_Files() within code )
Step through and process any referenced CPD files. Move the CPD file and all related images and text into their own subfolder for processing as a single item. ( see Process_CPD_Files() within code ). Note: this implies that the next time we go through the spreadsheet, when we find a row that references a page within a CPD file, it will not be found. This is why we checked that all files existed at the beginning of this process.
Finally, build the complete METS packages from the spreadsheet, CPD folders, and loose files ( see Create_SobekCM_METS() within code below )

C# Code

The code below essentially follows the steps listed above for the final processing of the metadata and images.


 // Read the prepared Excel spreadsheet into a DataTable
 ExcelBibliographicReader xlsReader = new ExcelBibliographicReader();
 xlsReader.Filename = "Complete.xls";
 xlsReader.Sheet = xlsReader.GetExcelSheetNames("Complete.xlsx")[0];
 DataTable importTbl = xlsReader.Check_Source();
 
 // Check that all files exist
 ContentDM_Importer contentDm = new ContentDM_Importer(importTbl, @"\\ad.ufl.edu\....\College\source");
 contentDm.Verify_Resource_Files_Exist();
 Console.WriteLine();
 
 // Since the text is in the spreadsheet, write out the text files for
 // indexing within Sobek
 int text_files_written = contentDm.Add_Text_Files();
 Console.WriteLine("Wrote " + text_files_written + " text files");
 Console.WriteLine();
 
 // Process all the CPD files referenced
 int cpd_files_handled = contentDm.Process_CPD_Files();
 Console.WriteLine(cpd_files_handled + " CPD files handled");
 Console.WriteLine();
 
 // Create the METS packages ready for SobekCM
 contentDm.Create_SobekCM_METS(@"\\ad.ufl.edu\....\College\ready\");
 
 Console.WriteLine("COMPLETE");
 Console.ReadLine();

This code uses the classes found in the ZIP file below, as well as the SobekCM_Resource_Object library, which is available in the SobekCM source code from our GitHub site.

Download ContentDM_Importer C# class.

Trademarks

ContentDM is trademarked by OCLC Online Computer Library Center, Inc. and its affiliates

Photoshop is trademarked by Adobe Systems Incorporated.