SobekCM Builder / Bulk Loader: Process Description
Execution Process
This process runs when the SobekCM Builder / Bulk Loader is launched ( usually 4 am ):
- Any loader logs more than 10 days old are deleted from the builder and the web server
- All RSS Feeds, item lists, and site maps are recreated
- MarcXML feed is recreated, if builder settings don't suppress this
- Any incoming FDA reports are read and stored in the database
This process runs continually with a pause between executions of 60 seconds:
- Refresh the settings and the item list
- Iterate through any recent loads requiring additional work
- Pre-Process any incoming resource files
- Convert any Word and Powerpoint documents to PDF
- Auto-extract full- text and create thumbnails (dimensions per library-wide settings) for all PDFs
- Auto-extract full-text from HTML files
- Auto-extract full-text from XML files
- Run OCR on any incoming TIFF files which do not have text
- Clean any text files and look for SSN in text
- Ensure any jpeg files have related thumbnails (dimensions per library-wide settings)
- Creates derivatives from any TIFFs as per defined library settings
- Perform any pre-archiving file deletions per library setting (by default *.QC.jpg are deleted here)
- Archive all files
- Perform any post-archiving file deletions per library setting (by default *.tif are deleted here)
- Load the latest METS
- Add thumbnail and aggregation information from the database
- Resource file updates
- Load file attributes for each JPEG2000 and JPEG file
- Ensure all non-image files are linked to the METS file (non-image files defined per library settings)
- Ensure all new image files are linked to the METS
- Try to assign a main thumbnail, if there is none
- If there are no page images associated with this item, determine page count from any PDFs
- Save All Updated Metadata
- SobekCM service METS
- SobekCM citation METS
- MarcXML
- Save update to the database (determine size as needed here)
- Save update to Solr/Lucene
- Save the static html file to the web server and image server
- Mark the item as having additional work all completed
- If the item is born digital, has files, and is currently public, close out the digitization milestones completely
- Move appropriate inbound packages to processing
- Steps through builder source folder from the builder incoming folders settings
- Validate and classify packages in process folders
- Incoming packages = any non-delete folder with resources and/or metadata
- Deletes = METS files requesting deletes
- Packages that are invalid or do not validate are moved to failures
- Iterate through all non-delete resources ready for processing
- Pre-Process any incoming resource files
- Convert any Word and Powerpoint documents to PDF
- Auto-extract full- text and create thumbnails (dimensions per library-wide settings) for all PDFs
- Auto-extract full-text from HTML files
- Auto-extract full-text from XML files
- Run OCR on any incoming TIFF files which do not have text
- Clean any text files and look for SSN in text
- Ensure any jpeg files have related thumbnails (dimensions per library-wide settings)
- Creates derivatives from any TIFFs as per defined library settings
- Rename any received METS file ( i.e., recd_YYYY_MM_DD.mets.xml )
- Perform any pre-archiving file deletions per library setting (by default *.QC.jpg are deleted here)
- Archive all files
- Perform any post-archiving file deletions per library setting (by default *.tif are deleted here)
- Move all the files to the image server
- Load the latest METS
- Add thumbnail and aggregation information from the database
- Resource file updates
- Load file attributes for each JPEG2000 and JPEG file
- Ensure all non-image files are linked to the METS file (non-image files defined per library settings)
- Ensure all new image files are linked to the METS
- Try to assign a main thumbnail, if there is none
- If there are no page images associated with this item, determine page count from any PDFs
- Save All Updated Metadata
- SobekCM service METS
- SobekCM citation METS
- MarcXML
- Save update to the database (determine size as needed here)
- Save update to Solr/Lucene
- Save the static html file to the web server and image server
- Mark the item as having additional work all completed
- If the item is born digital, has files, and is currently public, close out the digitization milestones completely
- Process all delete request ( iterate through all deletes )
- Move all files into a DELETED folder where they will sit until deleted manually by an admin
- Delete from the database
- Delete from Solr/Lucene
- Publish the log file to the web server
Every ten minutes, as needed, the following process runs after handling all new incoming resource files:
- Recreate aggregation XML and RSS feeds
- Only build aggregations which were affected by the previous processes
- Recreate library-wide XML and RSS feeds
- Rebuilds the all.rss and all.xml files which contains all files within the library
This process runs at the end of the day before the builder stops all execution (usually around 11pm):
- Recreate the cached links between aggregations and metadata
- Solr/Lucene index optimization initiated
- On even days the document core is optimized
- On odd days the page core is optimized