eprints2bags 1.10.3

Creator: bradpython12

Last updated:

Add to Cart

Description:

eprints2bags 1.10.3

eprints2bags
A program for downloading records from an EPrints server and creating BagIt packages out of them.
Authors: Michael Hucka, Betsy Coles
Repository: https://github.com/caltechlibrary/eprints2bags
License: BSD/MIT derivative – see the LICENSE file for more information




Table of Contents

Introduction
Installation instructions
Using eprints2bags
Getting help and support
Do you like it?
Contributing — info for developers
History
Acknowledgments
Copyright and license

☀ Introduction
Materials in EPrints must be extracted before they can be moved to a long-term preservation system or dark archive. Eprints2bags is a self-contained program that encapsulates the processes needed to download records and documents from EPrints, bundle up individual records in BagIt packages, and create single-file archives (e.g., in ZIP format) of each bag. The program is written in Python 3 and works over a network using an EPrints server's REST API.
✺ Installation instructions
The instructions below assume you have a Python interpreter installed on your computer; if that's not the case, please first install Python and familiarize yourself with running Python programs on your system.
On Linux, macOS, and Windows operating systems, you should be able to install eprints2bags with pip. If you don't have the pip package or are uncertain if you do, first run the following command in a terminal command line interpreter:
sudo python3 -m ensurepip

Then, to install eprints2bags from the Python package repository, run the following command:
python3 -m pip install eprints2bags --user --upgrade

As an alternative to getting it from PyPI, you can instruct pip to install eprints2bags directly from the GitHub repository:
python3 -m pip install git+https://github.com/caltechlibrary/eprints2bags.git --user --upgrade

On Linux and macOS systems, assuming that the installation proceeds normally, you should end up with a program called eprints2bags in a location normally searched by your terminal shell for commands.
▶︎ Using Eprints2bags
For help with usage at any time, run eprints2bags with the option -h (or /h on Windows).
eprints2bags contacts an EPrints REST server whose network API is accessible at the URL given by the command-line option -a (or /a on Windows). A typical EPrints server URL has the form https://somename.yourinstitution.edu/rest. This program will automatically add /eprint to the URL path given, so omit that part of the URL in the value given to -a. The -a (or /a) option is required; the program cannot infer the server address on its own.
Specifying which records to get
The EPrints records to be written will be limited to the list of EPrints numbers found in the file given by the option -i (or /i on Windows). If no -i option is given, this program will download all the contents available at the given EPrints server. The value of -i can also be one or more integers separated by commas (e.g., -i 54602,54604), or a range of numbers separated by a dash (e.g., -i 1-100, which is interpreted as the list of numbers 1, 2, ..., 100 inclusive), or some combination thereof. In those cases, the records written will be limited to those numbered.
If the -l option (or /l on Windows) is given, the records will be additionally filtered to return only those whose last-modified date/time stamp is no older than the given date/time description. Valid descriptors are those accepted by the Python dateparser library. Make sure to enclose descriptions within single or double quotes. Examples:
eprints2bags -l "2 weeks ago" -a ....
eprints2bags -l "2014-08-29" -a ....
eprints2bags -l "12 Dec 2014" -a ....
eprints2bags -l "July 4, 2013" -a ....

If the -s option (or /s on Windows) is given, the records will also be filtered to include only those whose <eprint_status> element value is one of the listed status codes. Comparisons are done in a case-insensitive manner. Putting a caret character (^) in front of the status (or status list) negates the sense, so that eprints2bags will only keep those records whose <eprint_status> value is not among those given. Examples:
eprints2bags -s archive -a ...
eprints2bags -s ^inbox,buffer,deletion -a ...

Both lastmod and status filering are done after the -i argument is processed.
By default, if an error occurs when requesting a record from the EPrints server, it stops execution of eprints2bags. Common causes of errors include missing records implied by the arguments to -i, missing files associated with a given record, and files inaccessible due to permissions errors. If the option -k (or /k on Windows) is given, eprints2bags will attempt to keep going upon encountering missing records, or missing files within records, or similar errors. Option -k is particularly useful when giving a range of numbers with the -i option, as it is common for EPrints records to be updated or deleted and gaps to be left in the numbering. (Running without -i will skip over gaps in the numbering because the available record numbers will be obtained directly from the server, which is unlike the user providing a list of record numbers that may or may not exist on the server. However, even without -i, errors may still result from permissions errors or other causes.)
Specifying what to do with the records
This program writes its output in subdirectories under the directory given by the command-line option -o (or /o on Windows). If the directory does not exist, this program will create it. If no -o is given, the current directory where eprints2bags is running is used. Whatever the destination is, eprints2bags will create subdirectories in the destination, with each subdirectory named according to the EPrints record number (e.g., /path/to/output/43, /path/to/output/44, /path/to/output/45, ...). If the -n option (/n on Windows) is given, the subdirectory names are changed to have the form NAME-NUMBER_ where NAME is the text string provided to the -n option and the NUMBER is the EPrints number for a given entry (meaning, /path/to/output/NAME-43, /path/to/output/NAME-44, /path/to/output/NAME-45, ...).
Each directory will contain an EPrints XML file and additional document file(s) associated with the EPrints record in question. Documents associated with each record will be fetched over the network. The list of documents for each record is determined from XML file, in the <documents> element. Certain EPrints internal documents such as indexcodes.txt and preview images are ignored.
By default, each record and associated files downloaded from EPrints will be placed in a directory structure that follows the BagIt specification, and then this bag will then be put into its own single-file archive. The default archive file format is ZIP with compression turned off (see next paragraph). Option -b (/b on Windows) can be used to change this behavior. This option takes a keyword value; possible values are none, bag and bag-and-archive, with the last being the default. Value none will cause eprints2bags to leave the downloaded record content in individual directories without bagging or archiving, and value bag will cause eprints2bags to create BagIt bags but not single-file archives from the results. Everything will be left in the output directory (the location given by the -o or /o option). Note that creating bags is a destructive operation: it replaces the individual directories of each record with a restructured directory corresponding to the BagIt format.
The type of archive made when bag-and-archive mode is used for the -b option can be changed using the option -t (or /t on Windows). The possible values are: compressed-zip, uncompressed-zip, compressed-tar, and uncompressed-tar. As mentioned above, the default is uncompressed-zip (used if no -t option is given). ZIP is the default because it is more widely recognized and supported than tar format, and uncompressed ZIP is used because file corruption is generally more damaging to a compressed archive than an uncompressed one. Since the main use case for eprints2bags is to archive contents for long-term storage, avoiding compression seems safer.
The ZIP archive file will be written with a text comment describing the contents of the archive. This comment can be viewed by ZIP utilities (e.g., using zipinfo -z on Unix/Linux and macOS). The following is an example of a comment and the information it contains:
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
About this archive file:

This is an archive of a file directory organized in BagIt v1.0 format.
The bag contains the content from the EPrints record located at
http://resolver.caltech.edu/CaltechAUTHORS:SHIjfm98

The software used to create this archive file was:
eprints2bags version 1.3.1 <https://github.com/caltechlibrary/eprints2bags>

The following is the metadata contained in bag-info.txt:
Bag-Software-Agent: bagit.py v1.7.0 <https://github.com/LibraryOfCongress/bagit-python>
Bagging-Date: 2018-12-13
External-Description: Archive of EPrints record and document files
External-Identifier: http://resolver.caltech.edu/CaltechAUTHORS:SHIjfm98
Internal-Sender-Identifier: https://authors.library.caltech.edu/id/eprint/355
Payload-Oxum: 4646541.2
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Archive comments are a feature of the ZIP file format and not available with tar.
Finally, the overall collection of EPrints records (whether the records are bagged and archived, or just bagged, or left as-is) can optionally be itself put into a bag and/or put in a ZIP archive. This behavior can be changed with the option -e (/e on Windows). Like -b, this option takes the possible values none, bag, and bag-and-archive. The default is none. If the value bag is used, a top-level bag containing the individual EPrints bags is created out of the output directory (the location given by the -o option); if the value bag-and-archive is used, the bag is also put into a single-file archive. (In other words, the result will be a ZIP archive of a bag whose data directory contains other ZIP archives of bags.) For safety, eprints2bags will refuse to do bag or bag-and-archive unless a separate output directory is given via the -o option; otherwise, this would restructure the current directory where eprints2bags is running – with potentially unexpected or even catastrophic results. (Imagine if the current directory were the user's home directory!)
Generating checksum values can be a time-consuming operation for large bags. By default, during the bagging step, eprints2bags will use a number of processes equal to one-half of the available CPUs on the computer. The number of processes can be changed using the option -c (or /c on Windows).
The use of separate options for the different stages provides some flexibility in choosing the final output. For example,
eprints2bags --bag-action none --end-action bag-and-archive

will create a ZIP archive containing a single bag directory whose data/ subdirectory contains the set of (unbagged) EPrints records retrieved by eprints2bags from the server.
Server credentials
Downloading documents usually requires supplying a user login and password to the EPrints server. By default, this program uses the operating system's keyring/keychain functionality to get a user name and password. If the information does not exist from a previous run of eprints2bags, it will query the user interactively for the user name and password, and unless the -K argument (/K on Windows) is given, store them in the user's keyring/keychain so that it does not have to ask again in the future. It is also possible to supply the information directly on the command line using the -u and -p options (or /u and /p on Windows), but this is discouraged because it is insecure on multiuser computer systems.
If a given EPrints server does not require a user name and password, do not use -u or -p and leave the name and password blank when prompted for them by eprints2bags. Empty user name and password are allowed values.
To reset the user name and password (e.g., if a mistake was made the last time and the wrong credentials were stored in the keyring/keychain system), add the -R (or /R on Windows) command-line argument to a command. When eprints2bags is run with this option, it will query for the user name and password again even if an entry already exists in the keyring or keychain.
Other options
eprints2bags produces color-coded diagnostic output as it runs, by default. However, some terminals or terminal configurations may make it hard to read the text with colors, so eprints2bags offers the -C option (/C on Windows) to turn off colored output.
If given the -@ argument (/@ on Windows), this program will output a detailed trace of what it is doing, and will also drop into a debugger upon the occurrence of any errors. The debug trace will be written to the given destination, which can be a dash character (-) to indicate console output, or a file path.
If given the -V option (/V on Windows), this program will print the version and other information, and exit without doing anything else.
Basic usage examples
Running eprints2bags then consists of invoking the program like any other program on your system. The following is a simple example showing how to get a single record (#85447) from Caltech's CODA EPrints server (with user name and password blanked out here for security reasons):
# eprints2bags -o /tmp/eprints -i 85447 -a https://authors.library.caltech.edu/rest -u XXXXX -p XXXXX

Beginning to process 1 EPrints entry.
Output will be written under directory "/tmp/eprints"
======================================================================
Getting record with id 85447
Creating /tmp/eprints/85447
Downloading https://authors.library.caltech.edu/85447/1/1-s2.0-S0164121218300517-main.pdf
Making bag out of /tmp/eprints/85447
Creating tarball /tmp/eprints/85447.tgz
======================================================================
Done. Wrote 1 EPrints record to /tmp/eprints/.

The following is a screen cast to give a sense for what it's like to run eprints2bags. Click on the following image:



Summary of command-line options
The following table summarizes all the command line options available. (Note: on Windows computers, / must be used as the prefix character instead of -):



Short     
Long form opt  
Meaning
Default





-aA
--api-urlA
Use A as the server's REST API URL




-bB
--bag-actionB
Do B with each record directory
Bag and archive



-cC
--processesC
No. of processes during bag creation
½ the number of CPUs



-eE
--end-actionE
Do E with the entire set of records
Nothing



-h
--help
Print help info and exit




-iI
--id-listI
Records to get (can be a file name)
Fetch all records from the server



-k
--keep-going
Don't count missing records as an error
Stop if encounter missing record



-lL
--lastmodL
Filter by last-modified date/time
Don't filter by date/time



-nN
--name-baseN
Prefix directory names with N
Use record number only



-oO
--output-dirO
Write outputs in the directory O
Write in the current directory



-q
--quiet
Don't print info messages while working
Be chatty while working



-sS
--statusS
Filter by status(s) in S
Don't filter by status



-uU
--userU
User name for EPrints server login




-pP
--passwordU
Password for EPrints proxy login




-tT
--arch-typeT
Use archive type T
Uncompressed ZIP



-C
--no-color
Don't color-code the output
Use colors in the terminal output



-K
--no-keyring
Don't use a keyring/keychain
Store login info in keyring



-R
--reset
Reset user login & password used
Reuse previous credentials



-V
--version
Print program version info and exit
Do other actions instead



-@OUT
--debugOUT
Debugging mode; write trace to OUT
Normal mode




⚑   Required argument.
✦   Possible values: none, bag, bag-and-archive.
♢   Possible values: uncompressed-zip, compressed-zip, uncompressed-tar, compressed-tar.
⚐   To write to the console, use the character - as the value of OUT; otherwise, OUT must be the name of a file where the output should be written.
Additional notes and considerations
Beware that some file systems have limitations on the number of subdirectories that can be created, which directly impacts how many record subdirectories can be created by this program. eprints2bags attempts to guess the type of file system where the output is being written and warn the user if the number of records exceeds known maximums (e.g., 31,998 subdirectories for the ext2 and ext3 file systems in Linux), but its internal table does not include all possible file systems and it may not be able to warn users in all cases. If you encounter file system limitations on the number of subdirectories that can be created, a simple solution is to manually create an intermediate level of subdirectories under the destination given to -o, then run eprints2bags multiple times, each time indicating a different subrange of records to the -i option and a different subdirectory to -o, such that the number of records written to each destination is below the file system's limit on total number of directories.
For maximum performance, the debug logging code that implements option -@ can be skipped completely at run-time by running Python with optimization turn on. One way to do this is to run eprints2bags using an invocation such as the following:
python -O -m eprints2bags ...other arguments...

⁇ Getting help and support
If you find an issue, please submit it in the GitHub issue tracker for this repository.
★ Do you like it?
If you like this software, don't forget to give this repo a star on GitHub to show your support!
♬ Contributing — info for developers
We would be happy to receive your help and participation with enhancing eprints2bags! Please visit the guidelines for contributing for some tips on getting started.
❡ History
In 2018, Betsy Coles wrote a set of Perl scripts and described a workflow for bagging contents from Caltech's EPrints-based Caltech Collection of Open Digital Archives (CODA) server. The original code is still available in this repository in the historical subdirectory. In late 2018, Mike Hucka sought to expand the functionality of the original tools and generalize them in anticipation of having to stop using DPN because on 2018-12-04, DPN announced they were shutting down. Thus was born Eprints2bags.
☺︎ Acknowledgments
The vector artwork of a bag used as a logo for this repository was created by StoneHub from the Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license.
We thank the following people for suggestions and ideas that led to improvements in eprints2bags: Robert Doiel, Tom Morrell, Tommy Keswick.
eprints2bags makes use of numerous open-source packages, without which it would have been effectively impossible to develop eprints2bags with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:

bagit – Python library for working with BagIt style packages
colorama – makes ANSI escape character sequences work under MS Windows terminals
dateparser – parse dates in almost any string format
humanize – helps write large numbers in a more human-readable form
ipdb – the IPython debugger
keyring – access the system keyring service from Python
lxml – an XML parsing library for Python
plac – a command line argument parser
psutil – process and system utilities
requests – an HTTP library for Python
setuptools – library for setup.py
termcolor – ANSI color formatting for output in terminal
twine – Twine is a utility for publishing Python packages on PyPI
urllib3 – HTTP client library for Python
validators – data validation package for Python

☮︎ Copyright and license
Copyright (C) 2019–2023, Caltech. This software is freely distributed under a BSD/MIT type license. Please see the LICENSE file for more information.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.