sumfolder1 0.0.2

Creator: bradpython12

Last updated:

Add to Cart

Description:

sumfolder1 0.0.2

sumfolder1 is a utility for use within the archival and digital preservation
community to generate checksums for file system directories, and to generate
an overall "collection" checksum for a given set of files.

Why?

Archival questions
Structural questions
Forensics questions


How?

Reference set
Reference implementation
Merkle trees
Terminology
New folder attributes
Sensitivity


DROID

DROID in Siegfried
DROID as an inspiration
Writing about sumfolder1


Installation
Usage

Demo output
Use with a DROID csv
Outputting the reference CSV


License

Why?
Conventionally, checksums exist for files, they do not exist for directories.
They have no payload that can be summed together to calculate a digest/checksum.
If it were possible to create checksums for folders or a global checksum for a
collection of objects, it would become possible to ask the following:
Archival questions

What is the collection checksum for a given set of files and folders?
What is the checksum for a given folder?
Given a collection of objects online, am I looking at an authentic listing?
Have I downloaded a collection in its entirety?

Structural questions

Is file/folder hash(x) included in the collection set?
Is file/folder hash(y) (non-existent) part of the entire set?
Is file hash(x) part of folder(y) where the collection has arbitrary depth?
Where are duplicate checksums located within a collection?

Forensics questions

Has a digital object been removed from the collection?
Did the collection contain at least one empty directory?

How?
Given a set of file paths and existing checksums it is possible to compute a
checksum for a folder by creating a checksum of the given checksums.
Given checksum 1) 7c1f9f9a4d0ce9a72ee63f37a1b7f694 and checksum 2)
aececec0bc3f515039aec9e60c413cd3 an MD5 can be computed as:
82f9e9a4305714fffdd7932783980cbc.
We can see this illustrated for a small collection as follows:
📁 folder_1 82f9e9a4305714fffdd7932783980cbc
📄 checksum_1 7c1f9f9a4d0ce9a72ee63f37a1b7f694
📄 checksum_2 aececec0bc3f515039aec9e60c413cd3

If we follow this approach through an entire directory structure we can create
checksums for all sub-directories and for the collection as a whole.
Reference set
A reference set is provided with this repository: reference set.
We can iterate through the directory tree to create sets of directory checksums
and a collection checksum: 52b94608dc70813aa88dae01176dc73b.
The reference set then looks as follows:
📁 collection 93778c524035d5d3e429a2fe43b7700a
📄 file_0001 14118ff9ad0344decb37960809b2f17a
📄 file_0000 8cfda2609b880a553759cd6200823f3b
📄 file_0002 a4501ee1a5c711ea9db78a800a24e830
📁 sub_dir_1 82301616d7e24f474dbe21de93af0a34
📄 file_empty d41d8cd98f00b204e9800998ecf8427e
📄 file_0003 dc7f828c5fe622925181d06edada350f
📄 file_0004 e3d90a4bf14a9b355f0e69ba08df522d
📁 sub_1_dir_1 1c7ba27edf1356d097a3f568032430c2
📄 file_0005 637a3fb7da1ab61d10e96336d9758416
📁 sub_dir_2 1ccb49edc4e873f1a8affd4bad5e9b90
📁 sub_dir_3 2a60541cede91a36e9dc5bab7a97dd6e
📁 sub_3_empty_1 db9d848b4f83ff3cb3faa4df0a59e3e1
📁 sub_3_empty_2 1ccb49edc4e873f1a8affd4bad5e9b90
📁 sub_dir_4 272d45767d534335163f220c1d40e559
📄 file_0006 2b43227486ec8744cd5d4c955d269743
📄 file_0007 c5a1973a70e08bf1eee13b8090f790ad
📄 file_0008 fdffe4dd2d39c7d9986dbf5c6ec5ad39
📁 sub_dir_5 d818d29b75f89a9b5d8d1c5a4c70dbbb
📁 sub_5_dir_1 82f9e9a4305714fffdd7932783980cbc
📄 file_0009 7c1f9f9a4d0ce9a72ee63f37a1b7f694
📄 file_0010 aececec0bc3f515039aec9e60c413cd3
📁 sub_dir_6 74be16979710d4c4e7c6647856088456
📄 file_empty d41d8cd98f00b204e9800998ecf8427e

Reference implementation
The reference implementation for sumfolder1 does the following:
From the lowest sub-directory in the tree:

Check for sub-directories and add the checksums for these to a hash digest in
alphabetical order by checksum.
For files in the directory add these to the hash digest in alphabetical order
by checksum.
Create a digest for the list of checksums.

Repeat, processing each folder backwards up to the top level.

NB. If a folder is completely empty it is assigned a constant value
chosen in the code: 2600_EMPTY_DIRECTORY. This evaluates to an MD5 value of
1ccb49edc4e873f1a8affd4bad5e9b90.

Merkle trees
The concept I have used here is based on Merkle trees and a loose understanding
of techniques used in the block-chain and in the source control system GitHub.
A good video summary of Merkle trees can be found on YouTube:

Gaurav Sen on Merkle Trees

And a Python tutorial I found useful in starting this work:

Dan Nolan on Merkle Trees

The technique required for a directory tree is a little more convoluted than
that of a Merkle tree which uses binary nodes and evaluates checksums from left
to right. I believe the implementation used for sumfolder1 is more closely
aligned to that of a "Radix Tree" or "Patricia Tree", however, this is to be
explored more.

NB. A merkle tree can be used in its context for performance; sumfolder1 does
not yet have a performance use-case.

Terminology
The reference implementation introduces some terminology that helps with
understanding the approach:

Active-tree: the side of a directory tree that we're querying about a given
hash.
Non-active-tree: the tree at root node (Rn+1) that do not contain the digital
object that we're querying.
Root-node (Rn): the name of the top-level node, i.e. collection folder. This
is either artificially created for a set of directories all at the same level,
or exists as a function of the given collection set.

New folder attributes
Folder objects need to be given additional attributes to enable the algorithm
to work.

Folder-depth, so directories can be grouped and distinguished from
one-another by level in the hierarchy.
Hash, the goal of this tool is to enable a hash to be calculated for
an entire collection.

Sensitivity
I am trying to make this code as portable as possible, i.e. while it works with
DROID-style reports today, it might also work with other checksum-based outputs
tomorrow. Additionally, to be able to compare folder structures, this utility
may also work with DROID-style reports later on in a transfer workflow; at which
point, folders and files may have been renamed, but their content remains
consistent.
To calculate a single folder checksum we currently do the following:

If there are folders in the directory, order their hashes alphabetically
and add to a list.
File checksums are then ordered alphabetically and added to the end of the
list.
The checksums are then summed together to create a new folder-level checksum.

DROID
sumfolder1 uses the DROID format identification report to generate folder level
checksums.
DROID can be found at The National Archives UK website:

DROID @ The National Archives

DROID in Siegfried
sumfolder1 can also be used with DROID compatible reports created by Siegfried
using a command such as follows:
sf --hash=md5 --droid <collection_folder>

DROID as an inspiration
File format reports provide a means of statically analyzing collections of
digital objects. A DROID report satisfies the pre-conditions required to create
reliable folder- and collection-level checksums for digital collections:

A collection is static, i.e. unlikely to change.
Digital objects within the collection have checksums.


NB: A collection need not be static to be analyzed but it is not the primary
use-case of this utility.

More information about the different uses for a file-format identification
report can be found in my paper in the Code4Lib journal.

Fractal in detail: What information is in a file format identification report?

Writing about sumfolder1
I wrote a blog describing the utility on the OPF website.

What is the checksum of a directory?

Installation
sumfolder1 is available on pypi and can be installed as follows:
pip install -U sumfolder1

Usage
sumfolder1 has the following usage instructions:
usage: sumfolder1.py [-h] [--csv CSV] [--demo] [--ref] [-v]

Calculate checksums for folders in a collection of objects using a DROID format
identification report

options:
-h, --help show this help message and exit
--csv CSV Single DROID CSV to read.
--demo Run demo queries and output a tree to demo.txt
--ref, --reference Write reference set to stdout.
-v, --version Return version information.

Demo output
sumfolder1's demo output can be invoked as follows:
python sumfolder1 --demo

JSON will be output to stdout describing a handful of queries generated using
the reference collection.
An visualization of the collection tree will be output (for demo purposes) to
stderr.
Use with a DROID csv
Given a DROID csv the tool can be invoked as follows:
python sumfolder1 --csv <droid_csv_file>

Outputting the reference CSV
A reference CSV can be output to stdout. Ideally it is piped to some other
file using a command such as follows:
python sumfolder1 --ref > <output_file>

License
This work is license using: GNU GENERAL PUBLIC LICENSE Version 3.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.