Irods Capability Automated Ingest 0.5.0

Description:

irodscapabilityautomatedingest 0.5.0

iRODS Automated Ingest Framework
The automated ingest framework gives iRODS an enterprise solution that solves two major use cases: getting existing data under management and ingesting incoming data hitting a landing zone.
Based on the Python iRODS Client and Celery, this framework can scale up to match the demands of data coming off instruments, satellites, or parallel filesystems.
The example diagrams below show a filesystem scanner and a landing zone.

Usage options
Redis options

option
effect
default

redis_host
Domain or IP address of Redis host
localhost

redis_port
Port number for Redis
6379

redis_db
Redis DB number to use for ingest
0

S3 options
To scan S3 bucket, minimally requires --s3_keypair and source path of the form /bucket_name/path/to/root/folder.

option
effect
default

s3_keypair
path to S3 keypair file
None

s3_endpoint_domain
S3 endpoint domain
s3.amazonaws.com

s3_region_name
S3 region name
us-east-1

s3_proxy_url
URL to proxy for S3 access
None

s3_insecure_connection
Do not use SSL when connecting to S3 endpoint
False

Logging/Profiling options

option
effect
default

log_filename
Path to output file for logs
None

log_level
Minimum level of message to log
None

log_interval
Time interval with which to rollover ingest log file
None

log_when
Type/units of log_interval (see TimedRotatingFileHandler)
None

--profile allows you to use vis to visualize a profile of Celery workers over time of ingest job.

option
effect
default

profile_filename
Specify name of profile filename (JSON output)
None

profile_level
Minimum level of message to log for profiling
None

profile_interval
Time interval with which to rollover ingest profile file
None

profile_when
Type/units of profile_interval (see TimedRotatingFileHandler)
None

Ingest start options
These options are used at the "start" of an ingest job.

option
effect
default

job_name
Reference name for ingest job
a generated uuid

interval
Restart interval (in seconds). If absent, will only sync once.
None

file_queue
Name for the file queue.
file

path_queue
Name for the path queue.
path

restart_queue
Name for the restart queue.
restart

event_handler
Path to event handler file
None (see "event_handler methods" below)

synchronous
Block until sync job is completed
False

progress
Show progress bar and task counts (must have --synchronous flag)
False

ignore_cache
Ignore last sync time in cache - like starting a new sync
False

Optimization options

option
effect
default

exclude_file_type
types of files to exclude: regular, directory, character, block, socket, pipe, link
None

exclude_file_name
a list of space-separated python regular expressions defining the file names to exclude such as "(\S+)exclude" "(\S+).hidden"
None

exclude_directory_name
a list of space-separated python regular expressions defining the directory names to exclude such as "(\S+)exclude" "(\S+).hidden"
None

files_per_task
Number of paths to process in a given task on the queue
50

initial_ingest
Use this flag on initial ingest to avoid check for data object paths already in iRODS
False

irods_idle_disconnect_seconds
Seconds to hold open iRODS connection while idle
60

available --event_handler methods

method
effect
default

pre_data_obj_create
user-defined python
none

post_data_obj_create
user-defined python
none

pre_data_obj_modify
user-defined python
none

post_data_obj_modify
user-defined python
none

pre_coll_create
user-defined python
none

post_coll_create
user-defined python
none

pre_coll_modify
user-defined python
none

post_coll_modify
user-defined python
none

character_map
user-defined python
none

as_user
takes action as this iRODS user
authenticated user

target_path
set mount path on the irods server which can be different from client mount path
client mount path

to_resource
defines target resource request of operation
as provided by client environment

operation
defines the mode of operation
Operation.REGISTER_SYNC

max_retries
defines max number of retries on failure
0

timeout
defines seconds until job times out
3600

delay
defines seconds between retries
0

Event handlers can use logger to write logs. See structlog for available logging methods and signatures.
Operation mode

operation
new files
updated files

Operation.REGISTER_SYNC (default)
registers in catalog
updates size in catalog

Operation.REGISTER_AS_REPLICA_SYNC
registers first or additional replica
updates size in catalog

Operation.PUT
copies file to target vault, and registers in catalog
no action

Operation.PUT_SYNC
copies file to target vault, and registers in catalog
copies entire file again, and updates catalog

Operation.PUT_APPEND
copies file to target vault, and registers in catalog
copies only appended part of file, and updates catalog

Operation.NO_OP
no action
no action

--event_handler usage examples can be found in the examples directory.
Character Mapping option
If an application should require that iRODS logical paths produced by the ingest process exclude subsets of the
range of possible Unicode characters, we can add a character_map method that returns a dict object. For example:
class event_handler(Core):
@staticmethod
def character_map():
return {
re.compile('[^a-zA-Z0-9]'):'_'
}
# ...

The returned dictionary, in this case, indicates that the ingest process should replace all non-alphanumeric (as
well as non-ASCII) characters with an underscore wherever they may occur in an otherwise normally generated logical path.
The substition process also applies to the intermediate (ie collection name) elements in a logical path, and a suffix is
appended to affected path elements to avoid potential collisions with other remapped object names.
Each key of the returned dictionary indicates a character or set of characters needing substitution.
Possible key types include:

character

# substitute backslashes with underscores
'\\': '_'

tuple of characters

# any character of the tuple is replaced by a Unicode small script x
('\\','#','-'): '\u2093'

regular expression

# any character outside of range 0-256 becomes an underscore
re.compile('[\u0100-\U0010ffff]'): '_'

callable accepting a character argument and returning a boolean

# ASCII codes above 'z' become ':'
(lambda c: ord(c) in range(ord('z')+1,128)): ':'

In the event that the order-of-substitution is significant, the method may instead return a list of key-value tuples.
UnicodeEncodeError
Any file whose path in the filesystem whose ingest results in a UnicodeEncodeError exception being raised (e.g. by the
inclusion of an unencodable UTF8 sequence) will be automatically renamed using a base-64 sequence to represent the original,
unmodified vault path.
Additionally, data objects that have had their names remapped, whether pro forma or via a UnicodeEncodeError, will be
annotated with an AVU of the form
Attribute: "irods::automated_ingest::" + ANNOTATION_REASON
Value: A PREFIX plus the base64-converted "bad filepath"
Units: "python3.base64.b64encode(full_path_of_source_file)"
Where :

ANNOTATION_REASON is either "UnicodeEncodeError" or "character_map" depending on why the remapping occurred.
PREFIX is either "irods_UnicodeEncodeError_" or blank(""), again depending on the re-mapping cause.

Note that the UnicodeEncodeError type of remapping is unconditional, whereas the character remapping is contingent on
an event handler's character_map method being defined. Also, if a UnicodeEncodeError-style ingest is performed on a
given object, this precludes character mapping being done for the object.
Manual Deployment
Configure python-irodsclient environment
python-irodsclient (PRC) is used by the Automated Ingest tool to interact with iRODS. The configuration and client environment files used for a PRC application applies here as well.
If you are using PAM authentication, remember to use the Client Settings File.
iRODSSessions are instantiated using an iRODS client environment file. The client environment file used can be controlled with the IRODS_ENVIRONMENT_FILE environment variable. If no such environment variable is set, the file is expected to be found at ${HOME}/.irods/irods_environment.json. A secure connection can be made by making the appropriate configurations in the client environment file.
Starting Redis Server
Install Redis server: https://redis.io/docs/latest/get-started
Starting the Redis server with package installation:
redis-server

Or, dameonized:
sudo service redis-server start

sudo systemctl start redis

The Redis GitHub page also describes how to build and run Redis: https://github.com/redis/redis?tab=readme-ov-file#running-redis
The Redis documentation also recommends an additional step:

Make sure to set the Linux kernel overcommit memory setting to 1. Add vm.overcommit_memory = 1 to /etc/sysctl.conf and then reboot or run the command sysctl vm.overcommit_memory=1 for this to take effect immediately.

This allows the Linux kernel to overcommit virtual memory even if this exceeds the physical memory on the host machine. See kernel.org documentation for more information.
Note: If running in a distributed environment, make sure Redis server accepts connections by editing the bind line in /etc/redis/redis.conf or /etc/redis.conf.
Setting up virtual environment
You may need to upgrade pip:
pip install --upgrade pip

Install virtualenv:
pip install virtualenv

Create a virtualenv with python3:
virtualenv -p python3 rodssync

Activate virtual environment:
source rodssync/bin/activate

Install this package
pip install irods_capability_automated_ingest

Set up environment for Celery:
export CELERY_BROKER_URL=redis://<redis host>:<redis port>/<redis db> # e.g. redis://127.0.0.1:6379/0
export PYTHONPATH=`pwd`

Start celery worker(s):
celery -A irods_capability_automated_ingest.sync_task worker -l error -Q restart,path,file -c <num workers>

Note: Make sure queue names match those of the ingest job (default queue names shown here).
Using the sync script
Start sync job
python -m irods_capability_automated_ingest.irods_sync start <source dir> <destination collection>

List jobs
python -m irods_capability_automated_ingest.irods_sync list

Stop jobs
python -m irods_capability_automated_ingest.irods_sync stop <job name>

Watch jobs (same as using --progress)
python -m irods_capability_automated_ingest.irods_sync watch <job name>

Run tests
Note: The tests start and stop their own Celery workers, and they assume a clean Redis database.
python -m irods_capability_automated_ingest.test.test_irods_sync

See docker/ingest-test/README.md for how to run tests with Docker Compose.

Overview

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

You're allowed to use the code bits in the repositories in unlimited projects.
Attribution is not required to use the code bits.

What you can do with it

Use them freely in your personal and professional work.

What you can't do with it

Don't be greedy. Selling or distributing these repositories in their original state is prohibited.

zed

irods-capability-automated-ingest 0.5.0

Languages

Categories

Description:

License:

Share

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

More From This Creator

zzz

zyre

zug

zssh

zshdb

irods-capability-automated-ingest 0.5.0

Languages

Categories

Description:

License:

Share

Customer Reviews

License

Overview

What you can do with it

What you can't do with it

Related Products

Views For YouTube Bot writed on Python

AI-Web-Scraper

quivr

roop

zed

More From This Creator

zzz

zyre

zug

zssh

zshdb