servicex-databinder 0.5.0

Creator: bradpython12

Last updated:

0 purchases

servicex-databinder 0.5.0 Image
servicex-databinder 0.5.0 Images

Languages

Categories

Add to Cart

Description:

servicexdatabinder 0.5.0

ServiceX DataBinder
Release v0.5.0

servicex-databinder is a user-analysis data management package using a single configuration file.
Samples with external data sources (e.g. RucioDID or XRootDFiles) utilize ServiceX to deliver user-selected columns with optional row filtering.

The following table shows supported ServiceX transformers by DataBinder



Input format
Code generator
Transformer
Output format




ROOT Ntuple
func-adl
uproot
root or parquet


ATLAS Release 21 xAOD
func-adl
atlasr21
root


ROOT Ntuple
python function
python
root or parquet




Prerequisite

Access to a ServiceX instance
Python 3.7+

Installation
pip install servicex-databinder

Configuration file
The configuration file is a yaml file containing all the information.
The following example configuration file contains minimal fields. You can also download servicex-opendata.yaml file (rename to servicex.yaml) at your working directory, and run DataBinder for OpenData without an access token.
General:
ServiceXName: servicex-opendata
OutputFormat: parquet

Sample:
- Name: ggH125_ZZ4lep
XRootDFiles: "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\
/2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"
Tree: mini
Columns: lep_pt, lep_eta

General block requires two mandatory options (ServiceXName and OutputFormat) as in the example above.
Input dataset for each Sample can be defined either by RucioDID or XRootDFiles or LocalPath.
ServiceX query can be constructed with either TCut syntax or func-adl.

Options for TCut syntax: Filter1 and Columns
Option for Func-adl expression: FuncADL

      1 Filter works only for scalar-type of TBranch.
Output format can be either Apache parquet or ROOT ntuple for uproot backend. Only ROOT ntuple format is supported for xAOD backend.
The followings are available options:




Option for General block
Description
DataType




ServiceXName*
ServiceX backend name in your servicex.yaml file
String


OutputFormat*
Output file format of ServiceX delivered data (parquet or root for uproot / root for xaod)
String


Transformer
Set transformer for all Samples. Overwrites the default transformer in the servicex.yaml file.
String


Delivery
Delivery option; LocalPath (default) or LocalCache or ObjectStore
String


OutputDirectory
Path to a directory for ServiceX delivered files
String


WriteOutputDict
Name of an ouput yaml file containing Python nested dictionary of output file paths (located in the OutputDirectory)
String


IgnoreServiceXCache
Ignore the existing ServiceX cache and force to make ServiceX requests
Boolean



*Mandatory options



Option for Sample block
Description
DataType




Name
Sample name defined by a user
String


Transformer
Transformer for the given sample
String


RucioDID
Rucio Dataset Id (DID) for a given sample; Can be multiple DIDs separated by comma
String


XRootDFiles
XRootD files (e.g. root://) for a given sample; Can be multiple files separated by comma
String


Tree
Name of the input ROOT TTree; Can be multiple TTrees separated by comma (uproot ONLY)
String


Filter
Selection in the TCut syntax, e.g. jet_pt > 10e3 && jet_eta < 2.0 (TCut ONLY)
String


Columns
List of columns (or branches) to be delivered; multiple columns separately by comma (TCut ONLY)
String


FuncADL
Func-adl expression for a given sample
String


LocalPath
File path directly from local path (NO ServiceX tranformation)
String





A config file can be simplified by utilizing Definition block. You can define placeholders under Definition block, which will replace all matched placeholders in the values of Sample block. Note that placeholders must start with DEF_.
You can source each Sample using different ServiceX transformers.
The default transformer is set by type of servicex.yaml, but Transformer in the General block overwrites if present, and Transformer in each Sample overwrites any previous transformer selection.
The following example configuration shows how to use each Options.
General:
ServiceXName: servicex-uc-af
Transformer: uproot
OutputFormat: root
OutputDirectory: /Users/kchoi/data_for_MLstudy
WriteOutputDict: fileset_ml_study
IgnoreServiceXCache: False

Sample:
- Name: Signal
RucioDID: user.kchoi:user.kchoi.signalA,
user.kchoi:user.kchoi.signalB,
user.kchoi:user.kchoi.signalC
Tree: nominal
FuncADL: DEF_ttH_nominal_query
- Name: Background1
XRootDFiles: DEF_ggH_input
Tree: mini
Filter: lep_n>2
Columns: lep_pt, lep_eta
- Name: Background2
Transformer: atlasr21
RucioDID: DEF_Zee_input
FuncADL: DEF_Zee_query
- Name: Background3
LocalPath: /Users/kchoi/Work/data/background3
- Name: Background4
Transformer: python
RucioDID: user.kchoi:user.kchoi.background4
Function: |
def run_query(input_filenames=None):
import awkward as ak, uproot
tree_name = "nominal"
o = uproot.lazy({input_filenames:tree_name})
return {"nominal: o}

Definition:
DEF_ttH_nominal_query: "Where(lambda e: e.met_met>150e3). \
Select(lambda event: {'el_pt': event.el_pt, 'jet_e': event.jet_e, \
'jet_pt': event.jet_pt, 'met_met': event.met_met})"
DEF_ggH_input: "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\
/2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"
DEF_Zee_input: "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.\
merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00"
DEF_Zee_query: "SelectMany('lambda e: e.Jets(\"AntiKt4EMTopoJets\")'). \
Where('lambda j: (j.pt() / 1000) > 30'). \
Select('lambda j: j.pt() / 1000.0'). \
AsROOTTTree('junk.root', 'my_tree', [\"JetPt\"])"

Deliver data
from servicex_databinder import DataBinder
sx_db = DataBinder('<CONFIG>.yml')
out = sx_db.deliver()

The function deliver() returns a Python nested dictionary that contains delivered files.

Input configuration can be also passed in a form of a Python dictionary.
Delivered Samples and files in the OutputDirectory are always synced with the DataBinder config file.

Error handling
failed_requests = sx_db.get_failed_requests()

If failed ServiceX request(s), deliver() will print number of failed requests and the name of Sample, Tree if present, and input dataset. You can get a full list of failed samples and error messages for each by get_failed_requests() function. If it is not clear from the message you can browse Logs in the ServiceX instance webpage for the detail.
Useful tools
Create Rucio container for multiple DIDs
The current ServiceX generates one request per Rucio DID.
It's often the case that a physics analysis needs to process hundreds of DIDs.
In such cases, the script (scripts/create_rucio_container.py) can be used to create one Rucio container per Sample from a yaml file.
An example yaml file (scripts/rucio_dids_example.yaml) is included.
Here is the usage of the script:
usage: create_rucio_containers.py [-h] [--dry-run DRY_RUN]
infile container_name version

Create Rucio containers from multiple DIDs

positional arguments:
infile yaml file contains Rucio DIDs for each Sample
container_name e.g. user.kchoi:user.kchoi.<container-name>.Sample.v1
version e.g. user.kchoi:user.kchoi.fcnc_ana.Sample.<version>

optional arguments:
-h, --help show this help message and exit
--dry-run DRY_RUN Run without creating new Rucio container

Acknowledgements
Support for this work was provided by the the U.S. Department of Energy, Office of High Energy Physics under Grant No. DE-SC0007890

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.