Dataset management
This section mostly concerns CMS users.
Grid
In order to gain access to the Worldwide LHC Computing Grid (WLCG), or just "the grid", you need to obtain grid certificate as described here. The grid certificate has to be updated annually.
Before opening the grid proxy, you have to source this file:
/cvmfs/grid.cern.ch/umd-c7ui-latest/etc/profile.d/setup-c7-ui-example.sh
The proxy can be opened with:
voms-proxy-init -voms cms -valid 24:00 # in hours:minutes
voms-proxy-info
voms-proxy-destroy
With an open grid proxy you can:
- Submit jobs to the grid;
- Make queries to DAS;
- Manage datasets with Rucio;
- Access individual files interactively.
To illustrate the last point, if a file is stored on disk (and not only on tape),
you can access it immediately by prefixing it with root://cms-xrd-global.cern.ch/
.
For example:
root -b -l root://cms-xrd-global.cern.ch//store/path/to/some/file.root
The remaining capabilities of an open grid proxy are discussed in the next sections.
CRAB
Jobs are submitted to the grid via CRAB3 interface, which can be set up as follows:
source /cvmfs/cms.cern.ch/crab3/crab.sh prod
To verify whether you can actually write to Tallinn T2 servers, run:
crab checkwrite --site T2_EE_Estonia
Note
Follow these instructions to understand if your CERN username is mapped to your grid certificate, and how to proceed in case it is not.
Contact the site admins in case you still get an error.
CRAB jobs are steered with a python configuration file. If you want the output of your jobs to be written to Tallinn T2 servers, you have to specify the following:
from CRABClient.UserUtilities import config, getUsernameFromCRIC
config = config()
config.Site.storageSite = 'T2_EE_Estonia'
config.Data.outLFNDirBase = f'/store/user/{getUsernameFromCRIC()}/JOB_DIRECTORY'
config.General.transferOutputs = True
config.General.transferLogs = True
# For storing the payload files in your home directory:
config.General.workArea = os.path.join(os.path.expanduser('~'), 'crab_projects')
/cms
locally, you can access the output files from
/cms/store/user/<your CERN username>/JOB_DIRECTORY
.
Obviously, pick something unique for the job directory name so that you would be able to distinguish between
different runs of the same job.
For further information, see:
DAS
Metadata about files, datasets, sites etc are all accessed from Data Aggregation System (DAS),
which has a web interface as well as a command line interface
called dasgoclient
. The latter becomes available after
you've opened your grid proxy and sourced /cvmfs/cms.cern.ch/cmsset_default.sh
.
Here are some common examples:
- Retrieve alphabetically sorted list of available datasets (some of which may still be in production):
where
dasgoclient -query="dataset dataset=${INPUT_RGX} status=* | grep dataset.name | grep dataset.dataset_access_type" | grep -v DELETED | grep -v INVALID | sort
${INPUT_RGX}
can be a simple regex. For instance, if you want to find HH MC NanoAOD samples for Run3, you can set it to/*HH*/Run3*/NANOAODSIM
; - Find the list of files belonging to
${INPUT_DATASET}
, sort them by size:Optionally, you can transform the number of bytes into more human-readable format by piping the above into:dasgoclient -query="file dataset=${INPUT_DATASET} | grep file.name | grep file.nevents | grep file.size" | sort -nr -k3
awk '{cmd="numfmt --to=iec-i --suffix=B <<< "$NF; cmd | getline $NF; close(cmd); print}'
- Find data storage sites where a given file or a dataset is stored:
This is useful to know in case you want to run jobs locally (in which case you'd want to see
dasgoclient -query="site file=${INPUT_FILE_NAME}" | sort dasgoclient -query="site dataset=${INPUT_DATASET_NAME}" | sort
T2_EE_Estonia
in that list), or on the grid (in which case you'd want to see FNAL T1 or T2 sites, since those can run grid jobs).
For an exhaustive list of DAS queries, run dasgoclient -examples
.
If you want more detailed information about a certain dataset or file, you can ask dasgoclient
to return the output in JSON format by appending -json
flag to the query command and parse it with other tools.
DAS queries can also be made via corresponding python3 interface as described here.
Rucio
The CMS datasets are nowadays managed through a service called Rucio, which succeeds the now deprecated PhEDEx system. Here are some quick links to Rucio documentation
- Official documentation
- Documentation for CMS users (restricted access)
- Web interface for managing CMS datasets (requires grid certificate)
A common acronym that appears in discussions related to Rucio is RSE, which stands for Rucio Storage Element.
In practical terms, it refers to the storage site, which in our case is T2_EE_Estonia
.
If you're first-time user of Rucio, it is highly recommended that you read everything on this page before committing to anything.
Setup
In order to use Rucio client, you will need to open grid proxy first (as described in the previous section), then do the following:
source /cvmfs/cms.cern.ch/rucio/setup-py3.sh
Note
Rucio does not mesh with CMSSW. If you want to use prograns like dasgoclient
, which becomes available
after setting up CMSSW, it is imperative that you source the necessary file for setting up
CMSSW before you set up Rucio and that you never run cmsenv
in that session, as otherwise it would pollute
the environment variables that python needs. To summarize, if you want to use both rucio
and dasgoclient
in the same
session, execute the following:
source /cvmfs/grid.cern.ch/umd-c7ui-latest/etc/profile.d/setup-c7-ui-example.sh
source /cvmfs/cms.cern.ch/cmsset_default.sh
source /cvmfs/cms.cern.ch/rucio/setup-py3.sh
dasgoclient
, you can skip the second line in the above.
To make your life easier, it is a good idea to add the following line to your ~/.bashrc
:
export RUCIO_ACCOUNT=yourcmsusername
This way you don't necessarily have to specify your username when manipulating Rucio rules.
Adding rules
Rules can be added with the following command:
rucio add-rule cms:/dataset/or/file 1 T2_EE_Estonia
You can create copy rules for individual files (which always start with /store/
) or full datasets, as long as they are tracked by DAS.
Beware of the cms:
prefix that always precedes the file or dataset name.
If you have the list of files or datasets you'd like to copy over to T2_EE_Estonia
in a text file called, say, input.txt
,
with one file or dataset specified per line, you should add rules for them with the following command because it's faster than
executing the add-rule
command separately for each file or dataset:
rucio add-rule $(cat input.txt | sed 's/^/cms:/g' | tr '\n' ' ') 1 T2_EE_Estonia
datasets.txt
.
The second-to-last argument specifies the number of copies (which you always should set to 1), which is followed by the site name.
However, before adding any rules, you need to first understand if you have been granted disk quota:
rucio list-account-limits $RUCIO_ACCOUNT
If the above command doesn't show anything for T2_EE_Estonia
, it means that your personal CMS account has not been granted disk quota.
In that case, you have three options:
- Submit the rule with
--ask-approval
flag:In general, you should avoid it, because each rule would have to be individually approved by the site admins.rucio add-rule --ask-approval cms:/dataset/or/file 1 T2_EE_Estonia
- Ask site admins for a quota. In order to figure out who the side admins are, run:
rucio list-rse-attributes T2_EE_Estonia
- (most preferred) Contact the side admins and ask them to add you under the group account
t2_ee_estonia_local_users
. You can add rules under the group account by settingRUCIO_ACCOUNT
equal to the group account name, either by exporting it in the current shell session or temporarily changing it while adding the rules (which would be preferable):RUCIO_ACCOUNT=t2_ee_estonia_local_users rucio add-rule \ cms:/dataset/or/file 1 T2_EE_Estonia
Things to keep in mind when adding new rules:
- Create new Rucio rules for datasets that are large and/or you need to acess frequently;
- Make sure that the datasets you're requesting won't exhaust the designated quota. If you're copying the data under the group account, coordinate and cooperate with your team members if you plan to use a significant percentage of the allowed disk space;
- Always add a comment to the rule you're creating, which justifies or adds context as to why you want to copy the data:
In case you forgot to add the comment, you can always update the rules. If site admins discover user-created rules with no comments, it's for them to decide if they're going to keep the dataset.
rucio add-rule cms:/dataset/or/file 1 T2_EE_Estonia --comment "My comment"
- Creating rules for individual files are justified if
- You want to test your code on a single file or a handful of files; and
- The individual files make up a fraction of the whole dataset.
Inspecting rules
You can view the existing rules with:
rucio list-rules --account $RUCIO_ACCOUNT | less -S
The advantage of piping the output of rucio list-rules
to less -S
is that it won't wrap long lines, which certainly occur in the output.
You can also save the output in CSV format with the --csv
flag, which can be parsed with ease.
There is a wrapper for the above command,
which lists not only the basic information that is made available with rucio list-rules
,
but also the comments, datasets and file sizes, event counts, etc.
The script also allows grouping rules for individual files by under the datasets that those files belong to.
Example:
rucio-list-rules --account $RUCIO_ACCOUNT --verbose --sort size --group | \
less -S
Updating rules
The rules can be updated with rucio update-rule
.
If you want to modify the existing rules that were created under the group account, make sure that the client software knows this:
RUCIO_ACCOUNT=t2_ee_estonia_local_users rucio update-rule $RULE_ID <options>
The most relevant attributes you might want to change in the existing rules are:
* the comments, which you can add or change with the --comment
option;
* the lifetime of the rule in seconds, which is specified after the --lifetime
option.
If the rule already has an expiration date, but you want to remove it, you can achieve this with --lifetime none
.
Removing rules
Rules can be deleted by simply issuing:
rucio delete-rule $RULE_ID
Again, the same comment about changing RUCIO_ACCOUNT
when creating or modifying rules also applies when deleting them.
There is no possible way of deleting rules in bulk.
Your best bet is to collect the rules you want to delete to a text file and then iterate over them one-by-one:
rucio list-rules --account $RUCIO_ACCOUNT --csv \
| <filter> | awk -F, '{print $1}' > rules.txt # NOTICE the <filter> part!!
for RULE_ID in `cat rules.txt`; do rucio delete-rule $RULE_ID; done
For admins
To change the quota of a user:
rucio-admin account set-limits <someone's account> T2_EE_Estonia <bytes>
If you want to add to or remove people from the group account (t2_ee_estonia_local_users
),
you have to modify the corresponding cms-EE_Estonia-local e-group accordingly.