Dataset management

This section mostly concerns CMS users.

Grid

In order to gain access to the Worldwide LHC Computing Grid (WLCG), or just "the grid", you need to obtain grid certificate as described here. The grid certificate has to be updated annually.

Before opening the grid proxy, you have to source this file:

/cvmfs/grid.cern.ch/umd-c7ui-latest/etc/profile.d/setup-c7-ui-example.sh

The proxy can be opened with:

voms-proxy-init -voms cms -valid 24:00 # in hours:minutes

By default, the proxy would be open for 12 hours. To check the status of your grid proxy, run:

voms-proxy-info

The grid proxy can be closed explicitly with:

voms-proxy-destroy

With an open grid proxy you can:

Submit jobs to the grid;
Make queries to DAS;
Manage datasets with Rucio;
Access individual files interactively.

To illustrate the last point, if a file is stored on disk (and not only on tape), you can access it immediately by prefixing it with root://cms-xrd-global.cern.ch/. For example:

root -b -l root://cms-xrd-global.cern.ch//store/path/to/some/file.root

The remaining capabilities of an open grid proxy are discussed in the next sections.

CRAB

Jobs are submitted to the grid via CRAB3 interface, which can be set up as follows:

source /cvmfs/cms.cern.ch/crab3/crab.sh prod

To gain access to the CRAB commands, you must also set up your CMSSW area.

To verify whether you can actually write to Tallinn T2 servers, run:

crab checkwrite --site T2_EE_Estonia

Note

Follow these instructions to understand if your CERN username is mapped to your grid certificate, and how to proceed in case it is not.

Contact the site admins in case you still get an error.

CRAB jobs are steered with a python configuration file. If you want the output of your jobs to be written to Tallinn T2 servers, you have to specify the following:

from CRABClient.UserUtilities import config, getUsernameFromCRIC

config = config()
config.Site.storageSite = 'T2_EE_Estonia'
config.Data.outLFNDirBase = f'/store/user/{getUsernameFromCRIC()}/JOB_DIRECTORY'
config.General.transferOutputs = True
config.General.transferLogs = True
# For storing the payload files in your home directory:
config.General.workArea = os.path.join(os.path.expanduser('~'), 'crab_projects')

Because CMS data storage is mounted at /cms locally, you can access the output files from /cms/store/user/<your CERN username>/JOB_DIRECTORY. Obviously, pick something unique for the job directory name so that you would be able to distinguish between different runs of the same job.

For further information, see:

DAS

Metadata about files, datasets, sites etc are all accessed from Data Aggregation System (DAS), which has a web interface as well as a command line interface called dasgoclient. The latter becomes available after you've opened your grid proxy and sourced /cvmfs/cms.cern.ch/cmsset_default.sh.

Here are some common examples:

Retrieve alphabetically sorted list of available datasets (some of which may still be in production):
```
dasgoclient -query="dataset dataset=${INPUT_RGX} status=* | grep dataset.name | grep dataset.dataset_access_type" | grep -v DELETED | grep -v INVALID | sort
```
where ${INPUT_RGX} can be a simple regex. For instance, if you want to find HH MC NanoAOD samples for Run3, you can set it to /*HH*/Run3*/NANOAODSIM;

Find the list of files belonging to ${INPUT_DATASET}, sort them by size:

dasgoclient -query="file dataset=${INPUT_DATASET} | grep file.name | grep file.nevents | grep file.size" | sort -nr -k3

Optionally, you can transform the number of bytes into more human-readable format by piping the above into:

awk '{cmd="numfmt --to=iec-i --suffix=B <<< "$NF; cmd | getline $NF; close(cmd); print}'

Find data storage sites where a given file or a dataset is stored:
```
dasgoclient -query="site file=${INPUT_FILE_NAME}" | sort
dasgoclient -query="site dataset=${INPUT_DATASET_NAME}" | sort
```
This is useful to know in case you want to run jobs locally (in which case you'd want to see T2_EE_Estonia in that list), or on the grid (in which case you'd want to see FNAL T1 or T2 sites, since those can run grid jobs).

For an exhaustive list of DAS queries, run dasgoclient -examples.

If you want more detailed information about a certain dataset or file, you can ask dasgoclient to return the output in JSON format by appending -json flag to the query command and parse it with other tools.

DAS queries can also be made via corresponding python3 interface as described here.

Rucio

The CMS datasets are nowadays managed through a service called Rucio, which succeeds the now deprecated PhEDEx system. Here are some quick links to Rucio documentation

Official documentation
Documentation for CMS users (restricted access)
Web interface for managing CMS datasets (requires grid certificate)

A common acronym that appears in discussions related to Rucio is RSE, which stands for Rucio Storage Element. In practical terms, it refers to the storage site, which in our case is T2_EE_Estonia.

If you're first-time user of Rucio, it is highly recommended that you read everything on this page before committing to anything.

Setup

In order to use Rucio client, you will need to open grid proxy first (as described in the previous section), then do the following:

source /cvmfs/cms.cern.ch/rucio/setup-py3.sh

Note

Rucio does not mesh with CMSSW. If you want to use prograns like dasgoclient, which becomes available after setting up CMSSW, it is imperative that you source the necessary file for setting up CMSSW before you set up Rucio and that you never run cmsenv in that session, as otherwise it would pollute the environment variables that python needs. To summarize, if you want to use both rucio and dasgoclient in the same session, execute the following:

source /cvmfs/grid.cern.ch/umd-c7ui-latest/etc/profile.d/setup-c7-ui-example.sh
source /cvmfs/cms.cern.ch/cmsset_default.sh
source /cvmfs/cms.cern.ch/rucio/setup-py3.sh

If you don't care about dasgoclient, you can skip the second line in the above.

To make your life easier, it is a good idea to add the following line to your ~/.bashrc:

export RUCIO_ACCOUNT=yourcmsusername

This way you don't necessarily have to specify your username when manipulating Rucio rules.

Adding rules

Rules can be added with the following command:

rucio add-rule cms:/dataset/or/file 1 T2_EE_Estonia

You can create copy rules for individual files (which always start with /store/) or full datasets, as long as they are tracked by DAS. Beware of the cms: prefix that always precedes the file or dataset name. If you have the list of files or datasets you'd like to copy over to T2_EE_Estonia in a text file called, say, input.txt, with one file or dataset specified per line, you should add rules for them with the following command because it's faster than executing the add-rule command separately for each file or dataset:

rucio add-rule $(cat input.txt | sed 's/^/cms:/g' | tr '\n' ' ') 1 T2_EE_Estonia

Even though it's just a single command, it will create a separate rule for every dataset and file that's listed in datasets.txt. The second-to-last argument specifies the number of copies (which you always should set to 1), which is followed by the site name.

However, before adding any rules, you need to first understand if you have been granted disk quota:

rucio list-account-limits $RUCIO_ACCOUNT

If the above command doesn't show anything for T2_EE_Estonia, it means that your personal CMS account has not been granted disk quota. In that case, you have three options:

Submit the rule with --ask-approval flag:
```
rucio add-rule --ask-approval cms:/dataset/or/file 1 T2_EE_Estonia
```
In general, you should avoid it, because each rule would have to be individually approved by the site admins.
Ask site admins for a quota. In order to figure out who the side admins are, run:
```
rucio list-rse-attributes T2_EE_Estonia
```
(most preferred) Contact the side admins and ask them to add you under the group account t2_ee_estonia_local_users. You can add rules under the group account by setting RUCIO_ACCOUNT equal to the group account name, either by exporting it in the current shell session or temporarily changing it while adding the rules (which would be preferable):
```
RUCIO_ACCOUNT=t2_ee_estonia_local_users rucio add-rule \
  cms:/dataset/or/file 1 T2_EE_Estonia
```

Things to keep in mind when adding new rules:

Create new Rucio rules for datasets that are large and/or you need to acess frequently;
Make sure that the datasets you're requesting won't exhaust the designated quota. If you're copying the data under the group account, coordinate and cooperate with your team members if you plan to use a significant percentage of the allowed disk space;
Always add a comment to the rule you're creating, which justifies or adds context as to why you want to copy the data:
```
rucio add-rule cms:/dataset/or/file 1 T2_EE_Estonia --comment "My comment"
```
In case you forgot to add the comment, you can always update the rules. If site admins discover user-created rules with no comments, it's for them to decide if they're going to keep the dataset.
Creating rules for individual files are justified if
- You want to test your code on a single file or a handful of files; and
- The individual files make up a fraction of the whole dataset.

Inspecting rules

You can view the existing rules with:

rucio list-rules --account $RUCIO_ACCOUNT | less -S

The advantage of piping the output of rucio list-rules to less -S is that it won't wrap long lines, which certainly occur in the output. You can also save the output in CSV format with the --csv flag, which can be parsed with ease.

There is a wrapper for the above command, which lists not only the basic information that is made available with rucio list-rules, but also the comments, datasets and file sizes, event counts, etc. The script also allows grouping rules for individual files by under the datasets that those files belong to. Example:

rucio-list-rules --account $RUCIO_ACCOUNT --verbose --sort size --group | \
  less -S

Updating rules

The rules can be updated with rucio update-rule. If you want to modify the existing rules that were created under the group account, make sure that the client software knows this:

RUCIO_ACCOUNT=t2_ee_estonia_local_users rucio update-rule $RULE_ID <options>

The most relevant attributes you might want to change in the existing rules are: * the comments, which you can add or change with the --comment option; * the lifetime of the rule in seconds, which is specified after the --lifetime option. If the rule already has an expiration date, but you want to remove it, you can achieve this with --lifetime none.

Removing rules

Rules can be deleted by simply issuing:

rucio delete-rule $RULE_ID

Again, the same comment about changing RUCIO_ACCOUNT when creating or modifying rules also applies when deleting them. There is no possible way of deleting rules in bulk. Your best bet is to collect the rules you want to delete to a text file and then iterate over them one-by-one:

rucio list-rules --account $RUCIO_ACCOUNT --csv \
   | <filter> | awk -F, '{print $1}' > rules.txt   # NOTICE the <filter> part!!
for RULE_ID in `cat rules.txt`; do rucio delete-rule $RULE_ID; done

For admins

To change the quota of a user:

rucio-admin account set-limits <someone's account> T2_EE_Estonia <bytes>

If you want to add to or remove people from the group account (t2_ee_estonia_local_users), you have to modify the corresponding cms-EE_Estonia-local e-group accordingly.