data format and packaging · corresponds to the technology/tools that are used to read and write...
TRANSCRIPT
Data Format and PackagingKurt Biery04 May 2020DUNE DAQ/SC Meeting
RemindersI primarily want to talk about the format and packaging of the raw data at the hand-off (interface) between online and offline.
I’ve used “event” and “Trigger Record” interchangeably in these slides.
In the interest of time, I suggest that we defer detailed discussions to follow-up meetings.
04-May-20202 Data Format and Packaging
OutlineIntroductory comments on• Data Format• Data Packaging• Metadata and Manifest files
A sample HDF5 file with protoDUNE SP data
Data packaging examples – 3 so far
Sample metadata and manifest ‘files’
Topics for follow-up meetings; Ideas for testing HDF5 at PDSP
04-May-20203 Data Format and Packaging
Data FormatWe are investigating a DUNE-specific binary format stored in HDF5 files.Eric Flumerfelt has done initial work in creating sample code to write/read data fragments in HDF5 (artdaq-demo-hdf5 package).I’ve hacked that code to provide some sample HDF5 files with data from protoDUNE SP run 11037 (Cosmics run type).I will describe the (tentative, non-binding) choices that I made later in this talk…
** For the purposes of this talk, let’s say that the high level ‘data format’ corresponds to the technology/tools that are used to read and write the files.
04-May-20204 Data Format and Packaging
Data Packaging IntroThere have been some number of discussions over time about data packaging (Data Model workshop, DAQ workshop, etc.)
With help from Alessandro, I have come to believe that there are a few common themes regarding data packaging…
** Data packaging = ‘chunking’ or ‘slicing’ or ‘deciding which pieces of a Trigger Record or Stream go into which file’
04-May-20205 Data Format and Packaging
Data Packaging AbstractionsGoal:• Identify a set of parameters (or “choices” or “questions”)
that we can use to specify the packaging model for each type of Trigger (e.g. Beam or SNB) or Stream (TP stream or WIB debug stream)
Once those parameters are identified, we can design and build the DAQ (Dataflow) infrastructure to support ‘configurable’ packaging by Trigger and Stream.• Parameter values will be specified later• [note to self: focus on the online/offline interface]
04-May-20206 Data Format and Packaging
Data Packaging ParametersFor Triggered data:1. Whether each file on disk will have an integer number of Trigger
Records, or whether each file can have a fractional number of Trigger Record(s)
For both Triggered and Streamed data:2. Whether or not the data in each file on disk will have geographically
complete coverage (superset, Trigger Decision has details)1. If not GeoCmplt, A) what subdivision will be used, and B) should
the file boundaries match between the different subdivided pieces3. The maximum size of files that will be created4. The maximum time interval/duration that will be stored in a single file
(data time or wall clock time both seem possible)We will need to specify priority among these for each Trigger/Stream…
04-May-20207 Data Format and Packaging
Data Packaging SamplesI will describe some data files that demonstrate sample packaging models later in this talk…
04-May-20208 Data Format and Packaging
Metadata and Manifest FilesGoal is to have meta-information that describes the raw data in files• ‘metadata file’ – one-to-one with raw data file• ‘manifest file’ – one-to-many; provides list(s) when Trigger
Records can span multiple filesThe meta-information may not need to be in a separate file• Some sample choices in the later examples, different choices
certainly possibleOne particular type of meta-information: indicating a region-of-interest when a Trigger Decision specifies that for a particular Trigger Record, and I have an example of one way to do that.
04-May-20209 Data Format and Packaging
HDF5 Samples
04-May-202010 Data Format and Packaging
Reminders about PDSP data nowart/ROOT files
Begin processing the 40th record. run: 11037 subRun: 1 event: 40 at 27-Mar-2020 11:07:38 CDTPRINCIPAL TYPE: EventPROCESS NAME | MODULE LABEL | PRODUCT INSTANCE NAME | DATA PRODUCT TYPE | PRODUCT FRIENDLY TYPE | SIZEDAQ.... | daq........... | TIMING........... | std::vector<artdaq::Fragment> | artdaq::Fragments.... | ...1DAQ.... | daq........... | ContainerFELIX... | std::vector<artdaq::Fragment> | artdaq::Fragments.... | ..50DAQ.... | daq........... | ContainerCRT..... | std::vector<artdaq::Fragment> | artdaq::Fragments.... | ...4DAQ.... | daq........... | ContainerTPC..... | std::vector<artdaq::Fragment> | artdaq::Fragments.... | ..10DAQ.... | TriggerResults | ................. | art::TriggerResults.......... | art::TriggerResults.. | ...-DAQ.... | daq........... | ContainerCTB..... | std::vector<artdaq::Fragment> | artdaq::Fragments.... | ...1DAQ.... | daq........... | ContainerPHOTON.. | std::vector<artdaq::Fragment> | artdaq::Fragments.... | ..24
Vectors of artdaq::Fragments, grouped by Product Instance Name’Container’ fragments for the majority of the types (pull-mode readout)artdaq::Fragment is header plus verbatim payload from the electronics
04-May-202011 Data Format and Packaging
Short intro to HDF5HDF Files contain Groups, Groups contain other Groups and Datasets• Similar to a directory structure- (Directories contain Directories and Files)
• Groups and Datasets can each have Attributes (key/value pairs)
HDF Datasets can support lots of different data structures• Simplifying assumption in these samples: use Datasets
with 1-dimensional arrays of unsigned 64-bit integers
04-May-202012 Data Format and Packaging
Comments about the samplesI followed the pattern set by Eric – artdaq::Fragment header fields become Dataset attributes, artdaq::Fragment payload is Dataset contents• And for event-based examples, I kept the same high-level Group as
Eric – an event
I took some liberties with mid-level constructs:• Got rid of Container Groups when only 1 fragment inside• Picked names for Groups and Datasets that seemed reasonable to
me, better suggestions welcome (e.g. “APA6.0”, “TimeSliceN”)• Created a Dataset Attribute that has the artdaq::Fragment timestamp
(which corresponds to the Trigger time) in a human-readable string• Missing data/Empty fragments just skipped
04-May-202013 Data Format and Packaging
Intro to the event-based layout
04-May-202014 Data Format and Packaging
file.hdf5Ø event N
Ø subdetector 1 (e.g. CRT)Ø geometric unit 1
Ø time slice 0Ø ... time slice P
Ø geometric unit 2Ø …
Ø subdetector 2 (e.g. CTB)Ø time slice 0Ø … time slice Q
Ø subdetector 3 (e.g. PDS) or electronics type 3 (e.g. FELIX or TPC [RCE])Ø geometric unit 1Ø … geometric unit R
Ø subdetector 4 (e.g. Timing)Ø event M
I chose to minimize the number oflevels in each case, at the expenseof giving up strict parallelism.
Of course, other choice(s) are possible.
Tools
04-May-202015 Data Format and Packaging
HDFView• HDF ‘file browser’; easy ‘install’; screenshots later this
talk; I can give a demo• Couple of notes: lexical sort; GB in HDFView is (1000)3
bytes, not (1024)3 [my talks always use (1024)3]h5dump• Prints out everything, so grep is typically neededh5copy• Allowed me to trivially copy individual files into one
Top-level view of one event
04-May-202016 Data Format and Packaging
Attributes for each event
04-May-202017 Data Format and Packaging
If/when I remake the examples, I will add attributes for• Trigger timestamp• Requested detector components• Successfully read out detector components
Datasets within each event
04-May-202018 Data Format and Packaging
Datasets within each event (2)
04-May-202019 Data Format and Packaging
Other options certainly available,For example, APA/1/link/0
Example of Dataset Attributes
04-May-202020 Data Format and Packaging
artdaq::Fragment header fields
Sample Timing Dataset
04-May-202021 Data Format and Packaging
“NoBeamTrig” & CookieTrigger timestampEvent counter & checksum
32-bit version, unused bits
Other timestamps
Sample FELIX/APA5.0 Dataset
04-May-202022 Data Format and Packaging
Crate/slot/fiber/version/sof?
Fragment metadata
Timestamp
Packaging examples
04-May-202023 Data Format and Packaging
Packaging examplesRemember, these are ones that I just made up – no implied official endorsement.If we want to try different schemes for further study, great, let’s do that.I can already see ways to make the Attributes assigned to various Groups and Datasets more common between event-based and time-based models.
04-May-202024 Data Format and Packaging
Packaging example 1”Fully-built events, primary file split by max file size”1. always an integer number of events (Trigger Records) per file2. fully-built/geographically-complete events3. max file size of N GB4. no limit on the time span covered by the events in each file
Four sample files so far (size limit is 6 GB):• np04_raw_run011037_GeoCmplt_6GB_0001.hdf5, 5.97 GB, events 1 to 45• np04_raw_run011037_GeoCmplt_6GB_0002.hdf5, 5.97 GB, events 46 to 90• np04_raw_run011037_GeoCmplt_6GB_0003.hdf5, 5.96 GB, events 91 to 135• np04_raw_run011037_GeoCmplt_6GB_0004.hdf5, 5.97 GB, events 136 to 180
Data structure already shown in “HDF5” part of the talk,…
04-May-202025 Data Format and Packaging
Packaging example 2”Data split by APA, primary file split by max file size”1. always an integer number of Trigger Record fragments per file2. geographically-separated data- APA1 data in one file (TPC & PDS)- APA2 data in a separate file (TPC & PDS)- APA3 data in a separate file (TPC & PDS)- APA4 data in a separate file (TPC & PDS)- APA5 data in a separate file (TPC & PDS)- APA6 data in a separate file (TPC & PDS)- Timing, CTB, and CRT data in a 7th file
3. correlated splitting of the files, by Trigger Record4. max file size N GB5. no limit on the time span covered by the events in each file
04-May-202026 Data Format and Packaging
Further info on example 2With geographically-separated data, at least two options for closing one file and opening another one:1. Each set of files (for example the TPC+PDS data from
APA1) is independent, and files get closed when they individually get to N GB
2. Files are correlated, and all files get closed when one of them reaches N GB
For this example, I chose option 2.
04-May-202027 Data Format and Packaging
Example 2 filesEach set of files was closed when any one of them approached 6 GB.
• np04_raw_run011037_GeoSplit_APA1_0001.hdf5, 5.94 GB, events 1 to 225• np04_raw_run011037_GeoSplit_APA2_0001.hdf5, 3.61 GB, events 1 to 225• np04_raw_run011037_GeoSplit_APA3_0001.hdf5, 2.35 GB, events 1 to 225• np04_raw_run011037_GeoSplit_APA4_0001.hdf5, 5.97 GB, events 1 to 225• np04_raw_run011037_GeoSplit_APA5_0001.hdf5, 5.99 GB, events 1 to 225• np04_raw_run011037_GeoSplit_APA6_0001.hdf5, 5.97 GB, events 1 to 225• np04_raw_run011037_GeoSplit_Other_0001.hdf5, 0.01 GB, events 1 to 225
• np04_raw_run011037_GeoSplit_APA1_0002.hdf5, 5.94 GB, events 226 to 450• np04_raw_run011037_GeoSplit_APA2_0002.hdf5, 3.61 GB, events 226 to 450• np04_raw_run011037_GeoSplit_APA3_0002.hdf5, 2.34 GB, events 226 to 450• np04_raw_run011037_GeoSplit_APA4_0002.hdf5, 5.98 GB, events 226 to 450• np04_raw_run011037_GeoSplit_APA5_0002.hdf5, 5.99 GB, events 226 to 450• np04_raw_run011037_GeoSplit_APA6_0002.hdf5, 5.97 GB, events 226 to 450• np04_raw_run011037_GeoSplit_Other_0002.hdf5, 0.01 GB, events 226 to 450
04-May-202028 Data Format and Packaging
Views of a few Example 2 files
04-May-202029 Data Format and Packaging
Packaging example 3"Fully-built Time Slices, primary file split TBD"
1. packaged by Time Slice, so no requirement on an integer number of Trigger Records per file
2. fully-built/geographically-complete events3. overall time span of Time Slices per file TBD4. max file size TBD
04-May-202030 Data Format and Packaging
Example 3 time-based layout
04-May-202031 Data Format and Packaging
file.hdf5Ø time slice 0
Ø subdetector 1 (e.g. FELIX) Ø geometric unit 1Ø …Ø geometric unit T
Ø time slice 1Ø …
It was not clear to me how to subdividethe non-FELIX PDSP data fragments, so I’veonly included FELIX data in this sample so far.
View of the Example 3 file
04-May-202032 Data Format and Packaging
I didn’t have access to a readily available set of protoDUNE SP data with a long readout window, so I stuck with run 11037 and broke the 3 msecevents into three 1 msec chunks. (1 ms / 20 ns = 50,000 ticks)
Additional view for Example 3
04-May-202033 Data Format and Packaging
APA-level data
Example 3 file
04-May-202034 Data Format and Packaging
• np04_raw_run011037_TimeBased_GeoCmplt_0001.hdf5, 367 MB, 9 time slices
I wanted to get feedback before adding more time slices to this file• Bugger the timestamps to get continuous time slices?
Additional examples are possible
04-May-202035 Data Format and Packaging
Maybe the discussion of additional examples should be delegated to a group that includes offline folks?
Metadata files• Started with JSON metadata files from PDSP• Created sample JSON files for HDF5 sample files• Some details have been glossed over• Format and location (e.g. inside the HDF5 file) can be
changed
04-May-202036 Data Format and Packaging
np04_raw_run011037_GeoSplit_APA1_0001.json{"file_name": "np04_raw_run011037_GeoSplit_APA1_0001.hdf5",
"start_time": "<at PDSP, Linux server time. still useful? If so, which process? HLF? Something upstream?>","end_time": "<still useful?>","earliest_trigger_time": 79184253665845104,"latest_trigger_time": 79184255392990149,
"information_about_the_version_of_the_DAQ_software_used" : {}
"data_stream": "Cosmics","data_tier": "raw","dune_data.daqconfigname": "CRT_noprescale_delay_00008","dune_data.is_fake_data": 0,
"dune_data.detector_config": "ohFelix100:ohFelix101:ssp101:ssp102:ssp103:ssp104:wib101:wib102:wib103:wib104:wib105""possible_new_way_of_describing_electronics_included_in_partition": { "TPC" : [ "APA1.0", "APA1.1", "APA1.2", "APA1.3",
"APA1.4", "APA1.5", "APA1.6", "APA1.7", "APA1.8", "APA1.9" ], "PDS" : [ "APA1.0", "APA1.1", "APA1.2", "APA1.3" ] }
"event_count": 225,"events": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,…216,217,218,219,220,221,222,223,224,225],
"file_type": "detector","first_event": 1,"last_event": 225,"runs": [[11037,1,"protodune-sp"]]
}
04-May-202037 Data Format and Packaging
np04_raw_run011037_GeoSplit_Manifest_0001.json{"files_for_each_trigger_record" :{"1" :{"CRT" : "np04_raw_run011037_GeoSplit_Other_0001.hdf5","CTB" : "np04_raw_run011037_GeoSplit_Other_0001.hdf5","PDS" :{"APA1" : "np04_raw_run011037_GeoSplit_APA1_0001.hdf5""APA2" : "np04_raw_run011037_GeoSplit_APA2_0001.hdf5""APA3" : "np04_raw_run011037_GeoSplit_APA3_0001.hdf5""APA4" : "np04_raw_run011037_GeoSplit_APA4_0001.hdf5""APA5" : "np04_raw_run011037_GeoSplit_APA5_0001.hdf5""APA6" : "np04_raw_run011037_GeoSplit_APA6_0001.hdf5"
}"Timing" : "np04_raw_run011037_GeoSplit_Other_0001.hdf5","TPC" :{"APA1" : "np04_raw_run011037_GeoSplit_APA1_0001.hdf5""APA2" : "np04_raw_run011037_GeoSplit_APA2_0001.hdf5""APA3" : "np04_raw_run011037_GeoSplit_APA3_0001.hdf5""APA4" : "np04_raw_run011037_GeoSplit_APA4_0001.hdf5""APA5" : "np04_raw_run011037_GeoSplit_APA5_0001.hdf5""APA6" : "np04_raw_run011037_GeoSplit_APA6_0001.hdf5"
}},
"2" :{"CRT" : "np04_raw_run011037_GeoSplit_Other_0001.hdf5","CTB" : "np04_raw_run011037_GeoSplit_Other_0001.hdf5",…
04-May-202038 Data Format and Packaging
Indicating region of interestMy idea so far is to add Attributes to each event for ‘requested_detector_components’ and ’successful_detector_components’, which would indicate which pieces of the overall detector were requested to be part of this event (from the Trigger Decision) and were successfully included in this event.
Thought so far is to use JSON string like in metadata files…“requested_detector_components": { "TPC" : [ "APA1.0", "APA1.1", "APA1.2", "APA1.3", "APA1.4", "APA1.5", "APA1.6", "APA1.7", "APA1.8", "APA1.9" ], "PDS" : [ "APA1.0", "APA1.1", "APA1.2", "APA1.3" ] }If this seems reasonable, I can go back and add this to the sample HDF5 files that have been created so far.
04-May-202039 Data Format and Packaging
Topics for follow-up meetingsAre we interested in pursuing these directions?Are the existing sample files useful, or is it better for folks to first meet and decide on better Groups and Datasets?With agreed-upon Groups and Datasets, should we remake the examples, or focus more into writing some HDF5 data from the DAQ at PDSP?How do folks envision accessing any HDF5 data that we produce (beyond looking at it with HDFView)? Reconstituted art/ROOT events? How would that work for time-slices?
04-May-202040 Data Format and Packaging
Possibly writing HDF5 files at PDSPRoland, Phil, Adam, and I have discussed various possibilities of writing long-window data at PDSP for later studies. (Binary dump from FELIX BR [Roland/Adam], HDF5 [me], stitching together 3 msec events [Phil])• I’ve made progress in creating some routines in dune-artdaq/dune-
raw-data that could be used for the HDF5 option. If we want to pursue that, we should discuss it. Like I suggested on the previous slide, learning about the expected data access will be a useful ingredient.
-rw-r--r-- 1 np04daq np-comp 123800 May 1 23:13 FelixReceiver_r11166_364.hdf5. (ohFelix600)-rw-r--r-- 1 np04daq np-comp 123800 May 1 23:13 FelixReceiver_r11166_369.hdf5. (ohFelix601)-rw-r--r-- 1 np04daq np-comp 111909192 May 1 23:13 FelixReceiver_r11167_364.hdf5-rw-r--r-- 1 np04daq np-comp 111909192 May 1 23:13 FelixReceiver_r11167_369.hdf5-rw-r--r-- 1 np04daq np-comp 83946568 May 1 23:13 FelixReceiver_r11168_364.hdf5-rw-r--r-- 1 np04daq np-comp 83946568 May 1 23:13 FelixReceiver_r11168_369.hdf5-rw-r--r-- 1 np04daq np-comp 125823736 May 1 23:13 FelixReceiver_r11169_364.hdf5-rw-r--r-- 1 np04daq np-comp 125823736 May 1 23:13 FelixReceiver_r11169_369.hdf5
04-May-202041 Data Format and Packaging
Backup Slides
04-May-202042 Data Format and Packaging
Reminder about Tom’s requirementsTom has summarized the following requirements:1. longevity of support 2. integrity checks – for the file format as well as the data fragments3. ability to read in small subsets of the trigger records and drop from
memory data no longer being used4. ability to navigate through a trigger record to get the adjacent time or
space samples5. compression tools6. browsable with a lightweight, interactive tool7. ability to handle evolution of data formats and structure gracefully
with backward compatibility ensuredhttps://wiki.dunescience.org/wiki/Project_Requirement_Brainstorming#Data_Format
04-May-202043 Data Format and Packaging
Ideas for common AttributesIn the packaging examples, different attributes were demonstrated for the Datasets. For the event-based grouping, the Attributes were mainly artdaq::Fragment header fields. For the time-based grouping, the Attributes were more focused on timestamps.Can I/we come up with a superset that works for both types of groupings? I believe so, yes.• Data size, fragment ID, timestamp, time string,• Fragment type, valid and complete flags• (FELIX) Number of frames, first frame timestamp, last frame timestamp
For the event or timeslice highest-level grouping, can we come up with a common set? Run number, time window start and end, is_complete. Event ID is redundant in the event-based grouping, so it could be dropped.
The earliest frame and latest frame Attributes that are part of the highest-level time-based grouping in example 3 should instead be part of the FELIX group.
04-May-202044 Data Format and Packaging
Further work and ideasSample HDF5 data files for long-window Trigger Records?
Sample HDF5 data files for Stream data?
Implementation ideas in Dataflow subsystem (Dispatcher)
04-May-202045 Data Format and Packaging
Other ideas1. Data challenge in Feb 20212. Metadata and manifest files…- Metadata file for each raw data file- Manifest file for each TR that spans multiple files- Metadata could instead be internal to the raw data file- Sample metadata information for SNB files:
• the trigger number/identifier• the APA number (or whatever geographic identifier(s) are appropriate)• the beginning and ending timestamps of the trigger window (or start
time and window size)• the beginning and ending timestamps of the interval that is covered by
the individual file (or start time and window size
04-May-202046 Data Format and Packaging
Another idea for time-based layout
04-May-202047 Data Format and Packaging
file.hdf5Ø macro time slice 0
Ø subdetector 1 (e.g. CRT)Ø geometric unit 1
Ø micro time slice 0Ø … micro time slice J
Ø … geometric unit RØ subdetector 2 (e.g. PDS)
Ø geometric unit 1Ø … geometric unit S
Ø subdetector 3 (e.g. FELIX) Ø geometric unit 1Ø … geometric unit T
Ø macro time slice 1Ø …