List of important messages#
Here we collect the error and fatal messages which have appeared in the recent past and put some instructions what must be done in case one of these messages is observed during your shift. You should have set the appropriate filter settings as described in here which will filter out messages which should be ignored by the shifter.
In case an error or fatal message appears which is not described on this page despite applying the appropriate filter settings please always create a bookkeeping entry to the run as described here.
Hint:
Magic numbers in the message body are of course not fixed. So do not search for the whole message, but for the words which do not change instead. E.g. the message
DPL: Splitting a huge cluster: chipID 233, rows 302:318 cols 468:596
can of course come from different chipIDs and different rows and columns. So rather search forDPL: Splitting a huge cluster
.
TfBuilder error#
Severity | System | Facility | Message |
---|---|---|---|
Error | datadist/tfbuilder | Merging STFs error: STF first orbits do not match [...] |
STFs from different detectors are out of sync. Something is wrong with the data sent from FLPs. This can yield problems reconstructing the run. Report to FLP oncall.
Framework / DPL#
Severity | System | Facility | Message |
---|---|---|---|
Error | DPL | readout-proxy | Could not find management segment for shmid '4cc14c93' |
All above errors should be ignored at the moment. They are being investigated by experts.
DataDistribution#
Severity | System | Facility | Message |
---|---|---|---|
Error | datadist/tfscheduler | addStfInfo: this TimeFrame was already built [...] |
These errors can be ignored at the moment, they are being investigated by the experts. With the default filter settings you should not even see them.
ROOT#
Severity | System | Facility | Message |
---|---|---|---|
Error | STDERR | can be any | stderr: Error in %: VariableMetricBuilder Initial matrix not pos.def. |
These are messages from ROOT put to stderr which get pickep up by the STDERR monitor tool. Please ignore these messages for now.
QC#
Severity | System | Facility | Message |
---|---|---|---|
Error | QC | % | Could not register to ServiceDiscovery. |
Please ignore this type of error message for now.
Severity | System | Facility | Message |
---|---|---|---|
Error | QC | post/% | An error occurred during retrieval |
Error | DPL | qc-% | Requested resource does not exist: ali-qcdb.cern.ch:8083/qc/% |
Please check first if the detector the error is connected to has put some information to their known issues page mentioning the error. If not, please create a bookkeeping entry tagging in addition to QC/PDP shifter the detector linked to the error message (possibly you have to enable the Detector column in the InfoLogger to see it) and copy the full line from the IL (at least time, hostname, system, facility, detector, partition, run number). Only the Requested resource does not exist message is relevant. The "An error occured during retrieval" simply follows the other message but does not add any helpful information for the expert.
Please do not tag PDP for your bookkeeping entry related to missing objects for QC tasks, only QC/PDP shifter
and the detector which is associated to the message.
MFT#
Severity | System | Facility | Message |
---|---|---|---|
Error | DPL | mft-stf-decoder | link cruID:0x0808/lID10 feeID:0x0123 (cable 17) has IR=BCid: 594 Orbit: 67398822 for current majority IR=BCid: 594 Orbit: 69495997 -> Old ROF, discarding |
Please create bookkeeping entry tagging only MFT (and not PDP).
MID#
Severity | System | Facility | Message |
---|---|---|---|
Error | DPL | MIDRawDecoder | RAWPARSER: Incomplete HBF - jump in packet counter 255 to 1 (1 total RawParser errors) |
These errors can be ignored at the moment, they are being investigated by the experts.
EMCAL#
Severity | System | Facility | ErrSource | Message |
---|---|---|---|---|
Error | DPL | EMCALRawToCellConverterSpec | AltroDecoder.cxx | Error while decoding RCU trailer: Last RCU trailer word not found! |
Please create a bookkeeping entry tagging in addition to "QC/PDP Shifter" also "EMCAL" and "PDP". In your entry you should copy+paste the full line of the InfoLogger including at least run number and hostname. In case other issues appear simultaneously, for example EMCAL QC plots are bad, then call EMCAL oncall.
Severity | System | Facility | Message |
---|---|---|---|
Error | DPL | EMCALRawToCellConverterSpec | Not all EMC active links contributed in global BCid=261923256227: mask=0000000000000000000000000000001000000000000000 |
Please ignore the "Not all EMC active links contributed in" error for now.
PHOS#
Severity | System | Facility | ErrSource | Message |
---|---|---|---|---|
Error | DPL | PHOSRawToCellConverterSpec | RawReaderMemory.cxx | Trailer decoding error: Last RCU trailer word not found! |
Please ignore these errors at the moment. Message from expert: " old error of PHOS SRI firmware. In rare cases SRU does not finish the event correctly, this the RCU trailer is missing. To fix it we need the new SRU firmware which will not happen in nearest future".
MCH#
Severity | System | Facility | Message |
---|---|---|---|
Error | QC | post/% | An error occurred during retrieval |
Error | DPL | qc-pp-% | Requested resource does not exist: ali-qcdb.cern.ch:8083/qc/MCH/MO/% |
These errors appear due to a misconfiguration of the MCH QC json which they need to fix. To be ignored by the shifters for now.
CCDB upload problems#
Severity | System | Facility | ErrSource | Message |
---|---|---|---|---|
Error | DPL | ccdb-populator | CCDBPopulatorSpec.h | failed on uploading to http://ccdb-test.cern.ch:8080 / FT0/Calib/TimeSpectraInfo for [1689213645332:1691805955826] |
In case this happens during a SYNTHETIC run, meaning the upload fails to http://ccdb-test.cern.ch:8080 (see message text) then create a bookkeeping entry tagging CCDB. If this happens during a PHYSICS run please call the PDP oncall.
CCDB access problems#
Severity | System | Facility | ErrSource | Message |
---|---|---|---|---|
Error | DPL | dpl/ft0-reconstructor | CcdbApi.cxx | DPL: Requested resource does not exist: http://o2-ccdb.internal//download/83b37330-28ff-11ec-ab62-2a010e0a09fb |
Error | DPL | dpl/ft0-reconstructor | CcdbApi.cxx | DPL: Curl request to http://o2-ccdb.internal//FT0/Calibration/ChannelTimeOffset/1635477272416/ failed |
If the run can continue, create a bookkeeping entry for the run tagging PDP and CCDB and including the full message. If not, i.e. too many EPNs crash and the run needs to be stopped because of this, call PDP oncall.
GPU reconstruciton issues#
Severity | System | Facility | ErrSource | Message |
---|---|---|---|---|
Error | DPL | dpl/gpu-reconstruction | GPUChainTracking.cxx | DPL: GPUReconstruction suffered from an error in the CPU part |
Error | DPL | dpl/gpu-reconstruction | GPUErrors.cxx | DPL: GPU Error Code (0:20) ERROR_CF_PEAK_OVERFLOW : 282017 / 271507 / 0 |
Create bookkeeping entry tagging PDP. As always, copy+paste the full line from the InfoLogger. At least run number and full error message must be included.
Error messages at EOR#
Severity | System | Facility | Message |
---|---|---|---|
Error | DPL | % | Some Lifetime::Timeframe data got dropped starting at X |
Runs cannot be stopped cleanly at the moment. Therefore there will always be a flood of errors at EOR in the InfoLogger which should be ignored. The above message may also appear at SOR or during a run. At the moment please ignore this. It is under expert investigation.
Severity | System | Facility | Message |
---|---|---|---|
Error | datadist/tfscheduler | [FMQ] Uncaught exception reached the top of DeviceRunner: Invalid transition: END: failed to change state of a fairmq device |
The above message might appear in case a run is not stopped cleanly. Can be ignored at EOR. If it appears in the middle of an ongoing run please contact EPN team.
Messages in case processes crash#
If a process crashes on the EPNs this can have different reasons (segmentation violation, OOM killer, unexpected input data, ...) which is not always obvious to spot from the message. Each crash will lead to an error or fatal message in the InfoLogger and whenever you spot such a message you should create a bookkeeping entry with the relevant information for the experts. A template entry is shown here.