مشروع LHCb للتصفية: معالجة بيانات الإرث المستدام لفيزياء الطاقة العالية The LHCb Stripping Project: Sustainable Legacy Data Processing for High-Energy Physics

المجلة: Computing and Software for Big Science، المجلد: 9، العدد: 1
DOI: https://doi.org/10.1007/s41781-025-00151-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41323503
تاريخ النشر: 2025-11-28
المؤلف: N. A. Grieser وآخرون
الموضوع الرئيسي: الحوسبة العلمية وإدارة البيانات

نظرة عامة

يلعب مشروع LHCb Stripping دورًا حاسمًا في إطار معالجة البيانات لتجربة LHCb، حيث يهدف إلى تحويل بيانات التصادم الواسعة إلى عينات قابلة للإدارة للتحليل غير المتصل. لا يسهل هذا المشروع إعادة تحليل البيانات القديمة من الجولات 1 و 2 فحسب، بل يضمن أيضًا صيانة مجموعة البرمجيات وتنفيذ حملات (إعادة) التصفية. مع انتقال التركيز إلى مجموعات البيانات الأحدث، يلتزم المشروع بتحسين البنية التحتية لمعالجة البيانات القديمة والحية على حد سواء.

تقدم هذه الورقة نظرة عامة مفصلة عن إطار التصفية، مع التركيز على هيكله القابل للتكوين باستخدام بايثون وتكامله مع أنظمة الحوسبة في LHCb، بالإضافة إلى قدراته في إدارة الحملات واسعة النطاق. تشمل التطورات الرئيسية تنفيذ سير العمل المعتمد على GitLab، والتكامل المستمر، والأتمتة، والمعالجة المتوازية، التي تعالج مختلف التحديات الحسابية. كما يعكس المؤلفون الدروس المستفادة طوال المشروع ويقترحون خارطة طريق مستقبلية لضمان الوصول الفعال إلى مجموعات البيانات الفيزيائية الأساسية القديمة لتعاون LHCb.

مقدمة

يعد مشروع LHCb Stripping مكونًا حاسمًا في خط معالجة البيانات لتجربة LHCb، حيث يسهل اختيار التفاعلات الجسيمية المهمة من كميات هائلة من البيانات المسجلة. من خلال استخدام واجهة بايثون مرنة، يمكن للباحثين تخصيص معايير الاختيار لتناسب دراسات الفيزياء المختلفة. منذ دمجه في إطار معالجة وتحليل بيانات LHCb (DPA) في عام 2020، أصبح مشروع التصفية عنصرًا رئيسيًا في حزمة العمل 5 (WP5)، المخصصة للحفاظ على البرمجيات والبيانات القديمة من عمليات مصادم الهادرون الكبير (LHC) خلال الجولات 1 (2010-2012) و 2 (2015-2018).

تلعب WP5 دورًا حيويًا في ضمان بقاء البيانات القديمة متاحة للتحليلات المستقبلية من خلال صيانة مجموعة البرمجيات وتنظيم حملات (إعادة) التصفية. تركز هذه الحملات، التي تحدث بالتزامن مع جمع البيانات أو خلال فترات نهاية العام، على تقليص البيانات للاحتفاظ بالمعلومات الأساسية لتحليل الفيزياء. تم تصميم بنية مشروع التصفية لتكثيف البيانات الخام بكفاءة إلى مجموعات فرعية قابلة للإدارة، مما يحسن من احتفاظ الإشارة مع تقليل الضوضاء الخلفية. بالإضافة إلى ذلك، يتم استخدام أدوات تنظيم حديثة، مثل معالم GitLab، لمراقبة التقدم وضمان تنفيذ المهام في الوقت المناسب.

الطرق

تم هيكلة منهجية حملة إعادة التصفية في ثلاث مراحل رئيسية: تطوير الخط، واختبار التحقق، والموافقة النهائية على الإنتاج. تاريخيًا، كانت هذه الحملات تتميز بالتنسيق على نطاق واسع والالتزام بالجداول الزمنية الصارمة. ومع ذلك، قدمت النسخة الأخيرة من الحملة تعديلات كبيرة على كل من سير العمل وإجراءات الإدارة، مما يشير إلى تحول في استراتيجية التنفيذ. تهدف هذه التطورات في المنهجية إلى تعزيز الكفاءة والقدرة على التكيف في عمليات الحملة.

المناقشة

يشمل نموذج معالجة بيانات LHCb إطارًا شاملاً لإدارة كميات البيانات الهائلة الناتجة عن تصادمات البروتون-بروتون في LHC. يلعب إطار برمجيات التصفية دورًا محوريًا في هذا النموذج من خلال تصفية واختيار الأحداث ذات الأهمية، مما يقلل من حجم البيانات من حوالي 1 تيرابايت في الثانية إلى مجموعات فرعية قابلة للإدارة مناسبة لتحليل الفيزياء. تتضمن هذه العملية تطبيقات متعددة، بما في ذلك Brunel لإعادة بناء البيانات و DaVinci للتحليل على مستوى المستخدم، والتي تعمل معًا لضمان معالجة البيانات وتخزينها بكفاءة. يسمح التصميم المعياري للبرمجيات بالمرونة في المعالجة وتنفيذ التحديثات المنهجية من خلال حملات التصفية، التي تصنف إلى نوعين كاملين وزيادات بناءً على طبيعة البيانات ومتطلبات التحليل.

لقد أظهرت حملات إعادة التصفية المتزايدة الأخيرة تحسينات كبيرة في كفاءة إنتاج البيانات، حيث حققت عوامل تقليل قدرها 4.2 و 7 في أحجام المخرجات عبر مجموعات بيانات مختلفة. يتم تنظيم هذه الحملات بدقة، حيث تشمل التعاون بين مجموعات العمل الفيزيائية (PWGs) وفريق تنسيق مركزي لضمان دمج معايير الاختيار المحدثة بشكل فعال في سير عمل معالجة البيانات. أدى الانتقال إلى نظام إدارة الإنتاج المعتمد على YAML واستخدام GitLab للتواصل والتحقق إلى تحديث سير العمل، مما يعزز من إمكانية إعادة الاستخدام وكفاءة العمليات. بشكل عام، يمثل إطار LHCb Stripping نهجًا قويًا وقابلًا للتكيف لإدارة البيانات القديمة مع تعزيز فرص البحث والتطوير المستمرة داخل التعاون.

Journal: Computing and Software for Big Science, Volume: 9, Issue: 1
DOI: https://doi.org/10.1007/s41781-025-00151-6
PMID: https://pubmed.ncbi.nlm.nih.gov/41323503
Publication Date: 2025-11-28
Author(s): N. A. Grieser et al.
Primary Topic: Scientific Computing and Data Management

Overview

The LHCb Stripping project plays a crucial role in the data processing framework of the LHCb experiment, aimed at transforming extensive collision data into manageable samples for offline analysis. This initiative not only facilitates the re-analysis of legacy data from Runs 1 and 2 but also ensures the maintenance of the software stack and the execution of (re-)Stripping campaigns. As the focus transitions to newer datasets, the project is committed to optimizing the infrastructure for both legacy and live data processing.

This paper presents a detailed overview of the Stripping framework, emphasizing its Python-configurable architecture and integration with LHCb computing systems, as well as its capabilities in managing large-scale campaigns. Key advancements include the implementation of GitLab-based workflows, continuous integration, automation, and parallelized processing, which address various computational challenges. The authors also reflect on lessons learned throughout the project and propose a future roadmap to ensure efficient access to essential physics legacy datasets for the LHCb collaboration.

Introduction

The LHCb Stripping project is a crucial component of the data processing pipeline for the LHCb experiment, facilitating the selection of significant particle interactions from vast amounts of recorded data. Utilizing a flexible Python interface, researchers can tailor selection criteria to suit various physics studies. Since its integration into the LHCb Data Processing & Analysis (DPA) framework in 2020, the Stripping project has been a key element of Work Package 5 (WP5), which is dedicated to preserving legacy software and data from CERN’s Large Hadron Collider (LHC) operations during Runs 1 (2010-2012) and 2 (2015-2018).

WP5 plays a vital role in ensuring that legacy data remains accessible for future analyses by maintaining the software stack and organizing (re-)Stripping campaigns. These campaigns, which occur alongside data collection or during End-of-Year periods, focus on skimming and slimming data to retain essential information for physics analysis. The Stripping project’s architecture is designed to efficiently condense raw data into manageable subsets, optimizing signal retention while minimizing background noise. Additionally, modern organizational tools, such as GitLab Milestones, are utilized to monitor progress and ensure the timely execution of tasks.

Methods

The re-Stripping campaign methodology is structured into three primary phases: line development, validation testing, and final production approval. Historically, these campaigns have been characterized by large-scale coordination and adherence to strict timelines. However, the latest iteration of the campaign has introduced substantial modifications to both the workflow and management procedures, indicating a shift in execution strategy. This evolution in methodology aims to enhance efficiency and adaptability in the campaign’s operations.

Discussion

The LHCb data processing model encompasses a comprehensive framework for managing the vast amounts of data generated from proton-proton collisions at the LHC. The Stripping software framework plays a pivotal role in this model by filtering and selecting events of interest, thereby reducing the data volume from approximately 1 TB per second to manageable subsets suitable for physics analysis. This process involves multiple applications, including Brunel for data reconstruction and DaVinci for user-level analysis, which work in tandem to ensure efficient data handling and storage. The modular design of the software allows for flexibility in processing and the implementation of systematic updates through Stripping campaigns, which are categorized into full and incremental types based on the nature of the data and analysis requirements.

The recent incremental re-Stripping campaigns have demonstrated significant improvements in data output efficiency, achieving reduction factors of 4.2 and 7 in output sizes across different datasets. These campaigns are meticulously organized, involving collaboration among Physics Working Groups (PWGs) and a central coordination team to ensure that updated selection criteria are effectively integrated into the data processing workflow. The transition to a YAML-based production management system and the use of GitLab for communication and validation have modernized the workflow, enhancing reusability and operational efficiency. Overall, the LHCb Stripping framework exemplifies a robust and adaptable approach to managing legacy data while fostering ongoing research and development opportunities within the collaboration.