CODE-ACCORD: مجموعة من بيانات تنظيم البناء لتوليد القواعد نحو التحقق التلقائي من الامتثال CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking

المجلة: Scientific Data، المجلد: 12، العدد: 1
DOI: https://doi.org/10.1038/s41597-024-04320-x
PMID: https://pubmed.ncbi.nlm.nih.gov/39880815
تاريخ النشر: 2025-01-29
المؤلف: Hansi Hettiarachchi وآخرون
الموضوع الرئيسي: أبحاث الصحة والسلامة المهنية

نظرة عامة

تقدم هذه القسم نظرة عامة على التحديات والحلول المتعلقة بالتحقق التلقائي من الامتثال (ACC) في قطاع الهندسة المعمارية والهندسة والبناء (AEC)، مع التركيز بشكل خاص على أتمتة تفسير لوائح البناء. يقدم المؤلفون CODE-ACCORD، مجموعة بيانات تتكون من 862 جملة مكتفية ذاتياً مستمدة من لوائح البناء في إنجلترا وفنلندا. هذه الجمل ضرورية لـ ACC لأنها تعبر عن قواعد كاملة بشكل مستقل، مما يسهل تحويلها إلى تنسيقات قابلة للقراءة بواسطة الآلة.

لزيادة فائدة مجموعة البيانات، تم وضع علامة على كل جملة بدقة من قبل فريق مكون من 12 مُعَلِّم، حيث تم تحديد 4,297 كياناً و4,329 علاقة عبر فئات مختلفة. تضمن هذه التصفية وجود حقيقة قوية تدعم مهام متعددة في التعلم الآلي (ML) ومعالجة اللغة الطبيعية (NLP)، مثل تصنيف النصوص، والتعرف على الكيانات، واستخراج العلاقات. من خلال الاستفادة من CODE-ACCORD، يمكن للباحثين تطبيق تقنيات متقدمة، بما في ذلك الشبكات العصبية العميقة ونماذج اللغة الكبيرة، لتحسين فعالية ACC في تفسير لوائح البناء.

مقدمة

تناقش مقدمة ورقة البحث الدور الحاسم لرموز البناء في ضمان سلامة وموثوقية الهياكل طوال دورة حياتها. تقليدياً، تم التحقق من الامتثال لهذه الرموز يدوياً، وهي عملية تستغرق وقتاً طويلاً وتحتاج إلى موارد كبيرة. أدى ظهور نماذج معلومات البناء الغنية دلالياً (BIMs) إلى زيادة الاهتمام بأتمتة عمليات التحقق من الامتثال من خلال التحقق التلقائي من الامتثال (ACC). ومع ذلك، يكمن التحدي الكبير في تحويل التنسيق النصي لرموز البناء إلى تنسيقات قابلة للقراءة بواسطة الآلة، مما يتطلب استخراج المعلومات الأساسية من اللغة الطبيعية غير المنظمة.

كانت الجهود المبكرة لتشكيل لوائح البناء إلى قواعد قابلة للقراءة بواسطة الآلة تتضمن عمليات يدوية، مثل وضع علامات RASE وإجراءات البرمجة الكائنية، ولكن كانت هذه الأساليب تتطلب جهداً كبيراً وخبرة في المجال. وقد انتقلت التطورات الحديثة نحو تقنيات التعلم المراقب، وخاصة نماذج التعلم العميق مثل الذاكرة طويلة وقصيرة المدى ثنائية الاتجاه (Bi-LSTM) والمحولات، التي أظهرت وعداً في استخراج المعلومات من النصوص التنظيمية. على الرغم من هذه التقدمات، لا يزال الاعتماد على مجموعات بيانات موثوقة عالية الجودة يمثل تحدياً، حيث غالباً ما تقتصر المجموعات الحالية على مجالات أو مناطق محددة. لمعالجة هذه القيود، تقدم الدراسة استراتيجية جديدة للتعليق تهدف إلى تعميم عبر رموز البناء المختلفة. تركز هذه الاستراتيجية على تحديد الكيانات والعلاقات داخل النص، مما يؤدي إلى إنشاء مجموعة بيانات CODE-ACCORD، التي تتكون من 862 جملة تنظيمية مكتفية ذاتياً من رموز البناء في إنجلترا وفنلندا، تم وضع علامات عليها للكيانات والعلاقات لتسهيل تطبيقات NLP الحديثة في ACC.

طرق

تم تطوير مجموعة بيانات CODE-ACCORD من خلال عملية من مرحلتين: جمع البيانات ووضع العلامات على البيانات. ركز جمع البيانات على استخراج الجمل التي تصف لوائح البناء من مصادر متاحة للجمهور في إنجلترا وفنلندا، مع استهداف مستندات تسهل إنشاء قواعد قابلة للقراءة بواسطة الآلة تلقائياً. تم الحصول على اللوائح الإنجليزية من وزارة التنمية المستدامة والإسكان والمجتمعات في المملكة المتحدة، بينما تم الحصول على اللوائح الفنلندية من وزارة البيئة، باستخدام الترجمات الإنجليزية لإنشاء مجموعة بيانات موحدة. تتكون مجموعة البيانات من 33 مستنداً، 23 منها من إنجلترا بإجمالي 1548 صفحة و10 من فنلندا تغطي 140 صفحة، تتناول جوانب مختلفة من سلامة البناء والصحة والمعايير البيئية.

في مرحلة وضع العلامات على البيانات، ركز فريق مكون من 12 مُعَلِّمًا لديهم خلفيات في علوم الكمبيوتر والهندسة المدنية على تحديد الكيانات المسماة والعلاقات داخل النص، والتي تعتبر ضرورية لفهم المحتوى التنظيمي. تم إجراء وضع العلامات باستخدام منصة LightTag، مما يضمن نهجاً منهجياً لالتقاط المعلومات ذات الصلة. تضمنت عملية وضع علامات الكيانات وضع علامات على نطاقات النص وتصنيفها إلى أربع فئات محددة مسبقاً، مصممة لتكون قابلة للتعميم عبر مجالات فرعية مختلفة مثل الهيكل، وسلامة الحريق، والوصول. تهدف هذه المنهجية إلى تعزيز أتمتة عمليات التحقق من الامتثال من خلال التعلم الآلي، مع توفير إرشادات مفصلة في دليل وضع العلامات المرافق.

نقاش

في هذا القسم، يوضح المؤلفون المنهجية المستخدمة لجمع ومعالجة الجمل التنظيمية من لوائح البناء الإنجليزية والفنلندية. كانت مجموعة البيانات الأولية تتكون من 20,674 جملة، تم تصفية 5,043 منها تلقائياً بناءً على معايير محددة تركز على المتطلبات الكمية والذاتية والديون. تضمنت المتطلبات الكمية شروطاً عددية، بينما تضمنت المتطلبات الذاتية لغة مرنة للتوصيات، وعالجت المنطق الديني البيانات المتعلقة بالتصريحات المعيارية بشأن الأذونات والالتزامات. تلت عملية التصفية عملية تصفية يدوية لضمان دقة الجمل، مما أسفر عن 1,246 جملة مكتفية ذاتياً مناسبة لمزيد من وضع العلامات.

تضمنت عملية وضع العلامات تحديد الكيانات وعلاقاتها داخل الجمل المختارة، مما أدى في النهاية إلى تصنيفها إلى عشرة أنواع من العلاقات. خضعت 862 جملة لهذه العملية الشاملة لوضع العلامات، مما أسفر عن 4,297 كياناً و3,329 علاقة. كما نفذ المؤلفون تدابير لمراقبة الجودة، بما في ذلك جولات متعددة من وضع العلامات وحسابات اتفاقية بين المُعَلِّمين (IAA)، لضمان موثوقية العلامات. تتوفر مجموعة البيانات النهائية، التي تم هيكلتها لتنسيقات قابلة للقراءة بواسطة الآلة، للبحث المستقبلي في أتمتة عمليات التحقق من الامتثال في لوائح البناء.

Journal: Scientific Data, Volume: 12, Issue: 1
DOI: https://doi.org/10.1038/s41597-024-04320-x
PMID: https://pubmed.ncbi.nlm.nih.gov/39880815
Publication Date: 2025-01-29
Author(s): Hansi Hettiarachchi et al.
Primary Topic: Occupational Health and Safety Research

Overview

The section presents an overview of the challenges and solutions related to Automatic Compliance Checking (ACC) in the Architecture, Engineering, and Construction (AEC) sector, particularly focusing on the automation of building regulation interpretation. The authors introduce CODE-ACCORD, a dataset comprising 862 self-contained sentences derived from the building regulations of England and Finland. These sentences are crucial for ACC as they express complete rules independently, thus facilitating their conversion into machine-readable formats.

To enhance the dataset’s utility, each sentence was meticulously annotated by a team of 12 annotators, identifying 4,297 entities and 4,329 relations across various categories. This curation ensures a robust ground truth that supports multiple machine learning (ML) and natural language processing (NLP) tasks, such as text classification, entity recognition, and relation extraction. By leveraging CODE-ACCORD, researchers can apply advanced techniques, including deep neural networks and large language models, to improve the efficacy of ACC in interpreting building regulations.

Introduction

The introduction of the research paper discusses the critical role of building codes in ensuring the safety and reliability of structures throughout their lifecycle. Traditionally, compliance with these codes has been verified manually, a process that is both time-consuming and resource-intensive. The emergence of semantically rich Building Information Models (BIMs) has spurred interest in automating compliance checks through Automated Compliance Checking (ACC). However, a significant challenge lies in converting the textual format of building codes into machine-readable formats, necessitating the extraction of underlying information from unstructured natural language.

Early efforts to formalize building regulations into machine-readable rules involved manual processes, such as RASE tagging and object-oriented procedures, but these approaches were labor-intensive and required domain expertise. Recent advancements have shifted towards supervised learning techniques, particularly deep learning models like Bidirectional Long Short-Term Memory (Bi-LSTM) and transformers, which have shown promise in extracting information from regulatory texts. Despite these advancements, the reliance on high-quality annotated datasets remains a challenge, as existing datasets are often limited to specific domains or regions. To address these limitations, the study introduces a novel annotation strategy aimed at generalizing across various building codes. This strategy focuses on identifying entities and relations within the text, culminating in the creation of the CODE-ACCORD dataset, which comprises 862 self-contained regulatory sentences from the building codes of England and Finland, annotated for entities and relations to facilitate modern NLP applications in ACC.

Methods

The CODE-ACCORD corpus was developed through a two-stage process: data collection and data annotation. The data collection focused on extracting sentences that describe building regulations from publicly available sources in England and Finland, specifically targeting documents that facilitate the automatic generation of machine-readable rules. The English regulations were sourced from the UK Department for Levelling Up, Housing and Communities, while the Finnish regulations were obtained from the Ministry of Environment, utilizing English translations to create a unified corpus. The dataset comprises 33 documents, with 23 from England totaling 1548 pages and 10 from Finland covering 140 pages, addressing various aspects of building safety, health, and environmental standards.

In the data annotation phase, a team of 12 annotators with backgrounds in computer science and civil engineering focused on identifying named entities and relations within the text, which are crucial for understanding the regulatory content. The annotation was conducted using the LightTag platform, ensuring a systematic approach to capturing relevant information. The entity annotation process involved marking text spans and categorizing them into four predefined categories, designed for generalizability across different subdomains such as structure, fire safety, and accessibility. This methodology aims to enhance the automation of compliance checking through machine learning, with detailed guidelines provided in an accompanying annotation manual.

Discussion

In this section, the authors detail the methodology employed for collecting and processing regulatory sentences from English and Finnish building regulations. The initial dataset comprised 20,674 sentences, from which 5,043 were auto-filtered based on specific criteria focusing on quantitative, subjective, and deontic requirements. Quantitative requirements included numerical stipulations, subjective requirements involved flexible language for recommendations, and deontic logic addressed normative statements regarding permissions and obligations. The filtering process was followed by manual curation to ensure the accuracy of the sentences, resulting in 1,246 self-contained sentences suitable for further annotation.

The annotation process involved identifying entities and their relations within the selected sentences, ultimately categorizing them into ten relation types. A total of 862 sentences underwent this comprehensive annotation, yielding 4,297 entities and 3,329 relations. The authors also implemented quality control measures, including multiple annotation rounds and Inter Annotator Agreement (IAA) calculations, to ensure the reliability of the annotations. The final dataset, which is structured for machine-readable formats, is made available for future research in automating compliance checking in building regulations.