AECBench: معيار هرمي لتقييم المعرفة لنماذج اللغة الكبيرة في مجال AEC AECBench: A hierarchical benchmark for knowledge evaluation of large language models in the AEC field

المجلة: Advanced Engineering Informatics، المجلد: 71
DOI: https://doi.org/10.1016/j.aei.2026.104314
تاريخ النشر: 2026-01-16
المؤلف: Liang Chen وآخرون
الموضوع الرئيسي: تقنيات معالجة اللغة الطبيعية

نظرة عامة

تقدم هذه الورقة البحثية AECBench، وهو معيار مصمم لتقييم أداء نماذج اللغة الكبيرة (LLMs) ضمن قطاع الهندسة المعمارية والهندسة والبناء (AEC). مع الاعتراف بزيادة اعتماد LLMs في هذا المجال الحساس للسلامة، طور المؤلفون إطار تقييم معرفي من خمسة مستويات يشمل حفظ المعرفة، وفهم المعرفة، واستدلال المعرفة، وحساب المعرفة، وتطبيق المعرفة. يتضمن المعيار 23 مهمة مستمدة من ممارسات AEC الحقيقية ومجموعة بيانات تحتوي على 4,800 سؤال، تم التحقق منها من خلال مراجعة الخبراء. كما تم إنشاء منهجية مبتكرة “LLM كقاضٍ” لتقييم الاستجابات المعقدة باستخدام معايير محددة من قبل الخبراء.

كشفت تقييمات تسعة LLMs عن انخفاض كبير في الأداء مع زيادة التعقيد المعرفي، خاصة في المهام التي تتطلب استدلالات معقدة، وحسابات، وتوليد مستندات محددة المجال. بينما تفوقت LLMs في حفظ المعرفة، واجهت صعوبة في تفسير البيانات الجدولية من رموز البناء وافتقرت إلى الدقة الواقعية في توليد المستندات المهنية. حددت الدراسة التحيزات المنهجية في أداء LLMs واقترحت طرق المعايرة لتعزيز الموثوقية. تشمل اتجاهات البحث المستقبلية معالجة تحديات دمج البيانات متعددة الوسائط، مثل التعرف على الرسومات المعمارية، لتحسين قدرات LLMs في مجال AEC. سيتم جعل AECBench مفتوح المصدر، مما يدعو المساهمات المجتمعية لتعزيزه.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على التعقيدات والطبيعة المتخصصة لصناعة الهندسة المعمارية والهندسة والبناء (AEC)، والتي تتطلب معرفة متعددة التخصصات واسعة النطاق طوال دورة حياة المشروع. تؤكد على الترابط بين مجالات مختلفة، مثل التصميم الهيكلي، والجماليات المعمارية، والامتثال التنظيمي، التي تؤثر مجتمعة على نتائج المشروع. تحدد الورقة فجوة حاسمة في تقييم نماذج اللغة الكبيرة (LLMs) ضمن هذا البيئة عالية المخاطر، مما يبرز الحاجة إلى معيار موحد ومحدد المجال للتحقق من موثوقيتها وكفاءتها.

لمعالجة هذه الفجوة، يقدم المؤلفون AECBench، وهو معيار شامل مصمم لتقييم LLMs في سيناريوهات AEC الواقعية. تشمل الابتكارات الرئيسية إطارًا معرفيًا هرميًا يعكس المتطلبات المعرفية المتنوعة لمهام AEC، وهي منظمة في خمسة مستويات، بما في ذلك مستويات متخصصة لاستدلال المعرفة وحساب المعرفة. بالإضافة إلى ذلك، يتم تقديم مجموعة بيانات معيارية عالية الجودة تتكون من 4,800 سؤال منسق عبر 23 مهمة تقييم، مما يضمن الصلة بالسيناريوهات الواقعية وأشكال الأسئلة المتنوعة. كما يتم اقتراح خط أنابيب تقييم آلي يستخدم نهج “LLM كقاضٍ”، مما يعزز قابلية التوسع وسرعة التقييمات مع الحفاظ على إشراف الخبراء. سيتم توفير المورد الكامل لـ AECBench، بما في ذلك الكود ومجموعة البيانات، كمصدر مفتوح للاستفادة من المجتمع الأوسع.

نقاش

تسلط قسم النقاش في الورقة البحثية الضوء على دمج نماذج اللغة الكبيرة (LLMs) ضمن دورة حياة الهندسة المعمارية والهندسة والبناء (AEC)، مشيرًا إلى الانتقال من معالجة الاستفسارات الأساسية إلى سير العمل المتقدمة. في مراحل مختلفة مثل التصميم، والهندسة الهيكلية، والبناء، والتشغيل، يتم استخدام LLMs لمهام تتراوح بين أتمتة نماذج الطاقة للمباني إلى تعزيز إدارة السلامة من خلال تحليل تقارير الحوادث. ومع ذلك، فإن التقييمات الحالية لـ LLMs محدودة بسيناريوهات معينة، وتفتقر إلى معيار موحد لتقييم قدراتها عبر تطبيقات AEC المتنوعة.

لمعالجة هذه الفجوة، تؤكد الورقة على الحاجة إلى معايير تقييم محددة المجال يمكن أن توفر مقاييس كمية لأداء LLM في قطاع AEC. تعتبر المعايير الحالية من مجالات أخرى، مثل القانون والطب، غير كافية بسبب الطبيعة متعددة التخصصات الفريدة لـ AEC. تركز المحاولات الأولية لإنشاء معايير، مثل aice.AEC-Bench وCivil-Eval، على أسئلة امتحانات الترخيص لكنها تظل محدودة في النطاق وتفشل في تضمين تعقيد مهام AEC الواقعية. يتكيف الإطار التقييمي الهرمي المقترح مع تصنيف بلوم في سياق AEC، مما يقدم مستويات تقيم حفظ المعرفة، والفهم، والاستدلال، والحساب، والتطبيق. يهدف هذا الإطار إلى التقاط المتطلبات المعرفية لمهام AEC، مما يسهل في النهاية تقييمًا أكثر شمولاً لـ LLMs مصممًا لتحديات الصناعة المحددة.

Journal: Advanced Engineering Informatics, Volume: 71
DOI: https://doi.org/10.1016/j.aei.2026.104314
Publication Date: 2026-01-16
Author(s): Liang Chen et al.
Primary Topic: Natural Language Processing Techniques

Overview

This research paper introduces AECBench, a benchmark designed to evaluate the performance of large language models (LLMs) within the Architecture, Engineering, and Construction (AEC) sector. Recognizing the increasing adoption of LLMs in this safety-critical domain, the authors developed a five-level cognitive evaluation framework encompassing Knowledge Memorization, Knowledge Understanding, Knowledge Reasoning, Knowledge Calculation, and Knowledge Application. The benchmark includes 23 tasks derived from authentic AEC practices and a dataset of 4,800 questions, validated through expert review. An innovative “LLM-as-a-Judge” methodology was also established to assess complex responses using expert-defined rubrics.

The evaluation of nine LLMs revealed a significant decline in performance as cognitive complexity increased, particularly in tasks requiring complex reasoning, calculations, and the generation of domain-specific documents. While LLMs excelled in memorizing knowledge, they struggled with interpreting tabular data from building codes and lacked factual accuracy in professional document generation. The study identified systematic biases in LLM performance and proposed calibration methods to enhance reliability. Future research directions include addressing the challenges of multimodal data integration, such as architectural drawing recognition, to further improve LLM capabilities in the AEC field. AECBench will be made open-source, inviting community contributions for its enhancement.

Introduction

The introduction of this research paper highlights the complexities and specialized nature of the Architecture, Engineering, and Construction (AEC) industry, which necessitates extensive interdisciplinary knowledge throughout the project lifecycle. It emphasizes the interconnectedness of various domains, such as structural design, architectural aesthetics, and regulatory compliance, which collectively influence project outcomes. The paper identifies a critical gap in the evaluation of large language models (LLMs) within this high-stakes environment, underscoring the need for a standardized, domain-specific benchmark to validate their reliability and proficiency.

To address this gap, the authors present AECBench, a comprehensive benchmark designed to evaluate LLMs in real-world AEC scenarios. Key innovations include a hierarchical cognitive framework that reflects the varying cognitive demands of AEC tasks, structured into five levels, including specialized levels for Knowledge Reasoning and Knowledge Calculation. Additionally, a high-quality benchmark dataset comprising 4,800 curated questions across 23 evaluation tasks is introduced, ensuring relevance to real-world scenarios and diverse question formats. An automated evaluation pipeline utilizing an “LLM-as-a-judge” approach is also proposed, enhancing the scalability and speed of assessments while maintaining expert oversight. The entire AECBench resource, including code and dataset, will be made available as open-source to benefit the broader community.

Discussion

The discussion section of the research paper highlights the integration of large language models (LLMs) within the architecture, engineering, and construction (AEC) lifecycle, noting a transition from basic query handling to sophisticated workflows. In various phases such as design, structural engineering, construction, and operation, LLMs are employed for tasks ranging from automating building energy models to enhancing safety management through accident report analysis. However, the current evaluations of LLMs are limited to specific scenarios, lacking a standardized benchmark to assess their capabilities across diverse AEC applications.

To address this gap, the paper emphasizes the need for domain-specific evaluation benchmarks that can provide quantitative metrics for LLM performance in the AEC sector. Existing benchmarks from other fields, such as law and medicine, are deemed insufficient due to the unique multidisciplinary nature of AEC. Initial attempts to create benchmarks, such as aice.AEC-Bench and Civil-Eval, have focused on licensure examination questions but remain limited in scope and fail to encompass the complexity of real-world AEC tasks. The proposed hierarchical evaluation framework adapts Bloom’s Taxonomy to the AEC context, introducing levels that assess knowledge memorization, understanding, reasoning, calculation, and application. This framework aims to capture the cognitive demands of AEC tasks, ultimately facilitating a more comprehensive evaluation of LLMs tailored to the industry’s specific challenges.