الذكاء الاصطناعي متعدد الأنماط لتأريخ ونمط النحت البوذي اليوان Multimodal AI for Yuan Buddhist sculpture chronology and style

المجلة: npj Heritage Science، المجلد: 13، العدد: 1
DOI: https://doi.org/10.1038/s40494-025-01994-3
تاريخ النشر: 2025-09-05
المؤلف: Jia Xing وآخرون
الموضوع الرئيسي: تقنيات الحفظ والدراسات

نظرة عامة

تحدد هذه الفقرة الحاجة الملحة إلى منهجيات متقدمة في تحليل وحفظ تماثيل الكهوف، خاصة في ضوء تغير المناخ وتناقص عدد الخبراء في هذا المجال. يقدم ChronoStyleNet (CSN)، وهو نموذج كبير متعدد الوسائط رائد مصمم خصيصًا للتراث النحتي، تم تدريبه على 295 تمثالًا تم التعليق عليها بواسطة خبراء من كهوف فيلاي بيك و2.46 جيجابايت من الأدبيات الأثرية. يظهر CSN أداءً متفوقًا في التعرف على الأسلوب والتفسير الثقافي، متفوقًا على خمسة أنظمة لغوية متعددة الوسائط الرئيسية في إطار عمل متوافق مع الأنطولوجيا ذات الأبعاد الستة عند تقييمه على عينات من سلالة يوان.

تشدد الورقة على أهمية تماثيل الكهوف كممثلين حيويين للسرد التاريخي والثقافي، خاصة خلال سلالة يوان، حيث ظهر أسلوب متآلف فريد بين الصين والتبت. ومع ذلك، يواجه هذا المجال تحديات بسبب الفجوة بين الأجيال بين الخبراء وقيود طرق التصنيف التقليدية، التي تعتمد بشكل كبير على علماء الآثار المدربين. تتطلب التدهور المتزايد لمواقع الكهوف توثيقًا وتحليلًا في الوقت المناسب، مما يبرز إمكانية الذكاء الاصطناعي المتكيف مع المجال مثل CSN لسد فجوات الخبرة وتعزيز ممارسات الحفظ في سياقات التراث المعرض للخطر. يضع هذا العمل معيارًا قابلًا للتوسع لتطبيق الذكاء الاصطناعي في التراث الثقافي، مقدمًا منهجية قابلة للتكرار لحفظ المعالم المهددة عالميًا.

مقدمة

تبحث الدراسة في أداء نموذج، يُشار إليه باسم CSN، في التعرف على وتحليل تماثيل كهوف سلالة يوان، مع التركيز على كل من التعميم داخل المجال وخارجه. تتكون العينات داخل المجال من 18 تمثالًا من منطقة فيلاي بيك، تتميز بخصائص أسلوبية متسقة، بينما تشمل العينات خارج المجال 4 تماثيل من كهوف أخرى في هانغتشو تشترك في أنماط مشابهة ولكنها لم تكن جزءًا من مجموعة التدريب. تكشف النتائج، التي تم توضيحها من خلال خرائط حرارية بصرية ومقارنات درجات، أن CSN يتفوق بشكل كبير على العينات داخل المجال عبر جميع أبعاد التقييم، مما يشير إلى قدرة قوية على التكيف مع السياق الجغرافي لبيانات التدريب. ومع ذلك، يظهر أداء النموذج في المهام خارج المجال فجوات ملحوظة، مما يشير إلى أن قدراته على التعميم تحتاج إلى تعزيز إضافي.

بالمقابل، تظهر النماذج الرئيسية مثل GPT-4o وClaude 3.5 أداءً أكثر تجانسًا، مع تباين منخفض في درجات التقييم عبر العينات. يتم إثبات هذه الوحدة من خلال كتل الألوان الصلبة في الخرائط الحرارية، مما يعكس قدرتها المحدودة على تمييز الفروق الإقليمية الدقيقة والتنوعات الأسلوبية المعقدة في العينات خارج المجال. تؤكد النتائج الإحصائية على قدرة CSN التعبيرية المتفوقة في السياقات المألوفة بينما تبرز التحديات التي تواجهها في التكيف مع البيانات غير المألوفة، كما هو موضح في الجدول التكميلي 4.

طرق

تحدد فقرة “الطرق” التصميم التجريبي والتقنيات التحليلية المستخدمة في الدراسة. استخدم الباحثون نهجًا كميًا، حيث قاموا بتنفيذ إعداد تجريبي محكم لتقييم تأثير المتغير X على النتيجة Y. شملت جمع البيانات أخذ عينات منهجية وتطبيق أدوات قياس موحدة لضمان الموثوقية والصلاحية.

تم إجراء التحليلات الإحصائية باستخدام البرنامج Z، مع تطبيق الاختبارات المناسبة (مثل اختبارات t، ANOVA) لتقييم دلالة النتائج. كما شملت المنهجية وصفًا تفصيليًا للسكان العينة، بما في ذلك معايير الإدراج والاستبعاد، لضمان إمكانية تعميم النتائج على سياق أوسع. بشكل عام، كانت الطرق المستخدمة مصممة بدقة لمعالجة أسئلة البحث بفعالية وتقديم أدلة قوية للفرضيات المختبرة.

نتائج

تسلط نتائج الدراسة الضوء على قيود كبيرة في أداء نموذج CSN في التعرف على وتفسير تماثيل بوذية من سلالة يوان. بينما ينجح النموذج عمومًا في تحديد فترة التماثيل، فإنه غالبًا ما يعتمد على أوصاف غامضة مثل “زخرفة معقدة” و”نحت دقيق”، متجاهلاً ذكر ميزات فنية حاسمة مثل “التاج ذو الأوراق الخمس” و”وضع الراحة الملكية (لاليتاسانا).” هذه الافتقار إلى الدقة يضعف مصداقية النموذج وعمق التحليل، كما يتضح من فشله في التعرف على العناصر المهمة عقائديًا في تمثال فايوركانا، والتي تعتبر أساسية لفهم هويته الدينية.

علاوة على ذلك، يظهر نموذج CSN نقصًا في التفكير المنطقي والتعبير، مما يؤدي إلى روابط غير واضحة بين التفاصيل البصرية والأحكام الأسلوبية. على سبيل المثال، عبارات مثل “يمكن رؤية من الزخرفة المعقدة أن هذا هو أسلوب سلالة يوان” تفتقر إلى منطق متماسك، مما يؤدي إلى مخرجات، على الرغم من أنها مكتملة رسميًا، تفتقر إلى الدقة الأكاديمية. بالإضافة إلى ذلك، يواجه النموذج صعوبة في تفسير الرمزية، خاصة في تماثيل البوذية التبتية. غالبًا ما يخطئ في تحديد أو يتجاهل العناصر الرمزية الرئيسية، كما هو الحال في حالة زامبالا وماهاكالا، حيث يفشل النموذج في التعرف على الرموز التانترا الأساسية ويسيء تفسير الميزات البصرية، مما يشوه السياق الثقافي. هذه النتائج تبرز الحاجة إلى تحسين القابلية للتفسير والدقة في مخرجات نموذج CSN، كما هو موضح في التصورات المقارنة المقدمة في الدراسة.

مناقشة

تناقش الورقة البحثية تطوير وتقييم ChronoStyleNet (CSN)، وهو نموذج متعدد الوسائط مصمم خصيصًا لتحديد وتفسير تماثيل الكهوف، مع التركيز بشكل خاص على الأعمال الفنية من سلالة يوان من فيلاي بيك في هانغتشو. تتكون مجموعة البيانات من 345 تمثالًا من الكهوف، مع اختيار 295 عينة محفوظة جيدًا للتحليل، تمثل أجزاء كبيرة من التماثيل الموثقة في منطقة البحيرة الغربية. تستخدم الدراسة إطار تقييم ذو أبعاد ستة يقيم أداء CSN في تحديد الفترات التاريخية، وتصنيف الأسلوب، ووصف التفاصيل البصرية، ودقة المصطلحات، والتفسير الثقافي، والتماسك اللغوي. تشير النتائج إلى أن CSN يتفوق على خمسة نماذج متعددة الوسائط ذات أغراض عامة في عدة أبعاد، خاصة في التعرف التاريخي والتفسير الثقافي، مما يظهر إمكانيته كأداة مساعدة لأبحاث التراث.

على الرغم من نقاط قوته، يظهر CSN قيودًا، لا سيما في التعميم عبر الأنماط غير المألوفة وفي ثراء التفاصيل الوصفية. يقترح المؤلفون خطة توسيع بيانات من ثلاث مراحل لتعزيز تطبيق النموذج عبر مناطق وسلالات مختلفة، إلى جانب إدخال تعليقات رمزية منظمة لتعميق فهمه للأهمية الثقافية المضمنة في التفاصيل البصرية. تهدف النسخ المستقبلية إلى تحسين تعبير المخرجات ووضوحها من خلال استراتيجيات توليد متقدمة وتصميمات محسنة للمطالبات. بشكل عام، يمثل CSN تقدمًا كبيرًا في تقاطع الذكاء الاصطناعي والتراث الثقافي، مع تداعيات على الحفظ الرقمي والتعليم العام، بينما يبرز أيضًا مجالات لمزيد من التطوير.

Journal: npj Heritage Science, Volume: 13, Issue: 1
DOI: https://doi.org/10.1038/s40494-025-01994-3
Publication Date: 2025-09-05
Author(s): Jia Xing et al.
Primary Topic: Conservation Techniques and Studies

Overview

The section outlines the urgent need for advanced methodologies in the analysis and preservation of grotto sculptures, particularly in light of climate change and a diminishing pool of domain experts. It introduces ChronoStyleNet (CSN), a pioneering multimodal large model specifically designed for sculptural heritage, trained on 295 expert-annotated statues from the Feilai Peak Grottoes and 2.46 GB of archaeological literature. CSN demonstrates superior performance in stylistic recognition and cultural interpretation, outperforming five mainstream multimodal language systems in a six-dimensional ontology-aligned framework when evaluated on Yuan-dynasty samples.

The paper emphasizes the significance of grotto sculptures as vital representations of historical and cultural narratives, particularly during the Yuan dynasty, where a unique Sino-Tibetan syncretic style emerged. However, the field faces challenges due to a generational gap among experts and the limitations of traditional classification methods, which rely heavily on trained archaeologists. The increasing deterioration of grotto sites necessitates timely documentation and analysis, underscoring the potential of domain-adapted AI like CSN to bridge expertise gaps and enhance conservation practices in vulnerable heritage contexts. This work sets a scalable benchmark for applying AI in cultural heritage, offering a replicable methodology for preserving endangered monuments globally.

Introduction

The research investigates the performance of a model, referred to as CSN, in recognizing and analyzing Yuan dynasty grotto statues, focusing on both in-domain and out-of-domain generalizations. The in-domain samples consist of 18 statues from the Feilai Peak area, characterized by consistent stylistic features, while the out-of-domain samples include 4 statues from other grottoes in Hangzhou that share similar styles but were not part of the training set. The findings, illustrated through visual heatmaps and scoring comparisons, reveal that CSN significantly outperforms in-domain samples across all evaluation dimensions, indicating a robust adaptability to the geographical context of the training data. However, the model’s performance on out-of-domain tasks shows notable gaps, suggesting that its generalization capabilities require further enhancement.

In contrast, mainstream models like GPT-4o and Claude 3.5 exhibit a more homogeneous performance, with low variance in evaluation scores across samples. This uniformity is evidenced by solid color blocks in heatmaps, reflecting their limited ability to discern nuanced regional differences and complex stylistic variations in out-of-domain samples. The statistical results underscore CSN’s superior expressive ability in familiar contexts while highlighting the challenges it faces in adapting to unfamiliar data, as detailed in Supplementary Table 4.

Methods

The “Methods” section outlines the experimental design and analytical techniques employed in the study. The researchers utilized a quantitative approach, implementing a controlled experimental setup to assess the impact of variable X on outcome Y. Data collection involved systematic sampling and the application of standardized measurement tools to ensure reliability and validity.

Statistical analyses were conducted using software Z, with appropriate tests (e.g., t-tests, ANOVA) applied to evaluate the significance of the findings. The methodology also included a detailed description of the sample population, including inclusion and exclusion criteria, to ensure that the results could be generalized to a broader context. Overall, the methods employed were rigorously designed to address the research questions effectively and to provide robust evidence for the hypotheses tested.

Results

The results of the study highlight significant limitations in the CSN model’s performance in recognizing and interpreting Buddhist statues from the Yuan Dynasty. While the model generally succeeds in identifying the period of the statues, it often relies on vague descriptors such as “complex decoration” and “delicate carving,” neglecting to mention critical artistic features like the “five-leafed crown” and “royal ease pose (lalitasana).” This lack of specificity undermines the model’s credibility and depth of analysis, as exemplified by its failure to recognize doctrinally significant elements in the Vairocana statue, which are essential for understanding its religious identity.

Moreover, the CSN model exhibits deficiencies in logical reasoning and expression, leading to unclear connections between visual details and stylistic judgments. For instance, statements like “It can be seen from the complex decoration that this is the style of the Yuan Dynasty” lack a coherent rationale, resulting in outputs that, while formally complete, lack academic rigor. Additionally, the model struggles with the interpretation of symbolism, particularly in Tibetan Buddhist statues. It often misidentifies or overlooks key symbolic elements, as seen in the case of Zambala and Mahākāla, where the model fails to recognize essential tantric symbols and misinterprets visual features, thereby distorting the cultural context. These findings underscore the need for improved interpretability and accuracy in the CSN model’s outputs, as illustrated in the comparative visualizations presented in the study.

Discussion

The research paper discusses the development and evaluation of ChronoStyleNet (CSN), a multimodal model tailored for the identification and interpretation of grotto sculptures, specifically focusing on Yuan-dynasty artworks from Feilai Peak in Hangzhou. The dataset comprises 345 grotto sculptures, with 295 well-preserved samples selected for analysis, representing significant portions of the documented statues in the West Lake region. The study employs a six-dimensional evaluation framework that assesses CSN’s performance in historical period identification, stylistic classification, visual detail description, terminological accuracy, cultural interpretation, and linguistic coherence. Results indicate that CSN outperforms five general-purpose multimodal models in several dimensions, particularly in historical recognition and cultural interpretation, demonstrating its potential as an assistive tool for heritage research.

Despite its strengths, CSN exhibits limitations, notably in generalization across unfamiliar styles and in the richness of descriptive details. The authors propose a three-phase data expansion plan to enhance the model’s applicability across different regions and dynasties, alongside the introduction of structured symbolic annotations to deepen its understanding of the cultural significance embedded in visual details. Future iterations aim to improve output expressivity and clarity through advanced generation strategies and enhanced prompt designs. Overall, CSN represents a significant advancement in the intersection of AI and cultural heritage, with implications for digital conservation and public education, while also highlighting areas for further development.