تقنيات البناء والاستخراج المنقحة للرسم البياني المعرفي القائم على نماذج اللغة الكبيرة The construction and refined extraction techniques of knowledge graph based on large language models

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-38066-w
PMID: https://pubmed.ncbi.nlm.nih.gov/41667618
تاريخ النشر: 2026-02-10
المؤلف: Li Peng وآخرون
الموضوع الرئيسي: الشبكات العصبية المتقدمة

نظرة عامة

تقدم ورقة البحث إطارًا لبناء وتنقيح الرسوم البيانية للمعرفة المتخصصة لتعزيز أنظمة دعم القرار، مع معالجة قيود أساليب إدارة المعرفة التقليدية في التعامل مع المعلومات الخاصة بالمجال. يبرز المؤلفون أن نماذج اللغة الكبيرة العامة (LLMs) غالبًا ما تفشل في تفسير المعلمات الفنية والإرشادات التشغيلية بدقة بسبب طبيعتها المتخصصة ومصادرها المتنوعة. للتغلب على هذه التحديات، يتضمن الإطار المقترح تحسين نماذج LLMs باستخدام مجموعات بيانات خاصة بالمجال، مما يحسن من فهمها للمصطلحات المعقدة والفروق الدلالية. كما تم تقديم خط أنابيب دمج المعرفة متعددة الوسائط، الذي يجمع بين الأنظمة القائمة على القواعد والهياكل الأنطولوجية لاستخراج وربط الكيانات من مصادر بيانات متنوعة، مما يؤدي إلى شبكة معرفة قابلة للتكيف.

تشير النتائج التجريبية إلى أن النموذج المحسن يتفوق بشكل كبير على نماذج LLMs العامة في دقة استخراج العلاقات، بينما تظهر الرسم البياني للمعرفة المبنية أداءً قويًا في التماسك الدلالي والتفكير التشغيلي. تستنتج الدراسة أن هذا النهج الجديد يدمج بشكل فعال الرسوم البيانية للمعرفة ونماذج LLMs القابلة للتكيف مع المجال، مما يمهد الطريق لتحسين إدارة المعرفة في المجالات المتخصصة. تهدف الأعمال المستقبلية إلى توسيع الرسم البياني للمعرفة ليشمل المزيد من السيناريوهات وإدراج أنواع بيانات إضافية، مما يعزز من فائدته في عمليات اتخاذ القرار. تؤكد النتائج على إمكانية الإطار في تعزيز التعاون عبر الوظائف وتحسين الكفاءة التشغيلية في البيئات المعقدة.

مقدمة

تقدم مقدمة ورقة البحث هذه إطارًا جديدًا لتكييف LoRA الهرمي يهدف إلى تعزيز تحسين المعلمات بكفاءة لنماذج اللغة الكبيرة (LLMs) للتطبيقات الخاصة بالمجال، وخاصة في بيئات ساحة المعركة المعقدة. يدمج هذا الإطار مجموعات بيانات متعددة المصادر لبناء معرفة مسبقة مدخلة ويقوم بتحسين معلمات الوحدات الهرمية بناءً على تسميات بيانات محددة للمهام. يتكون من ثلاثة تجمعات رئيسية لوحدات LoRA—BM-LoRA وTL-LoRA وTA-LoRA—كل منها يؤدي وظائف متميزة مثل تشكيل الشبكة الدلالية، تضمين القواعد التشغيلية، وتكييف المهام الديناميكي. تنظم شبكة توجيه المعرفة (KRN) هذه الوحدات، مما يسهل اتخاذ القرارات في الوقت الحقيقي من خلال إدارة فعالة للمعلمات وتقنيات تقليم خفيفة الوزن.

بالإضافة إلى ذلك، تؤكد الدراسة على أهمية بناء مجموعة تدريب عالية الجودة ومنظمة لنماذج LLMs في المجالات عالية الأمان. توضح استراتيجية شاملة لجمع البيانات تشمل سجلات الأوامر، وثائق المعدات، وبيانات المحاكاة، جميعها معالجة لضمان الموثوقية والاتساق. يعزز الإطار المقترح لبناء الرسم البياني للمعرفة الخاصة بالمجال استخراج المعرفة من خلال نهج متعدد الطبقات قائم على القواعد، مما يدمج مصادر بيانات غير متجانسة لتلبية الاحتياجات المعرفية متعددة الأبعاد لأنظمة اتخاذ القرار الذكية. يتم التحقق من فعالية هذا الإطار من خلال التحقق الآلي من جودة الرسم البياني للمعرفة، مما يظهر معدل موثوقية مرتفع للمعرفة المستخرجة، وبالتالي يوفر أساسًا قويًا للتطبيقات المستقبلية في دعم القرار التكتيكي.

الطرق

تحدد قسم الطرق إعدادًا تجريبيًا محكومًا مصممًا لتقييم تأثير إزالة الحساسية على جودة الرسم البياني للمعرفة وأداء النموذج. يستخدم التجربة كل من النسخ التي تم إزالة حساسيتها وغير الحساسة لمجموعات البيانات المتعلقة بالإجابة على الأسئلة المعرفية، والتخطيط التكتيكي، وتقييم التهديدات. تشمل المقاييس الرئيسية مؤشرات أداء المهام القياسية (مثل BERTScore، وKendall’s Tau) ومقاييس جديدة مثل Privacy Score، التي يتم تقييمها من خلال k-anonymity وl-diversity، ومعدل الاحتفاظ بالمعلومات، الذي يقيس الحفاظ على العناصر الدلالية الحرجة بعد إزالة الحساسية. يتم تدريب النموذج المحسن (DeepSeek-R1 70B LoRA) وتقييمه على كلا النسختين من مجموعة البيانات، مع تحليل الفروق في الأداء لتحديد أي تدهور بسبب إزالة الحساسية.

بالإضافة إلى ذلك، يتم تقديم إطار تقييم ثقة متعدد المستويات لتقييم موثوقية ثلاثيات الرسم البياني للمعرفة. يقيم هذا الإطار مصداقية الثلاثيات من خلال مقاييس الثقة على مستوى الكيانات، وعلى مستوى العلاقات، وثقة عالمية، باستخدام شبكة عصبية متعددة الطبقات (MLP) لإخراج درجة ثقة نهائية. تشير النتائج إلى أنه بينما تتفوق البيانات غير الحساسة قليلاً على البيانات الحساسة عبر مهام مختلفة، فإن الفروق طفيفة (1-2%)، مما يشير إلى أن تقنيات إزالة الحساسية تحافظ بشكل فعال على فعالية المهام بينما تعزز بشكل كبير من حماية الخصوصية. تلبي البيانات الحساسة معايير خصوصية صارمة، محققةً عتبات k-anonymity وl-diversity التي تقلل بشكل كبير من مخاطر إعادة التعرف. بشكل عام، تؤكد النتائج على التوازن بين أمان البيانات والأداء الوظيفي، مما يجعل إطار الرسم البياني للمعرفة الحساسة مناسبًا للنشر في البيئات الحساسة. ستسعى الأبحاث المستقبلية إلى تحسين إزالة الحساسية لتدفقات البيانات الديناميكية لتحسين التكيف في الوقت الحقيقي.

النتائج

تفصل قسم النتائج نتائج بناء رسم بياني للمعرفة خاص بالمجال باستخدام إطار هجين يدمج نماذج اللغة الكبيرة المحسنة (LLMs) ومعالجة البيانات متعددة الوسائط. يتكون الرسم البياني للمعرفة من حوالي 1.2 مليون كيان و3.5 مليون علاقة، مما يغطي بشكل فعال جوانب حيوية مثل العمليات التكتيكية ومواصفات المعدات. تشير مقاييس التقييم إلى دقة هيكلية عالية وتماسك دلالي، مع متوسط درجة عقدة يبلغ 5.8 ومعامل تجميع قدره 0.67، مما يدعم اتخاذ القرارات الفعالة في الوقت الحقيقي.

تقيّم دراسة الإزالة أيضًا مساهمات المكونات المختلفة داخل الإطار، مما يكشف أن النموذج الكامل يتفوق على جميع النسخ المزالة عبر ثلاث مهام: الإجابة على الأسئلة المعرفية، التخطيط التكتيكي، وتقييم التهديدات. من الجدير بالذكر أن إزالة وحدة LoRA القابلة للتكيف مع المهام أدت إلى أكبر انخفاض في الأداء في تقييم التهديدات، بينما أدت غياب توليد معزز بالاسترجاع (RAG) وتحفيز سلسلة الأفكار (CoT) أيضًا إلى انخفاضات ملحوظة في الإجابة على الأسئلة المعرفية والتخطيط التكتيكي، على التوالي. تؤكد تحليل الثلاثيات عالية الثقة التي تم إنشاؤها بواسطة كل نسخة على أهمية النهج المتكامل، حيث حقق النموذج الكامل 91.3% من الثلاثيات عالية الثقة مقارنةً بنسب أقل بكثير للنسخ المزالة. تؤكد هذه النتائج على ضرورة كل مكون في تعزيز موثوقية وسلامة الرسم البياني للمعرفة، مما يدعم بشكل فعال عمليات اتخاذ القرار الحرجة للمهمة.

المناقشة

تسلط قسم المناقشة في ورقة البحث الضوء على الدور التحويلي لنماذج اللغة الكبيرة (LLMs) في بناء الرسم البياني للمعرفة (KG) بشكل آلي. غالبًا ما تكون الطرق التقليدية، التي تعتمد على التوصيف اليدوي والأنظمة القائمة على القواعد، مكلفة وتفتقر إلى القابلية للتوسع. على النقيض من ذلك، تستفيد نماذج LLMs مثل GPT-4 وLLaMA من التدريب المسبق الواسع لاستخراج المعرفة المنظمة من النصوص غير المنظمة، مما يظهر أداءً متفوقًا في مهام مثل استخراج العلاقات والتعرف على الكيانات. على سبيل المثال، تولد أطر مثل REBEL ثلاثيات الكيان-العلاقة دون أنطولوجيات محددة مسبقًا، محققةً تغطية أعلى بكثير من الأساليب التقليدية. بالإضافة إلى ذلك، توضح التطبيقات المتخصصة في مجالات مثل الرعاية الصحية والمالية فعالية نماذج LLMs في بناء KGs مصممة لتلبية احتياجات محددة، على الرغم من أن التحديات لا تزال قائمة في السياقات عالية الأمان حيث لا يزال استخراج المعلومات الخاصة بالمجال قيد التطوير.

كما يتناول القسم بناء مجموعة تدريب عالية الجودة ومتعددة المصادر مصممة لدعم نشر نماذج LLMs في اتخاذ القرار التكتيكي وتقييم التهديدات. تدمج هذه المجموعة أنواع بيانات متنوعة، بما في ذلك سجلات الاتصالات التكتيكية وبيانات محاكاة ساحة المعركة، منظمة لتسهيل التكيف مع المهام المتعددة. يضمن عملية إزالة حساسية صارمة خصوصية البيانات، بينما يدعم هيكل JSON المعزز المتداخل التدريب الفعال عبر مهام متنوعة. تؤكد الدراسة على أهمية مجموعة بيانات منظمة جيدًا في تعزيز قدرات تعميم النموذج وكفاءته التشغيلية، بهدف تحسين اتخاذ القرار في الأنظمة الذكية. يجمع أسلوب البناء الهجين للرسوم البيانية للمعرفة بين استخراج البيانات القائم على نماذج LLMs مع الأساليب الهرمية القائمة على القواعد، مما يضمن تمثيل المعرفة بدقة وملاءمة سياقية.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-38066-w
PMID: https://pubmed.ncbi.nlm.nih.gov/41667618
Publication Date: 2026-02-10
Author(s): Li Peng et al.
Primary Topic: Advanced Graph Neural Networks

Overview

The research paper presents a framework for constructing and refining specialized knowledge graphs to enhance decision-support systems, addressing the limitations of conventional knowledge management approaches in handling domain-specific information. The authors highlight that general-purpose large language models (LLMs) often fail to accurately interpret technical parameters and operational guidelines due to their specialized nature and diverse sources. To overcome these challenges, the proposed framework involves fine-tuning LLMs with domain-specific datasets, which improves their comprehension of complex terminology and semantic nuances. A multimodal knowledge integration pipeline is also introduced, combining rule-based systems with ontological structures to extract and link entities from various data sources, resulting in an adaptive knowledge network.

Experimental results indicate that the fine-tuned model significantly outperforms general-purpose LLMs in relationship extraction accuracy, while the constructed knowledge graph demonstrates strong performance in semantic coherence and operational reasoning. The study concludes that this novel approach effectively integrates knowledge graphs and domain-adaptive LLMs, paving the way for improved knowledge management in specialized domains. Future work aims to expand the knowledge graph to encompass more scenarios and incorporate additional data types, thereby enhancing its utility in decision-making processes. The findings underscore the framework’s potential for fostering cross-functional collaboration and improving operational efficiency in complex environments.

Introduction

The introduction of this research paper presents a novel hierarchical LoRA adaptation framework aimed at enhancing the parameter-efficient fine-tuning of large language models (LLMs) for domain-specific applications, particularly in complex battlefield environments. This framework integrates multi-source corpora to construct input prior knowledge and optimizes hierarchical module parameters based on task-specific data labels. It comprises three main LoRA module clusters—BM-LoRA, TL-LoRA, and TA-LoRA—each serving distinct functions such as semantic network formation, operational rule embedding, and dynamic task adaptation. The Knowledge Routing Network (KRN) orchestrates these modules, facilitating real-time decision-making through efficient parameter management and lightweight pruning techniques.

Additionally, the study emphasizes the importance of constructing a high-quality, structured training corpus for LLMs in high-security domains. It outlines a comprehensive data collection strategy that includes command logs, equipment documentation, and simulation data, all processed to ensure reliability and consistency. The proposed framework for domain knowledge graph construction further enhances knowledge extraction through a layered, rule-driven approach, integrating heterogeneous data sources to meet the multidimensional knowledge needs of intelligent decision-making systems. The effectiveness of this framework is validated through automated quality verification of the knowledge graph, demonstrating a high reliability rate for the extracted knowledge, thus providing a robust foundation for future applications in tactical decision support.

Methods

The methods section outlines a controlled experimental setup designed to evaluate the impact of desensitization on knowledge graph quality and model performance. The experiment utilizes both desensitized and non-desensitized versions of datasets related to knowledge question answering, tactical planning, and threat assessment. Key metrics include standard task performance indicators (e.g., BERTScore, Kendall’s Tau) and new metrics such as Privacy Score, assessed through k-anonymity and l-diversity, and Information Retention Rate, which measures the preservation of critical semantic elements post-desensitization. The fine-tuned LLM (DeepSeek-R1 70B LoRA) is trained and evaluated on both dataset variants, with performance differences analyzed to quantify any degradation due to desensitization.

Additionally, a multi-level confidence evaluation framework is introduced to assess the reliability of knowledge graph triplets. This framework evaluates triplet credibility through entity-level, relationship-level, and global confidence metrics, employing a multilayer perceptron (MLP) to output a final confidence score. The results indicate that while non-desensitized data slightly outperforms desensitized data across various tasks, the differences are minimal (1-2%), suggesting that desensitization techniques effectively maintain task efficacy while significantly enhancing privacy protections. The desensitized data meets stringent privacy criteria, achieving k-anonymity and l-diversity thresholds that substantially reduce re-identification risks. Overall, the findings emphasize the balance between data security and functional performance, making the desensitized knowledge graph framework suitable for deployment in sensitive environments. Future research will aim to optimize desensitization for dynamic data streams to improve real-time adaptability.

Results

The results section details the outcomes of constructing a domain-specific knowledge graph using a hybrid framework that integrates fine-tuned large language models (LLMs) and multimodal data processing. The knowledge graph comprises approximately 1.2 million entities and 3.5 million relationships, effectively covering critical aspects such as tactical operations and equipment specifications. Evaluation metrics indicate a high structural accuracy and semantic coherence, with an average node degree of 5.8 and a clustering coefficient of 0.67, which support efficient real-time decision-making.

An ablation study further assesses the contributions of various components within the framework, revealing that the complete model outperforms all ablated variants across three tasks: knowledge question answering, tactical planning, and threat assessment. Notably, the removal of the Task-Adaptive LoRA module resulted in the most significant performance decline in threat assessment, while the absence of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting also led to marked reductions in knowledge question answering and tactical planning, respectively. The analysis of high-confidence triples generated by each variant underscores the importance of the integrated approach, with the full model achieving 91.3% high-confidence triples compared to significantly lower percentages for the ablated versions. These findings validate the necessity of each component in enhancing the reliability and structural integrity of the knowledge graph, thereby supporting mission-critical decision-making processes effectively.

Discussion

The discussion section of the research paper highlights the transformative role of large-scale language models (LLMs) in automated knowledge graph (KG) construction. Traditional methods, reliant on manual annotation and rule-based systems, are often costly and lack scalability. In contrast, LLMs like GPT-4 and LLaMA leverage extensive pre-training to extract structured knowledge from unstructured text, demonstrating superior performance in tasks such as relation extraction and entity recognition. For example, frameworks like REBEL generate entity-relation triples without predefined ontologies, achieving significantly higher coverage than traditional approaches. Additionally, specialized applications in domains such as healthcare and finance illustrate the efficacy of LLMs in constructing KGs tailored to specific needs, although challenges remain in high-security contexts where domain-specific extraction is still developing.

The section also details the construction of a high-quality, multi-source training corpus designed to support the deployment of LLMs in tactical decision-making and threat assessment. This corpus integrates diverse data types, including tactical communication logs and battlefield simulation data, structured to facilitate multi-task adaptation. A stringent desensitization process ensures data privacy, while an enhanced nested JSON architecture supports efficient training across various tasks. The study emphasizes the importance of a well-structured dataset in enhancing the model’s generalization capabilities and operational efficiency, ultimately aiming to improve decision-making in intelligent systems. The hybrid construction method for KGs combines LLM-based extraction with hierarchical rule-driven approaches, ensuring accurate and contextually relevant knowledge representation.