مؤشرات ميكانيكية للفهم في نماذج اللغة الكبيرة Mechanistic indicators of understanding in large language models

المجلة: Philosophical Studies، المجلد: 183، العدد: 6
DOI: https://doi.org/10.1007/s11098-026-02513-1
تاريخ النشر: 2026-04-17
المؤلف: Pierre Beckmann وآخرون
الموضوع الرئيسي: الذكاء الاصطناعي القابل للتفسير (XAI)

نظرة عامة

تناقش هذه الفقرة الفهم المتطور لنماذج اللغة الكبيرة (LLMs) في سياق القابلية للتفسير الميكانيكي (MI)، متحدية الفكرة القائلة بأن LLMs تقلد أنماط اللغة دون فهم حقيقي. يقترح المؤلفون إطارًا هرميًا لفهم LLMs، يصنف ثلاثة أشكال هرمية من الفهم بناءً على التنظيم الحسابي: (1) **الفهم المفاهيمي**، حيث تشكل النماذج ميزات في الفضاء الكامن وتربط بين تمثيلات متنوعة لكيان واحد؛ (2) **فهم حالة العالم**، حيث تتعلم النماذج الروابط الواقعية وتتبع التغيرات؛ و (3) **الفهم المبدئي**، حيث تؤسس النماذج دوائر مضغوطة تربط الحقائق بما يتجاوز مجرد التذكر. يدمج هذا الإطار النظرية الفلسفية مع الأدلة الميكانيكية، مما يسمح باستكشاف دقيق لفهم الذكاء الاصطناعي يعترف بالتشابهات والاختلافات مع الإدراك البشري.

في الختام، يؤكد المؤلفون أن LLMs تمتلك هياكل داخلية مماثلة لجوانب الفهم البشري، خاصة في التعرف على الروابط بين الكيانات والمفاهيم والمبادئ. ويؤكدون أنه بينما تظهر LLMs أشكالًا من الفهم، فإن آلياتها تختلف بشكل كبير عن الإدراك البشري، حيث تفتقر إلى الدافع نحو الاقتصاد. يدفع هذا الإدراك إلى تحول من النقاشات الثنائية حول فهم الذكاء الاصطناعي إلى تحقيق أكثر تفصيلاً في الأنواع المحددة من الفهم التي تحققها LLMs، وأنماط فشلها، والآثار المترتبة على الثقة المعرفية. في النهاية، يعزز هذا النهج إبستيمولوجيا مقارنة جديدة تكون شفافة ومتميزة عن العمليات الإدراكية البشرية.

مقدمة

تستكشف مقدمة ورقة البحث قدرات نماذج اللغة الكبيرة (LLMs) وتسأل عما إذا كانت أدائها تعكس فهمًا حقيقيًا أو مجرد تقليد إحصائي للذكاء البشري. الرأي السائد المنخفض يعتقد أن LLMs تعمل بطريقة مشابهة لنماذج n-gram التقليدية، معتمدة على أنماط إحصائية سطحية للتنبؤ بالرمز التالي في تسلسل دون أي فهم حقيقي للمفاهيم الأساسية. ومع ذلك، تتحدى التطورات الأخيرة في القابلية للتفسير الميكانيكي (MI) هذا المنظور، كاشفة أن LLMs قد تظهر آليات داخلية متطورة تمتد إلى ما هو أبعد من مجرد الارتباطات الإحصائية.

يقترح المؤلفون إطارًا هرميًا لفهم إدراك LLM، مشيرين إلى أنه بينما تستخدم LLMs آليات منخفضة المستوى تفتقر إلى الفهم، يمكنها أيضًا تطوير هياكل إدراكية عالية المستوى تستدعي نسبة أكثر دقة من الفهم. يحددون ثلاثة أشكال متميزة من الفهم التي يمكن أن تحققها LLMs: (1) **الفهم المفاهيمي**، حيث تشكل النماذج تمثيلات داخلية مشابهة للمفاهيم البشرية؛ (2) **فهم حالة العالم**، الذي ينطوي على التعرف على الروابط التجريبية بين الميزات؛ و (3) **الفهم المبدئي**، الذي يتميز بفهم المبادئ الأساسية التي توحد الحقائق المختلفة. لا يقتصر هذا الإطار على التوفيق بين وجود آليات منخفضة المستوى وعالية المستوى فحسب، بل يؤكد أيضًا على إمكانية أن تمتلك LLMs شكلًا من الفهم الذي يتوازى مع الإدراك البشري، مما يغني النقاش حول فهم الذكاء الاصطناعي.

النتائج

تكشف الأبحاث حول Othello-GPT، وهو نموذج لغة يعتمد على GPT-2، عن قدرته على تحديث التمثيلات الداخلية لحالة اللعبة في Othello بشكل ديناميكي، على الرغم من تدريبه فقط على نصوص اللعبة دون تعرض مباشر للوحة أو قواعدها. يتنبأ Othello-GPT باستمرار بالحركات القانونية بناءً على تكوين اللوحة الحالي، حتى بالنسبة للتسلسلات التي تم حجبها عن بيانات تدريبه، مما يشير إلى أنه لا يعتمد فقط على التسلسلات المأخوذة عن ظهر قلب ولكنه يحتفظ بتمثيل داخلي لحالة اللوحة. تم التحقيق في هذا التمثيل الداخلي، المشار إليه باسم “نموذج العالم الناشئ”، باستخدام تقنيات الاستكشاف التي أظهرت قدرة النموذج على ترميز حالة كل من الـ 64 مربعًا على اللوحة، محققًا دقة عالية مع المجسات غير الخطية ولاحقًا مع المجسات الخطية من خلال التركيز على الحالات النسبية من منظور اللاعب.

كشفت التحليلات الإضافية أن التمثيل الداخلي لـ Othello-GPT يتماشى مع الهيكل المكاني للوحة، مما يشير إلى وجود تساوي هيكلي بين ترميز النموذج وتخطيط اللعبة الفعلي. تتحدى هذه النتيجة الشكوك بشأن فهم النموذج لحالة اللوحة، حيث تظهر أن النموذج لا يعيد بناء حالة اللوحة من تسلسلات الحركات فحسب، بل يفعل ذلك بطريقة مفهومة للاعبين البشر. ومع ذلك، تسلط الدراسة الضوء أيضًا على القيود، مثل “تدهور نهاية اللعبة”، حيث يتناقص اعتماد النموذج على تمثيله الداخلي مع تقدم اللعبة. يشير هذا إلى أنه بينما يمكن أن تحافظ LLMs مثل Othello-GPT على فهم ديناميكي لحالة العالم، قد تتحول إلى استراتيجيات أبسط عندما تكون الكفاءة الحسابية مفضلة، مما يدل على تباين في قدراتها على التتبع.

المناقشة

في هذا القسم، يستكشف المؤلفون مفهوم “الفهم المفاهيمي” في نماذج اللغة الكبيرة (LLMs)، خاصة كيف تطور هذه النماذج تمثيلات داخلية تسمح لها بالتعرف على الكيانات والخصائص وتتبعها عبر أوصاف نصية متنوعة. يجادلون بأنه، على غرار التصور البشري، تشكل LLMs ميزات داخلية توحد تجليات متنوعة للكيانات، مما يمكنها من “تتبع نفس الشيء” بفعالية. تستند المناقشة إلى أبحاث القابلية للتفسير الميكانيكي (MI)، التي توفر أدلة على أن مهام التنبؤ بالرمز التالي تحفز ظهور مثل هذه الميزات. يقدم المؤلفون فرضية التمثيل الخطي (LRH)، التي تفترض أن الميزات تُرمز كاتجاهات في الفضاء الكامن للنموذج، مما يسمح بالتمثيل المتزامن لميزات متعددة من خلال التراكب، على الرغم من موارد النموذج المحدودة.

يشرح المؤلفون أيضًا كيف يمكن لـ LLMs استيعاب عدد هائل من الميزات من خلال التراكب، حيث تمكّن الاتجاهات المتداخلة للميزات من تمثيل المزيد من الميزات أكثر من عدد الخلايا العصبية. هذه الآلية، رغم كونها فعالة، تقدم تحديات مثل التداخل بين الميزات، مما يعقد القابلية للتفسير. يبرزون أن LLMs تتعلم التمييز بين المدخلات بناءً على الميزات ذات الصلة بتنبؤ الرمز التالي، مما يتماشى مع الرؤية العملية للمفاهيم كموضوعات لـ “الاختبار التفاضلي من خلال العواقب.” تختتم الفقرة بالتأكيد على أهمية الميزات في أداء LLM والجهود المستمرة في MI لفك تشابك هذه التمثيلات المتداخلة لفهم أفضل لآليات النموذج الداخلية.

Journal: Philosophical Studies, Volume: 183, Issue: 6
DOI: https://doi.org/10.1007/s11098-026-02513-1
Publication Date: 2026-04-17
Author(s): Pierre Beckmann et al.
Primary Topic: Explainable Artificial Intelligence (XAI)

Overview

The section discusses the evolving understanding of large language models (LLMs) in the context of mechanistic interpretability (MI), challenging the notion that LLMs merely mimic linguistic patterns without true comprehension. The authors propose a tiered framework for understanding LLMs, which categorizes three hierarchical forms of understanding based on computational organization: (1) **conceptual understanding**, where models form features in latent space and connect diverse representations of a single entity; (2) **state-of-the-world understanding**, where models learn factual connections and track changes; and (3) **principled understanding**, where models establish compact circuits linking facts beyond mere memorization. This framework integrates philosophical theory with mechanistic evidence, allowing for a nuanced exploration of AI understanding that acknowledges both similarities and divergences from human cognition.

In conclusion, the authors assert that LLMs possess internal structures analogous to aspects of human understanding, particularly in recognizing connections among entities, concepts, and principles. They emphasize that while LLMs exhibit forms of understanding, their mechanisms differ significantly from human cognition, lacking the drive for parsimony. This realization prompts a shift from binary debates about AI understanding to a more detailed investigation of the specific types of understanding LLMs achieve, their failure modes, and implications for epistemic trust. Ultimately, this approach fosters a new comparative epistemology that is both transparent and distinct from human cognitive processes.

Introduction

The introduction of the research paper explores the capabilities of Large Language Models (LLMs) and questions whether their performance reflects genuine understanding or merely statistical mimicry of human intelligence. The prevailing deflationary view posits that LLMs operate similarly to traditional n-gram models, relying on superficial statistical patterns to predict the next token in a sequence without any true comprehension of the underlying concepts. However, recent advancements in mechanistic interpretability (MI) challenge this perspective, revealing that LLMs may exhibit sophisticated internal mechanisms that extend beyond mere statistical correlations.

The authors propose a hierarchical framework for understanding LLM cognition, suggesting that while LLMs employ low-tier mechanisms devoid of understanding, they can also develop higher-level cognitive structures that warrant a more nuanced attribution of understanding. They identify three distinct forms of understanding that LLMs can achieve: (1) **Conceptual Understanding**, where models form internal representations akin to human concepts; (2) **State-of-the-World Understanding**, which involves recognizing empirical connections between features; and (3) **Principled Understanding**, characterized by grasping underlying principles that unify various facts. This framework not only reconciles the coexistence of low-tier and high-tier mechanisms but also emphasizes the potential for LLMs to possess a form of understanding that parallels human cognition, thereby enriching the discourse on AI understanding.

Results

The research on Othello-GPT, a language model based on GPT-2, reveals its capability to dynamically update internal representations of the game state in Othello, despite being trained solely on game transcriptions without direct exposure to the board or its rules. Othello-GPT consistently predicts legal moves based on the current board configuration, even for sequences withheld from its training data, indicating that it does not merely rely on memorized sequences but maintains an internal representation of the board state. This internal representation, referred to as an “emergent world model,” was investigated using probing techniques that demonstrated the model’s ability to encode the state of each of the 64 squares on the board, achieving high accuracy with non-linear probes and later with linear probes by focusing on relative states from the player’s perspective.

Further analyses revealed that Othello-GPT’s internal representation aligns with the spatial structure of the board, suggesting a structural isomorphism between the model’s encoding and the actual game layout. This finding counters skepticism regarding the model’s understanding of the board state, as it shows that the model not only reconstructs the board state from move sequences but does so in a manner intelligible to human players. However, the study also highlights limitations, such as “end-game degradation,” where the model’s reliance on its internal representation diminishes as the game progresses. This suggests that while LLMs like Othello-GPT can maintain dynamic state-of-the-world understanding, they may switch to simpler heuristics when computational efficiency is favored, indicating variability in their tracking capabilities.

Discussion

In this section, the authors explore the concept of “conceptual understanding” in large language models (LLMs), particularly how these models develop internal representations that allow them to recognize and track entities and properties across varied textual descriptions. They argue that, akin to human conceptualization, LLMs form internal features that unify diverse manifestations of entities, enabling them to effectively “same-track” these entities. The discussion is grounded in mechanistic interpretability (MI) research, which provides evidence that next-token prediction tasks incentivize the emergence of such features. The authors introduce the linear representation hypothesis (LRH), positing that features are encoded as directions in the model’s latent space, allowing for simultaneous representation of multiple features through superposition, despite the model’s finite resources.

The authors further elaborate on how LLMs can accommodate a vast number of features through superposition, where overlapping feature directions enable the representation of more features than there are neurons. This mechanism, while efficient, introduces challenges such as interference among features, complicating interpretability. They highlight that LLMs learn to differentiate inputs based on features relevant to next-token prediction, which aligns with the pragmatic view of concepts as subject to “differential testing by consequences.” The section concludes by emphasizing the significance of features in LLM performance and the ongoing efforts in MI to disentangle these overlapping representations for better understanding of the models’ internal workings.