GeoNav: تمكين نماذج اللغة متعددة اللغات من التفكير الجغرافي ثنائي المقياس للتنقل الجوي GeoNav: Empowering MLLMs with dual-scale geospatial reasoning for language-goal aerial navigation

المجلة: Pattern Recognition، المجلد: 177
DOI: https://doi.org/10.1016/j.patcog.2026.113365
تاريخ النشر: 2026-02-28
المؤلف: Zhenyun Du وآخرون
الموضوع الرئيسي: أنظمة الكلام والحوار

نظرة عامة

في هذا البحث، يقدم المؤلفون GeoNav، وهو وكيل متعدد الوسائط مصمم للتنقل الجوي بعيد المدى للطائرات بدون طيار (UAVs) في بيئات حضرية معقدة باستخدام التعليمات النصية. تعاني طرق التنقل التقليدية في الأماكن المغلقة من صعوبة التكيف مع الإعدادات الحضرية بسبب التحديات مثل الأجسام الغامضة وحقول الرؤية المحدودة. يتناول GeoNav هذه القضايا من خلال نهج ثلاثي المراحل: التنقل باستخدام المعالم، البحث عن الأهداف، والتحديد الدقيق للموقع، والذي يحاكي أنماط التفكير المكاني البشرية.

يبني GeoNav تمثيلات مكانية مزدوجة المقياس لتسهيل مهام التنقل الخاصة به. التمثيل الأول هو خريطة معرفية عالمية تدمج المعرفة الجغرافية السابقة مع الإشارات البصرية، مما يسمح بالتنقل الفعال إلى مناطق المعالم من خلال التفكير القائم على الخريطة. التمثيل الثاني هو رسم بياني للمشهد المحلي يلتقط العلاقات المكانية الهرمية بين المعالم والأجسام، وهو أمر أساسي للتحديد الدقيق للأهداف. بالإضافة إلى ذلك، يستخدم GeoNav آلية سلسلة التفكير المكاني لتعزيز قابلية تفسير وكفاءة اتخاذ القرار عبر مراحل تشغيله. في معيار CityNav، يظهر GeoNav تحسنًا كبيرًا مقارنة بأحدث التقنيات الحالية (SOTA)، محققًا زيادة تصل إلى 18.4% في معدل النجاح وتقليل أخطاء التنقل بشكل ملحوظ. تؤكد دراسات الإزالة بشكل أكبر على الدور الحاسم للإدراك المكاني المنظم في تعزيز قدرات التنقل للطائرات بدون طيار.

مقدمة

تتناول مقدمة هذه الورقة البحثية المجال الناشئ للتنقل الجوي القائم على اللغة، حيث يجب على الطائرات بدون طيار (UAVs) تفسير التعليمات باللغة الطبيعية للتنقل نحو الأهداف في بيئات خارجية غير مألوفة. هذه القدرة ذات صلة خاصة بحوكمة المدن والخدمات العامة، بما في ذلك الاستجابة للطوارئ واللوجستيات الجوية. على الرغم من توفر معلومات جغرافية غنية تتعلق بنقاط الاهتمام (POIs) في الإعدادات الحضرية، غالبًا ما تتجاهل طرق التنقل الجوي الحالية هذه البيانات، معتمدة بدلاً من ذلك على نهج مطابقة الرؤية-اللغة الذاتية الذي يناسب البيئات الأصغر.

تسلط الورقة الضوء على التحديات التي تواجه التنقل الجوي، خاصة في السياقات الحضرية التي تتميز بدلالات بصرية غامضة والحاجة إلى التفكير المكاني متعدد المقاييس. تقدم البيئات الحضرية هياكل متشابهة بصريًا تعقد مهام التعرف، مما يستلزم استخدام تعبيرات الإشارة الجغرافية. بالإضافة إلى ذلك، يتطلب التنقل الفعال دمج العناصر على المقياس الكلي، مثل شبكات الطرق، مع الكيانات على المقياس الدقيق، مثل المباني الفردية والمركبات. يؤكد المؤلفون أن هذه التحديات مهمة في سياق المعيار الجديد الذي تم إنشاؤه، CityNav، الذي يهدف إلى تعزيز مجال التنقل الجوي من خلال الاستفادة من المعالم الجغرافية السابقة.

طرق

في هذا القسم، يحدد المؤلفون الإعدادات التجريبية والأساليب الأساسية المستخدمة للمقارنة مع نهج GeoNav المقترح. يصنفون الأساليب إلى ثلاث مجموعات رئيسية: الأساليب القائمة على القواعد، الأساليب القائمة على التعلم، والأساليب بدون تدريب. تشمل الأساليب القائمة على القواعد استراتيجية اختيار عشوائية (RS) وطريقة جشعة تتنقل مباشرة إلى مركز المعلم. تشمل الأساليب القائمة على التعلم تقنيات التنقل الجوي المعروفة للرؤية واللغة (VLN) مثل Seq2Seq وCMA، وطريقة MGP الحديثة (SOTA)، مع تقديم مقاييس الأداء لكل منها.

تتكون الأساليب بدون تدريب من أربع استراتيجيات تنقل، بما في ذلك NavGPT وQwen-VL-Max، التي تستخدم وجهات نظر ذاتية، بالإضافة إلى TopV-Nav وSTMR، التي تعتمد على الخرائط من الأعلى إلى الأسفل. لضمان مقارنة عادلة، يدمج المؤلفون وحدة MNS في TopV-Nav وSTMR، مما يمكّن هذه الأساليب من الوصول إلى المعالم بشكل أكثر كفاءة مع تزويدها بوصف المعالم السابقة. يتم تفصيل مقاييس الأداء لكل طريقة، مما يبرز نتائج GeoNav التنافسية مقارنة بأداء البشر والأساليب الأساسية الأخرى.

مناقشة

في هذا القسم، يقدم المؤلفون GeoNav، وهو نهج وكيل جديد بدون تدريب مصمم للتنقل الجوي بعيد المدى في البيئات الحضرية. يدمج GeoNav المعرفة الجغرافية الحضرية مع الإدراك المجسد، مستخدمًا نماذج اللغة الكبيرة متعددة الوسائط (MLLMs) لتعزيز التفكير المكاني من خلال الذاكرة المنظمة وتفكيك المهام. يتم هيكلة عملية التنقل في ثلاث مراحل: (1) التنقل باستخدام المعالم، حيث تتحرك الطائرة بدون طيار نحو المعالم الجغرافية الخشنة؛ (2) البحث عن الأهداف، والذي يتضمن البحث عن الأجسام في محيط المعلم؛ و(3) التحديد الدقيق للموقع، حيث تحدد الطائرة بدون طيار الموقع الدقيق للهدف. يسمح هذا النهج المرحلي، المدعوم بخريطة معرفية تخطيطية (SCM) ورسم بياني هرمي للمشهد (HSG)، بتنقل أكثر كفاءة مقارنة بالطرق التقليدية من النهاية إلى النهاية.

يبرز المؤلفون ثلاث مساهمات رئيسية لـ GeoNav: آلية جدولة وتفكير مدركة للمراحل تقوم بتفكيك مهام التنقل إلى مهام فرعية قابلة للإدارة؛ دمج المعالم الجغرافية النصية مع الملاحظات البصرية لإنشاء تمثيلات مكانية مزدوجة المقياس؛ وتقييمات شاملة تظهر الأداء المتفوق لـ GeoNav على معيار CityNav، مما يحسن بشكل كبير معدلات النجاح والكفاءة في التنقل مقارنة بالأساليب الأساسية الحالية. كما تضع المناقشة GeoNav في سياق أوسع من التنقل الجوي القائم على الرؤية واللغة والتفكير المكاني المدفوع بـ MLLM، مما يبرز تقدمه في التعامل مع مهام التنقل الحضري المعقدة التي كافحت النماذج السابقة لمعالجتها.

Journal: Pattern Recognition, Volume: 177
DOI: https://doi.org/10.1016/j.patcog.2026.113365
Publication Date: 2026-02-28
Author(s): Zhenyun Du et al.
Primary Topic: Speech and dialogue systems

Overview

In this research, the authors introduce GeoNav, a multi-modal agent designed for long-range aerial navigation of unmanned aerial vehicles (UAVs) in complex urban environments using textual instructions. Traditional indoor navigation methods struggle to scale to urban settings due to challenges such as ambiguous objects and limited visual fields. GeoNav addresses these issues through a three-phase approach: landmark navigation, target search, and precise localization, which emulates human spatial reasoning patterns.

GeoNav constructs dual-scale spatial representations to facilitate its navigation tasks. The first representation is a global cognitive map that integrates prior geographic knowledge with visual cues, allowing for efficient navigation to landmark regions through intuitive map-based reasoning. The second representation is a local scene graph that captures hierarchical spatial relationships among landmarks and objects, essential for accurate target localization. Additionally, GeoNav employs a spatial chain-of-thought mechanism to enhance the interpretability and efficiency of decision-making across its operational stages. On the CityNav benchmark, GeoNav demonstrates a significant improvement over the current state-of-the-art (SOTA), achieving up to an 18.4% increase in success rate and markedly reducing navigation errors. Ablation studies further emphasize the critical role of structured spatial perception in enhancing UAV navigation capabilities.

Introduction

The introduction of this research paper addresses the emerging field of language-goal aerial navigation, where unmanned aerial vehicles (UAVs) must interpret natural language instructions to navigate towards targets in unfamiliar outdoor environments. This capability is particularly relevant for urban governance and public services, including emergency response and aerial logistics. Despite the rich availability of geographic information related to points of interest (POIs) in urban settings, existing aerial navigation methods often overlook this data, relying instead on an egocentric vision-language matching approach that is more suited to smaller environments.

The paper highlights the challenges faced in aerial navigation, particularly in urban contexts characterized by ambiguous visual semantics and the need for multi-scale spatial reasoning. Urban environments present visually similar structures that complicate identification tasks, necessitating the use of geospatial referring expressions. Additionally, effective navigation requires integrating macro-scale elements, such as road networks, with micro-scale entities, like individual buildings and vehicles. The authors emphasize that these challenges are significant in the context of the newly established benchmark, CityNav, which aims to advance the field of aerial navigation by leveraging landmark geographic priors.

Methods

In this section, the authors outline the experimental settings and baseline methods used for comparison with their proposed GeoNav approach. They categorize the methods into three main groups: Rule-based Methods, Learning-based Methods, and Zero-shot Methods. Rule-based Methods include a random selection strategy (RS) and a greedy method that navigates directly to the landmark centroid. Learning-based Methods encompass established aerial Vision-and-Language Navigation (VLN) techniques such as Seq2Seq, CMA, and the state-of-the-art (SOTA) method MGP, with performance metrics presented for each.

The Zero-shot Methods consist of four navigation strategies, including NavGPT and Qwen-VL-Max, which utilize ego-centric views, as well as TopV-Nav and STMR, which rely on top-down maps. To ensure a fair comparison, the authors incorporate the MNS module into TopV-Nav and STMR, enabling these methods to reach landmarks more efficiently while providing them with prior landmark descriptions. The performance metrics for each method are detailed, highlighting GeoNav’s competitive results against both human performance and other baseline methods.

Discussion

In this section, the authors present GeoNav, a novel zero-shot agentic approach designed for long-horizon aerial navigation in urban environments. GeoNav integrates urban geographic knowledge with embodied perception, employing multi-modal large language models (MLLMs) to enhance spatial reasoning through structured memory and task decomposition. The navigation process is structured into three stages: (1) Landmark Navigation, where the UAV moves towards coarse geographic landmarks; (2) Target Search, which involves searching for objects within the vicinity of the landmark; and (3) Precise Localization, where the UAV determines the exact location of the target. This staged approach, supported by a schematic cognitive map (SCM) and a hierarchical scene graph (HSG), allows for more efficient navigation compared to traditional end-to-end methods.

The authors highlight three key contributions of GeoNav: a stage-aware scheduling and reasoning mechanism that decomposes navigation tasks into manageable subtasks; the fusion of textual geographic priors with visual observations to create dual-scale spatial representations; and extensive evaluations demonstrating GeoNav’s superior performance on the CityNav benchmark, significantly improving navigational success rates and efficiency over existing baselines. The discussion also contextualizes GeoNav within the broader landscape of aerial vision-and-language navigation and MLLM-driven spatial reasoning, emphasizing its advancements in handling complex urban navigation tasks that previous models struggled to address.