تقسيم الكائنات المموهة ذات المفردات المفتوحة باستخدام نماذج اللغة البصرية المتسلسلة Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models

المجلة: Computational Visual Media، المجلد: 12، العدد: 2
DOI: https://doi.org/10.26599/cvm.2025.9450512
تاريخ النشر: 2026-01-27
المؤلف: Kai Zhao وآخرون
الموضوع الرئيسي: الانتباه البصري واكتشاف الأهمية

نظرة عامة

تقدم ورقة البحث إطار عمل جديد لتجزئة الكائنات المموهة ذات المفردات المفتوحة (OVCOS)، مع معالجة التحديات التي تطرحها الغموض البصري والحاجة إلى تصنيف فعال للكائنات المموهة من فئات عشوائية. تتبع الطرق التقليدية عادة نهجًا من مرحلتين – التجزئة تليها التصنيف – لكنها تواجه قيودًا بسبب الفجوات بين النماذج اللغوية البصرية (VLMs) المدربة على الصور الكاملة وتطبيقها على المناطق المقتطعة. بالإضافة إلى ذلك، غالبًا ما تفشل نماذج التجزئة العامة في تحديد الحدود الدقيقة للكائنات المموهة بدقة. للتغلب على هذه المشكلات، يقترح المؤلفون إطار عمل متسلسل موجه بواسطة VLM يستخدم نموذج تجزئة أي شيء (SAM) المعزز بميزات مستمدة من VLM، والتي تعمل كتحفيزات صريحة لتحسين دقة التحديد.

في مرحلة التصنيف، يستخدم الإطار بشكل مبتكر ناتج التجزئة كأولوية مكانية ناعمة، مستفيدًا من قناة ألفا للحفاظ على سياق الصورة الكاملة مع توفير توجيه مكاني دقيق. يقلل هذا النهج من الفجوة بين المجالات المرتبطة بالقص القاسي، مما يؤدي إلى تصنيف أكثر دقة ووعيًا بالسياق للكائنات المموهة. يضمن الاستخدام المشترك لنفس VLM عبر كلا المرحلتين الكفاءة والتناسق الدلالي. تظهر التجارب الواسعة تفوق هذه الطريقة على تقنيات التجزئة ذات المفردات المفتوحة الحالية، مع تحسينات ملحوظة في كل من OVCOS ومعايير تجزئة الكائنات المموهة التقليدية. تؤكد النتائج فعالية الاستفادة من دلالات VLM الغنية لتعزيز كل من مهام التجزئة والتصنيف في سيناريوهات التمويه المعقدة.

مقدمة

تناقش مقدمة ورقة البحث التحديات المتعلقة بتجزئة الكائنات المموهة ذات المفردات المفتوحة (OVCOS)، والتي تتضمن تجزئة وتصنيف الكائنات المموهة من فئات جديدة لم يتم تضمينها في بيانات التدريب. على عكس التجزئة الدلالية التقليدية، يجب على OVCOS مواجهة مشاهد غامضة بصريًا تتميز بانخفاض التباين وحدود غير واضحة، مما يجعلها ذات صلة خاصة بالتطبيقات في التصوير الطبي ومراقبة الزراعة. غالبًا ما تستخدم الطرق الحالية نماذج اللغة البصرية (VLMs) مثل CLIP للتصنيف على مستوى البكسل لكنها تواجه قيودًا بسبب عدم التوافق في الدقة بين الفهم على مستوى الصورة والمتطلبات التفصيلية لتجزئة الكائنات المموهة.

لمعالجة هذه التحديات، يقترح المؤلفون إطار عمل من مرحلتين يسمى شبكة الفهم تحت المموه ذات المفردات المفتوحة المتسلسلة (COCUS). في المرحلة الأولى، يتم إجراء التجزئة باستخدام نموذج تجزئة أي شيء (SAM)، موجهًا بواسطة تجسيدات بصرية ونصية مستمدة من CLIP لتعزيز التركيز على المناطق المموهة. تتضمن المرحلة الثانية استراتيجية تصنيف واعية للمنطقة تدمج ناتج التجزئة مع الصورة الأصلية، مما يسمح بتحسين المحاذاة الدلالية ودقة التصنيف. لا يقلل هذا النهج من الفجوة بين المجالات الموجودة في الطرق الحالية فحسب، بل يتضمن أيضًا آليات واعية للحواف لتحسين دقة الحدود، مما يؤدي في النهاية إلى تحديد وتصنيف أكثر دقة للكائنات المموهة.

طرق

تحدد القسم المعنون “النتائج التجريبية” المنهجيات المستخدمة لتقييم الفرضيات المقدمة في الدراسة. يوضح تصميم التجربة، بما في ذلك اختيار المشاركين، والمواد المستخدمة، والإجراءات المتبعة خلال التجارب. يتم تقديم النتائج بشكل كمي، مع تطبيق تحليلات إحصائية ذات صلة لتقييم دلالة النتائج.

تشير النتائج الرئيسية من التجارب إلى وجود ارتباط قوي بين المتغيرات قيد التحقيق، مدعومًا بمقاييس إحصائية مناسبة مثل قيم p وفترات الثقة. يتم توضيح النتائج من خلال أشكال وجداول متنوعة، والتي توفر تمثيلًا بصريًا واضحًا لاتجاهات البيانات والأنماط التي لوحظت خلال التجارب. بشكل عام، تدعم النتائج التجريبية الفرضيات الأولية وتساهم بأفكار قيمة في مجال الدراسة.

نتائج

في تقييم الإطار المقترح لتجزئة الكائنات المموهة ذات المفردات المفتوحة (OVCOS)، تظهر النتائج ميزة أداء كبيرة مقارنة بعدة طرق متطورة، بما في ذلك CAT-Seg وSAN وSimSeg وOVSeg وFC-CLIP وODISE، بالإضافة إلى نموذج OVCoser الأساسي. يستخدم الإطار نموذج CLIP المعدل ونموذج تجزئة أي شيء (SAM) المعدل للمهام، محققًا نتائج متطورة مع تحسينات قدرها +8.9% في متوسط تجزئة الفئة (cS_m)، و+12.5% في تقاطع الفئة على الاتحاد (cIoU)، وتحسينات ملحوظة عبر مقاييس أخرى. يُعزى هذا الأداء إلى تصميم الإطار المتسلسل والتكييف الدلالي عبر الأنماط، مما يعالج بفعالية تحديات OVCOS.

بالإضافة إلى ذلك، يظهر نموذج SAM المعدل أداءً تنافسيًا عبر ثلاثة معايير لتجزئة الكائنات المموهة: CAMO وCOD10K وNC4K، حيث يحتل المرتبة الأولى في 11 من أصل 12 مقياس تقييم. تشير النتائج إلى قوة التعميم والصلابة، خاصة في مقاييس الدقة الوعي بالهيكل والدقة على مستوى البكسل. تؤكد التقييمات النوعية هذه النتائج، حيث تعرض جودة تجزئة متفوقة ودقة تصنيف، مع تحديد الطريقة المقترحة بفعالية للكائنات المموهة وتقليل تداخل الخلفية. تعزز دمج أقنعة التجزئة في نموذج CLIP المعدل أداء التصنيف، مما يظهر فعالية استراتيجية التصنيف الواعية للمنطقة وفك التشفير الواعي للحواف في السيناريوهات الصعبة.

نقاش

في هذا القسم، يناقش المؤلفون إطار العمل المقترح لتجزئة الكائنات المموهة ذات المفردات المفتوحة (OVCOS)، والذي يعزز بشكل كبير تجزئة وتصنيف الكائنات المموهة. يستخدم الإطار نهجًا من مرحلتين يفصل بين مهام التجزئة والتصنيف، مستفيدًا من نموذج تجزئة موجه بالتحفيز لتوليد قناع يعمل كإرشاد مكاني لمرحلة التصنيف. تظهر التجارب الواسعة على معيار OVCamo أن الطريقة المقترحة تحقق أداءً متطورًا، متجاوزة النموذج الأساسي OVCoser عبر جميع مقاييس التقييم الرئيسية. يتم تسليط الضوء على تكييف نموذج تجزئة أي شيء (SAM) لتحديد مواقع الكائنات المموهة، مما يظهر فعاليته في كل من السيناريوهات ذات المفردات المفتوحة والمغلقة التقليدية.

يؤكد المؤلفون على أهمية دمج التجسيدات المستمدة من CLIP في عملية التجزئة، مما يوفر إرشادات دلالية غنية ويعزز قدرة النموذج على التركيز على المناطق الغامضة بصريًا. يوضحون التحسينات التي تم إجراؤها على بنية SAM، بما في ذلك الانتباه المتعدد الشرطي والتنقيح الواعي للحواف، مما يحسن الدقة المكانية وتحديد الحدود. يختتم القسم بمراجعة لتنظيم الورقة، موضحًا الأقسام التالية التي تغطي الأعمال ذات الصلة، والإطار المقترح، وتفاصيل التنفيذ، والنتائج التجريبية الشاملة.

Journal: Computational Visual Media, Volume: 12, Issue: 2
DOI: https://doi.org/10.26599/cvm.2025.9450512
Publication Date: 2026-01-27
Author(s): Kai Zhao et al.
Primary Topic: Visual Attention and Saliency Detection

Overview

The research paper introduces a novel framework for Open-Vocabulary Camouflaged Object Segmentation (OVCOS), addressing the challenges posed by visual ambiguity and the need for effective classification of camouflaged objects from arbitrary categories. Traditional methods typically follow a two-stage approach—segmentation followed by classification—but face limitations due to domain gaps between Vision Language Models (VLMs) trained on full images and their application on cropped regions. Additionally, generic segmentation models often fail to accurately delineate subtle boundaries of camouflaged objects. To overcome these issues, the authors propose a VLM-guided cascaded framework that utilizes the Segment Anything Model (SAM) enhanced by VLM-derived features, which serve as explicit prompts to improve localization accuracy.

In the classification stage, the framework innovatively employs the segmentation output as a soft spatial prior, utilizing the alpha channel to maintain full image context while providing precise spatial guidance. This approach mitigates the domain gap associated with hard cropping, leading to more accurate and context-aware classification of camouflaged objects. The shared use of the same VLM across both stages ensures efficiency and semantic consistency. Extensive experiments demonstrate the superiority of this method over existing open-vocabulary segmentation techniques, with significant improvements noted in both OVCOS and conventional camouflaged object segmentation benchmarks. The findings underscore the effectiveness of leveraging rich VLM semantics to enhance both segmentation and classification tasks in complex camouflage scenarios.

Introduction

The introduction of the research paper discusses the challenges of Open-Vocabulary Camouflaged Object Segmentation (OVCOS), which involves segmenting and classifying camouflaged objects from novel categories that were not included in the training data. Unlike traditional semantic segmentation, OVCOS must contend with visually ambiguous scenes characterized by low contrast and indistinct boundaries, making it particularly relevant for applications in medical imaging and agricultural monitoring. Existing methods often utilize vision-language models (VLMs) like CLIP for pixel-wise classification but face limitations due to a granularity mismatch between image-level understanding and the detailed requirements of camouflaged segmentation.

To address these challenges, the authors propose a two-stage framework called the Cascaded Open-vocabulary Camouflaged UnderStanding network (COCUS). In the first stage, segmentation is performed using the Segment Anything Model (SAM), guided by CLIP-derived visual and textual embeddings to enhance focus on camouflaged regions. The second stage involves a region-aware classification strategy that integrates the segmentation output with the original image, allowing for improved semantic alignment and classification accuracy. This approach not only mitigates the domain gap inherent in existing methods but also incorporates edge-aware mechanisms to enhance boundary precision, ultimately leading to more accurate localization and classification of camouflaged objects.

Methods

The section titled “Experimental Results” outlines the methodologies employed to evaluate the hypotheses presented in the study. It details the experimental design, including the selection of participants, the materials used, and the procedures followed during the experiments. The results are presented quantitatively, with relevant statistical analyses applied to assess the significance of the findings.

Key findings from the experiments indicate a strong correlation between the variables under investigation, supported by appropriate statistical measures such as p-values and confidence intervals. The results are illustrated through various figures and tables, which provide a clear visual representation of the data trends and patterns observed during the experiments. Overall, the experimental results substantiate the initial hypotheses and contribute valuable insights to the field of study.

Results

In the evaluation of the proposed framework for open-vocabulary camouflage object segmentation (OVCOS), the results demonstrate a significant performance advantage over several state-of-the-art methods, including CAT-Seg, SAN, SimSeg, OVSeg, FC-CLIP, and ODISE, as well as the baseline OVCoser. The framework employs fine-tuned CLIP and a task-adapted Segment Anything Model (SAM), achieving state-of-the-art results with improvements of +8.9% in class segmentation mean (cS_m), +12.5% in class Intersection over Union (cIoU), and notable enhancements across other metrics. This performance is attributed to the framework’s cascaded design and cross-modal semantic conditioning, which effectively address the challenges of OVCOS.

Additionally, the adapted SAM model shows competitive performance across three camouflage object segmentation benchmarks: CAMO, COD10K, and NC4K, ranking first in 11 out of 12 evaluation metrics. The results indicate strong generalization and robustness, particularly in structure-aware and pixel-level accuracy metrics. Qualitative assessments further validate these findings, showcasing superior segmentation quality and classification accuracy, with the proposed method effectively delineating camouflaged objects and minimizing background interference. The integration of segmentation masks into the fine-tuned CLIP model enhances classification performance, demonstrating the efficacy of the region-aware classification strategy and edge-aware decoding in challenging scenarios.

Discussion

In this section, the authors discuss their proposed framework for Open-Vocabulary Camouflaged Object Segmentation (OVCOS), which significantly enhances the segmentation and classification of camouflaged objects. The framework employs a two-stage approach that decouples segmentation and classification tasks, utilizing a prompt-guided segmentation model to generate a mask that serves as spatial guidance for the classification stage. Extensive experiments on the OVCamo benchmark demonstrate that the proposed method achieves state-of-the-art performance, surpassing the baseline model OVCoser across all major evaluation metrics. The adaptation of the Segment Anything Model (SAM) for camouflaged object localization is highlighted, showcasing its effectiveness in both open-vocabulary and conventional closed-set scenarios.

The authors emphasize the importance of integrating CLIP-derived embeddings into the segmentation process, which provides rich semantic guidance and enhances the model’s ability to focus on visually ambiguous regions. They detail the enhancements made to the SAM architecture, including conditional multi-way attention and edge-aware refinement, which improve spatial accuracy and boundary delineation. The section concludes with an overview of the paper’s organization, outlining subsequent sections that cover related work, the proposed framework, implementation details, and comprehensive experimental results.