الكشف التلقائي عن الأجسام لأبحاث السلوك باستخدام YOLOv8 Automatic object detection for behavioural research using YOLOv8

المجلة: Behavior Research Methods، المجلد: 56، العدد: 7
DOI: https://doi.org/10.3758/s13428-024-02420-5
PMID: https://pubmed.ncbi.nlm.nih.gov/38750389
تاريخ النشر: 2024-05-15
المؤلف: Frouke Hermens
الموضوع الرئيسي: دراسات سلوك الحيوان ورفاهيته

نظرة عامة

تدرس الدراسة فعالية YOLOv8، وهو نموذج متقدم للكشف عن الأجسام، في وضع علامات على الأجسام داخل تسجيلات الفيديو لأغراض البحث السلوكي. تشير النتائج إلى أن YOLOv8 يحقق كشفًا شبه مثالي للأجسام حتى عند تدريبه على مجموعة بيانات محدودة تتراوح بين 100 إلى 350 صورة. ومع ذلك، تتناقص أداء النموذج عندما يتم تقديم نفس الجسم ضد خلفيات مختلفة. للتخفيف من هذه المشكلة، توصي الدراسة بتدريب الكاشف باستخدام صور تشمل مجموعة متنوعة من الخلفيات، مما يعزز بشكل كبير دقة الكشف.

تشير تداعيات هذه النتائج إلى أن YOLOv8 يمكن أن يحدث ثورة في الدراسات السلوكية التي تعتمد على وضع علامات الفيديو، مثل تتبع العين المتنقل وبحوث التعامل مع الأجسام. من الضروري إعادة تدريب النموذج عند تغيير إما الجسم أو خلفية المختبر للحفاظ على الأداء. بالإضافة إلى ذلك، يحذر المؤلفون من أن النتائج قد لا تكون قابلة للتطبيق على مهام رؤية الكمبيوتر الأخرى مثل تقسيم الصور وتقدير الوضع، والتي تتطلب تحقيقًا منفصلًا. بشكل عام، فإن سهولة الاستخدام وفعالية YOLOv8 تجعله أداة قيمة في هذا المجال.

مقدمة

في مقدمة ورقة البحث، يناقش المؤلفون أهمية وضع علامات الفيديو في البحث السلوكي، لا سيما في تحليل بيانات تتبع العين والتعامل مع الأجسام في سياقات مختلفة، بما في ذلك الأنشطة اليومية والإجراءات الجراحية. يبرزون أهمية الكشف عن الأجسام، وهي مهمة راسخة في رؤية الكمبيوتر، والتي شهدت تحسينات كبيرة في الدقة بفضل التقدم في الخوارزميات والتكنولوجيا وتوافر مجموعات بيانات كبيرة موضوعة العلامات. يشير المؤلفون إلى أنه بينما كانت طرق الكشف التقليدية عن الأجسام تتطلب غالبًا مهارات برمجية، فإن التطورات الأخيرة، مثل حزمة Ultralytics، قد جعلت هذه التقنيات أكثر سهولة للباحثين في السلوك.

تهدف الدراسة إلى التحقيق في الشروط اللازمة لتدريب كاشفات الأجسام، مع التركيز بشكل خاص على خوارزمية YOLOv8. ستستكشف الكشف عن أدوات الجراحة وتطبيقها في إعدادات المختبر المسيطر عليها، مع معالجة عدة أسئلة بحثية تتعلق بعدد الصور الموضوعة العلامات المطلوبة للتدريب الفعال، وأداء الكاشفات على مقاطع الفيديو غير المرئية، وتأثير الخلفيات المختلفة على دقة الكشف. يؤكد المؤلفون أن نتائجهم يمكن أن توفر رؤى حول تدريب كاشفات الأجسام في البحث السلوكي، مما قد يؤدي إلى تطبيقات أكثر موثوقية في سياقات مختلفة.

طرق

في هذه الدراسة، يستخدم المؤلفون YOLOv8 (Jocher et al., 2023) لتحليل مقاطع الفيديو من تجربة تنسيق العين واليد تتضمن مبتدئين يستخدمون محاكي جراحي. الهدف الأساسي هو تحديد المؤشرات المحتملة داخل الأداة وحركات العين التي يمكن أن تتنبأ بالأداء في المهمة الجراحية. تركز الدراسة بشكل خاص على الكشف عن موضع الأداة الجراحية، المحددة كطرف الأداة لممسك جراحي (foot_0)، داخل مقاطع الفيديو المسجلة. تهدف هذه الطريقة المنهجية إلى تعزيز فهم العلاقة بين التنسيق البصري والحركي في السياقات الجراحية.

نتائج

تشير النتائج إلى أن النموذج أظهر كشفًا فعالًا للأطراف عبر أحجام مجموعات الصور المختلفة. على وجه التحديد، حقق النموذج المدرب على مجموعة بيانات تتكون من 150 صورة كشفًا ناجحًا في جميع الحالات من اختيار عشوائي للصور ضمن مجموعة التحقق، كما هو موضح في الشكل 6. تؤكد هذه الأداء المتسق على قوة النموذج في تحديد الأطراف عبر سيناريوهات مختلفة.

مناقشة

في هذه الدراسة، تم جمع بيانات تتبع العين وتسجيلات الفيديو من 38 مشاركًا لتحليل أداء الكشف عن الأجسام باستخدام نموذج YOLOv8 لتتبع الأدوات الجراحية. قام المشاركون، الذين ليس لديهم خبرة سابقة في المهام الجراحية، بأداء مهمة نقل خرز باستخدام ممسك جراحي بينما تم تسجيل حركات عيونهم. تم استخراج ما مجموعه 436 صورة من تسجيلات الفيديو، وتم وضع علامات عليها بإطارات حول الأطراف للأدوات. تم إجراء تدريب والتحقق من نموذج YOLOv8 باستخدام أحجام مجموعات صور مختلفة، مما كشف أن أداء النموذج تحسن مع زيادة عدد الصور الموضوعة العلامات، لا سيما بالنسبة لدرجات الدقة المتوسطة (mAP).

أشارت النتائج إلى أن أحجام مجموعات الصور الأصغر (50 إلى 150 صورة) حققت درجات mAP50 عالية، لكن الأداء انخفض لمجموعات من 200 و250 صورة قبل أن يتعافى عند الأحجام الأكبر (300 و350 صورة). من الجدير بالذكر أن النموذج أظهر درجة عالية من التداخل (IoU) بين الإطارات المكتشفة والموسومة، بمتوسط حوالي 80-90% عندما تم الكشف عن الأداة. تم تقليل السلبيات الكاذبة بشكل كبير مع مجموعات التدريب الأكبر، ولم يتم تسجيل أي إيجابيات كاذبة. بالإضافة إلى ذلك، استكشفت الدراسة تأثير أحجام النماذج المدربة مسبقًا المختلفة، وخلصت إلى أن النموذج النانوي أدى أداءً مماثلاً للنماذج الأكبر، دون ملاحظات لتحسينات كبيرة. بشكل عام، تؤكد النتائج على أهمية بيانات التدريب الكافية للكشف الفعال عن الأجسام في السياقات الجراحية.

Journal: Behavior Research Methods, Volume: 56, Issue: 7
DOI: https://doi.org/10.3758/s13428-024-02420-5
PMID: https://pubmed.ncbi.nlm.nih.gov/38750389
Publication Date: 2024-05-15
Author(s): Frouke Hermens
Primary Topic: Animal Behavior and Welfare Studies

Overview

The study investigates the efficacy of YOLOv8, a state-of-the-art object detection model, in annotating objects within video recordings for behavioral research. The findings indicate that YOLOv8 achieves nearly perfect object detection even when trained on a limited dataset of 100 to 350 images. However, the model’s performance diminishes when the same object is presented against different backgrounds. To mitigate this issue, the study recommends training the detector with images that encompass a variety of backgrounds, which significantly enhances detection accuracy.

The implications of these results suggest that YOLOv8 could revolutionize behavioral studies that rely on video annotation, such as mobile eye tracking and object handling research. It is crucial to retrain the model when changing either the object or the laboratory background to maintain performance. Additionally, the authors caution that the findings may not be applicable to other computer vision tasks like image segmentation and pose estimation, which warrant separate investigation. Overall, the ease of use and effectiveness of YOLOv8 positions it as a valuable tool in the field.

Introduction

In the introduction of the research paper, the authors discuss the significance of video annotation in behavioral research, particularly in analyzing eye-tracking data and object handling in various contexts, including everyday activities and surgical procedures. They highlight the importance of object detection, a task well-established in computer vision, which has seen substantial improvements in accuracy due to advancements in algorithms, technology, and the availability of large annotated datasets. The authors note that while traditional object detection methods often required programming skills, recent developments, such as the Ultralytics package, have made these techniques more accessible to behavioral researchers.

The study aims to investigate the conditions necessary for training object detectors, specifically focusing on the YOLOv8 algorithm. It will explore surgical tool detection and its application in controlled lab settings, addressing several research questions regarding the number of annotated images required for effective training, the performance of detectors on unseen videos, and the impact of different backgrounds on detection accuracy. The authors emphasize that their findings could provide insights into the training of object detectors in behavioral research, potentially leading to more reliable applications in various contexts.

Methods

In this study, the authors employ YOLOv8 (Jocher et al., 2023) to analyze videos from an eye-hand coordination experiment involving novices using a surgical simulator. The primary objective is to identify potential indicators within the instrument and eye movements that could forecast performance in the surgical task. Specifically, the research concentrates on detecting the position of the surgical instrument, identified as the tooltip of a surgical grasper (foot_0), within the recorded video footage. This methodological approach aims to enhance understanding of the relationship between visual and motor coordination in surgical contexts.

Results

The results indicate that the model demonstrated effective detection of tooltips across various image set sizes. Specifically, the model trained on a dataset of 150 images achieved successful detection in all instances from a random selection of images within the validation set, as illustrated in Figure 6. This consistent performance underscores the model’s robustness in identifying tooltips across different scenarios.

Discussion

In this study, eye-tracking data and video recordings were collected from 38 participants to analyze object detection performance using a YOLOv8 model for surgical instrument tracking. Participants, with no prior experience in surgical tasks, performed a bead-moving task using a surgical grasper while their eye movements were recorded. A total of 436 images were extracted from video recordings, annotated with bounding boxes around the tooltips of the instruments. The training and validation of the YOLOv8 model were conducted using various image set sizes, revealing that the model’s performance improved with an increasing number of annotated images, particularly for the mean average precision (mAP) scores.

The results indicated that smaller image set sizes (50 to 150 images) achieved high mAP50 scores, but performance dropped for sets of 200 and 250 images before recovering at larger sizes (300 and 350 images). Notably, the model exhibited a high degree of overlap (IoU) between detected and annotated bounding boxes, averaging around 80-90% when the tool was detected. False negatives were significantly reduced with larger training sets, and no false positives were recorded. Additionally, the study explored the impact of different pre-trained model sizes, concluding that the nano model performed comparably to larger models, with no significant improvements observed. Overall, the findings underscore the importance of adequate training data for effective object detection in surgical contexts.