مقارنة بين YOLOv8 و Mask R-CNN لتجزئة الكائنات في بيئات البساتين المعقدة Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments

عربي
English

المجلة: Artificial Intelligence in Agriculture، المجلد: 13
DOI: https://doi.org/10.1016/j.aiia.2024.07.001
تاريخ النشر: 2024-07-17

مقارنة بين YOLOv8 و Mask R-CNN لتجزئة الكائنات في بيئات البساتين المعقدة

رانجان سابكوتا*، داود أحمد ومانوج كاركي*مركز الأنظمة الزراعية الدقيقة والم automated، جامعة ولاية واشنطن، 24106 N Bunn Rd، بروسر، 99350، واشنطن، الولايات المتحدة الأمريكية

الملخص

تجزئة الكائنات، وهي عملية معالجة صور مهمة للأتمتة في الزراعة، تُستخدم لتحديد الكائنات الفردية ذات الاهتمام بدقة داخل الصور، مما يوفر معلومات أساسية لمهام آلية أو روبوتية متنوعة مثل الحصاد الانتقائي والتقليم الدقيق. تقارن هذه الدراسة بين نموذجين من التعلم الآلي، YOLOv8 أحادي المرحلة و Mask R-CNN ثنائي المرحلة لتجزئة الكائنات تحت ظروف بستانية متغيرة عبر مجموعتين من البيانات. تتضمن مجموعة البيانات 1، التي تم جمعها في موسم السكون، صورًا لأشجار التفاح الساكنة، والتي تم استخدامها لتدريب نماذج تجزئة متعددة الكائنات التي تحدد فروع الأشجار وجذوعها. تتضمن مجموعة البيانات 2، التي تم جمعها في بداية موسم النمو، صورًا لأشجار التفاح مع أوراق خضراء وتفاح غير ناضج (أخضر) (يُسمى أيضًا ثمرة صغيرة)، والتي تم استخدامها لتدريب نماذج تجزئة كائن واحد تحدد فقط التفاح الأخضر غير الناضج. أظهرت النتائج أن YOLOv8 أدت أداءً أفضل من Mask R-CNN، محققة دقة جيدة واسترجاعًا قريبًا من الكمال عبر كلا مجموعتي البيانات عند عتبة ثقة تبلغ 0.5. على وجه التحديد، لمجموعة البيانات 1، حقق YOLOv8 دقة قدرها 0.90 واسترجاعًا قدره 0.95 لجميع الفئات. بالمقارنة، أظهر Mask R-CNN دقة قدرها 0.81 واسترجاعًا قدره 0.81 لنفس مجموعة البيانات. مع مجموعة البيانات 2، حقق YOLOv8 دقة قدرها 0.93 واسترجاعًا قدره 0.97. حقق Mask R-CNN، في هذا السيناريو ذو الفئة الواحدة، دقة قدرها 0.85 واسترجاعًا قدره 0.88. بالإضافة إلى ذلك، كانت أوقات الاستدلال لـ YOLOv8 10.9 مللي ثانية لتجزئة متعددة الفئات (مجموعة البيانات 1) و7.8 مللي ثانية لتجزئة فئة واحدة (مجموعة البيانات 2)، مقارنة بـ 15.6 مللي ثانية و12.8 مللي ثانية التي حققها Mask R-CNN، على التوالي. تُظهر هذه النتائج دقة وكفاءة YOLOv8 المتفوقة في تطبيقات التعلم الآلي مقارنة بالنماذج ثنائية المرحلة، وخاصة Mask-R-CNN، مما يشير إلى ملاءمتها في تطوير عمليات بستانية ذكية وآلية، خاصة عندما تكون التطبيقات في الوقت الحقيقي ضرورية في مثل هذه الحالات مثل الحصاد الروبوتي وتخفيف الثمار الخضراء غير الناضجة.

الكلمات الرئيسية: التعلم الآلي، التعلم العميق، YOLOv8، Mask R-CNN، الأتمتة، الروبوتات، الذكاء الاصطناعي، رؤية الآلة

1. المقدمة

تجزئة الكائنات هي تقنية قوية في رؤية الكمبيوتر تجمع بين فوائد كل من كشف الكائنات والتجزئة الدلالية [1]. واحدة من الفوائد الرئيسية لتجزئة الكائنات في التطبيقات الزراعية هي قدرتها على قياس هياكل النباتات والمحاصيل بدقة [2]، مما يمكن أن يوفر معلومات قيمة حول نمو النبات، وتحديد الأمراض، وتقدير العائد، ويمكن أن يوفر أساسًا لمجالات البحث والتطوير الرئيسية المختلفة مثل تخفيف الثمار الخضراء (غير الناضجة) الروبوتية [3]. يمكن أن توفر تجزئة الكائنات قياسات دقيقة لميزات النبات، مثل مساحة الورقة، وطول الساق، وارتفاع النبات، بمستوى عالٍ من الدقة والكفاءة [4]، [5]. كانت الطرق التقليدية لتجزئة الكائنات في الصور الزراعية تعتمد في الغالب على ميزات مصنوعة يدويًا وتقنيات معالجة الصور الكلاسيكية مثل تحويل المياه [6]، التجزئة المعتمدة على الرسوم البيانية [7]، الأشكال النشطة (أو الثعابين) [8]، [9]، مجموعة المستوى [10]، [11]، [12]، نمو المنطقة [10]، [11]، [12]، العمليات الشكلية [13]، [14] وطرق التجميع [15]، [16]. ومع ذلك، تتطلب هذه الطرق الكثير من الإعدادات اليدوية والتنقيحات، مما يجعلها تستغرق وقتًا طويلاً وأقل موثوقية [17]. بالإضافة إلى ذلك، لم تتمكن هذه الطرق من التعلم بسهولة من البيانات الجديدة، مما جعلها أقل مرونة وصعوبة في التكيف مع سيناريوهات مختلفة. علاوة على ذلك، تضمنت هذه الطرق مراحل معالجة صور متعددة غير متصلة مثل إزالة الضوضاء، وضبط التباين، وتعزيز الصور، والتنقيح وتعريف واستخراج ميزات محددة يدويًا مثل الحواف، والملمس، أو الألوان.

يمثل الانتقال من طرق معالجة الصور التقليدية إلى تقنيات التعلم العميق في تجزئة الكائنات تطورًا كبيرًا في تحليل الصور الزراعية. تعتمد الطرق التقليدية مثل تحويل المياه [18]، التجزئة المعتمدة على الرسوم البيانية، والأشكال النشطة بشكل كبير على الخوارزميات المحددة مسبقًا التي تقوم بتجزئة الصور بناءً على تدرجات الكثافة، اللون، الملمس أو الاتصال، مما يتطلب غالبًا ضبطًا يدويًا مكثفًا لاستخراج الخصائص المحددة لمختلف المحاصيل أو الظروف [18]، [19]. هذه التقنيات، على الرغم من كونها أساسية في الأيام الأولى لرؤية الكمبيوتر، تتطلب تعديلات تكرارية وتكون محدودة بسبب عدم قدرتها على التعلم ديناميكيًا من البيانات الجديدة أو التكيف مع البيئات المتنوعة. بالمقابل، تقدم نماذج التعلم العميق، وخاصة الشبكات العصبية التلافيفية (CNNs)، طبقات من مرشحات التعلم التي تستخرج وتتعلم تلقائيًا أكثر الميزات المعلوماتية من كميات هائلة من البيانات. على عكس الطرق التقليدية التي تعتمد على تحديد الميزات يدويًا وبالتالي تكون عرضة للتحيز البشري والأخطاء، تتعلم أنظمة التعلم العميق التعرف على الأنماط والاختلافات الأساسية في البيانات/الصور، مما يجعلها أكثر قوة ودقة. هذه القدرة مفيدة بشكل خاص في التطبيقات الزراعية (مثل تجزئة الكانوب) حيث يمكن أن تكون التغيرات في مظهر النبات وظروف البيئة عالية [20]. تقدم نماذج CNN تقنيات تعلم شاملة، مما لا يقلل فقط من
وقت المعالجة ولكن أيضًا يعزز قدرة النماذج على التكيف مع سيناريوهات جديدة وغير مرئية، وهي ميزة أساسية للتطبيقات الزراعية القابلة للتعميم والتوسع. تعتبر هذه القدرة على التعلم الديناميكي تحسينًا كبيرًا مقارنة بالطرق التقليدية التي تكون ثابتة ومقيدة بصرامتها الخوارزمية.

بشكل أكثر تحديدًا، يتم استخدام هياكل الشبكات DL، بما في ذلك U-Net [21]، Mask R-CNN [22]، و YOLO [23] بشكل متزايد لمجموعة من التطبيقات في الزراعة. ميزة رئيسية لهذه التقنيات DL هي نهج التعلم الشامل، الذي يمكّن من الربط المباشر بين الصور الخام ونتائج التجزئة، مما يعزز الاتساق والموثوقية. علاوة على ذلك، تسمح تقنيات التعلم الانتقالي بتكييف النماذج المدربة مسبقًا على مجموعات بيانات واسعة مع مهام زراعية محددة، مما يقلل من أوقات التدريب ومتطلبات البيانات. باستخدام هذه الميزات من نماذج DL، تم التحقيق في تطبيقات زراعية متنوعة بما في ذلك تحديد أمراض النباتات [24]، تقدير العائد [25]، [26]، اكتشاف الآفات [27]، [28]، تقييم صحة التربة [29]، تحليل نضج المحاصيل [30]، وتطبيق التحكم في الأعشاب الضارة المحددة بالموقع [31]، [32]، [33] مما يظهر تنوعها وكفاءتها في الممارسات الزراعية الحديثة.

كما ذُكر سابقًا، تم تطبيق تقنيات تقسيم الكائنات بشكل واسع في إدارة أمراض المحاصيل [34]. الكشف المبكر عن أمراض النباتات أمر حاسم للحفاظ على إنتاجية المحاصيل وجودتها. من خلال استخدام تقسيم الكائنات، يمكن للباحثين قياس الأعراض مثل بقع الأوراق وتغير اللون ومراقبة تقدم هذه الأمراض بمرور الوقت [35]. هذه القدرة ضرورية في تطوير استراتيجيات فعالة لإدارة الأمراض، بما في ذلك العلاجات المستهدفة والتربية للحصول على أصناف مقاومة للأمراض. كما ثبت أن تقسيم الكائنات له دور محوري في تقدير إنتاجية المحاصيل بدقة. التقدير الدقيق للإنتاجية ضروري للمزارعين والمربين لاتخاذ قرارات مستنيرة بشأن إدارة المحاصيل واختيار الصفات لتربية أصناف جديدة. يمكن استخدام تقنيات تقسيم الكائنات لعد وحجم الفواكه الفردية أو غيرها من الكائنات في السقف من الصور بدقة. تسهل هذه المعلومات تقدير الإنتاجية بدقة وتوفر رؤى رئيسية حول خصائص الأصناف [34]. أظهرت الدراسات السابقة فعالية هذه التقنيات في تطبيقات ذات صلة مثل تقسيم أزهار التفاح [35]، وتقسيم وتحديد موقع ثمار الفراولة لعمليات الحصاد، وتقسيم ثمار الجوافة والفروع [36]. تساعد البيانات المستمدة من هذه الدراسات في تحسين استراتيجيات إدارة المحاصيل، بما في ذلك التطبيق الأمثل للمياه والأسمدة، وتحديد الأصناف ذات الإنتاجية العالية [35] كما ذُكر سابقًا.

بالإضافة إلى ذلك، تم تطبيق تقسيم الكائنات بشكل واسع لتطوير أنظمة رؤية الآلات للروبوتات الزراعية لأنها توفر قدرات للروبوتات لاكتشاف وتحديد وتتبع الكائنات الفردية ذات الأهمية في الحقول الزراعية باستخدام الصور أو الفيديوهات، مثل الفواكه والفروع والأزهار والخضروات والماشية [37]. يعد اكتشاف وتتبع معلمات النباتات مثل الأوراق والساق والجذع والفرع والزهرة والثمرة ضروريًا للروبوت لأداء مهام مختلفة تلقائيًا مثل الحصاد وإدارة السقف وحمل المحاصيل. في السنوات القليلة الماضية، نفذت عدة دراسات استخدام تقنيات تقسيم الكائنات المعتمدة على التعلم العميق لتطوير حلول روبوتية لمجموعة متنوعة من التطبيقات الزراعية مثل تقليم الأشجار في موسم السكون [38]، وجمع الفواكه والخضروات [39]، [40]، [41]، وتخفيف الأزهار والثمار الصغيرة وتحديد وقتل الأعشاب الضارة [42]، [43] من بين أمور أخرى.

من بين التطبيقات الواسعة لتقنيات التعلم العميق في الزراعة، كان هناك تركيز على استخدام معمارين محددين: YOLO (أنت تنظر مرة واحدة فقط) وMask Region-Based Convolutional Neural Network (Mask R-CNN). هذه النماذج، المعروفة بفعاليتها في تقسيم الكائنات، كانت محورية في تقدم مهام مثل اكتشاف المحاصيل، وإدارة الآفات والأمراض، وتحديد الأعشاب الضارة، وتقسيم سقف الأشجار، واكتشاف كائنات السقف (مثل الفروع والثمار). تستفيد هذه المهام، التي تعتبر حاسمة في الزراعة الدقيقة والأوتوماتيكية، بشكل كبير من قدرات هذين النموذجين من التعلم العميق. كما ذُكر سابقًا، استخدمت العديد من الدراسات الحديثة التي أجريت في التطبيقات الزراعية تقسيم الكائنات المعتمد على Mask-R-CNN [44] لمهام مثل اكتشاف المحاصيل [45]، [46]، واكتشاف الآفات والأمراض [47]، [48]، [49]، واكتشاف الأعشاب الضارة [50]، [51]، وتقسيم سقف الأشجار [52]، [53]، واكتشاف فروع الأشجار [52]، [53]. في الوقت نفسه، تم استخدام عائلة نماذج YOLO على نطاق واسع في اكتشاف الكائنات بسبب قدرتها على التعامل مع مهام مثل اكتشاف الكائنات، وتصنيف الصور، وتقسيم الكائنات في وقت واحد مع الشبكات ذات المرحلة الواحدة. على عكس Mask R-CNN، وهو نموذج ذو مرحلتين مناسب لمهام التقسيم [54]، يقوم YOLO بتحسين المعالجة العامة لضمان السرعة والكفاءة الضرورية للتطبيقات في الوقت الحقيقي في الزراعة مثل التقليم الروبوتي [55]، والتخفيف [56]، وتطبيق المبيدات [57].

ركزت عدد من الدراسات الحديثة على تقسيم جذوع الأشجار والفروع، باستخدام أساليب مختلفة من التعلم العميق. على سبيل المثال، [58]، [59] استخدمت التعلم العميق لاكتشاف الفروع تلقائيًا في أشجار السدر و[60] استخدمت نماذج الشبكات العصبية التلافيفية (R-CNN) جنبًا إلى جنب مع ميزات العمق لاكتشاف الفروع في أشجار التفاح المثمرة.
تمت دراسة تقسيم أجزاء سقف النباتات في الكروم الساكنة على نطاق واسع باستخدام تقنيات تعلم عميقة مختلفة (مثل [61].[100]). ظهرت نماذج أخرى مثل ViNet [62]، التي تقدم حلول تعلم عميق لتقدير هياكل الكروم. تشمل التطورات الإضافية تطبيق التعلم العميق والقيود الهندسية لتقسيم الفروع المخفية وإعادة البناء ثلاثي الأبعاد [63]، بالإضافة إلى استخدام خوارزميات استعمار الفضاء لتقليم السدر في نباتات السدر [59]. نظام استشعار قائم على التعلم العميق (يسمى SPGnet) لنبات السدر بواسطة باوجيان وآخرون [58]، واكتشاف الفروع في أشجار التفاح باستخدام R-CNN بواسطة زانغ وآخرون، 2014 [64]، وMask R-CNN الصغيرة من لين وآخرون لإعادة بناء فروع الجوافة [65] هي دراسات حديثة أخرى في هذا المجال. بالإضافة إلى ذلك، استكشف أغييار وآخرون [66] تقسيم الجذع باستخدام نهج تقسيم دلالي قائم على التعلم العميق مع كاشف متعدد الصناديق (SSD). بالمقارنة مع مقاييس الأداء التي أبلغت عنها هذه المنهجيات الحديثة والمبتكرة المتاحة في الأدبيات، قدم نموذج YOLOv8 المقدم في هذه الدراسة أداءً أفضل في تقسيم جذوع الأشجار من حيث الدقة (0.95) والاسترجاع (0.97) و

. علاوة على ذلك، بينما حقق نموذج Mask R-CNN أداءً أقل نسبيًا مقارنةً بـ YOLOv8، كان أداؤه مشابهًا أو أفضل مع العديد من الدراسات الحديثة حول اكتشاف الجذع والفروع بما في ذلك [60]، [61]، [62]، [63]، [67]، [68].

تمت دراسة كلا النموذجين بشكل موسع، كما تم مناقشته أعلاه وكما هو موضح في الجدول 1، الذي يبرز 23 منشورًا في السنوات الثلاث الماضية تركز بشكل خاص على تحليل صور سقوف أشجار التفاح الحديثة.

الجدول 1: يبرز الدراسات التي أجريت في السنوات الثلاث الماضية حول YOLO وMask R-CNN خلال بيئات بساتين التفاح المختلفة.

المراجع	السنة	نموذج DL	الأهداف
[69]، [70]	2021	YOLO-V4	اكتشاف التفاح في مشهد معقد
[71]، [72]	2021	Mask R-CNN	اكتشاف التفاح المعتمد على التعلم العميق
[71]، [73]	2021	YOLO-V3	اكتشاف الفواكه الخضراء (التفاح، المانجو)
[73]، [74]	2021	YOLO-V5	اكتشاف الثمار الصغيرة لتخفيف الثمار
[75]	2022	Mask R-CNN	تحديد الفروع وتحديد نقاط التقاطع في أشجار التفاح؛ تحديد الجذع وتقسيمه
[76]، [77]	2022	YOLO-V4	اكتشاف التفاح، العد، وتتبع جذوع الأشجار في البساتين الحديثة
[78]	2022	YOLO-V4	اكتشاف التفاح الناضج/غير الناضج على هياكل الأشجار ذات الأوراق الكثيفة لتقدير الحمل المبكر للمحاصيل
[79]	2022	YOLO-V5	طريقة تحديد نمط نمو التفاح في البستان
[80]	2022	YOLO-V5	اكتشاف جذع الشجرة والعوائق في بساتين التفاح
[81]، [82]	2022	Mask R-CNN	تقسيم التفاح الناضج والأخضر في البساتين
[83]	2022	Mask R-CNN	تقسيم الشجرة وتاج الشجرة في البساتين
[84]	2023	YOLO-V3	اكتشاف جودة ثمار التفاح
[85]	2023	YOLO-V7	اكتشاف وعدّ التفاح الصغير المستهدف
[56]، [82]	2023	Mask R-CNN	تقسيم التفاح الأخضر

استنادًا إلى هذا الخلفية من التطبيق الواسع لنماذج YOLOv8 وMask R-CNN، الهدف الرئيسي من هذه الدراسة هو مقارنة وتقييم أداء هذين النموذجين (YOLOv8 وMask R-CNN) لمهام تقسيم الكائنات في بساتين التفاح التجارية الحديثة. من خلال هذه المقارنة الشاملة، تهدف هذه البحث إلى تقديم رؤى حول ملاءمة وكفاءة والتحديات المحتملة المرتبطة بتنفيذ كل نموذج في تطبيقات الأتمتة الزراعية. لتحقيق هذا الهدف، سيتم السعي لتحقيق الأهداف المحددة التالية في هذه الدراسة:

لمقارنة أداء نماذج YOLOv8 وMask R-CNN في تقسيم الكائنات ذات الفئة الواحدة، وبشكل خاص التفاح الأخضر (الثمار الصغيرة)، في الصور المجمعة من بيئات بستان متغيرة في موسم النمو المبكر؛ و
لتقييم قدرات هذين النموذجين في تقسيم الكائنات متعددة الفئات، وبشكل خاص الفروع الرئيسية وجذوع أشجار التفاح في الصور المجمعة من بستان تفاح نموذجي خلال موسم السكون.

تستند المقارنة بين YOLOv8 و Mask R-CNN في هذه الدراسة إلى التقدمات الكبيرة في بنية YOLO التي توسع قدراتها لتتجاوز مجرد اكتشاف الصناديق المحيطة. تقليديًا، بينما كانت نماذج YOLO معروفة بشكل أساسي بسرعتها وكفاءتها في اكتشاف الكائنات، فإن النسخ الأحدث، وخاصة YOLOv8، قد دمجت ميزات تدعم تقسيم الكائنات. يسمح هذا التكيف لـ YOLOv8 ليس فقط
بتوقع الصناديق المحيطة ولكن أيضًا لإنشاء أقنعة كائنات دقيقة، مما يجعل وظائفه تتماشى بشكل أقرب مع تلك الخاصة بـ Mask R-CNN، التي كانت معيارًا في تقسيم الكائنات. لذلك، فإن مقارنة هذين النموذجين ذات صلة حيث أن كلاهما الآن يقدم حلولًا قوية لتقسيم الكائنات، مما يجعل تقييم أدائهما في التطبيقات الزراعية، حيث تكون سرعة الاكتشاف ودقة التقسيم حاسمة، ذا صلة عالية ومبرر علميًا.

تم تنظيم بقية هذه الورقة لتقديم مقارنة شاملة بين نماذج YOLOv8 و Mask R-CNN، مع التركيز بشكل خاص على تطبيقها في بساتين التفاح التجارية. أولاً، يتم تقديم قسم “خلفية” لتوضيح الأطر النظرية للكاشفات ذات المرحلة الواحدة (YOLOv8) وذات المرحلتين (Mask R-CNN)، مما يمهد الطريق لفهم أعمق لهذه النماذج المعقدة. ثم يتم تقديم قسم “المواد والأساليب” لوصف تصميم التجربة، واكتساب البيانات، والمنهجيات التحليلية المستخدمة. بعد المنهجية، يتم تقديم قسم “النتائج والمناقشة” للإبلاغ عن النتائج ومناقشة أداء النماذج بشكل نقدي في مهام التقسيم المختلفة، بما في ذلك كفاءتها وفعاليتها. تختتم الورقة بقسم “الاستنتاج”، الذي يلخص طرق البحث ونتائج الدراسة. تنتهي الورقة بقسم “العمل المستقبلي”، الذي يلخص المقارنة المحتملة الإضافية مع نماذج أخرى متطورة.

2. نماذج التعلم العميق

تُصنف نماذج التعلم العميق المستخدمة في اكتشاف الكائنات عمومًا إلى نهجين متميزين: كاشفات ذات مرحلة واحدة وكاشفات ذات مرحلتين [86]. تقوم الكاشفات ذات المرحلتين، مثل Mask R-CNN، أولاً بإنشاء مناطق اهتمام (ROIs) في مرحلة أولية، باستخدام شبكة اقتراح المنطقة (RPN) [44]. ثم يتم تصنيف هذه المناطق وتنقيحها في المرحلة الثانية لتوفير تحديد دقيق للكائنات وتصنيفها. تُعرف هذه الطريقة بدقتها العالية بسبب التنقيح المركز للكائنات المكتشفة. من ناحية أخرى، تقوم الكاشفات ذات المرحلة الواحدة مثل YOLO بتبسيط هذه العملية من خلال توقع فئات الكائنات والصناديق المحيطة في تمريرة واحدة عبر الشبكة، مما يضحي ببعض الدقة لتحقيق مكاسب كبيرة في السرعة. لا تفصل النماذج ذات المرحلة الواحدة الاكتشاف إلى اقتراحات مناطق متميزة ومرحلة تنقيح، مما يسمح لها بالعمل بشكل أسرع، مما يجعلها مناسبة تمامًا للتطبيقات التي تتطلب معالجة في الوقت الحقيقي [86]. لقد استمرت كلتا المنهجين في التطور في السنوات الأخيرة لمعالجة التوازن بين السرعة والدقة.

من الصحيح أنه تم تطوير العديد من الكاشفات ذات المرحلة الواحدة وذات المرحلتين على مدار العقد الماضي، كل منها مصمم لمعايير أداء محددة من حيث السرعة والدقة. تشمل بعض الكاشفات ذات المرحلتين الأكثر استخدامًا Fast R-CNN (الذي يحسن كفاءة استخدام الميزات)، وFaster R-CNN (الذي يدمج شبكة اقتراح المنطقة)، وR-FCN وCascade R-CNN (التي تعزز دقة التحديد والتصنيف من خلال الشبكات المتخصصة). وبالمثل، تشمل الكاشفات ذات المرحلة الواحدة الأكثر استخدامًا SSD (كاشف متعدد الصناديق في لقطة واحدة) وRetinaNet (الذي قدم خسارة بؤرية للتعامل مع عدم توازن الفئات)، وعائلة نموذج YOLO (You Only Look Once) بما في ذلك YOLOv5 وYOLOv6 (التي تركز على سرعة الاكتشاف). على الرغم من توفر العديد من النماذج ذات المرحلة الواحدة وذات المرحلتين، فإن YOLOv8 وMask R-CNN كانا الأكثر استخدامًا في التطبيقات الزراعية مع نتائج مؤثرة للغاية [87]، [88]، [89]، [90]، [91]، [92]، [93]. لقد أظهرت الدراسات المقارنة بوضوح أدائها المتفوق في اكتشاف وتقسيم الكائنات الزراعية تحت ظروف متنوعة، كما هو موضح في الجدول 1 من هذه المخطوطة. توازن هذه النماذج بين التوازنات بين السرعة والدقة والموثوقية، مما يجعلها مناسبة بشكل خاص للبيئات الديناميكية التي تواجهها في الإعدادات الزراعية. بناءً على هذه الدراسات السابقة، تم اختيار Mask R-CNN (لقدرته الدقيقة على التقسيم) وYOLOv8 (لسرعته الاستثنائية) لهذه الدراسة.

2.1 Mask R-CNN

Mask R-CNN هو نموذج تعلم عميق مصمم لاكتشاف الكائنات وتقسيم الكائنات، ويشتهر بدقته وكفاءته. تكمن قوته في قدرته على تحديد كل كائن بدقة وتحديده في الصورة، مما يجعله فعالًا للغاية في مهام تحليل الصور المعقدة. تم تطوير النموذج من قبل باحثين في Facebook AI Research في عام 2017 ويعتمد على نموذج اكتشاف الكائنات Faster R-CNN من خلال إضافة فرع لتوقع أقنعة الكائنات بالتوازي مع الفرع الموجود لاكتشاف الصناديق المحيطة [44]. تتكون بنية Mask R-CNN من ثلاثة مكونات رئيسية: شبكة العمود الفقري، وشبكة اقتراح المنطقة (RPN)، وفرعين متوازيين لاكتشاف الصناديق المحيطة وتوقع الأقنعة كما هو موضح في الشكل 1. عادةً ما تكون شبكة العمود الفقري شبكة عصبية تلافيفية (CNN) تستخرج الميزات من الصور المدخلة وتكون مشتركة بين كلا الفرعين. تقوم RPN بإنشاء مجموعة من اقتراحات المناطق التي من المحتمل أن تحتوي على كائنات، بناءً على خرائط الميزات التي تم إنشاؤها بواسطة
شبكة العمود الفقري. يتوقع فرع الصندوق المحيط تصنيف الكائنات وإحداثيات الصندوق المحيط لكل اقتراح منطقة، بينما يتوقع فرع القناع قناعًا ثنائيًا لكل كائن ضمن الصندوق المحيط.

ومع ذلك، فإن تطبيق Mask R-CNN وغيرها من نماذج التعلم العميق في الزراعة يأتي مع العديد من التحديات. أولاً، يعتمد أداء النموذج بشكل كبير على جودة وتنوع مجموعة بيانات التدريب [94]. البيئات الزراعية متغيرة للغاية، مع تغييرات في الإضاءة، وظروف الطقس، ومراحل نمو النباتات، وكلها يمكن أن تؤثر على دقة النموذج [95]. علاوة على ذلك، يتطلب Mask R-CNN موارد حسابية كبيرة للتدريب والاستدلال [96]، مما يمكن أن يكون قيدًا في التطبيقات في الوقت الحقيقي في المزرعة حيث تكون هذه الموارد محدودة. [14]، [16]، [17]

تتركز الدراسات الجارية على معالجة هذه التحديات وتحسين نماذج التعلم العميق مثل Mask R-CNN من أجل تحسين الكفاءة والموثوقية. تشمل هذه الجهود دمج هياكل الشبكات العصبية الأكثر تكيفًا وقابلية للتوسع، وتحسين تقنيات زيادة البيانات لجعل النموذج أكثر مرونة تجاه التغيرات البيئية، وتطوير إصدارات خفيفة من النماذج التي تحافظ على دقة عالية مع كونها أكثر كفاءة في استخدام الموارد. على سبيل المثال، قام الباحثون بتكييف Mask R-CNN لتحديد وتقسيم الأناناس بدقة من خلفيات معقدة [84]، وهو أمر حاسم لأنظمة الحصاد الآلي التي تهدف إلى تقليل تكاليف العمالة وزيادة دقة المحصول. وبالمثل، تم نشر Mask R-CNN لتحديد نقاط قطف محددة على نباتات الشاي [97] للمساعدة في حصاد أوراق الشاي عالية الجودة مع تقليل الأضرار التي تلحق بالنباتات. علاوة على ذلك، أظهر هذا النموذج نتائج واعدة في تطبيقات البستنة المختلفة مثل تقييم نضج الفراولة [98]. من خلال دمج Mask R-CNN مع تقنيات تقسيم المناطق، تمكن النظام من التمييز بين مراحل النضج المختلفة، مما يمكّن المزارعين من تحسين توقيت قطف الفراولة للحصول على أسعار أفضل في السوق وتقليل الفاقد. في بساتين التفاح، تم استخدام Mask R-CNN لاكتشاف الزهور وتحديد زهرة الملك، وهو أمر حاسم لاستراتيجيات التلقيح المستهدفة [99]. من المتوقع أن تساعد هذه التطبيق في تحسين كفاءة التلقيح مما يؤدي إلى زيادة إنتاجية وجودة الفاكهة. تم أيضًا تطبيق Mask R-CNN لمراقبة إجهاد المحاصيل مثل تقدير درجة حرارة سطح الفاكهة باستخدام مستشعرات إنترنت الأشياء. تتيح هذه التقنية المراقبة والإدارة في الوقت الحقيقي لمحاصيل الفاكهة، مما يساعد مديري البساتين على التخفيف من آثار الإجهاد الحراري والحفاظ على جودة الفاكهة وإنتاجيتها.

الشكل 1: هيكل Mask-R-CNN مع؛ (أ) مخطط الهيكل، يبرز الشبكة الأساسية، RPN، صندوق الحدود، وفروع توقع القناع؛ و (ب) عرض تفصيلي لشبكة اقتراح المنطقة (RPN).

2.2 YOLOv8

تطورت عائلة نماذج الكشف عن الأجسام وتقسيم الكائنات YOLO (You Only Look Once) بسرعة على مدار السنوات القليلة الماضية، حيث قدمت كل نسخة جديدة تحسينات في الدقة و/أو السرعة. تم بناء YOLOv8 (الشكل 2)، أحدث نموذج من مرحلة واحدة، على الأسس التي قدمتها نماذج YOLO السابقة، مثل YOLOv3 و YOLOv5. مقارنةً بالنماذج ذات المرحلتين، يتنبأ YOLOv8 مباشرةً بصناديق الحدود واحتمالات الفئة دون الحاجة إلى شبكة اقتراح منطقة منفصلة، مما يبسط عملية الكشف عن الأجسام. إحدى الابتكارات الرئيسية في YOLOv8 هي اعتماد نهج خالٍ من المراسي ومركز قائم للكشف عن الأجسام، مما يوفر عدة مزايا على الطرق التقليدية المعتمدة على المراسي مثل YOLOv5 و YOLOv6 و YOLOv7. ينفذ YOLOv8 أسلوب Pseudo Ensemble أو Pseudo Supervision (PS)، وهي طريقة تتضمن تدريب نماذج متعددة بتكوينات متميزة على نفس مجموعة البيانات لتوليد مجموعة أكثر تنوعًا من التوقعات، مما يحسن دقة وموثوقية التوقع النهائي. بالإضافة إلى ذلك، يستفيد YOLOv8 من بنية Darknet-53، وهي شبكة عصبية عميقة مكونة من 53 طبقة تم تحسينها لاستخراج الميزات والكشف عن الأجسام. أحد التغييرات المعمارية الكبيرة في YOLOv8 هو استبدال وحدة C3 بوحدة C2F. تعالج وحدة C3، المعروفة أيضًا بوحدة الالتفاف، بيانات الإدخال من خلال سلسلة من العمليات الالتفافية. تعزز وحدة C2F، وهي نسخة محسنة من وحدة C3، الدقة وأوقات المعالجة مقارنة بالنماذج السابقة. علاوة على ذلك، يستبدل YOLOv8 طبقة الالتفاف 6 × 6 بطبقة 3 × 3 في هيكل النموذج، مما يقلل من عدد المعلمات ويخلق شبكة أكثر كفاءة من حيث الحوسبة. يستخدم YOLOv8 أيضًا رأسًا مفصولًا، مما يفصل مهام توقع وجود الكائن وتصنيف أنواع الكائنات، مما يحسن كل من الدقة وسرعة المعالجة. يضع هذا التحسين YOLOv8 كحل فعال لكل من الكشف عن الأجسام وتقسيم الكائنات في رؤية الكمبيوتر.

الشكل 2: هيكل YOLOv8 يعرض تصميمه المبتكر للكشف عن الأجسام وتقسيمها (https://deci.ai/blog/history-yolo-object-detection-models-from-yolov1-yolov8/).

يوفر YOLOv8 عدة تكوينات لتلبية الاحتياجات المختلفة من حيث سرعة الحوسبة والدقة [100]، [101]: YOLOv8-Tiny لمعالجة سريعة على حساب بعض الدقة، مثالي للتطبيقات في الوقت الحقيقي على الأجهزة ذات الموارد المحدودة؛ YOLOv8-Small، الذي يوازن بين السرعة مع قدرات الكشف الأكثر تفصيلاً؛ YOLOv8-Standard لأداء قوي في بيئات متنوعة؛ و YOLOv8-Large، الذي يعطي الأولوية لدقة عالية للتطبيقات الحرجة حيث تكون التفاصيل والدقة في غاية الأهمية.[90]

لقد سهلت التطورات الأخيرة في YOLOv8 اعتماده في تطبيقات زراعية متنوعة، مما يظهر فعاليته في معالجة التحديات المحددة التي تنطوي عليها بيئات الزراعة المختلفة التي تم مناقشتها سابقًا. من خلال
تحسين معالجة الميزات منخفضة المستوى، يصبح YOLOv8 أداة قوية للكشف المبكر عن علامات خفية للآفات والأمراض الزراعية، وهو أمر حاسم للحفاظ على صحة المحاصيل [102]. على سبيل المثال، تم تطبيق إصدارات محسنة من YOLOv8 للكشف عن الأمراض في الخضروات داخل البيوت الزجاجية [103] لضمان الكشف المبكر وإدارة أمراض النباتات. تشمل الابتكارات الأخرى دمج آليات الانتباه في YOLOv8 لتعزيز قدرات الكشف عن الأجسام، والتي تم اختبارها بواسطة [104] لتحسين دقة الكشف عن الطماطم في البيئات الزراعية المزدحمة. وبالمثل، ركز [93] على دمج YOLOv8 مع هياكل المحولات الخفيفة لتحسين عملية استخراج الميزات، مما ساعد في تحسين الكشف عن نضج الفراولة في البيئات الحقلية. في بيئات البساتين، تم استخدام YOLOv8 مع تقنيات ملاءمة الشكل للكشف الدقيق عن حجم التفاح الأخضر غير الناضج، وهو أمر حيوي لتقدير الإنتاج ومراقبة النمو [106]. وبالمثل، تم تطوير نموذج YOLOv8 متخصص لمراقبة عملية نضج البطيخ المتغير اللون، مما يمثل خطوة كبيرة نحو تطبيقات مخصصة في الزراعة [107].

3. المواد والأساليب

تكونت هذه الدراسة من أربع خطوات رئيسية كما هو موضح في الشكل 3a، بدءًا من الحصول على صور RGB من بساتين تجارية في موسمين متميزين (الشكل 3b كفصل السكون و3c كفصل النمو المبكر). تم التقاط هذه الصور تحت ظروف بيئية متغيرة مثل الأيام المشمسة والمليئة بالغيوم، ثم تم وضع علامات عليها يدويًا لإنشاء مجموعات بيانات التدريب والاختبار. تم استخدام مجموعة بيانات التدريب بعد ذلك لتدريب النموذجين العميقين المذكورين سابقًا، وتم تقييم أدائهما في تقسيم الكائنات باستخدام مجموعة بيانات الاختبار.

الشكل 3: (أ) مخطط سير العمل العام المستخدم في هذا البحث؛ (ب) صورة مثال لبستان تفاح خلال فصل السكون (22 نوفمبر 2022)؛ (ج) صورة مثال لبستان تفاح خلال فصل نمو الفاكهة المبكر (18 يونيو 2023)؛ (د) كاميرا IntelRealsense 435i المستخدمة للحصول على الصور لتدريب واختبار نماذج تقسيم الكائنات؛ (هـ) مثال على جذوع وأفرع تم استخدامها لوضع علامات على صور فصل السكون؛ و (و) مثال على ثمار خضراء غير ناضجة (ثمار صغيرة) تم استخدامها لوضع علامات على صور فصل النمو المبكر.

3.1 موقع الدراسة والحصول على البيانات

تم إجراء هذه الدراسة في بستان تجاري للتفاح (الأشكال 3ب و3ج) مملوك ومشغل من قبل شركة ألان براذرز فواكه، الواقعة في بروسير، ولاية واشنطن، الولايات المتحدة الأمريكية. تم زراعة البستان في عام 2009 بنوع تفاح سكيلايت مع مسافة صفوف تبلغ 9.0 قدم، ومسافة نباتية تبلغ 3.0 قدم، وتم تدريبه على هيكل V-trellis. تم الحصول على مجموعتين من الصور RGB باستخدام كاميرا IntelRealsense 435i (شركة إنتل، كاليفورنيا، الولايات المتحدة الأمريكية)؛ واحدة في نوفمبر 2022 لإنشاء مجموعة بيانات موسم السكون كما هو موضح في الأشكال 3ب و3هـ، بينما تم الحصول على المجموعة الأخرى من الصور في يونيو 2023 (قبل عملية تقليم الثمار اليدوية) والتي وفرت مجموعة البيانات لموسم النمو المبكر كما هو موضح في الأشكال 3ج و3و. تم اختيار كاميرا Intel RealSense لالتقاط صور RGB بسبب قدرة مجموعة أدوات تطوير البرمجيات الخاصة بها (Intel RealSense SDK) على ضبط المعلمات والتقاط صور عالية الجودة.

3.2 إعداد البيانات

تم إعداد نوعين من مجموعات البيانات تتكون من 1,553 صورة RGB، تلتقط تنوعًا في ظروف إضاءة البستان لتحليل نموذج التعلم العميق. تتكون مجموعة البيانات 1 من 474 صورة من موسم السكون، والتي تم وضع علامات عليها يدويًا لتمثيل كائنات متعددة الفئات: جذع الشجرة والفروع الرئيسية التي تنمو من الجذوع (الشكل 4). بشكل إجمالي، تم إنشاء 1,141 علامة لجذع الشجرة و2,369 علامة لفروع الشجرة يدويًا عن طريق إنشاء مضلع فوق الكائنات المرغوبة في هذه الصور باستخدام برنامج وضع العلامات على الصور Labelbox. وبالمثل، تتكون مجموعة البيانات 2 من 1,079 صورة من موسم نمو الفاكهة الخضراء حيث تم إنشاء 5,921 علامة للتفاح الأخضر غير الناضج. خلال مرحلة معالجة الصور باستخدام برنامج Labelbox، تم تنسيق جميع هذه العلامات وفقًا لمواصفات مجموعة بيانات COCO، والتي تلبي متطلبات كل من نموذج YOLOv8 وMask R-CNN لتقسيم الصور. علاوة على ذلك، لتسهيل تدريب النموذج والتحقق منه، تم تغيير حجم كلا مجموعتي البيانات إلى

بكسلات، وتم تقسيم كلا المجموعتين البيانية عشوائيًا إلى مجموعات فرعية للتدريب، والتحقق، والاختبار، وفقًا لـ

نسبة التوزيع لكل فئة من الكائنات.

الشكل 4: مخطط سير العمل يوضح نوعي مجموعات البيانات المستخدمة في الدراسة؛ تضمنت مجموعة البيانات 1 أشجار التفاح في موسم السكون مع كائنات متعددة الفئات (الجذع والفرع) وتضمنت مجموعة البيانات 2 مظلات أشجار التفاح في موسم النمو مع ثمار خضراء غير ناضجة.

3.3 تنفيذ نموذج التعلم العميق

تم تدريب كل من نماذج YOLOv8 و Mask R-CNN على محطة عمل مزودة بمعالج Intel Xeon® W-2155 بسرعة 3.30 جيجاهرتز x20، وبطاقة رسومات NVIDIA TITAN Xp Collector’s Edition/PCIe/SSE2، وذاكرة 31.1 جيجابايت، و

نظام التشغيل أوبونتو 16.04 LTS 64 بت. كان إطار العمل الخلفي لتنفيذ النموذج هو بايتورتش، يعمل على نظام لينكس. لتحسين الأداء، كانت معدل التعلم المستخدم هو 0.001، وحجم الدفعة المستخدم هو 32، ومعدل التسرب المستخدم هو 0.5 للتخفيف من الإفراط في التكيف. تم إجراء التدريب على مدى 1000 تكرار. كان يتم إيقاف تدريب النموذج قبل الوصول إلى 1000 دورة إذا لم يتحسن أداء النموذج لمدة 20 دورة متتالية على مجموعة بيانات التحقق، مما كان مفيدًا لتقليل الإفراط في التكيف للنموذج مع مجموعة بيانات التدريب وتحسين العمومية. تم استخدام معدل تعلم أولي قدره 0.01 في تدريب كلا النموذجين، بينما كانت الزخم وتآكل الوزن المستخدمة هي 0.937 و 0.0005 على التوالي للنموذجين. تم اختيار هذه الإعدادات للمعلمات لتحسين سرعة عملية التدريب مع تقليل فرص الإفراط في التكيف للنموذج مع مجموعة بيانات التدريب. خلال الدورات الثلاث الأولى، تم استخدام مرحلة تسخين، باستخدام زخم قدره 0.8 ومعدل تعلم انحيازي قدره 0.1، لاستقرار تحسين النموذج وتقليل خطر التوقف عند حد أدنى محلي ضعيف.

خلال عملية التدريب، تم تطبيق تقنيات تعزيز مختلفة لتعزيز قوة النموذج وقدرته على التعميم مثل تعزيز اللون (0.015)، وتعزيز التشبع (0.7)، وتعزيز القيمة (0.4)، وتعديلات الترجمة (0.1)، وتvariations في القياس (0.5)، و

احتمالية الانقلابات من اليسار إلى اليمين. بالإضافة إلى ذلك، تم تطبيق تحسين الفسيفساء مع احتمال 1.0. بعد الانتهاء من تدريب النموذج، تم تحويل مخرجات النموذج إلى تنسيق TorchScript لتبسيط المعالجة اللاحقة لتقييم أداء كل من نماذج YOLOv8 و Mask R-CNN من حيث الدقة، والاسترجاع، ومتوسط الدقة العامة (mAP)، والمساحة تحت المنحنى (AUC) كما هو موضح أدناه. البيانات التفصيلية حول تقنيات تحسين البيانات المحددة وقيم المعلمات الفائقة المستخدمة أثناء التدريب موضحة في الجدول 2:

الجدول 2: معلمات زيادة البيانات والتقنين المستخدمة في تدريب النماذج في هذه الدراسة

الطرق المطبقة	قيمة
زيادة اللون (كسر)	0.015
زيادة التشبع (الكسور)	0.7
زيادة القيمة (كسر)	0.4
دوران	0.0
ترجمة	0.1
مقياس	0.5
اقلب يسارًا ويمينًا (احتمالية)	0.5
الموزاييك (الاحتمالية)	1.0
تآكل الوزن	0.0005

تم تحديد عدد دورات التدريب من خلال اختبار أولي. في هذا الاختبار، تم مراقبة أداء النموذج مع مجموعة بيانات التحقق أثناء عملية التدريب وتم اختيار عدد الدورات عندما بدأت دقة التحقق في الانخفاض أو البقاء ثابتة كعدد الدورات الأمثل، والذي كان من المتوقع أن يساعد في تجنب الإفراط في التكيف للنموذج.

3.4 تقييم الأداء

لتقييم قدرات تقسيم الكائنات في نماذج Mask R-CNN و YOLOv8، تم استخدام خمسة مقاييس متميزة: الدقة، الاسترجاع، متوسط الدقة المتوسطة (AP) عند 0.5 تقاطع على الاتحاد (mAP@0.5 IOU)، المساحة تحت منحنى التشغيل الاستقبالي (AUC)، وسرعة الاستدلال. تُعرف الدقة بأنها نسبة الحالات الإيجابية المحددة بشكل صحيح إلى إجمالي الحالات الإيجابية المتوقعة، كما هو موضح في المعادلة 1. وبالمثل، يقيس الاسترجاع، كما هو موضح في المعادلة 2، نسبة الحالات الإيجابية المحددة بشكل صحيح من جميع الحالات الفعلية للأهداف المستهدفة. علاوة على ذلك، يُمثل متوسط الدقة المتوسطة (mAP) كمتوسط لـ AP.
عبر k فئات (المعادلة 4)، كان حاسماً في تقييم دقة النموذج عند عتبة

التداخل بين الحدود/الصناديق المحيطة المتوقعة والحقيقية للأجسام. تم تقييم فعالية تصنيف النموذج عبر جميع العتبات الممكنة من خلال المساحة تحت المنحنى (AUC)، المحددة بالمعادلة 5. تم قياس كفاءة النموذج في معالجة وتقديم التوقعات من خلال سرعة الاستدلال، وكانت مرتبطة عكسياً بالوقت المستغرق في تحليل كل صورة.
يتم حساب هذه المقاييس على النحو التالي:

المعادلة 1

المعادلة 5
حيث تمثل TP و FP و FN حالات الكائنات الإيجابية الحقيقية والسلبية الزائفة والسلبية الحقيقية على التوالي. المتغير ‘k’ يمثل العدد الإجمالي لفئات الكائنات، و (AP)

يشير إلى الدقة المتوسطة المحسوبة لـ

الفئة بين هذه الفئات k. AP هو المساحة تحت منحنى الدقة والاسترجاع لفئة معينة. TPR يمثل معدل الإيجابيات الحقيقية، وFPR هو معدل الإيجابيات الكاذبة، وt يشير إلى الوقت المستغرق لنموذج لاستنتاج النتائج لصورة معينة (فردية).

4. النتائج والمناقشة

4.1 تقسيم الكائنات من فئة واحدة للتفاح الأخضر غير الناضج (الثمار الصغيرة)

بالنسبة لتجزئة الفئة الواحدة للفاكهة الخضراء غير الناضجة، أظهرت منحنيات الدقة والثقة الموضحة في الشكل 5 أن نموذج YOLOv8 حقق أقصى دقة تبلغ 1.00 عندما كان عتبة الثقة 0.929 (الشكل 5أ). وبالمثل، تم تقديم منحنيات الاسترجاع والثقة للنماذج المعنية في الشكل 6، والتي أظهرت أن استرجاع YOLOv8 بلغ 0.97 عند أدنى عتبة ثقة تبلغ 0.000. تشير هذه النسبة العالية من الاسترجاع، أو الحساسية، إلى قدرة النموذج على التعرف بشكل صحيح على نسبة عالية من الكائنات الفعلية، مما يظهر فعالية النماذج في تجزئة الفواكه الخضراء حتى عند أدنى مستويات الثقة. بالإضافة إلى ذلك، تفوق YOLOv8 على Mask R-CNN من حيث متوسط الدقة العامة (mAP)، حيث حقق 0.939 عند عتبة IoU تبلغ 0.5 لفواكه خضراء والفئات العامة، مقارنةً بـ mAP البالغ 0.902 الذي تم تحقيقه مع Mask R-CNN (الشكل 7).
تظهر الفروق في الأداء بين YOLOv8 و Mask R-CNN عمومًا الطبيعة المميزة لهياكلهما والطريقة التي تعالج بها الصور. تم تصميم YOLOv8، ككاشف من مرحلة واحدة، للسرعة والدقة، مما يجعله قادرًا على استبعاد المناطق غير المستهدفة المتشابهة، كما لوحظ في مهام التقسيم (الشكل 8ب). إن نهجه المباشر في كشف الكائنات يتجنب خطوة اقتراح المنطقة، مما يؤدي إلى عدد أقل من الإيجابيات الكاذبة في مناطق السقف التي تشبه الثمار المستهدفة في اللون. من ناحية أخرى، يستخدم Mask R-CNN عملية من مرحلتين، تتضمن إنشاء اقتراحات المناطق قبل تصنيف الكائنات وتقسيمها. يمكن أن يؤدي ذلك أحيانًا إلى تضمين مناطق غير مستهدفة، مثل الأوراق والسيقان التي يتم تصنيفها بشكل خاطئ كثمار (الشكل 8ج). علاوة على ذلك، يبدو أن أدائه أكثر حساسية لتغيرات الإضاءة، مما يمكن أن يؤدي إلى أخطاء في تحديد الكائنات تحت ظروف الإضاءة المتطرفة مثل ضوء الشمس الساطع المباشر والظلال الداكنة (الشكل 9ج). على الرغم من هذه الفروق، هناك حالات معينة قد تظل فيها Mask R-CNN الخيار المفضل. يمكن أن تكون عمليتها ذات المرحلتين، وخاصة خطوة اقتراح المنطقة، مفيدة في مهام التقسيم المعقدة حيث تكون الدقة حاسمة، وتكون الكائنات متقاربة أو مخفية جزئيًا. في الماضي، تم التحقيق في تقسيم الثمار الخضراء باستخدام أساليب متنوعة. إطار عمل Wei et al. D2D [108]، وتركيز GHFormer على الكشف في الليل [109]، وLiu et al.

نموذج FCOS للفواكه المحجوبة [110]، ونموذج FoveaMask القائم على ResNet لجيا وآخرون [111]، وتركيبة سون وآخرون من خوارزميات GrabCut وNcut [112] قدمت حلولاً لتحديات تقسيم محددة مثل انخفاض الدقة وارتفاع تكلفة الحساب. كما استكشفت بعض الدراسات نماذج شبه آلية [113]. ومع ذلك، فإن أداء نموذج YOLOv8 في هذه الدراسة تجاوز أداء الدراسات السابقة التي تم مراجعتها. وبالمثل، فإن أداء نموذج Mask R-CNN في تقسيم الفواكه الخضراء غير الناضجة، رغم أنه ليس مرتفعًا مثل YOLOv8، إلا أنه لا يزال يتفوق على العديد من الأساليب الحديثة [78]، [84]، [109]، [110]، [113]، [114].

الشكل 5: منحنى الدقة-الثقة لتجزئة الفئة الواحدة للتفاح الأخضر غير الناضج (الثمار الصغيرة) باستخدام؛ (أ) YOLOv8؛ و (ب) Mask R-CNN.

الشكل 6: منحنى الاسترجاع-الثقة لتجزئة فئة واحدة لثمار التفاح الأخضر باستخدام؛ (أ) YOLOv8؛ و (ب) Mask RCNN

الشكل 7: منحنى الدقة والاسترجاع لتجزئة فئة واحدة من ثمار التفاح الأخضر عند mAP@0.5؛ (أ) YOLOv8؛ و (ب) Mask R-CNN

الشكل 8: صور مثال توضح أداء طريقتين في تقسيم الثمار الخضراء غير الناضجة في ظروف البستان؛ (أ) الصور الأصلية؛ (ب) نتائج تقسيم الكائنات باستخدام YOLOv8؛ و (ج) نتائج تقسيم الكائنات باستخدام Mask R-CNN. يُلاحظ أن بعض المناطق الإشكالية في صور السقف (الدوائر الصفراء) تم تقسيمها بشكل غير صحيح كثمار خضراء بواسطة Mask R-CNN ولكن تم تركها بشكل صحيح كخلفية بواسطة YOLOv8.

الشكل 9: شكل يوضح الكشف الخاطئ خلال حالة بستان موسم النمو، المنطقة الصفراء تشمل منطقة التركيز (أ) الصورة الأصلية 1؛ نتائج تقسيم الكائنات لـ YOLOv8؛ و (ج) نتائج تقسيم الكائنات لـ Mask R-CNN.

استنادًا إلى النتائج المنشورة مؤخرًا مثل اكتشاف الفواكه متعددة الفئات باستخدام نظام رؤية روبوتي بواسطة وان وآخرون [115]، قارن المؤلفون بين YOLOv3 وFaster-R-CNN وImproved Faster-R-CNN، وحققوا نسبة mAP% من

على التوالي. استنادًا إلى مقاييس الأداء التي تم تحقيقها في هذه الدراسة (على سبيل المثال،

لـ YOLOv8 و

بالنسبة لـ Mask R-CNN) لمجموعات البيانات ذات الفئة الواحدة، لوحظ أن YOLOv8 و Mask R-CNN لديهما القدرة على تحقيق أداء أفضل بكثير مقارنةً بما تم تحقيقه مع النماذج الأخرى. ومع ذلك، سيكون من الضروري إجراء دراسة إضافية لمقارنة أداء جميع هذه النماذج باستخدام نفس مجموعة البيانات لدعم هذه النتيجة بشكل أكبر.

4.2 تقسيم الكائنات متعددة الفئات في صور أشجار التفاح الساكنة

مماثل لتجزئة كائنات الفئة الواحدة التي تم مناقشتها أعلاه، أدت YOLOv8 أداءً أفضل من Mask R-CNN في تجزئة صور أشجار التفاح الساكنة إلى فئات كائنات متعددة (الجذوع والفروع). حققت YOLOv8 دقة قدرها 1.00 عند عتبة ثقة قدرها 0.906، كما هو موضح في الشكل 10. وبالمثل، يظهر الشكل 11 أن الاسترجاع لـ YOLOv8 بلغ 0.95 عند أدنى عتبة ثقة، مما يدل على درجة عالية من الدقة في تجزئة هذه الهياكل المعقدة من مظلات الأشجار الساكنة. وصلت Mask R-CNN إلى دقة قدرها 1.00 عند عتبة ثقة أقل قدرها 0.813، مما يشير إلى قدرة قوية على اكتشاف الكائنات المستهدفة بشكل صحيح عند هذا المستوى من الثقة (الشكل 10ب). بالإضافة إلى ذلك، حقق استرجاع Mask R-CNN، كما هو موضح في الشكل 11، 0.837 عند أدنى عتبة ثقة، مما يدل على معدل أعلى قليلاً من السلبية الكاذبة مقارنةً بـ YOLOv8. وبالمثل، أظهرت منحنى الدقة والاسترجاع (الشكل 12أ) أن YOLOv8 حققت متوسط دقة متوسط (mAP) قدره 0.845 عبر جميع فئات الكائنات عند تقاطع على اتحاد (IoU) قدره 0.5، والتي كانت لفئات الجذع والفروع 0.971 و0.719، على التوالي. حققت Mask R-CNN أداءً أقل نسبيًا في مهام التجزئة متعددة الفئات أيضًا. كما هو موضح في الشكل 12ب (منحنى الدقة والاسترجاع)، حقق النموذج متوسط mAP لجميع الفئات قدره 0.748 عند IoU قدره 0.5، مع متوسط mAP فردي قدره 0.828 لتجزئة الجذع و0.673 لتجزئة الفرع.

تظهر الصور التوضيحية التي توضح النجاحات والإخفاقات المقارنة لهذه النماذج (YOLOv8 وMask R-CNN) في تقسيم الجذوع والفروع في الشكلين 13 و14. كما هو موضح سابقًا مع mAP وقياسات أخرى، تم تقسيم الجذوع بدقة أعلى بواسطة YOLOv8 مقارنةً بـ Mask R-CNN، كما هو موضح في الحالات النموذجية المعروضة في الشكلين 13ب و13ج على التوالي. على وجه التحديد، تم الكشف عن الفرع المميز داخل المستطيل المنقط الأصفر (الشكل 13 أ، ب وج) بنجاح بواسطة YOLOv8 ولكن ليس بواسطة Mask R-CNN، مما يظهر أداء YOLOv8 الأفضل في ظروف الإضاءة المنخفضة مقارنةً بـ Mask R-CNN. يوضح المثال في الشكل 13 أن YOLOv8 كان أكثر فعالية في تقسيم الجذوع. وبالمثل، قدم الشكل 14 أمثلة على تقسيمات ناجحة وفاشلة في كل من الجذع والفروع، مما أظهر أن YOLOv8 كان أكثر دقة (أقل اكتشاف خاطئ) من Mask R-CNN، خاصة في المناطق ذات الإضاءة الصعبة والخلفيات المعقدة (مثل الصندوق المستطيل في الشكل 14ب). بالمقارنة، أظهر Mask R-CNN أداءً أقل تحت هذه الظروف، حيث كانت القيود أكثر وضوحًا في المناطق ذات الإضاءة الضعيفة والخلفيات المعقدة (مثل الشكل 14ج). كما أبرز تقسيم الفرع داخل المستطيل الأصفر (الشكل 14د) قدرة YOLOv8 على اكتشاف الميزات على الرغم من ظروف الإضاءة المتغيرة الناتجة عن الظلال وتغيرات اللون، وهي منطقة كان فيها Mask R-CNN أقل قوة في تقسيم الأجسام المرغوبة (الشكل 14هـ).

الشكل 10: منحنى الدقة والثقة لتجزئة متعددة الفئات لجذع الشجرة والفروع؛ (أ) YoloV8، (ب) Mask R-CNN

الشكل 11: منحنى الاسترجاع-الثقة لتجزئة متعددة الفئات للجذوع والفروع التي تم تحقيقها باستخدام؛ (أ) YOLOv8؛ و (ب) Mask R-CNN.

الشكل 12: منحنى الدقة-الاسترجاع لتجزئة متعددة الفئات لجذوع وفروع أشجار التفاح الساكنة عند mAP@0.5؛ (أ) باستخدام YOLOv8؛ و (ب) باستخدام Mask R-CNN.

سرعة الحوسبة هي واحدة من المعايير الرئيسية لأداء هذه النماذج، خاصة عندما تُستخدم في تطبيقات حقلية في الوقت الحقيقي مثل التقليم الآلي أو التخفيف. تم تقديم أوقات الاستدلال (وقت المعالجة لكل صورة أثناء الاختبار) المطلوبة لتجزئة الفواكه الخضراء والأجسام متعددة الفئات (الجذوع والفروع) باستخدام نماذج YOLOv8 و Mask RCNN في الجدول 3. وُجد أن YOLOv8 استغرق فقط 7.8 مللي ثانية لإكمال تجزئة الفئة الواحدة و 10.9 مللي ثانية لتجزئة الفئات المتعددة لكل صورة اختبار باستخدام معالج Intel Xeon® W-2155 @ 3.30 GHz x20، بطاقة الرسوميات NVDIA TITAN Xp Collector’s edition/PCIe/SSE2، وذاكرة 31.1 غيغابايت، ونظام تشغيل Ubuntu 16.04 LTS 64-bit. تتوافق أوقات الاستدلال هذه مع سرعات استدلال تبلغ حوالي 128 إطارًا في الثانية و 92 إطارًا في الثانية، على التوالي لتجزئات الفئة الواحدة والفئات المتعددة. بالمقارنة، كانت أوقات الاستدلال لنموذج Mask RCNN أعلى عند 12.8 مللي ثانية لتجزئة الفئة الواحدة، مما يترجم إلى سرعة استدلال تبلغ حوالي 78 إطارًا في الثانية. بالنسبة لتجزئة الفئات المتعددة، زاد وقت الاستدلال إلى 15.6 مللي ثانية لنموذج Mask R-CNN، أو حوالي 64 إطارًا في الثانية. أظهرت هذه الفروق في وقت المعالجة ملاءمة YOLOv8 لكل من تجزئة الفئة الواحدة والفئات المتعددة لتطبيقات الوقت الحقيقي.

الشكل 13: نتائج مثال لتجزئة الفئات المتعددة للجذوع (الدائرة الصفراء) والفروع (المستطيل الأصفر) في صور بستان الموسم الساكن؛ (أ) الصور الأصلية؛ (ب) نتائج تجزئة YOLOv8؛ و (ج) نتائج تجزئة Mask R-CNN. أظهر هذا المثال أداءً أقل قليلاً في تجزئة Mask R-CNN، نوعيًا، مقارنةً بـ YOLOv8.

الشكل 14: أشكال توضح تجزئة الفئات المتعددة (أ) الصورة الأصلية 1؛ (ب) تجزئة YOLOv8؛ (ج) تجزئة Mask R-CNN؛ (د) الصورة الأصلية 2؛ (هـ) تجزئة Yolov8؛ (و) تجزئة Mask R-CNN.

الشكل 15: المساحة تحت المنحنى (AUC) لنتائج التجزئة لكلا مجموعتي البيانات: مجموعة بيانات الفواكه الخضراء غير الناضجة (التفاح) على الجانب الأيسر؛ مجموعة بيانات بستان الموسم الساكن على الجانب الأيمن.

الجدول 2: ملخص لمقاييس الأداء لنماذج YOLOv8 و Mask R-CNN بما في ذلك الدقة، الاسترجاع، mAP@0.5، أوقات الاستدلال، وFPS لمهام تجزئة الأجسام ذات الفئة الواحدة والفئات المتعددة في هذه الدراسة.

النموذج	الدقة	الاسترجاع	mAP@0.5	وقت الاستدلال (مللي ثانية)	الإطارات في الثانية (FPS)
YOLOv8 (فئة واحدة)	92.9	97	0.902	7.8	128.21
Mask R-CNN (فئة واحدة)	84.7	88	0.85	12.8	78.13
YOLOv8 (فئات متعددة)	90.6	95	0.74	10.9	91.74
Mask R-CNN (فئات متعددة)	81.3	83.7	0.700	15.6	64.10

وفقًا لدراسة حديثة حول دراسة مقارنة لاكتشاف التفاح الأخضر غير الناضج باستخدام نماذج التعلم الآلي والتعلم العميق[116]، حقق النموذج Fully Convolutional One-Stage (FCOS) مع هيكل ResNet101 RFPN دقة قدرها

. حقق نموذج SSD، باستخدام VGG16، دقة قدرها

، بينما وصل YOLOv3 مع Darknet-53 إلى

. حقق كل من Faster-R-CNN و RetinaNet، اللذان يستخدمان ResNet101-FPN، دقة قدرها

، على التوالي. أخيرًا، سجل نموذج CenterNet، باستخدام هيكل Hourglass-104، دقة قدرها

. مقارنةً بهذه النتيجة، فإن الدقة المسجلة من قبل كل من Yolov8 و Mask R-CNN (92.9 و 84.7) في هذه الدراسة أعلى. وبالمثل، أظهرت دراسات حديثة أخرى أجريت لاكتشاف وتجزئة الفروع في أشجار التفاح مثل Unet++[117] دقة قدرها

. علاوة على ذلك، كانت الدراسات[118] تهدف أيضًا إلى مهام تجزئة مماثلة، ومع ذلك فإن نتائجها لم تصل إلى النتائج التي تم الحصول عليها في دراستنا، حيث أظهرت YOLOv8 و Mask R-CNN معدلات دقة أعلى قدرها

على التوالي لتجزئة الفروع والجذوع.

4.3 المناقشة

بينما تُظهر Mask R-CNN دقة جديرة بالثناء في تجزئة الصور الزراعية المعقدة، إلا أن لديها عيبًا طفيفًا من حيث السرعة. في هذه الدراسة، لتحليل الأداء على ثمار التفاح الأخضر خلال موسم التاج والجذوع والفروع خلال الموسم الساكن، حققت Mask R-CNN 78 إطارًا في الثانية لتجزئة الفئة الواحدة و 64 إطارًا في الثانية لتجزئة الفئات المتعددة باستخدام محطة العمل System76. على الرغم من أن هذا المستوى من سرعة الاستدلال قد يكون كافيًا لمعظم التطبيقات غير المتصلة بالإنترنت حيث يمكن تقديم قدرة حوسبة أعلى نسبيًا، إلا أنه قد يطرح تحديات في العمليات الزراعية في الوقت الحقيقي مثل التقليم الآلي واتخاذ القرارات السريعة للتسميد مع موارد حوسبة محدودة. ومع ذلك، فإن قدرتها التفصيلية على التجزئة تجعلها مناسبة للغاية للتطبيقات التي تكون فيها الدقة وتحديد الكائنات بالتفصيل أمرًا أساسيًا.
تتميز YOLOv8 بسرعتها، حيث تحقق 128 إطارًا في الثانية (أسرع بـ 1.65 مرة من Mask R-CNN) لتجزئة الفئة الواحدة و 92 إطارًا في الثانية (أسرع بـ 1.43 مرة من Mask R-CNN) لتجزئة الفئات المتعددة باستخدام نفس البنية التحتية للتصوير والحوسبة المستخدمة لاختبار Mask R-CNN. تعتبر سرعة الاستدلال الأعلى لهذا النموذج ميزة خاصة للمهام الزراعية في الوقت الحقيقي كما تم مناقشته أعلاه. تؤكد مقاييس الدقة والاسترجاع العالية أيضًا على أدائه القوي عبر بيئات متنوعة بما في ذلك ظروف الإضاءة المتغيرة. ومع ذلك، بينما تقدم YOLOv8 تحسينات كبيرة في السرعة والدقة، قد تضحي ببعض التفاصيل في التجزئة مقارنة بالنماذج ذات المرحلتين مثل Mask R-CNN، مما يجعلها أقل ملاءمة قليلاً حيث تكون التفاصيل الدقيقة أكثر أهمية من سرعة المعالجة.

تعتبر معدلات الاستدلال الأسرع لـ YOLOv8 مفيدة بشكل خاص للمهام الحساسة للوقت مثل التقليم الآلي، خاصة في ظروف الإضاءة المنخفضة، مما يبرز ملاءمتها الفائقة للكفاءة التشغيلية في الزراعة الدقيقة.

بشكل عام، أظهرت هذه النتائج أن النموذجين اللذين تم تقييمهما في هذه الدراسة يمكن أن يكونا أداة فعالة وكفؤة لتطوير أدوات زراعية دقيقة وآلية متنوعة، مع تطبيقات محتملة تمتد إلى محاصيل مختلفة تتجاوز التفاح، مما سيلعب دورًا حاسمًا في تحسين إدارة المحاصيل وزيادة إنتاجية وجودة المحاصيل من خلال التعلم الآلي. بشكل خاص، أظهرت YOLOv8 قدرة جيدة على التكيف عبر ظروف البستان المختلفة، وهو فائدة حاسمة في تعزيز الحلول المستندة إلى التعلم الآلي القوية للابتكارات المستقبلية في الزراعة الذكية. إن دمج التعلم الآلي هو مفتاح لتلبية احتياجات الاستدامة الزراعية العالمية وأمن الغذاء.

5. الخاتمة

في السنوات الأخيرة، زادت الأبحاث والتطوير واعتماد تقنيات الاستشعار والدقة والأتمتة والروبوتات في العمليات الزراعية، مدفوعة بالحاجة إلى تقليل مدخلات الزراعة بما في ذلك العمالة وزيادة إنتاجية وجودة المحاصيل. قدمت هذه الدراسة، من خلال تجربة شاملة في البساتين التجارية، مقاييس أداء مقارنة لأحدث نموذجين، والأكثر استخدامًا في التعلم الآلي أو التعلم العميق (YOLOv8 و Mask R-CNN) لتجزئة الكائنات كما يتعلق بتطبيقها على مهام مراقبة المحاصيل وإدارة التاج وحمل المحاصيل الآلي (مثل التقليم الآلي وتخفيف الفواكه الخضراء غير الناضجة). بناءً على النتائج، يمكن استخلاص الاستنتاجات المحددة التالية.

أداء التجزئة في ظروف متنوعة: قامت كل من YOLOv8 و Mask R-CNN بتجزئة صور تاج شجرة التفاح بفعالية من كل من الموسم الساكن وموسم النمو المبكر. تُظهر YOLOv8 أداءً أفضل قليلاً في البيئات التي تتميز بخصائص لونية مشابهة بين الكائنات والخلفيات وتحت ظروف إضاءة متغيرة.
تجزئة الفئة الواحدة (الفواكه الخضراء غير الناضجة): تتفوق YOLOv8 في تجزئة الفئة الواحدة للفواكه الخضراء غير الناضجة، حيث تحقق دقة قدرها 0.92 واسترجاع قدره 0.97. بالمقارنة، تُظهر Mask R-CNN قدرات تجزئة أقل فعالية قليلاً بدقة قدرها 0.84 واسترجاع قدره 0.88.
تجزئة الفئات المتعددة (اكتشاف الجذع والفروع): في اكتشاف كل من الجذع والفروع، تُظهر YOLOv8 دقة أعلى، حيث تحقق مقاييس الدقة والاسترجاع 0.90 و 0.95، على التوالي. حققت Mask R-CNN دقة واسترجاع أقل، عند 0.81 و 0.83 على التوالي، مما يشير إلى فعالية أقل في مهام تجزئة الفئات المتعددة.
سرعة الاستدلال لتجزئة متعددة الفئات: يحافظ YOLOv8 على أداء قوي في سيناريوهات التجزئة متعددة الفئات بسرعة 91.74 إطارًا في الثانية. بالمقابل، تشير سرعة الاستدلال الأبطأ لـ Mask R-CNN البالغة 64.10 إطارًا في الثانية إلى قيود في التعامل مع التطبيقات التي تتطلب استجابات سريعة.

6. العمل المستقبلي

استنادًا إلى الدراسة الحالية، يمكن أن تركز الأبحاث المستقبلية على دراسة القدرات المتطورة لنماذج الكشف عن الأجسام الجديدة مثل YOLOv9 الذي تم إصداره في فبراير 2024 و YOLOv10 [119]، [120] الذي تم إصداره في مايو 2024، ودقتها وكفاءتها وقدرتها على التكيف مع معالجة الصور الزراعية. من الضروري اختبار YOLOv9 و YOLOv10 عبر مجموعات بيانات زراعية متنوعة، تشمل مراحل مختلفة من نمو المحاصيل، ومستويات مختلفة من الحجب، وظروف بيئية متغيرة، لتقييم فعاليتها في البيئات الزراعية الفعلية. يمكن أن تستكشف هذه الدراسة بشكل خاص كيف يتعامل YOLOv9 و YOLOv10 مع مهام الكشف المعقدة مثل تحديد التغيرات الفينوتيبية الدقيقة في المحاصيل تحت ظروف إضاءة صعبة أو خلال أوقات مختلفة من اليوم، وهي حالات نموذجية في بيئات الزراعة الخارجية. علاوة على ذلك، يمكن استكشاف دمج YOLOv9 و YOLOv10 مع تقنيات إنترنت الأشياء (IoT) لتطوير أنظمة متقدمة للمراقبة واتخاذ القرار في الزراعة.

شكر وتقدير

تم تمويل هذا البحث من قبل المؤسسة الوطنية للعلوم ووزارة الزراعة الأمريكية، المعهد الوطني للغذاء والزراعة من خلال برنامج “معهد الذكاء الاصطناعي للزراعة” (رقم الجائزة AWD003473). يود المؤلفون أن يعربوا عن شكرهم لديف ألان (Allan Bros., Inc.) لتوفير الوصول إلى البساتين خلال جمع البيانات وتقييم الميدان. بالإضافة إلى ذلك، يود المؤلفون أن يشكروا كريستين كرومار، بوني كوبلاند وباتريك شارف على دعمهم الأساسي في لوجستيات هذا المشروع.

مساهمة المؤلف

رانجان سابكوتا: التصور، التحقيق، التصور، المنهجية، الكتابة – المسودة الأصلية، المراجعة والتحرير. داوود أحمد: التحقيق، التصور والمنهجية. مانوج كاركي: الإشراف، التمويل، الكتابة – المراجعة والتحرير.

إعلان عن الذكاء الاصطناعي التوليدي والتقنيات المدعومة بالذكاء الاصطناعي في عملية الكتابة: خلال إعداد هذا العمل، استخدم المؤلف chatGPT من أجل تصحيح القواعد اللغوية واللغة. بعد استخدام هذه الأداة/الخدمة، قام المؤلفون بمراجعة وتحرير المحتوى حسب الحاجة ويتحملون المسؤولية الكاملة عن محتوى المنشور.

REFERENCES

[1] A. M. Hafiz and G. M. Bhat, ‘A survey on instance segmentation: state of the art’, Int J Multimed Inf Retr, vol. 9, no. 3, pp. 171-189, 2020.
[2] Q. Zhang, Y. Liu, C. Gong, Y. Chen, and H. Yu, ‘Applications of deep learning for dense scenes analysis in agriculture: A review’, Sensors, vol. 20, no. 5, p. 1520, 2020.
[3] J. Champ, A. Mora-Fallas, H. Goëau, E. Mata-Montero, P. Bonnet, and A. Joly, ‘Instance segmentation for the fine detection of crop and weed plants by precision agricultural robots’, Appl Plant Sci, vol. 8, no. 7, p. e11373, 2020.
[4] Y. Chen, S. Baireddy, E. Cai, C. Yang, and E. J. Delp, ‘Leaf segmentation by functional modeling’, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, p. 0.
[5] N. Lüling, D. Reiser, and H. W. Griepentrog, ‘Volume and leaf area calculation of cabbage with a neural network-based instance segmentation’, in Precision agriculture’21, Wageningen Academic Publishers, 2021, pp. 2719-2745.
[6] C. Niu, H. Li, Y. Niu, Z. Zhou, Y. Bu, and W. Zheng, ‘Segmentation of cotton leaves based on improved watershed algorithm’, in Computer and Computing Technologies in Agriculture IX: 9th IFIP WG 5.14 International Conference, CCTA 2015, Beijing, China, September 27-30, 2015, Revised Selected Papers, Part I 9, Springer, 2016, pp. 425-436.
[7] V. H. Pham and B. R. Lee, ‘An image segmentation approach for fruit defect detection using k-means clustering and graph-based algorithm’, Vietnam Journal of Computer Science, vol. 2, pp. 25-33, 2015.
[8] M. G. S. Jayanthi and D. R. Shashikumar, ‘Leaf disease segmentation from agricultural images via hybridization of active contour model and OFA’, Journal of Intelligent Systems, vol. 29, no. 1, pp. 35-52, 2017.
[9] J. Clement, N. Novas, J.-A. Gazquez, and F. Manzano-Agugliaro, ‘An active contour computer algorithm for the classification of cucumbers’, Comput Electron Agric, vol. 92, pp. 75-81, 2013.
[10] Y. A. N. Gao, J. F. Mas, N. Kerle, and J. A. Navarrete Pacheco, ‘Optimal region growing segmentation and its effect on classification accuracy’, Int J Remote Sens, vol. 32, no. 13, pp. 3747-3763, 2011.
[11] N. Jothiaruna, K. J. A. Sundar, and B. Karthikeyan, ‘A segmentation method for disease spot images incorporating chrominance in comprehensive color feature and region growing’, Comput Electron Agric, vol. 165, p. 104934, 2019.
[12] J. Ma, K. Du, L. Zhang, F. Zheng, J. Chu, and Z. Sun, ‘A segmentation method for greenhouse vegetable foliar disease spots images using color information and region growing’, Comput Electron Agric, vol. 142, pp. 110-117, 2017.
[13] V. Gupta, N. Sengar, M. K. Dutta, C. M. Travieso, and J. B. Alonso, ‘Automated segmentation of powdery mildew disease from cherry leaves using image processing’, in 2017 International Conference and Workshop on Bioinspired Intelligence (IWOBI), IEEE, 2017, pp. 1-4.
[14] S. D. Khirade and A. B. Patil, ‘Plant disease detection using image processing’, in 2015 International conference on computing communication control and automation, IEEE, 2015, pp. 768-771.
[15] K. Tian, J. Li, J. Zeng, A. Evans, and L. Zhang, ‘Segmentation of tomato leaf images based on adaptive clustering number of K-means algorithm’, Comput Electron Agric, vol. 165, p. 104962, 2019.
[16] T. Arsan and M. M. N. Hameez, ‘A clustering-based approach for improving the accuracy of UWB sensorbased indoor positioning system’, Mobile Information Systems, vol. 2019, pp. 1-13, 2019.
[17] L. C. Ngugi, M. Abelwahab, and M. Abo-Zahhad, ‘Recent advances in image processing techniques for automated leaf pest and disease recognition-A review’, Information processing in agriculture, vol. 8, no. 1, pp. 27-51, 2021.
[18] Q. Zeng, Y. Miao, C. Liu, and S. Wang, ‘Algorithm based on marker-controlled watershed transform for overlapping plant fruit segmentation’, Optical Engineering, vol. 48, no. 2, p. 27201, 2009.
[19] M. G. S. Jayanthi and D. R. Shashikumar, ‘Leaf disease segmentation from agricultural images via hybridization of active contour model and OFA’, Journal of Intelligent Systems, vol. 29, no. 1, pp. 35-52, 2019.
[20] S. Coulibaly, B. Kamsu-Foguem, D. Kamissoko, and D. Traore, ‘Deep learning for precision agriculture: A bibliometric analysis’, Intelligent Systems with Applications, vol. 16, p. 200102, 2022.
[21] N. Siddique, S. Paheding, C. P. Elkin, and V. Devabhaktuni, ‘U-net and its variants for medical image segmentation: A review of theory and applications’, Ieee Access, vol. 9, pp. 82031-82057, 2021.
[22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘Mask r-cnn’, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.
[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘You only look once: Unified, real-time object detection’, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.
[24] J. Rashid, I. Khan, G. Ali, F. Alturise, and T. Alkhalifah, ‘Real-Time Multiple Guava Leaf Disease Detection from a Single Leaf Using Hybrid Deep Learning Technique.’, Computers, Materials & Continua, vol. 74, no. 1, 2023.
[25] Y. Tian, G. Yang, Z. Wang, E. Li, and Z. Liang, ‘Instance segmentation of apple flowers using the improved mask R-CNN model’, Biosyst Eng, vol. 193, pp. 264-278, 2020.
[26] A. K. Maji, S. Marwaha, S. Kumar, A. Arora, V. Chinnusamy, and S. Islam, ‘SlypNet: Spikelet-based yield prediction of wheat using advanced plant phenotyping and computer vision techniques’, Front Plant Sci, vol. 13, p. 889853, 2022.
[27] J. Liu and X. Wang, ‘Tomato diseases and pests detection based on improved Yolo V3 convolutional neural network’, Front Plant Sci, vol. 11, p. 898, 2020.
[28] M. Lippi, N. Bonucci, R. F. Carpio, M. Contarini, S. Speranza, and A. Gasparri, ‘A yolo-based pest detection system for precision agriculture’, in 2021 29th Mediterranean Conference on Control and Automation (MED), IEEE, 2021, pp. 342-347.
[29] X. Qu, J. Wang, X. Wang, Y. Hu, T. Zeng, and T. Tan, ‘Gravelly soil uniformity identification based on the optimized Mask R-CNN modeľ, Expert Syst Appl, vol. 212, p. 118837, 2023.
[30] L. Zu, Y. Zhao, J. Liu, F. Su, Y. Zhang, and P. Liu, ‘Detection and segmentation of mature green tomatoes based on mask R-CNN with automatic image acquisition approach’, Sensors, vol. 21, no. 23, p. 7842, 2021.
[31] Q. Wang, M. Cheng, S. Huang, Z. Cai, J. Zhang, and H. Yuan, ‘A deep learning approach incorporating YOLO v5 and attention mechanisms for field real-time detection of the invasive weed Solanum rostratum Dunal seedlings’, Comput Electron Agric, vol. 199, p. 107194, 2022.
[32] H. Li et al., ‘Design of field real-time target spraying system based on improved YOLOv5’, Front Plant Sci, vol. 13, p. 1072631, 2022.
[33] C. Hu, J. A. Thomasson, and M. V Bagavathiannan, ‘A powerful image synthesis and semi-supervised learning pipeline for site-specific weed detection’, Comput Electron Agric, vol. 190, p. 106423, 2021.
[34] S. Chen et al., ‘An approach for rice bacterial leaf streak disease segmentation and disease severity estimation’, Agriculture, vol. 11, no. 5, p. 420, 2021.
[35] Y. Tian, G. Yang, Z. Wang, E. Li, and Z. Liang, ‘Instance segmentation of apple flowers using the improved mask R-CNN model’, Biosyst Eng, vol. 193, pp. 264-278, 2020.
[36] G. Lin, Y. Tang, X. Zou, and C. Wang, ‘Three-dimensional reconstruction of guava fruits and branches using instance segmentation and geometry analysis’, Comput Electron Agric, vol. 184, p. 106107, 2021.
[37] K. Jha, A. Doshi, P. Patel, and M. Shah, ‘A comprehensive review on automation in agriculture using artificial intelligence’, Artificial Intelligence in Agriculture, vol. 2, pp. 1-12, 2019.
[38] A. You et al., ‘Semiautonomous Precision Pruning of Upright Fruiting Offshoot Orchard Systems: An Integrated Approach’, IEEE Robot Autom Mag, 2023.
[39] W. Jia, Y. Tian, R. Luo, Z. Zhang, J. Lian, and Y. Zheng, ‘Detection and segmentation of overlapped fruits based on optimized mask R-CNN application in apple harvesting robot’, Comput Electron Agric, vol. 172, p. 105380, 2020.
[40] Y. Yu, K. Zhang, L. Yang, and D. Zhang, ‘Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN’, Comput Electron Agric, vol. 163, p. 104846, 2019.
[41] L. Zu, Y. Zhao, J. Liu, F. Su, Y. Zhang, and P. Liu, ‘Detection and segmentation of mature green tomatoes based on mask R-CNN with automatic image acquisition approach’, Sensors, vol. 21, no. 23, p. 7842, 2021.
[42] S. Xie, C. Hu, M. Bagavathiannan, and D. Song, ‘Toward robotic weed control: detection of nutsedge weed in bermudagrass turf using inaccurate and insufficient training data’, IEEE Robot Autom Lett, vol. 6, no. 4, pp. 7365-7372, 2021.
[43] J. Champ, A. Mora-Fallas, H. Goëau, E. Mata-Montero, P. Bonnet, and A. Joly, ‘Instance segmentation for the fine detection of crop and weed plants by precision agricultural robots’, Appl Plant Sci, vol. 8, no. 7, p. e11373, 2020.
[44] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘Mask r-cnn’, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.
[45] S. Wang, G. Sun, B. Zheng, and Y. Du, ‘A crop image segmentation and extraction algorithm based on Mask RCNN’, Entropy, vol. 23, no. 9, p. 1160, 2021.
[46] P. Ganesh, K. Volle, T. F. Burks, and S. S. Mehta, ‘Deep orange: Mask R-CNN based orange detection and segmentation’, IFAC-PapersOnLine, vol. 52, no. 30, pp. 70-75, 2019.
[47] U. Afzaal, B. Bhattarai, Y. R. Pandeya, and J. Lee, ‘An instance segmentation model for strawberry diseases based on mask R-CNN’, Sensors, vol. 21, no. 19, p. 6565, 2021.
[48] T.-L. Lin, H.-Y. Chang, and K.-H. Chen, ‘The pest and disease identification in the growth of sweet peppers using faster R-CNN and mask R-CNN’, Journal of Internet Technology, vol. 21, no. 2, pp. 605-614, 2020.
[49] Z. U. Rehman et al., ‘Recognizing apple leaf diseases using a novel parallel real-time processing framework based on MASK RCNN and transfer learning: An application for smart agriculture’, IET Image Process, vol. 15, no. 10, pp. 2157-2168, 2021.
[50] G. H. Krishnan and T. Rajasenbagam, ‘A Comprehensive Survey for Weed Classification and Detection in Agriculture Lands’, Journal of Information Technology, vol. 3, no. 4, pp. 281-289, 2021.
[51] K. Osorio, A. Puerto, C. Pedraza, D. Jamaica, and L. Rodríguez, ‘A deep learning approach for weed detection in lettuce crops using multispectral images’, AgriEngineering, vol. 2, no. 3, pp. 471-488, 2020.
[52] T. Zhao, Y. Yang, H. Niu, D. Wang, and Y. Chen, ‘Comparing U-Net convolutional network with mask RCNN in the performances of pomegranate tree canopy segmentation’, in Multispectral, hyperspectral, and ultraspectral remote sensing technology, techniques and applications VII, SPIE, 2018, pp. 210-218.
[53] A. Safonova, E. Guirado, Y. Maglinets, D. Alcaraz-Segura, and S. Tabik, ‘Olive tree biovolume from UAV multi-resolution image segmentation with mask R-CNN’, Sensors, vol. 21, no. 5, p. 1617, 2021.
[54] P. Soviany and R. T. Ionescu, ‘Optimizing the trade-off between single-stage and two-stage deep object detectors using image difficulty prediction’, in 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), IEEE, 2018, pp. 209-214.
[55] A. You et al., ‘Semiautonomous Precision Pruning of Upright Fruiting Offshoot Orchard Systems: An Integrated Approach’, IEEE Robot Autom Mag, 2023.
[56] M. Hussain, L. He, J. Schupp, D. Lyons, and P. Heinemann, ‘Green fruit segmentation and orientation estimation for robotic green fruit thinning of apples’, Comput Electron Agric, vol. 207, p. 107734, 2023.
[57] J. Seol, J. Kim, and H. Il Son, ‘Field evaluations of a deep learning-based intelligent spraying robot with flow control for pear orchards’, Precis Agric, vol. 23, no. 2, pp. 712-732, 2022.
[58] B. Ma, J. Du, L. Wang, H. Jiang, and M. Zhou, ‘Automatic branch detection of jujube trees based on 3D reconstruction for dormant pruning using the deep learning-based method’, Comput Electron Agric, vol. 190, p. 106484, 2021.
[59] Y. Fu et al., ‘Skeleton extraction and pruning point identification of jujube tree for dormant pruning using space colonization algorithm’, Front Plant Sci, vol. 13, p. 1103794, 2023.
[60] J. Zhang, L. He, M. Karkee, Q. Zhang, X. Zhang, and Z. Gao, ‘Branch detection for apple trees trained in fruiting wall architecture using depth features and Regions-Convolutional Neural Network (R-CNN)’, Comput Electron Agric, vol. 155, pp. 386-393, 2018.
[61] P. Guadagna et al., ‘Using deep learning for pruning region detection and plant organ segmentation in dormant spur-pruned grapevines’, Precis Agric, pp. 1-23, 2023.
[62] T. Gentilhomme, M. Villamizar, J. Corre, and J.-M. Odobez, ‘Towards smart pruning: ViNet, a deeplearning approach for grapevine structure estimation’, Comput Electron Agric, vol. 207, p. 107736, 2023.
[63] E. Kok, X. Wang, and C. Chen, ‘Obscured tree branches segmentation and 3D reconstruction using deep learning and geometrical constraints’, Comput Electron Agric, vol. 210, p. 107884, 2023.
[64] J. Zhang, L. He, M. Karkee, Q. Zhang, X. Zhang, and Z. Gao, ‘Branch detection for apple trees trained in fruiting wall architecture using depth features and Regions-Convolutional Neural Network (R-CNN)’, Comput Electron Agric, vol. 155, pp. 386-393, 2018.
[65] G. Lin, Y. Tang, X. Zou, and C. Wang, ‘Three-dimensional reconstruction of guava fruits and branches using instance segmentation and geometry analysis’, Comput Electron Agric, vol. 184, p. 106107, 2021.
[66] A. S. Aguiar et al., ‘Bringing semantics to the vineyard: An approach on deep learning-based vine trunk detection’, Agriculture, vol. 11, no. 2, p. 131, 2021.
[67] S. Tong, Y. Yue, W. Li, Y. Wang, F. Kang, and C. Feng, ‘Branch Identification and Junction Points Location for Apple Trees Based on Deep Learning’, Remote Sens (Basel), vol. 14, no. 18, p. 4495, 2022.
[68] R. Xiang, M. Zhang, and J. Zhang, ‘Recognition for stems of tomato plants at night based on a hybrid joint neural network’, Agriculture, vol. 12, no. 6, p. 743, 2022.
[69] L. Wu, J. Ma, Y. Zhao, and H. Liu, ‘Apple detection in complex scene using the improved YOLOv4 model’, Agronomy, vol. 11, no. 3, p. 476, 2021.
[70] W. Chen, J. Zhang, B. Guo, Q. Wei, and Z. Zhu, ‘An apple detection method based on des-YOLO v4 algorithm for harvesting robots in complex environment’, Math Probl Eng, vol. 2021, pp. 1-12, 2021.
[71] Z. Huang, P. Zhang, R. Liu, and D. Li, ‘Immature apple detection method based on improved Yolov3’, ASP Transactions on Internet of Things, vol. 1, no. 1, pp. 9-13, 2021.
[72] Y. Liu, G. Yang, Y. Huang, and Y. Yin, ‘SE-Mask R-CNN: An improved Mask R-CNN for apple detection and segmentation’, Journal of Intelligent & Fuzzy Systems, vol. 41, no. 6, pp. 6715-6725, 2021.
[73] A. Kuznetsova, T. Maleva, and V. Soloviev, ‘YOLOv5 versus YOLOv3 for apple detection’, in CyberPhysical Systems: Modelling and Intelligent Control, Springer, 2021, pp. 349-358.
[74] D. Wang and D. He, ‘Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning’, Biosyst Eng, vol. 210, pp. 271-281, 2021.
[75] S. Tong, Y. Yue, W. Li, Y. Wang, F. Kang, and C. Feng, ‘Branch Identification and Junction Points Location for Apple Trees Based on Deep Learning’, Remote Sens (Basel), vol. 14, no. 18, p. 4495, 2022.
[76] F. Gao et al., ‘A novel apple fruit detection and counting methodology based on deep learning and trunk tracking in modern orchard’, Comput Electron Agric, vol. 197, p. 107000, 2022.
[77] C. Zhang, F. Kang, and Y. Wang, ‘An improved apple object detection method based on lightweight YOLOv4 in complex backgrounds’, Remote Sens (Basel), vol. 14, no. 17, p. 4150, 2022.
[78] S. Lu, W. Chen, X. Zhang, and M. Karkee, ‘Canopy-attention-YOLOv4-based immature/mature apple fruit detection on dense-foliage tree architectures for early crop load estimation’, Comput Electron Agric, vol. 193, p. 106696, 2022.
[79] J. Lv et al., ‘A visual identification method for the apple growth forms in the orchard’, Comput Electron Agric, vol. 197, p. 106954, 2022.
[80] F. Su et al., ‘Tree Trunk and Obstacle Detection in Apple Orchard Based on Improved YOLOv5s Model’, Agronomy, vol. 12, no. 10, p. 2427, 2022.
[81] D. Wang and D. He, ‘Fusion of Mask RCNN and attention mechanism for instance segmentation of apples under complex background’, Comput Electron Agric, vol. 196, p. 106864, 2022.
[82] W. Jia et al., ‘Accurate segmentation of green fruit based on optimized mask RCNN application in complex orchard’, Front Plant Sci, vol. 13, p. 955256, 2022.
[83] P. Cong, J. Zhou, S. Li, K. Lv, and H. Feng, ‘Citrus Tree Crown Segmentation of Orchard Spraying Robot Based on RGB-D Image and Improved Mask R-CNN’, Applied Sciences, vol. 13, no. 1, p. 164, 2022.
[84] M. Karthikeyan, T. S. Subashini, R. Srinivasan, C. Santhanakrishnan, and A. Ahilan, ‘YOLOAPPLE: Augment Yolov3 deep learning algorithm for apple fruit quality detection’, Signal Image Video Process, pp. 1-10, 2023.
[85] L. Ma, L. Zhao, Z. Wang, J. Zhang, and G. Chen, ‘Detection and Counting of Small Target Apples under Complicated Environments by Using Improved YOLOv7-tiny’, Agronomy, vol. 13, no. 5, p. 1419, 2023.
[86] M. Carranza-García, J. Torres-Mateo, P. Lara-Benítez, and J. García-Gutiérrez, ‘On the performance of onestage and two-stage object detectors in autonomous vehicles using camera data’, Remote Sens (Basel), vol. 13, no. 1, p. 89, 2020.
[87] G. Yang, J. Wang, Z. Nie, H. Yang, and S. Yu, ‘A lightweight YOLOv8 tomato detection algorithm combining feature enhancement and attention’, Agronomy, vol. 13, no. 7, p. 1824, 2023.
[88] X. Yue, K. Qi, X. Na, Y. Zhang, Y. Liu, and C. Liu, ‘Improved YOLOv8-Seg Network for Instance Segmentation of Healthy and Diseased Tomato Plants in the Growth Stage’, Agriculture, vol. 13, no. 8, p. 1643, 2023.
[89] B. Jabir, K. El Moutaouakil, and N. Falih, ‘Developing an Efficient System with Mask R-CNN for Agricultural Applications’, Agris on-line Papers in Economics and Informatics, vol. 15, no. 1, pp. 61-72, 2023.
[90] H. Duong-Trung and N. Duong-Trung, ‘Integrating YOLOv8-agri and DeepSORT for Advanced Motion Detection in Agriculture and Fisheries’, EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, vol. 11, no. 1, pp. e4-e4, 2024.
[91] B. Xu et al., ‘Livestock classification and counting in quadcopter aerial images using Mask R-CNN’, Int

Remote Sens, vol. 41, no. 21, pp. 8121-8142, 2020.
[92] X. Mu, L. He, P. Heinemann, J. Schupp, and M. Karkee, ‘Mask R-CNN based apple flower detection and king flower identification for precision pollination’, Smart Agricultural Technology, vol. 4, p. 100151, 2023.
[93] C. Yu et al., ‘Segmentation and density statistics of mariculture cages from remote sensing images using mask R-CNN’, Information Processing in Agriculture, vol. 9, no. 3, pp. 417-430, 2022.
[94] P. Bharati and A. Pramanik, ‘Deep learning techniques-R-CNN to mask R-CNN: a survey’, Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019, pp. 657-668, 2020.
[95] G. Hoogenboom, ‘Contribution of agrometeorology to the simulation of crop production and its applications’, Agric For Meteorol, vol. 103, no. 1-2, pp. 137-157, 2000.
[96] L. Zhang, J. Wu, Y. Fan, H. Gao, and Y. Shao, ‘An efficient building extraction method from high spatial resolution remote sensing images based on improved mask R-CNN’, Sensors, vol. 20, no. 5, p. 1465, 2020.
[97] T. Wang et al., ‘Tea picking point detection and location based on Mask-RCNN’, Information Processing in Agriculture, vol. 10, no. 2, pp. 267-275, 2023.
[98] C. Tang et al., ‘A fine recognition method of strawberry ripeness combining Mask R-CNN and region segmentation’, Front Plant Sci, vol. 14, p. 1211830, 2023.
[99] B. R. Amogi, R. Ranjan, and L. R. Khot, ‘Mask R-CNN aided fruit surface temperature monitoring algorithm with edge compute enabled internet of things system for automated apple heat stress management’, Information Processing in Agriculture, 2023.
[100] Y. Li, Q. Fan, H. Huang, Z. Han, and Q. Gu, ‘A modified YOLOv8 detection network for UAV aerial image recognition’, Drones, vol. 7, no. 5, p. 304, 2023.
[101] G. Wang, Y. Chen, P. An, H. Hong, J. Hu, and T. Huang, ‘UAV-YOLOv8: a small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios’, Sensors, vol. 23, no. 16, p. 7190, 2023.
[102] L. Zhang, G. Ding, C. Li, and D. Li, ‘DCF-Yolov8: An Improved Algorithm for Aggregating Low-Level Features to Detect Agricultural Pests and Diseases’, Agronomy, vol. 13, no. 8, p. 2012, 2023.
[103] X. Wang and J. Liu, ‘Vegetable disease detection using an improved YOLOv8 algorithm in the greenhouse plant environment’, Sci Rep, vol. 14, no. 1, p. 4261, 2024.
[104] G. Yang, J. Wang, Z. Nie, H. Yang, and S. Yu, ‘A lightweight YOLOv8 tomato detection algorithm combining feature enhancement and attention’, Agronomy, vol. 13, no. 7, p. 1824, 2023.
[105] S. Yang, W. Wang, S. Gao, and Z. Deng, ‘Strawberry ripeness detection based on YOLOv8 algorithm fused with LW-Swin Transformer’, Comput Electron Agric, vol. 215, p. 108360, 2023.
[106] R. Sapkota, D. Ahmed, M. Churuvija, and M. Karkee, ‘Immature Green Apple Detection and Sizing in Commercial Orchards using YOLOv8 and Shape Fitting Techniques’, IEEE Access, vol. 12, pp. 4343643452, 2024.
[107] G. Chen, Y. Hou, T. Cui, H. Li, F. Shangguan, and L. Cao, ‘YOLOv8-CML: A lightweight target detection method for Color-changing melon ripening in intelligent agriculture’, 2023.
[108] J. Wei, Y. Ding, J. Liu, M. Z. Ullah, X. Yin, and W. Jia, ‘Novel green-fruit detection algorithm based on D2D framework’, International Journal of Agricultural and Biological Engineering, vol. 15, no. 1, pp. 251259, 2022.
[109] M. Sun, L. Xu, R. Luo, Y. Lu, and W. Jia, ‘GHFormer-Net: Towards more accurate small green apple/begonia fruit detection in the nighttime’, Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4421-4432, 2022.
[110] M. Liu, W. Jia, Z. Wang, Y. Niu, X. Yang, and C. Ruan, ‘An accurate detection and segmentation model of obscured green fruits’, Comput Electron Agric, vol. 197, p. 106984, 2022.
[111] W. Jia et al., ‘FoveaMask: A fast and accurate deep learning model for green fruit instance segmentation’, Comput Electron Agric, vol. 191, p. 106488, 2021.
[112] S. Sun, M. Jiang, D. He, Y. Long, and H. Song, ‘Recognition of green apples in an orchard environment by combining the GrabCut model and Ncut algorithm’, Biosyst Eng, vol. 187, pp. 201-213, 2019.
[113] A. Prabhu and N. S. Rani, ‘Semiautomated Segmentation Model to Extract Fruit Images from Trees’, in 2021 International Conference on Intelligent Technologies (CONIT), IEEE, 2021, pp. 1-13.
[114] Y. Tian, G. Yang, Z. Wang, H. Wang, E. Li, and Z. Liang, ‘Apple detection during different growth stages in orchards using the improved YOLO-V3 model’, Comput Electron Agric, vol. 157, pp. 417-426, 2019.
[115] S. Wan and S. Goudos, ‘Faster R-CNN for multi-class fruit detection using a robotic vision system’, Computer Networks, vol. 168, p. 107036, 2020.
[116] M. Liu, W. Jia, Z. Wang, Y. Niu, X. Yang, and C. Ruan, ‘An accurate detection and segmentation model of obscured green fruits’, Comput Electron Agric, vol. 197, p. 106984, 2022.
[117] E. Kok, X. Wang, and C. Chen, ‘Obscured tree branches segmentation and 3D reconstruction using deep learning and geometrical constraints’, Comput Electron Agric, vol. 210, p. 107884, 2023.
[118] D.-H. Kim, C.-U. Ko, D.-G. Kim, J.-T. Kang, J.-M. Park, and H.-J. Cho, ‘Automated Segmentation of Individual Tree Structures Using Deep Learning over LiDAR Point Cloud Data’, Forests, vol. 14, no. 6, p. 1159, 2023.
[119] R. Sapkota et al., ‘YOLOv10 to Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once Series’, arXiv preprint arXiv:2406.19407, 2024.
[120] A. Wang et al., ‘Yolov10: Real-time end-to-end object detection’, arXiv preprint arXiv:2405.14458, 2024.

Journal: Artificial Intelligence in Agriculture, Volume: 13
DOI: https://doi.org/10.1016/j.aiia.2024.07.001
Publication Date: 2024-07-17

Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments

Ranjan Sapkota*, Dawood Ahmed and Manoj Karkee*Center for Precision & Automated Agricultural Systems, Washington State University, 24106 N Bunn Rd, Prosser, 99350, Washington, USA

Abstract

Instance segmentation, an important image processing operation for automation in agriculture, is used to precisely delineate individual objects of interest within images, which provides foundational information for various automated or robotic tasks such as selective harvesting and precision pruning. This study compares the one-stage YOLOv8 and the two-stage Mask R-CNN machine learning models for instance segmentation under varying orchard conditions across two datasets. Dataset 1, collected in dormant season, includes images of dormant apple trees, which were used to train multi-object segmentation models delineating tree branches and trunks. Dataset 2, collected in the early growing season, includes images of apple tree canopies with green foliage and immature (green) apples (also called fruitlet), which were used to train single-object segmentation models delineating only immature green apples. The results showed that YOLOv8 performed better than Mask R-CNN, achieving good precision and near-perfect recall across both datasets at a confidence threshold of 0.5 . Specifically, for Dataset 1, YOLOv8 achieved a precision of 0.90 and a recall of 0.95 for all classes. In comparison, Mask R-CNN demonstrated a precision of 0.81 and a recall of 0.81 for the same dataset. With Dataset 2, YOLOv8 achieved a precision of 0.93 and a recall of 0.97 . Mask R-CNN, in this single-class scenario, achieved a precision of 0.85 and a recall of 0.88 . Additionally, the inference times for YOLOv8 were 10.9 ms for multi-class segmentation (Dataset 1) and 7.8 ms for single-class segmentation (Dataset 2), compared to 15.6 ms and 12.8 ms achieved by Mask R-CNN’s, respectively. These findings show YOLOv8’s superior accuracy and efficiency in machine learning applications compared to two-stage models, specifically Mask-R-CNN, which suggests its suitability in developing smart and automated orchard operations, particularly when real-time applications are necessary in such cases as robotic harvesting and robotic immature green fruit thinning.

Keywords: Machine Learning, Deep learning, YOLOv8, Mask R-CNN, Automation, Robotics, Artificial intelligence, Machine Vision

1. Introduction

Instance segmentation is a powerful computer vision technique that combines the benefits of both object detection and semantic segmentation [1]. One of the key benefits of instance segmentation in agricultural applications is its ability to accurately quantify plant and crop structures [2], which can provide valuable information about plant growth, disease identification, and yield estimation, and can provide a foundation for various key areas of research and development such as robotic green (immature) fruit thinning [3]. Instance segmentation can provide precise measurements of plant features, such as leaf area, stem length, and plant height, with a high level of accuracy and efficiency [4], [5]. The traditional methods of instance segmentation in agricultural images were mostly based on hand-crafted features and classical image processing techniques such as Watershed Transform [6], Graph-based Segmentation [7], Active Contours (or Snakes) [8], [9], level set [10], [11], [12], Region Growing [10], [11], [12], Morphological Operation [13], [14] and Clustering-based methods [15], [16]. However, these methods require a lot of manual setups and refinements, making them time-consuming and less reliable [17]. Additionally, these methods could not easily learn from new data, making them less flexible and difficult to adapt to different scenarios. Moreover, these methods involved multiple disjointed image processing stages such as noise removal, contrast adjustment, image enhancement, refinement and manually defining and/or extracting specific features such as edge, texture, or colors.

Transitioning from traditional image processing methods to deep learning-based techniques in instance segmentation represents a significant evolution in the analysis of agricultural imagery. Traditional methods such as Watershed Transform [18], Graph-based Segmentation, and Active Contours rely heavily on predefined algorithms that segment images based on intensity gradients, color, texture or connectivity, which often necessitates extensive manual tuning to extract the specific characteristics of different crops or conditions [18], [19]. These techniques, while foundational in the early days of computer vision, require iterative adjustments and are limited by their inability to dynamically learn from new data or adapt to varied environments. In contrast, deep learning models, especially convolutional neural networks (CNNs), introduce layers of learning filters that automatically extract and learn the most informative features from vast amounts of data. Unlike traditional methods that depend on manual feature specification and are thus prone to human bias and error, deep learning systems learn to identify underlying patterns and irregularities in data/images, making them more robust and accurate. This capability is particularly beneficial in agricultural applications (e.g., canopy instance segmentation) where the variability in plant appearance and environmental conditions can be high [20]. CNN-based models offer end-to-end learning techniques, which not only reduces the
processing time but also enhances the adaptability of the models to new, unseen scenarios, an essential feature for generalizable and scalable agricultural applications. This dynamic learning ability is a significant improvement over traditional methods that are static and constrained by their algorithmic rigidity.

More specifically, DL network architectures, including U-Net [21], Mask R-CNN [22], and YOLO [23], are increasingly utilized for a range of applications in agriculture. A key advantage of these DL techniques is their end-to-end learning approach, which enables direct mapping of raw images to segmentation results, thus enhancing consistency and reliability. Furthermore, transfer learning techniques allow for the adaptation of models pre-trained on extensive datasets to specific agricultural tasks, reducing both training times and data requirements. Utilizing these features of DL models, various agricultural applications have been investigated including plant disease identification [24], yield prediction [25], [26], pest detection [27], [28], soil health assessment [29], crop maturity analysis [30], and site-specific weed control application [31], [32], [33] showcasing their versatility and efficiency in modern agricultural practices.

As mentioned before, instance segmentation techniques have been widely applied to crop disease management [34] . Early detection of plant diseases is crucial for maintaining crop yield and quality. Utilizing instance segmentation, researchers can quantify symptoms such as leaf spots and discoloration and monitor the progression of these diseases over time [35]. This capability is instrumental in developing effective disease management strategies, including targeted treatments and breeding for disease-resistant cultivars. Instance segmentation has also been proved pivotal for precise crop yield estimation. Accurate yield estimation is essential for growers and breeders to make informed decisions about crop management and to select traits for breeding new cultivars. Instance segmentation techniques can be used to accurately counting and sizing individual fruits or other canopy objects from images. Such information facilitates a precise yield estimation and provides key insights into cultivar characteristics [34]. Past studies have demonstrated the effectiveness of these techniques in various relevant applications such as the segmentation of apple flowers [35], segmentation and localization of strawberry fruits for harvesting segmentation and counting of cranberries, and the segmentation of guava fruits and branches [36]. The data derived from these studies assist in optimizing crop management strategies, including optimal application of water and fertilizers, and identifying highyielding cultivars [35] as mentioned before.

In addition, instance segmentation has been applied extensively to develop machine vision systems for agricultural robots because it provides capabilities for robots to detect, delineate, and track individual objects of interest in agricultural fields using images or videos, such as fruits, branches, flowers, vegetables, and livestock [37]. Detecting and tracking plant parameters such as leaves, stem, trunk, branch, flower, and fruit is necessary for a robot to automatically perform various tasks such as harvesting, and canopy, and crop-load management operations. In the last few years, several studies have implemented the use of deep learning-based instance segmentation techniques for developing robotic solutions for various agricultural applications such as tree pruning in dormant season [38] picking fruits and vegetables [39], [40], [41], thinning flowers and fruitlet and identifying and killing weeds [42], [43] among others.

Among the broad applications of deep learning techniques in agriculture, there has been a focus on the use of two specific architectures: YOLO (You Only Look Once) and Mask Region-Based Convolutional Neural Network (Mask R-CNN). These models, known for their effectiveness in instance segmentation, have been pivotal in advancing tasks such as crop detection, pest and disease management, weed identification, tree canopy segmentation, and canopy object (e.g., branch and fruit) detection. These tasks, critical in precision and automated agriculture, benefit immensely from the capabilities of these two deep learning models. As mentioned before, many recent studies conducted in agricultural applications used Mask-R-CNN-based [44] instance segmentation for tasks such as crop detection [45], [46], pest and disease detection [47], [48], [49], weed detection [50], [51], tree canopy segmentation [52], [53], and tree branch detection [52], [53]. Concurrently, the YOLO family of models has been used widely in object detection because of its ability to handle tasks like object detection, image classification, and instance segmentation simultaneously with one-stage networks. Unlike Mask R-CNN, a two-stage model suitable for segmentation tasks [54], YOLO optimizes the overall processing ensuring speed and efficiency crucial for real-time applications in agriculture such as robotic pruning [55], thinning [56], and pesticide application [57].

A number of recent studies focused on the segmentation of tree trunks and branches, employing various deep learning approaches. For example, [58], [59] used deep learning for automatic branch detection in jujube trees and [60] used Regions-Convolutional Neural Networks (R-CNN) models alongside depth features for branch detection in fruiting
wall apple trees. Segmenting plant canopy parts in dormant grapevines has also been studied widely using different deep learning techniques (e.g., [61].[100]). Other models like ViNet [62] have emerged, providing deep learning solutions for estimating grapevine structures. Further advancements include the application of deep learning and geometric constraints for obscured branch segmentation and three-dimensional reconstruction [63], as well as the use of space colonization algorithms for dormant pruning in jujubee plants [59]. A deep learning-based sensing system (called SPGnet) for jujubee plant by Baojian et al. [58], branch detection in apple trees using R-CNN by Zhang et al., 2014 [64], and Lin et al.’s tiny Mask R-CNN for guava branch reconstruction [65] are other recent studies in this field. Additionally, Aguiar et al. [66] explored trunk segmentation using a semantic segmentation-based deep learning approach with a Single Shot Multibox Detector (SSD). In comparison with the performance measures reported by these latest, innovative methodologies available in the literature, YOLOv8 model presented in this study performed better in segmenting tree trunks in terms of both precision ( 0.95 ), recall ( 0.97 ) and

. Furthermore, while the Mask R-CNN model achieved relatively lower performance relative to YOLOv8, its performance was comparable or better with many recent studies on trunk and branch detection including [60], [61], [62], [63], [67], [68].

Both models are extensively studied, as discussed above and as shown in Table 1, highlighting 23 publications in the last 3 years focusing specifically on analyzing images of modern apple tree canopies.

Table 1: Highlighting the studies conducted in the last three years on YOLO and Mask R-CNN during different apple orchard environments.

References	Year	DL model	Objectives
[69], [70]	2021	YOLO-V4	Apple detection in a complex scene
[71], [72]	2021	Mask R-CNN	Deep learning-based apple detection
[71], [73]	2021	YOLO-V3	Green fruit detection (apples, mangoes)
[73], [74]	2021	YOLO-V5	Apple fruitlet detection for fruitlet thinning
[75]	2022	Mask R-CNN	Branch identification and junction points localization in apple trees; Trunk identification and segmentation
[76], [77]	2022	YOLO-V4	Apple detection, counting, and tree trunk tracking in modern orchards
[78]	2022	YOLO-V4	Immature/mature apple detection on dense-foliage tree architectures for early crop-load estimation
[79]	2022	YOLO-V5	Identification method for the apple growth pattern in the orchard
[80]	2022	YOLO-V5	Tree trunk and obstacle detection in apple orchards
[81], [82]	2022	Mask R-CNN	Ripe and green apple segmentation in orchards
[83]	2022	Mask R-CNN	Tree and tree crown segmentation in orchards
[84]	2023	YOLO-V3	Apple fruit quality detection
[85]	2023	YOLO-V7	Detection and counting of small target apples
[56], [82]	2023	Mask R-CNN	Green apple segmentation

Building upon this background of widespread application of YOLOv8 and Mask R-CNN models, the primary goal of this study is to systematically compare and evaluate the performance of these two models (YOLOv8 and Mask RCNN) for instance segmentation tasks in modern, commercial apple orchards. Through this comprehensive comparison, this research aims to provide insights into the suitability, efficiency, and potential challenges associated with implementing each model in agricultural automation applications. To achieve this goal, the following specific objectives will be pursued in this study:

To compare the performances of YOLOv8 and Mask R-CNN models in segmenting single-class objects, specifically green apples (fruitlets), in images collected from variable orchard environments in the early growing season; and
To evaluate the capabilities of these two models in segmenting multi-class objects, specifically primary branches and tree trunks of apple trees in images collected from a model apple orchard during the dormant season.

The comparison between YOLOv8 and Mask R-CNN in this study is founded on the significant advancements in the YOLO architecture that extend its capabilities beyond mere bounding box detection. Traditionally, while YOLO models were primarily known for their speed and efficiency in object detection, the latest iterations, particularly YOLOv8, have incorporated features supporting instance segmentation. This adaptation allows YOLOv8 to not only
predict bounding boxes but also to generate precise object masks, aligning its functionalities more closely with those of Mask R-CNN, which has been a standard in instance segmentation. Therefore, comparing these two models is pertinent as both now offer robust solutions for instance segmentation, making the evaluation of their performance in agricultural applications, where both detection speed and segmentation accuracy are critical, highly relevant and scientifically justified.

The remainder of this paper is organized to provide a comprehensive comparison of YOLOv8 and Mask R-CNN models, specifically focusing on their application in commercial apple orchards. First, a “Background” section is presented to outline the theoretical frameworks of the one-stage (YOLOv8) and two-stage (Mask R-CNN) detectors, setting the stage for a deeper understanding of these complex models. Then a “Materials and Methods” section is provided to describe the experimental design, data acquisition, and analytical methodologies employed. Following the methodology, a “Results and Discussion” section is presented to report the results and critically discuss models’ performance in various segmentation tasks, including their efficiency and efficacy. The paper concludes with a “Conclusion” section, which summarizes the research methods and the findings of the study. The paper ends with “Future Work” section, which summarizes the potential further comparison with state-of-the-art other models.

2. Deep Learning Models

Deep learning models used for object detection are generally categorized into two distinct approaches: one-stage and two-stage detectors [86]. Two-stage detectors, such as Mask R-CNN, first generate regions of interest (ROIs) in an initial stage, using a Region Proposal Network (RPN) [44]. These regions are then classified and refined in the second stage to provide precise object localization and classification. This approach is known for its high accuracy due to the focused refinement of detected objects. On the other hand, one-stage detectors like YOLO streamline this process by directly predicting object classes and bounding boxes in a single pass through the network, sacrificing some accuracy for significant gains in speed. One-stage models do not separate the detection into distinct region proposals and refinement stages, allowing them to operate faster, making them well-suited for applications requiring real-time processing [86]. Both methodologies have continued to evolve in recent years to address the trade-offs between speed and accuracy.

It is true that numerous one-stage and two-stage detectors have been developed over the past decade, each tailored for specific performance criteria in terms of speed and accuracy. Some of the most widely used two-stage detectors include Fast R-CNN (which improves the efficiency of feature usage), Faster R-CNN (which integrates a region proposal network), and R-FCN and Cascade R-CNN (which enhance localization and classification accuracy through specialized networks). Similarly, most widely used one-stage detectors include SSD (Single Shot MultiBox Detector) and RetinaNet (which introduced focal loss to handle class imbalance), and family of YOLO (You Only Look Once) model including YOLOv5 and YOLOv6 (which emphasize detection speed). Despite the availability of numerous one-stage and two-stage models, YOLOv8 and Mask R-CNN have been the most widely used models in agricultural applications with highly impactful results [87], [88], [89], [90], [91], [92], [93].Comparative studies have clearly demonstrated their superior performance in detecting and segmenting agricultural objects under varied conditions, as outlined in Table 1 of this manuscript. These models balance the trade-offs between speed, accuracy, and robustness, making them especially suitable for the dynamic environments encountered in agricultural settings. Based on these concaving past studies, Mask R-CNN (for its precise segmentation capability), and YOLOv8 (for its exceptional speed), have thus been selected for this study.

2.1 Mask R-CNN

Mask R-CNN is a deep learning model designed for object detection and instance segmentation, renowned for its accuracy and efficiency. Its strength lies in its ability to precisely identify and delineate each object in an image, making it highly effective for complex image analysis tasks. The model was developed by researchers at Facebook AI Research in 2017 and builds on top of the Faster R-CNN object detection model by adding a branch for predicting object masks in parallel with the existing branch for bounding box detection [44]. The architecture of Mask R-CNN consists of three main components: a backbone network, a region proposal network (RPN), and two parallel branches for bounding box detection and mask prediction as shown in Figure 1. The backbone network is typically a convolutional neural network (CNN) that extracts features from the input images and is shared by both branches. The RPN generates a set of region proposals that are likely to contain objects, based on the feature maps generated by the
backbone network. The bounding box branch predicts the class label and bounding box coordinates for each region proposal, while the mask branch predicts a binary mask for each object instance within the bounding box.

However, the application of Mask R-CNN and other deep learning models in agriculture comes with several challenges. First, the performance of the model heavily relies on the quality and diversity of the training dataset[94]. Agricultural environments are highly variable, with changes in lighting, weather conditions, and plant growth stages, all of which can affect the model’s accuracy [95]. Moreover, Mask R-CNN requires substantial computational resources for training and inference [96], which can be a limitation in real-time applications on the farm where such resources are limited.[14], [16], [17]

Ongoing studies are focusing on addressing these challenges and optimizing deep learning models such as Mask R-CNN for improved efficiency and robustness. These efforts include integrating more adaptive and scalable neural network architectures, improving data augmentation techniques to make the model more resilient to environmental variabilities, and developing lightweight versions of the models that maintain high accuracy while being more resource efficient. For instance, researchers have adapted Mask R-CNN to accurately identify and segment pineapples from complex backgrounds [84], which is crucial for automated harvesting systems aiming to reduce labor costs and increase yield precision. Similarly, Mask R-CNN has been deployed to identify specific picking points on tea plants [97] to aid robotic harvesting of quality tea-leaves while minimizing damage to plants. Furthermore, this model has shown promising results in various horticultural applications such as assessing the ripeness of strawberries [98]. By combining Mask R-CNN with region segmentation techniques, the system effectively distinguished between different ripeness stages, enabling growers to optimize the timing of strawberry picking for better market prices and reduced waste. In apple orchards, Mask R-CNN has been utilized for flower detection and identification of the king flower, which is critical for targeted pollination strategies [99]. This application is expected to help improve pollination efficiency leading to higher fruit yield and quality. Mask R-CNN has also been applied to monitor crop stress such as estimating fruit surface temperature using IoT sensors. This technology allows for real-time monitoring and management of fruit crops, thus helping orchard managers to mitigate the effects of heat stress and maintain fruit quality and yield.

Figure 1: Mask-R-CNN architecture with; (a) structure diagram, highlighting the backbone network, RPN, bounding box, and mask prediction branches; and (b) detailed view of the Region Proposal Network (RPN).

2.2 YOLOv8

The YOLO (You Only Look Once) family of object detection and instance segmentation models have evolved rapidly over the last several years, with each new iteration introducing improvements in accuracy and/or speed. YOLOv8 (Figure 2), the latest one-stage model, was built on the foundations provided by previous YOLO models, such as YOLOv3 and YOLOv5. Compared to two-stage models, YOLOv8 directly predicts bounding boxes and class probabilities without the need for a separate region proposal network, streamlining the object detection process. One key innovation in YOLOv8 is the adoption of an anchor-free, center-based approach for object detection, which offers several advantages over traditional anchor-based methods such as YOLOv5, YOLOv6, and YOLOv7. YOLOv8 implements Pseudo Ensemble or Pseudo Supervision (PS), a method that involves training multiple models with distinct configurations on the same dataset to generate a more diverse set of predictions, improving the accuracy and robustness of the final prediction. Additionally, YOLOv8 leverages the Darknet-53 architecture, a 53-layer deep convolutional neural network optimized for feature extraction and object detection. One significant architectural change in YOLOv8 is the replacement of the C3 module with the C2F module. The C3 module, also known as the convolutional module, processes input data through a series of convolutional operations. The C2F module, an improved version of the C3 module, enhances accuracy and processing times compared to previous models. Furthermore, YOLOv8 substitutes the 6 x 6 Convolutional (Conv) layer with a 3 x 3 Conv layer in the model backbone, reducing the number of parameters and creating a more compact, computationally efficient network. YOLOv8 also employs a decoupled head, which separates the tasks of predicting object presence and classifying object types, thereby improving both accuracy and processing speed. This refinement positions YOLOv8 as an effective solution for both object detection and instance segmentation in computer vision.

Figure 2: YOLOv8 architecture showcasing its innovative design for object detection and segmentation (https://deci.ai/blog/history-yolo-object-detection-models-from-yolov1-yolov8/).

YOLOv8 offers several configurations to cater to different needs for computational speed and accuracy [100], [101]: YOLOv8-Tiny for fast processing at the cost of some accuracy, ideal for real-time applications on limitedresource devices; YOLOv8-Small, which balances speed with more detailed detection capabilities; YOLOv8-Standard for robust performance in diverse settings; and YOLOv8-Large, which prioritizes high accuracy for critical applications where details and precision are paramount.[90]

The recent advancements in YOLOv8 have facilitated its adoption in diverse agricultural applications, demonstrating effectiveness in addressing specific challenges inherent to various farming environments discussed earlier. By
optimizing the processing of low-level features, YOLOv8 becomes a powerful tool for the early detection of subtle signs of agricultural pests and diseases, critical for maintaining crop health [102]. For example, enhanced versions of YOLOv8 have been applied to detect diseases in vegetables within greenhouse environments [103] ensuring early detection and management of plant diseases. Further innovations include the integration of attention mechanisms into YOLOv8 to enhance the object detection capabilities, which was tested by [104] to improve tomato detection accuracy in cluttered agricultural environments. Similarly, [93] focused on fusion of YOLOv8 with lightweight transformer architectures to enhance feature extraction process, which helped improve strawberry ripeness detection in field environmentsClick or tap here to enter text.. In orchard environments, YOLOv8 coupled with shape fitting techniques has been employed for the accurate detection and sizing of immature green apples, vital for yield estimation and growth monitoring [106]. Similarly, the a specialized YOLOv8 model has been developed for monitoring the ripening process of color-changing melons representing a significant step towards tailored applications in agriculture[107].

3. Materials and Methods

This study consisted of four major steps as outlined in Figure 3a beginning with RGB images acquisition from commercial orchards in two distinct seasons (Figure 3b as dormant season and 3c as early growing season). These images, captured under varying environmental conditions such as bright and cloudy days, were then manually annotated to create the training and testing datasets. The training dataset was subsequently used to train the two deep learning models mentioned previously, and their performance in instance segmentation was evaluated using the test dataset.

Figure 3: (a) Overall workflow diagram used in this research; (b) An example image of an Apple orchard during the dormant season (November 22, 2022); (c) An example image of an apple orchard during early fruit growing season (June 18, 2023); (d) IntelRealsense 435i camera used to acquire images to train and test the instance segmentation models; (e) Example trunks and branches used to annotate the dormant season images; and (f) Example immature green fruits (fruitlets) used to annotate early growing season images.

3.1 Study site and data acquisition

This study was conducted in a commercial apple orchard (Figures 3b and 3c) owned and operated by Allan Brothers Fruit Company, located at Prosser, Washington State, USA. The orchard was planted in 2009 with a Scilate apple cultivar with a row spacing of 9.0 ft , and a plant spacing of 3.0 f , and was trained to a V-trellis architecture. Two sets of RGB images were acquired using IntelRealsense 435i (Intel Corporation, California, USA); one in November 2022 creating dormant season dataset as shown in Figures 3b and 3e, while the other set of images was acquired in June 2023 (just before manual fruitlet thinning) which provided the dataset for early growing season as illustrated in Figure 3c and 3f. The Intel RealSense camera was selected for capturing RGB images due to its software development kit’s (Intel RealSense SDK) ability to adjust parameters and capture high-quality images.

3.2 Data preparation

Two kinds of datasets comprising 1,553 RGB images, capturing a variety in orchard lighting conditions were prepared for analysis of the deep learning model. Dataset 1 comprised 474 images from the dormant season, which were annotated manually to represent multi-class objects: the tree trunk and primary branches growing out from the trunks (Figure 4). Altogether, 1,141 annotations for the tree trunk and 2,369 annotations for the tree branches were generated manually by creating the polygon over desired objects in these images using the image labeling software Labelbox. Likewise, dataset 2 comprised 1,079 images from the green fruit growing season in which 5,921 annotations of immature green apples were generated. During the image preprocessing stage using the label box software, all these annotations were formatted in accordance with the COCO dataset specification, which meets the requirement of both the YOLOv8 and Mask R-CNN model for image segmentation. Furthermore, to facilitate model training and validation, both datasets were resized to

pixels, and both datasets were divided randomly into training, validation, and test subsets, following an

distribution ratio for each object class.

Figure 4: Workflow diagram showing the two types of datasets used in the study; Dataset 1 included the dormant season apple trees with multi-class objects (Trunk and branch) and Dataset 2 included growing season apple tree canopies with immature green fruits.

3.3 Deep Learning Model Implementation

Both the YOLOv8 and Mask R-CNN models were trained on a workstation with an Intel Xeon® W-2155 CPU @ 3.30 GHz x20 processor, NVIDIA TITAN Xp Collector’s Edition/PCIe/SSE2 graphics card, 31.1 GiB memory, and

Ubuntu 16.04 LTS 64-bit operating system. The backend framework for the model implementation was Pytorch, operating on a Linux system. To optimize performance, the learning rate used was 0.001 , the batch size used was 32 , and the dropout rate used was 0.5 to mitigate overfitting. The training was conducted over 1,000 iterations. The model training was stopping before reaching 1000 epochs, if the model performance did not improve for 20 consecutive epochs over the validation dataset, which was useful to minimize model overfitting to the training dataset and improving generality. An initial learning rate of 0.01 was used in training both models, whereas the momentum and weight decay used were 0.937 and 0.0005 respectively for the two models. These parameter settings were chosen to optimize the speed of the training process while minimizing the chances of overfitting the model to the training dataset. During the initial three epochs, a warm-up phase was employed, using a momentum of 0.8 and a bias learning rate of 0.1 , to stabilize the model’s optimization and mitigate the risk of being stuck at a poor local minimum.

During the training process, various augmentation techniques were applied to enhance model robustness and generalization such as hue augmentation ( 0.015 ), saturation augmentation ( 0.7 ), value augmentation ( 0.4 ), translation adjustments ( 0.1 ), scaling variations ( 0.5 ), and a

probability for left-right flips. Additionally, a mosaic augmentation was applied with a probability of 1.0 . After the model training was completed, the model outputs were converted to TorchScript format to simplify further processing to evaluate the performances of both YOLOv8 and Mask R-CNN models in terms of precision, recall, mean average precision (mAP), and area under curve (AUC) as discussed below. The detailed the specific data augmentation techniques and hyperparameter values utilized during training is presented in Table 2:

Table 2: Data augmentation and regularization parameters used in training models in this study

Methods Applied	Value
Hue augmentation (fraction)	0.015
Saturation augmentation (fraction)	0.7
Value augmentation (fraction)	0.4
Rotation	0.0
Translation	0.1
Scale	0.5
Flip left-right (probability)	0.5
Mosaic (probability)	1.0
Weight decay	0.0005

The number of training epochs was determined through a preliminary test. In this test, the model performance with the validation dataset was monitored during the training process and the number of epochs when validation accuracy started to decrease or stay flat was chosen as the optimal number of epochs, which was expected to help avoid model overfitting.

3.4 Performance Evaluation

To evaluate the instance segmentation capabilities of the Mask R-CNN and YOLOv8 models, five distinct metrics were used: Precision, Recall, mean Average Precision (AP) at 0.5 intersection over union (mAP@0.5 IOU), Area Under the receiver operating characteristic Curve (AUC), and Inference speed. Precision is defined as the proportion of correctly identified positive instances to the total predicted positive instances, as depicted by equation 1 . Similarly, recall, depicted by equation 2, quantified the proportion of correctly identified positive instances out of all actual instances of the target objects. Furthermore, the mean average precision (mAP), represented as the average of the AP
across k categories (equation 4), was crucial in evaluating the model’s precision at a threshold of

overlap between predicted and true object boundaries/bounding boxes. The area under the curve (AUC), defined by equation 5, assessed the model’s classification efficacy across all possible thresholds. The model’s efficiency in processing and delivering predictions was measured by the inference speed and was inversely related to the time taken per image analysis.
These metrics are calculated as follows:

Equation 1

Equation 5
where TP, FP, and FN represent true positive, false positive, and false negative object instances respectively. Variable ‘ k ‘ represents the total number of object classes, and ( AP )

refers to the average precision calculated for the

class among these k classes. AP is the area under the precision-recall curve for a given class. TPR represents the true positive rate, FPR is the false positive rate, and t indicates the time taken for the model to infer results for a given (single) image.

4. Results and Discussion

4.1 Single-class Object Segmentation of Immature Green Apples (Fruitlets)

For single-class segmentation for immature green fruits, the Precision-Confidence curves depicted in Figure 5, revealed that the YOLOv8 model achieved a maximum precision of 1.00 when the confidence threshold was 0.929 (Figure 5a). Correspondingly, Recall-Confidence curves for the respective models are presented in Figure 6, which showed that YOLOv8’s recall reached 0.97 at the minimum confidence threshold of 0.000 . This high recall rate, or sensitivity, indicates the model’s ability to correctly identify a high percentage of actual objects, which showed models effectiveness in segmenting green fruits even at the lowest confidence levels. Additionally, YOLOv8 outperformed Mask R-CNN in terms of mean average precision (mAP), achieving 0.939 at a 0.5 IoU threshold for green fruits and overall categories, compared to mAP of 0.902 achieved with Mask R-CNN (Figure 7).
The performance differences between YOLOv8 and Mask R-CNN generally reflected the distinct nature of their architectures and the way they process images. YOLOv8, being a one-stage detector, is designed for speed and accuracy, making it capable of excluding similar non-target areas, as observed in the segmentation tasks (Figure 8b). Its direct approach to object detection avoids the region proposal step, leading to fewer false positives in areas of the canopy that resembled the target fruit in color. Mask R-CNN, on the other hand, uses a two-stage process, which involved generating region proposals before classifying and segmenting objects. This can sometimes result in the inclusion of non-target areas, such as leaves and stems being misclassified as fruits (Figure 8c). Moreover, its performance appears to be more sensitive to lighting variations, which can lead to errors in object identification under the extreme sides of lighting situation such as bright, direct sun-light and dark shadows (Figure 9c). Despite these differences, there are specific situations where Mask R-CNN could still be the preferred choice. Its two-stage process, particularly the region proposal step, can be advantageous in complex segmentation tasks where precision is critical, and objects are densely packed or partially obscured. In the past, green fruit segmentation has been investigated using various approaches. Wei et al.’s D2D framework [108], GHFormer’s focus on night-time detection [109], Liu et al.’s

FCOS model for obscured fruits [110], Jia et al.’s ResNet-based FoveaMask [111], and Sun et al.’s combination of GrabCut and Ncut algorithms [112] each offered solutions to specific segmentation challenges such as lower accuracy and higher computation cost. Some studies also explored semi-automated models [113]. However, the performance of the YOLOv8 model in this study exceeded those of the reviewed past studies. Likewise, the performance of the Mask R-CNN model in segmenting immature green fruits, while not as high as YOLOv8’s, still surpassed many recent approaches [78], [84], [109], [110], [113], [114].

Figure 5: Precision-Confidence curve for single class segmentation of immature green apples (truitlets) using; (a) YOLOv8; and (b) Mask R-CNN.

Figure 6: Recall-Confidence curve for single class segmentation of green apple fruits using; (a) YOLOv8; and (b) Mask RCNN

Figure 7: Precision-Recall curve for single class segmentation of green apple fruits at mAP@0.5; (a) YOLOv8; and (b) Mask R-CNN

Figure 8: Example images showing the performance of two methods in segmenting immature green fruit in orchard condition; (a) Original images; (b) Instance segmentation results of YOLOv8; and (c) Instance segmentation results of Mask R-CNN. It is noted that some problematic regions in the canopy images (yellow circles) were incorrectly segmented as green fruit by Mask R-CNN but were correctly left as background by YOLOv8.

Figure 9: Figure illustrating wrong detection during the growing season orchard condition, yellow region includes the focus area (a) Original image 1; Instance segmentation results of YOLOv8; and (c) Instance segmentation results of Mask R-CNN.

Based on the recently published results such as multi-class fruit detection using a robotic vision system by Wan et al. [115], the authors compared YOLOv3, Faster-R-CNN and Improved Faster-R-CNN, and achieved an mAP% of

and

respectively. Based on the performance measures achieved in this study (e.g.,

for YOLOv8 and

for Mask R-CNN) for single class datasets, it was observed that YOLOv8 and Mask R-CNN has a potential to achieve a substantially better performance compared to the same achieved with the other models. However, further study to compare the performance of all these models with the same dataset would be essential to further substantiate this finding.

4.2 Multi-class Object Segmentation in Images of Dormant Apple Trees

Similar to single class object segmentation discussed above, YOLOv8 performed better than Mask R-CNN in segmenting dormant apple tree images into multiple object classes (trunks and branches). YOLOv8 achieved a precision of 1.00 at a confidence threshold of 0.906 , as shown in Figure 10. Similarly, Figure 11 shows that the recall for YOLOv8 reached 0.95 at the minimal confidence threshold, indicating a high degree of accuracy in segmenting these complex structures of dormant tree canopies. Mask R-CNN reached a precision of 1.00 at a lower confidence threshold of 0.813 , suggesting a strong ability to correctly detecting target objects at this level of confidence (Figure 10b). Additionally, the recall of Mask R-CNN, as depicted in Figure 11, achieved 0.837 at the lowest confidence threshold, indicating slightly higher rate of false negatives compared to YOLOv8. Similarly, precision-recall curve (Figure 12a) showed that YOLOv8 achieved a mean average precision (mAP) of 0.845 over all object classes at an intersection over union (IoU) of 0.5 , which for the trunk and branch classes were 0.971 and 0.719 , respectively. Mask R-CNN achieved relatively lower performance in multi-class segmentation tasks as well. As seen in Figure 12b (precision-recall curve) the model achieved an all-class mAP of 0.748 at an IoU of 0.5 , with individual mAP of 0.828 for trunk segmentation and 0.673 for branch segmentation.

Example images demonstrating comparative successes and failures of these models (YOLOv8 and Mask R-CNN) in segmenting trunks and branches are depicted in Figures 13 and 14. As shown before with mAP and other measures, trunks were segmented with higher accuracy by YOLOv8 compared to Mask R-CNN, which are indicated by sample cases shown in shown in Figure 13b and 13c respectively. Specifically, the branch highlighted within the yellow dotted rectangle (Figure 13 a, b and c) was successfully detected by YOLOv8 but not by Mask R-CNN, showing YOLOv8’s better performance in low light conditions compared to Mask R-CNN. The example in Figure 13 shows that YOLOv8 was more effective in segmenting trunks. Similarly, Figure 14 presented examples of successful and failed segmentations in both trunk and branches, which showed that YOLOv8 was more precise (less false detection) than Mask R-CNN, particularly in area with challenging lighting and complex backgrounds (e.g., a rectangular box in Figure 14b). Comparatively, Mask R-CNN exhibited lower performance under these conditions, with the limitations being more apparent in poorly lit areas with complex backgrounds (e.g., Figure 14c). The segmentation of the branch within the yellow rectangle (Figure 14d) also highlighted YOLOv8’s ability to detect features despite variable lighting conditions created by shadows and hue variations, an area where Mask R-CNN was less robust in segmenting desired objects (Figure 14e).

Figure 10: Precision-Confidence curve for multi-class segmentation of Trunk and Branch ; (a) YoloV8 , (b) Mask R-CNN

Figure 11: Recall-confidence curve for multi-class segmentation of trunks and branches achieved with; (a) YOLOv8; and (b) Mask R-CNN.

Figure 12: Precision-Recall curve for multi-class segmentation of trunks and branches of dormant apple trees at mAP@0.5; (a) with YOLOv8; and (b) with Mask R-CNN.

Computational speed is one of the major performance major of these models, particularly when they are used for realtime field applications such as robotic pruning or thinning. The inference times (processing time per image during testing) required for segmenting green fruit and multi-class objects (trunks and branches) with YOLOv8 and Mask RCNN models are presented in Table 3. It was found that YOLOv8 took only 7.8 ms to complete single-class segmentation and 10.9 ms for multi-class segmentation per test image using Intel Xeon® W-2155 CPU @ 3.30 GHz x20 processor, NVDIA TITAN Xp Collector’s edition/PCIe/SSE2 graphics card, 31.1 gigabyte memory, and Ubuntu 16.04 LTS 64-bit operating system. These inference times correspond to inference speeds of approximately 128 FPS and 92 FPS, respectively for single and multi-class segmentations. Comparatively, the inference times for Mask RCNN was higher at 12.8 ms for single class segmentation, which translates to an inference speed of approximately 78 FPS. For multi-class segmentation, the inference time increased to 15.6 ms for Mask R-CNN, or roughly 64 FPS. This difference in processing time showed suitability of the YOLOv8 for both single and multi-object instance segmentation for real-time application.

Figure 13: Example results for multiclass segmentation of trunks (yellow circle) and branches (yellow rectangle) in dormant season orchard images; (a) Original images ; (b) YOLOv8 segmentation results; and (c) Mask R-CNN segmentation results. This example showed slightly weaker segmentation performance of Mask R-CNN, qualitatively, compared to YOLOv8.

Figure 14: Figures illustrating multiclass segmentation (a) Original Image1 ; (b) YOLOv8 segmentation (c) Mask R-CNN segmentation; (d) Original image 2; (e) Yolov8 segmentation (f) Mask R-CNN segmentation.

Figure 15: Area under curve (AUC) for the segmentation results of both datasets: Immature green fruit (Apple) dataset on left side; Dormant season orchard dataset on right side

Table 2: Summary of the performance metrics of YOLOv8 and Mask R-CNN models including precision, recall, mAP@0.5, inference times, and FPS for single and multi-class object segmentation tasks in this study.

Model	Precision	Recall	mAP@0.5	Inference Time (ms)	Frames Per Second (FPS)
YOLOv8 (Single-class)	92.9	97	0.902	7.8	128.21
Mask R-CNN (Single-class)	84.7	88	0.85	12.8	78.13
YOLOv8 (Multi-class)	90.6	95	0.74	10.9	91.74
Mask R-CNN (Multi-class)	81.3	83.7	0.700	15.6	64.10

According to a most recent study on comparative study of immature green apple detection using machine learning and deep learning models[116], Fully Convolutional One-Stage (FCOS) with a ResNet101 RFPN backbone achieved a precision of

. SSD, utilizing VGG16, had a precision of

, while YOLOv3 with Darknet-53 reached

. Faster-R-CNN and RetinaNet, both employing ResNet101-FPN, achieved precisions of

and

, respectively. Lastly, CenterNet, using the Hourglass-104 backbone, recorded a precision of

. Compared to this result, the precision recorded by both Yolov8 and Mask R-CNN (92.9 and 84.7) in this study is higher. Likewise, other recent studies conducted to detect and segment branch in apple trees such as Unet++[117] with an accuracy of

. Furthermore, studies[118] also aimed at similar segmentation tasks, yet their outcomes fall short of the results obtained in our study, where YOLOv8 and Mask R-CNN demonstrated higher precision rates of

and

respectively for branch and trunk segmentation.

4.3 Discussion

Mask R-CNN, while demonstrating commendable accuracy in segmenting complex agricultural images, has a slight disadvantage in terms of speed. In this study, for analyzing performances over green apple fruitlets during canopy season and tree trunks and branches during dormant season, Mask R-CNN achieved 78 FPS for single-class and 64 FPS for multi-class segmentation using the System76 workstation. Though this level of inference speed might be sufficient for most of the off-line applications where relatively higher computational capacity could be offered, it may pose challenges in real-time agricultural operations such as automated pruning and rapid decision-making for fertilization with limited computational resources. However, its detailed segmentation capability makes it highly suitable for applications where precision and detailed object delineation is essential.
YOLOv8 stands out for its speed, achieving 128 FPS (1.65)times faster than Mask R-CNN) for single-class and 92 FPS ( 1.43 time faster than Mask R-CNN) for multi-class segmentation with the same imaging and computational infrastructure used to test Mask R-CNN. Comparatively higher inference speed of this model is particularly advantageous for real-time agricultural tasks as discussed above. Its high precision and recall metrics further emphasize its robust performance across diverse environmental settings including variable light conditions. However, while YOLOv8 offers substantial improvements in speed and accuracy, it may sacrifice some granularity in segmentation compared to two-stage models like Mask R-CNN, which makes it slightly less applicable where minute detail is more critical than processing speed.

YOLOv8’s faster inference rates are particularly beneficial for time-sensitive tasks such as automated pruning, especially in low-light conditions, underscoring its superior suitability for operational efficiency in precision agriculture.

In general, however, these findings showed that the two models evaluated in this study could be an effective and efficient tool for developing various precision and automated agricultural tools, with potential applications extending to various crops beyond apples, which will play a crucial role in enhancing crop management and improving crop yield and quality through machine learning. Particularly, YOLOv8 showed good adaptability across different orchard conditions, which is a critical benefit in advancing robust machine learning-based solutions for future innovations in smart farming. The incorporation of machine learning is a key to meet global agricultural sustainability and food security needs.

5. Conclusion

In recent years, there has been increased research, development and adoption of sensing, precision, automation and robotics technologies in agricultural operations, driven by the need to minimize farming inputs including labor and increasing crop yield and quality. This study, through a comprehensive experiment in commercial orchards, provided comparative performance measures of two latest, and most widely used machine learning or deep-learning models (YOLOv8 and Mask R-CNN) for instance segmentation as it relates to their applicability to various crop monitoring and automated canopy and crop-load management tasks (e.g., automated pruning and immature green fruit thinning). Based on the results, the following specific conclusions could be made.

Segmentation Performance in Diverse Conditions: Both YOLOv8 and Mask R-CNN effectively segmented apple tree canopy images from both dormant and early growing seasons. YOLOv8 shows slightly better performance in environments with similar color features between objects and backgrounds and under varying light intensities.
Single-Class Segmentation (Immature Green Fruit): YOLOv8 outperforms in single-class segmentation of immature green fruits, achieving a precision of 0.92 and a recall of 0.97 . In comparison, Mask R-CNN exhibits slightly less effective segmentation capabilities with a precision of 0.84 and a recall of 0.88 .
Multi-Class Segmentation (Trunk and Branch Detection): In the detection of both trunk and branches, YOLOv8 displays higher accuracy, achieving the precision and recall metrics of 0.90 and 0.95 , respectively. Mask R-CNN achieved lower precision and recall, at 0.81 and 0.83 respectively, indicating reduced effectiveness in multi-class segmentation tasks.
Inference Speed for Multi-Class Segmentation: YOLOv8 maintains robust performance in multi-class segmentation scenarios with a speed of 91.74 FPS. In contrast, Mask R-CNN’s slower inference speed of 64.10 FPS suggests limitations in handling applications requiring rapid responses.

6. Future Work

Building on the current study, future research could focus on studying evolving capabilities of new object detection models such as YOLOv9 released in February 2024 and YOLOv10 [119], [120] released in May, 2024, and their accuracy, efficiency and adaptability to agricultural image processing. It is essential to test YOLOv9 and YOLOv10 across diverse agricultural datasets, which include various stages of crop growth, different levels of occlusion, and varying environmental conditions, to evaluate their effectiveness in actual agricultural environments. This study could particularly explore how YOLOv9 and YOLOv10 handle complex detection tasks such as identifying subtle phenotypic changes in crops under challenging light conditions or during different times of the day, situations that are typical in outdoor farming environments. Furthermore, the integration of YOLOv9 and YOLOv10 with Internet of Things (IoT) technologies could be explored to develop advanced systems for real-time monitoring and decisionmaking in agriculture.

Acknowledgement

This research is funded by the National Science Foundation and United States Department of Agriculture, National Institute of Food and Agriculture through the “AI Institute for Agriculture” Program (Award No.AWD003473). The authors gratefully acknowledge Dave Allan (Allan Bros., Inc.) for providing access to the orchards during the data collection and field evaluation. Additionally, the authors would like to thank gratefully to Christine Cromar, Bonnie Copeland and Patrick Scharf for their essential support in this project logistics.

Author’s Contribution

Ranjan Sapkota: Conceptualization, Investigation, Visualization, Methodology, Writing – original draft, review & editing. Dawood Ahmed: Investigation, Visualization & Methodology. Manoj Karkee: Supervision, Funding, writing – review & editing.

Declaration of generative AI and AI-assisted technologies in the writing process: During the preparation of this work, the author used chatGPT in order to correct grammar and language. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.