تطوير وتقييم أنظمة الإجابة على الأسئلة الزراعية القائمة على نماذج اللغة الكبيرة The development and evaluation of agricultural question-answering systems based on large language models

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-026-35003-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41663434
تاريخ النشر: 2026-02-09
المؤلف: Ayşe Eldem وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

تقيّم هذه الدراسة التي أجراها آيشه إلديم وحسين إلديم تطبيق نماذج اللغة الكبيرة (LLMs) في أنظمة الإجابة على الأسئلة الزراعية. على الرغم من الأداء المتفوق الذي أظهرته نماذج LLMs عبر مجالات متنوعة، إلا أن استخدامها في الزراعة لا يزال محدودًا. قام الباحثون بتطوير مجموعة من الأسئلة متعددة الخيارات تشمل ثلاثة مواضيع – العامة، البستنة، وإنتاج المحاصيل – عبر ثلاثة مستويات من الصعوبة (سهل، متوسط، صعب). تم توليد الإجابات باستخدام نماذج GPT-4o و Gemini-2.0-flash، مع استخدام استراتيجيات تحفيز متنوعة، بما في ذلك Zero-Shot، Chain-of-Thought (CoT)، Self-Consistency، و Tree-of-Thought (ToT)، جنبًا إلى جنب مع خط أنابيب تحفيز مُحسّن تلقائيًا من خلال Automatic Prompt Engineering (APE).

قامت الدراسة بتقييم دقيق لدقة وتناسق الردود، كاشفة أنه بينما كانت نماذج LLMs عمومًا تؤدي بشكل جيد، إلا أن فعاليتها كانت تختلف بشكل كبير بناءً على النموذج المختار واستراتيجية التحفيز. تم إجراء تحليلات إحصائية، بما في ذلك فترات الثقة bootstrap، واختبارات t المزدوجة، وANOVA، وقياسات حجم التأثير (Cohen’s h و d)، لتقييم تأثير نموذج LLM، وطريقة التحفيز، ومستويات الصعوبة، وفئات الأسئلة. تسهم هذه البحث في تطوير واحدة من أولى أنظمة الإجابة على الأسئلة المحددة المجال للمهنيين الزراعيين، مما يضع أساسًا لتطبيقات رقمية متقدمة في هذا القطاع.

الطرق

في هذه الدراسة، تم تطوير نظام يستفيد من نماذج اللغة الكبيرة (LLMs) لتوليد إجابات دقيقة على الأسئلة الزراعية من خلال مجموعة منظمة من الأسئلة والأجوبة. تضمنت المنهجية عدة خطوات رئيسية: أولاً، تم إنشاء مجموعة بيانات من الأسئلة متعددة الخيارات، مع وجود أربعة خيارات للإجابة لكل سؤال، وتم حفظها بتنسيق CSV. بعد ذلك، تم توليد التحفيزات باستخدام أربع طرق متميزة – Zero-Shot، Chain-of-Thought (CoT)، Tree-of-Thought (ToT)، وSelf-Consistency – تم تطبيقها على نموذجين من LLMs، GPT-4o و Gemini-2.0-flash. تم تقييم كل سؤال باستخدام هذه التقنيات التحفيزية، مع تحسين لاحق من خلال Automatic Prompt Engineer (APE) لتعزيز الدقة والتناسق. تم تسجيل وتحليل النتائج من هذه التقييمات بشكل منهجي لتقييم معدلات النجاح والخطأ لكل نهج.

استخدمت الدراسة GPT-4o و Gemini-2.0-flash نظرًا لقدراتهما المتقدمة في معالجة المصطلحات الزراعية ودعمهما القوي للغات المتعددة. كانت استراتيجيات التحفيز المستخدمة مصممة لتعظيم أداء النماذج: قدمت Zero-Shot إجابات مباشرة، وحددت CoT خطوات التفكير، وولدت Self-Consistency إجابات متعددة للمقارنة، ونظمت ToT الردود في عملية تفكير متماسكة. كان APE أساسيًا في تحسين هذه التحفيزات، مما يضمن أن النماذج أنتجت إجابات موثوقة مع الحد الأدنى من التدخل البشري. ركزت النتائج التجريبية على دقة ومعدلات الخطأ في الردود، مصنفة حسب صعوبة السؤال ونوعه، مما يوفر رؤى حول فعالية كل طريقة تحفيز في سياق الاستفسارات الزراعية.

المناقشة

في هذه الدراسة، تم استكشاف تطبيق نماذج اللغة الكبيرة (LLMs) في الزراعة، مع التركيز على فعاليتها في أنظمة الإجابة على الأسئلة. استعرضت البحث دراسات متنوعة سلطت الضوء على مزايا نماذج LLMs، مثل قدرتها على توليد مخرجات ذات مغزى مع الحد الأدنى من البيانات المعلّمة، مقارنةً بأساليب التعلم الآلي التقليدية التي تتطلب مجموعات بيانات واسعة ومشاركة خبراء. من الجدير بالذكر أن الدراسة استخدمت نموذجين من LLMs، GPT-4o و Gemini-2.0-flash، جنبًا إلى جنب مع أربع تقنيات تحفيز: Zero-Shot، Chain of Thought (CoT)، Self-Consistency، وTree of Thoughts (ToT). أشارت النتائج إلى أن GPT-4o تفوق على Gemini-2.0-flash، خاصة مع طريقة Self-Consistency التي حققت دقة بنسبة 95.3%، بينما حققت Zero-Shot معدلات دقة أقل.

كما حددت الدراسة اختلافات كبيرة في الأداء بين تقنيات التحفيز، حيث أكدت التحليلات الإحصائية تفوق الأساليب القائمة على التفكير مثل Self-Consistency وToT على نهج Zero-Shot. تؤكد النتائج على أهمية كل من اختيار نموذج LLM واستراتيجية التحفيز في تعزيز دقة وموثوقية أنظمة الإجابة على الأسئلة الزراعية. علاوة على ذلك، تناولت الدراسة الفجوات الموجودة في الأدبيات بشأن التقييم المنهجي لنماذج LLMs في الزراعة، داعية إلى إجراء تقييمات إحصائية أكثر صرامة لتحسين موثوقية وقابلية تكرار النتائج المستقبلية. بشكل عام، تسهم الدراسة في تقديم رؤى قيمة حول إمكانيات نماذج LLMs في التطبيقات الزراعية، مع تسليط الضوء على فعاليتها في اتخاذ القرارات ونشر المعرفة.

القيود

تقدم الدراسة رؤى قيمة حول تطبيق نماذج اللغة الكبيرة (LLMs) مع تقنيات التحفيز المختلفة في الزراعة، ومع ذلك تعترف بعدة قيود. من الجدير بالذكر أن البحث لم يقارن نماذج LLMs مع التعديل الدقيق، أو أنظمة التوليد المعززة بالاسترجاع (RAG)، أو النماذج المحددة المجال، أو المنهجيات التقليدية. كان هذا الاختيار مقصودًا، بهدف تقييم قابلية استخدام نماذج LLMs الحالية للممارسين الزراعيين والتحقق من النتائج من خلال تحليل شامل لتقنيات التحفيز. وبالتالي، فإن النتائج تخدم كإطار أساسي للبحوث المستقبلية حول تطبيقات LLM في الزراعة.

علاوة على ذلك، تسلط الدراسة الضوء على ضرورة وجود إرشادات واضحة في تنفيذ أنظمة الأسئلة والأجوبة المصممة للسياقات الزراعية. تؤكد على أهمية استشارة الخبراء الزراعيين، خاصة في الحالات الحرجة التي تتطلب تقييمات فعلية، مثل تشخيص الأمراض أو توصيات المبيدات. بالإضافة إلى ذلك، تشير الدراسة إلى أن الاعتبارات العملية، بما في ذلك التكاليف المرتبطة باستخدام واجهة برمجة التطبيقات (API) لطرق التحفيز المختلفة وأوقات استجابة الأنظمة المطورة، هي أمور حاسمة للتطبيقات الواقعية. لذلك، فإن نهجًا متوازنًا أمر ضروري، مع الأخذ في الاعتبار الاستخدام المقصود، وخصائص المستخدمين، واستراتيجيات التحفيز، وأوقات الاستجابة المقبولة، والجدوى الاقتصادية.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-026-35003-9
PMID: https://pubmed.ncbi.nlm.nih.gov/41663434
Publication Date: 2026-02-09
Author(s): Ayşe Eldem et al.
Primary Topic: Topic Modeling

Overview

This study by Ayşe Eldem and Hüseyin Eldem evaluates the application of large language models (LLMs) in agricultural question-answering systems. Despite the demonstrated superior performance of LLMs across various domains, their utilization in agriculture remains limited. The researchers developed a set of multiple-choice questions encompassing three topics—General, Horticulture, and Crop Production—across three difficulty levels (Easy, Medium, Difficult). Answers were generated using the GPT-4o and Gemini-2.0-flash models, employing various prompting strategies, including Zero-Shot, Chain-of-Thought (CoT), Self-Consistency, and Tree-of-Thought (ToT), alongside an automatically optimized prompting pipeline through Automatic Prompt Engineering (APE).

The study rigorously assessed the accuracy and consistency of the responses, revealing that while LLMs generally performed well, their effectiveness varied significantly based on the chosen model and prompt strategy. Statistical analyses, including bootstrap confidence intervals, paired t-tests, ANOVA, and effect size measures (Cohen’s h and d), were conducted to evaluate the impact of the LLM model, prompt method, difficulty levels, and question categories. This research contributes to the development of one of the first domain-specific question-answering systems for agricultural professionals, establishing a foundation for advanced digital applications in the sector.

Methods

In this study, a system leveraging large language models (LLMs) was developed to generate accurate answers to agricultural questions through a structured question-answer set. The methodology involved several key steps: first, a dataset of multiple-choice questions was created, with each question accompanied by four answer options, and saved in a CSV format. Next, prompts were generated using four distinct methods—Zero-Shot, Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Self-Consistency—applied to two LLMs, GPT-4o and Gemini-2.0-flash. Each question was evaluated using these prompting techniques, with subsequent optimization through Automatic Prompt Engineer (APE) to enhance accuracy and consistency. The results from these evaluations were systematically recorded and analyzed to assess the success and error rates of each approach.

The study utilized GPT-4o and Gemini-2.0-flash due to their advanced capabilities in processing agricultural terminology and their robust multilingual support. The prompting strategies employed were designed to maximize the models’ performance: Zero-Shot provided direct answers, CoT outlined reasoning steps, Self-Consistency generated multiple answers for comparison, and ToT structured responses into a coherent thought process. APE was instrumental in refining these prompts, ensuring that the models produced reliable answers with minimal human intervention. The experimental results focused on the accuracy and error rates of the responses, categorized by question difficulty and type, providing insights into the effectiveness of each prompting method in the context of agricultural queries.

Discussion

In this study, the application of large language models (LLMs) in agriculture was explored, focusing on their effectiveness in question-answering systems. The research reviewed various studies that highlighted the advantages of LLMs, such as their ability to generate meaningful outputs with minimal labeled data, compared to traditional machine learning methods that require extensive datasets and expert involvement. Notably, the study utilized two LLMs, GPT-4o and Gemini-2.0-flash, alongside four prompting techniques: Zero-Shot, Chain of Thought (CoT), Self-Consistency, and Tree of Thoughts (ToT). Results indicated that GPT-4o outperformed Gemini-2.0-flash, particularly with the Self-Consistency method achieving an accuracy of 95.3%, while Zero-Shot yielded lower accuracy rates.

The research also identified significant performance differences among the prompting techniques, with statistical analyses confirming the superiority of reasoning-based methods like Self-Consistency and ToT over the Zero-Shot approach. The findings emphasize the importance of both the choice of LLM and the prompting strategy in enhancing the accuracy and reliability of agricultural question-answering systems. Furthermore, the study addressed existing gaps in the literature regarding the systematic evaluation of LLMs in agriculture, advocating for more rigorous statistical assessments to improve the reliability and reproducibility of future findings. Overall, the study contributes valuable insights into the potential of LLMs in agricultural applications, highlighting their effectiveness in decision-making and knowledge dissemination.

Limitations

The study presents valuable insights into the application of large language models (LLMs) with various prompting techniques in agriculture, yet it acknowledges several limitations. Notably, the research did not compare LLMs with fine-tuning, retrieval-augmented generation (RAG) pipelines, domain-specific models, or traditional methodologies. This choice was intentional, aimed at assessing the usability of existing LLMs for agricultural practitioners and validating results through a comprehensive analysis of prompting techniques. Consequently, the findings serve as a foundational framework for future research on LLM applications in agriculture.

Moreover, the study highlights the necessity for clear guidelines in implementing question-and-answer systems tailored for agricultural contexts. It emphasizes the importance of consulting agricultural experts, especially in critical situations that require physical assessments, such as disease diagnosis or pesticide recommendations. Additionally, the research points out that practical considerations, including the costs associated with API usage for different prompting methods and the response times of the developed systems, are crucial for real-world applications. Therefore, a balanced approach is essential, taking into account the intended use, user demographics, prompt strategies, acceptable response times, and cost-effectiveness.