أبكوس: مُحسِّن قائم على التكلفة لأنظمة المشغل الدلالي Abacus: A Cost-Based Optimizer for Semantic Operator Systems

المجلة: Proceedings of the VLDB Endowment، المجلد: 19، العدد: 5
DOI: https://doi.org/10.14778/3796195.3796215
تاريخ النشر: 2026-01-01
المؤلف: Matthew Russo وآخرون
الموضوع الرئيسي: نمذجة الموضوعات

نظرة عامة

تقدم ورقة البحث Abacus، وهو مُحسّن مبتكر يعتمد على التكلفة مصمم لأنظمة المشغلين الدلاليين المستخدمة في معالجة مجموعات كبيرة من الوثائق غير المنظمة. يتناول Abacus التحديات المرتبطة بتحسين تنفيذ المشغلين الدلاليين—مثل الخرائط، والفلاتر، والانضمامات—من خلال التركيز على تعزيز الجودة، والتكلفة، والكمون مع الالتزام بالقيود المحتملة. يستخدم المُحسّن مزيجًا من أمثلة التحقق الدنيا، والمعتقدات السابقة حول الأداء، وقاضي LLM لتقدير أداء المشغلين بشكل فعال.

في التقييمات التي أجريت على مهام معالجة الوثائق في المجالات الطبية الحيوية والقانونية، بالإضافة إلى الإجابة على الأسئلة متعددة الوسائط، أظهر Abacus تحسينات كبيرة. على وجه التحديد، حققت الأنظمة التي تم تحسينها بواسطة Abacus متوسط تحسين قدره 6.7% إلى 39.4% في الجودة، بينما كانت أيضًا أكثر فعالية من حيث التكلفة بمقدار 10.8 مرة وأسرع بمقدار 3.4 مرة من الأنظمة المنافسة. تختتم الورقة بتسليط الضوء على التقدمات الخوارزمية لـ Abacus، التي تعدل خوارزميات اللصوص متعددة الأذرع وخوارزميات Cascades التقليدية لتسهيل تقدير الأداء الأفضل والتحسين المقيد، مما يقلل في النهاية من حجم العينة اللازمة للوصول إلى النتائج المثلى.

مقدمة

تناقش مقدمة الورقة الاستخدام المتزايد لنماذج اللغة الكبيرة (LLMs) في كل من الصناعة والأكاديمية للمهام التي تتطلب فهمًا دلاليًا، مثل معالجة الوثائق غير المنظمة، والإجابة على الأسئلة متعددة الوسائط، والبحث الدلالي. لتحقيق الأداء الأمثل، غالبًا ما يقوم الممارسون بتقسيم هذه المهام إلى مهام فرعية معيارية باستخدام أطر برمجية مثل Palimpzest وLOTUS وDocETL، التي تستفيد من المشغلين الدلاليين—تحولات البيانات المدفوعة بالذكاء الاصطناعي المحددة بلغة طبيعية. تسهل هذه المشغلين مهام معالجة البيانات المختلفة، بما في ذلك استخراج المعلومات والتصنيف، من خلال السماح للمطورين بتعريف خطط منطقية يمكن تحسينها إلى خطط فعلية للتنفيذ.

تقدم الورقة Abacus، وهو مُحسّن جديد يعتمد على التكلفة مصمم لتعزيز أداء أنظمة المشغلين الدلاليين. على عكس الأطر الحالية التي تركز بشكل أساسي على تحسين جودة النظام، يأخذ Abacus في الاعتبار في الوقت نفسه القيود المتعلقة بالتكلفة والكمون. يستخدم نهج اللصوص متعددة الأذرع لاستكشاف مساحة المشغلين الفعليين بكفاءة ويستخدم خوارزمية Pareto-Cascades جديدة للحفاظ على حدود Pareto للخطط الفرعية أثناء التحسين. تظهر النتائج أن Abacus يتفوق بشكل كبير على الأنظمة المنافسة، محققًا تحسينات في الجودة تصل إلى 39.4%، بينما كان أيضًا أرخص بمقدار 10.8 مرة وأسرع بمقدار 3.4 مرة من البديل الأفضل التالي. تسلط النتائج الضوء على فعالية Abacus في تحسين أنظمة المشغلين الدلاليين تحت قيود مختلفة، مما يظهر إمكانياته لتطبيقات أوسع في معالجة البيانات.

مناقشة

في هذا القسم، يقدم المؤلفون نظرة شاملة على نظام تحسين Abacus، الذي يركز على تعزيز أنظمة المشغلين الدلاليين. يعمل Abacus من خلال عملية منظمة تبدأ بتجميع برنامج ذكاء اصطناعي إلى خطة منطقية، تتطلب مدخلات مثل البرنامج نفسه، وهدف التحسين، ومجموعة بيانات الإدخال. يمكن أن يكون هدف التحسين مقيدًا أو غير مقيد، مستهدفًا جوانب مثل جودة المخرجات، والتكلفة، والكمون. يسمح النظام بتوجيه المستخدم من خلال مجموعات بيانات التحقق والمعتقدات السابقة حول أداء المشغلين، والتي تُستخدم لإبلاغ عملية التحسين.

تكمن جوهر وظيفة Abacus في قدرته على إنشاء مساحة بحث للمشغلين الفعليين الذين يتوافقون مع المشغلين المنطقيين في الخطة المجمعة. يتضمن ذلك تطبيق قواعد التحويل والتنفيذ لتوليد خطط فرعية منطقية مكافئة وظيفيًا ولتحديد التنفيذات الفعلية. تعتبر مرحلة أخذ عينات المشغلين حاسمة، حيث تحدد المشغلين الفعليين ذوي الجودة العالية الذين يمكن تجميعهم في خطط تلبي أهداف المستخدم. يستخدم Abacus نهج اللصوص متعددة الأذرع للتنقل في توازن الاستكشاف والاستغلال في اختيار المشغلين، مع التركيز بشكل خاص على أولئك الذين يقعوا على حدود Pareto للتكلفة مقابل الجودة. أخيرًا، يتم توجيه اختيار الخطة النهائية للنظام بواسطة خوارزمية Pareto-Cascades، التي تضمن أن الخطة المختارة توازن بشكل مثالي بين التبادلات الكامنة في قيود وأهداف المستخدم.

القيود

تسلط قسم القيود الضوء على قيد رئيسي في نموذج تكلفة Abacus: يفترض أن أداء المشغلين مستقل عن بعضهم البعض ضمن خطة معينة. بينما يسمح هذا الافتراض لـ Abacus بتوليد تقديرات التكلفة للخطط المنطقية التي لم يتم أخذ عينات منها مباشرة، إلا أنه قد يقدم أيضًا عدم دقة في هذه التقديرات.

لمعالجة هذه القيود، يقترح المؤلفون أعمالًا مستقبلية تتضمن دمج تقنيات التحسين بايزي. لقد أظهرت هذه الأساليب وعدًا في تعزيز تقدير الأداء في أطر البرمجة التصريحية المماثلة، مما يشير إلى مسار محتمل لتحسين دقة نموذج تكلفة Abacus.

Journal: Proceedings of the VLDB Endowment, Volume: 19, Issue: 5
DOI: https://doi.org/10.14778/3796195.3796215
Publication Date: 2026-01-01
Author(s): Matthew Russo et al.
Primary Topic: Topic Modeling

Overview

The research paper introduces Abacus, an innovative cost-based optimizer designed for semantic operator systems utilized in processing large collections of unstructured documents. Abacus addresses the challenges associated with optimizing the implementation of semantic operators—such as maps, filters, and joins—by focusing on enhancing quality, cost, and latency while adhering to potential constraints. The optimizer employs a combination of minimal validation examples, prior performance beliefs, and an LLM judge to estimate operator performance effectively.

In evaluations conducted on document processing tasks within the biomedical and legal domains, as well as multi-modal question answering, Abacus demonstrated significant improvements. Specifically, systems optimized by Abacus achieved an average enhancement of 6.7% to 39.4% in quality, while also being 10.8 times more cost-effective and 3.4 times faster than competing systems. The paper concludes by highlighting Abacus’s algorithmic advancements, which modify traditional multi-armed bandit and Cascades algorithms to facilitate better performance estimation and constrained optimization, ultimately reducing the sample size needed to reach optimal results.

Introduction

The introduction of the paper discusses the growing utilization of large language models (LLMs) in both industry and academia for tasks requiring semantic understanding, such as unstructured document processing, multi-modal question answering, and semantic search. To achieve optimal performance, practitioners often break down these tasks into modular subtasks using programming frameworks like Palimpzest, LOTUS, and DocETL, which leverage semantic operators—AI-driven data transformations specified in natural language. These operators facilitate various data processing tasks, including information extraction and classification, by allowing developers to define logical plans that can be optimized into physical plans for execution.

The paper introduces Abacus, a novel cost-based optimizer designed to enhance the performance of semantic operator systems. Unlike existing frameworks that primarily focus on optimizing for system quality, Abacus simultaneously considers constraints related to cost and latency. It employs a multi-armed bandit approach to efficiently explore the space of physical operators and uses a new Pareto-Cascades algorithm to maintain the Pareto frontier of subplans during optimization. The results demonstrate that Abacus significantly outperforms competing systems, achieving quality improvements of up to 39.4%, while also being 10.8 times cheaper and 3.4 times faster than the next best alternative. The findings highlight the effectiveness of Abacus in optimizing semantic operator systems under various constraints, showcasing its potential for broader applications in data processing.

Discussion

In this section, the authors provide a comprehensive overview of the Abacus optimization system, which focuses on enhancing semantic operator systems. Abacus operates through a structured process that begins with the compilation of an AI program into a logical plan, requiring inputs such as the program itself, an optimization objective, and an input dataset. The optimization objective can be either constrained or unconstrained, targeting aspects like output quality, cost, and latency. The system allows for user guidance through validation datasets and prior beliefs about operator performance, which are utilized to inform the optimization process.

The core of Abacus’s functionality lies in its ability to create a search space of physical operators corresponding to the logical operators in the compiled plan. This involves applying transformation and implementation rules to generate functionally equivalent logical subplans and to define physical implementations. The operator sampling phase is crucial, as it identifies high-quality physical operators that can be composed into plans that meet the user’s objectives. Abacus employs a multi-armed bandit approach to navigate the exploration-exploitation trade-off in selecting operators, particularly focusing on those that lie on the Pareto frontier of cost versus quality. Finally, the system’s final plan selection is guided by the Pareto-Cascades algorithm, which ensures that the selected plan optimally balances the trade-offs inherent in the user’s constraints and objectives.

Limitations

The section on limitations highlights a key constraint of Abacus’s cost model: it assumes that the performance of operators is independent of one another within a given plan. While this assumption allows Abacus to generate cost estimates for logical plans that have not been directly sampled, it may also introduce inaccuracies in these estimations.

To address this limitation, the authors propose future work that involves the integration of Bayesian optimization techniques. These methods have shown promise in enhancing performance estimation in similar declarative programming frameworks, suggesting a potential pathway for improving the accuracy of Abacus’s cost model.