اختيار المتغيرات في تحليل كفاءة البيانات: نهج الغابة العشوائية مع زيادة البيانات Variable selection in data envelopment analysis: A random forest approach with data augmentation

المجلة: Decision Science Letters، المجلد: 15، العدد: 2
DOI: https://doi.org/10.5267/j.dsl.2026.2.008
تاريخ النشر: 2026-01-01
المؤلف: Tzu-Pu Chang
الموضوع الرئيسي: تحليل الكفاءة باستخدام DEA

نظرة عامة

تتناول هذه الورقة البحثية التحدي الحاسم لاختيار المتغيرات في تحليل كفاءة البيانات (DEA)، لا سيما عندما يكون عدد وحدات صنع القرار (DMUs) محدودًا. لتعزيز كفاءة اختيار المتغيرات، يقترح المؤلفون طريقة من مرحلتين تستخدم أولاً نموذج DEA أساسي لتصنيف DMUs ككفؤة أو غير كفؤة. في المرحلة الثانية، يتم استخدام خوارزمية الغابة العشوائية لتقييم أهمية كل متغير من خلال مؤشرات أهمية التبديل. تؤكد الدراسة على الحاجة إلى استراتيجيات قوية لاختيار المتغيرات، خاصة في السيناريوهات التي تحتوي على عدد قليل من DMUs.

يحدد المؤلفون استراتيجيتين متميزتين لاختيار المتغيرات بناءً على أهمية التبديل. الاستراتيجية الأولى تدعو إلى اختيار المتغيرات المدخلة والنتائج ذات الأهمية الأعلى للتبديل للحفاظ على ترتيب مشابه لدرجات كفاءة DMU مع تقليل عدد المتغيرات. على العكس، تقترح الاستراتيجية الثانية الاحتفاظ بالمتغيرات ذات الأهمية الأقل أو السلبية للتبديل للحفاظ على التأثيرات المهمة على ترتيب الكفاءة. تهدف كلتا الاستراتيجيتين إلى تقليل عدد DMUs الكفؤة مع زيادة التباين في درجات الكفاءة. بالإضافة إلى ذلك، تناقش الورقة تطبيق تقنيات زيادة البيانات، وخاصة SMOTE المزدوج وروابط توميك، لتعزيز القوة في حالات أحجام DMU الصغيرة ووجود DMUs غير كفؤة ذات درجات كفاءة عالية. في النهاية، تدمج الأبحاث تقنيات التعلم الآلي في منهجية DEA، مقترحة أن التقدم المستقبلي في الذكاء الاصطناعي يمكن أن يساهم في تحسين تطبيقات DEA.

مقدمة

تناقش مقدمة هذه الورقة البحثية تطبيق تحليل كفاءة البيانات (DEA)، وهي طريقة غير معلمية تستخدم البرمجة الخطية لتقييم كفاءة وحدات صنع القرار (DMUs) بناءً على عدة متغيرات مدخلة ونتائج. أحد التحديات الكبيرة في DEA هو اختيار مجموعة مناسبة من هذه المتغيرات، خاصة عندما تحتوي مجموعات البيانات على عدد قليل من DMUs مقارنة بعدد كبير من المدخلات والنتائج. يمكن أن يؤدي هذا الاختلال إلى تقليل قوة التمييز للنموذج، مما يستدعي الحاجة إلى طرق لاختيار المتغيرات أو تقليلها. يصنف المؤلفون الأساليب الحالية إلى طرق “مضافة”، والتي تدمج تقليل المتغيرات مباشرة في نموذج DEA، وطرق “إضافية”، التي تقيم أهمية المتغيرات بعد تحليل DEA.

لمعالجة تحدي اختيار المتغيرات، تقترح الورقة طريقة جديدة من مرحلتين تستخدم أولاً نموذج DEA أساسي مع جميع المتغيرات ثم تطبق خوارزمية الغابة العشوائية لتقييم أهمية المتغيرات. تتيح تقنية الغابة العشوائية، المعروفة بفعاليتها في السيناريوهات التي تحتوي على عدد أقل من الملاحظات مقارنة بالمتغيرات، تصنيف DMUs إلى فئات كفؤة وغير كفؤة. تهدف الدراسة إلى توضيح الفرق بين أهمية المتغيرات المستمدة من الغابة العشوائية وتلك المستمدة من DEA، مؤكدة أن الطريقة المقترحة تبسط عملية اختيار المتغيرات للباحثين. توضح الورقة هيكلها، مشيرة إلى أن الأقسام التالية ستفصل الإجراء المقترح، وتطبيقه على مجموعات بيانات تجريبية، والملاحظات الختامية.

نقاش

في هذا القسم، يقدم المؤلفون إجراءً من مرحلتين لاختيار المتغيرات يدمج تحليل كفاءة البيانات (DEA) وتصنيف الغابة العشوائية لتعزيز تقييم كفاءة وحدات صنع القرار (DMUs). تتضمن المرحلة الأولى تشغيل نموذج DEA أساسي مع $n$ DMUs، و$m$ مدخلات ($x_{ij}$)، و$s$ نتائج ($y_{rj}$) لاشتقاق درجات الكفاءة، مع تصنيف DMUs ككفؤة أو غير كفؤة. تتيح هذه المرحلة مرونة في اختيار النموذج، حيث يمكن للباحثين الاختيار بين اتجاهات DEA وأنواع العائد المختلفة. تستخدم المرحلة الثانية مصنف الغابة العشوائية لتقييم أهمية المتغيرات المدخلة والنتائج من خلال أهمية التبديل، التي تقيس التغيير في دقة التنبؤ عند تبديل متغير. يقترح المؤلفون استراتيجيتين لاختيار المتغيرات: تركز الاستراتيجية-A على الاحتفاظ بالمتغيرات التي تعطي نمط درجات كفاءة مشابه للنموذج الأساسي DEA، بينما تؤكد الاستراتيجية-B على الاحتفاظ بالمتغيرات التي تؤثر بشكل كبير على قوة تمييز النموذج.

تظهر النتائج من تطبيق هذه المنهجية على مجموعتين من البيانات—الأقسام الأكاديمية وسلاسل الفنادق—فعالية النهج المقترح. في مجموعة بيانات الأقسام الأكاديمية، يحدد المؤلفون الطلاب الخريجين كأهم متغير نتيجة، متماشياً مع الأدبيات السابقة. بالنسبة لمجموعة بيانات سلسلة الفنادق، يجدون أن الجمع بين الراحة والقيمة كمدخلات ونتائج، على التوالي، يوفر تصنيفًا أفضل من النموذج الأساسي. بالإضافة إلى ذلك، يتناول المؤلفون التحديات المتعلقة بأحجام العينات الصغيرة ووجود DMUs الحدية من خلال استخدام تقنيات مثل تقنية زيادة العينات للأقليات الاصطناعية (SMOTE) وروابط توميك لتعزيز القوة ودقة التصنيف. بشكل عام، تسهم هذه الأبحاث في أدبيات DEA من خلال دمج تقنيات التعلم الآلي، مقترحة أن التقدم المستقبلي في منهجية DEA يمكن أن يستفيد من مزيد من دمج الذكاء الاصطناعي وطرق علم البيانات.

Journal: Decision Science Letters, Volume: 15, Issue: 2
DOI: https://doi.org/10.5267/j.dsl.2026.2.008
Publication Date: 2026-01-01
Author(s): Tzu-Pu Chang
Primary Topic: Efficiency Analysis Using DEA

Overview

This research paper addresses the critical challenge of variable selection in data envelopment analysis (DEA), particularly when the number of decision-making units (DMUs) is limited. To enhance the efficiency of variable selection, the authors propose a two-stage method that first employs a baseline DEA model to classify DMUs as efficient or inefficient. In the second stage, a random forest algorithm is utilized to assess the importance of each variable through permutation importance indices. The study emphasizes the need for robust variable selection strategies, especially in scenarios with a small number of DMUs.

The authors outline two distinct strategies for variable selection based on permutation importance. The first strategy advocates for the selection of input and output variables with higher permutation importance to maintain a similar ranking of DMU efficiency scores while reducing the number of variables. Conversely, the second strategy suggests retaining variables with lower or negative permutation importance to preserve significant influences on the efficiency order. Both strategies aim to decrease the number of efficient DMUs while increasing the variability in efficiency scores. Additionally, the paper discusses the application of data augmentation techniques, specifically double-SMOTE and Tomek links, to enhance robustness in cases of small DMU sizes and the presence of inefficient DMUs with high efficiency scores. Ultimately, the research integrates machine learning techniques into DEA methodology, proposing that future advancements in artificial intelligence could further refine DEA applications.

Introduction

The introduction of this research paper discusses the application of Data Envelopment Analysis (DEA), a non-parametric method utilizing linear programming to assess the efficiency of decision-making units (DMUs) based on multiple input and output variables. A significant challenge in DEA is selecting an appropriate combination of these variables, especially when datasets contain a small number of DMUs relative to a large number of inputs and outputs. This imbalance can diminish the model’s discriminative power, prompting the need for variable selection or reduction methods. The authors categorize existing approaches into “plug-in” methods, which integrate variable reduction directly into the DEA model, and “add-on” methods, which assess variable importance post-DEA analysis.

To address the variable selection challenge, the paper proposes a novel two-stage method that first employs a baseline DEA model with all variables and subsequently applies a random forest algorithm to evaluate variable importance. The random forest technique, known for its effectiveness in scenarios with fewer observations than variables, allows for the classification of DMUs into efficient and inefficient categories. The study aims to clarify the distinction between variable importance derived from random forest and that from DEA, asserting that their proposed method simplifies the variable selection process for researchers. The paper outlines its structure, indicating that subsequent sections will detail the proposed procedure, its application to empirical datasets, and concluding remarks.

Discussion

In this section, the authors present a two-stage variable selection procedure that integrates Data Envelopment Analysis (DEA) and random forest classification to enhance the efficiency evaluation of Decision-Making Units (DMUs). The first stage involves running a baseline DEA model with $n$ DMUs, $m$ inputs ($x_{ij}$), and $s$ outputs ($y_{rj}$) to derive efficiency scores, labeling DMUs as efficient or inefficient. This stage allows flexibility in model selection, as researchers can choose between various DEA orientations and return types. The second stage employs a random forest classifier to assess the importance of input and output variables through permutation importance, which measures the change in prediction accuracy when a variable is permuted. The authors propose two strategies for variable selection: Strategy-A focuses on retaining variables that yield a similar efficiency score pattern to the baseline DEA, while Strategy-B emphasizes retaining variables that significantly influence the model’s discrimination power.

The findings from applying this methodology to two datasets—academic departments and hotel chains—demonstrate the effectiveness of the proposed approach. In the academic departments dataset, the authors identify graduate students as the most critical output variable, aligning with previous literature. For the hotel chain dataset, they find that the combination of convenience and value as inputs and outputs, respectively, provides a better classification than the baseline model. Additionally, the authors address challenges related to small sample sizes and the presence of borderline DMUs by employing techniques such as Synthetic Minority Oversampling Technique (SMOTE) and Tomek links to enhance robustness and classification accuracy. Overall, this research contributes to the DEA literature by integrating machine learning techniques, suggesting that future advancements in DEA methodology could benefit from further incorporation of artificial intelligence and data science methods.