استراتيجيات اختيار الميزات: تحليل مقارن لطرق قيمة SHAP والطرق المعتمدة على الأهمية Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

المجلة: Journal Of Big Data، المجلد: 11، العدد: 1
DOI: https://doi.org/10.1186/s40537-024-00905-w
تاريخ النشر: 2024-03-26
المؤلف: Huanjing Wang وآخرون
الموضوع الرئيسي: التعرف على الوجه والتعبيرات

نظرة عامة

تستكشف هذه الدراسة تقنيات اختيار الميزات في سياق بيانات احتيال بطاقات الائتمان عالية الأبعاد لتعزيز أداء نموذج اكتشاف الاحتيال. تقارن فعالية اختيار الميزات بناءً على قيم SHAP (SHapley Additive exPlanations) مقابل أهمية الميزات المدمجة في النموذج عبر خمسة مصنفات: XGBoost، شجرة القرار، CatBoost، الأشجار العشوائية للغاية، وغابة عشوائية. المقياس المستخدم للتقييم هو المساحة تحت منحنى الدقة والاسترجاع (AUPRC)، مع إجراء التجارب على مجموعة بيانات اكتشاف احتيال بطاقات الائتمان من كاجل. تشير النتائج إلى أن طرق اختيار الميزات المستندة إلى قيم الأهمية الجوهرية تتفوق باستمرار على تلك المستندة إلى قيم SHAP عبر مصنفات وأحجام مجموعات ميزات مختلفة.

من الجدير بالذكر أنه بينما يظهر XGBoost أداءً مختلطًا – يتفوق على SHAP-XGBoost لحجم مجموعة ميزات يبلغ 3 ولكنه يتخلف عن الأداء لحجم 10 – فإن CatBoost يتفوق باستمرار على نظيره SHAP لأحجام الميزات أقل من 15. تقترح الدراسة أنه بالنسبة لمجموعات البيانات الأكبر، فإن استخدام أهمية الميزات المدمجة في النموذج أكثر كفاءة وواقعية من حساب قيم SHAP، التي تقدم تكاليف حسابية إضافية وتعقيدًا. يوصي المؤلفون بمزيد من الاستكشاف لهذه الطرق في اختيار الميزات عبر مجالات تطبيق مختلفة في الأبحاث المستقبلية.

مقدمة

تؤكد مقدمة هذه الورقة البحثية على الحاجة الملحة لاكتشاف احتيال بطاقات الائتمان بشكل فعال داخل صناعة المالية، مع تسليط الضوء على الدور المحوري لمجموعات بيانات المعاملات. تحدد جودة البيانات كأحد التحديات الكبيرة في كل من المالية وتعلم الآلة، مما يؤثر بشكل مباشر على نتائج النمذجة والتحليل. لمعالجة ذلك، تركز الدراسة على اختيار الميزات كخطوة حيوية لتنظيف البيانات، بهدف القضاء على الميزات غير ذات الصلة أو المكررة لتعزيز كفاءة تدريب النموذج وأداء المصنف.

تقارن الورقة بين طريقتين لاختيار الميزات: اختيار قائم على قيم SHAP (SHapley Additive exPlanation) واختيار تقليدي قائم على الأهمية. تستخدم SHAP نظرية الألعاب لتحديد أهمية الميزات من خلال عملية من خطوتين تتضمن تدريب النموذج وترتيب الميزات لاحقًا، بينما يقيم الاختيار القائم على الأهمية دلالة الميزات خلال مرحلة تدريب النموذج. تستخدم الدراسة خمسة مصنفات – تعزيز التدرج المتطرف (XGBoost)، شجرة القرار (DT)، CatBoost، الأشجار العشوائية للغاية (ET)، وغابة عشوائية (RF) – لتقييم أداء هذه الطرق في اختيار الميزات على مجموعة بيانات اكتشاف احتيال بطاقات الائتمان من كاجل، التي تحتوي على 284,807 معاملة مع 492 فقط مصنفة على أنها احتيالية. يتم تقييم المصنفات باستخدام مقياس المساحة تحت منحنى الدقة والاسترجاع (AUPRC)، ويتم اختبار الدلالة الإحصائية عند مستوى α = 0.01. تمثل هذه الأبحاث تحقيقًا تجريبيًا جديدًا في الفعالية المقارنة لاختيار الميزات القائم على قيم SHAP والاختيار القائم على الأهمية في اكتشاف الاحتيال وربما تطبيقات تعلم الآلة الأخرى.

الطرق

تحدد قسم المنهجية في الورقة البحثية استخدام طرق اختيار الميزات القائمة على الأهمية التي تستخدم مصنفات شجرة القرار لتحديد وترتيب الميزات ذات الصلة من مجموعات البيانات. تستخدم الدراسة بشكل خاص خمسة مصنفات: تعزيز التدرج المتطرف (XGBoost)، شجرة القرار (DT)، CatBoost، الأشجار العشوائية للغاية (ET)، وغابة عشوائية (RF). كل من هذه المصنفات لديها آلية مدمجة لتحديد أهمية الميزات أثناء ملاءمة النموذج، مما يسمح بالقضاء على الميزات الأقل صلة وتعزيز كفاءة النموذج ودقته. من الجدير بالذكر أن XGBoost يحسب أهمية الميزات باستخدام طريقة “الكسب”، بينما يقيم CatBoost ذلك بناءً على تكرار استخدام الميزات للتقسيمات والتحسين في أداء النموذج من تلك التقسيمات.

بالإضافة إلى ذلك، تناقش الورقة SHAP (SHapley Additive exPlanations) كتقنية مستقلة عن النموذج لتفسير تنبؤات تعلم الآلة، والتي توفر رؤى حول مساهمة الميزات الفردية في تنبؤات معينة. تضمنت التصميم التجريبي مقارنة اختيار الميزات القائم على قيم SHAP مع الطرق التقليدية القائمة على الأهمية. قام الباحثون ببناء نماذج تصنيف باستخدام مجموعات فرعية من الميزات المختارة بواسطة هذه الطرق، مستخدمين منصة حوسبة موزعة للتنفيذ. قاموا بترتيب الميزات باستخدام عشرة طرق اختيار مختلفة وقاموا بتقييم أداء النموذج باستخدام المساحة تحت منحنى الدقة والاسترجاع (AUPRC)، مما أسفر عن مجموعة بيانات شاملة من 250 درجة AUPRC عبر أحجام مجموعات الميزات المختلفة وتشغيلات المصنف.

النتائج

في هذا القسم، يقدم المؤلفون نتائج تحقيقهم في عشرة طرق لاختيار الميزات، كل منها مقترن بخمسة مصنفات، مع التركيز على أفضل 15 ميزة مهمة تم تحديدها من خلال قيم SHAP أو درجات الأهمية المدمجة. يتم تلخيص النتائج في الجداول 2 إلى 6، حيث تتصدر الميزة V14 باستمرار بين الثلاثة الأوائل عبر جميع الطرق، بينما تظهر الميزة V4 في أفضل 15 لجميع الاختيارات. يتم تفصيل أداء التصنيف، الذي تم تقييمه باستخدام المساحة تحت منحنى الدقة والاسترجاع (AUPRC)، في الجداول 7 إلى 11، مع متوسطات مستمدة من عشرة جولات من التحقق المتقاطع بخمسة طيات.

تم إجراء اختبارات z الإحصائية لمقارنة النماذج المبنية باستخدام الميزات الأكثر أهمية المختارة بواسطة SHAP مقابل تلك التي تستخدم درجات الأهمية المدمجة، مع فرضية العدم التي تفترض عدم وجود فرق كبير في متوسط درجات AUPRC. تكشف النتائج، المشار إليها في الجداول 7 و8، عن أي طريقة لاختيار الميزات حققت قيم AUPRC أعلى، مع تحديد مستوى الدلالة عند $\alpha = 0.01$. تم الإشارة إلى حالات الفروق غير الدالة على أنها تعادلات، مما يوفر نظرة شاملة على فعالية طرق اختيار الميزات المستخدمة.

المناقشة

في هذه الدراسة، أجرينا تحليلًا مقارنًا لتقنيتين لاختيار الميزات – قيم SHapley Additive exPlanations (SHAP) وأهمية الميزات المدمجة – في سياق اكتشاف احتيال بطاقات الائتمان. كشفت مراجعتنا للأدبيات عن ندرة الدراسات التي تستخدم SHAP لاختيار الميزات بشكل خاص في هذا المجال، ولم تقارن أي أبحاث سابقة مباشرة فعالية هاتين الطريقتين. استخدمنا مجموعة بيانات اكتشاف احتيال بطاقات الائتمان ونفذنا خمسة مصنفات، مع تقييم أدائها بناءً على المساحة تحت منحنى الدقة والاسترجاع (AUPRC) عبر أحجام مجموعات الميزات المختلفة.

تشير نتائجنا إلى أنه، بشكل عام، يتفوق اختيار الميزات القائم على قيم الأهمية المدمجة باستمرار على الطرق القائمة على SHAP عبر المصنفات وأحجام الميزات المختبرة. من الجدير بالذكر أنه بينما أظهر XGBoost أداءً متفوقًا على SHAP-XGBoost لحجم مجموعة ميزات يبلغ 3، فإن الأخير تفوق على الأول عند حجم 10. وبالمثل، تفوق CatBoost على SHAP-CatBoost لأحجام الميزات الأصغر، لكن الأداء كان متقاربًا عند الحجم 15. تشير التعقيدات الحسابية المرتبطة بـ SHAP إلى أن تنفيذها قد لا يكون مبررًا، خاصة بالنسبة لمجموعات البيانات الكبيرة، حيث توفر الطرق المدمجة بديلًا أكثر كفاءة. يجب على الأبحاث المستقبلية استكشاف هذه التقنيات لاختيار الميزات عبر مجالات تطبيق مختلفة للتحقق من صحة هذه النتائج بشكل أكبر.

Journal: Journal Of Big Data, Volume: 11, Issue: 1
DOI: https://doi.org/10.1186/s40537-024-00905-w
Publication Date: 2024-03-26
Author(s): Huanjing Wang et al.
Primary Topic: Face and Expression Recognition

Overview

This study investigates feature selection techniques in the context of high-dimensional credit card fraud data to enhance fraud detection model performance. It compares the effectiveness of feature selection based on SHAP (SHapley Additive exPlanations) values against the model’s built-in feature importance across five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The evaluation metric used is the Area under the Precision-Recall Curve (AUPRC), with experiments conducted on the Kaggle Credit Card Fraud Detection Dataset. The findings indicate that feature selection methods based on intrinsic importance values consistently outperform those based on SHAP values across various classifiers and feature subset sizes.

Notably, while XGBoost shows a mixed performance—outperforming SHAP-XGBoost for a feature subset size of 3 but underperforming for a size of 10—CatBoost consistently outperforms its SHAP counterpart for feature sizes less than 15. The study suggests that for larger datasets, utilizing the model’s built-in feature importance is more efficient and practical than computing SHAP values, which introduces additional computational costs and complexity. The authors recommend further exploration of these feature selection methods across different application domains in future research.

Introduction

The introduction of this research paper emphasizes the critical need for effective credit card fraud detection within the finance industry, highlighting the pivotal role of transaction datasets. It identifies data quality as a significant challenge in both finance and machine learning, which directly impacts modeling and analysis outcomes. To address this, the study focuses on feature selection as a vital data cleansing step, aiming to eliminate irrelevant or redundant features to enhance model training efficiency and classifier performance.

The paper compares two feature selection methods: Shapley Additive exPlanation (SHAP)-value-based selection and traditional importance-based selection. SHAP utilizes game theory to determine feature importance through a two-step process involving model training and subsequent ranking of features, while importance-based selection assesses feature significance during the model training phase. The study employs five classifiers—Extreme Gradient Boosting (XGBoost), Decision Tree (DT), CatBoost, Extremely Randomized Trees (ET), and Random Forest (RF)—to evaluate the performance of these feature selection methods on the Credit Card Fraud Detection Dataset from Kaggle, which contains 284,807 transactions with only 492 labeled as fraudulent. The classifiers are assessed using the Area Under the Precision Recall Curve (AUPRC) metric, and statistical significance is tested at a level of α = 0.01. This research represents a novel empirical investigation into the comparative effectiveness of SHAP-value-based and importance-based feature selection in fraud detection and potentially other machine learning applications.

Methods

The methodology section of the research paper outlines the use of importance-based feature selection methods that utilize decision tree classifiers to identify and rank relevant features from datasets. The study specifically employs five classifiers: Extreme Gradient Boosting (XGBoost), Decision Tree (DT), CatBoost, Extremely Randomized Trees (ET), and Random Forest (RF). Each of these classifiers has a built-in mechanism for determining feature importance during model fitting, allowing for the elimination of less relevant features and the enhancement of model efficiency and accuracy. Notably, XGBoost calculates feature importance using the “gain” method, while CatBoost assesses it based on the frequency of feature usage for splits and the improvement in model performance from those splits.

Additionally, the paper discusses SHAP (SHapley Additive exPlanations) as a model-agnostic technique for interpreting machine learning predictions, which provides insights into the contribution of individual features to specific predictions. The experimental design involved comparing SHAP-value-based feature selection with traditional importance-based methods. The researchers constructed classification models using subsets of features selected by these methods, employing a distributed computing platform for implementation. They ranked features using ten different selection methods and evaluated model performance using the Area Under the Precision-Recall Curve (AUPRC), resulting in a comprehensive dataset of 250 AUPRC scores across various feature subset sizes and classifier runs.

Results

In this section, the authors present the results of their investigation into ten feature selection methods, each combined with five classifiers, focusing on the top 15 most important features identified through SHAP values or built-in importance scores. The findings are summarized in Tables 2 to 6, where feature V14 consistently ranks among the top three across all methods, while feature V4 appears in the top 15 for all selections. The classification performance, evaluated using the Area Under the Precision-Recall Curve (AUPRC), is detailed in Tables 7 to 11, with averages derived from ten rounds of five-fold cross-validation.

Statistical z-tests were conducted to compare models built with the most important features selected by SHAP against those using built-in importance scores, with the null hypothesis positing no significant difference in mean AUPRC scores. The results, indicated in Tables 7 and 8, reveal which feature selection method yielded higher mean AUPRC values, with a significance level set at $\alpha = 0.01$. Instances of non-significant differences are noted as ties, providing a comprehensive overview of the efficacy of the feature selection methods employed.

Discussion

In this study, we conducted a comparative analysis of two feature selection techniques—SHapley Additive exPlanations (SHAP) values and built-in feature importance—within the context of credit card fraud detection. Our literature review revealed a scarcity of studies employing SHAP for feature selection specifically in this domain, and no prior research directly compared the efficacy of these two methods. We utilized the Credit Card Fraud Detection Dataset and implemented five classifiers, assessing their performance based on the Area Under the Precision-Recall Curve (AUPRC) across various feature subset sizes.

Our findings indicate that, overall, feature selection based on built-in importance values consistently outperformed SHAP-based methods across the classifiers and feature sizes tested. Notably, while XGBoost exhibited superior performance over SHAP-XGBoost for a feature subset size of 3, the latter outperformed the former at a size of 10. Similarly, CatBoost outperformed SHAP-CatBoost for smaller feature sizes, but the performance was comparable at size 15. The computational complexity associated with SHAP suggests that its implementation may not be justified, particularly for large datasets, where built-in methods provide a more efficient alternative. Future research should explore these feature selection techniques across different application domains to further validate these findings.