نهج غير مرتبط بالمشكلة لاختيار الميزات والتحليل باستخدام SHAP A problem-agnostic approach to feature selection and analysis using SHAP

المجلة: Journal Of Big Data، المجلد: 12، العدد: 1
DOI: https://doi.org/10.1186/s40537-024-01041-1
تاريخ النشر: 2025-01-24
المؤلف: John Hancock وآخرون
الموضوع الرئيسي: تعلم الآلة وتصنيف البيانات

نظرة عامة

في هذا البحث، يقدم المؤلفون تقنية جديدة لاختيار الميزات باستخدام SHapley Additive exPlanations (SHAP) التي يمكن تطبيقها عبر سيناريوهات مختلفة لتوافر التسميات في تحليل البيانات. من خلال استخدام مجموعة بيانات كشف الاحتيال ببطاقات الائتمان من كاجل، يحاكون ثلاثة سيناريوهات: عدم وجود بيانات معنونة، وتوافر فئة واحدة، وتوافر كلا الفئتين. تُظهر الدراسة أن SHAP يمكنه تصنيف أهمية الميزات بفعالية بغض النظر عن توافر التسميات، مما يسمح باختيار أفضل $k$ ميزات. على وجه التحديد، يستخدمون Isolation Forest للتعلم غير المراقب، ونموذج المزيج الغاوسي (GMM) لتصنيف فئة واحدة، وXGBoost لتصنيف ثنائي الفئة، مما يوفر تحليلاً مقارناً شاملاً لاختيار الميزات عبر هذه السيناريوهات.

تشير النتائج إلى أنه يمكن تقليل مجموعات الميزات بشكل كبير باستخدام SHAP دون المساس بأداء النموذج. تُظهر النتائج وجود علاقة متسقة بين أهمية الميزات التي حددها SHAP وأداء النماذج، مع ملاحظة تحسينات في الأداء كلما تمت إضافة ميزات بناءً على قيم SHAP المطلقة المتوسطة. يُعتبر هذا البحث بارزًا لكونه الأول الذي يستكشف تقنية تحليل الميزات القابلة للتطبيق في جميع سياقات توافر التسميات الثلاثة، مما يؤكد على أهمية قيم SHAP في تحديد الميزات المؤثرة. يقترح المؤلفون أن نهجهم يمكن توسيعه إلى مجالات تطبيق جديدة، مما يمهد الطريق للبحوث المستقبلية في منهجيات تحليل الميزات.

مقدمة

تسلط مقدمة هذه الدراسة الضوء على التأثير المالي الكبير للاحتيال ببطاقات الائتمان في الولايات المتحدة، مما يستدعي الحاجة إلى تقنيات كشف فعالة. يؤكد المؤلفون على التحديات التي تطرحها الطبيعة غير المعنونة لبيانات معاملات بطاقات الائتمان، مما يستلزم استكشاف طرق مناسبة لمختلف سيناريوهات توافر تسميات البيانات: “noclass” (بدون تسميات)، “one-class” (فئة معنونة واحدة)، و”binaryclass” (فئتين معنونة). تقدم الدراسة تقنية تحليل الميزات التي تدمج SHapley Additive exPlanations (SHAP) مع خوارزميات التعلم الآلي المصممة لكل سيناريو، باستخدام مجموعة بيانات كشف احتيال بطاقات الائتمان المعروفة من كاجل كعرض توضيحي.

يشير المؤلفون إلى أنه بينما تُشتق سمات مجموعة البيانات من تحليل المكونات الرئيسية (PCA) وتفتقر إلى معلومات مفصلة حول أصولها، لا يزال بإمكان SHAP تحديد الميزات المهمة بفعالية عبر جميع السيناريوهات. يؤكدون أن هذا البحث رائد في تطبيق SHAP جنبًا إلى جنب مع أساليب التعلم الآلي المختلفة ضمن مجال كشف احتيال بطاقات الائتمان. تختتم المقدمة بتوضيح أن المقارنات بين أداء النماذج عبر السيناريوهات ليست مناسبة، حيث يجب أن يتماشى اختيار المصنف مع توافر البيانات المعنونة. ستتناول الأقسام التالية من الورقة الأعمال ذات الصلة، وخصوصيات البيانات، والخوارزميات المستخدمة، والمنهجية التجريبية، مما يؤدي في النهاية إلى استنتاجات الدراسة.

الطرق

تركز منهجية هذا البحث على تطبيق تقنيات التعلم الآلي على مجموعات بيانات بمستويات مختلفة من معلومات التسميات، باستخدام نظام حوسبة موزعة مع تكوينات أجهزة متقدمة. تستخدم الدراسة ثلاثة مصنفات—Isolation Forest، ونموذج المزيج الغاوسي (GMM)، وXGBoost—مصممة لسيناريوهات مختلفة من توافر التسميات. يتم استخدام SHAP (SHapley Additive exPlanations) لتقييم أهمية الميزات عبر هذه النماذج، مما يسمح بفهم شامل لمساهمات السمات دون الحاجة إلى معرفة سابقة واسعة بخصائص مجموعة البيانات. تعتبر مجموعة بيانات كشف احتيال بطاقات الائتمان، المشتقة من تحليل المكونات الرئيسية (PCA)، هي المصدر الرئيسي للبيانات، حيث يهدف التحليل إلى تحديد الميزات الرئيسية التي تؤثر على أداء النموذج.

يتضمن الإطار التجريبي تدريب كل مصنف على سمات مجموعة البيانات وتطبيق SHAP kernel explainer لاشتقاق قيم SHAP، التي تشير إلى تأثير كل ميزة على توقعات النموذج. تُستخدم قيم SHAP المطلقة المتوسطة لترتيب الميزات، ويتم تقييم النماذج من خلال عشر تكرارات من التحقق المتقاطع بخمس طيات، مما ينتج عنه درجات AUC (المساحة تحت المنحنى) وAUPRC (المساحة تحت منحنى الدقة والاسترجاع) للتحليل الإحصائي. يهدف هذا الإجراء إلى التحقق من فعالية قيم SHAP المطلقة المتوسطة كمؤشرات على أهمية الميزات. بالإضافة إلى ذلك، يتم استخدام مؤشر كونشيفا لقياس التشابه بين مجموعات الميزات المهمة التي حددتها مصنفات مختلفة، مما يسهل تحليل الميزات بشكل أعمق وفهم الأنماط الأساسية في البيانات.

النتائج

في قسم النتائج، يقدم المؤلفون شكلين رئيسيين من النتائج: نتائج أداء التصنيف وتحليل الميزات. يتم تقييم نتائج أداء التصنيف عبر ثلاثة سيناريوهات لتوافر التسميات—بدون فئة، فئة واحدة، وفئة ثنائية—باستخدام مقاييس غير حساسة للعوامل، وبشكل خاص المساحة تحت المنحنى (AUC) والمساحة تحت منحنى الدقة والاسترجاع (AUPRC). يؤكد المؤلفون على أهمية AUPRC، خاصة في سياق البيانات غير المتوازنة بشدة، حيث يوفر فهمًا أكثر دقة للعوامل التجريبية التي تؤثر على أداء التصنيف.

يكمل تحليل الميزات نتائج التصنيف من خلال فحص الميزات المختارة من خلال مجموعة من نماذج التعلم الآلي وترتيب أهمية الميزات باستخدام SHAP (SHapley Additive exPlanations). لا يبرز هذا النهج المزدوج فعالية النماذج فحسب، بل يوضح أيضًا أهمية ميزات معينة في مهام التصنيف، مما يوفر نظرة شاملة على أداء النموذج وخصائص البيانات الأساسية.

المناقشة

في قسم المناقشة من الورقة، يبرز المؤلفون جدّة بحثهم في سياق الأدبيات الحالية حول تحليل الميزات لكشف احتيال بطاقات الائتمان، خاصة تحت سيناريوهات مختلفة لتوافر التسميات. يشيرون إلى وجود فجوة كبيرة في الأدبيات، حيث لم تجمع أي دراسات سابقة بين تقنيات اختيار الميزات والتعلم غير المراقب في هذا المجال. يؤكد المؤلفون على قيود المقاييس المعتمدة على العتبات، مثل الدقة، التي تفشل في تقديم رؤى شاملة حول أداء النموذج مقارنة بالمقاييس غير المعتمدة على العتبات مثل المساحة تحت منحنى الدقة والاسترجاع (AUPRC) والمساحة تحت منحنى الخصائص التشغيلية المستقبلية (AUC). تستخدم منهجيتهم SHAP (SHapley Additive exPlanations) لاختيار الميزات، والتي يمكن تطبيقها حتى في غياب البيانات المعنونة، مما يتناقض مع دراسات أخرى تعتمد على مجموعات بيانات معنونة لتقنيات مثل العينة العشوائية الناقصة وتحليل المكونات الرئيسية (PCA).

يفرق المؤلفون أيضًا عملهم عن الدراسات ذات الصلة من خلال إثبات قابلية تطبيق SHAP عبر ثلاثة سيناريوهات متميزة لتوافر التسميات: بدون تسميات، فئة واحدة، ومجموعات بيانات معنونة بالكامل. يجادلون بأن نهجهم لا يحقق فقط صحة تقنية اختيار الميزات، بل يوفر أيضًا إطارًا لتقييم أهمية الميزات في الإعدادات غير المراقبة. يتم التأكيد على هذه المساهمة من خلال نتائجهم، التي تشير إلى أن SHAP يمكن أن يوفر رؤى ذات مغزى حول أهمية الميزات، مما يعزز من قابلية تفسير النموذج وأدائه في كشف احتيال بطاقات الائتمان. بشكل عام، يؤكد المؤلفون أن دراستهم تمثل تقدمًا كبيرًا في هذا المجال، حيث تعالج أسئلة بحثية لم يتم استكشافها سابقًا وتقدم منهجية قوية لتحليل الميزات في ظروف مختلفة من توافر التسميات.

Journal: Journal Of Big Data, Volume: 12, Issue: 1
DOI: https://doi.org/10.1186/s40537-024-01041-1
Publication Date: 2025-01-24
Author(s): John Hancock et al.
Primary Topic: Machine Learning and Data Classification

Overview

In this research, the authors present a novel feature selection technique utilizing SHapley Additive exPlanations (SHAP) that is applicable across various label availability scenarios in data analysis. By employing the Kaggle Credit Card Fraud detection dataset, they simulate three scenarios: no labeled data, availability of one class, and availability of both classes. The study demonstrates that SHAP can effectively rank feature importance regardless of label availability, allowing for the selection of the top $k$ features. Specifically, they utilize the Isolation Forest for unsupervised learning, Gaussian Mixture Model (GMM) for one-class classification, and XGBoost for binary-class classification, thereby providing a comprehensive comparative analysis of feature selection across these scenarios.

The findings indicate that feature sets can be significantly reduced using SHAP without compromising model performance. The results show a consistent correlation between the feature importance assigned by SHAP and the performance of the models, with performance improvements observed as features are added based on their mean absolute SHAP values. This research is notable for being the first to explore a feature analysis technique applicable in all three label availability contexts, confirming the meaningfulness of SHAP values in identifying impactful features. The authors suggest that their approach can be extended to new application domains, paving the way for future research in feature analysis methodologies.

Introduction

The introduction of this study highlights the significant financial impact of credit card fraud in the United States, prompting the need for effective detection techniques. The authors emphasize the challenges posed by the unlabeled nature of credit card transaction data, which necessitates the exploration of methods suitable for various data label availability scenarios: “noclass” (no labels), “one-class” (one labeled class), and “binaryclass” (two labeled classes). The study introduces a feature analysis technique that integrates SHapley Additive exPlanations (SHAP) with machine learning algorithms tailored for each scenario, utilizing the well-known Kaggle Credit Card Fraud Detection dataset for demonstration.

The authors note that while the dataset’s attributes are derived from Principal Components Analysis (PCA) and lack detailed information about their origins, SHAP can still effectively identify important features across all scenarios. They assert that this research is pioneering in applying SHAP alongside different machine learning approaches within the credit card fraud detection domain. The introduction concludes by clarifying that comparisons of model performance across scenarios are not appropriate, as the choice of classifier should align with the availability of labeled data. The subsequent sections of the paper will elaborate on related work, data specifics, employed algorithms, and the experimental methodology, ultimately leading to the study’s conclusions.

Methods

The methodology of this research focuses on applying machine learning techniques to datasets with varying levels of label information, specifically utilizing a distributed computing system with advanced hardware configurations. The study employs three classifiers—Isolation Forest, Gaussian Mixture Model (GMM), and XGBoost—tailored to different scenarios of label availability. SHAP (SHapley Additive exPlanations) is utilized to assess feature importance across these models, allowing for a comprehensive understanding of attribute contributions without requiring extensive prior knowledge of the dataset’s characteristics. The Credit Card Fraud dataset, derived from Principal Component Analysis (PCA), serves as the primary data source, with the analysis aimed at identifying key features that influence model performance.

The experimental framework involves training each classifier on the dataset’s attributes and applying the SHAP kernel explainer to derive SHAP values, which indicate the impact of each feature on the model’s predictions. The mean absolute SHAP values are used to rank features, and models are evaluated through ten iterations of five-fold cross-validation, yielding AUC (Area Under the Curve) and AUPRC (Area Under the Precision-Recall Curve) scores for statistical analysis. This process aims to validate the effectiveness of the mean absolute SHAP values as indicators of feature importance. Additionally, the Kuncheva index is employed to quantify the similarity between sets of important features identified by different classifiers, facilitating a deeper feature analysis and understanding of the underlying patterns in the data.

Results

In the Results section, the authors present two primary forms of findings: classification performance results and feature analysis. The classification performance results are evaluated across three label availability scenarios—no-class, one-class, and binary-class—using threshold-agnostic metrics, specifically the Area Under the Curve (AUC) and Area Under the Precision-Recall Curve (AUPRC). The authors emphasize the importance of AUPRC, particularly in the context of highly imbalanced data, as it provides a more nuanced understanding of the experimental factors influencing classification performance.

The feature analysis complements the classification results by examining the features selected through a combination of machine learning models and SHAP (SHapley Additive exPlanations) feature importance ranking. This dual approach not only highlights the effectiveness of the models but also elucidates the significance of specific features in the classification tasks, thereby providing a comprehensive overview of the model’s performance and the underlying data characteristics.

Discussion

In the discussion section of the paper, the authors highlight the novelty of their research in the context of existing literature on feature analysis for credit card fraud detection, particularly under varying scenarios of label availability. They note a significant gap in the literature, as no prior studies have combined feature selection techniques with unsupervised learning in this domain. The authors emphasize the limitations of threshold-dependent metrics, such as accuracy, which fail to provide comprehensive insights into model performance compared to threshold-independent metrics like the Area Under the Precision-Recall Curve (AUPRC) and the Area Under the Receiver Operating Characteristic Curve (AUC). Their methodology employs SHAP (SHapley Additive exPlanations) for feature selection, which is applicable even in the absence of labeled data, contrasting with other studies that rely on labeled datasets for techniques like Random Undersampling and Principal Component Analysis (PCA).

The authors further differentiate their work from related studies by demonstrating the applicability of SHAP across three distinct label availability scenarios: no labels, one class, and fully labeled datasets. They argue that their approach not only validates the feature selection technique but also provides a framework for assessing feature importance in unsupervised settings. This contribution is underscored by their findings, which indicate that SHAP can yield meaningful insights into feature importance, thereby enhancing model explainability and performance in credit card fraud detection. Overall, the authors assert that their study represents a significant advancement in the field, addressing previously unexplored research questions and offering a robust methodology for feature analysis in varying conditions of label availability.