من النقطة إلى تعزيز التدرج الاحتمالي لتوقع تكرار المطالبات وشدتها From point to probabilistic gradient boosting for claim frequency and severity prediction

المجلة: European Actuarial Journal، المجلد: 15، العدد: 3
DOI: https://doi.org/10.1007/s13385-025-00428-5
PMID: https://pubmed.ncbi.nlm.nih.gov/41179997
تاريخ النشر: 2025-08-04
المؤلف: Dominik Chevalier وآخرون
الموضوع الرئيسي: نماذج الاحتمالات والمخاطر

نظرة عامة

يوفر قسم ورقة البحث نظرة عامة على التقدم في خوارزميات تعزيز التدرج لأشجار القرار، مع تسليط الضوء على تطبيقاتها المتزايدة في علم التأمين بسبب أدائها التنبؤي المتفوق مقارنة بالنماذج الخطية العامة التقليدية. يقدم المؤلفون تدوينًا موحدًا لمقارنة الخوارزميات المختلفة الموجودة، بما في ذلك GBM وXGBoost وDART وLightGBM وCatBoost وEGBM وPGBM وXGBoostLSS وGBM الدوري وNGBoost.

في دراسة عددية شاملة، يتم تقييم أداء هذه الخوارزميات عبر خمسة مجموعات بيانات متاحة للجمهور تركز على تكرار المطالبات وشدتها، والتي تختلف في الحجم وتحتوي على العديد من المتغيرات الفئوية ذات الكاردينالية العالية. تقيم الدراسة الكفاءة الحسابية والأداء التنبؤي وملاءمة النموذج، كاشفة أن LightGBM وXGBoostLSS يتفوقان في الكفاءة الحسابية، بينما يعزز CatBoost الأداء التنبؤي في السيناريوهات التي تحتوي على متغيرات فئوية ذات كاردينالية عالية. من الجدير بالذكر أن EGBM القابل للتفسير يظهر أداءً تنافسيًا في التنبؤ مقارنة بالخوارزميات الأكثر غموضًا. تشير النتائج إلى أنه من الممكن تحقيق كل من ملاءمة النموذج والدقة التنبؤية دون تنازلات.

مقدمة

تناقش مقدمة الورقة تطور النمذجة التنبؤية في علم التأمين، مع تسليط الضوء على التحول من النماذج الخطية العامة التقليدية (GLMs) إلى تقنيات التعلم الآلي المتقدمة، وخاصة آلات تعزيز التدرج (GBMs). لقد أظهرت GBMs، التي تستخدم المتعلمين الضعفاء مثل أشجار القرار وتقوم بتحسين التنبؤات من خلال الانحدار التدرجي، أداءً متفوقًا في تطبيقات التأمين المختلفة، بما في ذلك تأمين السيارات وتأمين الصحة. ومع ذلك، بينما تعزز GBMs الدقة التنبؤية، فإنها غالبًا ما تفتقر إلى القابلية للتفسير ولا تقدم رؤى حول توزيع المتغيرات الاستجابية، والتي تعتبر ضرورية لإدارة المخاطر بشكل فعال.

لمعالجة هذه القيود، يؤكد المؤلفون على أهمية التنبؤات الاحتمالية، كما دعا Embrechts وWüthrich، ويقدمون عدة خوارزميات GBM احتمالية، مثل XGBoostLSS وcyc-GBM وNGBoost وPGBM. تهدف هذه الخوارزميات إلى التنبؤ بمعلمات التوزيع، وبالتالي تسوية الفجوة بين ثقافات نمذجة البيانات ونمذجة الخوارزميات. تقدم الورقة مراجعة شاملة لكل من خوارزميات GBDT النقطية والاحتمالية، مقارنةً بأدائها عبر مجموعات بيانات التأمين المختلفة. تشير النتائج إلى أنه بينما يتفوق LightGBM في الكفاءة الحسابية، فإن XGBoostLSS يبرز بين النماذج الاحتمالية لملاءمته وكفاءته المناسبة. تهدف الورقة إلى توضيح أوجه التشابه والاختلاف بين هذه الخوارزميات وتقديم تحليل منظم لأدائها من حيث الكفاءة الحسابية والدقة التنبؤية وملاءمة النموذج.

طرق

في هذا القسم، يحدد المؤلفون المنهجيات المستخدمة لتقييم أداء الخوارزميات المختلفة على بيانات التأمين، مع مقارنة خاصة لها مع نموذج خطي عام قياسي (GLM) ونموذج جمعي عام لموقع المقياس والشكل (GAMLSS) الذي يستخدم الانحدارات التكعبية للتأثيرات الرئيسية. يتم تحديد الافتراضات التوزيعية المتعلقة بتكرار المطالبات وشدتها في القسم 4.1، مع إيلاء اهتمام خاص لمعالجة التعرض المتغير للمخاطر.

يتم تفصيل إطار التقييم في القسم 4.2، بينما يوفر الملحق A رؤى حول استراتيجيات التدريب والتعديل المستخدمة. يتم إجراء تعديل المعلمات الفائقة باستخدام مقياس خارجي مستمد من مجموعة تحقق منفصلة لتحسين تعميم النموذج. بالإضافة إلى ذلك، يناقش المؤلفون استراتيجيات التخفيف المختلفة، مثل الانكماش، والتحقق المتقاطع، والتنظيم، التي يتم تنفيذها لمعالجة مشاكل الإفراط في التكيف المحتملة المرتبطة عادةً بنماذج أشجار القرار المعززة بالتدرج (GBDT).

مناقشة

في هذا القسم، يناقش المؤلفون الانتقال من التنبؤات النقطية إلى التنبؤات الاحتمالية في أشجار القرار المعززة بالتدرج (GBDT). يعرفون المتغير الاستجابي \( Y \) ومتجه المتغيرات \( x \)، ويصفون كيف تستخدم خوارزميات GBDT دالة خسارة محددة لتقليل الخسارة الإجمالية عبر مجموعة بيانات \( D \). يبرز المؤلفون مرونة GBDT بسبب اختيار دوال الخسارة، مشيرين إلى أنه بينما تكون دالة خسارة الخطأ المربع شائعة في الانحدار، فإن دوال الخسارة البديلة مثل انحرافات غاما وبواسون أكثر ملاءمة لنمذجة شدة وتكرار المطالبات. يؤكد القسم على أن اختيار دالة الخسارة يؤثر على التوزيع الشرطي لـ \( Y \)، حيث تعمل دالة التنبؤ المتعلمة كموحد للتوقع الشرطي \( E[Y | x] \).

كما يستعرض المؤلفون التقدم في خوارزميات GBDT، بما في ذلك التحسينات للتنبؤ النقطي وإدخال طرق GBDT الاحتمالية مثل XGBoostLSS وcyc-GBM وNGBoost. يشرحون أن هذه الطرق تهدف إلى التنبؤ بجميع معلمات التوزيع المفترض، مما يمكّن من نهج نمذجة احتمالية أكثر شمولاً. تتضمن المناقشة التحسينات الحسابية التي قدمتها XGBoost، والتي تحسن من قابلية التوسع والكفاءة، والتحديات المتعلقة بالقابلية للتفسير التي تطرحها نماذج GBDT المعقدة. لمعالجة هذه التحديات، يقترح المؤلفون نماذج قابلة للتفسير مثل آلة تعزيز التدرج القابلة للتفسير (EGBM)، التي تحافظ على القابلية للتفسير بينما تستفيد من القوة التنبؤية لـ GBDT. بشكل عام، يوفر هذا القسم نظرة شاملة على تطور وقدرات خوارزميات GBDT في سياقات التنبؤ النقطي والاحتمالي.

Journal: European Actuarial Journal, Volume: 15, Issue: 3
DOI: https://doi.org/10.1007/s13385-025-00428-5
PMID: https://pubmed.ncbi.nlm.nih.gov/41179997
Publication Date: 2025-08-04
Author(s): Dominik Chevalier et al.
Primary Topic: Probability and Risk Models

Overview

The research paper section provides an overview of the advancements in gradient boosting algorithms for decision trees, highlighting their increasing application in actuarial science due to their superior predictive performance compared to traditional generalized linear models. The authors present a unified notation to contrast various existing algorithms, including GBM, XGBoost, DART, LightGBM, CatBoost, EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost.

In a comprehensive numerical study, the performance of these algorithms is evaluated across five publicly available datasets focused on claim frequency and severity, which vary in size and include numerous high cardinality categorical variables. The study assesses computational efficiency, predictive performance, and model adequacy, revealing that LightGBM and XGBoostLSS excel in computational efficiency, while CatBoost enhances predictive performance in scenarios with high cardinality categorical variables. Notably, the interpretable EGBM demonstrates competitive predictive performance relative to the more opaque algorithms. The findings indicate that it is possible to achieve both model adequacy and predictive accuracy without compromise.

Introduction

The introduction of the paper discusses the evolution of predictive modeling in actuarial science, highlighting the shift from traditional generalized linear models (GLMs) to advanced machine learning techniques, particularly gradient boosting machines (GBMs). GBMs, which utilize weak learners like decision trees and optimize predictions through gradient descent, have shown superior performance in various actuarial applications, including auto insurance and health insurance. However, while GBMs enhance predictive accuracy, they often lack interpretability and do not provide insights into the distribution of response variables, which are essential for effective risk management.

To address these limitations, the authors emphasize the importance of probabilistic predictions, as advocated by Embrechts and Wüthrich, and introduce several probabilistic GBM algorithms, such as XGBoostLSS, cyc-GBM, NGBoost, and PGBM. These algorithms aim to predict distributional parameters, thereby reconciling the gap between data modeling and algorithmic modeling cultures. The paper presents a comprehensive review of both point and probabilistic GBDT algorithms, comparing their performance across various insurance datasets. The findings indicate that while LightGBM excels in computational efficiency, XGBoostLSS stands out among probabilistic models for its adequate fit and efficiency. The paper aims to clarify the similarities and differences among these algorithms and provides a structured analysis of their performance in terms of computational efficiency, predictive accuracy, and model adequacy.

Methods

In this section, the authors outline the methodologies employed to evaluate the performance of various algorithms on actuarial data, specifically comparing them to a standard Generalized Linear Model (GLM) and a Generalized Additive Model for Location Scale and Shape (GAMLSS) that utilizes cubic splines for main effects. The distributional assumptions regarding claim frequency and severity are established in Section 4.1, with particular attention given to the treatment of varying exposure-to-risk.

The evaluation framework is detailed in Section 4.2, while Appendix A provides insights into the training and tuning strategies utilized. Hyperparameter tuning is conducted using an extrinsic metric derived from a separate validation set to improve model generalization. Additionally, the authors discuss various mitigation strategies, such as shrinkage, cross-validation, and regularization, which are implemented to address potential overfitting issues commonly associated with Gradient Boosted Decision Tree (GBDT) models.

Discussion

In this section, the authors discuss the transition from point predictions to probabilistic predictions in Gradient Boosting Decision Trees (GBDT). They define the response variable \( Y \) and the covariate vector \( x \), and describe how GBDT algorithms utilize a specified loss function to minimize total loss across a dataset \( D \). The authors highlight the versatility of GBDT due to the choice of loss functions, noting that while squared error loss is common for regression, alternative loss functions such as gamma and Poisson deviances are more suitable for modeling claim severity and frequency. The section emphasizes that the choice of loss function influences the conditional distribution of \( Y \), with the learned prediction function serving as an estimator for the conditional expectation \( E[Y | x] \).

The authors also survey advancements in GBDT algorithms, including enhancements for point prediction and the introduction of probabilistic GBDT methods like XGBoostLSS, cyc-GBM, and NGBoost. They explain that these methods aim to predict all parameters of an assumed distribution, thus enabling a more comprehensive probabilistic modeling approach. The discussion includes the computational enhancements introduced by XGBoost, which improve scalability and efficiency, and the interpretability challenges posed by complex GBDT models. To address these challenges, the authors propose explainable models such as the Explainable Gradient Boosting Machine (EGBM), which maintains interpretability while leveraging the predictive power of GBDT. Overall, this section provides a comprehensive overview of the evolution and capabilities of GBDT algorithms in both point and probabilistic prediction contexts.