نموذج للبيانات العددية ذات التشتت المنخفض A model for underdispersed count data

المجلة: Statistical Papers، المجلد: 67، العدد: 2
DOI: https://doi.org/10.1007/s00362-026-01810-5
تاريخ النشر: 2026-02-18
المؤلف: Rose Baker
الموضوع الرئيسي: طرق إحصائية واستدلال بايزي

نظرة عامة

تقدم هذه القسم نهجًا جديدًا لنمذجة بيانات العد غير المتناثرة، والتي تُواجه عادةً في الدراسات الاقتصادية، من خلال مزيج من المتغيرات العشوائية بواسون وفئة جديدة من التوزيعات تُسمى ‘بواسون-m’. تتميز هذه التوزيعات بجمع وإعادة تسمية $m$ احتمالات بواسون المجاورة، مما يؤدي إلى نموذج مرن من 2 أو 3 معلمات قادر على معالجة كل من عدم التشتت والزيادة المعتدلة في التشتت، فضلاً عن ثنائية القمة. يتم تسليط الضوء على الكفاءة الحسابية لهذه التوزيعات، مع طرق بسيطة لحساب الاحتمالات، واللحظات، وتوليد الأرقام العشوائية، إلى جانب حساب دقيق للحظات وتقريب بسيط غير محدد للمتوسط والانحراف المعياري القابل للتطبيق على سيناريوهات العد المنخفض.

تشير النتائج إلى أن توزيعات بواسون-m المقترحة تتفوق عمومًا على المعايير المعمول بها، وخاصة توزيعات COM-Poisson وتوزيعات بواسون الموزونة، من حيث ملاءمة أقصى احتمال للبيانات. تقترح الدراسة أن توزيعات بواسون-4، وخاصة تلك التي تحتوي على معلمتين أو ثلاث، يجب أن تُعطى الأولوية للتطبيقات العملية. بينما تستند التحليلات إلى مجموعة بيانات محدودة من ثمانية، فإنها توفر أساسًا للبحوث المستقبلية، والتي قد تشمل مزيدًا من تقييم هذه النماذج وتطوير نماذج ذات صلة مستوحاة من مفهوم الحدث الأولي. تظل استكشاف نماذج المزيج منطقة مهمة للبحث المستمر.

مقدمة

تناقش المقدمة أهمية التوزيعات المنفصلة في مجالات مختلفة، وخاصة في الاقتصاد، حيث تُستخدم لنمذجة ظواهر مثل معدلات الخصوبة، وزيارات الرعاية الصحية، وتغيرات الوظائف، والغياب. تُستخدم توزيع بواسون عادةً لبيانات العد؛ ومع ذلك، فإنه يفترض التشتت المتساوي، وهو ما لا يحدث غالبًا في البيانات الواقعية حيث يحدث التشتت الزائد (التباين يتجاوز المتوسط) أو التشتت الناقص (التباين أقل من المتوسط). يقترح المؤلفون فئة جديدة من النماذج التي تعالج البيانات غير المتساوية التشتت من خلال دمج دالة الكتلة الاحتمالية (pmf) لبواسون مع pmf غير متشتت، مما يوسع من قابلية تطبيق النماذج المعتمدة على بواسون.

تستعرض هذه القسم أيضًا النماذج الموجودة للتشتت الزائد، مثل نموذج الثنائي السالب، وتسلط الضوء على التحديات في نمذجة التشتت الناقص. يتم ذكر أساليب مختلفة، بما في ذلك نماذج العد ويبول وغاما، مع التركيز على أسسها الاحتمالية. يقدم المؤلفون نموذجهم، الذي يستخدم توزيع إرلنج لفترات الأحداث، ويقارنونه مع توزيع كونواي-ماكسويل (COM)، وهو نموذج مستخدم على نطاق واسع ولكنه مكثف حسابيًا. يهدف النموذج الجديد إلى تقديم بديل أكثر مرونة وقائم على النظرية لنمذجة بيانات العد غير المتساوية التشتت، مما يمهد الطريق للأقسام التالية التي ستفصل تطبيقه وأدائه مقارنةً بالمعايير الموجودة.

النتائج

في قسم النتائج، يقدم البحث نتائج غير محددة لتوزيع بواسون-m، مشيرًا إلى أنه بالنسبة لـ $\eta$ الكبيرة، يكون المتوسط معطى بـ $\mu = \eta – m – \frac{1}{2m}$. تُظهر التحليلات أن الاحتمالات يمكن تقريبها من خلال دالة كثافة الاحتمال الشائعة (pdf) $f(x)$، مما يؤدي إلى تغيير إجمالي في المتوسط قدره $-\frac{m-1}{2m}$. يتم تعديل اللحظة الثانية، $E(X^2)$، بشكل مشابه، مما يؤدي إلى تعبير عن التباين قدره $\text{var}(X) = \frac{\eta}{m} + \frac{m^2 – 1}{12m^2}$. تظل هذه النتائج غير المحددة صحيحة حتى بالنسبة للمتوسطات الصغيرة $\eta$، مع اقتراب الانحراف عن التماثل إلى $\frac{1}{\sqrt{m\eta}}$ وزيادة الكورتوزيس قدرها $\frac{1}{m\eta}$، مما يشير إلى أن هذه التوزيعات تتقارب نحو التوزيع الطبيعي بشكل أسرع من توزيع بواسون القياسي.

تؤكد ملخص النتائج على ضرورة وجود مقياس لجودة الملاءمة يتجاوز اللوغاريتم الاحتمالي، مع اقتراح الانحراف كبديل أكثر موثوقية. يُلاحظ أن مقياس جودة الملاءمة كاي-مربع غير مستقر بسبب الحاجة إلى تجميع الأحداث المتوقعة. تكشف النتائج التجريبية من ملاءمة سبعة مجموعات بيانات بدون متغيرات مصاحبة أن نموذج بواسون-4 يتفوق باستمرار على كل من بواسون-2 ونماذج أخرى، محققًا أفضل ملاءمة في جميع الحالات. بالإضافة إلى ذلك، توفر النماذج ذات الثلاث معلمات عمومًا ملاءمات أفضل، خاصة في وجود ثنائية القمة، مع تفوق نموذج بواسون-2 قليلاً على نموذج بواسون-4 في هذا السياق. بشكل عام، يظهر نموذج بواسون-4 احتمالًا أعلى من نموذج باكر 2025 RZP02 في خمسة من سبعة مجموعات بيانات تم تحليلها.

نقاش

في هذا القسم، يناقش المؤلفون خصائص وتطبيقات نموذج المزيج الجديد الذي تم تقديمه لبيانات العد غير المتناثرة، والذي يجمع بين توزيع بواسون وفئة من التوزيعات غير المتناثرة. يتم تعريف نموذج المزيج بواسطة دالة الكتلة الاحتمالية (pmf) $ R_k = (1 – \phi) P_k + \phi Q_k $، حيث $ P_k $ هو pmf لبواسون و $ Q_k $ يمثل التوزيع غير المتناثر. يتم اشتقاق المعلمات $ \mu $ و $ \sigma^2 $ للمتوسطات والانحرافات المعيارية للتوزيعين، مما يبرز أن المزيج يمكن أن يظهر أشكالًا متنوعة، بما في ذلك بواسون، وقمم، وزيادة التشتت، وثنائية القمة، وبواسون المعزز بالصفر (ZIP)، وبواسون المقطوع بالصفر (ZTP)، اعتمادًا على قيمة $ \phi $.

كما يحدد المؤلفون خوارزمية قوية لحساب الاحتمالات واللحظات، مع التأكيد على الكفاءة الحسابية لنهجهم. يظهرون قابلية تطبيق النموذج من خلال ملاءمة ثمانية مجموعات بيانات متنوعة، بما في ذلك بيانات الرعاية الصحية والاقتصاد، ويظهرون أن نماذج المزيج الخاصة بهم تتفوق على المعايير التقليدية مثل توزيعات COM وبواسون الموزونة. تشير النتائج إلى أن توزيع بواسون-4، وخاصة مع معلمتين أو ثلاث، هو الخيار المفضل لنمذجة بيانات العد غير المتناثرة. تعترف الدراسة بحدودها في عدد مجموعات البيانات التي تم تحليلها وتقترح اتجاهات البحث المستقبلية لتقييم هذه النماذج وتوسيعها بناءً على مفهوم الحدث الأولي.

Journal: Statistical Papers, Volume: 67, Issue: 2
DOI: https://doi.org/10.1007/s00362-026-01810-5
Publication Date: 2026-02-18
Author(s): Rose Baker
Primary Topic: Statistical Methods and Bayesian Inference

Overview

This section presents a novel approach to modeling underdispersed count data, commonly encountered in economic studies, through a mixture of Poisson random variables and a new class of distributions termed ‘Poisson-m’. These distributions are characterized by the summation and relabeling of $m$ adjacent Poisson probabilities, resulting in a flexible 2 or 3-parameter model capable of addressing both underdispersion and moderate overdispersion, as well as bimodality. The computational efficiency of these distributions is highlighted, with straightforward methods for calculating probabilities, moments, and generating random numbers, alongside an exact computation of moments and a simple asymptotic approximation for the mean and variance applicable to low count scenarios.

The findings indicate that the proposed Poisson-m distributions generally outperform established benchmarks, specifically the COM-Poisson and weighted Poisson distributions, in terms of maximum-likelihood fits to data. The study suggests that the Poisson-4 distributions, particularly those with two and three parameters, should be prioritized for practical applications. While the analysis is based on a limited dataset of eight, it provides a foundation for future research, which may include further evaluation of these models and the development of related models inspired by the proto-event concept. The exploration of mixture models remains a significant area for ongoing research.

Introduction

The introduction discusses the significance of discrete distributions in various fields, particularly in economics, where they are used to model phenomena such as fertility rates, healthcare visits, job changes, and absenteeism. The Poisson distribution is commonly employed for count data; however, it assumes equidispersion, which is often not the case in real-world data where overdispersion (variance exceeding the mean) or underdispersion (variance less than the mean) occurs. The authors propose a new class of models that address non-equidispersed data by combining the Poisson probability mass function (pmf) with an underdispersed pmf, thereby extending the applicability of Poisson-based models.

The section also reviews existing models for overdispersion, such as the negative binomial model, and highlights the challenges in modeling underdispersion. Various approaches, including Weibull and gamma count models, are mentioned, with a focus on their probabilistic foundations. The authors introduce their model, which utilizes an Erlang distribution for event intervals, and contrasts it with the Conway-Maxwell distribution (COM), a widely used but computationally intensive model. The new model aims to provide a more flexible and theoretically grounded alternative for fitting non-equidispersed count data, setting the stage for subsequent sections that will detail its implementation and performance relative to existing benchmarks.

Results

In the results section, the paper presents asymptotic findings for the Poisson-m distribution, highlighting that for large $\eta$, the mean is given by $\mu = \eta – m – \frac{1}{2m}$. The analysis shows that the probabilities can be approximated by a common probability density function (pdf) $f(x)$, leading to a total change in mean of $-\frac{m-1}{2m}$. The second moment, $E(X^2)$, is adjusted similarly, resulting in a variance expression of $\text{var}(X) = \frac{\eta}{m} + \frac{m^2 – 1}{12m^2}$. These asymptotic results remain valid even for small means $\eta$, with the skewness approaching $\frac{1}{\sqrt{m\eta}}$ and excess kurtosis of $\frac{1}{m\eta}$, indicating that these distributions converge to normality more rapidly than the standard Poisson distribution.

The summary of results emphasizes the necessity of a goodness-of-fit measure beyond the log-likelihood, with the deviance being proposed as a more reliable alternative. The chi-squared goodness-of-fit measure is noted to be erratic due to the need for binning predicted events. Empirical results from fitting seven datasets without covariates reveal that the Poisson-4 model consistently outperforms both the Poisson-2 and other models, achieving the best fit in all cases. Additionally, three-parameter models generally provide superior fits, particularly in the presence of bimodality, with the Poisson-2 model slightly outperforming the Poisson-4 model in this context. Overall, the Poisson-4 model demonstrates a higher likelihood than Baker’s 2025 RZP02 model in five out of seven datasets analyzed.

Discussion

In this section, the authors discuss the properties and applications of a newly introduced mixture model for underdispersed count data, combining a Poisson distribution with a class of underdispersed distributions. The mixture model is defined by the probability mass function (pmf) $ R_k = (1 – \phi) P_k + \phi Q_k $, where $ P_k $ is the Poisson pmf and $ Q_k $ represents the underdispersed distribution. The parameters $ \mu $ and $ \sigma^2 $ for the means and variances of the two distributions are derived, highlighting that the mixture can exhibit various shapes, including Poisson, peaked, overdispersed, bimodal, zero-inflated Poisson (ZIP), and zero-truncated Poisson (ZTP) distributions, depending on the value of $ \phi $.

The authors also outline a robust algorithm for computing probabilities and moments, emphasizing the computational efficiency of their approach. They demonstrate the model’s applicability through fitting to eight diverse datasets, including healthcare and economic data, and show that their mixture models outperform traditional benchmarks like the COM and weighted Poisson distributions. The findings suggest that the Poisson-4 distribution, particularly with two or three parameters, is a preferred choice for modeling underdispersed count data. The study acknowledges its limitation in the number of datasets analyzed and proposes future research directions to further evaluate and extend these models based on the proto-event concept.