استخدام النقاط العائمة على GitHub: دراسة واسعة النطاق للغات ذات النوع الثابت Floating-Point Usage on GitHub: A Large-Scale Study of Statically Typed Languages

المجلة: Proceedings of the ACM on Programming Languages، المجلد: 10
DOI: https://doi.org/10.1145/3798203
تاريخ النشر: 2026-04-10
المؤلف: Andrea Gilot وآخرون
الموضوع الرئيسي: الحوسبة المتوازية وتقنيات التحسين

نظرة عامة

تقدم هذه القسم نظرة شاملة لدراسة تجريبية واسعة النطاق تبحث في استخدام الحسابات العائمة في مستودعات GitHub العامة، مع التركيز بشكل خاص على لغات البرمجة ذات النوع الثابت. يبرز المؤلفون التحديات المرتبطة بالتفكير في الحسابات العائمة والحاجة إلى تقنيات تحليل محسنة قابلة للتطبيق على الشيفرة البرمجية في العالم الحقيقي. لمعالجة ذلك، يستخدمون ممارسات تعدين متقدمة، بما في ذلك أخذ عينات عشوائية وتصنيف الخصائص الجوهرية، لتحديد استخدام الحسابات العائمة من خلال البحث عن الكلمات الرئيسية وتحليل الشيفرة.

تؤكد النتائج الاستخدام الشائع للحسابات العائمة في التطبيقات الواقعية، مما يتماشى مع الادعاءات الواردة في الأدبيات الحالية. ومع ذلك، تكشف الدراسة أيضًا عن تناقضات بين الشيفرات المرجعية المستخدمة عادةً لتقييم تقنيات التفكير الآلي والشيفرة الفعلية للحسابات العائمة في الممارسة العملية. يقدم المؤلفون مجموعة بيانات تتكون من 10 ملايين دالة حسابية عائمة من العالم الحقيقي، والتي يمكن أن تكون مصدرًا قيمًا لتطوير وتقييم التقنيات المستقبلية في الحسابات العائمة، مما يضمن أنها تعكس بشكل أفضل توقعات المستخدمين والتطبيقات العملية.

مقدمة

تسلط مقدمة هذه الورقة البحثية الضوء على الاستخدام الواسع للبرمجيات العددية عبر مجالات مختلفة، بما في ذلك الأنظمة المدمجة، والحوسبة العلمية، وتعلم الآلة، والتي تعتمد غالبًا على الحسابات العائمة من أجل حسابات فعالة. يقدم هذا الاعتماد أخطاء تقريبية وقيم خاصة مثل اللانهاية وNot-a-Number (NaN)، مما يعقد التفكير في صحة البرمجيات. يشير المؤلفون إلى وجود فجوة في الأدبيات بشأن تمثيل الشيفرات المرجعية الحالية للحسابات العائمة، والتي يتم تقييمها عادةً على مجموعات بيانات صغيرة مختارة يدويًا بدلاً من أن تعكس استخدام الشيفرة البرمجية في العالم الحقيقي.

لمعالجة هذه الفجوة، تقدم الورقة مساهمتين رئيسيتين. أولاً، تقدم منهجية تعدين واسعة النطاق لاكتشاف وتحليل استخدام الحسابات العائمة في لغات البرمجة ذات النوع الثابت، باستخدام تقنيات آلية مثل أخذ العينات العشوائية وتصنيف المشاريع. تتيح هذه المنهجية فحصًا شاملاً لشيفرة الحسابات العائمة عبر مجموعة واسعة من اللغات المعترف بها من قبل GitHub. ثانيًا، يقوم المؤلفون بإجراء تحليل كمي لـ 447,209 مشروعًا عشوائيًا من GitHub، مما يكشف أن أكثر من 62% تحتوي على شيفرة حسابات عائمة، مما يؤكد استخدامها الواسع في الممارسة العملية. تشير النتائج إلى أن خصائص الشيفرة البرمجية للحسابات العائمة في العالم الحقيقي تختلف بشكل كبير عن تلك الموجودة في مجموعات الشيفرات المرجعية الحالية، مما يشير إلى أن المعايير الحالية قد لا تمثل بشكل كافٍ ممارسات البرمجة الفعلية للحسابات العائمة.

الطرق

تتركز المنهجية الموضحة في هذه الورقة البحثية على التحقيق في انتشار وأنواع الشيفرات التي تعمل على الأعداد العائمة ضمن الشيفرة البرمجية “الحقيقية” ذات النوع الثابت. تستلهم هذه الطريقة من الإطار المقترح من قبل Maj وآخرين وتلتزم بمعايير ACM SIGSOFT التجريبية لتعدين المستودعات. يقدم المؤلفون إطار عمل عام يسمى Scyros، تم تنفيذه بلغة Rust، والذي يسهل الدراسات الكبيرة للشيفرة مع التركيز على القابلية للتكرار وإعادة الاستخدام.

يسمح Scyros بجمع بيانات مرنة، مما يمكّن من التوقف والاستئناف دون المساس بسلامة النتائج. يتضمن ميزات مثل خلط المدخلات عشوائيًا، وبذور عشوائية محددة من قبل المستخدم، وتتبع مستمر للمستودعات من خلال تجزئة الالتزامات. يمكن للمستخدمين تخصيص تحليلاتهم من خلال اختيار مستودعات محلية، وتحديد امتدادات الملفات أو لغات البرمجة، وضبط مجموعات الكلمات الرئيسية بناءً على ميزات الشيفرة المرغوبة. يدعم الإطار تحليلات متوازية من خلال ملفات تكوين متعددة ويوفر واجهة سطر الأوامر لتنفيذ خطوات خط أنابيب مستقلة، والتي تنتج ملفات CSV للمعالجة اللاحقة. تم توثيق الشيفرة بشكل شامل، واختبارها، وتوفيرها كبرمجيات مفتوحة المصدر تحت رخصة Apache 2.0، مع إجراء تحليل البيانات بلغة Python.

نقاش

في هذا القسم، يقدم المؤلفون مناقشة شاملة حول إنشاء وتداعيات مجموعة بيانات جديدة تم إصدارها تتكون من 10 ملايين دالة حسابية عائمة من العالم الحقيقي، تهدف إلى تعزيز تقييم أدوات التفكير في الحسابات العائمة. هذه المجموعة من البيانات ذات صلة خاصة حيث تعمل معظم الأدوات الحالية على أساس كل دالة، مما يوفر أساسًا لتطوير معايير واقعية تعكس الاستخدام العملي للحسابات العائمة. كما يؤكد المؤلفون على قيود مجموعات المعايير السابقة، مثل FPBench، التي تتكون بشكل أساسي من برامج مختارة يدويًا قد لا تمثل بدقة التحديات الواقعية. يجادلون بأن الاعتماد على مثل هذه المعايير المنسقة يمكن أن يحجب القضايا المهمة التي تواجهها البرمجة العددية الفعلية.

يقدم المؤلفون مزيدًا من التفاصيل حول منهجيتهم لتعدين مستودعات GitHub لتجميع هذه المجموعة من البيانات، مشيرين إلى التحديات المتمثلة في ضمان أن المشاريع المختارة ليست تافهة وتمثل البرمجيات في العالم الحقيقي. يصفون عملية اختيار متعددة الخطوات تشمل تصفية لغات البرمجة ذات النوع الثابت، وتنزيل الملفات ذات الصلة، وتحديد استخدام الحسابات العائمة من خلال البحث عن الكلمات الرئيسية. تهدف هذه الطريقة الدقيقة إلى تقليل التحيزات المرتبطة بالدراسات السابقة التي غالبًا ما اعتمدت على عينات أصغر مختارة يدويًا. لا تسهل مجموعة البيانات الناتجة تقييم أدوات التفكير في الحسابات العائمة فحسب، بل تعمل أيضًا كمصدر للبحث المستقبلي في هذا المجال، داعيةً إلى مساهمات من المجتمع الأكاديمي الأوسع.

Journal: Proceedings of the ACM on Programming Languages, Volume: 10
DOI: https://doi.org/10.1145/3798203
Publication Date: 2026-04-10
Author(s): Andrea Gilot et al.
Primary Topic: Parallel Computing and Optimization Techniques

Overview

This section presents a comprehensive overview of a large-scale empirical study investigating the usage of floating-point arithmetic in public GitHub repositories, particularly focusing on statically typed programming languages. The authors highlight the challenges associated with reasoning about floating-point arithmetic and the need for improved analysis techniques that are applicable to real-world code. To address this, they employ advanced mining practices, including random sampling and intrinsic property filtering, to identify floating-point usage through keyword searches and code parsing.

The findings confirm the prevalent use of floating-point arithmetic in real-world applications, aligning with claims made in existing literature. However, the study also reveals discrepancies between benchmark codes commonly used for evaluating automated reasoning techniques and actual floating-point code in practice. The authors provide a dataset comprising 10 million real-world floating-point functions, which can serve as a valuable resource for developing and assessing future techniques in floating-point arithmetic, ensuring they better reflect user expectations and practical applications.

Introduction

The introduction of this research paper highlights the pervasive use of numerical software across various fields, including embedded systems, scientific computing, and machine learning, which often rely on floating-point arithmetic for efficient computation. This reliance introduces rounding errors and special values such as infinity and Not-a-Number (NaN), complicating the reasoning about software correctness. The authors note a gap in the literature regarding the representativeness of existing floating-point benchmarks, which are typically evaluated on small, hand-picked datasets rather than reflecting real-world code usage.

To address this gap, the paper presents two main contributions. First, it introduces a large-scale mining methodology for detecting and analyzing floating-point usage in statically typed programming languages, employing automated techniques such as random sampling and project filtering. This methodology allows for a comprehensive examination of floating-point code across a wide range of languages recognized by GitHub. Second, the authors conduct a quantitative analysis of 447,209 randomly sampled GitHub projects, revealing that over 62% contain floating-point code, thus confirming its widespread use in practice. The findings indicate that the characteristics of real-world floating-point code differ significantly from those of existing benchmark suites, suggesting that current benchmarks may not adequately represent actual floating-point programming practices.

Methods

The methodology outlined in this research paper focuses on investigating the prevalence and types of code that operate on floating-point numbers within statically typed “real-world” code. The approach is inspired by the framework proposed by Maj et al. and adheres to the ACM SIGSOFT Empirical Standards for repository mining. The authors introduce a general-purpose framework named Scyros, implemented in Rust, which facilitates large-scale code studies while emphasizing reproducibility and reusability.

Scyros allows for flexible data collection, enabling interruptions and resumptions without compromising the integrity of the results. It incorporates features such as random shuffling of inputs, fixed user-defined random seeds, and consistent repository tracking through commit hashes. Users can customize their analyses by selecting local repositories, specifying file extensions or programming languages, and adjusting keyword sets based on the desired code features. The framework supports parallel analyses through multiple configuration files and provides a command-line interface for executing independent pipeline steps, which generate CSV files for subsequent processing. The code is thoroughly documented, tested, and made available as open-source software under the Apache 2.0 license, with data analysis conducted in Python.

Discussion

In this section, the authors present a comprehensive discussion on the creation and implications of a newly released dataset comprising 10 million real-world floating-point functions, aimed at enhancing the benchmarking of floating-point reasoning tools. This dataset is particularly relevant as most existing tools operate on a per-function basis, thus providing a foundation for developing realistic benchmarks that reflect practical floating-point usage. The authors also emphasize the limitations of previous benchmark suites, such as FPBench, which primarily consist of hand-picked programs that may not accurately represent real-world challenges. They argue that the reliance on such curated benchmarks can obscure significant issues encountered in actual numerical programming.

The authors further detail their methodology for mining GitHub repositories to compile this dataset, highlighting the challenges of ensuring that the selected projects are non-trivial and representative of real-world software. They describe a multi-step selection process that includes filtering for statically typed languages, downloading relevant files, and identifying floating-point usage through keyword searches. This rigorous approach aims to mitigate biases associated with previous studies that often relied on smaller, hand-picked samples. The resulting dataset not only facilitates the evaluation of floating-point reasoning tools but also serves as a resource for future research in the domain, inviting contributions from the broader academic community.