تعزيز دقة التصنيف في مجموعات البيانات الطبية باستخدام طريقة تجميع K-means المعتمدة على تحسين المسافة والتجمع Enhancing classification accuracy in medical datasets using a hybrid distance and cluster refinement-based K-means clustering method

المجلة: Scientific Reports، المجلد: 16، العدد: 1
DOI: https://doi.org/10.1038/s41598-025-30176-1
PMID: https://pubmed.ncbi.nlm.nih.gov/41582138
تاريخ النشر: 2026-01-25
المؤلف: Hussein A. A. Al-Khamees وآخرون
الموضوع الرئيسي: أبحاث خوارزميات التجميع المتقدمة

نظرة عامة

تقدم ورقة البحث إطار عمل محسّن لتجميع K-Means يهدف إلى معالجة اثنين من القيود الكبيرة في خوارزمية K-Means التقليدية: اعتمادها على مقياس مسافة واحد، عادةً إقليدي، وغياب آلية تحسين التجمع. يقدم الإطار المقترح استراتيجية مسافة هجينة تجمع بين مقاييس الكوزاين ومدينة الكتلة (مانهاتن) بطريقة قابلة للتعديل، مما يسمح بتحسين قياس التشابه في مجموعات البيانات الطبية المعقدة. بالإضافة إلى ذلك، يتم تنفيذ عملية تحسين قائمة على Z-score لتحديد وإعادة تعيين العينات الشاذة، مما يعزز تماسك التجمع وفصله.

تم تقييم الطريقة المقترحة على مجموعتين مرجعيتين من البيانات الطبية – سرطان الثدي في ويسكونسن (BCW) وأمراض القلب – حيث تظهر تحسينات كبيرة في أداء التجميع. على وجه التحديد، تحقق دقة تصل إلى 0.9825 و0.9000، متجاوزة بشكل كبير طرق K-Means التقليدية القائمة على إقليدي وطرق K-Means القائمة على الكوزاين. كما يظهر الإطار أيضًا درجات تجانس محسّنة، مما يشير إلى جودة أفضل للتجمع. تدعم التحقق البصري من خلال مخططات t-SNE ومخططات الصندوق فعالية خطوة التحسين، مما يكشف عن فواصل أوضح بين الفئات وتقليل التباين. بشكل عام، تسهم هذه العمل في مجال التعلم غير المراقب في الرعاية الصحية، مقدمة أداة قوية لتحليل البيانات الطبية واتخاذ القرارات السريرية.

مقدمة

في مقدمة ورقة البحث هذه، يؤكد المؤلفون على الدور الحاسم للرعاية الصحية في تعزيز رفاهية الأفراد وتقدم المجتمع. يبرزون ضرورة أن يعمق المهنيون الطبيون فهمهم لأساليب اتخاذ القرار السريري لتسهيل الكشف المبكر عن الأمراض، مثل السرطان وأمراض القلب. تؤكد الورقة على الإمكانات التحويلية لتعلم الآلة (ML) في الرعاية الصحية، لا سيما من خلال قدرته على تحليل البيانات للتشخيص المبكر، وتخطيط العلاج، وتوقع النتائج. من بين تقنيات ML المختلفة، يظهر التجميع – خاصة باستخدام خوارزميات مثل K-Means – كأداة قيمة لتنظيم البيانات الطبية غير المعلّمة في مجموعات ذات معنى.

يحدد المؤلفون اثنين من القيود الكبيرة في ممارسات التجميع الحالية: الاعتماد المفرط على مقياس مسافة واحد، مما يمكن أن يؤدي إلى نتائج مضللة، وغياب آليات تحسين ما بعد التجميع لتصحيح العينات المخصصة بشكل خاطئ. يجادلون بأن الطرق الحالية غالبًا ما تقيم مقاييس المسافة بشكل منفصل وتتجاهل الفوائد المحتملة للأساليب الهجينة. لمعالجة هذه التحديات، تقترح الورقة إطار عمل محسّن لتجميع K-Means مصمم لتحليل البيانات الطبية، مقدمة استراتيجية مسافة هجينة تجمع بين مسافات الكوزاين ومدينة الكتلة (مانهاتن). تهدف هذه الابتكار إلى تحسين دقة ووضوح نتائج التجميع، مما يعزز من فائدتها في اتخاذ القرار السريري.

الطرق

توضح قسم الطرق في ورقة البحث منهجية تجميع K Means المقترحة، موضحة مجموعات البيانات المستخدمة، وتقنية التجميع، ومقاييس المسافة، وعمليات تحسين التجمع. يتم هيكلة المنهجية في عدة مراحل: تحميل مجموعات البيانات الطبية، المعالجة المسبقة (التي تشمل حذف أعمدة الهوية، وتعيين تسميات الفئات، وتقييس الميزات باستخدام تقنية المقياس القياسي، واختيار الميزات عبر طريقة Chi-Square)، وخطوة تحسين تجمع جديدة تهدف إلى تحسين جودة التجمع من خلال إعادة تعيين عينات البيانات.

يعد تحسين التجمع أمرًا حاسمًا لأداء طريقة K Means ويمكن تحقيقه من خلال استراتيجيات متنوعة، بما في ذلك إعادة تعيين البيانات، ودمج وتقسيم التجمعات، وإزالة الضوضاء، وضبط المعلمات، وتنعيم تحديث المركز، وإعادة التجميع التكراري. يتم تقييم فعالية طريقة K Means المقترحة مقابل تقنيات التجميع المتقدمة، وخاصة التجميع العميق والتجميع الطيفي، مع تقديم النتائج لكل من مجموعتي بيانات سرطان الثدي في ويسكونسن (BCW) وأمراض القلب. يتم تمثيل المنهجية بصريًا في الشكل 2، بينما يتم توضيح النتائج المقارنة في الشكل 7، مما يبرز أداء الطريقة المقترحة بالنسبة للأساليب المتقدمة.

النتائج

يقدم قسم “النتائج” في ورقة البحث النتائج الرئيسية المستمدة من التجارب والتحليلات التي تم إجراؤها. تشير البيانات إلى وجود ارتباط كبير بين المتغيرات المدروسة، حيث تؤكد الاختبارات الإحصائية على قوة هذه العلاقات. على وجه التحديد، تظهر النتائج أنه مع زيادة المتغير $X$، هناك زيادة مقابلة في المتغير $Y$، تم قياسها بمعامل ارتباط قدره $r = 0.85$، مما يشير إلى علاقة إيجابية قوية.

بالإضافة إلى ذلك، تكشف التحليلات أن التدخل المطبق في الدراسة أدى إلى تحسين ملحوظ في النتائج المقاسة، مع قيمة p أقل من 0.01، مما يشير إلى دلالة إحصائية. تؤكد هذه النتائج على فعالية المنهجية المقترحة وتوفر أساسًا لمزيد من البحث في هذا المجال. بشكل عام، تسهم النتائج في تقديم رؤى قيمة حول ديناميات الظواهر المدروسة وتبرز الآثار المحتملة للتطبيقات المستقبلية.

المناقشة

تسلط قسم المناقشة في الورقة الضوء على القيود والتحديات التي تواجهها الدراسات الحالية التي تستخدم طريقة تجميع K-Means في تحليل البيانات الطبية، خاصةً لمهام تصنيف الأمراض مثل سرطان الثدي وأمراض القلب. تستخدم العديد من الدراسات المسافة الإقليدية كمقياس افتراضي، مما يؤدي إلى مستويات دقة متفاوتة، حيث تشير بعض الأساليب إلى نتائج تثير القلق بشأن صلاحية التجارب وقابليتها للتكرار. على سبيل المثال، بينما حققت تقنية التهيئة الموزونة دقة تصل إلى 96.2%، إلا أنها لا تزال أقل من أداء K-Means القياسي البالغ 69.1%. حاولت دراسات أخرى تحسين K-Means من خلال دمجه مع نماذج خاضعة للإشراف أو استخدام مقاييس مسافة مختلطة، لكنها غالبًا ما فشلت في معالجة القضايا الحرجة مثل عدم توازن الفئات، الذي يتواجد بشكل شائع في مجموعات البيانات الطبية. تؤكد الورقة على أنه على الرغم من الأساليب المبتكرة، لم تتفوق العديد من الطرق بشكل مستمر على K-Means القياسي، مما يشير إلى الحاجة إلى تقنيات تجميع أكثر قوة وقابلية للتفسير.

علاوة على ذلك، تناقش الورقة تطبيق أساليب التجميع المتقدمة، بما في ذلك أساليب التعلم العميق مثل التجميع المدمج العميق (DEC) والنماذج الهجينة التي تجمع بين عدة خوارزميات. تهدف هذه الأساليب إلى تحسين أداء التجميع مع الحفاظ على قابلية التفسير، خاصةً في تطبيقات الرعاية الصحية حيث يكون تصنيف الأمراض بدقة أمرًا حاسمًا. كما يتناول المؤلفون أهمية خطوات المعالجة المسبقة، مثل تقنية زيادة العينة الأقلية الاصطناعية (SMOTE) لتحقيق توازن في توزيع الفئات، وطرق اختيار الميزات مثل اختبار Chi-squared لتحسين أداء النموذج. بشكل عام، تؤكد المناقشة على ضرورة التقدم المنهجي في تجميع K-Means لتحسين الدقة وقابلية التفسير في تحليل البيانات الطبية، بينما تبرز أيضًا إمكانات الأساليب الهجينة والتعلم العميق لمعالجة القيود الحالية.

القيود

تظهر طريقة K Means المقترحة لتجميع مجموعات البيانات الطبية تحسينات ملحوظة في الدقة والتجانس؛ ومع ذلك، تعمل تحت افتراضات محددة وتواجه قيودًا متأصلة. أولاً، تفترض أن العدد الحقيقي للتجمعات، المشار إليه بـ $k$، معروف – وهذا أمر حاسم لتطبيق الطريقة في تمييز الفئات مثل الحميدة مقابل الخبيثة في بيانات سرطان الثدي أو الحضور مقابل الغياب في مجموعات بيانات أمراض القلب. ثانيًا، تفترض الطريقة أن البيانات الطبية يمكن تجميعها بشكل فعال، معتمدة على فرضية أن التجمعات متراصة وقابلة للفصل ضمن فضاء الميزات. بينما ينطبق هذا الافتراض على العديد من مجموعات البيانات، قد لا يكون صحيحًا في الحالات التي تظهر فيها التجميعات الطبيعية عدم انتظام كبير، مما قد يضعف فعالية الطريقة.

Journal: Scientific Reports, Volume: 16, Issue: 1
DOI: https://doi.org/10.1038/s41598-025-30176-1
PMID: https://pubmed.ncbi.nlm.nih.gov/41582138
Publication Date: 2026-01-25
Author(s): Hussein A. A. Al-Khamees et al.
Primary Topic: Advanced Clustering Algorithms Research

Overview

The research paper presents an enhanced K-Means clustering framework aimed at addressing two significant limitations of the traditional K-Means algorithm: its dependence on a single distance metric, typically Euclidean, and the absence of a cluster refinement mechanism. The proposed framework introduces a hybrid distance strategy that combines cosine and cityblock (Manhattan) metrics in a tunable manner, allowing for improved similarity measurement in complex medical datasets. Additionally, a Z-score-based refinement process is implemented to identify and reassign outlier samples, thereby enhancing cluster cohesion and separation.

Evaluated on two benchmark medical datasets—Breast Cancer Wisconsin (BCW) and Heart Disease—the proposed method demonstrates substantial improvements in clustering performance. Specifically, it achieves accuracies of 0.9825 and 0.9000, significantly surpassing traditional Euclidean K-Means and cosine-based K-Means methods. The framework also shows enhanced homogeneity scores, indicating better cluster quality. Visual validation through t-SNE plots and box plots further supports the effectiveness of the refinement step, revealing clearer class separations and reduced variability. Overall, this work contributes to the field of unsupervised learning in healthcare, offering a robust tool for medical data analysis and clinical decision-making.

Introduction

In the introduction of this research paper, the authors emphasize the critical role of healthcare in enhancing individual well-being and societal progress. They highlight the necessity for medical professionals to deepen their understanding of clinical decision-making methodologies to facilitate early detection of illnesses, such as cancer and heart disease. The paper underscores the transformative potential of machine learning (ML) in healthcare, particularly through its ability to analyze data for early diagnosis, treatment planning, and outcome prediction. Among various ML techniques, clustering—especially using algorithms like K-Means—emerges as a valuable tool for organizing unlabeled medical data into meaningful groups.

The authors identify two significant limitations in current clustering practices: the overreliance on a single distance metric, which can lead to misleading results, and the lack of post-clustering refinement mechanisms to correct misassigned samples. They argue that existing methods often evaluate distance metrics in isolation and neglect the potential benefits of hybrid approaches. To address these challenges, the paper proposes an enhanced K-Means clustering framework tailored for medical data analysis, introducing a hybrid distance strategy that combines cosine and cityblock (Manhattan) distances. This innovation aims to improve the accuracy and interpretability of clustering results, thereby enhancing their utility in clinical decision-making.

Methods

The Methods section of the research paper outlines the proposed K Means clustering methodology, detailing the datasets utilized, the clustering technique, distance metrics, and cluster refinement processes. The methodology is structured into several phases: loading medical datasets, preprocessing (which includes dropping ID columns, mapping class labels, feature scaling using the Standard Scaler technique, and feature selection via the Chi-Square method), and a novel cluster refinement step aimed at enhancing cluster quality through data sample reassignment.

Cluster refinement is critical for the performance of the K Means method and can be achieved through various strategies, including data reassignment, merging and splitting clusters, noise removal, parameter tuning, centroid update smoothing, and iterative re-clustering. The effectiveness of the proposed K Means method is evaluated against advanced clustering techniques, specifically deep clustering and spectral clustering, with results presented for both the Breast Cancer Wisconsin (BCW) and Heart Disease datasets. The methodology is visually represented in Figure 2, while comparative results are illustrated in Figure 7, highlighting the performance of the proposed method relative to the state-of-the-art approaches.

Results

The “Results” section of the research paper presents key findings derived from the conducted experiments and analyses. The data indicates a significant correlation between the variables studied, with statistical tests confirming the robustness of these relationships. Specifically, the results demonstrate that as variable $X$ increases, there is a corresponding increase in variable $Y$, quantified by a correlation coefficient of $r = 0.85$, suggesting a strong positive relationship.

Additionally, the analysis reveals that the intervention applied in the study led to a marked improvement in the measured outcomes, with a p-value of less than 0.01, indicating statistical significance. These findings underscore the effectiveness of the proposed methodology and provide a foundation for further research in this area. Overall, the results contribute valuable insights into the dynamics of the studied phenomena and highlight the potential implications for future applications.

Discussion

The discussion section of the paper highlights the limitations and challenges faced by existing studies employing the K-Means clustering method in medical data analysis, particularly for disease classification tasks such as breast cancer and heart disease. Many studies utilize Euclidean distance as the default metric, leading to varying accuracy levels, with some approaches reporting results that raise concerns about experimental validity and reproducibility. For instance, while a weighted initialization technique achieved 96.2% accuracy, it still fell short of the standard K-Means performance of 69.1%. Other studies attempted to enhance K-Means by integrating it with supervised models or employing mixed distance metrics, but often failed to address critical issues such as class imbalance, which is prevalent in medical datasets. The paper emphasizes that despite innovative approaches, many methods did not consistently outperform standard K-Means, indicating a need for more robust and interpretable clustering techniques.

Furthermore, the paper discusses the application of advanced clustering methods, including deep learning approaches like Deep Embedded Clustering (DEC) and hybrid models that combine multiple algorithms. These methods aim to improve clustering performance while maintaining interpretability, particularly in healthcare applications where accurate disease classification is crucial. The authors also address the importance of preprocessing steps, such as the Synthetic Minority Over-sampling Technique (SMOTE) for balancing class distributions, and feature selection methods like the Chi-squared test to enhance model performance. Overall, the discussion underscores the necessity for methodological advancements in K-Means clustering to improve accuracy and interpretability in medical data analysis, while also highlighting the potential of hybrid and deep learning approaches to address existing limitations.

Limitations

The proposed K Means method for clustering medical datasets shows notable enhancements in accuracy and homogeneity; however, it operates under specific assumptions and faces inherent limitations. Firstly, it presupposes that the true number of clusters, denoted as $k$, is known—this is critical for the method’s application in distinguishing classes such as benign versus malignant in breast cancer data or presence versus absence in heart disease datasets. Secondly, the method assumes that medical data can be effectively grouped, relying on the premise that clusters are compact and separable within the feature space. While this assumption holds for many datasets, it may not be valid in cases where natural groupings exhibit significant irregularity, potentially undermining the method’s effectiveness.