Pron مقابل Prompt: هل يمكن لنماذج اللغة الكبيرة أن تتحدى بالفعل كاتب خيال عالمي المستوى في كتابة النصوص الإبداعية؟ Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?

المجلة: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
DOI: https://doi.org/10.18653/v1/2024.emnlp-main.1096
تاريخ النشر: 2024-01-01
المؤلف: Guillermo Marco وآخرون
الموضوع الرئيسي: الإنسانيات الرقمية والدراسات الأكاديمية

نظرة عامة

تستكشف هذه الورقة البحثية قدرات الكتابة الإبداعية لنماذج اللغة الكبيرة (LLMs)، وبشكل خاص GPT-4، مقارنةً بأحد أفضل الروائيين البشريين، باتريسيو برون. شملت الدراسة مسابقة حيث قام كلا المشاركين بإنشاء ثلاثين عنوانًا وكتابة قصص قصيرة بناءً على عناوينهم وعناوين بعضهم البعض. تم استخدام مقياس تقييم مستوحى من إطار عمل الإبداع لبودن، وتم جمع تقييمات الخبراء لقياس جودة النصوص. تكشف النتائج أن GPT-4، على الرغم من قدراته المتقدمة، لا يضاهي العمق الإبداعي والأصالة للروائيين من الطراز العالمي. ومن الجدير بالذكر أن GPT-4 أظهر إبداعًا أكبر عند الرد على العناوين المقدمة من برون، مما يشير إلى إمكانية التعاون الفعال بين الإنسان والآلة.

تؤكد الاستنتاجات أنه بينما يمكن لنماذج اللغة الكبيرة إنتاج نصوص متماسكة وسلسة، فإنها غالبًا ما تفتقر إلى الأصالة الدقيقة والنوايا التي تميز أفضل الكتاب البشريين. تسلط الدراسة الضوء على التأثير الكبير للمحفزات على الإنتاج الإبداعي، حيث كانت أداء GPT-4 أفضل بشكل ملحوظ باللغة الإنجليزية مقارنةً بالإسبانية، على الأرجح بسبب تحيز في بيانات التدريب. علاوة على ذلك، أصبح المقيمون الخبراء أكثر قدرة على تحديد النصوص التي تم إنشاؤها بواسطة الذكاء الاصطناعي، مما يشير إلى أسلوب يمكن التعرف عليه في مخرجات GPT-4. بشكل عام، تؤكد الأبحاث على القيود الجوهرية في نماذج اللغة الكبيرة الحالية، التي تميل إلى إنتاج محتوى بناءً على أنماط مألوفة بدلاً من التفكير الابتكاري، وبالتالي تفشل في تكرار العمليات الإبداعية للكتاب البشريين النخبة.

مقدمة

تناقش مقدمة هذه الورقة البحثية قدرات نماذج اللغة الكبيرة (LLMs)، لا سيما في مجال الكتابة الإبداعية، مع تسليط الضوء على تأثيرها المتزايد على الصناعات الإبداعية والاقتصاد. يهدف المؤلفون إلى التحقيق فيما إذا كانت نماذج اللغة الكبيرة، وبشكل خاص GPT-4 Turbo، يمكن أن تنافس أفضل الكتاب البشريين، باستخدام تنسيق مسابقة رسمية ضد الكاتب المتميز باتريسيو برون. تم وضع هذه الدراسة في إطار عدة أسئلة بحثية، بما في ذلك المهارات المقارنة لنماذج اللغة الكبيرة والكتاب البشريين، وتأثير المحفزات على الإبداع، والأداء عبر لغات مختلفة، وقابلية التعرف على أنماط كتابة نماذج اللغة الكبيرة، وقياس الإبداع بناءً على إطار عمل بودن.

تشمل المنهجية تقييمًا منظمًا لـ 180 قطعة نصية تم إنتاجها بواسطة كل من GPT-4 وبرون، تم تقييمها من قبل نقاد أدبيين باستخدام مقياس قائم على أبعاد الإبداع مثل الجدة والقيمة. تشير النتائج الرئيسية إلى أنه بينما يظهر GPT-4 بعض القوة، إلا أنه لا يزال لا يضاهي مهارات الكتابة الإبداعية لأفضل الكتاب البشريين، كما يتضح من تقييمات الخبراء التي تفضل برون. بالإضافة إلى ذلك، تكشف الدراسة أن التحفيز بالعناوين يعزز كتابة GPT-4، وأن أدائه في الإسبانية أقل من أدائه في الإنجليزية، وأن أسلوب كتابته يصبح قابلًا للتعرف عليه مع مرور الوقت. بشكل عام، تساهم الأبحاث في فهم قدرات نماذج اللغة الكبيرة في السياقات الإبداعية وتضع الأساس لاستفسارات مستقبلية حول التعاون بين الإنسان والذكاء الاصطناعي في الكتابة الإبداعية.

الطرق

في هذا القسم، يحدد المؤلفون التصميم التجريبي لمقارنة قدرات الكتابة الإبداعية لنموذج اللغة GPT-4 Turbo والروائي المشهور باتريسيو برون. بدأت التجربة مع GPT-4 Turbo، الذي تم اختياره لأدائه المتفوق في ذلك الوقت، وتم ضبطه على درجة حرارة 1 لضمان الصحة النحوية، لا سيما في الإسبانية. على الرغم من ظهور نماذج منافسة مثل Claude 3 Opus وGemini Ultra، لم تُلاحظ أي مزايا كبيرة، مما دفع الباحثين للاحتفاظ بإعدادهم الأصلي. تضمنت المهمة اقتراح كل متنافس 30 عنوان فيلم، تلاها إنشاء ملخصات من 600 كلمة لجميع العناوين. تم التأكيد على الحاجة إلى الإبداع والقيمة الأدبية في المحفزات المقدمة لـ GPT-4، حيث تم اقتراح العناوين في البداية بالإسبانية وترجمتها لاحقًا إلى الإنجليزية للتحليل المقارن.

يركز مقياس التقييم، الذي أعده خبراء في المجالات ذات الصلة، على أبعاد الإبداع، وبشكل خاص الأصالة والجاذبية والإبداع العام، بناءً على تعريف مارغريت بودن للإبداع كقدرة على إنتاج أفكار جديدة ومفاجئة وقيمة. يتم تقييم الأصالة من خلال الجدة والتفرد للعناوين والملخصات، بينما تتعلق الجاذبية بالنداء الأدبي وتفاعل المحتوى. يهدف المقياس إلى استكشاف العلاقة بين هذه الأبعاد وإدراك المقيمين للإبداع، مما يحقق صحة تصور بودن في سياق الكتابة الروائية.

النتائج

في قسم النتائج والمناقشة، يقدم المؤلفون النتائج الرئيسية لدراستهم، مع التركيز على ملاحظات تقييم الخبراء. يتم تخصيص كل قسم فرعي لتحليل شامل يهدف إلى معالجة الأسئلة البحثية المحددة التي تم طرحها في بداية الدراسة. توفر التقييمات التفصيلية رؤى حول فعالية وتأثير تقييمات الخبراء، مما يساهم في فهم أعمق لموضوع البحث.

المناقشة

في قسم المناقشة من الورقة البحثية، يضع المؤلفون دراستهم في سياق أوسع من التقدمات الأخيرة في تكنولوجيا نماذج اللغة الكبيرة (LLM) وتأثيرها على الكتابة الإبداعية. يشيرون إلى عدة أعمال رئيسية تستكشف تقاطع التعلم الآلي والإبداع، مع تسليط الضوء على التحديات في تقييم الإبداع الحسابي. ومن الجدير بالذكر أنه بينما أظهرت أدوات مثل Story Centaur وCoPoet وعدًا في مساعدة الكتاب البشريين، لا تزال هناك قيود في نماذج اللغة الكبيرة، لا سيما فيما يتعلق بالاتساق السردي وتطوير الحبكة. يؤكد المؤلفون على تركيزهم على الكتابة الإبداعية المستقلة لنماذج اللغة الكبيرة، مميزين نتائجهم عن الدراسات السابقة التي تضمنت بشكل أساسي التعاون بين الإنسان والذكاء الاصطناعي.

تكشف نتائج دراستهم عن اختلافات كبيرة في الإنتاج الإبداعي بين نموذج اللغة الكبيرة، وبشكل خاص GPT-4، وكاتب مرموق، باتريسيو برون. قام المقيمون بتقييم أعمال برون بشكل أعلى باستمرار عبر أبعاد الجودة المختلفة، بما في ذلك الأصالة والأسلوب والعمق العاطفي، مما يشير إلى أن GPT-4، على الرغم من قدرته على إنتاج نصوص متماسكة، إلا أنه يفشل في الإبداع والدقة الأسلوبية. كما يستكشف المؤلفون تأثير المحفزات على الإنتاج الإبداعي، حيث وجدوا أن العناوين المقدمة من برون أدت إلى تحسين الدرجات لـ GPT-4، مما يشير إلى أن المدخلات البشرية يمكن أن تعزز النصوص التي تم إنشاؤها بواسطة نماذج اللغة الكبيرة. علاوة على ذلك، يستكشفون الفروق اللغوية، كاشفين أن GPT-4 يؤدي بشكل أفضل باللغة الإنجليزية مقارنةً بالإسبانية، لا سيما في نقل صوت مؤلف فريد. بشكل عام، تؤكد النتائج على القيود الحالية لنماذج اللغة الكبيرة في مهام الكتابة الإبداعية وتقترح أن التعاون بين الإنسان والآلة قد يؤدي إلى نتائج أكثر وعدًا من الكتابة المستقلة بالكامل لنماذج اللغة الكبيرة.

القيود

تسلط قيود هذه الدراسة الضوء على عدة عوامل حاسمة قد تؤثر على قابلية تعميم وملاءمة نتائجها. أولاً، قد تكون القرار بتجنب ضبط المحفزات لكل من الكتاب البشريين وGPT-4 قد قيد استكشاف تصميمات المحفزات الأكثر فعالية، والتي قد تؤثر على جودة النصوص الناتجة. بالإضافة إلى ذلك، يقتصر التركيز على مهمة كتابة إبداعية محددة – صياغة ملخصات قصيرة لأفلام خيالية – على نطاق النتائج، حيث يشمل الإبداع مجموعة واسعة من المهام التي لم يتم تقييمها. يثير هذا التركيز الضيق تساؤلات حول قابلية تطبيق النتائج على أشكال أخرى من التعبير الإبداعي التي قد تتطلب مهارات مختلفة.

علاوة على ذلك، فإن القيود اللغوية والثقافية للدراسة كبيرة؛ حيث تم فحص النصوص فقط باللغة الإنجليزية والإسبانية، مما قد يتسبب في تجاهل التنوع الغني للإبداع الذي يتأثر بسياقات ثقافية متنوعة. قد لا تمتد النتائج إلى لغات أخرى، لا سيما تلك التي تحتوي على موارد عبر الإنترنت أقل، حيث قد يكون الفارق بين النصوص التي تم إنشاؤها بواسطة الذكاء الاصطناعي والنصوص المكتوبة بواسطة البشر أكثر وضوحًا. علاوة على ذلك، فإن الاعتماد على تقييمات الخبراء فقط يترك السؤال مفتوحًا حول استقبال الجمهور، الذي قد يختلف عن تقييمات الخبراء. يُشجع على إجراء أبحاث مستقبلية لتوسيع نطاق المهام الإبداعية، ودمج ملاحظات الجمهور، واستكشاف نماذج الذكاء الاصطناعي المختلفة لتعزيز فهم الإبداع في أنظمة الذكاء الاصطناعي.

Journal: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
DOI: https://doi.org/10.18653/v1/2024.emnlp-main.1096
Publication Date: 2024-01-01
Author(s): Guillermo Marco et al.
Primary Topic: Digital Humanities and Scholarship

Overview

This research paper investigates the creative writing capabilities of large language models (LLMs), specifically GPT-4, in comparison to a top human novelist, Patricio Pron. The study involved a contest where both participants generated thirty titles and wrote short stories based on their own and each other’s titles. An evaluation rubric inspired by Boden’s creativity framework was employed, and expert assessments were collected to gauge the quality of the texts. The findings reveal that GPT-4, despite its advanced capabilities, does not match the creative depth and originality of a world-class novelist. Notably, GPT-4 demonstrated greater creativity when responding to titles provided by Pron, suggesting potential for effective human-machine collaboration.

The conclusions emphasize that while LLMs can produce coherent and fluent text, they often lack the nuanced originality and intent characteristic of top human writers. The study highlights the significant influence of prompts on creative output, with GPT-4’s performance being notably better in English than in Spanish, likely due to a bias in training data. Furthermore, expert evaluators became increasingly adept at identifying AI-generated texts, indicating a recognizable style in GPT-4’s outputs. Overall, the research underscores inherent limitations in current LLMs, which tend to generate content based on familiar patterns rather than innovative thinking, thus falling short of replicating the creative processes of elite human authors.

Introduction

The introduction of this research paper discusses the capabilities of Large Language Models (LLMs), particularly in the realm of creative writing, highlighting their growing influence on creative industries and the economy. The authors aim to investigate whether LLMs, specifically GPT-4 Turbo, can compete with top human authors, using a formal contest format against distinguished writer Patricio Pron. This study is framed around several research questions, including the comparative skills of LLMs and human authors, the impact of prompts on creativity, performance across different languages, the recognizability of LLM writing styles, and the measurement of creativity based on Boden’s framework.

The methodology involves a structured evaluation of 180 pieces of text produced by both GPT-4 and Pron, assessed by literary critics using a rubric based on creativity dimensions such as novelty and value. Key findings indicate that while GPT-4 demonstrates some strengths, it does not yet match the creative writing skills of top human authors, as evidenced by expert assessments favoring Pron. Additionally, the study reveals that prompting with titles enhances GPT-4’s writing, its performance in Spanish is inferior to that in English, and its writing style becomes recognizable over time. Overall, the research contributes to the understanding of LLM capabilities in creative contexts and sets the stage for future inquiries into human-AI collaboration in creative writing.

Methods

In this section, the authors outline the experimental design for comparing the creative writing capabilities of the language model GPT-4 Turbo and the acclaimed novelist Patricio Pron. The experiment commenced with GPT-4 Turbo, selected for its superior performance at the time, and was set to a temperature of 1 to ensure grammatical correctness, particularly in Spanish. Despite the emergence of competing models like Claude 3 Opus and Gemini Ultra, no significant advantages were observed, leading the researchers to retain their original setup. The task involved each contender proposing 30 movie titles, followed by the generation of 600-word synopses for all titles. The prompts provided to GPT-4 emphasized the need for creativity and literary value, with the titles initially proposed in Spanish and subsequently translated into English for comparative analysis.

The evaluation rubric, crafted by experts in relevant fields, focuses on dimensions of creativity, specifically originality, attractiveness, and overall creativity, based on Margaret Boden’s definition of creativity as the ability to produce ideas that are new, surprising, and valuable. Originality is assessed through the novelty and uniqueness of the titles and synopses, while attractiveness pertains to the literary appeal and engagement of the content. The rubric aims to explore the correlation between these dimensions and the evaluators’ perceptions of creativity, thereby validating Boden’s conceptualization within the context of fiction writing.

Results

In the Results and Discussion section, the authors present the primary findings of their study, focusing on the expert assessment annotations. Each subsection is dedicated to a thorough analysis aimed at addressing the specific research questions posed at the outset of the study. The detailed evaluations provide insights into the effectiveness and implications of the expert assessments, contributing to a deeper understanding of the research topic.

Discussion

In the discussion section of the research paper, the authors contextualize their study within the broader landscape of recent advancements in large language model (LLM) technology and its implications for creative writing. They reference several key works that explore the intersection of machine learning and creativity, highlighting the challenges of evaluating computational creativity. Notably, while tools like Story Centaur and CoPoet have shown promise in assisting human writers, limitations remain in LLMs, particularly concerning narrative consistency and plot development. The authors emphasize their focus on autonomous LLM creative writing, contrasting their findings with previous studies that primarily involved human-AI collaboration.

The results of their study reveal significant differences in creative output between the LLM, specifically GPT-4, and a prestigious author, Patricio Pron. Evaluators consistently rated Pron’s work higher across various quality dimensions, including originality, style, and emotional depth, indicating that GPT-4, despite its ability to generate coherent text, falls short in creativity and stylistic nuance. The authors also investigate the influence of prompts on creative output, finding that titles provided by Pron led to improved scores for GPT-4, suggesting that human input can enhance LLM-generated texts. Furthermore, they explore language differences, revealing that GPT-4 performs better in English than in Spanish, particularly in conveying a unique authorial voice. Overall, the findings underscore the current limitations of LLMs in creative writing tasks and suggest that human-machine collaboration may yield more promising results than fully autonomous LLM writing.

Limitations

The limitations of this study highlight several critical factors that may affect the generalizability and applicability of its findings. Firstly, the decision to avoid fine-tuning prompts for both human writers and GPT-4 may have restricted the exploration of potentially more effective prompt designs, which could influence the quality of the generated texts. Additionally, the focus on a specific creative writing task—crafting short synopses for imaginary films—limits the scope of the findings, as creativity encompasses a wider array of tasks that were not assessed. This narrow focus raises questions about the applicability of the results to other forms of creative expression that may require different skills.

Moreover, the study’s linguistic and cultural limitations are significant; it only examined texts in English and Spanish, potentially overlooking the rich diversity of creativity influenced by various cultural contexts. The findings may not extend to other languages, particularly those with fewer online resources, where the gap between AI-generated and human-written texts could be more pronounced. Furthermore, the reliance on expert assessments alone leaves open the question of audience reception, which may differ from expert evaluations. Future research is encouraged to broaden the scope of creative tasks, incorporate audience feedback, and explore various AI models to enhance understanding of creativity in AI systems.