اكتشاف الببتيدات المضادة للميكروبات في الميكروبيوم العالمي باستخدام التعلم الآلي Discovery of antimicrobial peptides in the global microbiome with machine learning

عربي
English

المجلة: Cell، المجلد: 187، العدد: 14
DOI: https://doi.org/10.1016/j.cell.2024.05.013
PMID: https://pubmed.ncbi.nlm.nih.gov/38843834
تاريخ النشر: 2024-06-05

اكتشاف الببتيدات المضادة للميكروبات في الميكروبيوم العالمي باستخدام التعلم الآلي

ملخص رسومي

أهم النقاط

توقع التعلم الآلي ما يقرب من مليون مضاد حيوي جديد في الميكروبيوم العالمي
خارج منتم اختبار 79 ببتيدًا نشطًا في المختبر؛ 63 من هذه الببتيدات استهدفت مسببات الأمراض.
بعض الببتيدات قد تنشأ من تسلسلات أطول من خلال تجزئة الجينوم
AMPSphere هو مورد مفتوح الوصول لتسريع اكتشاف المضادات الحيوية

المؤلفون

سيلو دياز سانتوس-جونيور، مارسيلو دي. تي. توريس، ييكيان دوان، …، خايمي هورتا-سيباس، سيزار دي لا فوانتي-نونيز، لويس بيدرو كويلهو

المراسلات

cfuente@upenn.edu (C.d.I.F.-N.) luispedro@big-data-biology.org (ل.ب.س.)

باختصار

تنبأت مقاربة قائمة على التعلم الآلي بنحو مليون مضاد حيوي جديد من الميكروبيوم العالمي، حيث كانت 79 من بين 100 ببتيد تم اختباره نشطة في المختبر، وأظهر العديد منها فعالية Comparable لمضاد حيوي سريري في نموذج عدوى على الفئران.

اكتشاف الببتيدات المضادة للميكروبات في الميكروبيوم العالمي باستخدام التعلم الآلي

سيلو دياز سانتوس-جونيور،مارسيلو د.ت. توريس، ييشيان دوان ألفارو رودريغيز ديل ريو، توماس س.ب. شميت،هوي تشونغأنتوني فولاممايكل كونتشنغكاي زو،أمي هاوسمان،يلينا سومبورسكي آنا فاينز، شينغ-مينغ تشاوبير بوركخايمي ويرتا-سيباس،سيزار دي لا فوانتي-نونيز ولويس بيدرو كويلو معهد العلوم والتكنولوجيا للذكاء المستوحى من الدماغ – ISTBI، جامعة فودان، شنغهاي 200433، الصينمختبر العمليات الميكروبية والتنوع البيولوجي – LMPB، قسم الهيدrobiology، جامعة فيدرالية ساو كارلوس – UFSCar، ساو كارلوس، ساو باولو 13565-905، البرازيلمجموعة البيولوجيا الآلية، أقسام الطب النفسي والميكروبيولوجيا، معهد المعلوماتية الحيوية الطبية، معهد الطب الانتقالي والعلاج، مدرسة بيرلمان للطب، جامعة بنسلفانيا، فيلادلفيا، بنسلفانيا، الولايات المتحدة الأمريكيةأقسام الهندسة الحيوية والهندسة الكيميائية والبيومولكولية، كلية الهندسة والعلوم التطبيقية، جامعة بنسلفانيا، فيلادلفيا، بنسلفانيا، الولايات المتحدة الأمريكيةقسم الكيمياء، كلية الفنون والعلوم، جامعة بنسلفانيا، فيلادلفيا، بنسلفانيا، الولايات المتحدة الأمريكيةمعهد بن للعلوم الحاسوبية، جامعة بنسلفانيا، فيلادلفيا، بنسلفانيا، الولايات المتحدة الأمريكيةمركز التكنولوجيا الحيوية وعلم الجينوم للنباتات، الجامعة البوليتكنيكية في مدريد (UPM) – المعهد الوطني للبحث والتكنولوجيا الزراعية والغذائية (INIA-CSIC)، حرم مونتغانسيدو-UPM، بوزويلو دي ألركون، 28223 مدريد، إسبانياوحدة البيولوجيا الهيكلية والحاسوبية، المختبر الأوروبي لعلم الأحياء الجزيئي، هايدلبرغ، ألمانياAPC ميكروبيوم وكلية الطب، جامعة كوليج كورك، كورك، أيرلندامركز ماكس ديلبروك للطب الجزيئي، برلين، ألمانياقسم المعلوماتية الحيوية، مركز البيولوجيا، جامعة فيرزبورغ، فيرزبورغ، ألمانياقسم الأعصاب، مستشفى تشونغشان، جامعة فودان، شنغهاي، الصينالمختبر الوطني الرئيسي لعلم الأعصاب الطبي، معاهد علوم الدماغ، جامعة فودان، شنغهاي، الصينمختبر وزارة التعليم الرئيسي لعلوم الأعصاب الحاسوبية والذكاء المستوحى من الدماغ ومركز وزارة التعليم للعلوم العصبية المتقدمة، جامعة فودان، شنغهاي، الصينمركز أبحاث الميكروبيوم، كلية العلوم الطبية الحيوية، جامعة كوينزلاند للتكنولوجيا، معهد الأبحاث الانتقالية، وولونغابا، كوينزلاند، أسترالياساهم هؤلاء المؤلفون بالتساويجهة الاتصال الرئيسية*المراسلة: cfuente@upenn.edu (C.d.I.F.-N.)، luispedro@big-data-biology.org (L.P.C.)https://doi.org/10.1016/j.cell.2024.05.013

الملخص

الملخص: هناك حاجة ماسة لمضادات حيوية جديدة لمكافحة أزمة مقاومة المضادات الحيوية. نقدم نهجًا قائمًا على التعلم الآلي للتنبؤ بالببتيدات المضادة للميكروبات (AMPs) ضمن الميكروبيوم العالمي ونستفيد من مجموعة بيانات ضخمة تضم 63,410 ميتاجينومات و87,920 جينومًا بدائي النواة من المواطن البيئية والمترابطة مع المضيف لإنشاء AMPSphere، وهو كتالوج شامل يتضمن 863,498 ببتيد غير متكرر، القليل منها يتطابق مع قواعد البيانات الموجودة. يوفر AMPSphere رؤى حول الأصول التطورية للببتيدات، بما في ذلك من خلال التكرار أو تقصير الجينات من تسلسلات أطول، وقد لاحظنا أن إنتاج AMP يختلف حسب المواطن. للتحقق من تنبؤاتنا، قمنا بتخليق واختبار 100 AMP ضد مسببات الأمراض المقاومة للأدوية ذات الصلة سريريًا والميكروبات المعوية البشرية في كل من المختبر وفي الكائن الحي. كان هناك ما مجموعه 79 ببتيد نشط، مع 63 تستهدف مسببات الأمراض. أظهرت هذه AMPs النشطة نشاطًا مضادًا للبكتيريا من خلال تعطيل أغشية البكتيريا. في الختام، حدد نهجنا ما يقرب من مليون تسلسل AMP بدائي النواة، وهو مورد مفتوح الوصول لاكتشاف المضادات الحيوية.

مقدمة

تزداد صعوبة علاج العدوى المقاومة للمضادات الحيوية باستخدام العلاجات التقليدية.

في الواقع، مثل هذه العدوى تقتل حالياً 1.27 مليون شخص سنوياً.

لذلك، هناك حاجة ملحة لأساليب جديدة لاكتشاف المضادات الحيوية.

تم تطوير أساليب حسابية مؤخرًا لتسريع قدرتنا على تحديد مضادات حيوية جديدة، بما في ذلك الببتيدات المضادة للميكروبات (AMPs).

مؤخراً، تم تطوير أساليب تعدين البروتينات لتحديد العوامل المضادة للميكروبات في الكائنات المنقرضة في محاولة لتوسيع مجموعة مضادات الميكروبات المعروفة لدينا.

AMPs، الموجودة في جميع مجالات الحياة،

هي تسلسلات قصيرة (مُعرفة عمليًا هنا بأنها تتكون من 10-100 من بقايا الأحماض الأمينية)

قادر على إزعاج نمو الميكروبات.

تتداخل AMP بشكل شائع مع سلامة جدار الخلية وتسبب تحلل الخلية.

يمكن أن تنشأ AMP الطبيعية من التحلل البروتيني،

عن طريق التخليق غير الريبوزومي،

أو، كما نركز في الدراسة الحالية، يمكن ترميزها داخل الجينوم.

تعيش البكتيريا في توازن معقد من التنافس والتعاون في المواطن الطبيعية. تلعب الببتيدات المضادة للميكروبات دورًا مهمًا في تعديل هذه التفاعلات الميكروبية ويمكن أن تحل محل سلالات المنافسين، مما يسهل التعاون.

على سبيل المثال، مسببات الأمراض مثل شيغيلا spp.

المكورات العنقودية spp.

فيبريو كوليرا

وليسيريا spp.

تنتج AMPs التي تقضي على المنافسين (أحيانًا من نفس النوع)، مما يسمح لهم بشغل مكانهم.

تمتلك AMP إمكانيات واعدة كعلاجات محتملة وقد تم استخدامها سريريًا بالفعل كأدوية مضادة للفيروسات (مثل إنفوفيرتيد وتيلابريفير).

AMPs التي تظهر خصائص تعديل المناعة تخضع حاليًا للتجارب السريرية،

كما أن الببتيدات التي يمكن استخدامها لمعالجة العدوى الفطرية والبكتيرية

(على سبيل المثال، بيكسيغانان، LL-37، وPAC-113). على الرغم من أن معظم AMP تظهر نشاطًا واسع الطيف، إلا أن بعضها نشط فقط ضد أعضاء مرتبطين عن كثب من نفس النوع أو الجنس.

تعتبر هذه العوامل المضادة للميكروبات المستهدفة أكثر استهدافًا من المضادات الحيوية التقليدية واسعة الطيف.

علاوة على ذلك، على عكس المضادات الحيوية التقليدية، تحدث تطورات المقاومة للعديد من الببتيدات المضادة للميكروبات بمعدلات منخفضة ولا ترتبط بالمقاومة المتقاطعة لفئات أخرى من المضادات الحيوية المستخدمة على نطاق واسع.

لقد كانت تطبيقات التحليلات الميتاجينية في دراسة AMP محدودة بسبب القيود التقنية، التي تنبع أساسًا من التحدي المتمثل في تمييز تسلسلات البروتين الحقيقية عن الإيجابيات الكاذبة.

لذلك، تم تجاهل أهمية الإطارات القرائية الصغيرة المفتوحة (smORFs) تاريخياً في التحليلات (الميتاجينية).

في السنوات الأخيرة، تم إحراز تقدم كبير في التحليلات الميتاجينومية للـ smORFs المرتبطة بالبشر.

لقد دمجت هذه التطورات تقنيات تعلم الآلة (ML) لتحديد smORFs التي تشفر بروتينات تنتمي إلى فئات وظيفية محددة.

من الجدير بالذكر أن دراسة حديثة استخدمت smORFs المتوقعة لاكتشاف حوالي 2000 AMP من عينات الميتاجينوم لميكروبات الأمعاء البشرية.

ومع ذلك، من المهم أن نلاحظ أن أمعاء الإنسان تمثل فقط جزءًا من التنوع الميكروبي العام، مما يشير إلى أن هناك إمكانيات هائلة لاكتشاف AMP من الكائنات بدائية النواة في مجموعة متنوعة من المواطن في جميع أنحاء العالم.

في هذه الدراسة، استخدمنا التعلم الآلي للتنبؤ وتوثيق مضادات الميكروبات من الميكروبيوم العالمي كما هو ممثل حاليًا في قواعد البيانات العامة. من خلال استكشاف 63,410 ميتاجينوم متاحة للجمهور و87,920 جينوم ميكروبي عالي الجودة بشكل حسابي،

لقد اكتشفنا مجموعة واسعة من تنوع AMP. وقد أدى ذلك إلى إنشاء AMPSphere، وهي مجموعة تضم 863,498 تسلسل ببتيد غير متكرر، تشمل مرشحات AMP (c_AMPs) المستمدة من بيانات (الميتا) الجينومية. ومن المRemarkably، أن الغالبية العظمى من هذه تسلسلات c_AMP لم يتم وصفها سابقًا. كشفت تحليلاتنا أن هذه c_AMPs كانت محددة لبيئات معينة وكانت في الغالب ليست جينات أساسية في الجينوم الشامل.

علاوة على ذلك، قمنا بتخليق 100 c_AMPs من AMPSphere ووجدنا أن 79 منها كانت نشطة، حيث أظهر 63 منها نشاطًا مضادًا للميكروبات.
في المختبر ضد مسببات الأمراض ESKAPEE ذات الأهمية السريرية، والتي تُعتبر من القضايا الصحية العامة.

تمت مقارنة هذه الببتيدات بمزيد من الببتيدات المشفرة (EPs)، وهي تسلسلات ببتيد مخفية في تسلسلات البروتين ويتم استخراجها حسابياً.

وأظهرت قدرتها على استهداف الأغشية البكتيرية وميولها لتبني

-حلزوني و

-الهياكل. من الجدير بالذكر أن المرشحين الرئيسيين أظهروا نشاطًا واعدًا مضادًا للعدوى في نموذج حيواني قبل السريري. معًا، تُظهر أعمالنا قدرة أساليب التعلم الآلي على تحديد الببتيدات المضادة للميكروبات الوظيفية من الميكروبيوم العالمي.

النتائج

تتكون AMPSphere من ما يقرب من مليون c_AMPs من عدة موائل

تدمج AMPSphere c_AMPs المتوقعة باستخدام التعلم الآلي عبر Macrel،

خط أنابيب يستخدم الغابات العشوائية للتنبؤ بالـ AMPs من مجموعات بيانات الببتيد الكبيرة مع التركيز على الدقة بدلاً من الاسترجاع. تم تطبيقه على 63,410 ميتاجينومات متاحة للجمهور موزعة عالميًا (الشكل 1A؛ الجدول S1) و87,920 جينومًا بكتيريًا وأركيائيًا عالي الجودة.

تمت إزالة التسلسلات الموجودة في عينة واحدة،

باستثناء عندما كان لديهم مباراة مهمة (تعرف على أنها هوية الأحماض الأمينية

وقيمة E

) إلى تسلسل في قاعدة بيانات AMP المخصصة مستودع بيانات الببتيدات المضادة للميكروبات (DRAMP) الإصدار 3.0.

هذا أدى إلى

جينات

من إجمالي smORFs المتوقع، الذي يشفر 863,498 c_AMPs غير المتكررة (في المتوسط

بقايا طويلة؛ الأشكال 1A و S1). مشابهة للتسلسلات المعتمدة ذات النشاط المضاد للميكروبات،

c_AMPs من AMPSphere تحمل شحنة إيجابية (

نقطة isoelektrik العالية

الخاصية المحبة للماء (لحظة كارهة للماء،

) ، وإمكانية الارتباط بالأغشية أو بروتينات أخرى (مؤشر بومان،

كما هو متوقع، بشكل عام، توزيع الخصائص الفيزيائية والكيميائية للببتيدات من AMPSphere و DRAMP

الإصدار 3.0، ومجموعة بيانات التدريب الإيجابية المستخدمة في Macrel

أكثر تشابهًا مع بعضها البعض مقارنة بمجموعة التدريب السلبية (التي يُفترض أنها ليست AMP). ومع ذلك، فإن c_AMPs من AMPSphere هي في المتوسط أطول (

المخلفات) أكثر من تلك الموجودة في DRAMP

الإصدار 3.0 (

المخلفات)، وقد لاحظنا اختلافات في توزيع ميزات أخرى (مثل الشحنة، الأليفاتية، الأمفيباتية، والنقطة المعزولة؛ الشكل S1).

قمنا بعد ذلك بتقدير جودة توقعات smORF واكتشفنا

تسلسلات c_AMP في الميتابروتينات أو الميتا ترانسكرابتومات المستقلة المتاحة للجمهور (الأشكال 2 و S2A؛ انظر قسم طرق STAR “مراقبة جودة c_AMPs”) التي تنتمي إلى عدة مواطن مدرجة في AMPSphere، مثل أمعاء الإنسان، والنباتات، وغيرها (الجدول S6). ثم خضعنا جميع c_AMPs لمجموعة من اختبارات الجودة الحاسوبية (انظر قسم طرق STAR “مراقبة جودة c_AMPs”). مجموعة فرعية من c_AMPs (9.2% أو 80,213 c_AMPs) اجتازت جميعها، وهذه المجموعة الفرعية تُعرف فيما بعد بأنها عالية الجودة. الاختبار باستخدام أنظمة توقع AMP الأخرى (AMPScanner v2،

نموذج الببتيدات الناضجة في أمبير

أمبيبي

أبين

AI4AMP

و AMPLify

)، لاحظنا أن

(849,703 ببتيد) من AMPSphere c_AMPs تم التنبؤ بها أيضًا كأحماض أمينية مضادة للبكتيريا بواسطة نظام توقع أحماض أمينية مضادة للبكتيريا آخر على الأقل. تقريبًا

تم التنبؤ بـ (132,440 من أصل 863,498 ببتيد) من AMPSphere c_AMPs بواسطة جميع الطرق المستخدمة.

الشكل 1. يتكون AMPSphere من

AMPs غير الزائدة من آلاف الميتاجينومات وجينومات ميكروبية عالية الجودة (A) لبناء AMPSphere، قمنا أولاً بتجميع 63,410 ميتاجينوم متاح للجمهور من موائل متنوعة. نسخة معدلة من Prodigal،

التي يمكن أن تتنبأ أيضًا بـ smORFs (

)، تم استخدامه للتنبؤ بالجينات على الكنتيجات الميتاجنومية الناتجة وكذلك على 87,920 جينوم ميكروبي من ProGenomes

ماكرل

تم تطبيقه على

تنبأت smORFs للحصول على 863,498 c_AMPs غير متكررة (انظر أيضًا الشكل S1). ثم تم تجميع c_AMPs بشكل هرمي في أبجدية أحماض أمينية مختزلة باستخدام

، و

حدود الهوية. لقد لاحظنا 118,051 مجموعة غير فردية في

من الهوية، واعتُبرت 8,788 منها عائلات (

c_AMPs).
(ب) فقط

لـ c_AMPs لها نظائر قابلة للاكتشاف في قواعد بيانات البروتينات الصغيرة الأخرى (SmProt 2،

STsORFs

قاعدة بيانات الببتيدات النشطة حيوياً (DRAMP

الإصدار 3.0، starPepDB

) ومجموعات بيانات البروتين العامة (GMGCv1

; انظر أيضًا الشكل S2B). كما هو موضح عدد المتجانسات في AMPSphere في كل قاعدة بيانات بالإضافة إلى الإجمالي. يتم أيضًا عرض عدد المتجانسات التي اجتازت جميع اختبارات الجودة لدينا بغض النظر عن الأدلة التجريبية للترجمة/النسخ، بالإضافة إلى النسبة التي تمثلها من المتجانسات المحددة. لاحظ أن بعض الببتيدات لديها متجانسات في قواعد بيانات متعددة وبالتالي فإن العدد الإجمالي ليس مجموع قواعد البيانات الفردية.
(C) تظهر منحنيات التخفيف كيف يؤثر أخذ العينات على اكتشاف AMP، حيث تقدم معظم المواطن منحنيات أخذ عينات حادة.
(د) مشاركة c_AMPs بين المواطن محدودة. عرض الأشرطة يمثل نسبة c_AMPs المشتركة في المواطن على اليسار. انظر أيضًا الأشكال S2C و S2D والجداول S1 و S2.

فقط

من الـ c_AMPs المحددة (6,339 ببتيد) متجانسة (مُعرفة عمليًا على أنها هوية الأحماض الأمينية

وقيمة E

) إلى تسلسلات AMP التي تم التحقق منها تجريبيًا في DRAMP الإصدار 3.0.

علاوة على ذلك، كانت معظم c_AMPs أيضًا غير…
تم الإرسال من قواعد بيانات البروتينات غير المحددة لـ AMPs (الشكل 1B)، مثل قاعدة بيانات البروتينات الصغيرة (SmProt2)

أو كتالوج الجينات الميكروبية العالمية للبروتينات ذات الطول القياسي (GMGCv1)،

تشير إلى أن c_AMPs تمثل منطقة من

الشكل 2. مراقبة جودة مرشحي AMPSphere
(أ) يتم عرض عدد مرشحي AMPSphere الذين اجتازوا كل اختبار مقترح للجودة. تتكون مجموعة الجودة العالية من

من المرشحين دون دليل تجريبي و

من المرشحين مع أدلة على ترجمتهم أو نسخهم، بالإضافة إلى عدد المتجانسات الموجودة في مجموعة المرشحين عالية الجودة. على الرغم من أن مجموعة الجودة العالية تظهر بعض التداخل مع المتجانسات، إلا أن معظم المتجانسات لا توجد في مجموعة الجودة العالية.
(ب) عدد المرشحين لمضادات الميكروبات الذين تم التنبؤ بهم بشكل مشترك بواسطة أنظمة التنبؤ بمضادات الميكروبات بخلاف ماكرل (AMPScanner v2،

أمبير

مع نموذج الببتيدات الناضجة، amPEPpy،

أبين

مع نموذجهم المقترح، AI4AMP،

و AMPLify

). فقط جزء صغير من AMPSphere (<2%) لا يمكن التنبؤ به بواسطة أي نظام آخر غير Macrel.

مساحة تسلسل الببتيد التي لا توجد في هذه القواعد البيانات الأخرى. في المجموع، لم نتمكن من العثور إلا على 73,774 (

) c_AMPs مع نظائرها في أي من قواعد البيانات التي اعتبرناها. تم اكتشاف c_AMPs عالية الجودة في قواعد البيانات العامة بتردد أعلى من c_AMPs العامة ( 2.5 -مرة،

; الشكل 1B)، مع 23,012 من أصل 80,213 c_AMPs عالية الجودة التي لديها تطابق في قاعدة بيانات أخرى. ومع ذلك، من الجدير بالذكر أن

( 4,843 ببتيد من أصل 6,339 ) من تلك c_AMPs التي لها نظير في DRAMP

الإصدار 3.0 (وبالتالي، من المحتمل جدًا أن يكون وظيفيًا) ليست c_AMPs عالية الجودة. وبالتالي، بينما تختبر اختبارات الجودة لدينا تسلسلات موثوقة، فإن الفشل في اجتياز الاختبارات ليس سببًا كافيًا لاستنتاج أن التسلسل غير نشط.

لوضع c_AMPs في سياق تطوري، قمنا بتجميع الببتيدات بشكل هرمي باستخدام أبجدية أحماض أمينية مختزلة مكونة من 8 أحرف.

اعتمدت مستويات تجميع التسلسل الثلاثة حدود هوية

، و

(الشكل S3). عند

على مستوى الهوية، حصلنا على 521,760 مجموعة بروتينية، منها 405,547 كانت فردية، مما يتوافق مع

من جميع c_AMPs من AMPSphere. إجمالي 78,481 (

تم اكتشاف ) من هذه العائلات الفردية في الميتا ترانسكرپتومات أو الميتا بروتومات من مصادر متنوعة، مما يشير إلى أنها لم تكن عيوبًا. العدد الكبير من العائلات الفردية يوحي بأن معظم c_AMPs نشأت من عمليات أخرى غير التنوع داخل العائلات، وهو عكس الأصل المفترض للبروتينات كاملة الطول، حيث تكون العائلات الفردية نادرة.

8,788 مجموعة مع

الببتيدات المستخرجة في

الهوية تُسمى فيما بعد “عائلات”، كما في سبرّو وآخرون.

من بينها، اعتبرنا 6499 عائلة كعائلات عالية الجودة لأنها احتوت على أدلة على الترجمة أو النسخ أو لأن

تنجح جميع تسلسلاتهم في اجتياز جميع اختبارات الجودة الحاسوبية، بغض النظر عما إذا كانت هناك أدلة تجريبية متاحة (انظر قسم طرق STAR “عائلات AMP”). تمتد هذه العائلات عالية الجودة

من AMPSphere (133,309 ببتيد).

يمكن الوصول إلى جميع c_AMPs المتوقعة هنا على https:// ampsphere.big-data-biology.org/يمكن للمستخدمين استرجاع تسلسلات الببتيد، وORFs، والخصائص الكيميائية الحيوية المتوقعة لكل c_AMP (مثل الوزن الجزيئي، ونقطة التعادل، والشحنة الصافية عند pH 7.0). كما نقدم أيضًا توزيعها عبر المناطق الجغرافية، والموائل، والأنواع الميكروبية لكل c_AMP.

c_AMPs نادرة ومحددة الموائل

يمتد AMPSphere عبر 72 موطنًا مختلفًا، تم تصنيفها إلى ثمانية مجموعات موطن عالية المستوى، مثل التربة/النبات.

من c_AMPs في AMPSphere)، مائي (

)، والأمعاء البشرية (

; الشكل 1A؛ الجدول S2). يبدو أن معظم المواطن، باستثناء أمعاء الإنسان، بعيدة عن التشبع من حيث c_AMPs المكتشفة (الشكل 1C). في الواقع، فإن معظم AMPs نادرة (العدد الوسيط للاكتشافات هو 99، أو

من مجموعة البيانات؛ عند الاقتصار على c_AMPs عالية الجودة، يكون العدد الوسيط للاكتشافات 81، أو

مجموعة البيانات)، مع

يتم ملاحظته في أقل من 1% من العينات (الشكل S2). فقط

تم الكشف عن c_AMPs في أكثر من مجموعة موائل عالية المستوى (المشار إليها فيما بعد بـ “c_AMPs متعددة الموائل”); هذه النسبة أصغر بمقدار 7.25 مرة مما كان متوقعًا من خلال تخصيص عشوائي للموائل إلى العينات (

; انظر قسم طرق STAR “الـ c_AMPs متعددة المواطن والنادرة”). حتى داخل مجموعات المواطن عالية المستوى، تتداخل الـ c_AMPs بين المواطن بشكل أقل بكثير مما هو متوقع بالصدفة (أقل بمعدل يتراوح بين 2.4 إلى 192 مرة،

; انظر قسم طرق STAR “اختبار تداخل c_AMPs عبر المواطن”؛ الشكل 1D).

تولد الطفرات في الجينات الأكبر c_AMPs ككيانات جينومية مستقلة

يتم إنتاج العديد من AMP بعد الترجمة من خلال تجزئة البروتينات الأكبر.

على سبيل المثال، يتم اكتشاف EPs حسابياً كقطع من تسلسلات البروتين داخل البروتينات البشرية وبروتينات أخرى التي ثبت أنها نشطة للغاية.

تظهر EPs هياكل ثانوية متنوعة وتعمل على غشاء الخلايا البكتيرية بشكل مشابه للببتيدات المضادة للميكروبات الطبيعية المعروفة، ولكن لها ميزات فيزيائية كيميائية مختلفة مقارنة بالببتيدات المضادة للميكروبات المعروفة.

اعتبرت AMPSphere فقط الببتيدات المشفرة بواسطة جينات مخصصة. ومع ذلك، افترضنا أن بعض هذه الببتيدات قد نشأت من بروتينات أكبر من خلال التفتت على المستوى الجيني. لاستكشاف ذلك، قمنا بمحاذاة c_AMPs من AMPSphere مع البروتينات كاملة الطول في GMGCv1.

ولوحظ أن حوالي

منها متجانسة مع بروتين بطول قياسي (الشكل 1B)، مع

من بين هذه الطفرات التي تشترك في كودون البداية مع البروتين الأطول. وهذا يشير إلى إنهاء مبكر للبروتينات كاملة الطول كآلية واحدة لتوليد c_AMPs جديدة (الأشكال 3A و3B).

(AMP10.271_016)

إنهاء مبكر

CD3:33	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTG G CAAG TAATCGTTGGTGC
F0106	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAAA TAATCGTTGGTGC
F0697	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAAG TAATCGTTGGTGC
SAMN09837386	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTG A CAAG TAATCGTTGGTGC
SAMN09837387	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTGA CAAG TAATCGTTGGTGC
سامن09837388	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTGA CAAG TAATCGTTGGTGC
FDAARGOS_760	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAGGTAATCGTTGGTGC
FDAARGOS_306	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAGGTAATCGTTGGTGC
FDAARGOS_1566	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG G CAGGTAATCGTTGGTGC
ATCC 25845	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG G CAGGTAATCGTTGGTGC

ب. جيجوني ب. ميلانينوجينيكا طفرة إلى كودون الإيقاف متتالية ميتاجينومية منطقة محفوظة

وظيفة غير معروفة

الترجمة، هيكل الريبوسوم، والتكوين الحيوي

الشكل 3. الطفرات في الجينات التي تشفر بروتينات كبيرة تولد c_AMPs ككيانات جينومية مستقلة
يوضح توزيع المواقع (كنسبة مئوية من طول البروتين الأكبر) التي تبدأ منها المتجانسات AMP محاذاتها. حوالي 7% من c_AMPs متجانسة مع بروتينات من GMGCv1،

حيث أن حوالي ربع الضربات لها نفس موضع البداية مثل البروتين الأكبر.
(ب) كمثال توضيحي لمركب AMP متجانس مع بروتين كامل الطول، تم استرداد AMP10.271_016 من ثلاث عينات من لعاب الإنسان من نفس المتبرع.

من المتوقع أن يتم إنتاج AMP10.271_016 بواسطة Prevotella jejuni، حيث يشارك في كودون البداية (المُبرز) لجين ديهيدروجيناز يعتمد على NAD(P) (WP_089365220.1)، والذي تم إيقاف نسخها بواسطة طفرة (بالأحمر؛ TGG > TGA).
(ج) توزيع AMPs حسب فئة OG (يسار) وغناها مقارنة بالبروتينات كاملة الطول من GMGCv1

(يمين). تم تصنيف OGs إلى مجموعات فرعية وفقًا لعدد

AMPs التي كانوا مرتبطين بها. تمثل OGs ذات الوظيفة غير المعروفة أكبر عدد (2,041 من أصل 3,792 OGs) والأكثر غنى (

) فئة تحتوي على نظائر لـ c_AMPs في GMGCv1.

من المثير للاهتمام أنه عند النظر إليها بشكل فردي، كان عدد ضربات c_AMP إلى OGs غير المعروفة هو الأدنى.

لا تتغير هذه النتائج عند استبعاد OGs الممثلة تمثيلاً ناقصًا باستخدام عتبات مختلفة (مثل، على الأقل 10، 20، أو 100 متجانسات لكل OG). انظر أيضًا الجدول S3.

لتحقيق وظيفة البروتينات كاملة الطول المتجانسة مع AMP، قمنا برسم البروتينات المتطابقة من GMGCv1

إلى المجموعات المتجانسة (OGs) من eggNOG 5.0.

حددنا 3,792 (من أصل 43,789) من OGs الغنية بشكل ملحوظ (

، بعد إجراء تصحيحات متعددة للفرضيات باستخدام طريقة هولم-سيداك) بين النتائج من AMPSphere. على الرغم من أن OGs ذات الوظيفة غير المعروفة تتكون من

من بين جميع OGs المحددة، عند النظر إليها بشكل فردي، فإن هذه OGs أصغر في المتوسط من OGs في فئات أخرى. وبالتالي، على الرغم من أن كل OG لديه عدد صغير نسبيًا من ضربات c_AMP، عند مقارنتها بتوزيع الخلفية لـ OGs في GMGCv1،

كانت الجينات الأصلية ذات الوظيفة غير المعروفة الأكثر غنى بين نتائج c_AMP، بمتوسط غنى يبلغ 10,857 مرة.

; الشكل 3C؛ الجدول S3).

قد تنشأ جينات c_AMP بعد أحداث تكرار الجينات

ثم طرحنا سؤالاً حول ما إذا كانت c_AMPs ستكون موجودة بشكل رئيسي في سياقات جينومية محددة. للتحقيق في وظائف الجينات المجاورة لـ c_AMPs، قمنا برسمها مقابل 169,484 جينومًا تم تضمينها في دراسة حديثة.

إجمالي

من 55,191

لـ c_AMPs مع أكثر من اثنين من المتجانسات في جينومات مختلفة في قاعدة البيانات أظهرت سياق جيني محفوظ من الناحية التطورية مع جينات ذات وظيفة معروفة (انظر قسم طرق STAR “تحليل حفظ السياق الجيني”). هذا ينطبق على النسخ المنقحة من الفهرس:

من c_AMPs عالية الجودة و

من c_AMPs عالية الجودة مع أدلة تجريبية تظهر جيران جينوم محفوظين. هذه القيم الحفظية مشابهة لتلك الخاصة بـ

عائلات الجينات التي تحتوي على أكثر من اثنين من المتجانسات المحسوبة من جديد على كتالوج الجينات

)، مما يشير إلى أن الموقع الجينومي لـ c_AMPs ليس عشوائيًا.

على الرغم من المشاركة في عمليات مماثلة، كانت c_AMPs عمومًا مستنفدة من السياقات الجينومية المحفوظة التي تتضمن أنظمة معروفة لتخليق المضادات الحيوية والمقاومة، حتى عند مقارنتها بعائلات البروتين الصغيرة (الشكل 4). بدلاً من ذلك، وجدنا أن c_AMPs مشفرة في سياقات جينومية محفوظة مع جينات الريبوسوم.

) بتردد أعلى من عائلات الجينات الأخرى (4.75%; الشكل 4A; الجدول S4).

معظم c_AMPs (2,201 من أصل 2,642) في سياق محفوظ مع وحدات الريبوسوم متجانسة مع بروتينات الريبوسوم (الشكل 4D)، وهو ما يتوافق مع الملاحظة أنه في بعض الأنواع، تمتلك بروتينات الريبوسوم خصائص مضادة للميكروبات.

تم العثور على سبعة وسبعين من c_AMPs المتجانسة مع البروتينات الريبوسومية أيضًا متجانسة مع جين ريبوسومي في الجوار المباشر لها (حتى جين واحد أعلى/أسفل مجرى). هذه الظاهرة ليست حصرية للبروتينات الريبوسومية: يمكن توضيح 1,951 من c_AMPs إلى نفس مجموعة KEGG المتجانسة (KO) مثل بعض جيرانها المباشرين وقد تكون قد نشأت من أحداث تكرار الجينات. تم تفسير هذا التوضيح المشترك في هذا السياق كدليل على أصل تطوري مشترك وليس كتنبؤ وظيفي لـ c_AMPs. قد تكون هذه التكرارات قد نشأت من إعادة تركيب تسلسلات متجانسة محيطة، والتي يمكن أن تحدث أثناء انقسام الخلايا.

من المثير للاهتمام، أن 1,635 (

) من هذه c_AMPs تقع أعلى مجرى الجار الذي يحمل نفس توضيح KO. تعتبر الناقلات المختلفة والناقلات المتنقلة أكثر KOs شيوعًا المخصصة لـ c_AMPs وجيرانها (400 و125 c_AMPs، على التوالي؛ انظر الجدول S5).

معظم c_AMPs هي أعضاء في الجينوم المساعد

لاحظنا أن جزءًا صغيرًا فقط (

) من عائلات c_AMP الموجودة في ProGenomes2

موجودة في

من الجينومات من نفس الأنواع (الشكل 5)، هنا يُشار إليها باسم “الجوهر.”

هذا يتماشى مع الأعمال السابقة، حيث لوحظ أن إنتاج AMP كان محددًا بالنوع.

على النقيض من ذلك، فإن نسبة عالية (حوالي 68.8%) من عائلات البروتينات كاملة الطول هي جوهر في ProGenomes

الأنواع. هناك فرصة أكبر بمقدار 1.9 مرة (

) أن زوجًا من الجينومات من نفس الأنواع يشارك على الأقل c_AMP واحد عندما ينتمون إلى نفس السلالة (99.5%

ANI <99.99%).

أحد الأمثلة على هذا السلوك المحدد بالنوع هو AMP10.018_194، وهو c_AMP الوحيد الموجود في جينومات Mycoplasma pneumoniae. يتم تصنيف سلالات M. pneumoniae تقليديًا إلى مجموعتين بناءً على جين التصاق P1 الخاص بها.

من بين 76 جينومًا من M. pneumoniae الموجودة في دراستنا، تم تصنيف 29 على أنها من النوع 1، و29 تم تصنيفها على أنها من النوع 2، و18 المتبقية لم يتم تحديدها في هذا النظام التصنيفي

(انظر قسم طرق STAR “تحديد AMPs المساعدة”). تحتوي ستة وعشرون من 29 جينومًا من النوع 2 على AMP10.018_194، كما فعلت 2 من الجينومات غير المحددة، لكن لا تحتوي أي من جينومات النوع 1 على هذا AMP.

الأنواع الأكثر قابلية للنقل لديها كثافة c_AMP أقل

حققنا في التركيب التصنيفي لـ AMPSphere من خلال توضيح القطع باستخدام قاعدة بيانات تصنيف الجينوم (GTDB)

(انظر قسم طرق STAR “كثافة c_AMP في الأنواع الميكروبية”)، مما أسفر عن توضيح 570,187 c_AMP إلى جنس أو نوع. كانت الأجناس التي ساهمت بأكبر عدد من c_AMPs في AMPSphere هي Prevotella

c_AMPs)، Bradyrhizobium (11,846 c_AMPs)، Pelagibacter (6,675 c_AMPs)، Faecalibacterium (5,917 c_AMPs)، وCAG-110 (5,254 c_AMPs؛ انظر الشكل 5). تعكس هذه التوزيعة حقيقة أن هذه الأجناس من بين تلك التي تساهم بأكبر عدد من التسلسلات المجمعة في مجموعة البيانات الخاصة بنا (جميعها تحتل النسب المئوية فوق

بين الأجناس المجمعة). لذلك، قمنا بحساب كثافة c_AMP (

) من خلال تحديد عدد جينات c_AMP لكل ميغاباز من التسلسل المجمّع. لتجنب التحيز بسبب أخذ عينات غير متساوية من المواطن، قمنا بتضمين جميع التسلسلات التي تم التنبؤ بها بواسطة Macrel

في كل عينة، بما في ذلك التسلسلات الفردية التي تمت إزالتها لاحقًا وليست جزءًا من AMPSphere.

لاستكشاف أهمية إنتاج AMP في العمليات البيئية بشكل أكبر، حققنا في دور AMPs في قابلية نقل الأنواع البكتيرية من الأم إلى الطفل في ورقة نشرت مؤخرًا

من خلال ربط

لكل نوع بكتيري مع مقاييس نشر الميكروبات المنشورة. أظهرت بكتيريا الأمعاء البشرية زيادة في قابلية النقل عند كثافات AMP أقل

). بالمثل، في الأنواع البكتيرية في الميكروبيوم الفموي البشري، تكون قابلية النقل من الأم إلى النسل مرتبطة بشكل عكسي باستمرار مع

للسنة الأولى (

). وهذا يشير إلى أن بكتيريا الأمعاء البشرية وأنواع البكتيريا في الميكروبيوم الفموي تظهر زيادة في قابلية النقل عند انخفاض

. علاوة على ذلك، يبرز التأثير المحتمل لـ

على قابلية النقل في الأمعاء و

الشكل 4. يظهر سياق الجينوم لـ c_AMPs تفضيلًا للأحياء المجاورة التي تحتوي على بروتينات تجميع الريبوسوم
(أ) مقارنةً بالبروتينات الأخرى، تميل c_AMPs في الهياكل الجينومية المحفوظة إلى أن تكون أقرب إلى الجينات المتعلقة بآلية الريبوسوم من عائلات البروتينات ذات الأحجام المختلفة (جميع البروتينات ذات الطول الصغير و

الأحماض الأمينية).
(ب) نسبة c_AMPs في سياق الجينوم الذي يتضمن جينات مقاومة المضادات الحيوية أقل من تلك الموجودة في عائلات الجينات الأخرى.
(ج) نسبة c_AMPs في الأحياء المجاورة مع جينات مرتبطة بتخليق المضادات الحيوية صغيرة جدًا (

).
(د) يظهر السياق الجينومي المحفوظ للجين الذي يشفر AMP10.015_426 في جينومات مختلفة (الشجرة على اليسار توضح العلاقة التطورية للجينات المتجانسة له). هذا c_AMP متجانس مع البروتين الريبوسومي rpsH ويجد في سياق rpsH وجينات البروتينات الريبوسومية الأخرى. انظر أيضًا الجدول S4.

الميكروبيوم الفموي، مما يشير إلى وجود صلة بين AMPs ومعدلات نجاح النقل للأنواع الميكروبية.

الخصائص الفيزيائية والكيميائية والبنية الثانوية لـ AMPs

لتحقيق في خصائص وبنية الببتيدات المصنعة، قمنا أولاً بمقارنة تركيب الأحماض الأمينية لها مع AMPs من قواعد البيانات المتاحة للتسلسلات التي تم التحقق منها تجريبيًا (DRAMP

الإصدار 3.0، قاعدة بيانات النشاط المضاد للميكروبات وبنية الببتيدات [DBAASP]

وقاعدة بيانات الببتيدات المضادة للميكروبات [APD]

الإصدار 3). بشكل عام، كان التركيب مشابهًا، كما هو متوقع، نظرًا لأن نموذج ML الخاص بـ Macrel تم تدريبه باستخدام AMPs المعروفة.

من الجدير بالذكر أن تسلسلات AMPSphere أظهرت وفرة أعلى قليلاً من بقايا الأحماض الأمينية الأليفاتية، وخاصة الألانين والفالين. ومع ذلك، كانت هذه التسلسلات من AMPSphere تختلف باستمرار (الشكل 6A) عن EPs.

تشير أوجه التشابه في تركيب الأحماض الأمينية بين c_AMPs المحددة وAMPs المعروفة إلى خصائص فيزيائية وكيميائية وبنى ثانوية مشابهة، وكلاهما معروف بتأثيره على النشاط المضاد للميكروبات.

أظهرت c_AMPs خصائص مشابهة من حيث الهيدروفوبية،
الشحنة الصافية، والأمفيليتية مقارنةً بـ AMPs المستمدة من قواعد البيانات (الشكل S1). علاوة على ذلك، أظهرت ميلًا طفيفًا للتكوينات غير المرتبة (الشكل 6B) وكان لديها شحنة إيجابية صافية أقل مقارنةً بـ EPs الأخرى (الشكل 6A).

لتقييم الخصائص الهيكلية والمضادة للميكروبات لـ c_AMPs من AMPSphere، قمنا أولاً بتصفية AMPSphere للببتيدات التي تم التنبؤ بأنها مناسبة للاختبارات في المختبر بسبب قابليتها للذوبان في المحلول المائي وسهولة التركيب الكيميائي. اخترنا مجموعة من AMPs عالية الجودة مع 50 تسلسل ببتيد بناءً على انتشارها وتنوعها التصنيفي (انظر قسم طرق STAR “اختيار الببتيد للتصنيع والاختبار”). بالإضافة إلى ذلك، لتوفير تقييم غير متحيز للببتيدات التي نبلغ عنها هنا، استبعدنا أولاً أي ببتيدات لها نظير في واحدة من قواعد البيانات المنشورة ثم اخترنا عشوائيًا 50 ببتيدًا إضافيًا من AMPSphere، بما في ذلك 25 ببتيدًا مع احتمالات AMP لا تقل عن 0.6 (كما أبلغ عنه Macrel

) و25 ببتيدًا مع احتمالات أقل (

بعد ذلك، أجرينا تقييمات تجريبية للبنية الثانوية لـ c_AMPs النشطة باستخدام التشتت الدائري (الأشكال 6B وS4). مشابهة لـ AMPs الموثقة في قواعد البيانات، أظهرت الببتيدات المستمدة من AMPSphere اختلافات

الشكل 5. يعتمد تنوع AMP في قاعدة بيانات AMPSphere على التصنيف
(أ) تظهر النسب المئوية لـ AMPs (أو عائلات AMP) التي هي مساعدة (موجودة في

من الجينومات من نفس الأنواع)، قشرة (

)، أو جوهر (

).
(ب) توزيع أدنى مستوى تصنيفي تم توضيح c_AMPs فيه. بالتفصيل (اليمين) الأجناس العشرة الأولى التي تحتوي على أكبر عدد من c_AMPs المدرجة في AMPSphere. تساهم الأجناس المرتبطة بالحيوانات (مثل Prevotella وFaecalibacterium وCAG-110) بأكبر عدد من c_AMPs، مما يعكس على الأرجح أخذ العينات البيانية.
(ج) باستخدام

لكل جنس (محسوبة مع c_AMPs في AMPSphere)، لاحظنا توزيع c_AMPs لكل شعبة، مع Bacillota A كأكثرها كثافة (عدد العينات المستخدمة لبناء الرسم البياني موضح فوق كل مربع).
(د) يظهر تصنيف الأنواع المكتشفة في AMPSphere باستخدام GTDB

شجرة المرجع. تظهر الأشرطة الرمادية

التوزيع بالنسبة للتصنيف، مع الأشرطة السوداء التي تمثل فترة الثقة لـ

. باسيليوتا A، أكتينوميسيتوتا، وبسودومونادوتا هي الأكثر كثافة في c_AMPs. كمرجع، الوسيط لـ

للنباتات المقدمة يتم الإشارة إليها بخط متقطع باللون الأرجواني.

ميول لتبني

-هياكل حلزونية؛ أيضًا، كان بعضها غير منظم أو اتخذت

-تكوينات مضادة للتوازي في جميع الوسائط التي تم تحليلها. ومن الجدير بالذكر أنهم أظهروا أيضًا محتوى مرتفعًا بشكل غير عادي من

الهياكل المعاكسة في كل من الماء وخلائط الميثانول/الماء (الشكل 6B) على الرغم من تشابه تكوين الأحماض الأمينية مع AMP وEP. نحن نعزو هذه النتائج إلى الزيادة الطفيفة في حدوث بقايا الألانين والفالين، المعروفة بتفضيلها

هياكل شبيهة بـ” مع تفضيل لـ

-تكوين مضاد متوازي.

تحقق من فعالية c_AMPs كمضادات ميكروبية قوية من خلال اختبارات في المختبر

بعد ذلك، قمنا باختبار 100 ببتيد مصنّع ضد 11 سلالة مرضية ذات صلة سريرية تشمل Acinetobacter baumannii وEscherichia coli (بما في ذلك سلالة مقاومة للكولستين) وKlebsiella pneumoniae وPseudomonas aeruginosa وStaphylococcus aureus (بما في ذلك سلالة مقاومة للميثيسيلين) وEnterococcus faecalis المقاوم للفانكومايسين وEnterococcus faecium المقاوم للفانكومايسين. أظهر الفحص الأولي لدينا أن 63 من AMP (من أصل 100 مصنّعة) قضت تمامًا على
نمو واحد على الأقل من العوامل الممرضة المختبرة (الشكل 6C). ومن المRemarkably، في بعض الحالات، كانت AMP نشطة بتركيزات منخفضة تصل إلى

، بالقرب من المضاد الحيوي الببتيدي بوليميكسين ب والمضاد الحيوي ليفوفلوكساسين اللذين تم استخدامهما كضوابط إيجابية في جميع التجارب (الشكل S4A). أظهرت البكتيريا سالبة الجرام A. baumannii و E. coli، بالإضافة إلى السلالات إيجابية الجرام المقاومة للفانكومايسين E. faecalis و E. faecium، قابلية أعلى للتأثر بالـ AMPs، مع

، و 26 ضربة ببتيد، على التوالي. ومع ذلك، لم تؤثر أي من AMP التي تم اختبارها على المكورات العنقودية الذهبية المقاومة للميثيسيلين (MRSA) (الشكل 6C). كما قمنا بتخليق واختبار النسخ المختلطة من خمسة من أكثر الببتيدات نشاطًا من المجموعة عالية الجودة من حيث النشاط المضاد للميكروبات (أي، أكتينوميسين-1، إنتيروكوكين-1، لاكنوسبيرين-1، بروتيوباكتيكين-1، وسينكوكوكين-1). كانت جميع النسخ المختلطة غير نشطة باستثناء لاكنوسبيرين-1_المختلط، الذي أظهر نشاطًا متواضعًا ضد

. بومانيي في

(تركيز أعلى بـ 16 مرة مقارنة بالببتيد الأصلي لاكنو-سبيرين-1؛ الشكل S5A). تؤكد هذه النتائج على أهمية التسلسل المحدد لهذه الببتيدات في ممارسة نشاطها المضاد للميكروبات. لاستكشاف تأثير التسلسل بشكل أكبر على

الشكل 6. تركيبة الأحماض الأمينية، الهيكل، النشاط المضاد للميكروبات، وآلية العمل لـ c_AMPs
(أ) تكرار الأحماض الأمينية في c_AMPs من AMPSphere، AMPs من قواعد البيانات (DRAMP

الإصدار 3، APD3،

و DBAASP

) ، والببتيدات المشفرة

(EPs) من البروتينات البشرية.
(ب) خريطة حرارية مع النسبة المئوية للهيكل الثانوي الموجود لكل ببتيد في ثلاثة مذيبات مختلفة: الماء،

ثلاثي فلورو إيثانول (TFE) في الماء، و

الميثانول (MeOH) في الماء. تم حساب الهيكل الثانوي باستخدام خادم BeStSel.

(ج) تم تقييم نشاط c_AMPs ضد مسببات الأمراض ESKAPEE وسلالات البكتيريا المعوية البشرية. باختصار،

تم تعريضه لـ c_AMPs بتخفيف متسلسل بمعدل 2 ضعف يتراوح من 64 إلى

في أطباق 96 -بئر وتم تحضينها في

لمدة يوم واحد. بعد فترة التعرض، تم قياس امتصاص كل بئر عند 600 نانومتر. تم استخدام المحاليل غير المعالجة كضوابط، وتم تقديم قيم التركيز الأدنى للحد من النشاط بشكل كامل كخريطة حرارية للأنشطة المضادة للميكروبات.

) ضد 11 سلالة بكتيرية ممرضة وثمانية سلالات بكتيرية متعايشة في الأمعاء البشرية. تم إجراء جميع الاختبارات في ثلاث تكرارات مستقلة، وتظهر خريطة الحرارة الوضع الذي تم الحصول عليه ضمن نطاق تركيز التخفيف بمقدار 2 ضعف المدروس. إيجابية الغرام

وسالبة الجرام

تشير البكتيريا إلى ذلك (أعلى).

قمنا بتقييم ميل الهيكل الثانوي للببتيدات المختلطة باستخدام التشتت الدائري. لاحظنا انخفاضًا في نسبة الحلزونات للتسلسلات ذات المحتوى الحلزوني الأعلى (إنتيروكوكين-1، لاكنوسبيرين-1، وسينشوكوسين-1)، بينما أظهرت التسلسلات العشوائية بشكل رئيسي مثل أكتينوميسين-1 وبروتيوباكين-1، بالإضافة إلى نظائرها المختلطة، تسلسلات هيكل ثانوي مشابهة في جميع الوسائط التي تم تحليلها (الأشكال S5B-S5E). تشير هذه النتائج إلى عدم وجود ارتباط بين الهيكل الثانوي والنشاط المضاد للميكروبات للببتيدات المضادة للميكروبات المستمدة من AMPSphere.

تعيق c_AMPs نمو الكائنات الحية الدقيقة المعوية البشرية.

قمنا بفحص AMP ضد ثمانية من الأعضاء الأكثر صلة بميكروبات الأمعاء البشرية المرتبطة بصحة الإنسان.

قمنا باختبار البكتيريا المتعايشة التي تنتمي إلى أربعة فصائل (Verrucomicrobiota وBacteroidota وActinomycetota وBacillota)، وهي: Akkermansia muciniphila وBacteroides fragilis وBacteroides thetaiotaomicron وBacteroides uniformis وPhocaeicola vulgatus (المعروفة سابقًا باسم Bacteroides vulgatus) وCollinsella aerofaciens وClostridium scindens وParabacteroides distasonis.

بينما يُلاحظ بشكل شائع أن مضادات الميكروبات الطبيعية المعروفة لا تستهدف سلالات الميكروبيوم،

وجدت دراستنا أن 58 من الببتيدات المضادة للميكروبات التي تم تصنيعها (58%) أظهرت تأثيرات مثبطة على الأقل على سلالة واحدة من السلالات المتعايشة عند تركيزات منخفضة.

). على الرغم من أن نطاق التركيز هذا كان أعلى من ذلك الذي لوحظ بالنسبة لأكثر الببتيدات نشاطًا ضد مسببات الأمراض (1

ومع ذلك، فإنه لا يزال يقع ضمن النطاق النشط للغاية من AMP بناءً على الدراسات السابقة

(الشكل 6C). من المثير للاهتمام أن جميع سلالات الميكروبيوم المعوي التي تم تحليلها كانت حساسة لخمسة على الأقل من c_AMPs، مع سلالات من A. muciniphila و B. uniformis و P. vulgatus و C. aerofaciens و C. scindens و

. أظهرت الديستاسونيس أعلى درجة من القابلية للإصابة. في المجموع، أظهرت 79 AMP (من أصل 100 ببتيد تم تصنيعه) نشاطًا مضادًا للميكروبات ضد مسببات الأمراض و/أو الكائنات الحية المتعايشة. كما قمنا بفحص تسلسلات مختلطة لخمس من الببتيدات النشطة للغاية من المجموعة عالية الجودة ضد الكائنات الحية المتعايشة في الأمعاء. وبالمثل، كما هو الحال مع النتائج التي تم الحصول عليها ضد السلالات المسببة للأمراض (الشكل S5)، كانت النشاطية فقط للاشنو-سبيرين-1_المختلط متواضعة ضد

. سكيندينس في

(الشكل S5A).

اختراق وإزالة الاستقطاب للغشاء البكتيري بواسطة c_AMPs من AMPSphere

للحصول على رؤى حول آلية العمل المسؤولة عن النشاط المضاد للميكروبات الذي لوحظ في الببتيدات المستمدة من AMPSphere (الشكل 6C)، قمنا بإجراء تجارب لتقييم قدرتها على اختراق وإزالة استقطاب الأغشية الخارجية والسيتوبلازمية للبكتيريا عند تركيزاتها المثبطة الدنيا (MICs). على وجه التحديد، قمنا بالتحقيق في تأثيرات جميع 39 ببتيدًا أظهرت نشاطًا ضد A. baumannii (الأشكال 6D و6E) و6 ببتيدات ذات نشاط مضاد للميكروبات على P. aeruginosa (الأشكال S6A وS6B). للمقارنة وكـ
في التحكم، استخدمنا بوليميكسين ب، وهو مضاد حيوي ببتيدي معروف بخصائصه في نفاذية الغشاء وإزالة الاستقطاب.

للتحقيق في إمكانية اختراق الأغشية الخارجية للبكتيريا سالبة الجرام بواسطة AMP المختارة، قمنا بإجراء تجارب امتصاص 1-(N-phenylamino)naphthalene (NPN). NPN هو فلوروفور محب للدهون يظهر زيادة في الفلورية في وجود الدهون الموجودة داخل الأغشية الخارجية للبكتيريا. يشير امتصاص NPN إلى اختراق الأغشية وتلفها. من بين 39 ببتيد تم تقييمه للنشاط ضد

. باوماني، تسببت 10 ببتيدات في زيادة نفاذية الغشاء الخارجي بشكل كبير، مما أدى إلى مستويات من الفلورية لا تقل عن

أعلى من ذلك من بوليميكسين ب (الشكل 6D) بعد 45 دقيقة من التعرض. في حالة

أظهرت أربع من بين الستة ببتيدات المختبرة نفاذية أعلى من بوليميكسين ب في خلايا Pseudomonas aeruginosa (الشكل S6A).

لتقييم تأثير إزالة الاستقطاب المحتمل للغشاء من AMP المختارة من AMPSphere، استخدمنا الصبغة الفلورية

-ديبروبيل ثياديسكاربوكسيانين يوديد (DiSC

-[5]). من بين الببتيدات التي تم اختبارها ضد A. baumannii، أظهرت البوجيسين-1 (AMP10. 364_543)، والأمبسبيرين-2 (AMP10.615_023)، والمارينوباكتي-سين-1 (AMP10.321_460) استقطابًا أكبر لغشاء السيتوبلازم مقارنةً بالبوليميكسين B، ومن بين تلك التي تم اختبارها ضد

. أظهرت جميع الببتيدات المختبرة في Pseudomonas aeruginosa استقطابًا أكبر لغشاء السيتوبلازم مقارنةً بالبوليمكسين B (الشكل 6B). ومن المثير للاهتمام أن جميع ببتيدات AMPSphere المختبرة عرضت نمط استقطاب مميز على شكل هلال مقارنةً بالبوليمكسين B، حيث لوحظت مستويات أقل من الاستقطاب خلال أول 20 دقيقة من التعرض تلتها زيادة في الاستقطاب مع مرور الوقت (الأشكال 6E و S6B). مجتمعة، تشير هذه النتائج إلى أن حركيات استقطاب غشاء السيتوبلازم أبطأ مقارنةً بحركيات نفاذية الغشاء الخارجي، التي تحدث بسرعة عند التفاعل مع الخلايا البكتيرية.

تشير نتائجنا إلى أن AMP المختبرة من AMPSphere تؤثر بشكل أساسي من خلال اختراق الغشاء الخارجي بدلاً من إزالة استقطاب الغشاء السيتوبلازمي، مما يكشف عن آلية عمل مشابهة لتلك التي لوحظت في AMP الكلاسيكية وEP من البروتينات البشرية.

تظهر AMP فعالية مضادة للعدوى في نموذج الفأر

بعد ذلك، اختبرنا فعالية AMPSpherederived المضادة للعدوى في نموذج عدوى خراج الجلد في الفئران (الشكل 7A). تم تعريض الفئران للعدوى بـ A. baumannii، وهو مُمْرِض سالب الجرام خطير معروف بأنه يسبب عدوى شديدة في مواقع مختلفة من الجسم بما في ذلك مجرى الدم والرئتين والجهاز البولي والجروح.

عشر AMPs رائدة من مصادر مختلفة أظهرت نشاطًا قويًا في المختبر ضد A. baumannii: سينشوكوسين-1 (AMP10.000_211،

) من Synechococcus sp. (مرتبط بالشعاب المرجانية، الميكروبيوم البحري)؛ بروتيوباكتين-1 (AMP10.048_551،

) من Pseudomonadota (الميكروبيوم النباتي والتربة)؛ أكتينوميسين-1 (AMP10.199_072،

) من الأكتينوميسيس (فم الإنسان واللعاب

الشكل 7. النشاط المضاد للعدوى للببتيدات المضادة للميكروبات في نموذج حيواني قبل السريري
(أ) مخطط نموذج خراج الجلد في الفئران المستخدم لتقييم النشاط المضاد للعدوى للببتيدات ضد خلايا A. baumannii.
(ب) تم اختبار الببتيدات عند تركيزها المثبط الأدنى في جرعة واحدة بعد ساعتين من بدء العدوى. كانت كل مجموعة تتكون من ثلاثة فئران.

)، وكانت الأحمال البكتيرية المستخدمة لإصابة كل فأر مستمدة من مصدر مختلف.
(ج) لاستبعاد التأثيرات السامة للببتيدات، تم مراقبة وزن الفئران طوال التجربة.
تم تحديد الدلالة الإحصائية في (B) باستخدام تحليل التباين الأحادي حيث تم مقارنة جميع المجموعات مع مجموعة التحكم غير المعالجة؛

تُظهر القيم لكل مجموعة. تمثل الميزات في مخططات الكمان الوسيط والربعين العلوي والسفلي. البيانات في (C) هي المتوسط.

الانحراف المعياري. الشكل تم إنشاؤه في BioRender.com.

الميكروبيوم)؛ لاكنوسبيرين-1 (AMP10.015_742،

) من Lachnospira sp. (ميكروبيوم الأمعاء البشري)؛ انتيروكوكين-1 (AMP10.051_911،

) من Enterococcus faecalis (ميكروبيوم الأمعاء البشري)؛ ألفا بروتين-1 (AMP10.316_798،

) من الألفا بروتيوبكتيريا (الميكروبيوم المائي)؛ أوسيلوسبيرين (AMP10.771_988،

) من Oscillospiraceae (ميكروبيوم أمعاء الخنزير)؛ أمبفيرين-4 (AMP10.466_287،

) من مصدر غير معروف؛ ميثيلوسيلين-1 (AMP10.446_

) من Methylocella sp. (ميكروبيوم التربة)؛ و reyranin-1 (AMP10.337_875،

) من ريرانيلا (ميكروبيوم النباتات والتربة). تم تأسيس عدوى خراج الجلد بعبء بكتيري قدره

خلايا A. baumannii عند

وحدات تشكيل المستعمرات (CFUs)

على المنطقة المصابة من البشرة الظهرية (الشكل 7A). تم إعطاء جرعة واحدة من كل ببتيد عند قيمته الأدنى المثبطة للنمو التي تم الحصول عليها في المختبر (الأشكال 6C و S4A) إلى المنطقة المصابة. بعد يومين من العدوى، أظهرت الساينشوكوسين-1، والأكتينوميسين-1، والأوسيلوسبورين-1 نشاطًا بكتيريا مثبطًا، حيث منعت تكاثر خلايا A. baumannii، بينما أظهرت اللاخنوسبيرين-1، والإنترococcus-1، والأمبسبيرين-4، والرييرانين-1 نشاطًا بكتيريًا قاتلًا قريبًا من ذلك للمضاد الحيوي بوليميكسين B (عند

)، مما يقلل من عدد وحدات تشكيل المستعمرات بمقدار 3-4 أوامر من الحجم (الشكل 7B). بعد أربعة أيام من العدوى، أظهر الساينشوكوسين-1، لاكنوسبيرين-1، إنتيروكوكين-1، وأمبسبيرين-4 تأثيرًا بكتيريا ثابتًا قريبًا من تأثير المضاد الحيوي بوليميكسين ب، حيث قللوا من عدد وحدات تشكيل المستعمرات بمقدار 2-3 أوامر من الحجم مقارنةً بالتحكم غير المعالج (الشكل S6C). تسلط هذه النتائج الضوء على المضاد لـ
القدرة العدوانية للببتيدات المختبرة من AMPSphere حيث تم إعطاؤها في وقت واحد مباشرة بعد تكوين الخراج. تم مراقبة وزن الفأر كبديل للسمية، ولم تُلاحظ تغييرات كبيرة (الأشكال 7C و S6D)، مما يشير إلى أن الببتيدات المختبرة لم تكن سامة.

نقاش

هنا، استخدمنا التعلم الآلي لتحديد ما يقرب من مليون مرشح لمضادات الميكروبات الطبيعية في الميكروبيوم العالمي. بناءً على الدراسات السابقة التي ركزت بشكل خاص على ميكروبيوم الأمعاء البشرية،

قمنا بتصنيف AMP من الميكروبيوم العالمي عبر 63,410 ميتاجينومات متاحة للجمهور بالإضافة إلى 87,920 جينوم ميكروبي عالي الجودة من قاعدة بيانات ProGenomes2.

هذا أدى إلى إنشاء AMPSphere (https://ampsphere.big-databiology.org/“), مورد مفتوح الوصول ومتاح للجمهور يضم 863,498 ببتيد غير متكرر و6,499 عائلة AMP عالية الجودة من 72 موطنًا مختلفًا، بما في ذلك البيئات البحرية والتربة والأمعاء البشرية. معظم c_AMPs (

كانت ( ) غير معروفة سابقًا وتفتقر إلى نظائر قابلة للاكتشاف في قواعد بيانات أخرى، وحوالي واحد من كل خمسة كان لديه دليل على الترجمة و/أو النسخ، حيث يمكن اكتشافها في مجموعات مستقلة متاحة للجمهور من الميتا ترانسكريبتوم أو الميتا بروتيوم.

قمنا بتصميم مجموعة من الاختبارات لالتقاط توقعات ذات جودة أعلى، لكن العديد من الببتيدات فشلت في هذه الاختبارات على الرغم من وجود أدلة.
أنها كانت نشطة، بما في ذلك بياناتنا الخاصة في المختبر ووجود نظائر موثقة في قواعد البيانات الخارجية. من غير المرجح أن تمر الببتيدات ذات الانتشار المنخفض بالاختبارات (RNAcode

يتطلب عدة متغيرات)، وهو مستقل عن نشاطهم ويتأثر بتحيزات العينة.

التركيز على AMP المرشحة التي يتم ترميزها مباشرة في الجينوم مكن من إجراء اختبارات في المختبر وفي الكائنات الحية باستخدام التخليق الكيميائي دون تعديلات ما بعد الترجمة، ولكن هناك عمليات أخرى تنتج ببتيدات نشطة، مثل الببتيدات المشفرة (EPs).

التي استخدمناها كنقطة مقارنة. ومن الجدير بالذكر أن تركيبة الأحماض الأمينية والخصائص الفيزيائية والكيميائية لمركبات AMP التي تم التحقق منها من AMPSphere تختلف عن تلك التي تم التعرف عليها مؤخرًا في EPs.

تم استكشاف آليتين تطوريتين يمكن من خلالهما إنتاج AMP. أولاً، يمكن أن تؤدي الطفرات في الجينات التي تشفر بروتينات أطول إلى توليد قطع جينية عبر الاقتطاع. من بين مجموعات البروتينات المتجانسة الغنية من GMGCv1

المتماثلة مع c_AMPs، لاحظنا أن الغالبية العظمى من المجموعات كانت ذات وظيفة غير معروفة (53.8%)، مشابهة لما تم الإبلاغ عنه من قبل سبرّو وآخرين.

لبروتينات صغيرة من ميكروبيوم الأمعاء البشري. الآلية الثانية هي أن جين البروتين الصغير يمكن أن يخضع لتكرار يتبعه طفرة، وهو ما لاحظناه في حالة البروتينات الريبوسومية. يمكن أن تحتوي البروتينات الريبوسومية على نشاط مضاد للميكروبات،

ربما بسبب خصائصها الأميلويدية.

قد تكون الأصول الأخرى للببتيدات المضادة للميكروبات هي نقل الجينات الأفقي

أو تسلسلات غير مشفرة سلفية.

ومع ذلك، فإن الغالبية العظمى من AMP التي تم تحديدها لم يكن لها نظير قابل للاكتشاف في قواعد بيانات أخرى. قد يكون نقص التماثل الملحوظ بسبب القيود في قدرتنا على اكتشاف هذه العلاقات التماثلية بشكل قوي في التسلسلات الصغيرة، ولكن هناك أيضًا احتمال أن تكون البروتينات الصغيرة، مثل AMP، أكثر احتمالًا أن تتولد من جديد مقارنة بالبروتينات الأطول وقد تكون قد تطورت بشكل متكرر في مجموعات مختلفة.

قد يكون هذا أيضًا تفسيرًا للجزء الكبير من c_AMPs في AMPSphere الذي لا يتجمع مع أي تسلسلات أخرى.

لقد لاحظنا أن c_AMPs من AMPSphere كانت محددة للموائل وغالبًا ما كانت أعضاء ملحقة من بانجينومات الميكروبات. علاوة على ذلك، فإن أربعة من أصل خمسة أجناس تحتوي على أكبر عدد من c_AMPs الموجودة في AMPSphere تشترك في نمط حياة مرتبط بالمضيف، وثلاثة من هذه الأجناس (Prevotella وFaecalibacterium وCAG-110) شائعة في المضيفات الحيوانية.

(الشكل 5).

فالس-كولومر وآخرون،

الذين قاموا مؤخرًا بتحليل مجموعة كبيرة من الميتاجينومات المرتبطة بالبشر، يقدمون مؤشرًا خاصًا بالأنواع على قابلية الانتقال لعدة سيناريوهات انتقالية يدرسونها (مثل، من الأم إلى الرضيع). مع افتراض أن إنتاج AMP قد يكون مرتبطًا بالانتقال، قمنا بربط المؤشر الخاص بالأنواع.

تم حسابها في AMPSphere مع درجات النقل. في كل من الأمعاء البشرية والميكروبات الفموية، الأنواع ذات الدرجات الأعلى

أقل قابلية للانتقال، ربما لأن AMP توفر الحماية ضد استبدال السلالات. مجتمعة، هذه النتائج تؤكد قابلية تطبيق AMPSphere في دراسة علم البيئة الميكروبية، حيث تشير إلى دور AMP في تحديد قابلية انتقال الميكروبات وقدرتها على الاستعمار، مما يستدعي مزيدًا من التحقيق والتحقق في الأعمال المستقبلية.

أخيرًا، قمنا بالتحقق تجريبيًا من التوقعات التي قدمها نموذج التعلم الآلي الخاص بنا.

ووجد أن 79 (من أصل 100) من AMP المصنعة أظهرت نشاطًا مضادًا للميكروبات ضد إما مسببات الأمراض أو الكائنات المتعايشة. ومع ذلك، من الجدير بالذكر، أن أربعة ببتيدات (كاجيسين-1، كاجيسين-4، وإنترococcus-1 ضد A. baumannii وكاجيسين-1
و لاكنوسبيرين-1 ضد الإشريكية القولونية المقاومة للفانكومايسين) قدمت قيم MIC منخفضة تصل إلى

قابل للمقارنة مع قيم الحد الأدنى لتركيز المثبط لبعض من أقوى الببتيدات التي تم وصفها سابقًا في الأدبيات.

نظهر أن AMP التي تم اختبارها من AMPSphere كانت تميل إلى استهداف مسببات الأمراض السلبية الجرام ذات الأهمية السريرية وأظهرت نشاطًا ضد E. faecium المقاومة للفانكومايسين. على الرغم من أن AMP التقليدية لا تستهدف البكتيريا من ميكروبيوم الأمعاء البشري،

أظهرت AMPs التي تم اختبارها من AMPSphere فعالية ضد البكتيريا المتعايشة، مما يشير إلى الآثار البيئية المحتملة للببتيدات كعوامل واقية للكائنات المنتجة لها وقدرتها على إعادة تشكيل مجتمعات الميكروبيوم.

عند تقييم نشاطها في الجسم الحي، أظهرت ثلاثة ببتيدات فعالية مضادة للعدوى في نموذج عدوى فئري، حيث كانت لاكنو-سبيرين-1 وإنتيروكوكين-1 الأكثر قوة، مما أدى إلى تقليل الحمل البكتيري بمقدار يصل إلى ثلاثة أوامر من حيث الحجم. شملت الببتيدات النشطة تلك المشتقة من كل من الميكروبيوم المرتبط بالبشر والميكروبيوم البيئي، مما يثبت نهجنا في دراسة الميكروبيوم العالمي. بشكل عام، تكشف نتائجنا عن مجموعة واسعة من تسلسلات AMP دون تطابق في قواعد بيانات أخرى، مما يبرز إمكانيات التعلم الآلي في اكتشاف المضادات الميكروبية التي تشتد الحاجة إليها.

قيود الدراسة

ركزنا على فئة معينة من AMP، وهي الببتيدات المشفرة بواسطة جيناتها الخاصة والمكونة من ما يصل إلى 100 حمض أميني، والتي لا تغطي جميع الببتيدات النشطة. استكشفنا الميكروبيوم العالمي كما هو ممثل في قواعد البيانات العامة، وقد تم استكشاف بعض المواطن والمناطق في الكرة الأرضية بشكل أكبر بكثير من غيرها. يؤثر هذا التغطية غير المتكافئة أيضًا على تقديرات الجودة لدينا، حيث تعتمد على توفر البيانات. ومع ذلك، سنستمر في تحديث المورد مع توفر جينومات وميتابولوميات جديدة. نحن نقدم نتائج تستند إلى العثور على نظائر لببتيداتنا، لكن مطابقة تسلسلات صغيرة مع قواعد بيانات كبيرة لها معدل أخطاء أعلى (خصوصًا المطابقات المفقودة) مقارنةً بالتسلسلات الأطول. كانت نتائجنا حول قابلية انتقال سلالات الميكروبات وكثافة AMP تهدف إلى إظهار قيمة AMPSphere كمورد، لكن التحقق الكامل من هذه العلاقة سيكون محور العمل المستقبلي. أخيرًا، اختبرنا الببتيدات في المختبر وفي الكائنات الحية ضد مجموعة من البكتيريا. نظرًا لأننا لاحظنا استجابات محددة للأنواع وحتى السلالات، فمن الممكن أن تكون الببتيدات التي لم نلاحظ أي نشاط لها نشطة ضد سلالات لم يتم اختبارها هنا.

طرق النجوم

تُقدم طرق مفصلة في النسخة الإلكترونية من هذه الورقة وتشمل ما يلي:

جدول الموارد الرئيسية
توافر الموارد
جهة الاتصال الرئيسية
توفر المواد
توفر البيانات والشيفرة
تفاصيل النموذج التجريبي وشارك الدراسة
سلالات البكتيريا وظروف النمو
نموذج الفأر لعدوى خراج الجلد
تفاصيل الطريقة
اختيار الجينومات الميكروبية (الميتاجينومية)
قص و تجميع القراءات
تنبؤ smORF و AMP
تجميع عائلات AMP
مراقبة جودة c_AMPs
منحنيات تراكم c_AMPs المستندة إلى العينة
cAMPs متعددة المواطن ونادرة

اختبار تداخل c_AMPs عبر المواطن
كثافة c_AMP في الأنواع الميكروبية
نقل c_AMPs وأنواع البكتيريا
تحديد مضخات الملحقات
توصيف AMP باستخدام مجموعات بيانات مختلفة
تحليل الحفاظ على السياق الجينومي
موارد ويب AMPSphere
اختيار الببتيدات للتخليق والاختبار
تحديد التركيز المثبط الأدنى (MIC)
اختبارات التشتت الدائري
اختبارات نفاذية الغشاء الخارجي
اختبارات إزالة الاستقطاب للغشاء السيتوبلازمي

التكميم والتحليل الإحصائي
موارد إضافية

معلومات إضافية

يمكن العثور على معلومات إضافية على الإنترنت فيhttps://doi.org/10.1016/j/cell. 2024.05.013.

شكر وتقدير

نشكر ماريا ديمترييفا (جامعة زيورخ) على تعليقاتها المفيدة حول النسخة السابقة من المخطوطة. نشكر كايلين توسيغنت (جامعة كوينزلاند للتكنولوجيا) على مساعدتها في تحرير المخطوطة. نشكر جورجينا إتش. جويس (جامعة كوينزلاند للتكنولوجيا) على مساعدتها في تصميم الملخص الرسومي. نشكر أعضاء مجموعة كويلو ومختبر دي لا فوانتي على المناقشات المثمرة. يحمل C.F.-N. أستاذية رئاسية في جامعة بنسلفانيا ويعترف بالتمويل من شركة بروكتر وغامبل، يونايتد ثيرابيوتيكس، منحة الباحث الشاب من BBRF، جائزة نيميروفسكي، جائزة تسريع الصحة في بن، منح وكالة تخفيض التهديدات الدفاعية HDTRA11810041 وHDTRA1-23-1-0001، وصندوق الابتكار من كلية بيرلمان للطب في جامعة بنسلفانيا. نشكر الدكتور مارك غوليان على تبرعه الكريم بالسلالات Escherichia coli AIC221 (Escherichia coli MG1655 phnE_2:FRT [سلالة التحكم لـ AIC 222]) وEscherichia coli AIC222 (Escherichia coli MG1655 pmrA53 phnE_2:FRT [مقاومة للبوليمكسين]). تم تمويل هذا العمل جزئيًا من قبل EMBL ومنح أخرى: منح المؤسسة الوطنية للعلوم الطبيعية في الصين T2225015 و61932008 (L.P.C. وX.-M.Z.); منحة برنامج لجنة العلوم والتكنولوجيا في شنغهاي 23JS1410100 (L.P.C. وX.-M.Z.); منح البرنامج الوطني الرئيسي للبحث والتطوير في الصين 2023YFF1204800 و2020YFA0712403 (L.P.C. وX.-M.Z.); منحة المشروع الرئيسي للعلوم والتكنولوجيا في بلدية شنغهاي 2018SHZDZX01 (L.P.C. وX.-M.Z.); منحة مختبر لينغنغ ومنحة مختبر العوامل البشرية الوطنية المشتركة LG-TKN-202203-01 (X.-M.Z.); منحة لجنة العلوم والتكنولوجيا لبلدية شنغهاي 22JC1410900 (L.P.C.); منحة مجلس الأبحاث الأسترالي FT230100724 (L.P.C.); جائزة لانجر من مؤسسة AIChE (C.F.-N.); منحة المعاهد الوطنية للصحة R35GM138201 (C.F.-N.); منحة وكالة تخفيض التهديدات الدفاعية HDTRA1-21-1-0014 (C.F.-N.); PID2021-127210NB-I00، MCIN/AEI/10.13039/501100011033/FEDER، UE (J.H.-C.); مؤسسة “لا كايسا” ID 100010434، رمز الزمالة LCF/BQ/DI18/11660009 (A.R.d.R.); وبرنامج أفق 2020 للبحث والابتكار من الاتحاد الأوروبي بموجب اتفاقية منحة ماري سكلودوفسكا-كوري 713673 (A.R.d.R.).

مساهمات المؤلفين

تصور، C.D.S.-J.، L.P.C.، M.D.T.T.، و C.F.-N.؛ تنسيق البيانات، C.D.S.-J.، Y.D.، T.S.B.S.، M.K.، A.F.، L.P.C.، M.D.T.T.، و C.F.-N.؛ التحليل الرسمي، C.D.S.-J.، L.P.C.، و M.D.T.T.؛ الحصول على التمويل، L.P.C.
X.-M.Z. و C.F.-N.; التحقيق، C.D.S.-J. و L.P.C. و M.D.T.T. و C.F.-N.; المنهجية، C.D.S.-J. و Y.D. و J.H.-C. و A.R.d.R. و L.P.C. و M.D.T.T. و C.F.-N.; إدارة المشروع، L.P.C. و M.K. و X.-M.Z. و P.B. و C.F.-N.; الموارد، L.P.C. و X.-M.Z. و C.F.-N.; الإشراف، L.P.C. و C.F.-N.; التصور، C.D.S.-J. و J.H.-C. و J.S. و A.V. و A.H. و C.Z. و L.P.C. و M.D.T.T.; كتابة المسودة الأصلية، C.D.S.-J. و M.D.T.T. و C.F.-N. و L.P.C.; كتابة – مراجعة وتحرير، C.D.S.-J. و Y.D. و J.H.-C. و A.R.d.R. و T.S.B.S. و A.F. و P.B. و X.-M.Z. و L.P.C. و M.D.T.T. و C.F.-N.

إعلان المصالح

C.F.-N. يقدم خدمات استشارية لشركة Invaio Sciences وهو عضو في اللجان الاستشارية العلمية لشركة Nowture S.L. وPhare Bio. لقد حصل مختبر دي لا فوانتي على تمويل بحثي أو تبرعات عينية من United Therapeutics وStrata Manufacturing PJSC وProcter & Gamble، ولم يتم استخدام أي من هذه الأموال لدعم هذا العمل. تم تقديم إقرار اختراع مرتبط بهذا العمل.

تاريخ الاستلام: 14 يونيو 2023
تمت المراجعة: 11 أبريل 2024
تم القبول: 6 مايو 2024
نُشر: 5 يونيو 2024

REFERENCES

de la Fuente-Nunez, C., Torres, M.D., Mojica, F.J., and Lu, T.K. (2017). Next-generation precision antimicrobials: towards personalized treatment of infectious diseases. Curr. Opin. Microbiol. 37, 95-102. https:// doi.org/10.1016/j.mib.2017.05.014.
Antimicrobial Resistance Collaborators (2022). Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399, 629-655. https://doi.org/10.1016/S0140-6736(21)02724-0.
Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., MacNair, C.R., French, S., Carfrae, L.A., Bloom-Ackermann, Z., et al. (2020). A Deep Learning Approach to Antibiotic Discovery. Cell 180, 688-702.e13. https://doi.org/10.1016/j.cell.2020.01.021.
Torres, M.D.T., Melo, M.C.R., Flowers, L., Crescenzi, O., Notomista, E., and de la Fuente-Nunez, C. (2022). Mining for encrypted peptide antibiotics in the human proteome. Nat. Biomed. Eng. 6, 67-75. https://doi. org/10.1038/s41551-021-00801-1.
Porto, W.F., Irazazabal, L., Alves, E.S.F., Ribeiro, S.M., Matos, C.O., Pires, Á.S., Fensterseifer, I.C.M., Miranda, V.J., Haney, E.F., Humblot, V., et al. (2018). In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nat. Commun. 9, 1490. https://doi.org/10.1038/s41467-018-03746-3.
Ma, Y., Guo, Z., Xia, B., Zhang, Y., Liu, X., Yu, Y., Tang, N., Tong, X., Wang, M., Ye, X., et al. (2022). Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat. Biotechnol. 40, 921-931. https://doi.org/10.1038/s41587-022-01226-0.
Wong, F., de la Fuente-Nunez, C., and Collins, J.J. (2023). Leveraging artificial intelligence in the fight against infectious diseases. Science 381, 164-170. https://doi.org/10.1126/science.adh1114.
Cesaro, A., Bagheri, M., Torres, M., Wan, F., and de la Fuente-Nunez, C. (2023). Deep learning tools to accelerate antibiotic discovery. Expert Opin. Drug Discov. 18, 1245-1257. https://doi.org/10.1080/17460441. 2023.2250721.
Torres, M.D.T., and de la Fuente-Nunez, C. (2019). Toward computermade artificial antibiotics. Curr. Opin. Microbiol. 51, 30-38. https://doi. org/10.1016/j.mib.2019.03.004.
Maasch, J.R.M.A., Torres, M.D.T., Melo, M.C.R., and de la Fuente-Nunez, C. (2023). Molecular de-extinction of ancient antimicrobial peptides enabled by machine learning. Cell Host Microbe 31, 1260-1274.e6. https://doi.org/10.1016/j.chom.2023.07.001.
Besse, A., Vandervennet, M., Goulard, C., Peduzzi, J., Isaac, S., Rebuffat, S., and Carré-Mlouka, A. (2017). Halocin C8: an antimicrobial peptide
distributed among four halophilic archaeal genera: Natrinema, Haloterrigena, Haloferax, and Halobacterium. Extremophiles 21, 623-638. https://doi.org/10.1007/s00792-017-0931-5.
Cotter, P.D., Ross, R.P., and Hill, C. (2013). Bacteriocins – a viable alternative to antibiotics? Nat. Rev. Microbiol. 11, 95-105. https://doi.org/10. 1038/nrmicro2937.
Wang, S., Zheng, Z., Zou, H., Li, N., and Wu, M. (2019). Characterization of the secondary metabolite biosynthetic gene clusters in archaea. Comput. Biol. Chem. 78, 165-169. https://doi.org/10.1016/j.compbiolchem. 2018.11.019.
Zasloff, M. (2019). Antimicrobial Peptides of Multicellular Organisms: My Perspective. In Antimicrobial Peptides: Basics for Clinical Application, K. Matsuzaki, ed. (Springer Singapore), pp. 3-6. https://doi.org/10.1007/ 978-981-13-3588-4_1.
Huang, K.-Y., Chang, T.-H., Jhong, J.-H., Chi, Y.-H., Li, W.-C., Chan, C.L., Robert Lai, K., and Lee, T.-Y. (2017). Identification of natural antimicrobial peptides from bacteria through metagenomic and metatranscriptomic analysis of high-throughput transcriptome data of Taiwanese oolong teas. BMC Syst. Biol. 11, 131. https://doi.org/10.1186/s12918-017-0503-4.
Torres, M.D.T., Sothiselvam, S., Lu, T.K., and de la Fuente-Nunez, C. (2019). Peptide Design Principles for Antimicrobial Applications. J. Mol. Biol. 431, 3547-3567. https://doi.org/10.1016/j.jmb.2018.12.015.
Pizzo, E., Cafaro, V., Di Donato, A., and Notomista, E. (2018). Cryptic Antimicrobial Peptides: Identification Methods and Current Knowledge of their Immunomodulatory Properties. Curr. Pharm. Des. 24, 10541066. https://doi.org/10.2174/1381612824666180327165012.
Nolan, E.M., and Walsh, C.T. (2009). How nature morphs peptide scaffolds into antibiotics. Chembiochem 10, 34-53. https://doi.org/10.1002/ cbic. 200800438.
Singh, N., and Abraham, J. (2014). Ribosomally synthesized peptides from natural sources. J. Antibiot. 67, 277-289. https://doi.org/10.1038/ ja.2013.138.
García-Bayona, L., and Comstock, L.E. (2018). Bacterial antagonism in host-associated microbial communities. Science 361, eaat2456. https://doi.org/10.1126/science.aat2456.
Anderson, M.C., Vonaesch, P., Saffarian, A., Marteyn, B.S., and Sansonetti, P.J. (2017). Shigella sonnei encodes a functional T6SS used for interbacterial competition and niche occupancy. Cell Host Microbe 21, 769-776.e3. https://doi.org/10.1016/j.chom.2017.05.004.
Krismer, B., Weidenmaier, C., Zipperer, A., and Peschel, A. (2017). The commensal lifestyle of Staphylococcus aureus and its interactions with the nasal microbiota. Nat. Rev. Microbiol. 15, 675-687. https://doi.org/ 10.1038/nrmicro.2017.104.
Zhao, W., Caro, F., Robins, W., and Mekalanos, J.J. (2018). Antagonism toward the intestinal microbiota and its effect on Vibrio cholerae virulence. Science 359, 210-213. https://doi.org/10.1126/science.aap8775.
Quereda, J.J., Nahori, M.A., Meza-Torres, J., Sachse, M., Titos-Jiménez, P., Gomez-Laguna, J., Dussurget, O., Cossart, P., and Pizarro-Cerdá, J. (2017). Listeriolysin S is a streptolysin s-like virulence factor that targets exclusively prokaryotic cells in vivo. mBio 8, e00259-17. https://doi.org/ .
Quereda, J.J., Dussurget, O., Nahori, M.A., Ghozlane, A., Volant, S., Dillies, M.A., Regnault, B., Kennedy, S., Mondot, S., Villoing, B., et al. (2016). Bacteriocin from epidemic Listeria strains alters the host intestinal microbiota to favor infection. Proc. Natl. Acad. Sci. USA 113, 5706-5711. https://doi.org/10.1073/pnas.1523899113.
Gomes, B., Augusto, M.T., Felício, M.R., Hollmann, A., Franco, O.L., Gonçalves, S., and Santos, N.C. (2018). Designing improved active peptides for therapeutic approaches against infectious diseases. Biotechnol. Adv. 36, 415-429. https://doi.org/10.1016/j.biotechadv.2018.01.004.
Lesiuk, M., Paduszyńska, M., and Greber, K.E. (2022). Synthetic Antimicrobial Immunomodulatory Peptides: Ongoing Studies and Clinical Tri-
als. Antibiotics (Basel) 11, 1062. https://doi.org/10.3390/antibiotics 11081062.
Mahlapuu, M., Håkansson, J., Ringstad, L., and Björn, C. (2016). Antimicrobial Peptides: An Emerging Category of Therapeutic Agents. Front. Cell. Infect. Microbiol. 6, 235805.
Baquero, F., Lanza, V.F., Baquero, M.R., Del Campo, R., and Bravo-Vázquez, D.A. (2019). Microcins in Enterobacteriaceae: peptide antimicrobials in the eco-active intestinal chemosphere. Front. Microbiol. 10, 2261. https://doi.org/10.3389/fmicb.2019.02261.
Kim, S.G., Becattini, S., Moody, T.U., Shliaha, P.V., Littmann, E.R., Seok, R., Gjonbalaj, M., Eaton, V., Fontana, E., Amoretti, L., et al. (2019). Micro-biota-derived lantibiotic restores resistance against vancomycin-resistant Enterococcus. Nature 572, 665-669. https://doi.org/10.1038/ s41586-019-1501-z.
Nakatsuji, T., Hata, T.R., Tong, Y., Cheng, J.Y., Shafiq, F., Butcher, A.M., Salem, S.S., Brinton, S.L., Rudman Spergel, A.K., Johnson, K., et al. (2021). Development of a human skin commensal microbe for bacteriotherapy of atopic dermatitis and use in a phase 1 randomized clinical trial. Nat. Med. 27, 700-709. https://doi.org/10.1038/s41591-021-01256-2.
Spohn, R., Daruka, L., Lázár, V., Martins, A., Vidovics, F., Grézal, G., Méhi, O., Kintses, B., Számel, M., Jangir, P.K., et al. (2019). Integrated evolutionary analysis reveals antimicrobial peptides with limited resistance. Nat. Commun. 10, 4538. https://doi.org/10.1038/s41467-019-12364-6.
Cesaro, A., Torres, M.D.T., Gaglione, R., Dell’Olmo, E., Di Girolamo, R., Bosso, A., Pizzo, E., Haagsman, H.P., Veldhuizen, E.J.A., de la FuenteNunez, C., and Arciello, A. (2022). Synthetic Antibiotic Derived from Sequences Encrypted in a Protein from Human Plasma. ACS Nano 16, 1880-1895. https://doi.org/10.1021/acsnano.1c04496.
Hyatt, D., Chen, G.-L., LoCascio, P.F., Land, M.L., Larimer, F.W., and Hauser, L.J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 11, 119. https://doi.org/10. 1186/1471-2105-11-119.
Ahrens, C.H., Wade, J.T., Champion, M.M., and Langer, J.D. (2022). A Practical Guide to Small Protein Discovery and Characterization Using Mass Spectrometry. J. Bacteriol. 204, e0035321. https://doi.org/10. 1128/JB.00353-21.
Storz, G., Wolf, Y.I., and Ramamurthi, K.S. (2014). Small Proteins Can No Longer Be Ignored. Annu. Rev. Biochem. 83, 753-777. https://doi.org/ 10.1146/annurev-biochem-070611-102400.
Su, M., Ling, Y., Yu, J., Wu, J., and Xiao, J. (2013). Small proteins: untapped area of potential biological importance. Front. Genet. 4, 286. https://doi.org/10.3389/fgene.2013.00286.
Sberro, H., Fremin, B.J., Zlitni, S., Edfors, F., Greenfield, N., Snyder, M.P., Pavlopoulos, G.A., Kyrpides, N.C., and Bhatt, A.S. (2019). LargeScale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes. Cell 178, 1245-1259.e14. https://doi.org/10.1016/j.cell. 2019.07.016.
Donia, M.S., Cimermancic, P., Schulze, C.J., Wieland Brown, L.C., Martin, J., Mitreva, M., Clardy, J., Linington, R.G., and Fischbach, M.A. (2014). A systematic analysis of biosynthetic gene clusters in the human microbiome reveals a common family of antibiotics. Cell 158, 1402-1414. https://doi.org/10.1016/j.cell.2014.08.032.
Fingerhut, L.C.H.W., Miller, D.J., Strugnell, J.M., Daly, N.L., and Cooke, I.R. (2021). ampir: an package for fast genome-wide prediction of antimicrobial peptides. Bioinformatics 36, 5262-5263. https://doi.org/10. 1093/bioinformatics/btaa653.
Sugimoto, Y., Camacho, F.R., Wang, S., Chankhamjon, P., Odabas, A., Biswas, A., Jeffrey, P.D., and Donia, M.S. (2019). A metagenomic strategy for harnessing the chemical repertoire of the human microbiome. Science 366, eaax9176. https://doi.org/10.1126/science.aax9176.
Santos-Júnior, C.D., Pan, S., Zhao, X.-M., and Coelho, L.P. (2020). Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ 8, e10555. https://doi.org/10.7717/peerj. 10555.
Mende, D.R., Letunic, I., Maistrenko, O.M., Schmidt, T.S.B., Milanese, A., Paoli, L., Hernández-Plaza, A., Orakov, A.N., Forslund, S.K., Sunagawa, S., et al. (2020). proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621-D625. https://doi.org/10. 1093/nar/gkz1002.
Navidinia, M. (2016). The clinical importance of emerging ESKAPE pathogens in nosocomial infections. Archives of Advances in Biosciences 7, 43-57. https://doi.org/10.22037/jps.v7i3.12584.
Mulani, M.S., Kamble, E.E., Kumkar, S.N., Tawre, M.S., and Pardesi, K.R. (2019). Emerging Strategies to Combat ESKAPE Pathogens in the Era of Antimicrobial Resistance: A Review. Front. Microbiol. 10, 539. https:// doi.org/10.3389/fmicb.2019.00539.
Shi, G., Kang, X., Dong, F., Liu, Y., Zhu, N., Hu, Y., Xu, H., Lao, X., and Zheng, H. (2022). DRAMP 3.0: an enhanced comprehensive data repository of antimicrobial peptides. Nucleic Acids Res. 50, D488-D496. https://doi.org/10.1093/nar/gkab651.
Zhang, L.-J., and Gallo, R.L. (2016). Antimicrobial peptides. Curr. Biol. 26, R14-R19. https://doi.org/10.1016/j.cub.2015.11.017.
Bhadra, P., Yan, J., Li, J., Fong, S., and Siu, S.W.I. (2018). AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep. 8, 1697. https://doi.org/10.1038/s41598-018-19752-w.
Hao, Y., Zhang, L., Niu, Y., Cai, T., Luo, J., He, S., Zhang, B., Zhang, D., Qin, Y., Yang, F., and Chen, R. (2018). SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Brief. Bioinform. 19, 636-643. https://doi.org/10.1093/bib/bbx005.
Venturini, E., Svensson, S.L., Maaß, S., Gelhausen, R., Eggenhofer, F., Li, L., Cain, A.K., Parkhill, J., Becher, D., Backofen, R., et al. (2020). A global data-driven census of Salmonella small proteins and their potential functions in bacterial virulence. microLife 1, uqaa002. https://doi.org/10. 1093/femsml/uqaa002.
Aguilera-Mendoza, L., Marrero-Ponce, Y., Beltran, J.A., Tellez Ibarra, R., Guillen-Ramirez, H.A., and Brizuela, C.A. (2019). Graph-based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis. Bioinformatics 35, 4739-4747. https://doi.org/10.1093/bioinformatics/btz260.
Coelho, L.P., Alves, R., Del Río, Á.R., Myers, P.N., Cantalapiedra, C.P., Giner-Lamia, J., Schmidt, T.S., Mende, D.R., Orakov, A., Letunic, I., et al. (2022). Towards the biogeography of prokaryotic genes. Nature 601, 252-256. https://doi.org/10.1038/s41586-021-04233-4.
Veltri, D., Kamath, U., and Shehu, A. (2018). Deep learning improves antimicrobial peptide recognition. Bioinformatics 34, 2740-2747. https://doi. org/10.1093/bioinformatics/bty179.
Lawrence, T.J., Carper, D.L., Spangler, M.K., Carrell, A.A., Rush, T.A., Minter, S.J., Weston, D.J., and Labbé, J.L. (2021). amPEPpy 1.0: a portable and accurate antimicrobial peptide prediction tool. Bioinformatics 37, 2058-2060. https://doi.org/10.1093/bioinformatics/btaa917.
Su, X., Xu, J., Yin, Y., Quan, X., and Zhang, H. (2019). Antimicrobial peptide identification using multi-scale convolutional network. BMC Bioinf. 20, 730. https://doi.org/10.1186/s12859-019-3327-y.
Lin, T.-T., Yang, L.-Y., Lu, I.-H., Cheng, W.-C., Hsu, Z.-R., Chen, S.-H., and Lin, C.-Y. (2021). Al4AMP: an Antimicrobial Peptide Predictor Using Physicochemical Property-Based Encoding Method and Deep Learning. mSystems 6, e0029921. https://doi.org/10.1128/mSystems.00299-21.
Li, C., Sutherland, D., Hammond, S.A., Yang, C., Taho, F., Bergman, L., Houston, S., Warren, R.L., Wong, T., Hoang, L.M.N., et al. (2022). AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against whom priority pathogens. BMC Genom. 23, 77. https://doi.org/10.1186/s12864-022-08310-4.
Murphy, L.R., Wallqvist, A., and Levy, R.M. (2000). Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. 13, 149-152. https://doi.org/10.1093/protein/13.3.149.
Heintz-Buschart, A., May, P., Laczny, C.C., Lebrun, L.A., Bellora, C., Krishna, A., Wampach, L., Schneider, J.G., Hogan, A., de Beaufort, C., and Wilmes, P. (2016). Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes. Nat. Microbiol. 2, 16180. https://doi.org/10.1038/nmicrobiol.2016.180.
Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S.K., Cook, H., Mende, D.R., Letunic, I., Rattei, T., Jensen, L.J., et al. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309-D314. https://doi.org/10.1093/ nar/gky1085.
Rodríguez del Río, Á., Giner-Lamia, J., Cantalapiedra, C.P., Botas, J., Deng, Z., Hernández-Plaza, A., Munar-Palmer, M., Santamaría-Hernando, S., Rodríguez-Herva, J.J., Ruscheweyh, H.-J., et al. (2023). Functional and evolutionary significance of unknown genes from uncultivated taxa. Nature, 1-3. https://doi.org/10.1038/s41586-023-06955-z.
Hurtado-Rios, J.J., Carrasco-Navarro, U., Almanza-Pérez, J.C., and Ponce-Alquicira, E. (2022). Ribosomes: The New Role of Ribosomal Proteins as Natural Antimicrobials. Int. J. Mol. Sci. 23, 9123. https://doi.org/ 10.3390/ijms23169123.
Shoja, V., and Zhang, L. (2006). A Roadmap of Tandemly Arrayed Genes in the Genomes of Human, Mouse, and Rat. Mol. Biol. Evol. 23, 21342141. https://doi.org/10.1093/molbev/msl085.
Sukhodolets, V.V. (2006). Unequal crossing-over in Escherichia coli. Russ. J. Genet. 42, 1285-1293. https://doi.org/10.1134/S102279540611010X.
Kim, M.K., Kang, T.H., Kim, J., Kim, H., and Yun, H.D. (2012). Evidence Showing Duplication and Recombination of cel Genes in Tandem from Hyperthermophilic Thermotoga sp. Appl. Biochem. Biotechnol. 168, 1834-1848. https://doi.org/10.1007/s12010-012-9901-7.
Blaustein, R.A., McFarland, A.G., Ben Maamar, S., Lopez, A., CastroWallace, S., and Hartmann, E.M. (2019). Pangenomic Approach To Understanding Microbial Adaptations within a Model Built Environment, the International Space Station, Relative to Human Hosts and Soil. mSystems 4, e00281-18. https://doi.org/10.1128/mSystems.00281-18.
Collins, F.W.J., Mesa-Pereira, B., O’Connor, P.M., Rea, M.C., Hill, C., and Ross, R.P. (2018). Reincarnation of Bacteriocins From the Lactobacillus Pangenomic Graveyard. Front. Microbiol. 9, 1298. https://doi.org/ 10.3389/fmicb.2018.01298.
Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., Evans, P.N., Hugenholtz, P., and Tyson, G.W. (2017). Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533-1542. https://doi.org/10.1038/ s41564-017-0012-7.
Parks, D.H., Chuvochina, M., Chaumeil, P.-A., Rinke, C., Mussig, A.J., and Hugenholtz, P. (2020). A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol. 38, 1079-1086. https://doi. org/10.1038/s41587-020-0501-8.
Simmons, W.L., Daubenspeck, J.M., Osborne, J.D., Balish, M.F., Waites, K.B., and Dybvig, K. (2013). Type 1 and type 2 strains of Mycoplasma pneumoniae form different biofilms. Microbiology (Read.) 159, 737-747. https://doi.org/10.1099/mic.0.064782-0.
Diaz, M.H., Desai, H.P., Morrison, S.S., Benitez, A.J., Wolff, B.J., Caravas, J., Read, T.D., Dean, D., and Winchell, J.M. (2017). Comprehensive bioinformatics analysis of Mycoplasma pneumoniae genomes to investigate underlying population structure and type-specific determinants. PLoS One 12, e0174701. https://doi.org/10.1371/journal.pone.0174701.
Valles-Colomer, M., Blanco-Míguez, A., Manghi, P., Asnicar, F., Dubois, L., Golzato, D., Armanini, F., Cumbo, F., Huang, K.D., Manara, S., et al. (2023). The person-to-person transmission landscape of the gut and oral microbiomes. Nature 614, 125-135. https://doi.org/10.1038/ s41586-022-05620-1.
Pirtskhalava, M., Amstrong, A.A., Grigolava, M., Chubinidze, M., Alimbarashvili, E., Vishnepolsky, B., Gabrielian, A., Rosenthal, A., Hurt, D.E., and Tartakovsky, M. (2021). DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res. 49, D288-D297. https://doi.org/10.1093/nar/gkaa991.
Wang, G., Li, X., and Wang, Z. (2016). APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 44, D1087-D1093. https://doi.org/10.1093/nar/gkv1278.
Micsonai, A., Moussong, É., Wien, F., Boros, E., Vadászi, H., Murvai, N., Lee, Y.-H., Molnár, T., Réfrégiers, M., Goto, Y., et al. (2022). BeStSel: webserver for secondary structure and fold prediction for protein CD spectroscopy. Nucleic Acids Res. 50, W90-W98. https://doi.org/10. 1093/nar/gkac345.
Lifson, S., and Sander, C. (1979). Antiparallel and parallel -strands differ in amino acid residue preferences. Nature 282, 109-111. https://doi.org/ 10.1038/282109a0.
Derrien, M., Collado, M.C., Ben-Amor, K., Salminen, S., and de Vos, W.M. (2008). The Mucin Degrader Akkermansia muciniphila Is an Abundant Resident of the Human Intestinal Tract. Appl. Environ. Microbiol. 74, 1646-1648. https://doi.org/10.1128/AEM.01226-07.
Earley, H., Lennon, G., Balfe, Á., Coffey, J.C., Winter, D.C., and O’Connell, P.R. (2019). The abundance of Akkermansia muciniphila and its relationship with sulphated colonic mucins in health and ulcerative colitis. Sci. Rep. 9, 15683. https://doi.org/10.1038/s41598-019-51878-3.
Daquigan, N., Seekatz, A.M., Greathouse, K.L., Young, V.B., and White, J.R. (2017). High-resolution profiling of the gut microbiome reveals the extent of Clostridium difficile burden. npj Biofilms Microbiomes 3, 35. https://doi.org/10.1038/s41522-017-0043-0.
Saenz, C., Fang, Q., Gnanasekaran, T., Trammell, S.A.J., Buijink, J.A., Pisano, P., Wierer, M., Moens, F., Lengger, B., Brejnrod, A., and Arumugam, M. (2023). Clostridium scindens secretome suppresses virulence gene expression of Clostridioides difficile in a bile acid-independent manner. Microbiol. Spectr. 11, e0393322. https://doi.org/10.1128/spec-trum.03933-22.
Geerlings, S.Y., Kostopoulos, I., De Vos, W.M., and Belzer, C. (2018). Akkermansia muciniphila in the Human Gastrointestinal Tract: When, Where, and How? Microorganisms 6, 75. https://doi.org/10.3390/ microorganisms6030075.
Cullen, T.W., Schofield, W.B., Barry, N.A., Putnam, E.E., Rundell, E.A., Trent, M.S., Degnan, P.H., Booth, C.J., Yu, H., and Goodman, A.L. (2015). Antimicrobial peptide resistance mediates resilience of prominent gut commensals during inflammation. Science 347, 170-175. https://doi. org/10.1126/science. 1260580.
Torres, M.D.T., Pedron, C.N., Araújo, I., Silva, P.I., Silva, F.D., and Oliveira, V.X. (2017). Decoralin Analogs with Increased Resistance to Degradation and Lower Hemolytic Activity. ChemistrySelect 2, 18-23. https:// doi.org/10.1002/slct. 201601590.
Torres, M.D.T., Pedron, C.N., Higashikuni, Y., Kramer, R.M., Cardoso, M.H., Oshiro, K.G.N., Franco, O.L., Silva Junior, P.I., Silva, F.D., Oliveira Junior, V.X., et al. (2018). Structure-function-guided exploration of the antimicrobial peptide polybia-CP identifies activity determinants and generates synthetic therapeutic candidates. Commun. Biol. 1, 221. https://doi.org/10.1038/s42003-018-0224-2.
Silva, O.N., Torres, M.D.T., Cao, J., Alves, E.S.F., Rodrigues, L.V., Resende, J.M., Lião, L.M., Porto, W.F., Fensterseifer, I.C.M., Lu, T.K., et al. (2020). Repurposing a peptide toxin from wasp venom into antiinfectives with dual antimicrobial and immunomodulatory properties. Proc. Natl. Acad. Sci. USA 117, 26936-26945. https://doi.org/10.1073/ pnas. 2012379117.
Morris, F.C., Dexter, C., Kostoulias, X., Uddin, M.I., and Peleg, A.Y. (2019). The Mechanisms of Disease Caused by Acinetobacter baumannii. Front. Microbiol. 10, 1601.
Petruschke, H., Schori, C., Canzler, S., Riesbeck, S., Poehlein, A., Daniel, R., Frei, D., Segessemann, T., Zimmerman, J., Marinos, G., et al. (2021). Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome. Microbiome 9, 55. https://doi.org/10.1186/ s40168-020-00981-z.
Washietl, S., Findeiß, S., Müller, S.A., Kalkhof, S., von Bergen, M., Hofacker, I.L., Stadler, P.F., and Goldman, N. (2011). RNAcode: Robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17, 578-594. https://doi.org/10.1261/rna.2536111.
Galzitskaya, O.V. (2021). Exploring Amyloidogenicity of Peptides From Ribosomal S1 Protein to Develop Novel AMPs. Front. Mol. Biosci. 8, 705069. https://doi.org/10.3389/fmolb.2021.705069.
Ochman, H., Lawrence, J.G., and Groisman, E.A. (2000). Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299-304. https://doi.org/10.1038/35012500.
Zheng, D., and Gerstein, M.B. (2007). The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they? Trends Genet. 23, 219-224. https://doi.org/10.1016/j.tig.2007.03.003.
Lazzaro, B.P., Zasloff, M., and Rolff, J. (2020). Antimicrobial peptides: Application informed by evolution. Science 368, eaau5480. https://doi. org/10.1126/science.aau5480.
Sun, S., Wang, H., Howard, A.G., Zhang, J., Su, C., Wang, Z., Du, S., Fodor, A.A., Gordon-Larsen, P., and Zhang, B. (2022). Loss of Novel Diversity in Human Gut Microbiota Associated with Ongoing Urbanization in China. mSystems 7, e0020022. https://doi.org/10.1128/msystems. 00200-22.
Piquer-Esteban, S., Ruiz-Ruiz, S., Arnau, V., Diaz, W., and Moya, A. (2022). Exploring the universal healthy human gut microbiota around the World. Comput. Struct. Biotechnol. J. 20, 421-433. https://doi.org/ 10.1016/j.csbj.2021.12.035.
Dhakan, D.B., Maji, A., Sharma, A.K., Saxena, R., Pulikkan, J., Grace, T., Gomez, A., Scaria, J., Amato, K.R., and Sharma, V.K. (2019). The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches. GigaScience 8, giz004. https://doi.org/10.1093/gigascience/giz004.
Coelho, L.P., Alves, R., Monteiro, P., Huerta-Cepas, J., Freitas, A.T., and Bork, P. (2019). NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language. Microbiome 7, 84. https:// doi.org/10.1186/s40168-019-0684-8.
Coelho, L.P. (2017). Jug: Software for Parallel Reproducible Computation in Python. J. Open Res. Softw. 5, 30. https://doi.org/10.5334/jors.161.
Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152. https://doi.org/10.1093/bioinformatics/bts565.
Steinegger, M., and Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026-1028. https://doi.org/10.1038/nbt.3988.
Van Rossum, G. (2020). Python Release Python 3.8.2. Python.org. https://www.python.org/downloads/release/python-382/.
Hunter, J.D. (2007). Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 9, 90-95. https://doi.org/10.1109/MCSE.2007.55.
Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., et al. (2020). Array programming with NumPy. Nature 585, 357-362. https://doi.org/ 10.1038/s41586-020-2649-2.
McKinney, W. (2010). Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, pp. 56-61. https://doi.org/10.25080/Majora-92bf1922-00a.
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261-272. https://doi.org/10.1038/s41592-019-0686-2.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Sci-kit-learn: Machine Learning in Python. Machine Learning In Python 12, 2825-2830.
The scikit-bio development team (2020). scikit-bio: A Bioinformatics Library for Data Scientists, Students, and Developers. Version 0.5.5.
Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., and de Hoon, M.J.L. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422-1423. https:// doi.org/10.1093/bioinformatics/btp163.
Cantalapiedra, C.P., Hernández-Plaza, A., Letunic, I., Bork, P., and Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol. Biol. Evol. 38, 5825-5829. https://doi.org/10.1093/molbev/ msab293.
Eddy, S.R. (2011). Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195. https://doi.org/10.1371/journal.pcbi. 1002195.
Price, M.N., Dehal, P.S., and Arkin, A.P. (2010). FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One 5, e9490. https://doi.org/10.1371/journal.pone.0009490.
Jain, C., Rodriguez-R, L.M., Phillippy, A.M., Konstantinidis, K.T., and Aluru, S. (2018). High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114. https://doi.org/10.1038/s41467-018-07641-9.
Li, D., Luo, R., Liu, C.M., Leung, C.M., Ting, H.F., Sadakane, K., Yamashita, H., and Lam, T.W. (2016). MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3-11. https://doi.org/10.1016/j.ymeth.2016. 02.020.
Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760. https://doi. org/10.1093/bioinformatics/btp324.
Seabold, S., and Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python. In Proceedings of the 9th Python in Science Conference, pp. 92-96. https://doi.org/10.25080/Majora-92bf1922-011.
Milanese, A., Mende, D.R., Paoli, L., Salazar, G., Ruscheweyh, H.-J., Cuenca, M., Hingamp, P., Alves, R., Costea, P.I., Coelho, L.P., et al. (2019). Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014. https://doi.org/10.1038/ s41467-019-08844-4.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079. https://doi.org/10.1093/bioinformatics/btp352.
Quinlan, A.R., and Hall, I.M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842. https://doi. org/10.1093/bioinformatics/btq033.
Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539. https://doi.org/10.1038/msb. 2011.75.
Buchfink, B., Xie, C., and Huson, D.H. (2015). Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59-60. https://doi.org/10. 1038/nmeth. 3176.
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinf. 10, 421. https://doi.org/10.1186/1471-2105-10-421.
UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480-D489. https://doi.org/10. 1093/nar/gkaa1100.
Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G.A., Sonnhammer, E.L.L., Tosatto, S.C.E., Paladin, L., Raj, S., Richardson, L.J., et al. (2021). Pfam: The protein families database in 2021. Nucleic Acids Res. 49, D412-D419. https://doi.org/10.1093/nar/gkaa913.
Eberhardt, R.Y., Haft, D.H., Punta, M., Martin, M., O’Donovan, C., and Bateman, A. (2012). AntiFam: a tool to help identify spurious ORFs in protein annotation. Database 2012, bas003. https://doi.org/10.1093/database/bas003.
NCBI Resource Coordinators (2015). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 43, D6-D17. https://doi.org/10.1093/nar/gku1130.
Alcock, B.P., Raphenya, A.R., Lau, T.T.Y., Tsang, K.K., Bouchard, M., Edalatmand, A., Huynh, W., Nguyen, A.-L.V., Cheng, A.A., Liu, S., et al. (2020). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 48, D517D525. https://doi.org/10.1093/nar/gkz935.
Kanehisa, M., and Sato, Y. (2020). KEGG Mapper for inferring cellular functions from protein sequences. Protein Sci. 29, 28-35. https://doi. org/10.1002/pro.3711.
Courtot, M., Cherubin, L., Faulconbridge, A., Vaughan, D., Green, M., Richardson, D., Harrison, P., Whetzel, P.L., Parkinson, H., and Burdett, T. (2019). BioSamples database: an updated sample metadata hub. Nucleic Acids Res. 47, D1172-D1178. https://doi.org/10.1093/nar/ gky1061.
Harrison, P.W., Ahamed, A., Aslam, R., Alako, B.T.F., Burgin, J., Buso, N., Courtot, M., Fan, J., Gupta, D., Haseeb, M., et al. (2021). The European Nucleotide Archive in 2020. Nucleic Acids Res. 49, D82-D85. https:// doi.org/10.1093/nar/gkaa1028.
Jones, P., Côté, R.G., Martens, L., Quinn, A.F., Taylor, C.F., Derache, W., Hermjakob, H., and Apweiler, R. (2006). PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 34, D659-D663. https://doi.org/10.1093/nar/gkj138.
Schmidt, T.S.B., Fullam, A., Ferretti, P., Orakov, A., Maistrenko, O.M., Ruscheweyh, H.-J., Letunic, I., Duan, Y., Van Rossum, T., Sunagawa, S., et al. (2024). SPIRE: a Searchable, Planetary-scale mlcrobiome REsource. Nucleic Acids Res. 52, D777-D783. https://doi.org/10.1093/ nar/gkad943.
Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J., and Levy Karin, E. (2021). Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029-3031. https://doi.org/10.1093/bioinformatics/btab184.
Oren, A., Arahal, D.R., Rosselló-Móra, R., Sutcliffe, I.C., and Moore, E.R.B. (2021). Emendation of Rules 5b, 8, 15 and 22 of the International Code of Nomenclature of Prokaryotes to include the rank of phylum. Int. J. Syst. Evol. Microbiol. 71. https://doi.org/10.1099/ijsem.0.004851.
Oren, A., and Garrity, G.M. (2021). Valid publication of the names of fortytwo phyla of prokaryotes. Int. J. Syst. Evol. Microbiol. 71. https://doi.org/ 10.1099/ijsem.0.005056.
Solis, A.D. (2015). Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins 83, 21982216. https://doi.org/10.1002/prot.24936.
Peterson, E.L., Kondev, J., Theriot, J.A., and Phillips, R. (2009). Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 25, 1356-1362. https://doi.org/10. 1093/bioinformatics/btp164.
Smith, T.F., and Waterman, M.S. (1981). Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195-197. https://doi.org/10. 1016/0022-2836(81)90087-5.
Karlin, S., and Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264-2268. https://doi.org/ 10.1073/pnas.87.6.2264.
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.
Cena, J.A. de, Zhang, J., Deng, D., Damé-Teixeira, N., and Do, T. (2021). Low-Abundant Microorganisms: The Human Microbiome’s Dark Matter, a Scoping Review. Front. Cell. Infect. Microbiol. 11, 689197.
Mende, D.R., Sunagawa, S., Zeller, G., and Bork, P. (2013). Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881-884. https://doi.org/10.1038/nmeth.2575.
Sélem-Mojica, N., Aguilar, C., Gutiérrez-García, K., Martínez-Guerrero, C.E., and Barona-Gómez, F. (2019). EvoMining reveals the origin and fate of natural product biosynthetic enzymes. Microb. Genom. 5, e000260. https://doi.org/10.1099/mgen.0.000260.
Rodriguez-R, L.M., Conrad, R.E., Viver, T., Feistel, D.J., Lindner, B.G., Venter, S.N., Orellana, L.H., Amann, R., Rossello-Mora, R., and Konstantinidis, K.T. (2024). An ANI gap within bacterial species that advances the definitions of intra-species units. mBio 15, e02696-23. https://doi.org/10. 1128/mbio.02696-23.
Finn, R.D., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Mistry, J., Mitchell, A.L., Potter, S.C., Punta, M., Qureshi, M., Sangrador-Vegas, A., et al. (2016). The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279-D285. https://doi.org/10.1093/nar/ gkv1344.
SolyPep: a fast generator of soluble peptides https://bioserv.rpbs.univ-paris-diderot.fr/services/SolyPep/
Ochoa, R., and Cossio, P. (2021). PepFun: Open Source Protocols for Peptide-Related Computational Analysis. Molecules 26, 1664. https:// doi.org/10.3390/molecules26061664.
Kochendoerfer, G.G., and Kent, S.B. (1999). Chemical protein synthesis. Curr. Opin. Chem. Biol. 3, 665-671. https://doi.org/10.1016/s1367-5931(99)00024-1.
Sheppard, R. (2003). The fluorenylmethoxycarbonyl group in solid phase synthesis. J. Pept. Sci. 9, 545-552. https://doi.org/10.1002/psc.479.
Palomo, J.M. (2014). Solid-phase peptide synthesis: an overview focused on the preparation of biologically relevant peptides. RSC Adv. 4, 32658-32672. https://doi.org/10.1039/C4RA02458C.
Schmidt, T.S.B., Li, S.S., Maistrenko, O.M., Akanni, W., Coelho, L.P., Dolai, S., Fullam, A., Glazek, A.M., Hercog, R., Herrema, H., et al. (2022). Drivers and determinants of strain dynamics following fecal microbiota transplantation. Nat. Med. 28, 1902-1912. https://doi.org/10.1038/ s41591-022-01913-0.
Wiegand, I., Hilpert, K., and Hancock, R.E.W. (2008). Agar and broth dilution methods to determine the minimal inhibitory concentration (MIC) of antimicrobial substances. Nat. Protoc. 3, 163-175. https://doi.org/10. 1038/nprot.2007.521.
Santos-Júnior, C.D., Schmidt, T.S.B., Fullam, A., Duan, Y., Bork, P., Zhao, X.-M., and Coelho, L.P. (2021). AMPSphere : The Worldwide Survey of Prokaryotic Antimicrobial Peptides (Zenodo) https://doi.org/10. 5281/zenodo. 4606582.

طرق النجوم

جدول الموارد الرئيسية

كاشف أو مورد	المصدر	معرف
سلالات البكتيريا والفيروسات
أسيتيتوباكتر باومانني	مجموعة الثقافة الأمريكية للأنواع	ATCC 19606
إشريشيا كولاي	مجموعة الثقافة الأمريكية للأنواع	ATCC 11775
إشريشيا كولاي	إشريشيا كولاي MG1655 phnE_2:FRT	AIC221
إشريشيا كولاي	إشريشيا كولاي MG1655 pmrA53 phnE_2:FRT (سلالة مقاومة للبولي مكسين؛ سلالة مقاومة للكولستين)	AIC222
كليبسيلا الرئوية	مجموعة الثقافة الأمريكية للأنواع	ATCC 13883
المُكَوِّرَة الزُرقاء	غير متوفر	PAO1
المُكَوِّرَة الزُرقاء	غير متوفر	PA14
المكورات العنقودية الذهبية	مجموعة الثقافة الأمريكية للأنواع	ATCC 12600
المكورات العنقودية الذهبية	مجموعة الثقافة الأمريكية للأنواع	ATCC BAA-1556 (سلالة مقاومة للميثيسيلين)
أكرمانسيا موكينيفيلا	مجموعة الثقافة الأمريكية	ATCC BAA-635
باكتيرويدس فراجيلس	مجموعة الثقافة الأمريكية للأنواع	ATCC 25285
باكتيرويدس ثيتايوتاوميكرو	مجموعة الثقافة الأمريكية	ATCC 29148
باكتيرويدس يونيformis	مجموعة الثقافة الأمريكية	ATCC 8492
باكتيرويدس فulgatus (فوكايكولا فulgatus)	مجموعة الثقافة الأمريكية للأنواع	ATCC 8482
كولينسيلا أيروفاسيانس	مجموعة الثقافة الأمريكية	ATCC 25986
كلوستريديوم سكيندينس	مجموعة الثقافة الأمريكية للأنواع	ATCC 35704
باراباكتيرويدس ديستاسونيس	مجموعة الثقافة الأمريكية	ATCC 8503
المواد الكيميائية، الببتيدات، والبروتينات المؤتلفة
مرق لوريا-بيرتاني	بي دي	244620
مرق الصويا التربتيك	سيغما	T8907-1KG
أجار	سيغما	05039
أجار ماكونكي	RPI	M42560-500.0
محلول ملحي من فوسفات	سيغما	P3913-10PAK
جلوكوز	سيغما	G5767
1-(N-فينيل أمين) نافثالين	سيغما	١٠٤٠٤٣
يوديد 3,3′-ديبروبيل ثياديسكاربوكسيانين	سيغما	43608
هيبز	صياد	بي بي 310-100
كلوريد البوتاسيوم (KCl)	سيغما	P3911
البيانات المودعة
رمز لتوليد AMPSphere	هذه الدراسة	https://doi.org/10.5281/zenodo. 11055585
قاعدة بيانات AMPSphere	هذه الدراسة	https://zenodo.org/record/4606582
نماذج تجريبية: الكائنات/السلالات
فأرة: CD-1	نهر تشارلز	18679700-022
البرمجيات والخوارزميات
NGLess 1.3.0	كويلو وآخرون	https://github.com/ngless-toolkit/ngless
جوج 2.1.1	كويلو	https://github.com/luispedro/jug
بروجيكال 2.6.3	هايات وآخرون	https://github.com/hyattpd/Prodigal
ماكرل v.1.0.0	سانتوس-جونيور وآخرون	https://github.com/BigDataBiology/macrel
سي دي هيت 4.8.1	فو وآخرون	https://github.com/weizhongli/cdhit
MMseqs2	شتاينجر وسودينغ	https://github.com/soedinglab/MMseqs2

(يتبع في الصفحة التالية)

مستمر
كاشف أو مورد	المصدر	معرف
بايثون 3.8.2	فان روسوم	https://www.python.org/
ماتplotlib 3.4.3	صياد	https://matplotlib.org/
numpy 1.21.2	هاريس وآخرون	https://numpy.org/
باندا 1.3.2	مكينى	https://pandas.pydata.org/
بلوتلي 5.2.1	شركة بلوتلي تكنولوجيز، 2015	https://plot.ly
scipy 1.7.1	فيرتانين وآخرون	https://www.scipy.org
scikit-learn 0.24	بادريغوسا وآخرون	https://scikit-learn.org/
scikit-bio 0.5.6	فريق تطوير سكايكت-بايو،	http://scikit-bio.org/
بايو بايثون 1.7.9	كوك وآخرون	https://biopython.org/
خريطة البيض المخفوق v2	كانتالابيدرا وآخرون	https://github.com/eggnogdb/ eggnog-mapper
HMMer 3.3+dfsg2-1	إيدي	http://hmmer.org/
فاست تري 2.1	برايس وآخرون	http://www.microbesonline.org/fasttree/
فاست أيه إن آي v.1.33	جاين وآخرون	https://github.com/ParBLiSS/FastANI
ميغاهيت 1.2.9	لي وآخرون	https://github.com/voutcn/megahit/
أمبليفاى	لي وآخرون	https://github.com/bcgsc/AMPlify
أمبير	فينجرهوت وآخرون	https://github.com/Legana/ampir
ماسح AMP الإصدار 2	فيلتري وآخرون	https://www.dveltri.com/ascan/ v2/ascan.html
أبين	سو وآخرون	https://github.com/zhanglabNKU/APIN
amPEPpy 1.0	لورنس وآخرون	https://github.com/tlawrence3/amPEPpy
أل4AMP	لين وآخرون	https://github.com/LinTzuTang/ AI4AMP_predictor
RNAcode 0.2-beta	واشييتل وآخرون	https://github.com/ViennaRNA/RNAcode
بوا 0.7.17	لي وآخرون	https://github.com/lh3/bwa
ستاتس موديلز 0.14.0	سيبولد و بيركتولد	https://www.statsmodels.org
mOTUs2	ميلانيز وآخرون	https://github.com/motu-tool/mOTUs
سام توولز 1.18	لي وآخرون	https://github.com/samtools/samtools
أدوات السرير v2.31.0	كوينلان وهال	https://github.com/arqأدوات السرير 2
كلوستال أوميغا 1.2.2	سيفرز وآخرون	http://clustal.org/omega/
الماس v2.1.8	بوشفينك وآخرون	https://github.com/bbuchfink/diamond
بلاست+ 2.13.0	كاماشو وآخرون	https://blast.ncbi.nlm.nih.gov/doc/ blast-help/downloadblastdata.html
آخر
برو جينومز 2	مند و آخرون	http://progenomes.embl.de/
DRAMP – مستودع بيانات الببتيدات المضادة للميكروبات 3.0	شي وآخرون	http://dramp.cpu-bioinfor.org/
يونيبروتKB 2021_03	اتحاد يوني بروت	https://www.uniprot.org/
بيض بالكراميل النسخة 5.0	هورتا-سيباس وآخرون	http://eggnog5.embl.de/
قاعدة بيانات SmProt الإصدار 2.0	هاو وآخرون	http://bigdata.ibp.ac.cn/ SmProt/index.html
ستاربيب45ك	أغيليرا-ميندوزا وآخرون	http://mobiosd-hub.com/starpep
PFAM 33.1.	ميستري وآخرون	http://pfam.xfam.org/
أنتي فام 7.0	إيبرهاردت وآخرون	https://www.ebi.ac.uk/research/bateman/software/antifam-tool-identify-spurious-proteins
جي تي دي بي 07-آر إس 95	باركس وآخرون	https://gtdb.ecogenomic.org/
إصدار NCBI 207	منسقو موارد NCBI	https://ftp.ncbi.nih.gov/refseq/release/

مستمر
المُعَايِن أو المورد	المصدر	معرف
قاعدة بيانات النشاط المضاد للميكروبات وبنية الببتيدات – DBAASP	بيرتسخالافا وآخرون	https://dbaasp.org/home
قاعدة بيانات الببتيدات المضادة للميكروبات – APD3	وانغ ووانغ	https://aps.unmc.edu/
الأورف الصغيرة لسالمونيلا تايفيموريوم – STsORFs	فينتوريني وآخرون	https://academic.oup.com/microlife/article/1/1/uqaa002/5928550#supplementary-data
قاعدة بيانات مقاومة المضادات الحيوية الشاملة	ألكوك وآخرون	https://card.mcmaster.ca/
إصدار 102 من موسوعة كيوتو للجينات والجينومات (KEGG)	كانهيسا وآخرون	https://www.genome.jp/kegg/
قاعدة بيانات عينات الأحياء	كوروت وآخرون	http://www.ebi.ac.uk/biosamples
أرشيف النوكليوتيدات الأوروبي – ENA	هاريسون وآخرون	https://www.ebi.ac.uk/ena
قاعدة بيانات تحديد البروتيوم – PRIDE	جونز وآخرون	https://www.ebi.ac.uk/pride/

توافر الموارد

جهة الاتصال الرئيسية

يجب توجيه المزيد من المعلومات وطلبات الموارد والمواد الكيميائية إلى جهة الاتصال الرئيسية لويس بيدرو كويلو وسيتم تلبيتها بواسطةه.luispedro@big-data-biology.org).

توفر المواد

لم تنتج هذه الدراسة مواد كيميائية فريدة جديدة.

توفر البيانات والشيفرة

تتوفر بيانات الميتاجينومات والجينومات للجمهور في أرشيف النيوكليوتيدات الأوروبي (ENA) اعتبارًا من تاريخ النشر. أرقام الوصول الخاصة بها مدرجة في الجدول S1. AMPSphere متاحة كموارد عامة عبر الإنترنت.https://ampsphere. big-data-biology.org/)، وقد تم إيداع ملفاتها في زينودو وهي متاحة للجمهور اعتبارًا من تاريخ النشر. تم إدراج معرفات الكائن الرقمي (DOIs) في جدول الموارد الرئيسية.
تم إيداع جميع الشيفرات الأصلية في Zenodo وهي متاحة للجمهور اعتبارًا من تاريخ النشر. تم إدراج معرفات الكائن الرقمي (DOIs) في جدول الموارد الرئيسية.
أي معلومات إضافية مطلوبة لإعادة تحليل البيانات المبلغ عنها في هذه الورقة متاحة من جهة الاتصال الرئيسية عند الطلب.

تفاصيل النموذج التجريبي وشارك الدراسة

سلالات البكتيريا وظروف النمو

تم زراعة السلالات الممرضة Acinetobacter baumannii ATCC 19606، Escherichia coli ATCC 11775، Escherichia coli AIC221 [Escherichia coli MG1655 phnE_2FRT (سلالة التحكم لـ AIC 222)]، Escherichia coli AIC222 [Escherichia coli MG1655 pmrA53 phnE_2FRT (سلالة مقاومة للبوليمكسين؛ سلالة مقاومة للكولستين)]، Klebsiella pneumoniae ATCC 13883، Pseudomonas aeruginosa PAO1، Pseudomonas aeruginosa PA14، Staphylococcus aureus ATCC 12600، Staphylococcus aureus ATCC BAA-1556 (سلالة مقاومة للميثيسيلين)، Enterococcus faecalis ATCC 700802 (سلالة مقاومة للفانكومايسين)، و Enterococcus faecium ATCC 700221 (سلالة مقاومة للفانكومايسين) على أطباق أجار لوريا-بيرتاني (LB) وتم حضنها طوال الليل في

من المخزونات المجمدة. بعد الحضانة، تم نقل مستعمرة معزولة إلى 6 مل من الوسط (LB)، وتم حضانة الثقافات طوال الليل (16 ساعة) عند

في اليوم التالي، تم تحضير اللقاحات عن طريق تخفيف الثقافات التي نمت طوال الليل.

في 6 مل من الوسائط المعنية وحضانتها في

حتى وصلت البكتيريا إلى المرحلة اللوغاريتمية

تم زراعة سلالات البكتيريا المعوية المتعايشة Akkermansia muciniphila ATCC BAA-635، Bacteroides fragilis ATCC 25285، Bacteroides thetaiotaomicron ATCC 29148، Bacteroides uniformis ATCC 8492، Bacteroides vulgatus ATCC 8482 (Phocaeicola vulgatus)، Collinsella aerofaciens ATCC 25986، Clostridium scindens ATCC 35704، و Parabacteroides distasonis ATCC 8503 في أطباق أجار مغذية من مستخلص قلب الدم (BHI) غنية بـ

فيتامين

هيمين

مخفف بـ 10 مل من هيدروكسيد الصوديوم 1 ن

-سيستين (

)، من مخزونات مجمدة وتم حضنها طوال الليل عند

تم استخدام الريزازورين كمؤشر للأكسجين. بعد فترة الحضانة، تم نقل مستعمرة معزولة واحدة إلى 3 مل من مرق BHI وتم حضانتها طوال الليل في

في اليوم التالي، تم تحضير اللقاحات عن طريق تخفيف الثقافات البكتيرية التي نمت طوال الليل.

في 3 مل من مرق BHI وتم حضنه في

حتى وصلت الخلايا إلى المرحلة اللوغاريتمية

نموذج الفأر لعدوى خراج الجلد

لتقييم الفعالية المضادة للعدوى للببتيدات ضد A. baumannii ATCC 19606 في نموذج عدوى خراج الجلد في الفئران، تم زراعة البكتيريا في وسط مرق الصويا التربتي (TSB) حتى

تم الوصول إلى 0.5. بعد ذلك، تم غسل الخلايا مرتين بمحلول PBS معقم (pH 7.4) وتعليقها بتركيز نهائي من

وحدات تشكيل المستعمرات (CFU) لكل

تم تعريض إناث فئران CD-1 التي تبلغ من العمر ستة أسابيع، بعد تخديرها باستخدام الإيزوفلوران، لخدش سطحي خطي على ظهورها في منطقة لا يمكنها لمسها بفمها أو أطرافها. تم أخذ عينة من

تم إعطاء المحلول الذي يحتوي على الحمل البكتيري على المنطقة المتآكلة. تم إعطاء جرعة واحدة من الببتيدات المخففة في الماء عند قيمتها الأدنى من التركيز المثبط (MIC) للمنطقة المصابة بعد ساعتين من الإصابة. تم euthanizing الحيوانات بعد يومين وأربعة أيام من الإصابة، وتم استخراج المنطقة المصابة وتجانسها لمدة 20 دقيقة باستخدام جهاز خلط الكرات (25 هرتز) وتم تخفيفها بالتسلسل 10 مرات لت quantification وحدات تشكيل المستعمرات (CFU) على أطباق أجار ماكونكي لتسهيل تمييز مستعمرات A. baumannii. كانت المجموعات التجريبية تتكون من 3 فئران من نوع CD-1 لكل مجموعة.

)، جميعها إناث، وتم إصابة كل فأر بعينة من مستعمرة مختلفة لضمان التنوع. تم وضع الحيوانات في أقفاص فردية لتجنب التلوث المتبادل. تم استخدام جميع الفئران بعد ثلاثة أيام من وصولها من المزود التجاري. تم اعتماد نموذج عدوى خراج الجلد للفئران من قبل موارد الحيوانات المخبرية في الجامعة (ULAR) من جامعة بنسلفانيا (البروتوكول 806763).

تفاصيل الطريقة

اختيار الجينومات الميكروبية (الميتاجينومية)

كان اختيار الميتاجينومات والجينومات لتكوين AMPSphere مشابهًا لذلك الذي اعتمده كويلو وآخرون.

تم تنزيل الميتاجينومات العامة المتاحة في 1 يناير 2020 والتي تم إنتاجها باستخدام أدوات إلومينا (باستثناء MiSeq، لضمان اتساق وموثوقية نتائج التحليل الميتا)، والتي تحتوي على ما لا يقل عن 2 مليون قراءة ومتوسط طولها 75 قاعدة، من الأرشيف الأوروبي للنيوكليوتيدات (ENA). استوفت هذه العينات معيارين: (1) تم وضع علامة عليها بمعرف التصنيف 408169 (للميتاجينوم) أو كانت من نسلها في الشجرة التصنيفية؛ و/أو (2) جاءت من تجارب تم إدراج مصدر المكتبة فيها كـ “ميتاجينومي”. تم تجميع العينات حسب المشروع وتم تضمين جميع المشاريع التي تحتوي على 20 عينة على الأقل للتحليل. بالإضافة إلى ذلك، تم تضمين الميتاجينومات المودعة من قبل نظام الجينومات الميكروبية المتكاملة (IMG) والتي كانت مفقودة من ENA. تم تنسيق البيانات الوصفية يدويًا من الأدبيات التي تصف كل عينة وقاعدة بيانات العينات البيولوجية.

تم إنشاء مجموعات تصنيف المواطن بناءً على تشابه ظروف المواطن، مثل الهواء، والأنثروبوجيني، والمائي، والمرتبط بالموطن، والرقم الهيدروجيني: القلوي، والرواسب، والبرية، وغيرها. تم الحصول على أصول العينات والمعلومات المتعلقة بأنواع المضيف باستخدام رقم التعريف التصنيفي من NCBI. تم اختيار جينومات ميكروبية عالية الجودة من قاعدة بيانات ProGenomes2.

تم إدراج 63,410 ميتاجينوم متاح للجمهور و87,920 جينوم ميكروبي عالي الجودة في الجدول S1.

قص و تجميع القراءات

تمت معالجة القراءات باستخدام NGLess،

تم تقليم المواقع ذات الجودة الأقل من 25 والتخلص من القراءات التي تقل عن 60 قاعدة بعد التقليم. تم تمرير الميتاجينومات المستخرجة من ميكروبيوم مرتبط بالمضيف عبر تصفية القراءات التي تتطابق مع جينوم المضيف عند توفره. تم تجميع القراءات التي تجاوزت 14.7 تريليون قاعدة من الحمض النووي المتسلسل باستخدام MEGAHIT 1.2.9.

وتصنيف الـ

تم استنتاج الكتل الناتجة كما تم وصفه سابقًا،

استخدام MMSeqs

لرسم التسلسلات مقابل إصدار GTDB

تمت مراجعة سلالات التصنيف المرسومة يدويًا لتتوافق مع المدونة الدولية لتسمية الكائنات بدائية النواة.

تنبؤ smORF و AMP

على نحو مشابه لـ Sberro وآخرون،

استخدمنا نسخة معدلة من Prodigal

لتوقع smORFs (33-303 نقطة أساسية) من القطع.

smORFs الزائدة، معظمها (

نشأت من الميتاجينومات، ثم تم إزالة التكرارات لتحسين استخدام الموارد الحاسوبية، مما أسفر عن

smORFs غير الزائدة. ماكرل

تم تشغيله على smORFs المكررة لإجراء توقعات لـ c_AMPs. تم استبعاد التسلسلات الفردية (تلك التي تظهر في عينة واحدة أو جينوم واحد) ، باستثناء عندما كان لديها تطابق كبير (هوية الأحماض الأمينية

وقيمة E

) إلى تسلسل من مستودع بيانات الببتيدات المضادة للميكروبات (DRAMP)

الإصدار 3.0 باستخدام طريقة ‘easy-search’ من MMSeqs2.

في المجموع، شملت AMPSphere 863,498 من c_AMPs المتوقعة غير المتكررة التي تم ترميزها بواسطة

الجينات الزائدة. تم تقدير كثافات AMP كعدد AMPs لكل قاعدة مركبة في عينة أو نوع.

جينات AMP التي نشأت من ProGenomes

تم تعيين تصنيف الجينوم الأصلي لهم، في حين تم تعيين الجينات AMP من الميتاجينومات التصنيف المتوقع للكونتيغ الذي وُجدت فيه. تم الحصول على رؤى حول التشكيلات الهيكلية المحتملة باستخدام دالة النسبة المئوية للهيكل الثانوي من وحدة ProtParam المطبقة في SeqUtils في Biopython.

تقوم هذه الوظيفة بحساب نسبة الأحماض الأمينية التي تميل إلى اتخاذ تشكيلات من الحلزون [VIYFWL]، والانحناء [NPGS]، والورقة [EMAL].

تجميع عائلات AMP

تجميع الببتيدات حسب هوية التسلسل ممكن فقط عند هويات عالية، حيث إن التطابقات القصيرة ذات الهوية المنخفضة/المتوسطة ممكنة بالصدفة. لذلك، فإن الهدف هو استعادة التطابقات حيث يتم الحفاظ على الميزات الأساسية حتى لو لم تكن الأحماض الأمينية الفردية متطابقة.

استخدمنا أبجدية مختزلة من الأحماض الأمينية تتكون من 8 أحرف

– [LVIMC]، [AG]، [ST]، [FYW]، [EDNQ]، [KR]، [P]، [H]. تم تجميع c_AMPs بشكل هرمي بعد تقليل الأبجدية باستخدام ثلاثة حدود متتابعة للهوية (

، و

) مع CD-Hit.

تم اعتبار التجمع عائلة AMP عندما يتكون من 8 تسلسلات على الأقل.

تم اختيار تسلسلات تمثيلية لمجموعات الببتيدات وفقًا لطولها (مع أخذ الأطول) وتم كسر التعادل حسب ترتيبها الأبجدي.

لتحقيق صحة هذه العملية التجميعية، استخدمنا عينة من 3000 تسلسل تم اختيارها عشوائيًا من AMPSphere، مستبعدين ممثلي التجمع. تم محاذاة هذه التسلسلات مع التسلسل الممثل لتجمعها باستخدام خوارزمية سميث-واترمان.

باستخدام مصفوفة تكلفة BLOSUM 62، وعقوبات فتح الفجوات وتمديدها بقيم -10 و -0.5 على التوالي. ثم تم تحويل درجة المحاذاة إلى قيمة E وفقًا للنموذج الذي وضعه كارلين وألتشول،

التي تستخدم قيم

( 0.313667 ) الثوابت المعدلة للبحث عن تسلسل إدخال قصير كما هو مطبق في خوارزمية BLAST.

تم اعتبار المحاذاة ذات دلالة إذا كانت قيمة E الخاصة بها أقل من

. وجدنا أن أكثر من

من المحاذاة التي تم إنتاجها في المستويين الأولين (

الهوية) كانت مهمة، إلى جانب

من أولئك من المستوى الثالث (

الهوية) – انظر الشكل S3.

مراقبة جودة c_AMPs

تم تقديم c_AMPs في AMPSphere إلى ستة أنظمة أخرى لتوقع AMP (AMPScanner v2،

أمبير

– مع نموذج الببتيدات الناضجة، amPEPpy،

APIN

– مع نموذجهم المقترح، AI4AMP،

و AMPLify

تم إخضاع جينات c_AMPs لخمس اختبارات جودة مختلفة لتقليل احتمال أن تكون الببتيدات المرصودة عبارة عن فنون أو قطع من بروتينات أكبر. في البداية، تم البحث عن الببتيدات ضد AntiFam v.7.0

باستخدام HMMSearch،

الذي تم تصميمه لتحديد ORFs المتكررة بشكل خاطئ، مع الخيار “-cut_ga”. كان أقل من

من c_AMPs لديها أي نتائج مهمة.

بالنسبة لكل smORF، بحثنا عن كودون توقف في الإطار قبل كودون البدء الخاص به. عندما لا يتم العثور على كودون توقف، لا يمكننا استبعاد احتمال أن يكون smORF جزءًا من جين أكبر لا يمكننا ملاحظته بسبب التجميع المجزأ. معظم (

) من c_AMPs مشفرة بواسطة جين واحد على الأقل ليس في النهاية. ومع ذلك، فإن حقيقة أن c_AMP هو نهائي لا تعني أن c_AMP المعطى هو فن، حيث أن جينات AMP قصيرة بما يكفي لاستعادتها حتى في القطع القصيرة. على سبيل المثال،

(

) من المتجانسات لـ DRAMP

النسخة 3.0 تم العثور عليها كـ c_AMPs نهائية في AMPSphere.
برنامج RNAcode

يتنبأ بمناطق ترميز البروتين بناءً على التوقيعات التطورية النموذجية لجينات البروتين. تعتمد هذه التحليل على مجموعة من الجينات المتجانسة وغير المتطابقة. لذلك، تم محاذاة مجموعات AMP التي تحتوي على ثلاثة متغيرات جينية على الأقل. نظرًا لأن جزءًا كبيرًا من مرشحي AMPSphere (

من 863,498 ) ليس جزءًا من مثل هذه المجموعة، لم يكن من الممكن اختبارهم. من بين c_AMPs المختبرة،

من 403,588

اعتبرت جينات ذات سمات تطورية لتسلسلات ترميز البروتين.

ثم تحققنا من وجود دليل على النسخ و/أو الترجمة باستخدام 221 مجموعة بيانات متاحة للجمهور، تتضمن أمعاء الإنسان (142)، الخث (48)، النباتات (13)، والرموز (17)؛ و 109 مجموعة بيانات متاحة للجمهور من قاعدة بيانات PRIDE

التي تتضمن 37 موطنًا – الجدول S6. باستخدام bwa v.0.7.17،

تم رسم القراءات من الميتا ترانسكريبتوم ضد جينات AMP غير المكررة، و، باستخدام NGLess،

اخترنا الجينات التي تحتوي على قراءة واحدة على الأقل تم رسمها عبر حد أدنى من عينتين لزيادة ثقتنا. هذه الطريقة مشابهة لتلك المعتمدة عند توقع AMPs.

باستخدام التعبيرات العادية المنفذة في بايثون

-mers من جميع ببتيدات AMPSphere (بطول يساوي على الأقل نصف طول التسلسل) تم مقارنتها بتسلسلات الببتيد في بيانات الميتابروتيوميات. تم اعتبار المطابقة المثالية بين k-mer وببتيد ميتابروتيومي دليلاً إضافيًا على أن هذا c_AMP من المحتمل أن يتم ترجمته، كما وصفه ما وآخرون.

باختصار، تم حساب عدد ببتيدات c_AMP التي تم رسمها ضد مجموعة من عينات الميتابروتيوم، وتم وضع علامة على تلك الببتيدات c_AMP التي تحتوي على مطابقة واحدة على الأقل تغطي أكثر من

من الببتيد على أنها مكتشفة. شكلت c_AMPs التي تحتوي على دليل تجريبي في الميتا ترانسكريبتوم و/أو الميتابروتيوم حوالي 20% من AMPSphere.

تم تنفيذ رسم c_AMPs دون النظر في السياق الجينومي، مما قد يؤدي إلى تقدير مفرط للمرشحين الذين تم التعرف عليهم على أنهم قد يتم نسخهم. على سبيل المثال، إذا كانت متجانسة مع بروتينات أطول، قد تؤدي وجود الجين الأطول إلى اكتشاف إيجابي خاطئ لـ c_AMP الأقصر. قمنا بالتحقيق في ذلك باستخدام اختبار فيشر الدقيق لمقارنة نسبة المتجانسات AMP مع قاعدة بيانات GMGCv1

مع دليل تجريبي على الترجمة (

من 61,020 ببتيد، نسبة الأرجحية

) و/أو النسخ (

من 61,020 ببتيد، نسبة الأرجحية

). تشير النتائج إلى أن نهجنا يميل إلى تقدير طفيف مفرط لإمكانية النسخ والترجمة للمرشحين الذين لديهم متجانسات بطول قياسي.

نظرًا لأن عددًا صغيرًا فقط من مجموعات بيانات النسخ أو البروتيوميات كانت متاحة والقيود المذكورة أعلاه في تفسير الرسوم، اعتبرنا أن AMPs التي تمر بجميع اختبارات مراقبة الجودة هي عالية الجودة، بغض النظر عن دليل الترجمة أو النسخ. قمنا أيضًا بفصل تلك التي تحتوي على دليل تجريبي للترجمة/النسخ ( 17,115 c_AMPs، حوالي

من AMPSphere) وتلك التي لا تحتوي عليه ( 63,098 c_AMPs، حوالي 7%). بالنسبة لعائلات c_AMP، اعتبرنا عالية الجودة تلك التي

من c_AMPs تمر بجميع اختبارات مراقبة الجودة أو تلك التي تحتوي على c_AMP واحد على الأقل يمتلك دليلًا تجريبيًا للترجمة/النسخ.

منحنيات تراكم c_AMPs المعتمدة على العينة

لتحديد تشبع اكتشاف c_AMP، لكل موطن أو مجموعة من المواطن، قمنا بحساب منحنيات تراكم معتمدة على العينة من خلال أخذ عينات عشوائية من الميتاجينومات بخطوات من 10 ميتاجينومات. تم تكرار هذه العملية 32 مرة، وتم أخذ المتوسط.

c_AMPs متعددة المواطن والنادرة

قمنا أولاً بعدّ c_AMPs الموجودة في

المواطن (“AMPs متعددة المواطن”). لاختبار دلالة هذه القيمة، اخترنا نهجًا مشابهًا لذلك الموصوف في كويلو وآخرون.

: تم خلط تسميات المواطن لكل عينة 100 مرة وتم عد عدد c_AMPs متعددة المواطن الناتجة. أدى خلط التسميات إلى

c_AMPs متعددة المواطن بالصدفة لمجموعات المواطن عالية المستوى، و في

c_AMPs متعددة المواطن بالصدفة عند النظر إلى المواطن بشكل فردي داخل المجموعات عالية المستوى. تم استخدام اختبار شابيرو-ويلكس للتحقق من أن توزيع البيانات الناتجة طبيعي (

، لمواطن معينة؛

لمواطن عالية المستوى). في البيانات الأصلية (غير المخلطة)، قدمت مجموعات المواطن عالية المستوى 93,280 c_AMPs متعددة المواطن ( 136.21 انحرافات معيارية تحت القيمة المخلطة)، بينما قدمت المواطن المحددة 173,955 c_AMPs متعددة المواطن (117.1 انحرافات معيارية تحت القيمة المخلطة).

لتحديد ندرة c_AMPs، قمنا بتكييف البروتوكول الذي تم وضعه سابقًا بواسطة كويلو وآخرون.

حيث تم رسم الجينات غير المكررة في AMPSphere ضد قراءات عينات الميتاجينوم باستخدام NGLess.

اعتبرنا فقط القراءات المرسومة بشكل فريد. من الرسم، قمنا بحساب c_AMPs المكتشفة لكل عينة وعدد الاكتشافات لكل c_AMP، مع اعتبار “c_AMPs النادرة” تلك المكتشفة أقل من متوسط جميع AMPSphere ( 682 اكتشافات أو

من جميع العينات كما تم وصفه سابقًا للأنواع

). تم اعتماد هذا النهج للتغلب على التكاليف الحسابية العالية لإجراء رسم تنافسي. نتوقع أن نهجنا يبالغ في تقدير مدى انتشار c_AMPs، وبسبب ذلك، فهو وسيلة قوية لتقدير ندرة c_AMPs.

نظرًا لأن تصنيف الجودة العالية يتطلب على الأقل 3 متغيرات جينية لإجراء اختبار RNAcode، فلن تكون الجينات النادرة عالية الجودة. ومع ذلك، من أجل القوة، قمنا بتحديد هذا التأثير من خلال حساب المتوسط والوسيط لعدد الاكتشافات في c_AMPs عالية الجودة فقط و c_AMPs غير النهائية فقط (اختبار لا يتطلب عددًا أدنى من الجينات). كان متوسط عدد الاكتشافات 682 للمجموعة الكاملة، 789 لـ c_AMPs عالية الجودة، و 679 لـ c_AMPs غير النهائية.

اختبار تداخل c_AMPs عبر المواطن

كما تم القيام به عند اختبار دلالة عدد c_AMPs متعددة المواطن المرصودة، تم حساب عدد c_AMPs المتداخلة لكل زوج من المواطن. قمنا بخلط تسميات العينات 1,000 مرة، عدّين عدد c_AMPs المتداخلة عشوائيًا لكل زوج من المواطن. ثم، قدرنا احتمال ملاحظة التداخل باستخدام عدم المساواة لشيبشيف، الذي لا يعتمد على أي افتراضات بشأن توزيع البيانات كما لاحظنا، باستخدام اختبار شابيرو-ويلك، أن العد المخلط لا يتبع توزيعًا طبيعيًا. عدم المساواة لشيبشيف هو

، حيث

يمثل

الدرجة المحسوبة من المتوسط والانحرافات المعيارية المقدرة بواسطة إجراء الخلط. تم تعديل

-القيم باستخدام هولم-سيداك المنفذ في multipletests من حزمة statsmodels،

وتم اعتبار تلك التي تقل عن 0.05 ذات دلالة.

كثافة c_AMP في الأنواع الميكروبية

تم تعريف كثافة c_AMP على أنها

، حيث

هو عدد الجينات المكررة c_AMP و

هو أزواج القواعد المجمعة. نفترض، كتقريب، أنه في جزء كبير مجمع، تكون مواقع بدء جينات AMP مستقلة وعشوائية بشكل موحد. ثم، قمنا بحساب خطأ نسبة العينة القياسي باستخدام الصيغة: STDerr

. تم استخدام خطأ نسبة العينة القياسي لحساب هامش الخطأ عند

فترة الثقة (

للحصول على رؤى حول مساهمات الفئات المختلفة والأنواع والأجناس في AMPSphere، قمنا بحساب كثافة c_AMP لهذه المستويات التصنيفية باستخدام c_AMPs المضمنة ضمن AMPSphere، وجمعنا جميع أزواج القواعد المجمعة للكونتيجات المعينة لكل مستوى تصنيفي في العينات المستخدمة في AMPSphere.

الأجناس، والأنواع، والممالك ضمن هامش خطأ يتجاوز

تم استبعاد القيم المحسوبة جنبًا إلى جنب مع القيم الشاذة وفقًا لأسوار توكي (

قمنا بتقدير وجود الأنواع ووفرتها في كل عينة باستخدام mOTUs2.

لا أي من الأجناس التي تحمل أعلى

كانت (Algorimicrobium و TMED78 و SFJ001 و STGJ01 و CAG-462) ميكروبات شائعة للغاية.

نقل c_AMPs وأنواع البكتيريا

استخدمنا تصنيف الأنواع ومؤشرات القابلية للانتقال التي حسبها فالس-كولومر وآخرون.

لإظهار تأثير AMP على انتقال الأنواع البكتيرية من الأم إلى الأطفال. فقط تلك الأنواع التي تتداخل مع AMPSphere ومجموعات البيانات من Valles-Colomer وآخرون.

تم استخدامه في هذا التحليل، وتم حساب كثافات AMP الخاصة بهم كما هو موضح في القسم السابق (كثافة c_AMP في الأنواع الميكروبية)، باستخدام جميع c_AMPs المتوقعة من الميتاجينومات والجينومات التي حصلنا عليها، بما في ذلك تلك التي ليست في AMPSphere، لتجنب تحيز العينة. تم ربط كثافة AMP ومعامل الانتقال باستخدام طريقة سبيرمان المطبقة في حزمة scipy.

: متابعة ميكروبيوم الأطفال بعد 1 و 3 وحتى 18 عامًا، بالإضافة إلى التعايش والبيانات الداخلية.

تم تصحيح قيم الارتباط باستخدام طريقة هولم-سيداك المطبقة في دالة multipletests من حزمة statsmodels.

تحديد مضخات الملحقات

لكشف انتشار c_AMPs من خلال الجينومات الميكروبية، تم تحديد مجموعات c_AMP الأساسية والصدفية والإضافية باستخدام مجموعة فرعية من c_AMPs التي تم الحصول عليها من ProGenomes.

بسبب تصنيفاتهم المعينة ذات الثقة العالية والأنواع المحددة جينومياً (specl

لزيادة الثقة في تدابيرنا، تم تضمين فقط الأنواع التي تحتوي على 10 جينومات على الأقل
استخدمت في هذا التحليل. عائلات c_AMPs وAMP موجودة في أقل من

تم تصنيف الجينومات من نوع ميكروبي كملحق. c_AMPs والعائلات الموجودة في

تم تصنيف الجينومات في المجموعة على أنها قشرة،

وأولئك الحاضرون في

تم تصنيف الجينومات على أنها جينات أساسية.

لتحديد ميل AMP لمشاركتها بين الجينومات التي تنتمي إلى نفس السلالة، قمنا أولاً بتعريف السلالات داخل الأنواع. لهذا، استخدمنا FastANI v.1.33.

لتجميع الجينومات من نفس النوع في ProGenomes2.

مجموعات الجينوم مع ANI

تم اعتبارها مجمعات متطابقة، وتم الاحتفاظ بتمثيل واحد فقط من كل مجمع متطابق لمزيد من التحليلات. الأنواع التي كان لديها أقل من 10 جينومات بعد هذه الخطوة لم تؤخذ بعين الاعتبار في هذا التحليل. بعد ذلك، قمنا باستنتاج السلالات (99.5%

) كما في رودريغيز وآخرون

ثم قمنا بحساب أزواج الجينومات من نفس النوع التي تشترك في AMP، مقسمة حسب ما إذا كانت الزوج من نفس السلالة أم لا، واختبرنا النتائج باستخدام اختبار فيشر الدقيق المطبق في حزمة scipy.

لتحديد نسب البروتينات الكاملة من الملحق والقشرة والنواة في الجينومات الميكروبية، قمنا أيضًا باستخراج البروتينات الكاملة المتوقعة من قاعدة بيانات ENA لكل جينوم وقمنا بتجميعها بشكل هرمي بعد تقليل الأبجدية بطريقة مشابهة لتلك الموصوفة في موضوع “عائلات AMP”. مجموعات البروتينات الكاملة مع

تم الاحتفاظ بتسلسلات لكل نوع. تم حساب انتشار عائلات البروتينات كاملة الطول داخل نوع ما كما هو موضح أعلاه، وتمت مقارنة عدد العائلات الأساسية بعدد عائلات c_AMP الأساسية باستخدام الاحتمالية، التي تم حسابها كعدد الأنواع التي لديها نسبة من عائلات البروتينات كاملة الطول أقل من أو تساوي تلك الملاحظة لعائلات c_AMP مقسومًا على إجمالي الأنواع التي تم تقييمها.

لتحديد النمط الجيني لجينومات الميكوبلازما الرئوية في ProGenomes2،

قمنا باستخراج الجين الذي يشفر لـ P1 adhe

عن طريق رسم تسلسل الجين المرجعي NZ_LR214945.1:c568695-567307 ضد كل جينوم باستخدام bwa v.0.7.17

، ثم استخرجت التسلسلات باستخدام SAMtools

و BEDtools.

تم محاذاة تسلسلات الجينات المستخرجة باستخدام Clustal Omega،

وتم بناء شجرة تطورية باستخدام تسلسلات النوكليوتيدات المتراصة و FastTree

مع نموذج الاستبدال القابل للعكس المحدود وإجراء إعادة التقدير باستخدام 1,000 نسخة زائفة لتحديد دعم العقد. تم استخدام الشجرة لتقسيم وتصنيف الجينومات مع أخذ نوع السلالة من الجينومات المرجعية من دياز وآخرين.

توصيف AMP باستخدام مجموعات بيانات مختلفة

لكشف المتجانسات للبروتينات المنشورة سابقًا، قمنا بمحاذاة مرشحي AMPSphere ضد عدة قواعد بيانات: (ط) مجموعات البروتينات الصغيرة في SmProt 2،

(ii) قاعدة بيانات الببتيدات النشطة حيوياً starPepDB

(iii) البروتينات الصغيرة من التعداد العالمي المدفوع بالبيانات لسالمونيلا،

(iv) الكتالوج العالمي للجينات الميكروبية GMGCv1،

وقاعدة بيانات AMP DRAMP

الإصدار 3.0. لتجنب أي آثار تجميعية في التحليل، تم البحث فقط عن c_AMPs التي اجتازت اختبار وضع النهاية (أي، التي كان هناك دليل قوي على أن ORF مكتمل بالفعل) ضد GMGCv1.

تمت توضيحات AMPs باستخدام MMseqs

مع طريقة ‘البحث السهل’، مع الاحتفاظ بالنتائج التي لها قيمة E تصل إلى

. كـ ماكرل

يزيل الميثيونين الابتدائي من الببتيدات التي ينتجها، وتُعتبر الضربات التي تبدأ من الحمض الأميني الثاني كأنها تتطابق مع الأول.

استخدمنا اختبار الهيبرجيومايتر المطبق في حزمة scipy

لنمذجة العلاقة بين c_AMPs وتوزيع الخلفية لمجموعات الأورثولوج من GMGCv1.

عدد الجينات التي كانت زائدة في GMGCv1

تم حساب كل مجموعة من المتماثلات جنبًا إلى جنب مع العد لمجموعات المتماثلات في أفضل النتائج لـ AMPSphere. تم إعطاء الثراء كنسبة من النتائج الموجودة في مجموعة متماثلات معينة مقسومة على نسبة تلك المجموعة المتماثلة بين التسلسلات الزائدة في GMGCv1.

وتم اعتبار النتائج ذات دلالة إذا

بعد التصحيح باستخدام طريقة هولم-سيداك المطبقة في multipletests من حزمة statsmodels.

عند استخدام نهج قوي يقوم بتصفية مجموعات الأورثولوج حسب عدد ضربات c_AMP و GMGCv1

الضربات المرتبطة بها، باستخدام حد أدنى من 10 أو 20 أو حتى 100 بروتين، كانت النتائج مشابهة لتلك التي تم الحصول عليها مع جميع البيانات، مما يظهر أن توسيع مجموعات الأورثولوج في AMPSphere لم يؤثر على تحليل الإثراء.

للتحقق من الكيانات الجينومية الناتجة بعد تقصير الجين، قمنا بفحص نظائر c_AMP باستخدام الإعدادات الافتراضية لـ Blastn.

ضد قاعدة بيانات NCBI،

الاحتفاظ فقط بالضربات المهمة مع قيمة E القصوى لـ

. كدراسة حالة، اخترنا AMP10.271_016، المتوقع أن يتم إنتاجه بواسطة Prevotella jejuni، الذي يشارك في كود البداية مع الجين الذي يشفر إنزيم نازعة الهيدروجين المعتمدة على NAD(P) (WP_089365220.1). للتحقق من توزيع الجين والطفرات المحتملة التي تؤدي إلى إنشاء AMP، استخدمنا Biopython.

لتحاذي الكودونات للقطع من الكتل الميتاجينية المجمعة من العينات SAMN09837386 و SAMN09837387 و SAMN09837388، وقطع الجينوم من سلالات مختلفة من Prevotella jejuni CD3:33 (CP023864.1: 504836-504949)، F0106 (CP072366.1:781389-781502)، F0697 (CP072364.1:1466323-1466436)، ومن سلالات Prevotella melaninogenica FDAARGOS_760 (CP054010.1:157726-157839)، FDAARGOS_306 (CP022041.2:943522-943635)، FDAARGOS_1566 (CP085943.1:1102942-1103055)، و ATCC 25845 (CP002123.1:409656-409769) ومقارنة الأجزاء التي تشفر لـ AMP والبروتين الكامل الأصلي.

تحليل الحفاظ على السياق الجينومي

للحصول على رؤى حول تتابع الجينات المتعلقة بجينات AMP، قمنا برسم 863,498 تسلسل AMP مقابل مجموعة من 169,632 جينوم مرجعي، وجينومات مجمعة من الميتاجينوم (MAGs) وجينومات مضخمة فردية (SAGs) تم تنسيقها في مكان آخر.

مع الماس

في وضع “blastp”، كما تم الإبلاغ عنه سابقًا.

ضربات بالهوية

(حمض أميني) وتغطية الاستعلام والهدف

اعتُبرت ذات دلالة. يتجنب حد التغطية المستهدف الضربات على المتجانسات الأكبر التي قد تكون وظيفتها غير مرتبطة. وقد أسفر ذلك عن 107,308 AMP مع متجانسات في جينوم واحد على الأقل. قمنا ببناء عائلات جينية من الضربات لكل AMP تم اكتشافه في الجينومات بدائية النواة وحسبنا درجة الحفظ بناءً على التوصيف الوظيفي للجينات المجاورة في نافذة من ثلاث جينات أعلى وأسفل. كانت درجة الحفظ العمودية في كل موضع داخل نافذة كل c_AMP هي
محسوبًا كعدد الجينات ذات التوصيف الوظيفي المحدد (مجموعة الأرتولوج، موسوعة كيوتو للجينات والجنوم (KEGG) المسار، الأرتولوجي KEGG، وحدة KEGG،

PFAM 33.1،

وبطاقة

; تفاصيل التعليق وقاعدة البيانات المعلّقة الموصوفة سابقًا

). مقسومًا على عدد الجينات في العائلة. AMP التي تحتوي على أكثر من ضربة واحدة ودرجة الحفاظ العمودية

تم اعتبار أي مصطلح وظيفي له سياقات جينومية محفوظة. توضح الشكل 4 الحفاظ على سياقات الجينوم لمختلف مسارات KEGG.

لاختبار ما إذا كانت نسبة AMP مع الجيران الجينوميين المحفوظين مشابهة لتلك الخاصة بعائلات الجينات الأخرى ضمن 169,632 جينوم تم تنسيقها بواسطة ديل ريو وآخرون،

قمنا بحساب الحفاظ على السياق الجينومي في

عائلات الجينات المحسوبة من جديد باستخدام MMSeqs

(باستخدام هوية حمض أميني دنيا من

تغطية التسلسل الأقصر على الأقل

، وأقصى قيمة E لـ

تمت أيضًا توضيح c_AMPs باستخدام EggNOG-mapper v2.

تمت مقارنة تعليقات KO الخاصة بهم مع تلك الخاصة بالجيران المباشرين

المناصب) لتحديد الأحياء التي تؤدي نفس الوظيفة. كان من الممكن وضع تعليقات

من 107,308

من c_AMPs مع نتائج على الجينومات المختبرة باستخدام قاعدة بيانات EggNOG5.

من بين هؤلاء،

تم تعيينهم لوظائف متعلقة بالترجمة (الفئة J)، 14.4% تنتمي إلى بروتينات ذات وظيفة غير معروفة (S)،

تم تكليفهم بالتكرار، وإعادة التركيب، والإصلاح (L).

موارد الويب AMPSphere

توجد AMPSphere في العنوانhttps://ampsphere.big-data-biology.org/التنفيذ يعتمد على بايثون

و Vue Javascript. تم بناء قاعدة البيانات باستخدام sqlite، وتم استخدام SQLalchemy لربط قاعدة البيانات بكائنات Python. تم بناء واجهات برمجة التطبيقات الداخلية والخارجية باستخدام FastAPI و Gunicorn لخدمتها. في الواجهة الأمامية، تم استخدام Vue 3 كهيكل أساسي وتم بناء التخطيط باستخدام Quasar. تم استخدام Plotly لإنشاء رسوم بيانية تفاعلية، و Axios لعرض المحتوى بسلاسة. LogoJS (https:// logojs.wenglab.org/app/) تم استخدامه لإنشاء شعارات تسلسلية لعائلات AMP؛ بينما تم استخدام تطبيق العجلة الحلزونية (https://github.com/تم استخدام clemlab/helicalwheel لإنشاء عجلات حلزونية AMP.

اختيار الببتيدات للتخليق والاختبار

اخترنا مجموعتين من الببتيدات: (i) 50 ببتيد تم اختيارها لأنها كانت مرجحة بشكل خاص لتكون نشطة وكانت مثيرة للاهتمام بطرق أخرى (كما هو موضح أدناه)، (ii) 50 ببتيد تم اختيارها عشوائيًا بعد تطبيق استثناءات تقنية.

بالنسبة للمجموعة الأولى، تم اعتبار فقط c_AMPs عالية الجودة (انظر الموضوع “مراقبة جودة c_AMPs”) للتخليق. وتم تصفيتها لاحقًا وفقًا لستة معايير للذوبانية.

وثلاثة معايير للتوليف، كما في PepFun.

قمنا بتقدير الذوبانية باستخدام المعايير المطبقة في PepFun.

ملاحظًا أن

ببتيدات

اجتاز على الأقل نصف معايير الذوبانية التي تم تقييمها. المجموعة الفرعية التي تتشابه مع الببتيدات في DRAMP

كانت النسخة 3.0 تحتوي على معدل أقل قليلاً،

اجتاز نصف الاختبارات. ثم قمنا بتقييم الببتيدات من حيث سهولة تخليقها، ومع ذلك، فقط

من AMPSphere اجتاز على الأقل 2 من 3 المعايير المحددة للتخليق الكيميائي.

تم تصفية ببتيد تم الموافقة عليه على الأقل لستة من المعايير المذكورة أعلاه من خلال التنبؤ بنشاط AMP باستخدام ستة طرق بالإضافة إلى Macrel.

AMPScanner v2،

نموذج الببتيدات الناضجة في أمبير

أمبيبي

أبين

– مع نموذجهم المقترح، AI4AMP،

و AMPLify.

تم تصفية الببتيدات المتوقعة أن تكون AMP بواسطة جميع الطرق حسب الطول، مع التخلص من التسلسلات التي تزيد عن 40 بقايا من الأحماض الأمينية، حيث أن التخليق الببتيدي التقليدي في الطور الصلب باستخدام استراتيجية Fmoc له عوائد أقل والعديد من تفاعلات إعادة الربط.

تم الاحتفاظ ببتيد واحد فقط من كل عائلة أو مجموعة، وهو الببتيد الذي يحتوي على أعلى عدد من smORFs الملحوظة. بعد هذه العملية، حصلنا على 364 AMP مرشح، تنتمي إلى 166 عائلة و198 مجموعة تحتوي على أقل من 8 c_AMPs. من بين هذه، كان 30 مرشحًا متجانسًا مع تسلسلات من قواعد البيانات المستخدمة في التوصيف (مثل SmProt).

). لتكوين قائمة تضم 50 مرشحًا عالي الاحتمالية: (i) اخترنا 34 من أكثر الببتيدات شيوعًا؛ (ii) اخترنا عشوائيًا 14 من c_AMPs (

من مجموعتنا) مع نظائر لـ GMGCv1

وأحد يتطابق مع SmProt

; و (iii) قمنا بتضمين ببتيد واحد تم العثور عليه في MAGs المجمعة من عينات البراز المستخدمة للتحقيق في زراعة البراز.

قمنا أيضًا بتضمين تسلسلات مخلوطة تم إنشاؤها باستخدام خمسة من أكثر تسلسلات الببتيد نشاطًا للتحقق من فعالية التسلسلات المولدة عشوائيًا.

لبناء مجموعة من الببتيدات المختارة عشوائيًا، قمنا أولاً باختيار c_AMPs التي لا تتشابه مع أي قواعد بيانات أخرى تم اختبارها والتي اجتازت معايير التركيب المذكورة أعلاه (إجمالي 768,061 من أصل 863,498 ببتيد). قمنا بعد ذلك بتقسيم هذه المجموعة إلى مجموعات فرعية: (i) تلك التي تم تعيين احتمال لها بواسطة Macrel

( 271,555 c_AMPs) و (ii) تلك في النطاق

( 496,506 c_AMPs؛ لاحظ أن جميع c_AMPs في AMPSphere لها احتمال معين من قبل Macrel

قمنا بأخذ عينة عشوائية من 25 ببتيد من كل مجموعة.

تحديد التركيز المثبط الأدنى (MIC)

تم اختبار الـ 100 AMP للنشاط المضاد للميكروبات باستخدام طريقة التخفيف الميكروبي في الوسط.

تم اعتبار قيم MIC على أنها تركيز الببتيدات التي قتلت

الخلايا بعد 24 ساعة من الحضانة عند

. أولاً، تم إضافة الببتيدات المخففة في الماء إلى أطباق الميكروتيتر المصنوعة من بولي ستيرين ذات القاع المسطح غير المعالجة، بتخفيفات تتراوح من 64 إلى

“، ثم تم تعريض الببتيدات لجرعة من

الخلايا في مرق LB أو BHI، للجراثيم المسببة للأمراض والميكروبات المعوية على التوالي. بعد فترة الحضانة، تم تحليل امتصاص كل بئر يمثل كل من الظروف باستخدام مطياف ضوئي عند 600 نانومتر. تم إجراء الاختبارات في ثلاث تكرارات بيولوجية لضمان موثوقية إحصائية.

اختبارات التشتت الدائري

تم إجراء تجارب الانكسار الدائري باستخدام مقياس الطيف الدائري J1500 (Jasco) في مركز موارد الكيمياء الحيوية (BCRC) بجامعة بنسلفانيا. تم إجراء التجارب عند درجة حرارة

دائري
تم الحصول على طيف ثنائي اللون عن طريق متوسط ثلاث تجميعات باستخدام قنينة كوارتز بطول مسار بصري يبلغ 1.0 مم. تم تسجيل الطيف في نطاق الطول الموجي من 260 إلى 190 نانومتر بمعدل مسح

بعرض نطاق قدره 0.5 نانومتر. تم اختبار الببتيدات بتركيز قدره

تمت القياسات في الماء، ومزيج من الماء وثلاثي فلورو إيثانول (TFE) بنسبة

، ومزيج من الماء والميثانول بنسبة

تم تسجيل قياسات الأساس قبل كل قياس. لتقليل تأثيرات الخلفية، تم تطبيق فلتر تحويل فورييه. تم حساب قيم الكسر الحلزوني باستخدام أداة تحليل الطيف الفردي المتاحة على خادم BeStSel.

اختبارات نفاذية الغشاء الخارجي

تم تحليل نفاذية الغشاء باستخدام اختبار امتصاص 1-(N-phenylamino)naphthalene (NPN). يظهر NPN فلورية ضعيفة في البيئة خارج الخلوية ولكنه يظهر فلورية قوية عند ملامسته للدهون من الغشاء الخارجي للبكتيريا. وبالتالي، سيظهر NPN زيادة في الفلورية عندما تتعرض سلامة الغشاء الخارجي للخطر. A. baumannii ATCC 19606 و

تم زراعة Pseudomonas aeruginosa PA01 حتى وصلت أعداد الخلايا إلى

0.4 ، تليها الطرد المركزي (

في

لمدة 3 دقائق)، الغسل، وإعادة التعليق في المحلول المنظم (

هيبز

الجلوكوز، الرقم الهيدروجيني 7.4). بعد ذلك،

محلول NPN (تركيز العمل لـ

) أُضيف إلى

محلول بكتيري في لوحة مسطحة بيضاء ذات 96 بئر. تم مراقبة الفلورية عند

محاليل الببتيد في الماء (

تم إدخال الحل عند قيم MIC الخاصة بهم) في كل بئر، وتم مراقبة الفلورية كدالة للوقت حتى لم يُلاحظ أي زيادة أخرى في الفلورية (30 دقيقة). تم حساب الفلورية النسبية باستخدام ملاءمة غير خطية. تم استخدام التحكم الإيجابي (المضاد الحيوي بوليميكسين ب) كخط أساس. تم تطبيق المعادلة التالية لتعكس نسبة الفرق بين الخط الأساسي (بوليميكسين ب) والعينة:

اختبارات إزالة الاستقطاب للغشاء السيتوبلازمي

تم تقييم قدرة الببتيدات على إزالة استقطاب الغشاء السيتوبلازمي من خلال قياس فلوريسcence الصبغة الحساسة لإمكانات الغشاء.

-ديبروميد ثياديسكاربوكسيانين

]. هذا الفلوروفر الذي يعتمد على البوتنشيومتر يتألق عند إطلاقه من داخل الغشاء السيتوبلازمي استجابة لعدم توازن في جهده عبر الغشاء. A. baumannii ATCC 19606 و

تم زراعة خلايا Pseudomonas aeruginosa PA01 مع التحريك في

حتى وصلوا إلى مرحلة منتصف السجل

). ثم تم طرد الخلايا مركزيًا وغسلها مرتين بمحلول الغسيل (

جلوكوز

هيبز، pH 7.2) وإعادة تعليقها إلى

0.05 بوصة

جلوكوز

عينة من

تم إضافة خلايا بكتيرية إلى لوحة مسطحة سوداء ذات 96 بئرًا وتم تحضينها مع

من

-(5) لمدة 15 دقيقة حتى استقرت الفلورية، مما يدل على دمج الصبغة في الغشاء السيتوبلازمي. تم مراقبة إزالة استقطاب الغشاء من خلال ملاحظة التغير في شدة انبعاث الفلورية للصورة.

)، بعد إضافة الببتيدات (

تم حساب الفلورية النسبية باستخدام ملاءمة غير خطية. تم استخدام التحكم الإيجابي (المضاد الحيوي بوليميكسين ب) كخط أساس. قمنا بتقدير نسبة الفرق بين الخط الأساسي (بوليميكسين ب) والعينة باستخدام نفس النهج الرياضي كما في “اختبارات نفاذية الغشاء الخارجي”.

التكميم والتحليل الإحصائي

تم إنشاء الرسوم البيانية للنتائج التجريبية وإجراء الاختبارات الإحصائية في GraphPad Prism الإصدار 9.5.1 (GraphPad Software، سان دييغو، كاليفورنيا، الولايات المتحدة الأمريكية).

موارد إضافية

AMPSphere متاح للتنزيل مجانًا على زينودو

وكخادم ويب (https://ampsphere.big-data-biology.org/).

الرسوم التوضيحية التكميلية

الشكل S1. الميزات الفيزيائية-الكيميائية العامة لـ c_AMPS في AMPSphere وقواعد البيانات المعتمدة للببتيدات المضادة للميكروبات، المتعلقة بالشكل 1
موضحة هنا منحنيات الكثافة؛ وحدات الكثافة التعسفية غير معروضة، حيث تم تطبيع جميع المنحنيات بشكل مستقل بحيث تكون المساحة تحت المنحنى واحدة. لكل مجموعة بيانات وميزة، الأعلى

وأسفل

تم اعتبار القيم التي تعتبر نقاط شاذة ولم تظهر في الرسم البياني. نسب البقايا ذات السلاسل الجانبية الصغيرة

لكل

AMP جنبًا إلى جنب مع نسب البقايا الأساسية

تم أيضًا عرض القيم بالنسبة لـ c_AMP. تم مقارنة توزيعات كل ميزة بين مجموعات البيانات باستخدام اختبار مان-ويتني مع تصحيح الاختبار المتعدد باستخدام هولم-سيداك. تقريبًا جميع الفروق ذات دلالة إحصائية (تم تعديلها

قيمة

). الاستثناءات هي: لم يختلف المؤشر الأليفاتي بين الببتيدات من إصدار DRAMP

والأشخاص الموجودين في مجموعة التدريب الإيجابية المستخدمة في ماكرل

( ب

لم تختلف ببتيدات AMPSphere عن مجموعة التدريب الإيجابية المستخدمة في Macrel

في جزء من العطر

0.58)، غير قطبي (

)، قطبي (

) وحامضي (

) بقايا؛ مؤشر عدم الاستقرار (

) وخصائص الكارهة للماء (

) من ببتيدات AMPSphere لم تكن مختلفة أيضًا عن مجموعة التدريب الإيجابية المستخدمة في Macrel.

الشكل S2. جودة c_AMP وتوزيع الموائل، مرتبط بالشكلين 1 و 2
(أ) أظهر تقييم جودة AMPSphere أن معظم الببتيدات اجتازت على الأقل اختبارًا واحدًا. يعتمد اختبار RNAcode على تنوع الجينات، الذي يكون منخفضًا جدًا بالنسبة لـ AMPSphere، مما أدى إلى انخفاض معدل الإيجابيات بين مرشحينا.
(ب) أظهرت نظائر c_AMPs في قواعد بيانات الببتيدات الحيوية النشطة المعتمدة أيضًا جودة متوسطة أعلى لهذه المجموعات البيانية.
(ج) إن التداخل المحدود لمركبات c_AMPs بين المواطن يدعم استخدام مجموعات المواطن لتحقيق دقة أكبر. لاحظ أن مجموعة المواطن التي تحتوي على أعلى تداخل مزدوج تنتمي إلى مواقع الجسم البشري وعينات من أمعاء البشر وأمعاء الثدييات غير البشرية. تم عرض المواطن التي تحتوي على 100 عينة على الأقل فقط.
(د) لاحظنا نسبة كبيرة من الجينات النادرة في AMPSphere من مجموعات المواطن المختلفة.

الشكل S3. التحقق من التجميع للعائلات، متعلق بقسم طرق STAR “تجميع عائلات AMP”
لتحقيق صحة إجراء التجميع باستخدام أبجدية الأحماض الأمينية المختزلة، تم سحب عينات من 1,000 ببتيد بشكل عشوائي من AMPSphere (باستثناء التسلسلات التمثيلية) وتم محاذاتها مع ممثلي مجموعاتها. تم اختبار ثلاثة مستويات مختلفة (الأول، الثاني، والثالث) من التجميع. تم حساب قيم E لكل محاذاة ورسمها مقابل هوية المحاذاة المقابلة. يتم عرض النسبة المتوسطة للمحاذات المهمة في كل رسم بياني أعلاه.

الشكل S4. النشاط المضاد للميكروبات لبوليمكسين ب وليفوفلوكساسين وطيف الانكسار الدائري لـ c_AMPs، متعلق بقسم طرق STAR “اختبارات الانكسار الدائري”
(أ) قيم التركيز المثبط الأدنى لبوليمكسين ب، وهو مضاد حيوي ببتيدي، وليفوفلوكساسين ضد جميع السلالات المختبرة. تم استخدام بوليمكسين ب وليفوفلوكساسين كضوابط إيجابية في جميع الاختبارات المضادة للميكروبات.
(B-D) تم تحليل الميل الهيكلي الثانوي لـ c_AMPs باستخدام ثلاثة مذيبات مختلفة: (B) الماء، (C) مزيج ثلاثي فلور الإيثانول (TFE) والماء.

) و (D) خليط الميثانول (MeOH) والماء (

). تم إجراء التجارب في

، والطيف الانكساري الدائري المعروض هو متوسط ثلاث تجميعات تم الحصول عليها باستخدام قنينة كوارتز بطول مسار بصري يبلغ 1.0 مم، تتراوح من 260 إلى 190 نانومتر بمعدل

وعرض نطاق قدره 0.5 نانومتر. تم اختبار جميع الببتيدات بتركيز قدره

تم تسجيل خطوط الأساس المعنية قبل القياس. تم تطبيق فلتر تحويل فورييه لتقليل تأثيرات الخلفية.

الشكل S6. آلية عمل ببتيدات AMPSphere والنشاط المضاد للعدوى لـ c_AMPs في نموذج حيواني قبل السريري، المتعلقة بالأشكال 6 و 7
(أ) قيم الفلورية بالنسبة للبولي مكسين ب (PMB، التحكم الإيجابي) لمسبار الفلورسنت 1-(N-phenylamino)naphthalene (NPN) التي تشير إلى اختراق الغشاء الخارجي

خلايا Pseudomonas aeruginosa PAO1.
(ب) قيم الفلورية بالنسبة لـ PMB (التحكم الإيجابي) لمادة 3,3′-ديبروبيلثياديديكاربوسيانيين يوديد

-[5])، مجس فلوري هيدروفوبي يُستخدم للإشارة إلى إزالة استقطاب الغشاء السيتوبلازمي لـ

خلايا Pseudomonas aeruginosa PAO1.
(ج) عدد البكتيريا بعد أربعة أيام من العدوى؛ تم اختبار c_AMPs عند تركيزه المثبط الأدنى في جرعة واحدة بعد ساعة من بدء العدوى. كل مجموعة تتكون من ثلاثة فئران.

)، وكانت الأحمال البكتيرية المستخدمة لإصابة كل فأر مستمدة من جرعة مختلفة.
وزن الفئران طوال التجربة (المتوسط

الانحراف المعياري).
تم تحديد الدلالة الإحصائية في (C) باستخدام تحليل التباين الأحادي حيث تم مقارنة جميع المجموعات مع مجموعة التحكم غير المعالجة؛

تُظهر القيم لكل من المجموعات. تمثل الميزات في مخططات الكمان الوسيط والربعين العلوي والسفلي. تم إنشاء الشكل في BioRender.com.

(D) قيم الفلورية بالنسبة للبوليماكسين ب (PMB، التحكم الإيجابي) لمسبار الفلورسنت 1-(N-phenylamino)naphthalene (NPN) التي تشير إلى اختراق الغشاء الخارجي لخلايا A. baumannii ATCC 19606.
قيم الفلورية بالنسبة لـ PMB (التحكم الإيجابي) لـ-ديبروميد ثياديسكاربوكسيانينمسبار فلوري هيدروفوبي يُستخدم للإشارة إلى إزالة استقطاب الغشاء السيتوبلازمي لـحدثت إزالة استقطاب الغشاء السيتوبلازمي ببطء مقارنة بتمرير الغشاء الخارجي واستغرقت حوالي 20 دقيقة للاستقرار.
الشكل S5. النشاط المضاد للميكروبات والبنية الثانوية للإصدارات المختلطة لبعض من المركبات الرائدة c_AMPs، المتعلقة بالشكلين 6 و7 (A) قيم MIC للإصدارات المختلطة من خمسة من المركبات الرائدة c_AMPs من AMPSphere التي تم اختبارها ضد نفس 11 سلالة مرضية وثماني سلالات متعايشة في الأمعاء المستخدمة لتقييم نشاط المركبات c_AMPs.
(B-D) تم تحليل الميل الهيكلي الثانوي للببتيدات المخلوطة باستخدام ثلاثة مذيبات مختلفة: (B) الماء، (C) خليط TFE والماء ( )، و(D) خليط MeOH والماء ( ). تم إجراء التجارب في نفس الظروف المستخدمة لـ AMPs. تم تطبيق فلتر تحويل فورييه لتقليل تأثيرات الخلفية.
(E) خريطة حرارية مع النسبة المئوية للهيكل الثانوي الموجود لكل ببتيد في ثلاثة مذيبات مختلفة: الماء، TFE في الماء، و في الماء. تم حساب الهيكل الثانوي باستخدام خادم BeStSel.

Journal: Cell, Volume: 187, Issue: 14
DOI: https://doi.org/10.1016/j.cell.2024.05.013
PMID: https://pubmed.ncbi.nlm.nih.gov/38843834
Publication Date: 2024-06-05

Discovery of antimicrobial peptides in the global microbiome with machine learning

Graphical abstract

Highlights

Machine learning predicts nearly 1 million new antibiotics in the global microbiome
Out of tested peptides, 79 were active in vitro; 63 of these targeted pathogens
Some peptides may originate from longer sequences through genomic fragmentation
The AMPSphere is an open-access resource to accelerate antibiotic discovery

Authors

Célio Dias Santos-Júnior, Marcelo D.T. Torres, Yiqian Duan, …, Jaime Huerta-Cepas, Cesar de la Fuente-Nunez, Luis Pedro Coelho

Correspondence

cfuente@upenn.edu (C.d.I.F.-N.), luispedro@big-data-biology.org (L.P.C.)

In brief

A machine-learning-based approach predicts nearly one million new antibiotics from the global microbiome, with 79 out of 100 tested peptides being active in vitro and several showing efficacy comparable to a clinical antibiotic in a mouse preclinical model of infection.

Discovery of antimicrobial peptides in the global microbiome with machine learning

Célio Dias Santos-Júnior, Marcelo D.T. Torres, Yiqian Duan, Álvaro Rodríguez del Río, Thomas S.B. Schmidt, Hui Chong, Anthony Fullam, Michael Kuhn, Chengkai Zhu, Amy Houseman, Jelena Somborski, Anna Vines, Xing-Ming Zhao, Peer Bork, Jaime Huerta-Cepas, Cesar de la Fuente-Nunez, and Luis Pedro Coelho Institute of Science and Technology for Brain-Inspired Intelligence – ISTBI, Fudan University, Shanghai 200433, China Laboratory of Microbial Processes & Biodiversity – LMPB, Department of Hydrobiology, Universidade Federal de São Carlos – UFSCar, São Carlos, São Paulo 13565-905, Brazil Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA Department of Chemistry, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) – Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, Pozuelo de Alarcón, 28223 Madrid, Spain Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany APC Microbiome & School of Medicine, University College Cork, Cork, Ireland Max Delbrück Centre for Molecular Medicine, Berlin, Germany Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany Department of Neurology, Zhongshan Hospital, Fudan University, Shanghai, China State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, QLD, Australia These authors contributed equally Lead contact*Correspondence: cfuente@upenn.edu (C.d.I.F.-N.), luispedro@big-data-biology.org (L.P.C.)https://doi.org/10.1016/j.cell.2024.05.013

Abstract

SUMMARY Novel antibiotics are urgently needed to combat the antibiotic-resistance crisis. We present a machine-learning-based approach to predict antimicrobial peptides (AMPs) within the global microbiome and leverage a vast dataset of 63,410 metagenomes and 87,920 prokaryotic genomes from environmental and host-associated habitats to create the AMPSphere, a comprehensive catalog comprising 863,498 nonredundant peptides, few of which match existing databases. AMPSphere provides insights into the evolutionary origins of peptides, including by duplication or gene truncation of longer sequences, and we observed that AMP production varies by habitat. To validate our predictions, we synthesized and tested 100 AMPs against clinically relevant drug-resistant pathogens and human gut commensals both in vitro and in vivo. A total of 79 peptides were active, with 63 targeting pathogens. These active AMPs exhibited antibacterial activity by disrupting bacterial membranes. In conclusion, our approach identified nearly one million prokaryotic AMP sequences, an open-access resource for antibiotic discovery.

INTRODUCTION

Antibiotic-resistant infections are becoming increasingly difficult to treat with conventional therapies.

Indeed, such infections currently kill 1.27 million people per year.

Therefore, there is an urgent need for novel methods for antibiotic discovery.

Computational approaches have recently been developed to accelerate our ability to identify novel antibiotics, including antimicrobial peptides (AMPs).

Recently, proteome mining approaches have even been developed to identify antimicrobial agents in extinct organisms in an attempt to further expand our repertoire of known antimicrobials.

AMPs, found in all domains of life,

are short sequences (operationally defined here as 10-100 amino acid residues

) capable of disturbing microbial growth.

AMPs most commonly interfere with cell wall integrity and cause cell lysis.

Natural AMPs can originate by proteolysis,

by non-ribosomal synthesis,

or, as we focus on in the present study, they can be encoded within the genome.

Bacteria live in an intricate balance of antagonism and mutualism in natural habitats. AMPs play an important role in modulating such microbial interactions and can displace competitor strains, facilitating cooperation.

For instance, pathogens such as Shigella spp.,

Staphylococcus spp.,

Vibrio cholerae,

and Listeria spp.

produce AMPs that eliminate competitors (sometimes from the same species), allowing them to occupy their niche.

AMPs hold promise as potential therapeutics and have already been used clinically as antiviral drugs (e.g., enfuvirtide and telaprevir

). AMPs that exhibit immunomodulatory properties are currently undergoing clinical trials,

as are peptides that may be used to address yeast and bacterial infections

(e.g., pexiganan, LL-37, and PAC-113). Although most AMPs display broad-spectrum activity, some are only active against closely related members of the same species or genus.

Such AMPs are more targeted agents than conventional broad-spectrum antibiotics.

Furthermore, contrary to conventional antibiotics, the evolution of resistance to many AMPs occurs at low rates and is not related to cross-resistance to other classes of widely used antibiotics.

The application of metagenomic analyses to the study of AMPs has been limited due to technical constraints, primarily stemming from the challenge of distinguishing genuine proteincoding sequences from false positives.

Therefore, the significance of small open reading frames (smORFs) has been historically overlooked in (meta)genomic analyses.

In recent years, significant progress has been made in metagenomic analyses of human-associated smORFs.

These advancements have incorporated machine learning (ML) techniques to identify smORFs encoding proteins belonging to specific functional categories.

Notably, a recent study used predicted smORFs to uncover approximately 2,000 AMPs from metagenomic samples of human gut microbiomes.

Nevertheless, it is important to note that the human gut represents only a fraction of the overall microbial diversity, suggesting that there remains an immense potential for the discovery of AMPs from prokaryotes in the diverse range of habitats across the globe.

In this study, we employed ML to predict and catalog AMPs from the global microbiome as currently represented in public databases. By computationally exploring 63,410 publicly available metagenomes and 87,920 high-quality microbial genomes,

we uncovered a vast array of AMP diversity. This resulted in the creation of the AMPSphere, a collection of 863,498 nonredundant peptide sequences, encompassing candidate AMPs (c_AMPs) derived from (meta)genomic data. Remarkably, the majority of these c_AMP sequences had not been previously described. Our analysis revealed that these c_AMPs were specific to particular habitats and were predominantly not core genes in the pangenome.

Moreover, we synthesized 100 c_AMPs from AMPSphere and found that 79 were active, with 63 exhibiting antimicrobial activ-
ity in vitro against clinically significant ESKAPEE pathogens, which are recognized as public health concerns.

These peptides were further compared to encrypted peptides (EPs), which are peptide sequences hidden in protein sequences and mined computationally,

and demonstrated their ability to target bacterial membranes and their propensity to adopt

-helical and

-structures. Notably, the leading candidates displayed promising anti-infective activity in a preclinical animal model. Together, our work demonstrates the ability of ML approaches to identify functional AMPs from the global microbiome.

RESULTS

AMPSphere comprises almost 1 million c_AMPs from several habitats

AMPSphere incorporates c_AMPs predicted with ML using Macrel,

a pipeline that uses random forests to predict AMPs from large peptide datasets with an emphasis on precision over recall. It was applied to 63,410 globally distributed publicly available metagenomes (Figure 1A; Table S1) and 87,920 high-quality bacterial and archaeal genomes.

Sequences present in a single sample were removed,

except when they had a significant match (defined as amino acid identity

and E -value

) to a sequence in the AMP-dedicated database Data Repository of Antimicrobial Peptides (DRAMP) version 3.0.

This resulted in

genes,

of the total predicted smORFs, coding for 863,498 non-redundant c_AMPs (on average

residues long; Figures 1A and S1). Similar to validated sequences with antimicrobial activity,

c_AMPs from AMPSphere present a positive charge (

), high isoelectric point (

), amphiphilicity (hydrophobic moment,

), and a potential to bind to membranes or other proteins (Boman index,

). As expected, in general, the distribution of physicochemical properties of peptides from AMPSphere, DRAMP

version 3.0, and the positive training dataset used in Macrel

are more similar to each other than to the negative training set (assumed to not be AMPs). Nonetheless, c_AMPs from AMPSphere are on average longer (

residues) than those in DRAMP

version 3.0 (

residues), and we observed differences in the distribution of other features (e.g., charge, aliphaticity, amphipathicity, and isoelectric point; Figure S1).

We subsequently estimated the quality of the smORF predictions and detected

of the c_AMP sequences in independent publicly available metaproteomes or metatranscriptomes (Figures 2 and S2A; see STAR Methods section “Quality control of c_AMPs”) belonging to several habitats included in the AMPSphere, such as the human gut, plants, and others (Table S6). We then subjected all c_AMPs to a bundle of in silico quality tests (see STAR Methods section “Quality control of c_AMPs”). A subset of c_AMPs (9.2% or 80,213 c_AMPs) passed all of them, and this subset is hereafter designated as high-quality. Testing with other AMP prediction systems (AMPScanner v2,

the model for mature peptides in ampir,

amPEPpy,

APIN,

AI4AMP,

and AMPLify

), we observed that

( 849,703 peptides) of AMPSphere c_AMPs were also predicted as AMPs by at least one other AMP prediction system. Approximately

( 132,440 out of 863,498 peptides) of AMPSphere c_AMPs were co-predicted by all methods used.

Figure 1. AMPSphere comprises

non-redundant c_AMPs from thousands of metagenomes and high-quality microbial genomes (A) To build the AMPSphere, we first assembled 63,410 publicly available metagenomes from diverse habitats. A modified version of Prodigal,

which can also predict smORFs (

), was used to predict genes on the resulting metagenomic contigs as well as on 87,920 microbial genomes from ProGenomes

Macrel

was applied to the

predicted smORFs to obtain 863,498 non-redundant c_AMPs (see also Figure S1). c_AMPs were then hierarchically clustered in a reduced amino acid alphabet using

, and

identity cutoffs. We observed 118,051 non-singleton clusters at

of identity, and 8,788 of them were considered families (

c_AMPs).
(B) Only

of c_AMPs have detectable homologs in other small protein databases (SmProt 2,

STsORFs

), bioactive peptide databases (DRAMP

version 3.0, starPepDB

), and general protein datasets (GMGCv1

; see also Figure S2B). Also shown is the number of homologs in the AMPSphere in each database as well as the total. The number of homologs passing all of our quality tests regardless of their experimental evidence of translation/transcription is also shown along with the percentage it represents in the homologs identified. Note that some peptides have homologs in multiple databases and thus the total count is not the sum of the individual databases.
(C) Shown are rarefaction curves showing how AMP discovery is impacted by sampling, with most of the habitats presenting steep sampling curves.
(D) Sharing of c_AMPs between habitats is limited. The width of ribbons represents the proportion of the shared c_AMPs in the habitat on the left. See also Figures S2C and S2D and Tables S1 and S2.

Only

of the identified c_AMPs (6,339 peptides) are homologous (operationally defined as amino acid identity

and E -value

) to experimentally validated AMP sequences in DRAMP version 3.0.

Moreover, most c_AMPs were also ab-
sent from protein databases not specific to AMPs (Figure 1B), such as the Small Proteins database (SmProt2)

or the Global Microbiome Gene Catalog of canonical-length proteins (GMGCv1),

suggesting that c_AMPs represent a region of

Figure 2. Quality control of AMPSphere candidates
(A) The number of AMPSphere candidates passing each of the tests proposed for quality is shown. The high-quality set is composed of

of candidates without experimental evidence and

of candidates with evidence of their translation or transcription, as well as the number of homologs found in the high-quality set of AMP candidates. Although the high-quality set displays some overlap with the homologs, most of the homologs are not found in the high-quality set.
(B) The number of AMP candidates co-predicted by AMP prediction systems beyond Macrel (AMPScanner v2,

ampir

with the model for mature peptides, amPEPpy,

APIN

with their proposed model, AI4AMP,

and AMPLify

). Only a small portion of AMPSphere (<2%) cannot be co-predicted by any system other than Macrel.

peptide sequence space that is not present in these other databases. In total, we could find only 73,774 (

) c_AMPs with homologs in any of the databases we considered. High-quality c_AMPs were detected in public databases at a higher frequency than general c_AMPs ( 2.5 -fold,

; Figure 1B), with 23,012 out of the 80,213 high-quality c_AMPs having a match in another database. However, it is notable that

( 4,843 peptides out of 6,339 ) of those c_AMPs that have a homolog in DRAMP

version 3.0 (and, therefore, are highly likely to be functional) are not high-quality c_AMPs. Thus, while our quality tests do enrich for validated sequences, a failure to pass the tests is not a sufficient reason to conclude that the sequence is not active.

To put c_AMPs in an evolutionary context, we hierarchically clustered peptides using a reduced amino acid alphabet of 8 letters.

The three sequence clustering levels adopted identity cutoffs of

, and

(Figure S3). At the

identity level, we obtained 521,760 protein clusters, of which 405,547 were singletons, corresponding to

of all c_AMPs from AMPSphere. A total of 78,481 (

) of these singletons were detected in metatranscriptomes or metaproteomes from various sources, indicating that they were not artifacts. The large number of singletons suggests that most c_AMPs originated from processes other than diversification within families, which is the opposite of the hypothesized origin of full-length proteins, in which singleton families are rare.

The 8,788 clusters with

peptides obtained at

of identity are hereafter named “families,” as in Sberro et al.

Among them, we considered 6,499 as high-quality families because they contained evidence of translation or transcription or because

of their sequences pass all in silico quality tests, regardless of whether experimental evidence is available (see STAR Methods section “AMP families”). These high-quality families span

of the AMPSphere (133,309 peptides).

All the c_AMPs predicted here can be accessed at https:// ampsphere.big-data-biology.org/. Users can retrieve the peptide sequences, ORFs, and predicted biochemical properties of each c_AMP (e.g., molecular weight, isoelectric point, and net charge at pH 7.0 ). We also provide the distribution across geographical regions, habitats, and microbial species for each c_AMP.

c_AMPs are rare and habitat-specific

The AMPSphere spans 72 different habitats, which were classified into eight high-level habitat groups, e.g., soil/plant (

of c_AMPs in AMPSphere), aquatic (

), and human gut (

; Figure 1A; Table S2). Most of the habitats, except for the human gut, appear to be far from saturated in terms of discovered c_AMPs (Figure 1C). In fact, most AMPs are rare (median number of detections is 99 , or

of the dataset; when restricted to high-quality c_AMPs, the median number of detections is 81 , or

of the dataset), with

being observed in <1% of samples (Figure S2). Only

of c_AMPs were detected in more than one high-level habitat group (henceforth termed “multi-habitat c_AMPs”); this fraction is 7.25-fold smaller than would be expected by a random assignment of habitats to samples (

; see STAR Methods section “Multi-habitat and rare c_AMPs”). Even within high-level habitat groups, c_AMPs overlap between habitats much less frequently than expected by chance (2.4-192-fold less,

; see STAR Methods section “Testing c_AMPs overlap across habitats”; Figure 1D).

Mutations in larger genes generate c_AMPs as independent genomic entities

Many AMPs are generated post-translationally by the fragmentation of larger proteins.

For example, EPs are computationally detected fragments from protein sequences within the human proteome and other proteomes that have been shown to be highly active.

EPs present diverse secondary structures and act on the membrane of bacterial cells similarly to known natural AMPs but have different physicochemical features compared to known AMPs.

AMPSphere only considered peptides encoded by dedicated genes. Nonetheless, we hypothesized that some of these have originated from larger proteins by fragmentation at the genomic level. To explore this, we aligned the AMPSphere c_AMPs to the full-length proteins in GMGCv1

and observed that about

of them are homologous to a canonical-length protein (Figure 1B), with

of these hits sharing the start codon with the longer protein. This suggests early termination of full-length proteins as one mechanism for generating novel c_AMPs (Figures 3A and 3B).

(AMP10.271_016)

Early termination

CD3:33	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTG G CAAG TAATCGTTGGTGC
F0106	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAAA TAATCGTTGGTGC
F0697	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAAG TAATCGTTGGTGC
SAMN09837386	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTG A CAAG TAATCGTTGGTGC
SAMN09837387	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTGA CAAG TAATCGTTGGTGC
SAMN09837388	GCTATGGTATCTGTAAGTTTTTAGGT AAGAGTGGCTGA CAAG TAATCGTTGGTGC
FDAARGOS_760	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAGGTAATCGTTGGTGC
FDAARGOS_306	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG GCAGGTAATCGTTGGTGC
FDAARGOS_1566	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG G CAGGTAATCGTTGGTGC
ATCC 25845	GCTATGGTATCTGTAAGTTTTTAGGCAAGAGTGGCTG G CAGGTAATCGTTGGTGC

P. jejuni P. melaninogenica Mutation to stop codon Metagenomic contig Conserved region

Unknown function

Translation, ribosomal structure, and biogenesis

Figure 3. Mutations in genes encoding large proteins generate c_AMPs as independent genomic entities
(A) The distribution of positions (as a percentage of the length of the larger protein) from which the AMP homologs start their alignment is shown. About 7% of c_AMPs are homologous to proteins from GMGCv1,

with approximately one-fourth of the hits having the same start position as the larger protein.
(B) As an illustrative example of an AMP homologous to a full-length protein, AMP10.271_016 was recovered from three samples of human saliva from the same donor.

AMP10.271_016 is predicted to be produced by Prevotella jejuni, sharing the start codon (bolded) of an NAD(P)-dependent dehydrogenase gene (WP_089365220.1), the transcription of which was stopped by a mutation (in red; TGG > TGA).
(C) The distribution of AMPs per OG class (left) and their enrichment in comparison to full-length proteins from GMGCv1

(right). OGs were classified into subgroups according to the number of

AMPs they were affiliated with. The OGs of unknown function represent the largest ( 2,041 out of 3,792 OGs) and most enriched (

) class with homologs to c_AMPs in GMGCv1.

Interestingly, when considered individually, the number of c_AMP hits to unknown OGs was the lowest (

). These results do not change when underrepresented OGs are excluded by using different thresholds (e.g., at least 10,20 , or 100 homologs per OG). See also Table S3.

To investigate the function of the full-length proteins homologous to AMPs, we mapped the matching proteins from GMGCv1

to orthologous groups (OGs) from eggNOG 5.0.

We identified 3,792 (out of 43,789 ) OGs significantly enriched (

, after multiple hypothesis corrections with the Holm-Sidak method) among the hits from AMPSphere. Although OGs of unknown function comprise

of all identified OGs, when considered individually, these OGs are on average smaller than OGs in other categories. Thus, despite each OG having a relatively small number of c_AMP hits, when compared to the background distribution of the OGs in GMGCv1,

OGs of unknown function were the most enriched among the c_AMP hits, with an average enrichment of 10,857fold (

; Figure 3C; Table S3).

c_AMP genes may arise after gene duplication events

We next raised the question of whether c_AMPs would be predominantly present in specific genomic contexts. To investigate the functions of the neighboring genes of the c_AMPs, we mapped them against 169,484 genomes included in a recent study.

A total of

out of 55,191

of c_AMPs with more than two homologs in different genomes in the database showed phylogenetically conserved genomic context with genes of known function (see STAR Methods section “Genomic context conservation analysis”). This holds true for curated versions of the catalog:

of high-quality c_AMPs and

of high-quality c_AMPs with experimental evidence show conserved genomic neighbors. These conservation values are similar to that of

gene families with more than two homologs calculated de novo on the gene catalog (

), indicating that the genomic location of c_AMPs is not random.

Despite being involved in similar processes, c_AMPs were generally depleted from conserved genomic contexts involving known systems of antibiotic synthesis and resistance, even when compared to small protein families (Figure 4). Instead, we found that c_AMPs are encoded in conserved genomic contexts with ribosomal genes (

) at a higher frequency than other gene families (4.75%; Figure 4A; Table S4).

Most of the c_AMPs ( 2,201 out of 2,642 ) in a conserved context with ribosomal subunits are homologous to ribosomal proteins (Figure 4D), congruent with the observation that in some species, ribosomal proteins have antimicrobial properties.

Seventy-seven c_AMPs homologous to ribosomal proteins were also homologous to a ribosomal gene in their immediate vicinity (up to 1 gene up/downstream). This phenomenon is not exclusive to ribosomal proteins: 1,951 c_AMPs can be annotated to the same KEGG Orthologous Group (KO) as some of their immediate neighbors and may have originated from gene duplication events. This shared annotation was interpreted in this context as evidence for a common evolutionary origin and not as a functional prediction for the c_AMPs. These duplications may have arisen by recombination of flanking homologous sequences, which can happen during cell division.

Interestingly, 1,635 (

) of these c_AMPs are located upstream of the neighbor with the same KO annotation. Different permeases and transposases are the most common KOs assigned to c_AMPs and their neighbors ( 400 and 125 c_AMPs, respectively; see Table S5).

Most c_AMPs are members of the accessory pangenome

We observed that only a small portion (

) of c_AMP families present in ProGenomes2

are contained in

of genomes from the same species (Figure 5), here referred to as “core.”

This is consistent with previous work, in which AMP production was observed to be strain-specific.

In contrast, a high proportion (circa 68.8%) of full-length protein families are core in ProGenomes

species. There is a 1.9 -fold greater chance (

) that a pair of genomes from the same species share at least one c_AMP when they belong to the same strain (99.5%

ANI <99.99%).

One example of this strain-specific behavior is AMP10.018_ 194, the only c_AMP found in Mycoplasma pneumoniae genomes. M. pneumoniae strains are traditionally classified into two groups based on their P1 adhesin gene.

Of the 76 M. pneumoniae genomes present in our study, 29 were classified as type-1, 29 were classified as type-2, and the remaining 18 were undetermined in this classification system

(see STAR Methods section “Determination of accessory AMPs”). Twe-nty-six of the 29 type-2 genomes contain AMP10.018_194, as did 2 undetermined type genomes, but none of the type-1 genomes contain this AMP.

More transmissible species have lower c_AMP density

We investigated the taxonomic composition of AMPSphere by annotating contigs with the Genome Taxonomy Database (GTDB) taxonomy

(see STAR Methods section “c_AMP density in microbial species”), which resulted in 570,187 c_AMPs being annotated to a genus or species. The genera contributing the most c_AMPs to AMPSphere were Prevotella

c_ AMPs), Bradyrhizobium ( 11,846 c_AMPs), Pelagibacter ( 6,675 c_AMPs), Faecalibacterium ( 5,917 c_AMPs), and CAG-110 ( 5,254 c_AMPs; see Figure 5). This distribution reflects the fact that these genera are among those that contribute the most assembled sequences in our dataset (all occupying percentiles above

among the assembled genera). Therefore, we calculated the c_AMP density (

) by determining the number of c_AMP genes per megabase pairs of assembled sequence. To avoid bias due to the unequal sampling of habitats, we included all the sequences predicted by Macrel

in each sample, including singleton sequences that were subsequently removed and are not part of AMPSphere.

To further explore the importance of AMP production in ecological processes, we investigated the role of AMPs in the mother-to-child transmissibility of bacterial species in a recently published paper

by correlating the

for each bacterial species to the published measures of microbial transmission. Human gut bacteria showed increased transmissibility at lower AMP densities

). Similarly, in human oral microbiome bacterial species, transmissibility from mother to offspring is consistently inversely correlated with their

for the first year (

). This suggests that human gut bacteria and oral microbiome bacterial species show increased transmissibility at lower

. Moreover, it highlights the potential influence of

on the transmissibility of gut and

Figure 4. The genome context of c_AMPs shows a preference for neighborhoods containing ribosome assembly proteins
(A) Compared to other proteins, c_AMPs in conserved genomic architectures tend to be closer to ribosomal-machinery-related genes than families of proteins with different sizes (all length and small proteins with

amino acids).
(B) The proportion of c_AMPs in a genome context involving antibiotic resistance genes is lower than in other gene families.
(C) The proportion of c_AMPs in neighborhoods with antibiotic-synthesis-related genes is very small (

).
(D) The conserved genomic context of the gene encoding AMP10.015_426 is shown in different genomes (the tree on the left depicts the phylogenetic relationship of the genes homologous to it). This c_AMP is homologous to the ribosomal protein rpsH and is found in the context of rpsH and other ribosomal protein genes. See also Table S4.

oral microbiota, suggesting a link between AMPs and the transmission success rates of microbial species.

Physicochemical features and secondary structure of AMPs

To investigate the properties and structure of the synthesized peptides, we first compared their amino acid composition to AMPs from available databases of experimentally verified sequences (DRAMP

version 3.0, Database of Antimicrobial Activity and Structure of Peptides [DBAASP],

and Antimicrobial Peptides Database [APD]

version 3). Overall, the composition was similar, as was expected, given that Macrel’s ML model was trained using known AMPs.

Notably, AMPSphere sequences displayed a slightly higher abundance of aliphatic amino acid residues, specifically alanine and valine. However, these AMPSphere sequences consistently differed (Figure 6A) from EPs.

The resemblances in amino acid composition between the identified c_AMPs and known AMPs suggested similar physicochemical characteristics and secondary structures, both of which are recognized for their influence on antimicrobial activity.

The c_AMPs exhibited comparable hydropho-
bicity, net charge, and amphiphilicity to AMPs sourced from databases (Figure S1). Furthermore, they displayed a slight propensity for disordered conformations (Figure 6B) and had a lower net positive charge compared to other EPs (Figure 6A).

To evaluate the structural and antimicrobial properties of c_AMPs from AMPSphere, we first filtered the AMPSphere for peptides that were predicted as suitable for in vitro assays due to their solubility in aqueous solution and ease of chemical synthesis. We chose a set of high-quality AMPs with 50 peptide sequences based on their prevalence and taxonomic diversity (see STAR Methods section “Peptide selection for synthesis and testing”). Additionally, to provide an unbiased evaluation of the peptides we report here, we first excluded any peptides with a homolog in one of the published databases and then randomly selected 50 additional peptides from the AMPSphere, including 25 peptides with AMP probabilities of at least 0.6 (as reported by Macrel

) and 25 peptides with lower probabilities (

Subsequently, we conducted experimental assessments of the secondary structure of the active c_AMPs using circular dichroism (Figures 6B and S4). Similar to AMPs documented in databases, peptides derived from AMPSphere exhibited different

Figure 5. AMP variation in AMPSphere database is taxonomy-dependent
(A) Shown are the fractions of AMPs (or AMP families) that are accessory (present in

of genomes from same species), shell (

), or core (

).
(B) Distribution of the lowest taxonomic level at which c_AMPs were annotated. In detail (right) are the top 10 genera with the highest numbers of c_AMPs included in AMPSphere. Animal-associated genera (e.g., Prevotella, Faecalibacterium, and CAG-110) contribute the most c_AMPs, possibly reflecting data sampling.
(C) Using the

per genus (calculated with c_AMPs in AMPSphere), we observed the distribution of c_AMPs per phylum, with Bacillota A as the densest (the number of samples used to build the graph is shown above each box).
(D) Taxonomy of the detected taxa in AMPSphere is shown using the GTDB

reference tree. The gray bars show

distribution with respect to taxonomy, with black bars representing the confidence interval of

. Bacillota A, Actinomycetota, and Pseudomonadota are the densest phyla in c_AMPs. As a reference, the median of

for the presented genera is indicated by a magenta dashed line.

propensities for adopting

-helical structures; also, some of them were unstructured or adopted

-antiparallel conformations in all media analyzed. Notably, they also displayed an unusually high content of

-antiparallel structures in both water and methanol/water mixtures (Figure 6B) despite their amino acid composition similarities to AMPs and EPs. We attribute these findings to the slightly elevated occurrence of alanine and valine residues, which are known to favor

-like structures with a preference for

-antiparallel conformation.

Validation of c_AMPs as potent antimicrobials through in vitro assays

Next, we tested the 100 synthesized peptides against 11 clinically relevant pathogenic strains encompassing Acinetobacter baumannii, Escherichia coli (including one colistin-resistant strain), Klebsiella pneumoniae, Pseudomonas aeruginosa, Staphylococcus aureus (including one methicillin-resistant strain), vancomycin-resistant Enterococcus faecalis, and vancomycinresistant Enterococcus faecium. Our initial screening revealed that 63 AMPs (out of 100 synthesized) completely eradicated
the growth of at least one of the pathogens tested (Figure 6C). Remarkably, in some cases, the AMPs were active at concentrations as low as

, close to the peptide antibiotic polymyxin B and the antibiotic levofloxacin that were used as positive controls in all experiments (Figure S4A). The Gram-negative bacteria A. baumannii and E. coli, as well as the Gram-positive van-comycin-resistant strains E. faecalis and E. faecium, displayed higher susceptibility to the AMPs, with

, and 26 peptide hits, respectively. However, none of the tested AMPs affected methicillin-resistant S. aureus (MRSA) (Figure 6C). We also synthesized and tested the scrambled versions of five of the most active peptides from the high-quality group for antimicrobial activity (i.e., actinomycin-1, enterococcin-1, lachnospirin-1, prote-obacticin-1, and synechocucin-1). All scrambled versions were inactive except for lachnospirin-1_scrambled, which presented modest activity against

. baumannii at

( 16 times higher concentration compared to its parent peptide lachno-spirin-1; Figure S5A). These results underscore the importance of the specific sequence of these peptides to exert their antimicrobial activity. To further explore the influence of sequence on

Figure 6. Amino acid composition, structure, antimicrobial activity, and mechanism of action of c_AMPs
(A) Amino acid frequency in c_AMPs from AMPSphere, AMPs from databases (DRAMP

version 3, APD3,

and DBAASP

), and encrypted peptides

(EPs) from the human proteome.
(B) Heatmap with the percentage of secondary structure found for each peptide in three different solvents: water,

trifluoroethanol (TFE) in water, and

methanol (MeOH) in water. Secondary structure was calculated using BeStSel server.

was exposed to c_AMPs 2-fold serially diluted ranging from 64 to

in 96 -well plates and incubated at

for one day. After the exposure period, the absorbance of each well was measured at 600 nm . Untreated solutions were used as controls, and minimal concentration values for complete inhibition were presented as a heatmap of antimicrobial activities (

) against 11 pathogenic and eight human gut commensal bacterial strains. All the assays were performed in three independent replicates, and the heatmap shows the mode obtained within the 2 -fold dilution concentration range studied. Gram-positive

and Gram-negative

bacteria are indicated as such (top).

structure, we assessed the secondary structure tendency of the scrambled peptides using circular dichroism. We noticed a decrease in helical fraction for sequences with higher helical content (enterococcin-1, lachnospirin-1, and synechocucin-1), while the predominately random coiled sequences actinomy-cin-1 and proteobactin-1, as well as their scrambled counterparts, showed similar secondary structural sequences in all media analyzed (Figures S5B-S5E). These results suggest a lack of correlation between secondary structure and antimicrobial activity of the AMPs derived from AMPSphere.

The growth of human gut commensals is impaired by c_AMPs

We screened the AMPs against eight of the most relevant members of the human gut microbiota associated with human health.

We tested commensal bacteria belonging to four phyla (Verrucomicrobiota, Bacteroidota, Actinomycetota, and Bacillota), i.e., Akkermansia muciniphila, Bacteroides fragilis, Bacteroides thetaiotaomicron, Bacteroides uniformis, Phocaeicola vulgatus (formerly Bacteroides vulgatus), Collinsella aerofaciens, Clostridium scindens, and Parabacteroides distasonis.

While it is commonly observed that known natural AMPs do not target microbiome strains,

our study found that 58 of the synthesized AMPs (58%) demonstrated inhibitory effects on at least one commensal strain at low concentrations (

). Although this concentration range was higher than that observed for the most active peptides against pathogens (1

), it still falls within the highly active range of AMPs based on previous studies

(Figure 6C). Interestingly, all the analyzed gut microbiome strains were susceptible to at least four c_AMPs, with strains of A. muciniphila, B. uniformis, P. vulgatus, C. aerofaciens, C. scindens, and

. distasonis exhibiting the highest susceptibility. In total, 79 AMPs (out of 100 synthesized peptides) demonstrated antimicrobial activity against pathogens and/or commensals. We also screened scrambled sequences of five of the highly active peptides from the highquality group against gut commensals. Similarly to the results obtained against pathogenic strains (Figure S5), only lachno-spirin-1_scrambled was modestly active against

. scindens at

(Figure S5A).

Permeabilization and depolarization of the bacterial membrane by c_AMPs from AMPSphere

To gain insights into the mechanism of action responsible for the antimicrobial activity observed in the peptides derived from AMPSphere (Figure 6C), we conducted experiments to assess their ability to permeabilize and depolarize the outer and cytoplasmic membranes of bacteria at their minimum inhibitory concentrations (MICs). Specifically, we investigated the effects of all 39 peptides that showed activity against A. baumannii (Figures 6D and 6E) and 6 peptides with antimicrobial activity on P. aeruginosa (Figures S6A and S6B). For comparison and as a
control, we used polymyxin B, a peptide antibiotic known for its membrane permeabilization and depolarization properties.

To investigate the potential permeabilization of the outer membranes of Gram-negative bacteria by the selected AMPs, we conducted 1-(N-phenylamino)naphthalene (NPN) uptake assays. NPN is a lipophilic fluorophore that exhibits increased fluorescence in the presence of lipids found within bacterial outer membranes. The uptake of NPN indicates membrane permeabilization and damage. Among the 39 peptides evaluated for activity against

. baumannii, 10 peptides caused significant permeabilization of the outer membrane, resulting in fluorescence levels at least

higher than that of polymyxin B (Figure 6D) after 45 min of exposure. In the case of

. aeruginosa cells, four out of the six tested peptides showed higher permeabilization than polymyxin B (Figure S6A).

To evaluate the potential membrane depolarization effect of the selected AMPs from AMPSphere, we utilized the fluorescent dye

-dipropylthiadicarbocyanine iodide (DiSC

-[5]). Among the peptides tested against A. baumannii, bogicin-1 (AMP10. 364_543), ampspherin-2 (AMP10.615_023), and marinobacti-cin-1 (AMP10.321_460) exhibited greater cytoplasmic membrane depolarization than polymyxin B , and among the ones tested against

. aeruginosa, all peptides tested exhibited greater cytoplasmic membrane depolarization than polymyxin B (Figure 6B). Interestingly, all the tested AMPSphere peptides displayed a characteristic crescent-shaped depolarization pattern compared to polymyxin B, with lower levels of depolarization observed during the first 20 min of exposure followed by an increase in depolarization over time (Figures 6E and S6B). Taken together, these results indicate that the kinetics of cytoplasmic membrane depolarization are slower compared to the kinetics of outer membrane permeabilization, which occurs rapidly upon interaction with the bacterial cells.

Our findings indicate that the tested AMPs from AMPSphere primarily exert their effects by permeabilizing the outer membrane rather than depolarizing the cytoplasmic membrane, revealing a similar mechanism of action to that observed for classical AMPs and EPs from the human proteome.

AMPs exhibit anti-infective efficacy in a mouse model

Next, we tested the anti-infective efficacy of AMPSpherederived peptides in a skin abscess murine infection model (Figure 7A). Mice were subjected to infection with A. baumannii, a dangerous Gram-negative pathogen known for causing severe infections in various body sites including the bloodstream, lungs, urinary tract, and wounds.

Ten lead AMPs from different sources displayed potent in vitro activity against A. baumannii: synechocucin-1 (AMP10.000_211,

) from Synechococcus sp. (coral-associated, marine microbiome); proteobacti-cin-1 (AMP10.048_551,

) from Pseudomonadota (plant and soil microbiome); actynomycin-1 (AMP10.199_072,

) from Actinomyces (human mouth and saliva

Figure 7. Anti-infective activity of AMPs in preclinical animal model
(A) Schematic of the skin abscess mouse model used to assess the anti-infective activity of the peptides against A. baumannii cells.
(B) Peptides were tested at their MIC in a single dose 2 h after the establishment of the infection. Each group consisted of three mice (

), and the bacterial loads used to infect each mouse were derived from a different inoculum.
(C) To rule out toxic effects of the peptides, mouse weight was monitored throughout the experiment.
Statistical significance in (B) was determined using one-way ANOVA where all groups were compared to the untreated control group;

values are shown for each of the groups. Features on the violin plots represent median and upper and lower quartiles. Data in (C) are the mean

the standard deviation. Figure created in BioRender.com.

microbiome); lachnospirin-1 (AMP10.015_742,

) from Lachnospira sp. (human gut microbiome); enterococcin-1 (AMP10.051_911,

) from Enterococcus faecalis (human gut microbiome); alphaprotecin-1 (AMP10.316_798,

) from Alphaproteobacteria (aquatic microbiome); oscillospirin (AMP10.771_988,

) from Oscillospiraceae (pig gut microbiome); ampspherin-4 (AMP10.466_287,

) from an unknown source; methylocellin-1 (AMP10.446_

) from Methylocella sp. (soil microbiome); and reyranin-1 (AMP10.337_875,

) from Reyranella (plant and soil microbiome). The skin abscess infection was established with a bacterial load of

of A. baumannii cells at

colony-forming units (CFUs)

onto the wounded area of the dorsal epidermis (Figure 7A). A single dose of each peptide at their respective MIC value obtained in vitro (Figures 6C and S4A) was administered to the infected area. Two days postinfection, synechocucin-1, actynomycin-1, and oscillosporin-1 presented bacteriostatic activity, inhibiting the proliferation of A. baumannii cells, whereas lachnospirin-1, enterococcin-1, ampspherin-4, and reyranin-1 presented bactericidal activity close to that of the antibiotic polymyxin B (at

), reducing the CFU counts by 3-4 orders of magnitude (Figure 7B). Four days post-infection, synechocucin-1, lachnospirin-1, enterococcin-1, and ampspherin-4 presented a bacteriostatic effect close to that of the antibiotic polymyxin B, reducing the CFU counts by 2-3 orders of magnitude compared to the untreated control (Figure S6C). These results highlight the anti-
infective potential of the tested peptides from AMPSphere as they were administered at a single time immediately after the establishment of the abscess. Mouse weight was monitored as a proxy for toxicity, and no significant changes were observed (Figures 7C and S6D), suggesting that the peptides tested were not toxic.

DISCUSSION

Here, we used ML to identify nearly a million candidate AMPs in the global microbiome. Building on previous studies that focused specifically on the human gut microbiome,

we cataloged AMPs from the global microbiome across 63,410 publicly available metagenomes as well as 87,920 high-quality microbial genomes from the ProGenomes2 database.

This led to the creation of AMPSphere (https://ampsphere.big-databiology.org/), an open-access and publicly available resource encompassing 863,498 non-redundant peptides and 6,499 high-quality AMP families from 72 different habitats, including marine and soil environments and the human gut. Most of the c_AMPs (

) were previously unknown and lacked detectable homologs in other databases, and about one in five had evidence of translation and/or transcription, as they could be detected in independent publicly available sets of metatranscriptomes or metaproteomes.

We designed a set of tests to capture higher-quality predictions, but many peptides failed these tests despite evidence
that they were active, including our own in vitro data and the existence of validated homologs in external databases. Low-prevalence peptides will be less likely to pass the tests (RNAcode

requires multiple variants), which is independent of their activity and influenced by sampling biases.

Focusing on candidate AMPs that are directly encoded in the genome enabled in vitro and in vivo testing using chemical synthesis without post-translational modifications, but there are other processes that generate active peptides, such as encrypted peptides (EPs),

which we used as a comparison point. Notably, the amino acid composition and physicochemical characteristics of the validated AMPs from AMPSphere differed from those of recently identified in EPs.

Two evolutionary mechanisms by which AMPs may be generated were explored. First, mutations in genes encoding longer proteins could generate gene fragments via truncation. Among the enriched ortholog groups of proteins from GMGCv1

homologous to c_AMPs, we observed that a majority of groups had unknown function (53.8%), similar to what was reported by Sberro et al.

for small proteins from the human gut microbiome. The second mechanism is that a small protein gene could undergo a duplication followed by mutation, which we observed in the case of ribosomal proteins. Ribosomal proteins can harbor antimicrobial activity,

possibly due to their amyloidogenic properties.

Other origins of AMPs may be horizontal gene transfer

or ancestral non-coding sequences.

Nonetheless, the majority of identified AMPs did not have a detectable homolog in other databases. The lack of observed homology may be due to limitations in our ability to robustly detect these homology relationships in small sequences, but there is also the possibility that small proteins, such as AMPs, may be more likely to be generated de novo compared to longer proteins and may have repeatedly evolved in various taxa.

This may also be an explanation for the large fraction of c_AMPs in the AMPSphere that do not cluster with any other sequences.

We observed that c_AMPs from AMPSphere were habitatspecific and mostly accessory members of microbial pangenomes. Furthermore, four out of the five genera with the most c_AMPs present in AMPSphere share a host-associated lifestyle, and three of these (Prevotella, Faecalibacterium, and CAG-110) are common in animal hosts

(Figure 5).

Valles-Colomer et al.,

who recently analyzed a large collection of human-associated metagenomes, provide a species-specific index of transmissibility for the several transmission scenarios they study (e.g., mother to infant). Hypothesizing that AMP production may be related to transmission, we correlated the spe-cies-specific

calculated in AMPSphere with transmission scores. In both the human gut and oral microbiomes, species with higher

are less transmissible, possibly because AMPs confer protection against strain replacement. Taken together, these results validate the applicability of AMPSphere in the study of microbial ecology, as they suggest a role for AMPs in determining the transmissibility and colonization ability of microbes, which warrants further investigation and validation in future work.

Finally, we experimentally validated predictions made by our ML model

and found that 79 (out of 100 ) synthesized AMPs displayed antimicrobial activity against either pathogens or commensals. Nonetheless, notably, four peptides (cagicin-1, cagi-cin-4, and enterococcin-1 against A. baumannii and cagicin-1
and lachnospirin-1 against vancomycin-resistant E. faecium) presented MIC values as low as

, comparable to the MICs of some of the most potent peptides previously described in the literature.

We show that the tested AMPs from AMPSphere tended to target clinically relevant Gram-negative pathogens and showed activity against vancomycin-resistant E. faecium. Although conventional AMPs do not target bacteria from the human gut microbiome,

tested AMPs from AMPSphere showed efficacy against commensal bacteria, suggesting potential ecological implications of peptides as protective agents for their producing organisms and their ability to reconfigure microbiome communities.

When assessing their activity in vivo, three peptides exhibited anti-infective efficacy in a murine infection model, with lachno-spirin-1 and enterococcin-1 being the most potent, resulting in a reduction of bacterial load by up to three orders of magnitude. The active peptides included those derived from both humanassociated and environmental microbiota, validating our approach of investigating the global microbiome. Overall, our findings unveil a wide array of AMP sequences without matches in other databases, highlighting the potential of machine learning in the discovery of much-needed antimicrobials.

Limitations of the study

We focused on a particular category of AMPs, namely peptides encoded by their own genes and composed of up to 100 amino acids, which does not cover all active peptides. We explored the global microbiome as represented in public databases, and certain habitats and areas of the globe have been significantly more explored than others. This uneven coverage also impacts our quality estimates, as they depend on data availability. We will, however, continue to update the resource as newer genomes and metagenomes are made available. We report results based on finding homologs to our peptides, but matching small sequences to large databases has a higher rate of errors (particularly missed matches) than is the case for longer sequences. Our results on the transmissibility of microbial strains and AMP density were intended to demonstrate the value of AMPSphere as a resource, but a full validation of this link will be the focus of future work. Finally, we tested peptides in vitro and in vivo against a panel of bacteria. Given that we observed speciesand even strain-specific responses, it is possible that peptides for which we did not observe any activity would have been active against strains not tested here.

STAR*METHODS

Detailed methods are provided in the online version of this paper and include the following:

Key RESOURCES TABLE
RESOURCE AVAILABILITY
Lead contact
Materials availability
Data and code availability
EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS
Bacterial strains and growth conditions
Skin abscess infection mouse model
METHOD DETAILS
Selection of microbial (meta)genomes
Reads trimming and assembly
smORF and AMP prediction
Clustering of AMP families
Quality control of c_AMPs
Sample-based c_AMPs accumulation curves
Multi-habitat and rare c_AMPs

Testing c_AMPs overlap across habitats
c_AMP density in microbial species
c_AMPs and bacterial species transmissibility
Determination of accessory AMPs
Annotation of AMPs using different datasets
Genomic context conservation analysis
AMPSphere web resource
Peptide selection for synthesis and testing
Minimal inhibitory concentration (MIC) determination
Circular dichroism assays
Outer membrane permeabilization assays
Cytoplasmic membrane depolarization assays

QUANTIFICATION AND STATISTICAL ANALYSIS
ADDITIONAL RESOURCES

SUPPLEMENTAL INFORMATION

Supplemental information can be found online at https://doi.org/10.1016/j.cell. 2024.05.013.

ACKNOWLEDGMENTS

We thank Marija Dmitrijeva (University of Zurich) for her helpful comments on a previous version of the manuscript. We thank Kaylyn Tousignant (Queensland University of Technology) for her help editing the manuscript. We thank Georgina H. Joyce (Queensland University of Technology) for her help designing the graphical abstract. We thank members of the Coelho group and the de la Fuente Lab for insightful discussions. C.F.-N. holds a Presidential Professorship at the University of Pennsylvania and acknowledges funding from the Procter & Gamble Company, United Therapeutics, a BBRF Young Investigator Grant, the Nemirovsky Prize, the Penn Health-Tech Accelerator Award, Defense Threat Reduction Agency grants HDTRA11810041 and HDTRA1-23-1-0001, and the Dean’s Innovation Fund from the Perelman School of Medicine at the University of Pennsylvania. We thank Dr. Mark Goulian for kindly donating the strains Escherichia coli AIC221 (Escherichia coli MG1655 phnE_2:FRT [control strain for AIC 222]) and Escherichia coli AIC222 (Escherichia coli MG1655 pmrA53 phnE_2:FRT [poly-myxin-resistant]). This work was partly funded by the EMBL and the following grants: National Natural Science Foundation of China grants T2225015 and 61932008 (L.P.C. and X.-M.Z.); Shanghai Science and Technology Commission Program grant 23JS1410100 (L.P.C. and X.-M.Z.); National Key R&D Program of China grants 2023YFF1204800 and 2020YFA0712403 (L.P.C. and X.-M.Z.); Shanghai Municipal Science and Technology Major Project grant 2018SHZDZX01 (L.P.C. and X.-M.Z.); Lingang Laboratory and National Key Laboratory of Human Factors Engineering Joint Grant LG-TKN-202203-01 (X.-M.Z.); The Science and Technology Commission of Shanghai Municipality grant 22JC1410900 (L.P.C.); Australian Research Council grant FT230100724 (L.P.C.); the Langer Prize from the AIChE Foundation (C.F.-N.); National Institutes of Health grant R35GM138201 (C.F.-N.); Defense Threat Reduction Agency grant HDTRA1-21-1-0014 (C.F.-N.); PID2021-127210NB-I00, MCIN/AEI/10.13039/ 501100011033/FEDER, UE (J.H.-C.); ‘la Caixa’ Foundation ID 100010434, fellowship code LCF/BQ/DI18/11660009 (A.R.d.R.); and the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement 713673 (A.R.d.R.).

AUTHOR CONTRIBUTIONS

Conceptualization, C.D.S.-J., L.P.C., M.D.T.T., and C.F.-N.; Data curation, C.D.S.-J., Y.D., T.S.B.S., M.K., A.F., L.P.C., M.D.T.T., and C.F.-N.; Formal analysis, C.D.S.-J., L.P.C., and M.D.T.T.; Funding acquisition, L.P.C.,
X.-M.Z., and C.F.-N.; Investigation, C.D.S.-J., L.P.C., M.D.T.T., and C.F.-N.; Methodology, C.D.S.-J., Y.D., J.H.-C., A.R.d.R., L.P.C., M.D.T.T., and C.F.-N.; Project administration, L.P.C., M.K., X.-M.Z., P.B., and C.F.-N.; Resources, L.P.C., X.-M.Z., and C.F.-N.; Supervision, L.P.C. and C.F.-N.; Visualization, C.D.S.-J., J.H.-C., J.S., A.V., A.H., C.Z., L.P.C., and M.D.T.T.; Writing original draft, C.D.S.-J., M.D.T.T., C.F.-N., and L.P.C.; Writing – review & editing, C.D.S.-J., Y.D., J.H.-C., A.R.d.R., T.S.B.S., A.F., P.B., X.-M.Z., L.P.C., M.D.T.T., and C.F.-N.

DECLARATION OF INTERESTS

C.F.-N. provides consulting services to Invaio Sciences and is a member of the Scientific Advisory Boards of Nowture S.L. and Phare Bio. The de la Fuente Lab has received research funding or in-kind donations from United Therapeutics, Strata Manufacturing PJSC, and Procter & Gamble, none of which were used in support of this work. An invention disclosure associated with this work has been submitted.

Received: June 14, 2023
Revised: April 11, 2024
Accepted: May 6, 2024
Published: June 5, 2024

REFERENCES

de la Fuente-Nunez, C., Torres, M.D., Mojica, F.J., and Lu, T.K. (2017). Next-generation precision antimicrobials: towards personalized treatment of infectious diseases. Curr. Opin. Microbiol. 37, 95-102. https:// doi.org/10.1016/j.mib.2017.05.014.
Antimicrobial Resistance Collaborators (2022). Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399, 629-655. https://doi.org/10.1016/S0140-6736(21)02724-0.
Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., MacNair, C.R., French, S., Carfrae, L.A., Bloom-Ackermann, Z., et al. (2020). A Deep Learning Approach to Antibiotic Discovery. Cell 180, 688-702.e13. https://doi.org/10.1016/j.cell.2020.01.021.
Torres, M.D.T., Melo, M.C.R., Flowers, L., Crescenzi, O., Notomista, E., and de la Fuente-Nunez, C. (2022). Mining for encrypted peptide antibiotics in the human proteome. Nat. Biomed. Eng. 6, 67-75. https://doi. org/10.1038/s41551-021-00801-1.
Porto, W.F., Irazazabal, L., Alves, E.S.F., Ribeiro, S.M., Matos, C.O., Pires, Á.S., Fensterseifer, I.C.M., Miranda, V.J., Haney, E.F., Humblot, V., et al. (2018). In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nat. Commun. 9, 1490. https://doi.org/10.1038/s41467-018-03746-3.
Ma, Y., Guo, Z., Xia, B., Zhang, Y., Liu, X., Yu, Y., Tang, N., Tong, X., Wang, M., Ye, X., et al. (2022). Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat. Biotechnol. 40, 921-931. https://doi.org/10.1038/s41587-022-01226-0.
Wong, F., de la Fuente-Nunez, C., and Collins, J.J. (2023). Leveraging artificial intelligence in the fight against infectious diseases. Science 381, 164-170. https://doi.org/10.1126/science.adh1114.
Cesaro, A., Bagheri, M., Torres, M., Wan, F., and de la Fuente-Nunez, C. (2023). Deep learning tools to accelerate antibiotic discovery. Expert Opin. Drug Discov. 18, 1245-1257. https://doi.org/10.1080/17460441. 2023.2250721.
Torres, M.D.T., and de la Fuente-Nunez, C. (2019). Toward computermade artificial antibiotics. Curr. Opin. Microbiol. 51, 30-38. https://doi. org/10.1016/j.mib.2019.03.004.
Maasch, J.R.M.A., Torres, M.D.T., Melo, M.C.R., and de la Fuente-Nunez, C. (2023). Molecular de-extinction of ancient antimicrobial peptides enabled by machine learning. Cell Host Microbe 31, 1260-1274.e6. https://doi.org/10.1016/j.chom.2023.07.001.
Besse, A., Vandervennet, M., Goulard, C., Peduzzi, J., Isaac, S., Rebuffat, S., and Carré-Mlouka, A. (2017). Halocin C8: an antimicrobial peptide
distributed among four halophilic archaeal genera: Natrinema, Haloterrigena, Haloferax, and Halobacterium. Extremophiles 21, 623-638. https://doi.org/10.1007/s00792-017-0931-5.
Cotter, P.D., Ross, R.P., and Hill, C. (2013). Bacteriocins – a viable alternative to antibiotics? Nat. Rev. Microbiol. 11, 95-105. https://doi.org/10. 1038/nrmicro2937.
Wang, S., Zheng, Z., Zou, H., Li, N., and Wu, M. (2019). Characterization of the secondary metabolite biosynthetic gene clusters in archaea. Comput. Biol. Chem. 78, 165-169. https://doi.org/10.1016/j.compbiolchem. 2018.11.019.
Zasloff, M. (2019). Antimicrobial Peptides of Multicellular Organisms: My Perspective. In Antimicrobial Peptides: Basics for Clinical Application, K. Matsuzaki, ed. (Springer Singapore), pp. 3-6. https://doi.org/10.1007/ 978-981-13-3588-4_1.
Huang, K.-Y., Chang, T.-H., Jhong, J.-H., Chi, Y.-H., Li, W.-C., Chan, C.L., Robert Lai, K., and Lee, T.-Y. (2017). Identification of natural antimicrobial peptides from bacteria through metagenomic and metatranscriptomic analysis of high-throughput transcriptome data of Taiwanese oolong teas. BMC Syst. Biol. 11, 131. https://doi.org/10.1186/s12918-017-0503-4.
Torres, M.D.T., Sothiselvam, S., Lu, T.K., and de la Fuente-Nunez, C. (2019). Peptide Design Principles for Antimicrobial Applications. J. Mol. Biol. 431, 3547-3567. https://doi.org/10.1016/j.jmb.2018.12.015.
Pizzo, E., Cafaro, V., Di Donato, A., and Notomista, E. (2018). Cryptic Antimicrobial Peptides: Identification Methods and Current Knowledge of their Immunomodulatory Properties. Curr. Pharm. Des. 24, 10541066. https://doi.org/10.2174/1381612824666180327165012.
Nolan, E.M., and Walsh, C.T. (2009). How nature morphs peptide scaffolds into antibiotics. Chembiochem 10, 34-53. https://doi.org/10.1002/ cbic. 200800438.
Singh, N., and Abraham, J. (2014). Ribosomally synthesized peptides from natural sources. J. Antibiot. 67, 277-289. https://doi.org/10.1038/ ja.2013.138.
García-Bayona, L., and Comstock, L.E. (2018). Bacterial antagonism in host-associated microbial communities. Science 361, eaat2456. https://doi.org/10.1126/science.aat2456.
Anderson, M.C., Vonaesch, P., Saffarian, A., Marteyn, B.S., and Sansonetti, P.J. (2017). Shigella sonnei encodes a functional T6SS used for interbacterial competition and niche occupancy. Cell Host Microbe 21, 769-776.e3. https://doi.org/10.1016/j.chom.2017.05.004.
Krismer, B., Weidenmaier, C., Zipperer, A., and Peschel, A. (2017). The commensal lifestyle of Staphylococcus aureus and its interactions with the nasal microbiota. Nat. Rev. Microbiol. 15, 675-687. https://doi.org/ 10.1038/nrmicro.2017.104.
Zhao, W., Caro, F., Robins, W., and Mekalanos, J.J. (2018). Antagonism toward the intestinal microbiota and its effect on Vibrio cholerae virulence. Science 359, 210-213. https://doi.org/10.1126/science.aap8775.
Quereda, J.J., Nahori, M.A., Meza-Torres, J., Sachse, M., Titos-Jiménez, P., Gomez-Laguna, J., Dussurget, O., Cossart, P., and Pizarro-Cerdá, J. (2017). Listeriolysin S is a streptolysin s-like virulence factor that targets exclusively prokaryotic cells in vivo. mBio 8, e00259-17. https://doi.org/ .
Quereda, J.J., Dussurget, O., Nahori, M.A., Ghozlane, A., Volant, S., Dillies, M.A., Regnault, B., Kennedy, S., Mondot, S., Villoing, B., et al. (2016). Bacteriocin from epidemic Listeria strains alters the host intestinal microbiota to favor infection. Proc. Natl. Acad. Sci. USA 113, 5706-5711. https://doi.org/10.1073/pnas.1523899113.
Gomes, B., Augusto, M.T., Felício, M.R., Hollmann, A., Franco, O.L., Gonçalves, S., and Santos, N.C. (2018). Designing improved active peptides for therapeutic approaches against infectious diseases. Biotechnol. Adv. 36, 415-429. https://doi.org/10.1016/j.biotechadv.2018.01.004.
Lesiuk, M., Paduszyńska, M., and Greber, K.E. (2022). Synthetic Antimicrobial Immunomodulatory Peptides: Ongoing Studies and Clinical Tri-
als. Antibiotics (Basel) 11, 1062. https://doi.org/10.3390/antibiotics 11081062.
Mahlapuu, M., Håkansson, J., Ringstad, L., and Björn, C. (2016). Antimicrobial Peptides: An Emerging Category of Therapeutic Agents. Front. Cell. Infect. Microbiol. 6, 235805.
Baquero, F., Lanza, V.F., Baquero, M.R., Del Campo, R., and Bravo-Vázquez, D.A. (2019). Microcins in Enterobacteriaceae: peptide antimicrobials in the eco-active intestinal chemosphere. Front. Microbiol. 10, 2261. https://doi.org/10.3389/fmicb.2019.02261.
Kim, S.G., Becattini, S., Moody, T.U., Shliaha, P.V., Littmann, E.R., Seok, R., Gjonbalaj, M., Eaton, V., Fontana, E., Amoretti, L., et al. (2019). Micro-biota-derived lantibiotic restores resistance against vancomycin-resistant Enterococcus. Nature 572, 665-669. https://doi.org/10.1038/ s41586-019-1501-z.
Nakatsuji, T., Hata, T.R., Tong, Y., Cheng, J.Y., Shafiq, F., Butcher, A.M., Salem, S.S., Brinton, S.L., Rudman Spergel, A.K., Johnson, K., et al. (2021). Development of a human skin commensal microbe for bacteriotherapy of atopic dermatitis and use in a phase 1 randomized clinical trial. Nat. Med. 27, 700-709. https://doi.org/10.1038/s41591-021-01256-2.
Spohn, R., Daruka, L., Lázár, V., Martins, A., Vidovics, F., Grézal, G., Méhi, O., Kintses, B., Számel, M., Jangir, P.K., et al. (2019). Integrated evolutionary analysis reveals antimicrobial peptides with limited resistance. Nat. Commun. 10, 4538. https://doi.org/10.1038/s41467-019-12364-6.
Cesaro, A., Torres, M.D.T., Gaglione, R., Dell’Olmo, E., Di Girolamo, R., Bosso, A., Pizzo, E., Haagsman, H.P., Veldhuizen, E.J.A., de la FuenteNunez, C., and Arciello, A. (2022). Synthetic Antibiotic Derived from Sequences Encrypted in a Protein from Human Plasma. ACS Nano 16, 1880-1895. https://doi.org/10.1021/acsnano.1c04496.
Hyatt, D., Chen, G.-L., LoCascio, P.F., Land, M.L., Larimer, F.W., and Hauser, L.J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 11, 119. https://doi.org/10. 1186/1471-2105-11-119.
Ahrens, C.H., Wade, J.T., Champion, M.M., and Langer, J.D. (2022). A Practical Guide to Small Protein Discovery and Characterization Using Mass Spectrometry. J. Bacteriol. 204, e0035321. https://doi.org/10. 1128/JB.00353-21.
Storz, G., Wolf, Y.I., and Ramamurthi, K.S. (2014). Small Proteins Can No Longer Be Ignored. Annu. Rev. Biochem. 83, 753-777. https://doi.org/ 10.1146/annurev-biochem-070611-102400.
Su, M., Ling, Y., Yu, J., Wu, J., and Xiao, J. (2013). Small proteins: untapped area of potential biological importance. Front. Genet. 4, 286. https://doi.org/10.3389/fgene.2013.00286.
Sberro, H., Fremin, B.J., Zlitni, S., Edfors, F., Greenfield, N., Snyder, M.P., Pavlopoulos, G.A., Kyrpides, N.C., and Bhatt, A.S. (2019). LargeScale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes. Cell 178, 1245-1259.e14. https://doi.org/10.1016/j.cell. 2019.07.016.
Donia, M.S., Cimermancic, P., Schulze, C.J., Wieland Brown, L.C., Martin, J., Mitreva, M., Clardy, J., Linington, R.G., and Fischbach, M.A. (2014). A systematic analysis of biosynthetic gene clusters in the human microbiome reveals a common family of antibiotics. Cell 158, 1402-1414. https://doi.org/10.1016/j.cell.2014.08.032.
Fingerhut, L.C.H.W., Miller, D.J., Strugnell, J.M., Daly, N.L., and Cooke, I.R. (2021). ampir: an package for fast genome-wide prediction of antimicrobial peptides. Bioinformatics 36, 5262-5263. https://doi.org/10. 1093/bioinformatics/btaa653.
Sugimoto, Y., Camacho, F.R., Wang, S., Chankhamjon, P., Odabas, A., Biswas, A., Jeffrey, P.D., and Donia, M.S. (2019). A metagenomic strategy for harnessing the chemical repertoire of the human microbiome. Science 366, eaax9176. https://doi.org/10.1126/science.aax9176.
Santos-Júnior, C.D., Pan, S., Zhao, X.-M., and Coelho, L.P. (2020). Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ 8, e10555. https://doi.org/10.7717/peerj. 10555.
Mende, D.R., Letunic, I., Maistrenko, O.M., Schmidt, T.S.B., Milanese, A., Paoli, L., Hernández-Plaza, A., Orakov, A.N., Forslund, S.K., Sunagawa, S., et al. (2020). proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621-D625. https://doi.org/10. 1093/nar/gkz1002.
Navidinia, M. (2016). The clinical importance of emerging ESKAPE pathogens in nosocomial infections. Archives of Advances in Biosciences 7, 43-57. https://doi.org/10.22037/jps.v7i3.12584.
Mulani, M.S., Kamble, E.E., Kumkar, S.N., Tawre, M.S., and Pardesi, K.R. (2019). Emerging Strategies to Combat ESKAPE Pathogens in the Era of Antimicrobial Resistance: A Review. Front. Microbiol. 10, 539. https:// doi.org/10.3389/fmicb.2019.00539.
Shi, G., Kang, X., Dong, F., Liu, Y., Zhu, N., Hu, Y., Xu, H., Lao, X., and Zheng, H. (2022). DRAMP 3.0: an enhanced comprehensive data repository of antimicrobial peptides. Nucleic Acids Res. 50, D488-D496. https://doi.org/10.1093/nar/gkab651.
Zhang, L.-J., and Gallo, R.L. (2016). Antimicrobial peptides. Curr. Biol. 26, R14-R19. https://doi.org/10.1016/j.cub.2015.11.017.
Bhadra, P., Yan, J., Li, J., Fong, S., and Siu, S.W.I. (2018). AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep. 8, 1697. https://doi.org/10.1038/s41598-018-19752-w.
Hao, Y., Zhang, L., Niu, Y., Cai, T., Luo, J., He, S., Zhang, B., Zhang, D., Qin, Y., Yang, F., and Chen, R. (2018). SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Brief. Bioinform. 19, 636-643. https://doi.org/10.1093/bib/bbx005.
Venturini, E., Svensson, S.L., Maaß, S., Gelhausen, R., Eggenhofer, F., Li, L., Cain, A.K., Parkhill, J., Becher, D., Backofen, R., et al. (2020). A global data-driven census of Salmonella small proteins and their potential functions in bacterial virulence. microLife 1, uqaa002. https://doi.org/10. 1093/femsml/uqaa002.
Aguilera-Mendoza, L., Marrero-Ponce, Y., Beltran, J.A., Tellez Ibarra, R., Guillen-Ramirez, H.A., and Brizuela, C.A. (2019). Graph-based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis. Bioinformatics 35, 4739-4747. https://doi.org/10.1093/bioinformatics/btz260.
Coelho, L.P., Alves, R., Del Río, Á.R., Myers, P.N., Cantalapiedra, C.P., Giner-Lamia, J., Schmidt, T.S., Mende, D.R., Orakov, A., Letunic, I., et al. (2022). Towards the biogeography of prokaryotic genes. Nature 601, 252-256. https://doi.org/10.1038/s41586-021-04233-4.
Veltri, D., Kamath, U., and Shehu, A. (2018). Deep learning improves antimicrobial peptide recognition. Bioinformatics 34, 2740-2747. https://doi. org/10.1093/bioinformatics/bty179.
Lawrence, T.J., Carper, D.L., Spangler, M.K., Carrell, A.A., Rush, T.A., Minter, S.J., Weston, D.J., and Labbé, J.L. (2021). amPEPpy 1.0: a portable and accurate antimicrobial peptide prediction tool. Bioinformatics 37, 2058-2060. https://doi.org/10.1093/bioinformatics/btaa917.
Su, X., Xu, J., Yin, Y., Quan, X., and Zhang, H. (2019). Antimicrobial peptide identification using multi-scale convolutional network. BMC Bioinf. 20, 730. https://doi.org/10.1186/s12859-019-3327-y.
Lin, T.-T., Yang, L.-Y., Lu, I.-H., Cheng, W.-C., Hsu, Z.-R., Chen, S.-H., and Lin, C.-Y. (2021). Al4AMP: an Antimicrobial Peptide Predictor Using Physicochemical Property-Based Encoding Method and Deep Learning. mSystems 6, e0029921. https://doi.org/10.1128/mSystems.00299-21.
Li, C., Sutherland, D., Hammond, S.A., Yang, C., Taho, F., Bergman, L., Houston, S., Warren, R.L., Wong, T., Hoang, L.M.N., et al. (2022). AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against whom priority pathogens. BMC Genom. 23, 77. https://doi.org/10.1186/s12864-022-08310-4.
Murphy, L.R., Wallqvist, A., and Levy, R.M. (2000). Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. 13, 149-152. https://doi.org/10.1093/protein/13.3.149.
Heintz-Buschart, A., May, P., Laczny, C.C., Lebrun, L.A., Bellora, C., Krishna, A., Wampach, L., Schneider, J.G., Hogan, A., de Beaufort, C., and Wilmes, P. (2016). Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes. Nat. Microbiol. 2, 16180. https://doi.org/10.1038/nmicrobiol.2016.180.
Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S.K., Cook, H., Mende, D.R., Letunic, I., Rattei, T., Jensen, L.J., et al. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309-D314. https://doi.org/10.1093/ nar/gky1085.
Rodríguez del Río, Á., Giner-Lamia, J., Cantalapiedra, C.P., Botas, J., Deng, Z., Hernández-Plaza, A., Munar-Palmer, M., Santamaría-Hernando, S., Rodríguez-Herva, J.J., Ruscheweyh, H.-J., et al. (2023). Functional and evolutionary significance of unknown genes from uncultivated taxa. Nature, 1-3. https://doi.org/10.1038/s41586-023-06955-z.
Hurtado-Rios, J.J., Carrasco-Navarro, U., Almanza-Pérez, J.C., and Ponce-Alquicira, E. (2022). Ribosomes: The New Role of Ribosomal Proteins as Natural Antimicrobials. Int. J. Mol. Sci. 23, 9123. https://doi.org/ 10.3390/ijms23169123.
Shoja, V., and Zhang, L. (2006). A Roadmap of Tandemly Arrayed Genes in the Genomes of Human, Mouse, and Rat. Mol. Biol. Evol. 23, 21342141. https://doi.org/10.1093/molbev/msl085.
Sukhodolets, V.V. (2006). Unequal crossing-over in Escherichia coli. Russ. J. Genet. 42, 1285-1293. https://doi.org/10.1134/S102279540611010X.
Kim, M.K., Kang, T.H., Kim, J., Kim, H., and Yun, H.D. (2012). Evidence Showing Duplication and Recombination of cel Genes in Tandem from Hyperthermophilic Thermotoga sp. Appl. Biochem. Biotechnol. 168, 1834-1848. https://doi.org/10.1007/s12010-012-9901-7.
Blaustein, R.A., McFarland, A.G., Ben Maamar, S., Lopez, A., CastroWallace, S., and Hartmann, E.M. (2019). Pangenomic Approach To Understanding Microbial Adaptations within a Model Built Environment, the International Space Station, Relative to Human Hosts and Soil. mSystems 4, e00281-18. https://doi.org/10.1128/mSystems.00281-18.
Collins, F.W.J., Mesa-Pereira, B., O’Connor, P.M., Rea, M.C., Hill, C., and Ross, R.P. (2018). Reincarnation of Bacteriocins From the Lactobacillus Pangenomic Graveyard. Front. Microbiol. 9, 1298. https://doi.org/ 10.3389/fmicb.2018.01298.
Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., Evans, P.N., Hugenholtz, P., and Tyson, G.W. (2017). Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533-1542. https://doi.org/10.1038/ s41564-017-0012-7.
Parks, D.H., Chuvochina, M., Chaumeil, P.-A., Rinke, C., Mussig, A.J., and Hugenholtz, P. (2020). A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol. 38, 1079-1086. https://doi. org/10.1038/s41587-020-0501-8.
Simmons, W.L., Daubenspeck, J.M., Osborne, J.D., Balish, M.F., Waites, K.B., and Dybvig, K. (2013). Type 1 and type 2 strains of Mycoplasma pneumoniae form different biofilms. Microbiology (Read.) 159, 737-747. https://doi.org/10.1099/mic.0.064782-0.
Diaz, M.H., Desai, H.P., Morrison, S.S., Benitez, A.J., Wolff, B.J., Caravas, J., Read, T.D., Dean, D., and Winchell, J.M. (2017). Comprehensive bioinformatics analysis of Mycoplasma pneumoniae genomes to investigate underlying population structure and type-specific determinants. PLoS One 12, e0174701. https://doi.org/10.1371/journal.pone.0174701.
Valles-Colomer, M., Blanco-Míguez, A., Manghi, P., Asnicar, F., Dubois, L., Golzato, D., Armanini, F., Cumbo, F., Huang, K.D., Manara, S., et al. (2023). The person-to-person transmission landscape of the gut and oral microbiomes. Nature 614, 125-135. https://doi.org/10.1038/ s41586-022-05620-1.
Pirtskhalava, M., Amstrong, A.A., Grigolava, M., Chubinidze, M., Alimbarashvili, E., Vishnepolsky, B., Gabrielian, A., Rosenthal, A., Hurt, D.E., and Tartakovsky, M. (2021). DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res. 49, D288-D297. https://doi.org/10.1093/nar/gkaa991.
Wang, G., Li, X., and Wang, Z. (2016). APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 44, D1087-D1093. https://doi.org/10.1093/nar/gkv1278.
Micsonai, A., Moussong, É., Wien, F., Boros, E., Vadászi, H., Murvai, N., Lee, Y.-H., Molnár, T., Réfrégiers, M., Goto, Y., et al. (2022). BeStSel: webserver for secondary structure and fold prediction for protein CD spectroscopy. Nucleic Acids Res. 50, W90-W98. https://doi.org/10. 1093/nar/gkac345.
Lifson, S., and Sander, C. (1979). Antiparallel and parallel -strands differ in amino acid residue preferences. Nature 282, 109-111. https://doi.org/ 10.1038/282109a0.
Derrien, M., Collado, M.C., Ben-Amor, K., Salminen, S., and de Vos, W.M. (2008). The Mucin Degrader Akkermansia muciniphila Is an Abundant Resident of the Human Intestinal Tract. Appl. Environ. Microbiol. 74, 1646-1648. https://doi.org/10.1128/AEM.01226-07.
Earley, H., Lennon, G., Balfe, Á., Coffey, J.C., Winter, D.C., and O’Connell, P.R. (2019). The abundance of Akkermansia muciniphila and its relationship with sulphated colonic mucins in health and ulcerative colitis. Sci. Rep. 9, 15683. https://doi.org/10.1038/s41598-019-51878-3.
Daquigan, N., Seekatz, A.M., Greathouse, K.L., Young, V.B., and White, J.R. (2017). High-resolution profiling of the gut microbiome reveals the extent of Clostridium difficile burden. npj Biofilms Microbiomes 3, 35. https://doi.org/10.1038/s41522-017-0043-0.
Saenz, C., Fang, Q., Gnanasekaran, T., Trammell, S.A.J., Buijink, J.A., Pisano, P., Wierer, M., Moens, F., Lengger, B., Brejnrod, A., and Arumugam, M. (2023). Clostridium scindens secretome suppresses virulence gene expression of Clostridioides difficile in a bile acid-independent manner. Microbiol. Spectr. 11, e0393322. https://doi.org/10.1128/spec-trum.03933-22.
Geerlings, S.Y., Kostopoulos, I., De Vos, W.M., and Belzer, C. (2018). Akkermansia muciniphila in the Human Gastrointestinal Tract: When, Where, and How? Microorganisms 6, 75. https://doi.org/10.3390/ microorganisms6030075.
Cullen, T.W., Schofield, W.B., Barry, N.A., Putnam, E.E., Rundell, E.A., Trent, M.S., Degnan, P.H., Booth, C.J., Yu, H., and Goodman, A.L. (2015). Antimicrobial peptide resistance mediates resilience of prominent gut commensals during inflammation. Science 347, 170-175. https://doi. org/10.1126/science. 1260580.
Torres, M.D.T., Pedron, C.N., Araújo, I., Silva, P.I., Silva, F.D., and Oliveira, V.X. (2017). Decoralin Analogs with Increased Resistance to Degradation and Lower Hemolytic Activity. ChemistrySelect 2, 18-23. https:// doi.org/10.1002/slct. 201601590.
Torres, M.D.T., Pedron, C.N., Higashikuni, Y., Kramer, R.M., Cardoso, M.H., Oshiro, K.G.N., Franco, O.L., Silva Junior, P.I., Silva, F.D., Oliveira Junior, V.X., et al. (2018). Structure-function-guided exploration of the antimicrobial peptide polybia-CP identifies activity determinants and generates synthetic therapeutic candidates. Commun. Biol. 1, 221. https://doi.org/10.1038/s42003-018-0224-2.
Silva, O.N., Torres, M.D.T., Cao, J., Alves, E.S.F., Rodrigues, L.V., Resende, J.M., Lião, L.M., Porto, W.F., Fensterseifer, I.C.M., Lu, T.K., et al. (2020). Repurposing a peptide toxin from wasp venom into antiinfectives with dual antimicrobial and immunomodulatory properties. Proc. Natl. Acad. Sci. USA 117, 26936-26945. https://doi.org/10.1073/ pnas. 2012379117.
Morris, F.C., Dexter, C., Kostoulias, X., Uddin, M.I., and Peleg, A.Y. (2019). The Mechanisms of Disease Caused by Acinetobacter baumannii. Front. Microbiol. 10, 1601.
Petruschke, H., Schori, C., Canzler, S., Riesbeck, S., Poehlein, A., Daniel, R., Frei, D., Segessemann, T., Zimmerman, J., Marinos, G., et al. (2021). Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome. Microbiome 9, 55. https://doi.org/10.1186/ s40168-020-00981-z.
Washietl, S., Findeiß, S., Müller, S.A., Kalkhof, S., von Bergen, M., Hofacker, I.L., Stadler, P.F., and Goldman, N. (2011). RNAcode: Robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17, 578-594. https://doi.org/10.1261/rna.2536111.
Galzitskaya, O.V. (2021). Exploring Amyloidogenicity of Peptides From Ribosomal S1 Protein to Develop Novel AMPs. Front. Mol. Biosci. 8, 705069. https://doi.org/10.3389/fmolb.2021.705069.
Ochman, H., Lawrence, J.G., and Groisman, E.A. (2000). Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299-304. https://doi.org/10.1038/35012500.
Zheng, D., and Gerstein, M.B. (2007). The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they? Trends Genet. 23, 219-224. https://doi.org/10.1016/j.tig.2007.03.003.
Lazzaro, B.P., Zasloff, M., and Rolff, J. (2020). Antimicrobial peptides: Application informed by evolution. Science 368, eaau5480. https://doi. org/10.1126/science.aau5480.
Sun, S., Wang, H., Howard, A.G., Zhang, J., Su, C., Wang, Z., Du, S., Fodor, A.A., Gordon-Larsen, P., and Zhang, B. (2022). Loss of Novel Diversity in Human Gut Microbiota Associated with Ongoing Urbanization in China. mSystems 7, e0020022. https://doi.org/10.1128/msystems. 00200-22.
Piquer-Esteban, S., Ruiz-Ruiz, S., Arnau, V., Diaz, W., and Moya, A. (2022). Exploring the universal healthy human gut microbiota around the World. Comput. Struct. Biotechnol. J. 20, 421-433. https://doi.org/ 10.1016/j.csbj.2021.12.035.
Dhakan, D.B., Maji, A., Sharma, A.K., Saxena, R., Pulikkan, J., Grace, T., Gomez, A., Scaria, J., Amato, K.R., and Sharma, V.K. (2019). The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches. GigaScience 8, giz004. https://doi.org/10.1093/gigascience/giz004.
Coelho, L.P., Alves, R., Monteiro, P., Huerta-Cepas, J., Freitas, A.T., and Bork, P. (2019). NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language. Microbiome 7, 84. https:// doi.org/10.1186/s40168-019-0684-8.
Coelho, L.P. (2017). Jug: Software for Parallel Reproducible Computation in Python. J. Open Res. Softw. 5, 30. https://doi.org/10.5334/jors.161.
Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152. https://doi.org/10.1093/bioinformatics/bts565.
Steinegger, M., and Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026-1028. https://doi.org/10.1038/nbt.3988.
Van Rossum, G. (2020). Python Release Python 3.8.2. Python.org. https://www.python.org/downloads/release/python-382/.
Hunter, J.D. (2007). Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 9, 90-95. https://doi.org/10.1109/MCSE.2007.55.
Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., et al. (2020). Array programming with NumPy. Nature 585, 357-362. https://doi.org/ 10.1038/s41586-020-2649-2.
McKinney, W. (2010). Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, pp. 56-61. https://doi.org/10.25080/Majora-92bf1922-00a.
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261-272. https://doi.org/10.1038/s41592-019-0686-2.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Sci-kit-learn: Machine Learning in Python. Machine Learning In Python 12, 2825-2830.
The scikit-bio development team (2020). scikit-bio: A Bioinformatics Library for Data Scientists, Students, and Developers. Version 0.5.5.
Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., and de Hoon, M.J.L. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422-1423. https:// doi.org/10.1093/bioinformatics/btp163.
Cantalapiedra, C.P., Hernández-Plaza, A., Letunic, I., Bork, P., and Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol. Biol. Evol. 38, 5825-5829. https://doi.org/10.1093/molbev/ msab293.
Eddy, S.R. (2011). Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195. https://doi.org/10.1371/journal.pcbi. 1002195.
Price, M.N., Dehal, P.S., and Arkin, A.P. (2010). FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One 5, e9490. https://doi.org/10.1371/journal.pone.0009490.
Jain, C., Rodriguez-R, L.M., Phillippy, A.M., Konstantinidis, K.T., and Aluru, S. (2018). High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114. https://doi.org/10.1038/s41467-018-07641-9.
Li, D., Luo, R., Liu, C.M., Leung, C.M., Ting, H.F., Sadakane, K., Yamashita, H., and Lam, T.W. (2016). MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3-11. https://doi.org/10.1016/j.ymeth.2016. 02.020.
Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760. https://doi. org/10.1093/bioinformatics/btp324.
Seabold, S., and Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python. In Proceedings of the 9th Python in Science Conference, pp. 92-96. https://doi.org/10.25080/Majora-92bf1922-011.
Milanese, A., Mende, D.R., Paoli, L., Salazar, G., Ruscheweyh, H.-J., Cuenca, M., Hingamp, P., Alves, R., Costea, P.I., Coelho, L.P., et al. (2019). Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014. https://doi.org/10.1038/ s41467-019-08844-4.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079. https://doi.org/10.1093/bioinformatics/btp352.
Quinlan, A.R., and Hall, I.M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842. https://doi. org/10.1093/bioinformatics/btq033.
Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539. https://doi.org/10.1038/msb. 2011.75.
Buchfink, B., Xie, C., and Huson, D.H. (2015). Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59-60. https://doi.org/10. 1038/nmeth. 3176.
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinf. 10, 421. https://doi.org/10.1186/1471-2105-10-421.
UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480-D489. https://doi.org/10. 1093/nar/gkaa1100.
Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G.A., Sonnhammer, E.L.L., Tosatto, S.C.E., Paladin, L., Raj, S., Richardson, L.J., et al. (2021). Pfam: The protein families database in 2021. Nucleic Acids Res. 49, D412-D419. https://doi.org/10.1093/nar/gkaa913.
Eberhardt, R.Y., Haft, D.H., Punta, M., Martin, M., O’Donovan, C., and Bateman, A. (2012). AntiFam: a tool to help identify spurious ORFs in protein annotation. Database 2012, bas003. https://doi.org/10.1093/database/bas003.
NCBI Resource Coordinators (2015). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 43, D6-D17. https://doi.org/10.1093/nar/gku1130.
Alcock, B.P., Raphenya, A.R., Lau, T.T.Y., Tsang, K.K., Bouchard, M., Edalatmand, A., Huynh, W., Nguyen, A.-L.V., Cheng, A.A., Liu, S., et al. (2020). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 48, D517D525. https://doi.org/10.1093/nar/gkz935.
Kanehisa, M., and Sato, Y. (2020). KEGG Mapper for inferring cellular functions from protein sequences. Protein Sci. 29, 28-35. https://doi. org/10.1002/pro.3711.
Courtot, M., Cherubin, L., Faulconbridge, A., Vaughan, D., Green, M., Richardson, D., Harrison, P., Whetzel, P.L., Parkinson, H., and Burdett, T. (2019). BioSamples database: an updated sample metadata hub. Nucleic Acids Res. 47, D1172-D1178. https://doi.org/10.1093/nar/ gky1061.
Harrison, P.W., Ahamed, A., Aslam, R., Alako, B.T.F., Burgin, J., Buso, N., Courtot, M., Fan, J., Gupta, D., Haseeb, M., et al. (2021). The European Nucleotide Archive in 2020. Nucleic Acids Res. 49, D82-D85. https:// doi.org/10.1093/nar/gkaa1028.
Jones, P., Côté, R.G., Martens, L., Quinn, A.F., Taylor, C.F., Derache, W., Hermjakob, H., and Apweiler, R. (2006). PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 34, D659-D663. https://doi.org/10.1093/nar/gkj138.
Schmidt, T.S.B., Fullam, A., Ferretti, P., Orakov, A., Maistrenko, O.M., Ruscheweyh, H.-J., Letunic, I., Duan, Y., Van Rossum, T., Sunagawa, S., et al. (2024). SPIRE: a Searchable, Planetary-scale mlcrobiome REsource. Nucleic Acids Res. 52, D777-D783. https://doi.org/10.1093/ nar/gkad943.
Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J., and Levy Karin, E. (2021). Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029-3031. https://doi.org/10.1093/bioinformatics/btab184.
Oren, A., Arahal, D.R., Rosselló-Móra, R., Sutcliffe, I.C., and Moore, E.R.B. (2021). Emendation of Rules 5b, 8, 15 and 22 of the International Code of Nomenclature of Prokaryotes to include the rank of phylum. Int. J. Syst. Evol. Microbiol. 71. https://doi.org/10.1099/ijsem.0.004851.
Oren, A., and Garrity, G.M. (2021). Valid publication of the names of fortytwo phyla of prokaryotes. Int. J. Syst. Evol. Microbiol. 71. https://doi.org/ 10.1099/ijsem.0.005056.
Solis, A.D. (2015). Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins 83, 21982216. https://doi.org/10.1002/prot.24936.
Peterson, E.L., Kondev, J., Theriot, J.A., and Phillips, R. (2009). Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 25, 1356-1362. https://doi.org/10. 1093/bioinformatics/btp164.
Smith, T.F., and Waterman, M.S. (1981). Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195-197. https://doi.org/10. 1016/0022-2836(81)90087-5.
Karlin, S., and Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264-2268. https://doi.org/ 10.1073/pnas.87.6.2264.
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.
Cena, J.A. de, Zhang, J., Deng, D., Damé-Teixeira, N., and Do, T. (2021). Low-Abundant Microorganisms: The Human Microbiome’s Dark Matter, a Scoping Review. Front. Cell. Infect. Microbiol. 11, 689197.
Mende, D.R., Sunagawa, S., Zeller, G., and Bork, P. (2013). Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881-884. https://doi.org/10.1038/nmeth.2575.
Sélem-Mojica, N., Aguilar, C., Gutiérrez-García, K., Martínez-Guerrero, C.E., and Barona-Gómez, F. (2019). EvoMining reveals the origin and fate of natural product biosynthetic enzymes. Microb. Genom. 5, e000260. https://doi.org/10.1099/mgen.0.000260.
Rodriguez-R, L.M., Conrad, R.E., Viver, T., Feistel, D.J., Lindner, B.G., Venter, S.N., Orellana, L.H., Amann, R., Rossello-Mora, R., and Konstantinidis, K.T. (2024). An ANI gap within bacterial species that advances the definitions of intra-species units. mBio 15, e02696-23. https://doi.org/10. 1128/mbio.02696-23.
Finn, R.D., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Mistry, J., Mitchell, A.L., Potter, S.C., Punta, M., Qureshi, M., Sangrador-Vegas, A., et al. (2016). The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279-D285. https://doi.org/10.1093/nar/ gkv1344.
SolyPep: a fast generator of soluble peptides https://bioserv.rpbs.univ-paris-diderot.fr/services/SolyPep/
Ochoa, R., and Cossio, P. (2021). PepFun: Open Source Protocols for Peptide-Related Computational Analysis. Molecules 26, 1664. https:// doi.org/10.3390/molecules26061664.
Kochendoerfer, G.G., and Kent, S.B. (1999). Chemical protein synthesis. Curr. Opin. Chem. Biol. 3, 665-671. https://doi.org/10.1016/s1367-5931(99)00024-1.
Sheppard, R. (2003). The fluorenylmethoxycarbonyl group in solid phase synthesis. J. Pept. Sci. 9, 545-552. https://doi.org/10.1002/psc.479.
Palomo, J.M. (2014). Solid-phase peptide synthesis: an overview focused on the preparation of biologically relevant peptides. RSC Adv. 4, 32658-32672. https://doi.org/10.1039/C4RA02458C.
Schmidt, T.S.B., Li, S.S., Maistrenko, O.M., Akanni, W., Coelho, L.P., Dolai, S., Fullam, A., Glazek, A.M., Hercog, R., Herrema, H., et al. (2022). Drivers and determinants of strain dynamics following fecal microbiota transplantation. Nat. Med. 28, 1902-1912. https://doi.org/10.1038/ s41591-022-01913-0.
Wiegand, I., Hilpert, K., and Hancock, R.E.W. (2008). Agar and broth dilution methods to determine the minimal inhibitory concentration (MIC) of antimicrobial substances. Nat. Protoc. 3, 163-175. https://doi.org/10. 1038/nprot.2007.521.
Santos-Júnior, C.D., Schmidt, T.S.B., Fullam, A., Duan, Y., Bork, P., Zhao, X.-M., and Coelho, L.P. (2021). AMPSphere : The Worldwide Survey of Prokaryotic Antimicrobial Peptides (Zenodo) https://doi.org/10. 5281/zenodo. 4606582.

STAR*METHODS

KEY RESOURCES TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Bacterial and virus strains
Acinetobacter baumannii	American Type Culture Collection	ATCC 19606
Escherichia coli	American Type Culture Collection	ATCC 11775
Escherichia coli	Escherichia coli MG1655 phnE_2:FRT	AIC221
Escherichia coli	Escherichia coli MG1655 pmrA53 phnE_2:FRT (polymyxin-resistant; colistin-resistant strain)	AIC222
Klebsiella pneumoniae	American Type Culture Collection	ATCC 13883
Pseudomonas aeruginosa	N/A	PAO1
Pseudomonas aeruginosa	N/A	PA14
Staphylococcus aureus	American Type Culture Collection	ATCC 12600
Staphylococcus aureus	American Type Culture Collection	ATCC BAA-1556 (methicillin-resistant strain)
Akkermansia muciniphila	American Type Culture Collection	ATCC BAA-635
Bacteroides fragilis	American Type Culture Collection	ATCC 25285
Bacteroides thetaiotaomicron	American Type Culture Collection	ATCC 29148
Bacteroides uniformis	American Type Culture Collection	ATCC 8492
Bacteroides vulgatus (Phocaeicola vulgatus)	American Type Culture Collection	ATCC 8482
Collinsella aerofaciens	American Type Culture Collection	ATCC 25986
Clostridium scindens	American Type Culture Collection	ATCC 35704
Parabacteroides distasonis	American Type Culture Collection	ATCC 8503
Chemicals, peptides, and recombinant proteins
Luria-Bertani broth	BD	244620
Tryptic soy broth	Sigma	T8907-1KG
Agar	Sigma	05039
MacConkey agar	RPI	M42560-500.0
Phosphate buffer saline	Sigma	P3913-10PAK
Glucose	Sigma	G5767
1-(N-phenylamino)naphthalene	Sigma	104043
3,3′-dipropylthiadicarbocyanine iodide	Sigma	43608
HEPES	Fisher	BP310-100
Potassium chloride (KCl)	Sigma	P3911
Deposited data
Code for generation of AMPSphere	This study	https://doi.org/10.5281/zenodo. 11055585
AMPSphere database	This study	https://zenodo.org/record/4606582
Experimental models: Organisms/strains
Mouse: CD-1	Charles River	18679700-022
Software and algorithms
NGLess 1.3.0	Coelho et al.	https://github.com/ngless-toolkit/ngless
JUG 2.1.1	Coelho	https://github.com/luispedro/jug
Prodigal 2.6.3	Hyatt et al.	https://github.com/hyattpd/Prodigal
Macrel v.1.0.0	Santos-Júnior et al.	https://github.com/BigDataBiology/macrel
CDHit 4.8.1	Fu et al.	https://github.com/weizhongli/cdhit
MMseqs2	Steinegger and Söding	https://github.com/soedinglab/MMseqs2

(Continued on next page)

Continued
REAGENT or RESOURCE	SOURCE	IDENTIFIER
python 3.8.2	Van Rossum	https://www.python.org/
matplotlib 3.4.3	Hunter	https://matplotlib.org/
numpy 1.21.2	Harris et al.	https://numpy.org/
pandas 1.3.2	McKinney	https://pandas.pydata.org/
plotly 5.2.1	Plotly Technologies Inc, 2015	https://plot.ly
scipy 1.7.1	Virtanen et al.	https://www.scipy.org
scikit-learn 0.24	Pedregosa et al.	https://scikit-learn.org/
scikit-bio 0.5.6	The scikit-bio development team,	http://scikit-bio.org/
BioPython 1.7.9	Cock et al.	https://biopython.org/
eggnog-mapper v2	Cantalapiedra et al.	https://github.com/eggnogdb/ eggnog-mapper
HMMer 3.3+dfsg2-1	Eddy	http://hmmer.org/
FastTree 2.1	Price et al.	http://www.microbesonline.org/fasttree/
FastANI v.1.33	Jain et al.	https://github.com/ParBLiSS/FastANI
Megahit 1.2.9	Li et al.	https://github.com/voutcn/megahit/
AMPlify	Li et al.	https://github.com/bcgsc/AMPlify
Ampir	Fingerhut et al.	https://github.com/Legana/ampir
AMPScanner v2	Veltri et al.	https://www.dveltri.com/ascan/ v2/ascan.html
APIN	Su et al.	https://github.com/zhanglabNKU/APIN
amPEPpy 1.0	Lawrence et al.	https://github.com/tlawrence3/amPEPpy
Al4AMP	Lin et al.	https://github.com/LinTzuTang/ AI4AMP_predictor
RNAcode 0.2-beta	Washietl et al.	https://github.com/ViennaRNA/RNAcode
Bwa v.0.7.17	Li et al.	https://github.com/lh3/bwa
Statsmodels 0.14.0	Seabold and Perktold	https://www.statsmodels.org
mOTUs2	Milanese et al.	https://github.com/motu-tool/mOTUs
SAMtools 1.18	Li et al.	https://github.com/samtools/samtools
BEDtools v2.31.0	Quinlan and Hall	https://github.com/arq bedtools2
Clustal Omega 1.2.2	Sievers et al.	http://clustal.org/omega/
Diamond v2.1.8	Buchfink et al.	https://github.com/bbuchfink/diamond
Blast+ 2.13.0	Camacho et al.	https://blast.ncbi.nlm.nih.gov/doc/ blast-help/downloadblastdata.html
Other
ProGenomes2	Mende et al.	http://progenomes.embl.de/
DRAMP – Data repository of antimicrobial peptides 3.0	Shi et al.	http://dramp.cpu-bioinfor.org/
UniprotKB 2021_03	The UniProt Consortium	https://www.uniprot.org/
Eggnog v.5.0	Huerta-Cepas et al.	http://eggnog5.embl.de/
SmProt database v.2.0	Hao et al.	http://bigdata.ibp.ac.cn/ SmProt/index.html
StarPep45k	Aguilera-Mendoza et al.	http://mobiosd-hub.com/starpep
PFAM 33.1.	Mistry et al.	http://pfam.xfam.org/
AntiFAM v.7.0	Eberhardt et al.	https://www.ebi.ac.uk/research/ bateman/software/antifam-tool-identify-spurious-proteins
GTDB 07-RS95	Parks et al.	https://gtdb.ecogenomic.org/
NCBI release 207	NCBI Resource Coordinators	https://ftp.ncbi.nih.gov/refseq/release/

Continued
REAGENT or RESOURCE	SOURCE	IDENTIFIER
Database of Antimicrobial Activity and Structure of Peptides – DBAASP	Pirtskhalava et al.	https://dbaasp.org/home
Antimicrobial peptides database – APD3	Wang and Wang	https://aps.unmc.edu/
Salmonella Typhimurium small ORFs – STsORFs	Venturini et al.	https://academic.oup.com/microlife/ article/1/1/uqaa002/5928550 #supplementary-data
CARD – Comprehensive Antibiotic Resistance Database	Alcock et al.	https://card.mcmaster.ca/
Kyoto Encyclopedia of Genes and Genomes (KEGG) release 102	Kanehisa et al.	https://www.genome.jp/kegg/
Biosamples database	Courtot et al.	http://www.ebi.ac.uk/biosamples
European Nucleotide Archive – ENA	Harrison et al.	https://www.ebi.ac.uk/ena
Proteomics Identification Database – PRIDE	Jones et al.	https://www.ebi.ac.uk/pride/

RESOURCE AVAILABILITY

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact Luis Pedro Coelho (luispedro@big-data-biology.org).

Materials availability

This study did not generate new unique reagents.

Data and code availability

Metagenomes and Genomes data are publicly available at the European Nucleotide Archives (ENA) as of the date of publication. Their accession numbers are listed in Table S1. AMPSphere is available as a public online resource (https://ampsphere. big-data-biology.org/), and its files have been deposited in Zenodo and are publicly available as of the date of publication. DOIs are listed in the key resources table.
All original code has been deposited at Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS

Bacterial strains and growth conditions

The pathogenic strains Acinetobacter baumannii ATCC 19606, Escherichia coli ATCC 11775, Escherichia coli AIC221 [Escherichia coli MG1655 phnE_2FRT (control strain for AIC 222)], Escherichia coli AIC222 [Escherichia coli MG1655 pmrA53 phnE_2FRT (poly-myxin-resistant; colistin-resistant strain)], Klebsiella pneumoniae ATCC 13883, Pseudomonas aeruginosa PAO1, Pseudomonas aeruginosa PA14, Staphylococcus aureus ATCC 12600, Staphylococcus aureus ATCC BAA-1556 (methicillin-resistant strain), Enterococcus faecalis ATCC 700802 (vancomycin-resistant strain), and Enterococcus faecium ATCC 700221 (vancomycin-resistant strain) were grown and plated on Luria-Bertani (LB) agar plates and incubated overnight at

from frozen stocks. After incubation, one isolated colony was transferred to 6 mL of medium (LB), and cultures were incubated overnight ( 16 h ) at

. The following day, inocula were prepared by diluting the overnight cultures

in 6 mL of the respective media and incubating them at

until bacteria reached logarithmic phase (

The gut commensal strains Akkermansia muciniphila ATCC BAA-635, Bacteroides fragilis ATCC 25285, Bacteroides thetaiotaomicron ATCC 29148, Bacteroides uniformis ATCC 8492, Bacteroides vulgatus ATCC 8482 (Phocaeicola vulgatus), Collinsella aerofaciens ATCC 25986, Clostridium scindens ATCC 35704, and Parabacteroides distasonis ATCC 8503 were grown in brain heart infusion (BHI) agar plates enriched with

vitamin

hemin

, diluted with 10 mL of 1 N sodium hydroxide), and

-cysteine (

), from frozen stocks and incubated overnight at

. Resazurin was used as an oxygen indicator. After the incubation period, a single isolated colony was transferred to 3 mL of BHI broth and incubated overnight at

. The next day, inocula were prepared by diluting the bacterial overnight cultures

in 3 mL of BHI broth and incubated at

until cells reached the logarithmic phase (

Skin abscess infection mouse model

To assess the anti-infective efficacy of the peptides against A. baumannii ATCC 19606 in a skin abscess infection mouse model, the bacteria were cultured in tryptic soy broth (TSB) medium until an

of 0.5 was reached. Next, the cells were washed twice with sterile PBS ( pH 7.4 ) and suspended to a final concentration of

colony-forming units (CFU) per

. Six-week-old female CD-1 mice, after being anesthetized with isoflurane, were subjected to a superficial linear skin abrasion on their backs in an area that they could not touch with their mouth or limbs. An aliquot of

containing the bacterial load was then administered over the abraded area. A single dose of the peptides diluted in water at their MIC value was administered to the infected area 2 h after the infection. The animals were euthanized two- and four-days post-infection, and the infected area was extracted and homogenized for 20 min using a bead beater ( 25 Hz ) and 10-fold serially diluted for CFU quantification on MacConkey agar plates for easy differentiation of A. baumannii colonies. The experimental groups consisted of 3 mice CD-1 per group (

), all female, and each mouse was infected with an inoculum from a different colony to ensure variability. The animals were single caged to avoid cross-contamination. All the mice were used three days after arrival from the commercial provider. The skin abscess infection mouse model was approved by the University Laboratory Animal Resources (ULAR) from the University of Pennsylvania (Protocol 806763).

METHOD DETAILS

Selection of microbial (meta)genomes

Selection of metagenomes and genomes to compose the AMPSphere was similar to that adopted by Coelho et al.

Public metagenomes available on 1 January 2020 produced with Illumina instruments (except for MiSeq, to ensure the consistency and reliability of the meta-analysis findings), with at least 2 million reads and, on average, 75 bp long, were downloaded from the European Nucleotide Archive (ENA). These samples met two criteria: (1) they were tagged with taxonomy ID 408169 (for metagenome) or were a descendant of it in the taxonomic tree; and/or (2) they came from experiments with the library source listed as “METAGENOMIC”. Samples were grouped by project and all projects with at least 20 samples were included for analysis. Additionally, metagenomes deposited by the Integrated Microbial Genomes System (IMG) missing from ENA were also included. Metadata was manually curated from each sample’s describing literature and Biosamples database.

For habitat classification groups were created based on the similarity of habitat conditions, such as air, anthropogenic, aquatic, host-associated, ph:alkaline, sediment, terrestrial, and others. The sample origins and information related to host species were obtained using the NCBI taxonomic identification number. Highquality microbial genomes were selected from ProGenomes2 database.

The resulting 63,410 publicly available metagenomes and 87,920 high-quality microbial genomes are listed in Table S1.

Reads trimming and assembly

Reads were processed using NGLess,

trimming positions with quality lower than 25 and discarding reads shorter than 60 bp posttrimming. Metagenomes obtained from a host-associated microbiome passed through a filtering of reads mapping to the host genome when available. Reads totaling more than 14.7 trillion base pairs of sequenced DNA were assembled with MEGAHIT 1.2.9

and the taxonomy of the

contigs generated was inferred as previously described,

using MMSeqs

to map the sequences against the GTDB release

Mapped taxonomy lineages were then manually curated to conform to the International Code of Nomenclature of Prokaryotes.

smORF and AMP prediction

Analogously to Sberro et al.,

we used a modified version of Prodigal

to predict smORFs (33-303 bp) from contigs. The

redundant smORFs, most of which (

) originated in metagenomes, were then de-duplicated to optimize the computational resource usage, yielding

non-redundant smORFs. Macrel

was run on the de-duplicated smORFs to predict c_AMPs. Singleton sequences (those appearing in a single sample or genome) were eliminated, except when they had a significant match (amino acid identity

and E -value

) to a sequence from the Data Repository of Antimicrobial Peptides (DRAMP)

version 3.0 using the ‘easy-search’ method from MMSeqs2.

In total, AMPSphere encompassed 863,498 non-redundant predicted c_AMPs encoded by

redundant genes. AMP densities were estimated as the number of AMPs per assembled base pairs in a sample or a species.

AMP genes originating from ProGenomes

had the taxonomy of the original genome assigned to them, whereas AMP genes from metagenomes were assigned the taxonomy predicted for the contig where they were found. Insights about potential structural conformations were obtained using the function secondary_structure_fraction from the ProtParam module implemented in the SeqUtils in Biopython.

This function calculates the fraction of amino acids tend to assume conformations of helix [VIYFWL], turn [NPGS], and sheet [EMAL].

Clustering of AMP families

Clustering peptides by sequence identity is only possible at high identities as short low-/medium-identity matches are possible by chance. Therefore, aiming to recover matches where basic features are preserved even if individual amino acids are not identical,

we used a reduced amino acids alphabet of 8 letters

– [LVIMC], [AG], [ST], [FYW], [EDNQ], [KR], [P], [H]. c_AMPs were hierarchically clustered after alphabet reduction using three sequential identity cutoffs (

, and

) with CD-Hit.

A cluster was considered an AMP family when it consisted of at least 8 sequences.

Representative sequences of peptide clusters were selected according to their length (taking the longest) with ties being broken by their alphabetical order.

To validate this clustering procedure, we used a sample of 3,000 sequences randomly sampled from AMPSphere, excluding cluster representatives. These sequences were aligned against the representative sequence of their cluster using the Smith-Waterman algorith

with the BLOSUM 62 cost matrix, and gap open and extension penalties of -10 and -0.5 , respectively. The alignment score was then converted to an E -value according to the model by Karlin and Altschul,

which uses the values of

and

( 0.313667 ) constants adjusted to search for a short input sequence as implemented in the BLAST algorithm.

Alignments were considered significant if their E -value was less than

. We found that more than

of alignments produced in the first two levels (

and

of identity) were significant, along with

of those from the third level (

of identity) – see Figure S3.

Quality control of c_AMPs

The c_AMPs in AMPSphere were submitted to another six AMP prediction systems (AMPScanner v2,

ampir

– with the model for mature peptides, amPEPpy,

APIN

– with their proposed model, AI4AMP,

and AMPLify

The genes of c_AMPs were subjected to five different quality tests to reduce the likelihood that the observed peptides were artifacts or fragments of larger proteins. Initially, the peptides were searched against AntiFam v.7.0

using HMMSearch,

which was designed to identify commonly recurring spuriously predicted ORFs, with the option “-cut_ga”. Fewer than

of c_AMPs had any significant hits.

For each smORF, we searched for an in-frame stop codon upstream of its start codon. When no stop codon is found, we cannot rule out the possibility that the smORF is part of a larger gene which we cannot observe due to fragmented assembly. Most (

) of the c_AMPs are encoded by at least one gene that is not terminally placed. However, the fact that a c_AMP is terminal does not imply that the given c_AMP is an artifact since the AMP genes are short enough to be recovered even in short contigs. For example,

(

) of homologs to DRAMP

version 3.0 were found as terminal c_AMPs in AMPSphere.
The RNAcode

program predicts protein-coding regions based on evolutionary signatures typical for protein genes. This analysis depends on a set of homologous and non-identical genes. Therefore, AMP clusters containing at least three gene variants were aligned. Given that an extensive portion of the AMPSphere candidates (

out of 863,498 ) is not part of such a cluster, they could not be tested. Of the tested c_AMPs,

out of 403,588

were considered genes with evolutionary traits of pro-tein-coding sequences.

We then checked for evidence of transcription and/or translation using 221 publicly available metatranscriptomes, comprising human gut (142), peat (48), plant (13), and symbionts (17); and 109 publicly available metaproteomes from PRIDE

database comprising from 37 habitats – Table S6. Using bwa v.0.7.17,

reads from the metatranscriptomes were mapped against non-redundant AMP genes, and, using NGLess,

we selected genes with at least one read mapped across a minimum of two samples to increase our confidence. This approach is similar to that adopted when predicting AMPs.

Using regular expressions implemented in Python

-mers of all AMPSphere peptides (with length equal to at least half the length of the sequence) were compared to peptide sequences in metaproteomics data. A perfect match between a k-mer and a metaproteomic peptide was considered additional evidence that this c_AMP is likely to be translated, as described by Ma et al.

Briefly, the number of c_AMP peptides mapped against the set of metaproteomic samples was counted, and those c_AMP peptides with at least one match covering more than

of the peptide were marked as detected. c_AMPs with experimental evidence in metatranscriptomes and/or metaproteomes accounted for circa 20% of the AMPSphere.

The mapping of c_AMPs was performed without considering genomic context, which may have led to an overestimation of candidates being identified as potentially transcribed. For example, if they are homologous to longer proteins the presence of the longer gene may lead to a false positive detection of the shorter c_AMP. We investigated this using Fisher’s Exact Test to compare the percentage of AMP homologs to the GMGCv1

database with experimental evidence of translation (

out of 61,020 peptides, Odds Ratio

) and/or transcription (

out of 61,020 peptides, Odds Ratio

). The results suggest that our approach tends to slightly overestimate the potential transcription and translation of candidates with canonical-length homologs.

Given that only a small number of transcriptomic or proteomics dataset were available and the afore-mentioned limitations in interpreting the mappings, we considered AMPs passing all quality-control tests to be high-quality, regardless of evidence of translation or transcription. We further separated those with experimental evidence of translation/transcription ( 17,115 c_AMPs, circa

of AMPSphere) and those without it ( 63,098 c_AMPs, circa 7%). For c_AMP families, we considered high-quality those where

of its c_AMPs pass all quality control tests or those with at least one c_AMP possessing experimental evidence of translation/transcription.

Sample-based c_AMPs accumulation curves

To determine the saturation of c_AMP discovery, for each habitat or group of habitats, we computed sample-based accumulation curves by randomly sampling metagenomes in steps of 10 metagenomes. This procedure was repeated 32 times, and the average was taken.

Multi-habitat and rare c_AMPs

We first counted c_AMPs present in

habitats (“multi-habitat AMPs”). To then test the significance of this value, we opted for a similar approach to that described in Coelho et al.

: habitat labels for each sample were shuffled 100 times and the number of resulting multi-habitat c_AMPs was counted. Shuffling labels resulted in

multi-habitat c_AMPs by chance for highlevel habitat groups, and in

multi-habitat c_AMPs by chance when looking at the habitats individually inside the high-level groups. The Shapiro-Wilks test was used to check that the resulting data distribution is normal (

, for specific habitats;

for high-level habitats). In the original (non-shuffled data), high-level habitat groups presented 93,280 multi-habitat c_AMPs ( 136.21 standard deviations below shuffled value), while specific habitats presented 173,955 multi-habitat c_AMPs (117.1 standard deviations below shuffled value).

To determine the rarity of c_AMPs, we adapted the protocol previously established by Coelho et al.

in which the non-redundant genes in AMPSphere were mapped against the reads of metagenome samples using NGLess.

We considered only uniquely mapped reads. From the mapping, we computed the c_AMPs detected per sample and the number of detections per c_AMP, considering “rare” c_AMPs as those detected less than the average of the entire AMPSphere ( 682 detections or

of all samples as previously described for species

). This approach was adopted to overcome the high computational costs of a competitive mapping procedure. We expect that our approach overestimates how prevalent c_AMPs are, and because of that, it is a robust way to estimate the rarity of c_AMPs.

As the high-quality designation requires at least 3 gene variants for the RNAcode test to be performed, the rarest genes will not be high-quality. However, for robustness, we quantified this effect by computing the mean and median number of detections in only the high-quality c_AMPs and only non-terminal c_AMPs (a test which does not require a minimum number of genes). The mean number of detections is 682 for the full collection, 789 for high-quality c_AMPs, and 679 for non-terminal ones.

Testing c_AMPs overlap across habitats

Like was done when testing the significance of the number of multi-habitat c_AMPs observed, the number of overlapping c_AMPs was computed for each pair of habitats. We shuffled the sample labels 1,000 times, counting the number of randomly overlapping c_AMPs for each pair of habitats. Then, we estimated the probability of observing the overlap by Chebyshev’s inequality, which does not rely on any assumption regarding the distribution of the data as we observed, using the Shapiro-Wilk’s test, that the shuffled counts do not follow a normal distribution. Chebyshev’s inequality is

, where

stands for the

score computed from the average and standard deviations estimated by the shuffling procedure. The

-values were adjusted using Holm-Sidak implemented in multipletests from the statsmodels package,

and those below 0.05 were considered significant.

c_AMP density in microbial species

The c_AMP density was defined as

, where

is the number of c_AMP redundant genes and

is the assembled base pairs. We assume, as an approximation, that in a large segment assembled, the start positions of AMP genes are independent and uniformly random. Then, we calculated the standard sample proportion error with the formula: STDerr

. The standard sample proportion error was used to calculate the margin of error at a

confidence interval (

To gain insights about the contributions of different phyla, species, and genera to the AMPSphere, we calculated the c_AMP density for these taxonomy levels using the c_AMPs included within AMPSphere, summing all assembled base pairs for contigs assigned to each taxonomy level in the samples used in AMPSphere. The

of genera, phyla and species within a margin of error superior to

of the calculated value were eliminated along with outliers according to Tukey’s fences (

). We estimated species’ presence and abundance in each sample using mOTUs2.

None of the genera with the highest

(Algorimicrobium, TMED78, SFJ001, STGJ01, and CAG-462) were highly prevalent microbes.

c_AMPs and bacterial species transmissibility

We used the species taxonomy and transmissibility indices calculated by Valles-Colomer et al.

to demonstrate the effect of AMPs on the transmission of bacterial species from mother to children. Only those species overlapping AMPSphere and the datasets from Valles-Colomer et al.

were used for this analysis, and their AMP densities were calculated as described in the previous section (c_AMP density in microbial species), using all the predicted c_AMPs from metagenomes and genomes we obtained, also including those not in AMPSphere, to avoid sampling bias. The AMP density and the coefficient of transmissibility were correlated using Spearman’s method implemented in the scipy package

: following children’s microbiome after 1,3 , and up to 18 years, as well as, cohabitation and intra-datasets. The

-values of correlations were corrected using Holm-Sidak implemented in the multipletests function from the statsmodels package.

Determination of accessory AMPs

To uncover the prevalence of c_AMPs through the microbial pangenomes, core, shell, and accessory c_AMP clusters were determined using the subset of c_AMPs obtained from ProGenomes

because of their high-confidence assigned taxonomies and ge-nomically-defined species (specl

). To increase confidence in our measures, only species containing at least 10 genomes were
used in this analysis. c_AMPs and AMP families present in fewer than

of the genomes from a microbial species were classified as accessory. c_AMPs and families present in

of the genomes in the cluster were classified as shell,

and those present in

of the genomes were classified as core genes.

To determine the propensity of AMPs being shared between genomes belonging to the same strain, we first defined strains within species. For this, we used FastANI v.1.33

to cluster genomes from the same species in ProGenomes2.

Genome groups with ANI

were considered clonal complexes and only a single representative of each clonal complex was kept for further analyses. Species that had fewer than 10 genomes after this step were not considered further in this analysis. Next, we inferred strains (99.5%

) as in Rodriguez et al.

We then counted the pairs of genomes from the same species sharing AMPs, stratified by whether the pair originates from the same strain or not, and tested the results with Fisher’s Exact Test implemented in the scipy package.

To determine the proportions of accessory, shell and core full-length proteins in the microbial pangenomes, we also extracted the predicted full-length proteins from the ENA database for each genome and hierarchically clustered them after alphabet reduction in a similar fashion to that described in the topic “AMP families”. Full-length protein clusters with

sequences for each species were kept. The prevalence of full-length protein families within a species was computed as above and the number of core families was compared to the number of c_AMP core families using the probability, calculated as number of species with proportion of core full-length protein families less or equal to that observed for c_AMPs divided by the total of assessed species.

To determine the genotype of Mycoplasma pneumoniae genomes in ProGenomes2,

we extracted the gene coding for P1 adhe

by mapping the reference gene sequence NZ_LR214945.1:c568695-567307 against each genome with bwa v.0.7.17

, and later extracted the sequences using with SAMtools

and BEDtools.

The extracted gene sequences were aligned using Clustal Omega,

and a phylogenetic tree was built using the aligned nucleotide sequences and FastTree

with the restricted timereversible substitution model and a bootstrapping procedure with 1,000 pseudo-replicates to determine node support. The tree was used to segregate and classify genomes taking the strain type of reference genomes from Diaz et al.

Annotation of AMPs using different datasets

To detect homologs to previously published proteins, we aligned AMPSphere candidates against several databases: (i) the small protein sets in SmProt 2,

(ii) the bioactive peptides database starPepDB

(iii) the small proteins from the global data-driven census of Salmonella,

(iv) the global microbial gene catalog GMGCv1,

and the AMP database DRAMP

version 3.0. To strictly avoid any artifacts of assembly for the analysis, only c_AMPs which passed the terminal placement test (i.e., for which there was strong evidence that the ORF is indeed complete) were searched against the GMGCv1.

The AMPs were annotated using MMseqs

with the ‘easy-search’ method, retaining hits with an E -value up to

. As Macrel

removes the starting methionine from the peptides it outputs, hits starting at the second amino acid were treated as if they matched the first one.

We used the hypergeometric test implemented in the scipy package

to model the association between c_AMPs and the background distribution of ortholog groups from GMGCv1.

The number of genes that were redundant in GMGCv1

for each ortholog group was computed along with the counts for ortholog groups in the top hits to AMPSphere. The enrichment was given as the proportion of hits present in a given ortholog group divided by the proportion of that ortholog group among the redundant sequences in GMGCv1,

and results were considered significant if

after correction with the Holm-Sidak method implemented in multipletests from the statsmodels package.

When using a robust approach that filters the ortholog groups by the number of c_AMP hits and GMGCv1

hits associated with them, using a minimum of 10,20 , or even 100 proteins, the results were kept similar to those obtained with all data showing that the extension of the ortholog groups in AMPSphere did not affect the enrichment analysis.

To check for genomic entities generated after gene truncation, we screened for c_AMP homologs using the default settings for Blastn

against the NCBI database,

keeping only significant hits with a maximum E-value of

. As a case study, we selected the AMP10.271_016, predicted to be produced by Prevotella jejuni, which shares the start codon with the gene coding for a NAD(P)dependent dehydrogenase (WP_089365220.1). To verify the gene disposition and putative mutations leading to the AMP creation, we used Biopython

to codon-align the fragments from metagenomic contigs assembled from samples SAMN09837386, SAMN09837387, and SAMN09837388, and genomic fragments of different strains of Prevotella jejuni CD3:33 (CP023864.1: 504836-504949), F0106 (CP072366.1:781389-781502), F0697 (CP072364.1:1466323-1466436), and from Prevotella melaninogenica strains FDAARGOS_760 (CP054010.1:157726-157839), FDAARGOS_306 (CP022041.2:943522-943635), FDAARGOS_1566 (CP085943.1:1102942-1103055), and ATCC 25845 (CP002123.1:409656-409769) and compared the segments coding for the AMP and the original full-length protein.

Genomic context conservation analysis

To gain insights into the gene synteny involving AMP genes, we mapped the 863,498 AMP sequences against a collection of 169,632 reference genomes, metagenome-assembled genomes (MAGs) and single amplified genomes (SAGs) curated elsewhere

with DIAMOND

in “blastp” mode, as previously reported.

Hits with identity

(amino acid) and query and target coverage

were considered significant. The target coverage threshold avoids hits to larger homologs whose function may be unrelated. This yielded 107,308 AMPs with homologs in at least one genome. We built gene families from the hits of each AMP detected in the prokaryotic genomes and calculated a conservation score based on the functional annotation of the neighboring genes in a window of three genes up and downstream. The vertical conservation score at each position within the window of each c_AMP was
calculated as the number of genes with a given functional annotation (ortholog group, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, KEGG orthology, KEGG module,

PFAM 33.1,

and CARD

; details of annotation and annotated database described previously

). divided by the number of genes in the family. AMPs with more than two hits and a vertical conservation score

with any functional term were considered to have conserved genomic contexts. Figure 4 shows genomic context conservation of different KEGG pathways.

For testing whether the fraction of AMPs with conserved genomic neighbors is similar to that of other gene families within the 169,632 genomes curated by del Río et al.,

we calculated genomic context conservation on

gene families calculated de novo with MMSeqs

(using a minimal amino acid identity of

, coverage of the shorter sequence of at least

, and maximum E-value of

). The c_AMPs were also annotated using EggNOG-mapper v2.

Their KO annotations were compared to that of the immediate neighbors (

positions) to identify neighborhoods with the same function. It was possible to annotate

out of 107,308

of c_AMPs with hits to the genomes tested using the EggNOG5 database.

Of these,

were assigned to translation-related functions (class J), 14.4% belong to proteins of unknown function (S),

were assigned to replication, recombination, and repair (L).

AMPSphere web resource

AMPSphere is found at the address https://ampsphere.big-data-biology.org/. The implementation is based on Python

and Vue Javascript. The database was built with sqlite, and SQLalchemy was used to map the database to Python objects. Internal and external APIs were built using FastAPI and Gunicorn to serve them. On the front end, Vue 3 was used as the backbone and Quasar built the layout. Plotly was used to generate interactive visualization plots, and Axios to render content seamlessly. LogoJS (https:// logojs.wenglab.org/app/) was used to generate sequence logos for AMP families; while the helical wheel app (https://github.com/ clemlab/helicalwheel) was used to generate AMP helical wheels.

Peptide selection for synthesis and testing

We selected two groups of peptides: (i) 50 peptides that were selected as being particularly likely to be active and that were otherwise interesting (as described below), (ii) 50 peptides selected randomly after applying technical exclusions.

For the first group, only high-quality (see the topic “quality control of c_AMPs”) c_AMPs were considered for synthesis. They were further filtered according to six criteria for solubility

and three criteria for synthesis, as in PepFun.

We estimated the solubility using the criteria implemented in PepFun,

observing that

peptides

passed at least half of the solubility criteria evaluated. The subset that is homologous to peptides in DRAMP

version 3.0 had a slightly lower rate,

passed half the tests. We then assessed the peptides regarding their ease of synthesis, however, only

from AMPSphere passed at least 2 out of the 3 criteria established for chemical synthesis.

A peptide approved for at least six of the above-mentioned criteria was then filtered by predicting AMP activity with six methods in addition to Macrel

: AMPScanner v2,

the mature peptides model in ampir,

amPEPpy,

APIN

– with their proposed model, AI4AMP,

and AMPLify.

Peptides predicted to be AMPs by all methods were filtered by length, discarding sequences longer than 40 amino acid residues, for which conventional solid-phase peptide synthesis using Fmoc strategy has lower yields and many recoupling reactions.

Only one peptide was kept from each family or cluster, namely the one with the highest number of observed smORFs. After this process, we obtained 364 candidate AMPs, belonging to 166 families and 198 clusters with <8 c_AMPs. Of these, 30 candidates were homologous to sequences from the databases used in annotation (e.g., SmProt

). To compose the list of 50 high-likelihood candidates: (i) we selected 34 of the most prevalent peptides; (ii) we randomly selected 14 c_AMPs (

of our set) with homologs to the GMGCv1

and one that matched SmProt

; and (iii) we included one peptide that was found in the MAGs binned from stool samples used to investigate fecal transplantations.

We also included scrambled sequences made using five of the most active peptide sequences to verify the potency of randomly generated sequences.

To build the group of randomly selected peptides, we first selected c_AMPs that are not homologous to any other databases tested and that passed the abovementioned synthesis criteria (total of 768,061 out of 863,498 peptides). We further divided this group into subgroups: (i) those with Macrel-assigned probability

( 271,555 c_AMPs) and (ii) those in the range

( 496,506 c_AMPs; note that all c_AMPs in AMPSphere have a Macrel-assigned probability

). We randomly sampled 25 peptides from each group.

Minimal inhibitory concentration (MIC) determination

The 100 AMPs were tested for antimicrobial activity using the broth microdilution method.

MIC values were considered as the concentration of the peptides that killed

of cells after 24 h of incubation at

. First, peptides diluted in water were added to untreated flat-bottom polystyrene microtiter 96-well plates in 2-fold dilutions ranging from 64 to

, and then peptides were exposed to an inoculum of

cells in LB or BHI broth, for pathogens and gut commensals, respectively. After the incubation time, the absorbance of each well representing each of the conditions was analyzed using a spectrophotometer at 600 nm . The assays were conducted in three biological replicates to ensure statistical reliability.

Circular dichroism assays

Circular dichroism experiments were conducted using a J1500 circular dichroism spectropolarimeter (Jasco) at the Biological Chemistry Resource Center (BCRC) of the University of Pennsylvania. The experiments were carried out at a temperature of

. Circular
dichroism spectra were obtained by averaging three accumulations using a quartz cuvette with an optical path length of 1.0 mm . The spectra were recorded in the wavelength range from 260 to 190 nm at a scanning rate of

with a bandwidth of 0.5 nm . The peptides were tested at a concentration of

. Measurements were performed in water, a mixture of water and trifluoroethanol (TFE) in a ratio of

, and a mixture of water and methanol in a ratio of

. Baseline measurements were recorded prior to each measurement. To minimize background effects, a Fourier transform filter was applied. The helical fraction values were calculated using the single spectra analysis tool available on the BeStSel server.

Outer membrane permeabilization assays

Membrane permeability was analyzed using the 1-(N-phenylamino)naphthalene (NPN) uptake assay. NPN demonstrates weak fluorescence in an extracellular environment but displays strong fluorescence when in contact with lipids from the bacterial outer membrane. Thus, NPN will show increased fluorescence when the integrity of the outer membrane is compromised. A. baumannii ATCC 19606 and

. aeruginosa PA01 were cultured until cell numbers reached an

of 0.4 , followed by centrifugation (

for 3 min ), washing, and resuspension in buffer (

HEPES,

glucose, pH 7.4 ). Subsequently,

of NPN solution (working concentration of

) was added to

of bacterial solution in a white flat bottom 96-well plate. The fluorescence was monitored at

and

. The peptide solutions in water (

solution at their MIC values) were introduced into each well, and fluorescence was monitored as a function of time until no further increase in fluorescence was observed ( 30 min ). The relative fluorescence was calculated using a non-linear fit. The positive control (antibiotic polymyxin B) was used as baseline. The following equation was applied to reflect % of difference between the baseline (polymyxin B) and the sample:

Cytoplasmic membrane depolarization assays

The ability of the peptides to depolarize the cytoplasmic membrane was assessed by measuring the fluorescence of the membrane potential-sensitive dye

-dipropylthiadicarbocyanine iodide [

]. This potentiometric fluorophore fluoresces upon release from the interior of the cytoplasmic membrane in response to an imbalance of its transmembrane potential. A. baumannii ATCC 19606 and

. aeruginosa PA01 cells were grown with agitation at

until they reached mid-log phase (

). The cells were then centrifuged and washed twice with washing buffer (

glucose,

HEPES, pH 7.2 ) and re-suspended to an

of 0.05 in

glucose,

. An aliquot of

of bacterial cells was added to a black flat bottom 96-well plate and incubated with

-(5) for 15 min until the fluorescence stabilized, indicating the incorporation of the dye into the cytoplasmic membrane. The membrane depolarization was monitored by observing the change in the fluorescence emission intensity of the dye (

), after the addition of the peptides (

solution at their MIC values). The relative fluorescence was calculated using a non-linear fit. The positive control (antibiotic polymyxin B) was used as baseline. We estimated the % of difference between the baseline (polymyxin B) and the sample using the same mathematical approach as in the “Outer membrane permeabilization assays”.

QUANTIFICATION AND STATISTICAL ANALYSIS

Graphs for the experimental results were created and statistical tests conducted in GraphPad Prism v.9.5.1 (GraphPad Software, San Diego, California USA).

ADDITIONAL RESOURCES

AMPSphere is freely available for download in Zenodo

and as a web server (https://ampsphere.big-data-biology.org/).

Supplemental figures

Figure S1. General physical-chemical features of c_AMPs in AMPSphere and validated databases of antimicrobial peptides, related to Figure 1
Shown are density curves; the arbitrary density units are not shown, as all curves are independently normalized so the area under the curve is one. For each dataset and feature, the top

and bottom

of values were considered outliers and are not shown in the plot. Proportions of residues with small side chains

per

AMP along with the proportions of basic residues

per c_AMP were also shown. The distributions of each feature were compared among the datasets using the Mann-Whitney test with multiple hypothesis testing corrected using Holm-Sidak. Almost all differences are significant (adjusted

value

). The exceptions are: aliphatic index did not differ between the peptides from DRAMP version

and the ones present in the positive training set used in Macrel

( p

); AMPSphere peptides did not differ from the positive training set used in Macrel

in the fraction of aromatic (

0.58 ), non-polar (

), polar (

), and acidic (

) residues; the instability index (

) and the hydrophobicity (

) of AMPSphere peptides also were not different from the positive training set used in Macrel.

Figure S2. c_AMP quality and habitat distribution, related to Figures 1 and 2
(A) Quality assessment of AMPSphere revealed most of the peptides passed at least one of the tests. The RNAcode test depends on gene diversity, which is very low for AMPSphere, which led to a low rate of positives among our candidates.
(B) c_AMPs homologous to databases of validated bioactive peptides also showed a higher average quality of these datasets.
(C) The limited overlap of c_AMPs among habitats argues in favor of using habitat groups to gain resolution. Note that the group of habitats with the highest paired overlaps belongs to human body sites and samples from human guts and non-human mammalian guts. Only habitats with at least 100 samples were shown.
(D) We observed a large proportion of rare genes in AMPSphere from different habitat groups.

Figure S3. Clustering validation of families, related to STAR Methods section “Clustering of AMP families”
To validate the clustering procedure using a reduced amino acid alphabet, samples of 1,000 peptides were randomly drawn from AMPSphere (excluding representative sequences) and aligned against their cluster representatives. Three different levels (I, II, and III) of clustering were tested. The E-values were computed per alignment and plotted against the corresponding alignment identity. The averaged proportion of significant alignments is shown in each graph above.

Figure S4. Antimicrobial activity of polymyxin B and levofloxacin and circular dichroism spectra of the c_AMPs, related to STAR Methods section “Circular dichroism assays”
(A) Minimal inhibitory concentration values for polymyxin B, a peptide antibiotic, and levofloxacin against all the strains tested. Polymyxin B and levofloxacin were used as positive controls in all antimicrobial assays.
(B-D) The c_AMPs’ secondary structural tendency was analyzed using three different solvents: (B) water, (C) trifluoroethanol (TFE) and water mixture (

), and (D) methanol (MeOH) and water mixture (

). The experiments were carried out at

, and the circular dichroism spectra shown are an average of three accumulations obtained using a quartz cuvette with an optical path length of 1.0 mm , ranging from 260 to 190 nm at a rate of

and a bandwidth of 0.5 nm . All peptides were tested at a concentration of

, with respective baselines recorded prior to measurement. A Fourier transform filter was applied to minimize background effects.

Figure S6. Mechanism of action of AMPSphere peptides and anti-infective activity of c_AMPs in a preclinical animal model, related to Figures 6 and 7
(A) Fluorescence values relative to polymyxin B (PMB, positive control) of the fluorescent probe 1-(N-phenylamino)naphthalene (NPN) that indicate outer membrane permeabilization of

. aeruginosa PAO1 cells.
(B) Fluorescence values relative to PMB (positive control) of 3,3′-dipropylthiadicarbocyanine iodide (

-[5]), a hydrophobic fluorescent probe used to indicate cytoplasmic membrane depolarization of

. aeruginosa PAO1 cells.
(C) Bacterial counts four days post-infection; the c_AMPs were tested at their MIC in a single dose 1 h after the establishment of the infection. Each group consisted of three mice (

), and the bacterial loads used to infect each mouse were derived from a different inoculum.
(D) Mouse weight throughout the experiment (mean

the standard deviation).
Statistical significance in (C) was determined using one-way ANOVA where all groups were compared to the untreated control group;

values are shown for each of the groups. Features on the violin plots represent median and upper and lower quartiles. Figure created in BioRender.com.

(D) Fluorescence values relative to polymyxin B (PMB, positive control) of the fluorescent probe 1-(N-phenylamino)naphthalene (NPN) that indicate outer membrane permeabilization of A. baumannii ATCC 19606 cells.
(E) Fluorescence values relative to PMB (positive control) of -dipropylthiadicarbocyanine iodide , a hydrophobic fluorescent probe used to indicate cytoplasmic membrane depolarization of . baumannii ATCC 19606 cells. Depolarization of the cytoplasmic membrane occurred with slow kinetics compared to the permeabilization of the outer membrane and took approximately 20 min to stabilize.
Figure S5. Antimicrobial activity and secondary structure of scrambled versions of some of the lead c_AMPs, related to Figures 6 and 7 (A) MIC values of the scrambled versions of five of the lead c_AMPs from AMPSphere tested against the same 11 pathogenic strains and eight gut commensal strains used to assess the activity of the c_AMPs.
(B-D) The scrambled peptides’ secondary structural tendency was analyzed using three different solvents: (B) water, (C) TFE and water mixture ( ), and (D) MeOH and water mixture ( ). The experiments were carried out in the same conditions as the ones used for the AMPs. A Fourier transform filter was applied to minimize background effects.
(E) Heatmap with the percentage of secondary structure found for each peptide in three different solvents: water, TFE in water, and in water. Secondary structure was calculated using BeStSel server.