html,body {
width: 100%;
height: 100%;
}
*, *::before,*::after {
box-sizing: border-box;
}
body {
margin: 0;
color: #1E2029;
font-size: 14px;
line-height: normal;
}
hr {
box-sizing: content-box;
height: 0;
overflow: visible;
}
h1, h2, h3, h4, h5, h6 {
margin-top: 0;
margin-bottom: 0.5em;
color: rgba(0, 0, 0, 0.85);
font-weight: 500;
}
p {
margin-top: 0;
margin-bottom: 1em;
}
ol, ul, dl {
margin-top: 0;
margin-bottom: 1em;
}
ol ol, ul ul, ol ul, ul ol {
margin-bottom: 0;
}
dt {
font-weight: 500;
}
dd {
margin-bottom: 0.5em;
margin-left: 0;
}
blockquote {
margin: 0 0 1em;
}
dfn {
font-style: italic;
}
b, strong {
font-weight: bolder;
}
small {
font-size: 80%;
}
sub, sup {
position: relative;
font-size: 75%;
line-height: 0;
vertical-align: baseline;
}
sub {
bottom: -0.25em;
}
sup {
top: -0.5em;
}
a {
color: #0B93ff;
text-decoration: none;
background-color: transparent;
outline: none;
cursor: pointer;
transition: color 0.3s;
}
a:hover {
color: #33aaff;
}
a:active {
color: #0070d9;
}
a:active, a:hover {
text-decoration: none;
outline: 0;
}
a[disabled] {
color: rgba(0, 0, 0, 0.25);
cursor: not-allowed;
pointer-events: none;
}
pre, code, kbd, samp {
font-size: 1em;
}
pre {
margin-top: 0;
margin-bottom: 1em;
overflow: auto;
}
figure {
margin: 0 0 1em;
}
img {
vertical-align: middle;
border-style: none;
}
svg:not(:root) {
overflow: hidden;
}
table {
border-collapse: collapse;
}
caption {
padding-top: 0.75em;
padding-bottom: 0.3em;
color: rgba(0, 0, 0, 0.45);
text-align: left;
caption-side: bottom;
}
th {
text-align: inherit;
}
mjx-container[jax=”SVG”] {
direction: ltr;
}
mjx-container[jax=”SVG”] > svg {
overflow: visible;
min-height: 1px;
min-width: 1px;
}
mjx-container[jax=”SVG”] > svg a {
fill: blue;
stroke: blue;
}
mjx-container[jax=”SVG”][display=”true”] {
display: block;
text-align: center;
margin: 1em 0;
}
mjx-container[jax=”SVG”][display=”true”][width=”full”] {
display: flex;
}
mjx-container[jax=”SVG”][justify=”left”] {
text-align: left;
}
mjx-container[jax=”SVG”][justify=”right”] {
text-align: right;
}
g[data-mml-node=”merror”] > g {
fill: red;
stroke: red;
}
g[data-mml-node=”merror”] > rect[data-background] {
fill: yellow;
stroke: none;
}
g[data-mml-node=”mtable”] > line[data-line], svg[data-table] > g > line[data-line] {
stroke-width: 70px;
fill: none;
}
g[data-mml-node=”mtable”] > rect[data-frame], svg[data-table] > g > rect[data-frame] {
stroke-width: 70px;
fill: none;
}
g[data-mml-node=”mtable”] > .mjx-dashed, svg[data-table] > g > .mjx-dashed {
stroke-dasharray: 140;
}
g[data-mml-node=”mtable”] > .mjx-dotted, svg[data-table] > g > .mjx-dotted {
stroke-linecap: round;
stroke-dasharray: 0,140;
}
g[data-mml-node=”mtable”] > g > svg {
overflow: visible;
}
[jax=”SVG”] mjx-tool {
display: inline-block;
position: relative;
width: 0;
height: 0;
}
[jax=”SVG”] mjx-tool > mjx-tip {
position: absolute;
top: 0;
left: 0;
}
mjx-tool > mjx-tip {
display: inline-block;
padding: .2em;
border: 1px solid #888;
font-size: 70%;
background-color: #F8F8F8;
color: black;
box-shadow: 2px 2px 5px #AAAAAA;
}
g[data-mml-node=”maction”][data-toggle] {
cursor: pointer;
}
mjx-status {
display: block;
position: fixed;
left: 1em;
bottom: 1em;
min-width: 25%;
padding: .2em .4em;
border: 1px solid #888;
font-size: 90%;
background-color: #F8F8F8;
color: black;
}
foreignObject[data-mjx-xml] {
font-family: initial;
line-height: normal;
overflow: visible;
}
mjx-container[jax=”SVG”] path[data-c], mjx-container[jax=”SVG”] use[data-c] {
stroke-width: 3;
}
#setText > div {
justify-content: inherit;
margin-top: 0;
margin-bottom: 1em;
}
#setText div:last-child {
margin-bottom: 0 !important;
}
#setText > br, #preview-content br {
line-height: 1.2;
}
#preview-content > div {
margin-top: 0;
margin-bottom: 1em;
}
.proof > div, .theorem > div {
margin-top: 1rem;
}
#preview-content table {
margin-bottom: 1em;
}
#setText table {
margin-bottom: 1em;
}
#preview-content .sub-table table, #setText .sub-table table {
margin-bottom: 0;
}
mjx-container {
text-indent: 0;
overflow-y: hidden;
overflow-x: auto;
padding-top: 1px;
padding-bottom: 1px;
}
.math-inline mjx-container {
display: inline-block !important;
page-break-inside: avoid;
max-width: 100%;
padding: 0;
line-height: 0;
}
.math-inline[data-overflow=”visible”] mjx-container {
overflow: visible;
}
.math-inline mjx-container mjx-assistive-mml {
max-width: 100%;
}
.math-block {
align-items: center;
page-break-after: auto;
page-break-inside: avoid;
margin: 0;
display: block; /* mjx-container has block */
}
.math-inline {
display: inline-flex; /* mjx-container has inline-block. To prevent displacement during use overflow-x: auto;*/
max-width: 100%;
}
.math-block[data-width=”full”] {
overflow-x: auto;
display: flex; /* mjx-container has flex */
}
svg .math-inline {
display: initial;
max-width: initial;
}
svg .math-inline mjx-container {
max-width: initial;
}
svg mjx-container {
overflow: visible;
padding: 0;
}
svg math-block[data-width=”full”] {
overflow: visible;
}
.math-block,.math-inline {
–mmd-highlight-color: rgba(0, 147, 255, 0.25);
–mmd-highlight-text-color: #1e2029;
}
.math-block[data-highlight-color] mjx-container[jax=”SVG”] > svg {
background-color: var(–mmd-highlight-color);
}
.math-block[data-highlight-text-color] mjx-container[jax=”SVG”] > svg {
color: var(–mmd-highlight-text-color);
}
.math-inline[data-highlight-color] mjx-container[jax=”SVG”] {
background-color: var(–mmd-highlight-color);
}
.math-inline[data-highlight-text-color] mjx-container[jax=”SVG”] {
color: var(–mmd-highlight-text-color);
}
.math-block p {
flex-shrink: 1;
}
.math-block mjx-container {
margin: 0 !important;
}
.math-error {
background-color: yellow;
color: red;
}
#preview-content img, #setText img {
max-width: 100%;
}
#preview-content blockquote, #setText blockquote {
page-break-inside: avoid;
color: #666;
margin: 0 0 1em 0;
padding-left: 3em;
border-left: .5em solid #eee;
}
#preview-content pre, #setText pre {
border: none;
padding: 0;
overflow: auto;
font-size: 85%;
line-height: 1.45;
border-radius: 6px;
box-sizing: border-box;
background: #f8f8fa;
}
#preview-content pre code, #setText pre code{
padding: 1rem;
display: block;
overflow-x: auto;
line-height: 24px;
}
.empty {
text-align: center;
font-size: 18px;
padding: 50px 0 !important;
}
#setText table, #preview-content table {
display: table;
overflow: auto;
max-width: 100%;
border-collapse: collapse;
page-break-inside: avoid;
}
#setText table th, #preview-content table th {
text-align: center;
font-weight: bold;
}
#setText table td, #preview-content table td,
#setText table th, #preview-content table th {
border: 1px solid #dfe2e5;
padding: 6px 13px;
}
#setText table tr, #preview-content table tr {
background-color: #fff;
border-top: 1px solid #c6cbd1;
}
#setText table tr:nth-child(2n), #preview-content table tr:nth-child(2n) {
background-color: #f6f8fa;
}
#preview-content .main-title, #setText .main-title {
text-align: center;
line-height: 1.2;
margin: 0 auto 1em auto;
}
#preview-content .author, #setText .author {
text-align: center;
margin: 0 auto;
display: flex;
justify-content: center;
flex-wrap: wrap;
}
#preview-content .author p, #setText .author p {
min-width: 30%;
max-width: 50%;
padding: 0 7px;
}
#preview-content .author > p > span, #setText .author > p > span {
display: block;
text-align: center;
}
#preview-content .section-title, #setText .section-title {
margin-top: 1.5em;
}
#preview-content .abstract, #setText .abstract {
text-align: justify;
margin-bottom: 1em;
}
#preview-content .abstract p, #setText .abstract p {
margin-bottom: 0;
}
@media print {
#preview {
font-size: 10pt!important;
}
svg {
shape-rendering: crispEdges;
}
.math-block svg, math-inline svg {
margin-top: 1px;
}
#preview-content img, #setText img {
display: block;
}
#preview-content .figure_img img, #setText .figure_img img {
display: inline;
}
.preview-right {
word-break: break-word;
}
#preview-content h1, #setText h1 {
page-break-inside: avoid;
position: relative;
border: 2px solid transparent;
}
#preview-content h1::after, #setText h1::after {
content: “”;
display: block;
height: 100px;
margin-bottom: -100px;
position: relative;
}
#preview-content h2, #setText h2 {
page-break-inside: avoid;
position: relative;
border: 2px solid transparent;
}
#preview-content h2::after, #setText h2::after {
content: “”;
display: block;
height: 100px;
margin-bottom: -100px;
position: relative;
}
#preview-content h3, #setText h3 {
page-break-inside: avoid;
position: relative;
border: 2px solid transparent;
}
#preview-content h3::after, #setText h3::after {
content: “”;
display: block;
height: 100px;
margin-bottom: -100px;
position: relative;
}
#preview-content h4, #setText h4 {
page-break-inside: avoid;
position: relative;
border: 2px solid transparent;
}
#preview-content h4::after, #setText h4::after {
content: “”;
display: block;
height: 100px;
margin-bottom: -100px;
position: relative;
}
#preview-content h5, #setText h5 {
page-break-inside: avoid;
position: relative;
border: 2px solid transparent;
}
#preview-content h5::after, #setText h5::after {
content: “”;
display: block;
height: 100px;
margin-bottom: -100px;
position: relative;
}
#preview-content h6, #setText h6 {
page-break-inside: avoid;
position: relative;
border: 2px solid transparent;
}
#preview-content h6::after, #setText h6::after {
content: “”;
display: block;
height: 100px;
margin-bottom: -100px;
position: relative;
}
}
#preview-content sup, #setText sup {
top: -.5em;
position: relative;
font-size: 75%;
line-height: 0;
vertical-align: baseline;
}
#preview-content .text-url, #setText .text-url {
color: #0B93ff;
cursor: text;
pointer-events: none;
}
#preview-content .text-url a:hover, #setText .text-url a:hover {
color: #0B93ff;
}
mark {
background-color: #feffe6;
}
span[data-underline-type] mark {
background: inherit;
background-color: #feffe6;
padding-top: 0;
padding-bottom: 0;
}
*[data-has-dotfill] {
position: relative;
overflow: hidden;
}
*[data-has-dotfill] .dotfill::after {
position: absolute;
padding-left: .25ch;
content: ” . . . . . . . . . . . . . . . . . . . ”
“. . . . . . . . . . . . . . . . . . . . . . . ”
“. . . . . . . . . . . . . . . . . . . . . . . ”
“. . . . . . . . . . . . . . . . . . . . . . . ”
“. . . . . . . . . . . . . . . . . . . . . . . ”
“. . . . . . . . . . . . . . . . . . . . . . . ”
“. . . . . . . . . . . . . . . . . . . . . . . “;
text-align: right;
}
.smiles {
text-align: center;
}
div.svg-container, #setText > div.svg-container {
display: flex;
justify-content: center;
}
#preview-content code, #setText code {
font-family: Inconsolata;
font-size: inherit;
display: initial;
background: #f8f8fa;
}
#preview-content .mmd-highlight code, #setText .mmd-highlight code,
#preview-content pre.mmd-highlight code, #setText pre.mmd-highlight code {
background-color: transparent;
}
#preview-content pre code, #setText pre code {
font-family: ‘DM Mono’, Inconsolata, monospace;
color: #333;
font-size: 15px;
}
.hljs-comment,
.hljs-quote {
color: #998;
font-style: italic;
}
.hljs-command {
color: #005cc5;
}
.hljs-keyword,
.hljs-selector-tag,
.hljs-subst {
color: #d73a49;
font-weight: bold;
}
.hljs-number,
.hljs-literal,
.hljs-variable,
.hljs-template-variable,
.hljs-tag .hljs-attr {
color: #005cc5;
}
.hljs-string,
.hljs-doctag {
color: #24292e;
}
.hljs-title,
.hljs-section,
.hljs-selector-id {
color: #6f42c1;
font-weight: bold;
}
.hljs-subst {
font-weight: normal;
}
.hljs-type,
.hljs-class .hljs-title {
color: #458;
font-weight: bold;
}
.hljs-tag,
.hljs-name,
.hljs-attribute {
color: #000080;
font-weight: normal;
}
.hljs-regexp,
.hljs-link {
color: #009926;
}
.hljs-symbol,
.hljs-bullet {
color: #990073;
}
.hljs-built_in,
.hljs-builtin-name {
color: #24292e;
}
.hljs-meta {
color: #999;
font-weight: bold;
}
.hljs-meta-keyword {
color: #d73a49;
}
.hljs-meta-string {
color: #032f62;
}
.hljs-deletion {
background: #fdd;
}
.hljs-addition {
background: #dfd;
}
.hljs-emphasis {
font-style: italic;
}
.hljs-strong {
font-weight: bold;
}
.table_tabular table th, .table_tabular table th {
border: none !important;
padding: 6px 13px;
}
.tabular tr, .tabular tr {
border-top: none !important;
border-bottom: none !important;
}
.tabular td, .tabular td {
border-style: none !important;
background-color: #fff;
border-color: #000 !important;
word-break: keep-all;
padding: 0.1em 0.5em !important;
}
.tabular {
display: inline-table !important;
height: fit-content;
}
.tabular td > p {
margin-bottom: 0;
margin-top: 0;
}
.tabular td._empty {
height: 1.3em;
}
.tabular td .f {
opacity: 0;
}
html[data-theme=”dark”] .tabular tr, html[data-theme=”dark”] .tabular td {
background-color: #202226;
border-color: #fff !important;
}
.table_tabular {
overflow-x: auto;
padding: 0 2px 0.5em 2px;
}
.figure_img {
margin-bottom: 0.5em;
overflow-x: auto;
}
ol.enumerate, ul.itemize {
padding-inline-start: 40px;
}
/* It’s commented because counter not supporting to change value
ol.enumerate.lower-alpha {
counter-reset: item ;
list-style-type: none !important;
}
.enumerate.lower-alpha > li {
position: relative;
}
.enumerate.lower-alpha > li:before {
content: “(“counter(item, lower-alpha)”)”;
counter-increment: item;
position: absolute;
left: -47px;
width: 47px;
display: flex;
justify-content: flex-end;
padding-right: 7px;
flex-wrap: nowrap;
word-break: keep-all;
}
*/
.itemize > li {
position: relative;
}
.itemize > li > span.li_level, .li_enumerate.not_number > span.li_level {
position: absolute;
right: 100%;
white-space: nowrap;
width: max-content;;
display: flex;
justify-content: flex-end;
padding-right: 10px;
box-sizing: border-box;
}
.li_enumerate.not_number {
position: relative;
display: inline-block;
list-style-type: none;
}
#preview {
font-family: ‘CMU Serif’, ‘Georgia’, Helvetica, Arial, sans-serif;
font-size: 17px;
visibility: visible;
word-break: break-word;
padding: 2.5em;
max-width: 800px;
margin: auto;
box-sizing: content-box;
}
#preview h1, #preview h2, #preview h3, #preview h4, #preview h5, #preview strong {
font-family: ‘CMU Serif Bold’, ‘Georgia’, Helvetica, Arial, sans-serif;
}
#preview i, #preview em {
font-family: ‘CMU Serif Italic’, ‘Georgia’, Helvetica, Arial, sans-serif;
}
.mmd-menu {
max-width: 320px;
position: absolute;
background-color: white;
color: black;
width: auto;
padding: 5px 0px;
border: 1px solid #E5E6EB;
margin: 0;
cursor: default;
font: menu;
text-align: left;
text-indent: 0;
text-transform: none;
line-height: normal;
letter-spacing: normal;
word-spacing: normal;
word-wrap: normal;
white-space: nowrap;
float: none;
z-index: 201;
border-radius: 5px;
-webkit-border-radius: 5px;
-moz-border-radius: 5px;
-khtml-border-radius: 5px;
box-shadow: 0px 10px 20px #808080;
-webkit-box-shadow: 0px 10px 20px #808080;
-moz-box-shadow: 0px 10px 20px #808080;
-khtml-box-shadow: 0px 10px 20px #808080;
}
.mmd-menu:focus { outline: none; }
.mmd-menu.mmd-menu-sm {
max-width: 100vw;
padding-bottom: 34px;
border-radius: 0;
-webkit-border-radius: 0;
-moz-border-radius: 0;
-khtml-border-radius: 0;
}
.mmd-menu-item-icon {
color: #1e2029;
margin-left: auto;
align-items: center;
display: flex;
flex-shrink: 0;
display: none;
}
.mmd-menu-item {
padding-bottom: 8px;
padding-top: 8px;
padding-left: 1.25rem;
padding-right: 1.25rem;
display: flex;
background: transparent;
height: 52px;
max-height: 52px;
}
.mmd-menu-item:focus { outline: none; }
.mmd-menu-item.active {
background-color: #e1e0e5;
}
.mmd-menu-item.active .mmd-menu-item-icon {
display: flex;
}
.mmd-menu-item-container {
overflow: hidden;
}
.mmd-menu-item-title {
color: #1e2029;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
font-size: 14px;
line-height: 20px;
}
.mmd-menu-item-value {
color: #7d829c;
white-space: nowrap;
text-overflow: ellipsis;
overflow: hidden;
font-size: 12px;
line-height: 16px;
}
html[data-theme=”dark”] .mmd-menu-item-title {
color: #ebefe7;
}
html[data-theme=”dark”] .mmd-menu-item.active .mmd-menu-item-title {
color: #1e2029;
}
html[data-theme=”dark”] .mmd-menu {
background-color: #33363a;
}
.mmd-context-menu-overlay{
background: rgba(0, 0, 0, 0.56);
}
.ClipboardButton {
padding: 0;
margin: 0.5rem;
display: inline-block;
cursor: pointer;
color: rgb(36, 41, 47);
background: rgb(246, 248, 250);
border-radius: 6px;
border: 1px solid rgba(31, 35, 40, 0.15);
box-shadow: rgba(31, 35, 40, 0.04) 0 1px 0 0, rgba(255, 255, 255, 0.25) 0 1px 0 0 inset;
position: relative;
}
.ClipboardButton:hover {
background-color: rgb(243, 244, 246);
border-color: rgba(31, 35, 40, 0.15);
transition-duration: .1s;
}
.mmd-clipboard-icon {
fill: currentColor;
vertical-align: text-bottom;
}
.mmd-clipboard-copy-icon {
color: rgb(101, 109, 118);
}
.mmd-clipboard-check-icon {
color: rgb(26, 127, 55);
}
.mmd-tooltipped-no-delay:hover::before,
.mmd-tooltipped-no-delay:hover::after {
animation-delay: 0s;
}
.mmd-tooltipped:hover::before,
.mmd-tooltipped:hover::after {
display: inline-block;
text-decoration: none;
animation-name: tooltip-appear;
animation-duration: .1s;
animation-fill-mode: forwards;
animation-timing-function: ease-in;
animation-delay: .4s;
}
.mmd-tooltipped-w::before {
top: 50%;
bottom: 50%;
left: -7px;
margin-top: -6px;
border-left-color: rgb(36, 41, 47);
}
.mmd-tooltipped::before {
position: absolute;
z-index: 1000001;
display: none;
width: 0;
height: 0;
color: rgb(36, 41, 47);
pointer-events: none;
content: “”;
border: 6px solid transparent;
opacity: 0;
}
.mmd-tooltipped-w::after {
right: 100%;
bottom: 50%;
margin-right: 6px;
transform: translateY(50%);
}
.mmd-tooltipped::after {
position: absolute;
z-index: 1000000;
display: none;
padding: 0.5em 0.75em;
font: normal normal 11px/1.5 ‘CMU Serif’, ‘Georgia’, Helvetica, Arial, sans-serif;
-webkit-font-smoothing: subpixel-antialiased;
color: rgb(255, 255, 255);
text-align: center;
text-decoration: none;
text-shadow: none;
text-transform: none;
letter-spacing: normal;
word-wrap: break-word;
white-space: pre;
pointer-events: none;
content: attr(aria-label);
background: rgb(36, 41, 47);
border-radius: 6px;
opacity: 0;
}
ColabFold: making protein folding accessible to all
Abstract
ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40-60-fold faster search and optimized model utilization enables prediction of close to
structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.com/sokrypton/ColabFold and its novel environmental databases are available at https://colabfold.mmseqs.com.
Predicting the three-dimensional (3D) structure of a protein from its sequence alone remains an unsolved problem. However, by exploiting the information in multiple sequence alignments (MSAs) of related proteins as the raw input features for end-to-end training, AlphaFold2 (ref.
) was able to predict the 3D atomic coordinates of folded protein structures at a median global distance test total score (GDT_TS) of
in the latest round of the protein folding competition by the international community, CASP14 (Critical Assessment of protein Structure Prediction, round 14) (ref.
). The accuracy of many of the predicted structures was within the error margin of experimental structure determination methods. Many ideas of AlphaFold2 were independently reproduced and implemented in RoseTTAFold (ref.
). In addition to predictions for single chains, RoseTTAFold and, later, AlphaFold, were also shown to generalize to protein complexes. Evans et al.
have since released AlphaFold-multimer, a refined version of AlphaFold2 for the prediction of protein complexes. Thus, two highly accurate open-source prediction methods for single chains and one for protein complexes are now publicly available.
To leverage the power of these methods, researchers require powerful computing capabilities. First, to build diverse MSAs, large collections of protein sequences from public reference
and environmental
databases are searched using the most sensitive homology detection methods,
and HHblits
, both of which use profile hidden Markov models (HMMs). These environmental databases contain billions of proteins extracted from metagenomic and transcriptomic experiments, which often complement databases dominated by isolated genomes. Due to their large size, searches can take up to hours for a single protein while requiring more than 2 TB of storage space alone. Second, to execute the deep neural networks, graphics processing units (GPUs) with a large amount of GPU RAM (random access memory) are required even for relatively common
residues. For these, however, the MSA generation dominates the overall run time.
developed an AlphaFold2 Jupyter Notebook for Google Colaboratory (referred to as AlphaFold-Colab), for which the input MSA is built by searching with HMMer against the UniProt Reference Clusters (UniRef90) and an eightfold-reduced environmental database. This results in less accurate predictions while still requiring long search times.
-fold faster MMseqs2 (Many-against-Many sequence searching) (refs.
), and speeds up batch predictions by
-fold by avoiding recompilation and adding an early stop criterion. We show that ColabFold outperforms AlphaFold-Colab and matches AlphaFold2 on CASP14 targets and also matches AlphaFold-multimer on the ClusPro
dataset in prediction quality.

a command line interface (a) that send FASTA input sequence(s) to an MMseqs2 server (b) searching two databases, UniRef100 and a database of environmental sequences, with three profile-search iterations each. The second database is searched using a sequence profile generated from the UniRef100 search as input. The server generates two MSAs in A3M format containing all detected sequences. c, For predictions of single structures (i) we filter both A3Ms using a diversity-aware filter and return this to be provided as the MSA input feature to the AlphaFold2 models. For predictions of complexes (ii) we pair the top hits within the same species to resolve the inter-chain contacts and additionally add two unpaired MSAs (same as i) to guide the structure prediction. Single chain predictions are ranked by pLDDT and complexes by predicted TM-score. d, To help researchers judge the prediction quality we visualize MSA depth and diversity and show the AlphaFold2 confidence measures (pLDDT and PAE).
) sufficiently diverse sequences is enough to produce high-quality predictions (see fig. 5a in ref.
).
, phage catalogs
and an updated version of MetaClust
. We refer to this database as ColabFoldDB (Methods 2.3.2). In Supplementary Fig. 2 we show that ColabFoldDB, in comparison with BFD/MGnify, produces more diverse MSAs for domains in the protein families database (Pfam)
with <30 members.
and 0.62 for ColabFold-AlphaFold2-BFD/ MGnify, ColabFold-AlphaFold2-ColabFoldDB, AlphaFold2, AlphaFold-Colab and ColabFold-RoseTTAFold-BFD/MGnify, respectively. Over all CASP14 targets (excluding AlphaFold-Colab because it cannot be used as a standalone) the TM-scores are 0.887,
and 0.754 for the respective methods. The prediction of target T1084 can be improved from a TM-score of 0.457 to 0.872 by ColabFold if MMseqs2’s compositional filter is disabled (Supplementary Fig. 3). Supplementary Table 1 lists the additional targets for which ColabFold differed significantly from AlphaFold2.
it could often successfully model complexes. Shortly afterwards, Baek
found that increasing the model’s internal parameter, residue-index (the method that was used in RoseTTAFold), could also be done in AlphaFold2.
). We implemented a similar pairing procedure (Methods 2.4.2) and show the prediction capabilities of ColabFold for complexes in Fig. 2c. ColabFold achieves the highest accuracy in the prediction of complexes on the ClusPro
dataset with the AlphaFold-multimer model, however, some targets performed better using the residue-index mode.
.

. The yellow dashed line represents an extrapolation on the basis of the 50 AlphaFold2 predictions.
on one Nvidia Titan RTX (Fig. 2d), while sacrificing little or no prediction accuracy (Methods 2.7.4). The average pLDDTs of AlphaFold2 and ColabFold Stop
were 89.75 and 88.78 in a subsampled set of 50 proteins.
Online content
Published online: 30 May 2022
References
- Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589 (2021).
- Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP): round XIV. Proteins 89, 1607-1617 (2021).
- Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871-876 (2021).
- Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
- UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506-D515 (2019).
- Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570-D578 (2020).
- Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
- Steinegger, M. et al. HH-suite 3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).
- Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590-596 (2021).
- Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026-1028 (2017).
- Mirdita, M., Steinegger, M. & Söding, J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics 35, 2856-2858 (2019).
- Kozakov, D. et al. The ClusPro web server for protein-protein docking. Nat. Protoc. 12, 255-278 (2017).
- Levy Karin, E., Mirdita, M. & Söding, J. MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome 8, 48 (2020).
- Delmont, T. O. et al. Functional repertoire convergence of distantly related eukaryotic plankton lineages abundant in the sunlit ocean. Cell Genomics 2, 100123 (2022).
- Alexander, H. et al. Eukaryotic genomes from a global metagenomic dataset illuminate trophic modes and biogeography of ocean plankton. Preprint at bioRxiv https://doi.org/10.1101/2021.07.25.453713 (2021).
- Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960-970 (2021).
- Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098-1109 (2021).
- Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).
- Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412-D419 (2021).
- Moriwaki, Y. AlphaFold2 can also predict heterocomplexes. all you have to do is input the two sequences you want to predict and connect them with a long linker. Twitter https://twitter.com/Ag_smith/status/1417063635000598528 (2021).
- Baek, M. Adding a big enough number for ‘residue_index’ feature is enough to model hetero-complex using AlphaFold. Twitter https://twitter.com/ minkbaek/status/1417538291709071362 (2021).
- Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein-protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
- Mosalaganti, S. et al. Artificial intelligence reveals nuclear pore complexity. Preprint at bioRxiv https://doi.org/10.1101/2021.10.26.465776 (2021).
© The Author(s) 2022
Methods
. The first is AlphaFold2_mmseqs2 for basic use, which supports protein structure prediction using MSAs generated by MMseqs2 (version edb822), custom MSA upload, use of template information, relaxing of the predicted structures using amber force fields
, and prediction of complexes. The second, AlphaFold2_advanced, for advanced users, additionally supports MSA generation using HMMer (same as AlphaFold-Colab), the sampling of diverse structures by iterating through a series of random seeds (num_samples), and control of AlphaFold2 model internal parameters, such as changing the number of recycles (max_recycle), the number of ensembles (num_ensemble), and the is_training option. The use of the is_training option enables dropout during inference. This activates the stochastic part of the model and can result in different predictions. Thus by iterating through different seeds, one can sample different structures predictions from the uncertainty of the model
or the ambiguity of co-evolution constraints derived from the input MSA. The third main type of Jupyter notebook is AlphaFold2_batch, for batch prediction of multiple sequences or MSAs. The batch notebook saves time by avoiding recompilation of the AlphaFold2 models (section 2.5.2) for each individual input sequence. The fourth type is RoseTTAFold, for basic use of RoseTTAFold, and which supports protein structure prediction using MSAs generated by MMseqs2, and custom MSAs, and sidechain prediction using SCWRL4 (ref.
). The RoseTTAFold notebook also has an option to use a slower but more accurate PyRosetta
folding protocol for structure prediction, using constraints predicted by RoseTTAFold’s neural network.
. It searches the sequence(s) with three iterations against the consensus sequences of the UniRef30, a clustered version of the UniRef100 (ref.
). We accept hits with an E -value lower than 0.1 . For each hit we realign its respective UniRef100 cluster member using the profile generated by the last iterative search, filter them (Methods 2.2.2) and add these to the MSA. This expanding search results in a speed-up of
-fold given that only 29.3 million cluster consensus sequences are searched instead of all 277.5 million UniRef100 sequences. Additionally, it has the advantage of being more sensitive given that the cluster consensus sequences are used. We use the UniRef30 sequence profile to perform an iterative search against the BFD/ MGnify or ColabFoldDB using the same parameters, filters and expansion strategy.
implemented in MMseqs2 in multiple stages. In the first stage, during UniRef cluster expansion, we filter each individual UniRef30 cluster before adding the cluster members to the MSA, such that no cluster pair has a higher maximum sequence identity than 95% (–max-seq-id 0.95). In the second stage, after realignment we enable only the –qsc 0.8 threshold and disable all other thresholds (–qid 0 –diff 0 –max-seq-id 1.0). Additionally, the qsc filtering is used only if at least 100 hits are found (–filter-min-enable 100). In the last stage, during MSA construction we filter again with the following parameters:–filter-min-enable 1000 –diff 3000 –qid
–qsc 0 –max-seq-id 0.95. Here, we extended the HHblits filtering algorithm to filter within a given sequence identity bucket such that it cannot eliminate redundancy across filter buckets. Our
filter keeps the 3,000 most diverse sequences in the identity buckets [
], (0.2-0.4], (0.4-0.6], (0.6-0.8] and (0.8-1.0]. In buckets containing less than 1,000 hits we disable the filtering.
improved it to also support fast single-against-many searches. This type of search requires the database to be indexed and stored in memory. mmseqs createindex indexes the sequences and stores all time-consuming-to-compute data structures used for MMseqs2 searches to disk. We load the index into the operating systems cache using vmtouch (https://github.com/hoytech/vmtouch) to enable calls to the different MMseqs2 modules to become nearly overhead free. We extended the index to store, in addition to the already present cluster consensus sequences, all member sequences and the pairwise alignments of the cluster representatives to the cluster members. With these resident in cache, we eliminate the overhead of the remaining module calls.
RAM for headers and sequences alone.
billion proteins organized in 64 million clusters. MGnify (2019_05) contains ~300 million environmental proteins. We merged both databases by searching the MGnify sequences against the BFD cluster-representative sequences using MMseqs2. Each MGnify sequence with a sequence identity of
and a local alignment that covers at least
of its length is assigned to the respective BFD cluster. All unassigned sequences are clustered at
sequence identity and
coverage (–min-seq-id 0.3 -c 0.3 –cov-mode 1 -s 3) and merged with the BFD clusters, resulting in 182 million clusters. To reduce the size of the database we filtered each cluster, keeping only the 10 most diverse sequences using mmseqs filterresult –diff 10 . This reduced the total number of sequences from 2.5 billion to 513 million, thus requiring only 84 GB RAM for headers and sequences.
, MetaEuk (eukaryotes)
, TOPAZ (eukaryotes)
, MGV (DNA viruses)
, GPD (bacteriophages)
and an updated version of MetaClust
against the BFD/MGnify centroids using MMseqs2 and assigned each sequence to the respective cluster if they have a
sequence identity at a
sequence overlap (-c 0.9 –cov-mode 1 –min-seq-id 0.3). All remaining sequences were clustered using MMseqs2 cluster -c 0.9 –cov-mode 1 –min-seq-id 0.3 and appended to the database. We remove redundancy per cluster by keeping the most 10 diverse sequences using mmseqs filterresult –diff 10. The final database consists of
million representative sequences and
members (see the Data Availability section for the input files). We provide the MMseqs2 search workflow used in the server (Methods 2.2.1) as a standalone script (colabfold_search).
) to find the 20 top ranked templates. To save time, we use MMseqs2 (ref.
) to search against the PDB70 cluster representatives as a prefiltering step to find candidate templates. This search is also done as part of the MMseqs2 API call on our server. Only the top 20 target templates according to E -value are then aligned by HHsearch . The accepted templates are given to AlphaFold2 as input features. This alignment step is done in the ColabFold client and therefore it requires the subset of the PDB70 containing the respective HMMs. The PDB70 subset and the PDB mmCIF files are fetched from our server. For benchmarking, no templates are given to ColabFold.
. Here, we show the steps that we took for ColabFold to produce accurate protein complex predictions.
and the other is based on the manipulation of residue index in the original AlphaFold2 model. Baek et al.
show that RoseTTAFold is able to model complexes despite being trained only on single chains. This is done by providing a paired alignment and modifying the residue index. The residue index is used as an input to the models to compute positional embedding. In AlphaFold2 we find the same to be true, although surprisingly the paired alignment is often not needed (Fig. 2c). AlphaFold2 uses relative positional encoding with a cap at
, meaning that any pair of residues separated by 32 or more are given the same relative
positional encoding. By offsetting the residue index between two proteins to be
, AlphaFold2 treats them as separate polypeptide chains. ColabFold integrates this for modeling complexes.
to pair sequences according to their taxonomic identifier. For the pairing we search each distinct sequence of a complex against the UniRef100 using the same procedure as described in section 2.2.1. We return only hits that cover all complex proteins within one species and pair only the best hit (smallest E -value) with an alignment that covers the query to at least
. The pairing is implemented in the new MMseqs2 module pairaln.
to pair sequences based on their distances in the genome as predicted from the UniProt accession numbers.
. The taxonomic labels are extracted from the lowest common ancestor field (‘common taxon ID’) of each UniRef100 sequence from the uniref100. xml (2021_03) file.
to optimize the model for specific MSA or template input sizes. When no templates are provided, we compile once and, during inference, replace the weights from the other models, using the configuration of model 5. This saves 7 min of compile time. When templates are enabled, model 1 is compiled and weights from model 2 are used, model 3 is compiled and weights from models 4 and 5 are used. This saves 5 min of compile time. If the user changes the sequence or settings without changing the length or number of sequences in the MSA, the compiled models are reused without triggering recompilation.
(by default). All sequences that lie within the query length and an additional
margin are not required to be recompiled, resulting in a large speed-up for short proteins.
.
compatible module for displaying the 3D ribbon diagram of a protein structure or complex. The ribbon can be colored by residue index ( N to C terminus) or by a predicted confidence metric (such as pLDDT). For complexes, each protein can be colored by chain ID. Instead of using a 3D renderer, we instead use a 2 D line plotting based technique. The lines that make up the ribbon are plotted in the order in which they appear along the z-axis. Furthermore, we add shade to the lines according to the z -axis. This creates the illusion of a 3D rendered graphic. The advantage over a 3D renderer is that the images are very lightweight, can be used in animations and saved as vector graphics for lossless inclusion in documents. Given that the 2D renderer is not interactive, we additionally included a 3D visualization option using py3Dmol
in the ColabFold notebooks.
) targets. ColabFold-AlphaFold2 (commit 2b49880) used UniRef30 (2021_03) (ref.
) and the BFD/MGnify or ColabFoldDB. ColabFold-RoseTTAFold (commit ae2b519) was executed with papermill (https://github.com/nteract/papermill) using the PyRosetta protocol
. ColabFold-RoseTTAFold-BFD/MGnify and ColabFold-AlphaFold2-BFD/MGnify used the same MSAs. AlphaFold-Colab used the UniRef90 (2021_03), MGnify (2019_05) and the small BFD. AlphaFold2 used the full_dbs preset and default databases downloaded with the download_all_data.sh script. The 65 targets contain 91 domains, among these are 20 free-modeling targets with 28 domains. We compared the predictions against the experimental structures using TMalign (downloaded on 24 February 2021) (ref.
).
core Intel E5-2680v4 central processing units (CPUs) and 768 GB RAM. Each generated MSA was processed by a single CPU core. Run times were computed from server logs.
core Intel E5-2680v4 CPUs and 768 GB RAM system. The AlphaFold2 databases were stored on a software-RAID5 as implemented in Linux (mdadm) composed of six Samsung 970 EVO Plus 1 TB NVMe drives. Run times for AlphaFold2 were taken from the features entry of the timings.json file. For a fair comparison, AlphaFold2 was modified to allow HMMer and HHblits to access one CPU core.
core Intel Gold 6242 CPUs with 192 GB RAM and 4 x Nvidia Quadro RTX5000 GPUs. Only one GPU was used in each run.
targets to their native structures using DockQ (commit 3735c16) (ref.
). We used colabfold_ batch (commit 45ad0e9) with BFD/MGnify in residue index manipulation and AlphaFold-multimer mode to predict structures. We use MSA pairing as described in section 2.4.2 and also add unpaired sequences. Models are ranked by predicted interface TM-score as returned by AlphaFold-multimer. The DockQ AlphaFold-multimer reference numbers were provided by R. Evans.
have an average pLDDT of 90.68, 90.22 and 89.33, respectively, for 50 randomly sampled proteins. These are the same proteins that were used to extrapolate the run time of AlphaFold2. Over all predictions, the pLDDTs for the M. jannaschii proteome downloaded from the AlphaFoldDB, ColabFold default and ColabFold Stop
are 89.75, 89.38 and 88.77, respectively.
Data availability
Code availability
References
- Kluyver, T. et al. Jupyter Notebooks: a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas (eds Loizides, F. & Schmidt, B.) 87-90 (IOS Press, 2016).
- Eastman, P. et al. OpenMM7: rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol. 13, e1005659 (2017).
- Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. Preprint at
https://doi. org/10.48550/arxiv.1506.02142 (2016). - Krivov, G. G., Shapovalov, M. V. & Dunbrack Jr, R. L. Improved prediction of protein side-chain conformations with SCWRL4. Proteins 77, 778-795 (2009).
- Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689-691 (2010).
- Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926-932 (2015).
- Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40, D136-D143 (2012).
- Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. Github https://github.com/google/jax (2018).
- Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90-95 (2007).
- Rego, N. & Koes, D. 3Dmol.js: molecular visualization with WebGL. Bioinformatics 31, 1322-1324 (2015).
- Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170-D176 (2017).
- Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302-2309 (2005).
- Basu, S. & Wallner, B. DockQ: a quality measure for protein-protein docking models. PLoS One 11, e0161879 (2016).
Acknowledgements
Author contributions
Funding
Competing interests
Additional information
Correspondence and requests for materials should be addressed to Milot Mirdita, Sergey Ovchinnikov or Martin Steinegger.
Peer review information Nature Methods thanks David Jones and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh in collaboration with the Nature Methods team. Peer reviewer reports are available.
Reprints and permissions information is available at www.nature.com/reprints.
nature research
Corresponding author(s): Steinegger
Reporting Summary
Statistics
n/a Confirmed
The exact sample size
for each experimental group/condition, given as a discrete number and unit of measurement
X A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
X The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
X
A description of all covariates tested
X A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)
For null hypothesis testing, the test statistic (e.g.
) with confidence intervals, effect sizes, degrees of freedom and
value noted Give
values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen’s
, Pearson’s
), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Software and code
Policy information about availability of computer code
Data collection
Data analysis
, ggplot/3.3.5, cowplot/1.1.1, lubridate/1.7.10. ColabFold generated plots were made using matplotlib/3.1.3. TM-score analysis was done with TMalign/2021/02/24 and DockQ/3735c16.
Data was generated with the following software: MMseqs2 (github commit edb822), Colabfold (github commit 45ad0e9), RoseTTAFold (github commit fcf9125), HHblits v3.3.0 and AlphaFold2 v2.1.1
Data
Policy information about availability of data
- Accession codes, unique identifiers, or web links for publicly available datasets
- A list of figures that have associated raw data
- A description of any restrictions on data availability
Code Availability
A locally installable version is available at github.com/YoshitakaMo/localcolabfold.
The ColabFold development version shown in this manuscript is available at github.com/konstin/ColabFold.
The ColabFold server components are free open-source software (GPLv3) and available at github.com/soedinglab/mmseqs2-app.
MMseqs2 is free open-source software (GPLv3) and available at mmseqs.com.
Data Availability
MSAs and structures produced during benchmarking:
wwwuser.gwdg.de/
Input databases used for building ColabFold databases:
UniRef30: uniclust.mmseqs.com
BFD: bfd.mmseqs.com
MGnify:ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2019_05
PDB70: wwwuser.gwdg.de/
MetaEuk: wwwuser.gwdg.de/
SMAG: www.genoscope.cns.fr/tara/localdata/data/SMAGs-v1/SMAGs_v1_concat.faa.tar.gz
TOPAZ: osf.io/gm564
MGV: portal.nersc.gov/MGV/MGV_v1.0_2021_07_08/mgv_proteins.faa
GPD: ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/gut_phage_database/GPD_proteome.faa
Further datasets used for benchmarking ColabFold:
PFAM (Pfam-A.seed.gz & Pfam-A.full.gz): ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0
textit{M. jannaschii proteome: uniprot.org/proteomes/UP000000805 ftp.ebi.ac.uk/pub/databases/alphafold/v1/UP000000805_243232_METJA_v1.tar
Field-specific reporting
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf
Life sciences study design
Sample size | ColabFold was evaluated on all CASP14 targets for single-chain predictions. For complex predictions, ColabFold was evaluated on the publicly available ClusPro dataset. We do not compute sample size since previously published standard benchmark sets are used. |
Data exclusions | No targets were excluded. |
Replication | Not applicable. ColabFold is exclusively a computational method. The computional method is deterministic (same result each time you run) when run on the same computer setup. This is why replicates are not needed, as the result would be identical for each replicate. |
Randomization | Not applicable. We are not comparing across groups. |
Blinding | Not applicable. We are not comparing across groups. |
Reporting for specific materials, systems and methods
Materials & experimental systems | Methods | ||
n/a | Involved in the study | n/a | Involved in the study |
![]() |
Antibodies |
![]() |
ChIP-seq |
![]() |
![]() |
||
![]() |
![]() |
![]() |
|
![]() |
|||
![]() |
|||
![]() |
|||
-
Quantitative and Computational Biology, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany.
School of Biological Sciences, Seoul National University, Seoul, South Korea.
Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan.
Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Tokyo, Japan.
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA.
JHDSF Program, Harvard University, Cambridge, MA, USA.
FAS Division of Science, Harvard University, Cambridge, MA, USA.
Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea.
These authors contributed equally: Milot Mirdita, Sergey Ovchinnikov and Martin Steinegger.凶e-mail: milot.mirdita@mpinat.mpg.de; so@fas.harvard.edu; martin.steinegger@snu.ac.kr -
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.