Skip to content

unexpected poor performance of e2efold #5

@kad-ecoli

Description

@kad-ecoli

I have tested e2efold on a set of 361 PDB chains, where secondary structures for RNAs shorter than 600 nucleotides are predicted by e2efold_productive/e2efold_productive_short.py, while those longer than 600 nucleotides are predicted by e2efold_productive/e2efold_productive_long.py.

To my big surprise, when evaluated against DSSR assigned canonical base pairs of this dataset, e2efold predicted *.ct files have very low average F1 and MCC of 0.2400 and 0.2401, respectively, which are significantly worse than SOTA methods mentioned in Table 2 of the e2efold paper (https://openreview.net/pdf?id=S1eALyrYDH). The following is my benchmark result, ranked in ascending order of F1 score.

Method F1 MCC Predicted base pairs per RNA
e2efold 0.2400 0.2401 18.2133
mfold 0.6275 0.6285 32.4903
RNAstructure (ProbablePair) 0.6443 0.6475 29.4238
CONTRAfold 0.6617 0.6642 32.5845

I have attached the predicted ct files below. Additionally, I include the 4 sequences listed under e2efold_productive/*_seqs/*seq and make sure that my run generates identical ct files as the one shown in the github repository.
e2e.zip

Could you check whether I run the e2efold program incorrectly and results in such a low performance? In particular, could you check why e2efold has on average only 18.2133 predicted base pairs per RNA chain, while the actual average number of canonical base pairs in the native structure is as many as 28.6648? Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions