Treffer: Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Title:

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Authors:

Libbrecht MW; Department of Genome Sciences, University of Washington, Seattle, Washington., Bilmes JA; Department of Electrical Engineering, University of Washington, Seattle, Washington., Noble WS; Department of Genome Sciences, University of Washington, Seattle, Washington.; Department of Computer Science and Engineering, University of Washington, Seattle, Washington.

Source:

Proteins [Proteins] 2018 Apr; Vol. 86 (4), pp. 454-466. Date of Electronic Publication: 2018 Feb 01.

Publication Type:

Journal Article; Research Support, N.I.H., Extramural

Language:

English

Journal Info:

Publisher: Wiley-Liss Country of Publication: United States NLM ID: 8700181 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1097-0134 (Electronic) Linking ISSN: 08873585 NLM ISO Abbreviation: Proteins Subsets: MEDLINE

Imprint Name(s):

Publication: New York, NY : Wiley-Liss
Original Publication: New York : Alan R. Liss, c1986-

MeSH Terms:

Algorithms*, Proteins/*chemistry , Sequence Analysis, Protein/*methods, Cluster Analysis ; Proteomics/methods

References:

Genome Biol. 2016 Nov 15;17 (1):229. (PMID: 27846892)
BMC Bioinformatics. 2013 Aug 15;14:248. (PMID: 23945046)
Bioinformatics. 1998;14(9):783-8. (PMID: 9918948)
Bioinformation. 2010 Nov 27;5(6):234-9. (PMID: 21364823)
Bioinformatics. 2001 Mar;17(3):282-3. (PMID: 11294794)
Nature. 2012 Jun 13;486(7402):207-14. (PMID: 22699609)
Nucleic Acids Res. 2006 Mar 17;34(5):1571-80. (PMID: 16547200)
Protein Sci. 1992 Mar;1(3):409-17. (PMID: 1304348)
Bioinformatics. 2006 Jul 1;22(13):1658-9. (PMID: 16731699)
Science. 2007 Feb 16;315(5814):972-6. (PMID: 17218491)
Comput Appl Biosci. 1992 Oct;8(5):461-6. (PMID: 1422879)
Trends Genet. 2000 Jun;16(6):276-7. (PMID: 10827456)
J Mol Biol. 1990 Oct 5;215(3):403-10. (PMID: 2231712)
Bioinformatics. 2003 Aug 12;19(12 ):1589-91. (PMID: 12912846)
Nucleic Acids Res. 2002 Apr 1;30(7):1575-84. (PMID: 11917018)
Proc Natl Acad Sci U S A. 2004 Apr 27;101(17):6559-63. (PMID: 15087500)
Bioinformatics. 1998 Jun;14(5):423-9. (PMID: 9682055)
Cytometry A. 2007 Sep;71(9):724-36. (PMID: 17654650)
J Mol Biol. 1995 Apr 7;247(4):536-40. (PMID: 7723011)
Nucleic Acids Res. 2000 Jan 1;28(1):254-6. (PMID: 10592239)
Bioinformatics. 2000 May;16(5):451-7. (PMID: 10871267)
Bioinformatics. 2010 Oct 1;26(19):2460-1. (PMID: 20709691)
PLoS One. 2013;8(2):e55484. (PMID: 23393584)
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. (PMID: 9254694)

Grant Information:

R01 CA180777 United States CA NCI NIH HHS; U41 HG007000 United States HG NHGRI NIH HHS

Contributed Indexing:

Keywords: discrete optimization; diversity; protein sequence analysis; redundancy; representative subsets; submodular maximization

Substance Nomenclature:

0 (Proteins)

Entry Date(s):

Date Created: 20180119 Date Completed: 20190204 Latest Revision: 20190401

Update Code:

20250114

PubMed Central ID:

PMC5835207

DOI:

10.1002/prot.25461

PMID:

29345009

Database:

MEDLINE

Weitere Informationen

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.
(© 2018 Wiley Periodicals, Inc.)

Treffer: Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Weitere Informationen

Links

Zusatz-Funktionen