Treffer: Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Title:
Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.
Authors:
Libbrecht MW; Department of Genome Sciences, University of Washington, Seattle, Washington., Bilmes JA; Department of Electrical Engineering, University of Washington, Seattle, Washington., Noble WS; Department of Genome Sciences, University of Washington, Seattle, Washington.; Department of Computer Science and Engineering, University of Washington, Seattle, Washington.
Source:
Proteins [Proteins] 2018 Apr; Vol. 86 (4), pp. 454-466. Date of Electronic Publication: 2018 Feb 01.
Publication Type:
Journal Article; Research Support, N.I.H., Extramural
Language:
English
Journal Info:
Publisher: Wiley-Liss Country of Publication: United States NLM ID: 8700181 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1097-0134 (Electronic) Linking ISSN: 08873585 NLM ISO Abbreviation: Proteins Subsets: MEDLINE
Imprint Name(s):
Publication: New York, NY : Wiley-Liss
Original Publication: New York : Alan R. Liss, c1986-
References:
Genome Biol. 2016 Nov 15;17 (1):229. (PMID: 27846892)
BMC Bioinformatics. 2013 Aug 15;14:248. (PMID: 23945046)
Bioinformatics. 1998;14(9):783-8. (PMID: 9918948)
Bioinformation. 2010 Nov 27;5(6):234-9. (PMID: 21364823)
Bioinformatics. 2001 Mar;17(3):282-3. (PMID: 11294794)
Nature. 2012 Jun 13;486(7402):207-14. (PMID: 22699609)
Nucleic Acids Res. 2006 Mar 17;34(5):1571-80. (PMID: 16547200)
Protein Sci. 1992 Mar;1(3):409-17. (PMID: 1304348)
Bioinformatics. 2006 Jul 1;22(13):1658-9. (PMID: 16731699)
Science. 2007 Feb 16;315(5814):972-6. (PMID: 17218491)
Comput Appl Biosci. 1992 Oct;8(5):461-6. (PMID: 1422879)
Trends Genet. 2000 Jun;16(6):276-7. (PMID: 10827456)
J Mol Biol. 1990 Oct 5;215(3):403-10. (PMID: 2231712)
Bioinformatics. 2003 Aug 12;19(12 ):1589-91. (PMID: 12912846)
Nucleic Acids Res. 2002 Apr 1;30(7):1575-84. (PMID: 11917018)
Proc Natl Acad Sci U S A. 2004 Apr 27;101(17):6559-63. (PMID: 15087500)
Bioinformatics. 1998 Jun;14(5):423-9. (PMID: 9682055)
Cytometry A. 2007 Sep;71(9):724-36. (PMID: 17654650)
J Mol Biol. 1995 Apr 7;247(4):536-40. (PMID: 7723011)
Nucleic Acids Res. 2000 Jan 1;28(1):254-6. (PMID: 10592239)
Bioinformatics. 2000 May;16(5):451-7. (PMID: 10871267)
Bioinformatics. 2010 Oct 1;26(19):2460-1. (PMID: 20709691)
PLoS One. 2013;8(2):e55484. (PMID: 23393584)
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. (PMID: 9254694)
Grant Information:
R01 CA180777 United States CA NCI NIH HHS; U41 HG007000 United States HG NHGRI NIH HHS
Contributed Indexing:
Keywords: discrete optimization; diversity; protein sequence analysis; redundancy; representative subsets; submodular maximization
Substance Nomenclature:
0 (Proteins)
Entry Date(s):
Date Created: 20180119 Date Completed: 20190204 Latest Revision: 20190401
Update Code:
20250114
PubMed Central ID:
PMC5835207
DOI:
10.1002/prot.25461
PMID:
29345009
Database:
MEDLINE

Weitere Informationen

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.
(© 2018 Wiley Periodicals, Inc.)