PXD038367

PXD038367 is an original dataset announced via ProteomeXchange.

Dataset Summary

Title	Training data for generalized peakgroup scoring models
Description	The statistical validation of peptide and protein identifications in mass spectrometry proteomics is a critical step in the analytical workflow. This is particularly important in discovery experiments to ensure only confident identifications are accumulated for downstream analysis and biomarker consideration. However, the inherent nature of discovery proteomics experiments leads to scenarios where the search space will inflate substantially due to the increased number of potential proteins that are being queried in each sample. In these cases, issues will begin to arise when the machine learning algorithms that are trained on an experiment specific basis cannot accurately distinguish between correct and incorrect identifications and will struggle to accurately control the false discovery rate. Here, we propose an alternative validation algorithm trained on a curated external data set of 2.8 million extracted peakgroups that leverages advanced machine learning techniques to create a generalizable peakgroup scoring (GPS) method for data independent acquisition (DIA) mass spectrometry. By breaking the reliance on the experimental data at hand and instead training on a curated external dataset, GPS can confidently control the false discovery rate while increasing the number of identifications and providing more accurate quantification in different search space scenarios. To first test the performance of GPS in a standard experimental environment and to provide a benchmark against other methods, a novel spike-in data set with known varying concentrations was analyzed. When compared to existing methods GPS increased the nunmber of identifications by 5-18\% and was able to provide more accurate quantification by increasing the number of ratio validated identifications by 24-74\%. To evaluate GPS in a larger search space, a novel data set of 141 blood plasma samples from patients developing acute kidney injury after sepsis was searched with a human tissue spectral library (10000+ proteins). Using GPS, we were able to provide a 207-377\% increase in the number of candidate differentially abundant proteins compared to the existing methods while maintaining competitive numbers of global identifications. Finally, using an optimized human tissue library and workflow we were able to identify 1205 proteins from the 141 plasma samples and increase the number of candidate differentially abundant proteins by 70.87\%. With the addition of machine learning aided differential expression, we were able to identify potential new biomarkers for stratifying subphenotypes of acute kidney injury in sepsis. These findings suggest that by using a generalized model such as GPS in tandem with a massive scale spectral library it is possible to expand the boundaries of discovery experiments in DIA proteomics. GPS is open source and freely available on github at (\url{https://github.com/InfectionMedicineProteomics/gps})
HostingRepository	PRIDE
AnnounceDate	2024-10-22
AnnouncementXML	Submission_2024-10-22_05:57:02.291.xml
DigitalObjectIdentifier
ReviewLevel	Peer-reviewed dataset
DatasetOrigin	Original dataset
RepositorySupport	Unsupported dataset by repository
PrimarySubmitter	Aaron Scott
SpeciesList	scientific name: Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast); NCBI TaxID: 559292;
ModificationList	iodoacetamide derivatized residue
Instrument	Q Exactive HF-X

Dataset History

Revision	Datetime	Status	ChangeLog Entry
0	2022-11-25 07:10:02	ID requested
1	2023-07-20 12:47:31	announced
2	2023-11-14 09:04:05	announced	2023-11-14: Updated project metadata.
⏵ 3	2024-10-22 05:57:07	announced	2024-10-22: Updated project metadata.

Publication List

Scott AM, Karlsson C, Mohanty T, Hartman E, Vaara ST, Linder A, Malmstr, ö, m J, Malmstr, ö, m L, Generalized precursor prediction boosts identification rates and accuracy in mass spectrometry based proteomics. Commun Biol, 6(1):628(2023) [pubmed]
10.1038/s42003-023-04977-x;

Scott AM, Karlsson C, Mohanty T, Hartman E, Vaara ST, Linder A, Malmstr, ö, m J, Malmstr, ö, m L, Generalized precursor prediction boosts identification rates and accuracy in mass spectrometry based proteomics. Commun Biol, 6(1):628(2023) [pubmed]

10.1038/s42003-023-04977-x;

Keyword List

submitter keyword: machine learning, software, dia,proteome

submitter keyword: machine learning, software, dia,proteome

Contact List

Lars Malmström
contact affiliation	Lund University, Faculty of Medicine, Department of Clinical Sciences Lund, Division of Infection Medicine, Lund, Sweden
contact email	lars.malmstrom@med.lu.se
lab head
Aaron Scott
contact affiliation	Faculty of Medicine, Lund University
contact email	aaron.scott@med.lu.se
dataset submitter

Full Dataset Link List

Dataset FTP location NOTE: Most web browsers have now discontinued native support for FTP access within the browser window. But you can usually install another FTP app (we recommend FileZilla) and configure your browser to launch the external application when you click on this FTP link. Or otherwise, launch an app that supports FTP (like FileZilla) and use this address: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2023/07/PXD038367
PRIDE project URI

Dataset FTP location
NOTE: Most web browsers have now discontinued native support for FTP access within the browser window. But you can usually install another FTP app (we recommend FileZilla) and configure your browser to launch the external application when you click on this FTP link. Or otherwise, launch an app that supports FTP (like FileZilla) and use this address: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2023/07/PXD038367

PRIDE project URI

Repository Record List

[ + ]