View Source

h1. MS-GF\+

h5. {color:#800000}[(Download MS-GF+)|http://proteomics.ucsd.edu/Software/MSGFPlus.html#Downloads]{color}

{color:#800000}[(How to migrate from MS-GFDB to MS-GF)|CCMStools:MS-GFDB]{color}

{color:#800000}[ChangeLog|CCMStools:MS-GF+ ChangeLog]{color}


h3.
{code}
Usage: java -Xmx3500M -jar MSGFPlus.jar
-s SpectrumFile (*.mzML, *.mzXML, *.mgf, *.ms2, *.pkl or *_dta.txt)
Spectra should be centroided. Profile spectra will be ignored.
-d DatabaseFile (*.fasta or *.fa)
[-o OutputFile (*.mzid)] (Default: SpectrumFileName.mzid)
[-t PrecursorMassTolerance] (e.g. 2.5Da, 20ppm or 0.5Da,2.5Da, Default: 20ppm)
Use comma to set asymmetric values. E.g. "-t 0.5Da,2.5Da" will set 0.5Da to the minus (expMass<theoMass) and 2.5Da to plus (expMass>theoMass)
[-ti IsotopeErrorRange] (Range of allowed isotope peak errors, Default:0,1)
Takes into account of the error introduced by chooosing a non-monoisotopic peak for fragmentation.
On Windows, put the range inside "" (e.g. "0,1").
The combination of -t and -ti determins the precursor mass tolerance.
E.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
[-thread NumThreads] (Number of concurrent threads to be executed, Default: Number of available cores)
[-tda 0/1] (0: don't search decoy database (Default), 1: search decoy database)
[-m FragmentMethodID] (0: As written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD)
[-inst InstrumentID] (0: Low-res LCQ/LTQ (Default), 1: High-res LTQ, 2: TOF, 3: Q-Exactive)
[-e EnzymeID] (0: unspecific cleavage, 1: Trypsin (Default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: glutamyl endopeptidase, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: no cleavage)
[-protocol ProtocolID] (0: NoProtocol (Default), 1: Phosphorylation, 2: iTRAQ, 3: iTRAQPhospho)
[-ntt 0/1/2] (Number of Tolerable Termini, Default: 2)
E.g. For trypsin, 0: non-tryptic, 1: semi-tryptic, 2: fully-tryptic peptides only.
[-mod ModificationFileName] (Modification file, Default: standard amino acids with fixed C+57)
[-minLength MinPepLength] (Minimum peptide length to consider, Default: 6)
[-maxLength MaxPepLength] (Maximum peptide length to consider, Default: 40)
[-minCharge MinCharge] (Minimum precursor charge to consider if charges are not specified in the spectrum file, Default: 2)
[-maxCharge MaxCharge] (Maximum precursor charge to consider if charges are not specified in the spectrum file, Default: 3)
[-n NumMatchesPerSpec] (Number of matches per spectrum to be reported, Default: 1)
[-addFeatures 0/1] (0: output basic scores only (Default), 1: output additional features)
Example (high-precision): java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 20ppm -ti "-1,2" -ntt 0 -tda 1 -o testMSGFPlus.mzid
Example (low-precision): java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 0.5Da,2.5Da -ntt 0 -tda 1 -o testMSGFPlus.mzid
{code}

h5. Parameters:

* *\-s SpectrumFile* (.mzML*,&nbsp;*.mzXML, \*.mgf, \*.ms2, \*.pkl or \*_dta.txt) - Required
** Spectrum file name. Currently, MS-GF\+ supports the following file formats: mzML, mzXML, mzML, mgf, ms2, pkl and \_dta.txt.
** {color:#0000ff}We recommend to use mzML, whenever possible.{color}

* *\-d DatabaseFile* (*.fasta or \*.fa) - Required
** Path to the protein database file. If the database file does not have auxiliary index files (*.canno, \*.cnlcp, \*.csarr, and \*.cseq), MS-GF\+ will create them.
** When "-tda 1" option is used, the database specified here must contain only target protein sequences.

{note}
If multiple MS-GF\+ processes access the same database file, it is strongly recommended to index the database prior to the database search by running BuildSA (see below).
{note}

* *\-o OutputFile* (*.mzid)
** Filename where the output (mzIdentML 1.1 format) will be written.
** File extension must be "mzid" (case sensitive).
** By default, the output file name will be "\[SpectrumFileName\].mzid".
** E.g. for the input spectrum file "test.mzML", the output will be written to "test.mzid" if this parameter is not specified.

* *\-t ParentMassTolerance*&nbsp;(Default: 20ppm)
** Parent mass tolerance in Da. or ppm. There must be no space between the number and the unit. E.g. 2.5Da, 20ppm
** To set asymmetric tolerances, use comma to separate left (experimental mass < theoretical mass) or right (experimental mass > theoretical mass) tolerances. E.g. 0.5Da,2.5Da
** It is recommended to use a tight tolerance rather than a loose tolerance (e.g. for Orbitrap data,&nbsp;10 or 20ppm usually identifies more spectra than 50ppm).

* *\-ti IsotopeErrorRange*&nbsp;(Default: 0,1)
** Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
** If the parent mass tolerance is equal to or larger than 0.5Da or 500ppm, this parameter will be ignored.
** The combination of \-t and \-ti determins the precursor mass tolerance.
** E.g. "-t 20ppm \-ti \-1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.


* *\-thread NumOfThreads* (Number of concurrent threads to be executed, Default: Number of available cores)
** Number of concurrent threads to be executed together.
** Default value is the number of available logical cores (e.g. 8 for quad-core processor with hyper-threading support).

* *\-tda 0/1* (0: don't search decoy database (default), 1: search decoy database to compute FDR)
** Indicates whether to search the decoy database or not.
** If 0, the decoy database is not searched.
** If 1, FDRs are computed based on the target-decoy approach (i.e. reversed database is appended to the target database and MS-GF\+ searches the combined database)
*** FDR(t) = #(DecoyPSMs with score equal or above t) / #(TargetPSMs with score equal or above t).
*** PSM: Peptide-Spectrum Match
*** \-log(SpecProb) is used as the score to compute FDR.

{note}
If \-tda 1 is specified, MS-GF\+ automatically creates a combined target/reversed database file (DBFileName.revConcat.fasta). Thus, when specifying "-d" parameter, DatabaseFile must contain only target proteins.
{note}

* *\-m FragmentationMethodID* (0: as written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor)
** Fragmentation method identifier (used to determine the scoring model).
** If the identifier is 0 and fragmentation method is written in the spectrum file (e.g. mzML files), MS-GF\+ will recognize the fragmentation method and use a relevant scoring model.
** If the identifier is 0 and there is no fragmentation method information in the spectrum (e.g. mgf files), CID model will be used by default.
** If the identifier is non-zero and the spectrum has fragmentation method information, only the spectra that match with the identifier will be processed.
** If the identifier is non-zero and the spectrum has no fragmentation method information, MS-GF\+ will process all spectra assuming the specified fragmentation method.
** If the identifier is 4, MS/MS spectra from the same precursor ion (e.g. CID/ETD pairs, CID/HCD/ETD triplets) will be merged and the "merged" spectrum will be used for searching instead of individual spectra. See Kim et al., MCP 2010 for details.

* *\-inst InstrumentID* (0: Low-res LCQ/LTQ (Default for CID and ETD), 1: High-res LTQ (Default for HCD), 2: TOF, 3: Q-Exactive)
** Identifier of the instrument to generate MS/MS spectra (used to determine the scoring model).
** For "hybrid" spectra with high-precision MS1 and low-precision MS2, use 0.
** For usual low-precision instruments (e.g. Thermo LTQ), use 0.
** If MS/MS fragment ion peaks are of high-precision (e.g. tolerance = 10ppm), use 2.
** For TOF instruments, use 2.
** For Q-Exactive HCD spectra, use 3.
** For other HCD spectra, use 1.

* *\-e EnzymeID* (Default: 1)
** Enzyme identifier. Trypsin (1) will be used by default.
** 0: unspecific cleavage, 1: Trypsin (default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: glutamyl endopeptidase (Glu-C), 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: no cleavage
** Use 9 for peptidomics studies

* *\-p ProtocolID* (Default: 0)
** Protocol identifier. Protocols are used to enable scoring parameters for enriched and/or labeled samples.
** 0: No protocol (Default)
** 1: Phosphorylation: for phosphopeptide enriched samples
** 2: iTRAQ: for iTRAQ-labeled samples
** 3: iTRAQPhospho: for phosphopeptide enriched and iTRAQ-labeled samples

* *\-ntt 0/1/2* (Number of tolerable (tryptic) termini, Default: 2)
** This parameter is used to apply the enzyme cleavage specificity rule when searching the database.
** Specifies the minimum number of termini matching the enzyme specificity rule.
*** For example, for trypsin, K.ACDEFGHR.C (NTT=2), G.ACDEFGHR.C (NTT=1), K.ACDEFGHI.C (NTT=1) and G.ACDEFGHR.C (NTT=0).
*** '-ntt 2' will search for fully tryptic peptides only.
** By default, \-ntt 2 will be used. Using \-ntt 1 (or 0) will make the search significantly slower.
{note}
The meaning and the default value has been changed after version 8442).
{note}

* *\-mod ModificationFile* (Default: standard amino acids with fixed C+57)\]
** Modification file name. ModificationFile contains the modifications to be considered in the search.
** If \-mod option is not specified, standard amino acids with fixed Carboamidomethylation C will be used.
** [Download|^Mods.txt] an example modification file.

* *\-minLength MinPepLength* (Default: 6)
** Minimum length of the peptide to be considered.

* *\-maxLength MaxPepLength* (Default: 40)
** Maximum length of the peptide to be considered.

* *\-minCharge MinPrecursorCharge* (Default: 2)
** Minimum precursor charge to consider. This parameter is used only for spectra with no charge.

* *\-maxCharge MinPrecursorCharge* (Default: 3)
** Maximum precursor charge to consider. This parameter is used only for spectra with no charge.

* *\-n NumMatchesPerSpec* (Default: 1)
** Number of peptide matches per spectrum to report.
** Expected false discovery rates (EFDRs) will be reported only when this value is 1.
* *\-addFeatures 0/1*
** If 0, only basic scores are reported.
** If 1, the following features are reported
*** MS2IonCurrent: Summed intensity of all product ions
*** ExplainedIonCurrentRatio: Summed intensity of all matched product ions (e.g. b, b-H2O, y, etc.) divided by MS2IonCurrent
*** NTermIonCurrentRatio: Summed intensity of all matched prefix ions (e.g. b, b-H2O, etc.) divided by MS2IonCurrent
*** CTermIonCurrentRatio: Summed intensity of all matched suffix ions (e.g. y, y-H2O, etc.) divided by MS2IonCurrent
* *\-showQValue 0/1*
** If 0, QValue and PepQValue are not reported.
** If 1, QValue (PSM-level Q-value) and PepQValue (peptide-level Q-value) are reported (Default).
** This parameter is ignored when "-tda 0".

h5. MS-GF\+ output

MS-GF\+ outputs results as an mzIdentML (version 1.1) file. See&nbsp;[http://www.psidev.info/mzidentml/] for details on the mzIdentML format. For every PSM, MS-GF\+ reports the scores.&nbsp;


* *MS-GF:RawScore*: MS-GF\+ raw score of the peptide-spectrum match&nbsp;
* *MS-GF:DeNovoScore{*}*:* the score of the optimal scoring peptide for the spectrum (not necessary in the database)&nbsp;(MS-GF:RawScore <= MS-GF:DeNovoScore)
* *MS-GF:SpecEValue*: spectral E-value (spectrum level E-value) of the peptide-spectrum match - the lower the better
* *MS-GF:EValue*: database level E-value (expected number of peptides in a random database having equal or better scores than the PSM score) - the lower the better
* *MS-GF:QValue*
** PSM-level Q-value estimated using the target-decoy approach.
** MS-GF:QValue is computed solely based on MS-GF:SpecEValue.
* *MS-GF:PepQValue*
** Peptide-level Q-value estimated using the target-decoy approach.
** Reported only if "-tda 1" is specified.
** If multiple spectra are matched to the same peptide, only the best scoring PSM (lowest SpecProb) is retained. After that, MS-GF:PepQValue is calculated as #DecoyPSMs>s / #TargetPSMs>s among the retained PSMs. This approximates the Q-value of the set of unique peptides. In the MS-GF\+ output, the same PepQValue value is given to all PSMs sharing the peptide. So, even a low-quality PSM may get a low PepQValue (if it has a high-quality "sibling" PSM sharing the peptide). Note that this should not be used to count the number of identified PSMs.
* {color:#3366ff}Using MzIDToTsv One can convert MS-GF\+ output (*.mzid) into the tsv format{color}

h5. MS-GF\+ output example

[MzIdentML format|^test.mzid]&nbsp;

[TSV format|^test_Unrolled.tsv]&nbsp; (converted by MzIDToTsv using MzIDToTsv)

h1. MzIDToTsv

Converts MS-GF\+ output (*.mzid) into the tsv format (*.tsv)

{code}
Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv
-i MzIDFile (MS-GF+ output file (*.mzid))
[-o TSVFile] (TSV output file (*.tsv) (Default: MzIDFileName.tsv))
[-showQValue 0/1] (0: do not show Q-values, 1: show Q-values (Default))
[-showDecoy 0/1] (0: do not show decoy PSMs (Default), 1: show decoy PSMs)
[-unroll 0/1] (0: merge shared peptides (Default), 1: unroll shared peptides)
{code}

{color:#003366}{*}Parameters:*{color}
* *\-i MzIDFile*
** Path to the MS-GF\+ result file (*.mzid)
* *\-o TSVFile*
** Path to the tsv output file (*.tsv)
** If not specified, for input MyFile.mzid, the output will be MyFile.tsv.
* *\-showQValue 0/1*
** If 0, QValue and PepQValue are not be reported.
** If 1, QValue and PepQValue are reported (Default).
* *\-showDecoy 0/1*
** If 0, decoy PSMs will not be reported (Default).
** If 1, decoy PSMs will be reported.
* *\-unroll 0/1*
** This parameter controls the output format for shared peptides (peptides matched to multiple proteins).
** When "-unroll 0" (Default), a PSM matched to a shared peptide will be printed as a single line.
*** Peptide column does not show neighboring amino acids (e.g. IGAYLFVDMAHVAGLIAAGVYPNPVPHAHVVTSTTHK).
*** Protein column shows all proteins in a single line.
*** Example: MyProtein(pre=K,post=T);MyProteinIsoform(pre=K,post=T)
*** [Download example file|^test.tsv]
** When "-unroll 1", a PSM matched to a shared peptide will be printed in multiple lines.
*** Peptide column shows neighboring amino acids (e.g. K.IGAYLFVDMAHVAGLIAAGVYPNPVPHAHVVTSTTHK.T).
*** Different peptide-protein matches are printed in different lines.
*** [Download example file|^test_Unrolled.tsv]


h1. BuildSA

{color:#000000}Index a protein database for fast searching.&nbsp;{color}
{code}
Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA 
-d DatabaseFile (*.fasta or *.fa)
[-tda 0/1/2] (0: target only, 1: target-decoy database only, 2: both)
{code}

{color:#003366}{*}Parameters:*{color}
* *\-d DbPath*
** Name of a protein database (*.fasta or \*.fa)&nbsp;
** Database file must ends with ".fasta" or ".fa".
* *\-tda 0/1/2*
*- If 0, only "DatabaseFile" will be indexed.
*- If 1, a new database file (*.revConcat.fasta) will be generated by appending reversed proteins. This forward-reverse database will be indexed.
*- If 2, both the original database and the forward-reverse database file will be indexed.

BuildSA creates a suffix array of the protein database. For a input database file DBFileName.fasta, BuildSA will generate 4 auxiliary files (DbFileName.canno, DBFileName.cnlcp, DBFileName.csarr,&nbsp;DBFileName.cseq).&nbsp;It needs to be executed only once per each database file.