MS-GF+

MS-GF+

(Download MS-GF+)

(How to migrate from MS-GFDB to MS-GF)

ChangeLog

Usage: java -Xmx3500M -jar MSGFPlus.jar
	-s SpectrumFile (*.mzML, *.mzXML, *.mgf, *.ms2, *.pkl or *_dta.txt)
	   Spectra should be centroided. Profile spectra will be ignored.
	-d DatabaseFile (*.fasta or *.fa)
	[-o OutputFile (*.mzid)] (Default: SpectrumFileName.mzid)
	[-t PrecursorMassTolerance] (e.g. 2.5Da, 20ppm or 0.5Da,2.5Da, Default: 20ppm)
	   Use comma to set asymmetric values. E.g. "-t 0.5Da,2.5Da" will set 0.5Da to the minus (expMass<theoMass) and 2.5Da to plus (expMass>theoMass)
	[-ti IsotopeErrorRange] (Range of allowed isotope peak errors, Default:0,1)
	   Takes into account of the error introduced by chooosing a non-monoisotopic peak for fragmentation.
	   On Windows, put the range inside "" (e.g. "0,1").
	   The combination of -t and -ti determins the precursor mass tolerance.
	   E.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
	[-thread NumThreads] (Number of concurrent threads to be executed, Default: Number of available cores)
	[-tda 0/1] (0: don't search decoy database (Default), 1: search decoy database)
	[-m FragmentMethodID] (0: As written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD)
	[-inst InstrumentID] (0: Low-res LCQ/LTQ (Default), 1: High-res LTQ, 2: TOF, 3: Q-Exactive)
	[-e EnzymeID] (0: unspecific cleavage, 1: Trypsin (Default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: glutamyl endopeptidase, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: no cleavage)
	[-protocol ProtocolID] (0: NoProtocol (Default), 1: Phosphorylation, 2: iTRAQ, 3: iTRAQPhospho)
	[-ntt 0/1/2] (Number of Tolerable Termini, Default: 2)
	   E.g. For trypsin, 0: non-tryptic, 1: semi-tryptic, 2: fully-tryptic peptides only.
	[-mod ModificationFileName] (Modification file, Default: standard amino acids with fixed C+57)
	[-minLength MinPepLength] (Minimum peptide length to consider, Default: 6)
	[-maxLength MaxPepLength] (Maximum peptide length to consider, Default: 40)
	[-minCharge MinCharge] (Minimum precursor charge to consider if charges are not specified in the spectrum file, Default: 2)
	[-maxCharge MaxCharge] (Maximum precursor charge to consider if charges are not specified in the spectrum file, Default: 3)
	[-n NumMatchesPerSpec] (Number of matches per spectrum to be reported, Default: 1)
	[-addFeatures 0/1] (0: output basic scores only (Default), 1: output additional features)
Example (high-precision): java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 20ppm -ti "-1,2" -ntt 0 -tda 1 -o testMSGFPlus.mzid
Example (low-precision): java -Xmx3500M -jar MSGFPlus.jar -s test.mzXML -d IPI_human_3.79.fasta -t 0.5Da,2.5Da -ntt 0 -tda 1 -o testMSGFPlus.mzid
Parameters:
  • -s SpectrumFile (.mzML*, *.mzXML, *.mgf, *.ms2, *.pkl or *_dta.txt) - Required
    • Spectrum file name. Currently, MS-GF+ supports the following file formats: mzML, mzXML, mzML, mgf, ms2, pkl and _dta.txt.
    • We recommend to use mzML, whenever possible.
  • -d DatabaseFile (*.fasta or *.fa) - Required
    • Path to the protein database file. If the database file does not have auxiliary index files (*.canno, *.cnlcp, *.csarr, and *.cseq), MS-GF+ will create them.
    • When "-tda 1" option is used, the database specified here must contain only target protein sequences.
If multiple MS-GF+ processes access the same database file, it is strongly recommended to index the database prior to the database search by running BuildSA (see below).
  • -o OutputFile (*.mzid)
    • Filename where the output (mzIdentML 1.1 format) will be written.
    • File extension must be "mzid" (case sensitive).
    • By default, the output file name will be "[SpectrumFileName].mzid".
    • E.g. for the input spectrum file "test.mzML", the output will be written to "test.mzid" if this parameter is not specified.
  • -t ParentMassTolerance (Default: 20ppm)
    • Parent mass tolerance in Da. or ppm. There must be no space between the number and the unit. E.g. 2.5Da, 20ppm
    • To set asymmetric tolerances, use comma to separate left (experimental mass < theoretical mass) or right (experimental mass > theoretical mass) tolerances. E.g. 0.5Da,2.5Da
    • It is recommended to use a tight tolerance rather than a loose tolerance (e.g. for Orbitrap data, 10 or 20ppm usually identifies more spectra than 50ppm).
  • -ti IsotopeErrorRange (Default: 0,1)
    • Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
    • If the parent mass tolerance is equal to or larger than 0.5Da or 500ppm, this parameter will be ignored.
    • The combination of -t and -ti determins the precursor mass tolerance.
    • E.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
  • -thread NumOfThreads (Number of concurrent threads to be executed, Default: Number of available cores)
    • Number of concurrent threads to be executed together.
    • Default value is the number of available logical cores (e.g. 8 for quad-core processor with hyper-threading support).
  • -tda 0/1 (0: don't search decoy database (default), 1: search decoy database to compute FDR)
    • Indicates whether to search the decoy database or not.
    • If 0, the decoy database is not searched.
    • If 1, FDRs are computed based on the target-decoy approach (i.e. reversed database is appended to the target database and MS-GF+ searches the combined database)
      • FDR(t) = #(DecoyPSMs with score equal or above t) / #(TargetPSMs with score equal or above t).
      • PSM: Peptide-Spectrum Match
      • -log(SpecProb) is used as the score to compute FDR.
If -tda 1 is specified, MS-GF+ automatically creates a combined target/reversed database file (DBFileName.revConcat.fasta). Thus, when specifying "-d" parameter, DatabaseFile must contain only target proteins.
  • -m FragmentationMethodID (0: as written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor)
    • Fragmentation method identifier (used to determine the scoring model).
    • If the identifier is 0 and fragmentation method is written in the spectrum file (e.g. mzML files), MS-GF+ will recognize the fragmentation method and use a relevant scoring model.
    • If the identifier is 0 and there is no fragmentation method information in the spectrum (e.g. mgf files), CID model will be used by default.
    • If the identifier is non-zero and the spectrum has fragmentation method information, only the spectra that match with the identifier will be processed.
    • If the identifier is non-zero and the spectrum has no fragmentation method information, MS-GF+ will process all spectra assuming the specified fragmentation method.
    • If the identifier is 4, MS/MS spectra from the same precursor ion (e.g. CID/ETD pairs, CID/HCD/ETD triplets) will be merged and the "merged" spectrum will be used for searching instead of individual spectra. See Kim et al., MCP 2010 for details.
  • -inst InstrumentID (0: Low-res LCQ/LTQ (Default for CID and ETD), 1: High-res LTQ (Default for HCD), 2: TOF, 3: Q-Exactive)
    • Identifier of the instrument to generate MS/MS spectra (used to determine the scoring model).
    • For "hybrid" spectra with high-precision MS1 and low-precision MS2, use 0.
    • For usual low-precision instruments (e.g. Thermo LTQ), use 0.
    • If MS/MS fragment ion peaks are of high-precision (e.g. tolerance = 10ppm), use 2.
    • For TOF instruments, use 2.
    • For Q-Exactive HCD spectra, use 3.
    • For other HCD spectra, use 1.
  • -e EnzymeID (Default: 1)
    • Enzyme identifier. Trypsin (1) will be used by default.
    • 0: unspecific cleavage, 1: Trypsin (default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: glutamyl endopeptidase (Glu-C), 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: no cleavage
    • Use 9 for peptidomics studies
  • -p ProtocolID (Default: 0)
    • Protocol identifier. Protocols are used to enable scoring parameters for enriched and/or labeled samples.
    • 0: No protocol (Default)
    • 1: Phosphorylation: for phosphopeptide enriched samples
    • 2: iTRAQ: for iTRAQ-labeled samples
    • 3: iTRAQPhospho: for phosphopeptide enriched and iTRAQ-labeled samples
  • -ntt 0/1/2 (Number of tolerable (tryptic) termini, Default: 2)
    • This parameter is used to apply the enzyme cleavage specificity rule when searching the database.
    • Specifies the minimum number of termini matching the enzyme specificity rule.
      • For example, for trypsin, K.ACDEFGHR.C (NTT=2), G.ACDEFGHR.C (NTT=1), K.ACDEFGHI.C (NTT=1) and G.ACDEFGHR.C (NTT=0).
      • '-ntt 2' will search for fully tryptic peptides only.
    • By default, -ntt 2 will be used. Using -ntt 1 (or 0) will make the search significantly slower.
      The meaning and the default value has been changed after version 8442).
  • -mod ModificationFile (Default: standard amino acids with fixed C+57)]
    • Modification file name. ModificationFile contains the modifications to be considered in the search.
    • If -mod option is not specified, standard amino acids with fixed Carboamidomethylation C will be used.
    • Download an example modification file.
  • -minLength MinPepLength (Default: 6)
    • Minimum length of the peptide to be considered.
  • -maxLength MaxPepLength (Default: 40)
    • Maximum length of the peptide to be considered.
  • -minCharge MinPrecursorCharge (Default: 2)
    • Minimum precursor charge to consider. This parameter is used only for spectra with no charge.
  • -maxCharge MinPrecursorCharge (Default: 3)
    • Maximum precursor charge to consider. This parameter is used only for spectra with no charge.
  • -n NumMatchesPerSpec (Default: 1)
    • Number of peptide matches per spectrum to report.
    • Expected false discovery rates (EFDRs) will be reported only when this value is 1.
  • -addFeatures 0/1
    • If 0, only basic scores are reported.
    • If 1, the following features are reported
      • MS2IonCurrent: Summed intensity of all product ions
      • ExplainedIonCurrentRatio: Summed intensity of all matched product ions (e.g. b, b-H2O, y, etc.) divided by MS2IonCurrent
      • NTermIonCurrentRatio: Summed intensity of all matched prefix ions (e.g. b, b-H2O, etc.) divided by MS2IonCurrent
      • CTermIonCurrentRatio: Summed intensity of all matched suffix ions (e.g. y, y-H2O, etc.) divided by MS2IonCurrent
  • -showQValue 0/1
    • If 0, QValue and PepQValue are not reported.
    • If 1, QValue (PSM-level Q-value) and PepQValue (peptide-level Q-value) are reported (Default).
    • This parameter is ignored when "-tda 0".
MS-GF+ output

MS-GF+ outputs results as an mzIdentML (version 1.1) file. See http://www.psidev.info/mzidentml/ for details on the mzIdentML format. For every PSM, MS-GF+ reports the scores. 

  • MS-GF:RawScore: MS-GF+ raw score of the peptide-spectrum match 
  • MS-GF:DeNovoScore: the score of the optimal scoring peptide for the spectrum (not necessary in the database) (MS-GF:RawScore <= MS-GF:DeNovoScore)
  • MS-GF:SpecEValue: spectral E-value (spectrum level E-value) of the peptide-spectrum match - the lower the better
  • MS-GF:EValue: database level E-value (expected number of peptides in a random database having equal or better scores than the PSM score) - the lower the better
  • MS-GF:QValue
    • PSM-level Q-value estimated using the target-decoy approach.
    • MS-GF:QValue is computed solely based on MS-GF:SpecEValue.
  • MS-GF:PepQValue
    • Peptide-level Q-value estimated using the target-decoy approach.
    • Reported only if "-tda 1" is specified.
    • If multiple spectra are matched to the same peptide, only the best scoring PSM (lowest SpecProb) is retained. After that, MS-GF:PepQValue is calculated as #DecoyPSMs>s / #TargetPSMs>s among the retained PSMs. This approximates the Q-value of the set of unique peptides. In the MS-GF+ output, the same PepQValue value is given to all PSMs sharing the peptide. So, even a low-quality PSM may get a low PepQValue (if it has a high-quality "sibling" PSM sharing the peptide). Note that this should not be used to count the number of identified PSMs.
  • Using MzIDToTsv One can convert MS-GF+ output (*.mzid) into the tsv format
MS-GF+ output example

MzIdentML format 

TSV format  (converted by MzIDToTsv using MzIDToTsv)

MzIDToTsv

Converts MS-GF+ output (.mzid) into the tsv format (.tsv)

Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv
	-i MzIDFile (MS-GF+ output file (*.mzid))
	[-o TSVFile] (TSV output file (*.tsv) (Default: MzIDFileName.tsv))
	[-showQValue 0/1] (0: do not show Q-values, 1: show Q-values (Default))
	[-showDecoy 0/1] (0: do not show decoy PSMs (Default), 1: show decoy PSMs)
	[-unroll 0/1] (0: merge shared peptides (Default), 1: unroll shared peptides)

Parameters:

  • -i MzIDFile
    • Path to the MS-GF+ result file (*.mzid)
  • -o TSVFile
    • Path to the tsv output file (*.tsv)
    • If not specified, for input MyFile.mzid, the output will be MyFile.tsv.
  • -showQValue 0/1
    • If 0, QValue and PepQValue are not be reported.
    • If 1, QValue and PepQValue are reported (Default).
  • -showDecoy 0/1
    • If 0, decoy PSMs will not be reported (Default).
    • If 1, decoy PSMs will be reported.
  • -unroll 0/1
    • This parameter controls the output format for shared peptides (peptides matched to multiple proteins).
    • When "-unroll 0" (Default), a PSM matched to a shared peptide will be printed as a single line.
      • Peptide column does not show neighboring amino acids (e.g. IGAYLFVDMAHVAGLIAAGVYPNPVPHAHVVTSTTHK).
      • Protein column shows all proteins in a single line.
      • Example: MyProtein(pre=K,post=T);MyProteinIsoform(pre=K,post=T)
      • Download example file
    • When "-unroll 1", a PSM matched to a shared peptide will be printed in multiple lines.
      • Peptide column shows neighboring amino acids (e.g. K.IGAYLFVDMAHVAGLIAAGVYPNPVPHAHVVTSTTHK.T).
      • Different peptide-protein matches are printed in different lines.
      • Download example file

BuildSA

Index a protein database for fast searching. 

Usage: java -Xmx3500M -cp MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA 
	-d DatabaseFile (*.fasta or *.fa)
	[-tda 0/1/2] (0: target only, 1: target-decoy database only, 2: both)

Parameters:

  • -d DbPath
    • Name of a protein database (*.fasta or *.fa) 
    • Database file must ends with ".fasta" or ".fa".
  • -tda 0/1/2
    • If 0, only "DatabaseFile" will be indexed.
    • If 1, a new database file (*.revConcat.fasta) will be generated by appending reversed proteins. This forward-reverse database will be indexed.
    • If 2, both the original database and the forward-reverse database file will be indexed.

BuildSA creates a suffix array of the protein database. For a input database file DBFileName.fasta, BuildSA will generate 4 auxiliary files (DbFileName.canno, DBFileName.cnlcp, DBFileName.csarr, DBFileName.cseq). It needs to be executed only once per each database file.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.