Added by Jeremy Carver, last edited by Jeremy Carver on Mar 26, 2015  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Table of Contents

Converting Tab-Separated (TSV) Result Files to mzTab

mzTab is the primary result file format officially supported by the MassIVE repository for parsing, summarizing and visualizing spectrum identifications within MassIVE datasets. Therefore, we provide a convenient conversion workflow for users to generate valid mzTab for dataset submissions from most "tab-separated value" (TSV) search engine result file formats.

Running the TSV Conversion Workflow

To run the TSV to mzTab conversion workflow, first navigate to the official MassIVE repository web site - http://massive.ucsd.edu/. You will then need to log in to your registered MassIVE account.

Once logged in, click on "Submit Data" to load the MassIVE workflow selection page. There are a number of useful workflows available on this page, but to convert TSV files to mzTab, be sure that the "Convert TSV files to mzTab" workflow is highlighted in the "Workflow" drop-down menu near the top of the page.

With this workflow selected, you will see an input form with all the fields you need to tell MassIVE how to convert your TSV file(s) to mzTab.

Input Files

The first section of the form is for specifying which files you'd like to use in the conversion workflow.

File Categories

There are two categories of files that can be selected:

  • Tab-Separated Result Files - The TSV spectrum identification files that you would like to convert to mzTab.
    • You can convert as many TSV files at a time as you'd like, but each run of the workflow will operate using the same set of parameters. So, if you have files with different TSV formats (e.g. type and number of columns, etc.), then you should run this workflow on them separately because their parameters will differ.
  • Peak List Files - Optional. These are not used in the actual conversion to mzTab format, but if you have the peak list files that your TSV files refer to, then you can provide them here to browse the converted mzTab files with visualized spectrum identifications. MassIVE cannot do this for you unless it has access to the original spectra.
    • For this to work, the filenames of the peak list files you provide here must exactly match the filenames that are listed in your TSV file.

TSV File Header Line

Most tab-separated result formats include a header line, i.e. the first line of the file specifying the name of each tab-separated "column" in the file. For example, see this sample BiblioSpec .ssl file:

file		scan	charge	sequence
demo.ms2	00008	3	VGAGAPVYLAAVLEYLAAEVLELAGNAAR
demo.ms2	01806	2	LAESITIEQGK
demo.ms2	02572	2	ELAEDGC[+57.0]SGVEVR

Because most TSV formats include this header line, the workflow's "Header Line" checkbox is checked by default. If this is the case with your input TSV files, then just leave it as is.

However, if your file does not contain a header line, then you should uncheck this box. In this case, the meanings of the tab-separated columns are simply assumed based on their expected number and order, and therefore you will refer to each column by numerical index rather than name.

See below for more information on how to specify the file's important columns in each of these cases.

Uploading Input Files

Before you can assign any input files to the TSV conversion workflow, you must first upload them to your MassIVE account. See here for help with this.

Assigning Input Files

To select files for either of the listed categories, just click on one of the "Select Input Files" buttons to launch a file selector popup window (you may need to instruct your web browser to ignore popups from this site). This window will present you with a view of your MassIVE user account files, as well as an interface with which to assign these files to either category.

To assign any of your uploaded files, simply click individual files or whole folders in the left-hand folder view, and then click on the appropriate file button (in the middle, with the green arrows) to assign them to the relevant category. Files and folders that have been added so far can be seen in the right-hand "Selected Files" view.

When you are done selecting files, you can either click on "Finish Selection" or simply close the popup window, and the file selections will be noted in blue on the main input form. You can always mouse over this blue text to verify exactly which files have been assigned to that file category.

TSV File Columns

The second section of the form is for informing the converter about all the important columns in your input TSV files.

Required Columns

There are a few columns that should be present in any tab-separated spectrum identification result file. Without the essential information stored in these columns, MassIVE cannot produce a meaningful mzTab output file. Therefore, in order to run the TSV conversion workflow, these columns, at a minimum, must be specified here.

  • Peak List Filename - Spectrum identifications are only useful if they can be linked back to source spectra, which are stored in peak list files. Each row of your TSV file should indicate which file the identified spectrum came from.
  • Spectrum Identifier (index or scan number) - To find a matched spectrum within the referenced peak list file, we must know its unique identifier within the file.
    • Most peak list files identify their spectra with scan numbers, so this option is the default selection here. However, if your file identifies spectra by index, then you can click that option instead.
  • Peptide ID - The actual (modified) peptide string that was matched to this spectrum.

Optional Columns

These columns are not required to produce meaningful mzTab output, but they are associated with known mzTab columns, so specifying them here can add additional useful information to your converted mzTab file.

  • Protein Accession - The accession, or unique ID string, of the matched protein, if any.
  • Charge - The charge state of this spectrum identification.

Remaining Columns

Any columns from your input TSV files that are not specified in the workflow's input form will be transferred to the output mzTab file as "optional" columns, as described in section 5.12.2 ("Adding optional columns"), pages 18-19, of the official mzTab format specification.

Column Entry Example

For the following input TSV file (with header line), you would enter the corresponding column names into the form:

file		scan	charge	sequence
demo.ms2	00008	3	VGAGAPVYLAAVLEYLAAEVLELAGNAAR
demo.ms2	01806	2	LAESITIEQGK
demo.ms2	02572	2	ELAEDGC[+57.0]SGVEVR

However, if the same TSV file did not have a header line, then you would instead enter the (0-based) numerical index of each relevant column:

demo.ms2	00008	3	VGAGAPVYLAAVLEYLAAEVLELAGNAAR
demo.ms2	01806	2	LAESITIEQGK
demo.ms2	02572	2	ELAEDGC[+57.0]SGVEVR

Post-Translational Modifications

The third section of the form is for informing the converter about the post-translational modifications that occur in your input TSV files.

Most spectrum identification searches include one or more post-translational modifications. It is expected that your TSV files will simply encode these modifications directly into the "Peptide ID" string value of each row (i.e. the modified peptide sequence that was identified for that spectrum). For example:

file		scan	sequence
spec.mzXML	9273	VIEQPITSETAM[+15.995]K

However, in the mzTab file that is produced from this input, these modifications are treated much more rigorously - they must be catalogued, labeled and declared in accordance with sections 5.8 ("Reporting modifications and amino acid substitutions", pages 15-16) and 6.2.24-29 ("fixed_mod[1-n]" and "variable_mod[1-n]" columns, pages 26-28) of the official mzTab format specification.

For example, in mzTab, the oxidation row from the TSV example above would look something like this:

...
MTD	ms_run[1]-format	[PSI-MS, MS:1000566, ISB mzXML format, ]
MTD	ms_run[1]-location	file://spec.mzXML
MTD	ms_run[1]-id_format	[PSI-MS, MS:1000776, scan number only nativeID format, ]
...
MTD	variable_mod[1]		[UNIMOD, UNIMOD:35, Oxidation, ]
MTD	variable_mod[1]-site	M
...
PSH	sequence	...	modifications	...	spectra_ref		...
PSM	VIEQPITSETAMK	...	12-UNIMOD:35	...	ms_run[1]:scan=9273	...

Because mzTab keeps such careful track of every modification in the search results, the converter needs to know about all the mods that it will find in your TSV file, to properly fill out the related sections of the mzTab file.

Known Variable Modifications

For this converter, a "variable" modification is defined as any one that is explicitly written into the peptide sequence in your TSV file, such as the oxidation from the example above - even if it occurs universally for all amino acids of a certain type.

If the mass offset appears in the peptide string, then it should go under "Variable Modifications".

For each such modification that you know about and expect to find in your file, you should enter a reference here. To do this, start by simply typing the first few characters of the modification name into the "Variable Modifications" text box, and then click the correct entry when it appears:

Then, enter into the small text box the EXACT modification pattern that you see in your file's peptide sequence, including the amino acid(s):

Now, whenever the converter finds this pattern in any peptide sequence in your file, it will catalogue and declare the correct mod reference, and then assign it to the correct position for that identification.

Amino Acid Patterns

When entering your modification pattern in the text box, you are not limited to a single amino acid. You can enter multiple amino acids to match any one of them, or you can enter "*" to match anything (e.g. N-terminal modifications). For example:

Selected modification Expected sites Value entered in text box Example matching peptide string
UNIMOD:35 (Oxidation) M M+16 SAM+16PLE
UNIMOD:21 (Phosphorylation) S, T or Y STY[79.97] PEPT[79.97]IDE
UNIMOD:5 (Carbamylation) N-terminal (*,+43.01) (P,+43.01)EPTIDE

Modification Pattern Precision

However, when entering the modification's numerical mass offset in the text box,

You MUST enter the EXACT numerical text that occurs in your TSV file in order to match the modification correctly.

For example, if your TSV file reports oxidations using the string "M[+15.995]" (as in the example above), and you enter something like "M[+16]", or "M[+15.994915]" (the actual UNIMOD precision for oxidation), or any other similar but slightly different value, then the converter will not match the modification and your conversion will fail. In other words, the converter only performs literal string matches when looking for your declared mods.

Why is this? Well, the converter could certainly be made a little smarter to try to match mods within some arbitrary level of precision, but we decided not to do this because there are so many mods in UNIMOD that have mass shift values that are extremely close to one another, and the resulting mzTab file would technically be incorrect if the converter were ever to guess wrong between two very similar but different mods. So, in the interests of guaranteeing correctness, the converter will not try to make any guesses, it will only do exactly what it's told and nothing more. Consequently, please be careful to verify the exact mass string that is used in your TSV file, and enter it correctly here.

Blind (Unknown) Modifications

Sometimes your TSV file contains modifications that you can't anticipate in advance, such as when running a blind modification search. For example, say your file contains a lot of lines like these, where all of the mods were discovered during the search:

file		index	sequence
spec.mzXML	50	(V,-1)KSPELQAEAK
spec.mzXML	864	(L,-2)VGFIDDAVKK
spec.mzXML	2200	KGYTQ(Q,+1)LAFR

In this case, it's not really feasible to have to look through your entire file, find every new modification that was discovered, and then enter each one separately - there could be hundreds or thousands of slightly different blind modifications that were found in the search. Instead, it would be convenient to enter only one reference for a generic "unknown" modification, and then let the converter assign this reference to any modification it finds that matches the pattern you specify.

To do this, first enter "unknown" into the "Variable Modifications" text box, and then click the correct entry when it appears ("MS:1001460"):

Then, enter into the small text box the blind modification pattern that you see in your file's peptide sequence, using the character "*" to represent any amino acid, and "#" to represent any number. Here, "#" will match any numerical text, including decimal points (".") and leading sign characters ("+" or "-"). So, for the example above, the pattern would be "(*,#)":

Now, whenever the converter finds any mod matching this pattern in any peptide sequence in your file, it will assign the correct "unknown" mod reference ("MS:1001460") to the correct position for that identification.

Unknown Modification Matching Priority

If any known modifications are entered along with unknown ones, then the known modifications will be matched first. The "unknown" pattern will only be applied if a found modification does not match any entered known mod references.

Fixed Modifications

For this converter, a "fixed" modification is defined as any one that occurs universally for all amino acids of a certain type, but which is NOT written into the peptide sequence in your TSV file - it's simply assumed for all occurrences of that amino acid.

If the mass offset does NOT appear in the peptide string, but is simply assumed for all amino acids of a certain type, then it should go under "Fixed Modifications".

When entering references for fixed modifications, use the same procedure as for variable modifications above, except keep in mind that it's only "fixed" if the mass shift is missing from the peptide string. Therefore, the pattern entered in the small text box should consist only of the relevant amino acids, since that's all that will be found in the peptide string to reflect this modification:

Now, whenever the converter finds any occurrence of this amino acid in your file, it will catalogue and declare the correct mod reference, and then assign it to the correct position for that identification.