Added by Jeremy Carver, last edited by Jeremy Carver on Oct 30, 2014  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

To submit a dataset to the MassIVE repository, you must follow two important steps - first upload your data files to ProteoSAFe, and then run the MassIVE dataset submission workflow on those files.

Uploading Dataset Files

The first step in the submission process is to upload your dataset files to your ProteoSAFe account. MassIVE dataset submission is only available to registered users of ProteoSAFe - anonymous submissions are not supported. For instructions on how to register an account in ProteoSAFe, see here.

Once you have a registered ProteoSAFe account, we strongly recommend that you use FTP to upload your MassIVE dataset files to your account, as opposed to the ProteoSAFe web interface. This is because the web-based upload interface is optimized for quick uploads of small files, whereas for the much larger files typically associated with MassIVE datasets, FTP is much more stable and robust.

For instructions on how to use FTP to upload files to your ProteoSAFe account, see here. Please be sure to connect to the MassIVE FTP server (massive.ucsd.edu) when doing this!

One of the main benefits of using FTP to upload your dataset files is that it allows you to freely organize the files into a folder structure of your choosing, before assigning them to the various submission categories of the MassIVE workflow. By pre-arranging your files in this manner, you will find it much easier to assign them when setting up your submission workflow. See below for a more detailed explanation.

Submission File Categories

MassIVE dataset files are organized into the following categories when making a submission:

  • License Files - Specifying how and under what conditions the dataset files may be downloaded and used. Multiple license files may be uploaded, if appropriate.
  • Raw Spectrum Files - Raw mass spectrum files in a non-standard or instrument-specific format, such as Thermo .RAW files, AB Sciex .WIFF files, etc.
  • Peak List Files - Processed mass spectrum files in a standardized format, such as mzXML, mzML, or MGF files.
  • Search Engine Files - The output of any search engine or data analysis tools or pipelines that were used to analyze this dataset, unless provided in a standardized format recognized by the "Result Files" category (see below).
  • Result Files - Spectrum identifications in a standardized format. The following formats are recognized by MassIVE as valid for this category: mzIdentML and mzTab.
  • Sequence Databases - Any protein or other sequence databases that were searched against in the analysis of this dataset, if applicable (usually .fasta format).
  • Spectral Libraries - Any spectral library files that were searched against in the analysis of this dataset, or that were generated using the spectrum files provided in this dataset, if applicable.
  • Quantification Results - Any data and metadata generated by the analysis software used for performing exclusively the quantification analysis of peptides and proteins.
  • Gel Images - Any gel image files generated, in the event that two-dimensional gel electrophoresis has been used as a separation method.
  • Methods and Protocols - Any open-format files containing explanations or discussions of the experimental procedures used to obtain or analyze this dataset.

NOTE: Currently, none of the above file categories expect their assigned files to conform to any required or standard format. However, it is planned to add file format detection and automatic conversion support to the MassIVE dataset submission workflow, via the integrated ProteoWizard file format management suite of tools.

A recommended procedure for simplifying the MassIVE submission process is to use FTP to upload and organize your files into separate folders, each named for one of the above categories. This way, when you are ready to make your submission, all you need to do is select each folder and assign it to its proper category, without having to fumble through a large number of individual file assignments. See below for more information on assigning files in the submission workflow.

Submitting a Dataset

MassIVE dataset submission is implemented as a workflow in the ProteoSAFe system. To submit a dataset, first navigate to the official MassIVE repository ProteoSAFe site - http://massive.ucsd.edu/ProteoSAFe/. Once there, you can launch the MassIVE workflow by locating it in the "Workflow" drop-down menu near the top of the page. However, remember that only registered users can see or use the MassIVE workflow, so be sure to log in first!

You will be presented with an input form including all file and metadata fields relevant to your submission. The file categories seen in this form are explained above. The only data requirement is to submit at least one file in either the "Raw Spectrum Files" or "Peak List Files" categories - each of the remaining file types you see is optional, allowing you to customize your dataset to include as few or as many types of files as you see fit.

One other requirement for submission is to provide some kind of license for your dataset - however, if you do not have your own license file, then you can simply leave the "Standard License" checkbox checked and your dataset will be submitted under the default Creative Commons CC0 1.0 Universal license.

To select files for any of the listed categories, just click on one of the "Select Input Files" buttons to launch the ProteoSAFe file selector popup window (you may need to instruct your web browser to ignore popups from this site). This window will present you with a view of your ProteoSAFe user account files, as well as an interface with which to assign these files to the various dataset categories.

To add any of your uploaded files to the dataset, simply click on them (individual files, or whole folders) in the left-hand folder view, and then click on the appropriate file button (in the middle, with the green arrows) to assign them to the relevant category. Files and folders that have been added so far can be seen in the right-hand "Selected Files" view.

When you are done selecting files, you can either click on "Finish Selection" or simply close the popup window, and the file selections will be noted in blue on the main input form.

Finally, the "Dataset Password" field is optional; any dataset submitted without a password will be freely visible to any user who knows its URL. However, all newly submitted datasets, regardless of password settings, are considered "private" until explicitly released for public viewing by the submitting user. This means that even if no password was provided, a private dataset will not appear in any public searches from ProteoSAFe. See below for more information on dataset privacy.

When you are ready to submit your dataset, click on the "Submit" button at the bottom of the form, and your dataset will be sent to the ProteoSAFe server for validation. If any problems are found with your input, you will be notified. Otherwise, you will see the job status page, which will be periodically updated with the current status of your dataset submission workflow. As with all ProteoSAFe workflows, blue boxes represent activities that have not yet started, orange boxes represent currently running activities, and green boxes represent activities that have completed successfully.

The dataset submission workflow includes steps to record the details of the dataset in a database, validate the dataset files, and then securely copy them to the MassIVE repository. If there is a problem in any step of the workflow, the job will fail and the relevant error messages will be displayed on the status page.

Viewing a Dataset

Once a dataset has been successfully submitted, its status page will update to a dataset details view. This page will show the dataset's relevant metadata, as well as provide a link to browse and download the actual dataset files themselves.

This page can be viewed by anyone who knows its URL. However, if the dataset is still private, it will not be seen in public searches. In this way, private datasets can still be shared with relevant users (such as publication reviewers), while still maintaining reasonable privacy from the general public. See below for more information about dataset privacy.

To view a dataset's files, simply click on the "FTP Download" link. Please note that when clicking this link, some web browsers may then ask you to enter a username, even though it is encoded directly into the URL. In this case, simply type the MassIVE dataset ID (the "MSVXXXXXXXXX" part just before the "@" symbol in the URL), and you should be provided access. If the dataset is both private and password-protected, then you will need to enter the password; otherwise, if you are prompted for a password, then just enter the anonymous default password "a" to browse any non-password protected dataset. See Dataset Privacy below for more details.

Once you are able to connect to the dataset's FTP directory, then the browser will bring up a listing of the dataset's top-level directory, which can then be navigated just like any directory listing.

Alternatively, you can simply copy the FTP URL and then paste it into your preferred FTP client program, which will then (after any relevant password authentication) provide the same access to the dataset's files.

Dataset Privacy

Whenever a new dataset is submitted to the MassIVE repository, it is by default considered "private". This means that, although technically anyone can view its basic details page (assuming that they know the correct URL to that page), a private dataset will never show up in any ProteoSAFe search. Therefore, it is essentially impossible to find or view a private dataset without knowing its status page URL.

Furthermore, private datasets can be password protected, by entering a non-empty value in the "Dataset Password" field from the initial submission input form. This password must be provided in order to actually connect to the ProteoSAFe FTP server and view/download the dataset's files, even if the user is able to find and load the dataset's status page.

IMPORTANT NOTE: Due to implementation requirements for controlled FTP access, even non-password protected datasets are technically given a password. However, all such datasets (including public ones) have the same default public password, which is simply the string "a". Try entering this if you want to view a public dataset and you are being challenged by the server for password authentication.

Once a dataset is submitted, its owner (the ProteoSAFe user who launched the submission workflow) can make it public at any time by clicking on the "Make Public" link near the top of the status page. Once a dataset is made public, its password is removed (or, more specifically, its password is set to the public password "a"), and it then becomes visible in public dataset searches from ProteoSAFe.

Searching Datasets

To search the public datasets available in the MassIVE repository, click on the "MassIVE Datasets" link in the top menu of ProteoSAFe.

You will be presented with a list of all current public datasets in the repository. This list is sorted by default in increasing order of upload date (earliest first). You can page through the list using the controls at the top, and you can also sort and filter by any column using the controls in the column headers. To filter, simply enter your search term in the box at the top of the relevant column, and then click "Filter"; multiple filters can be combined in this manner. Alternatively, you can select specific datasets by checking the boxes next to them, then check the "checked only" box, and then hit "Filter", to reduce the list to only those rows that you want to see.

For datasets imported from Tranche, the Tranche hash is also displayed, allowing you to filter by that column in order to search for a specific Tranche dataset.

When you are ready to view the details of a particular dataset, just click on its green "MassIVE ID" link to load the status page for that dataset.