- Table of Contents
- Public MassIVE datasets
- Browsing MassIVE Datasets
- Interacting with MassIVE Data
- Contributing New MassIVE Datasets
The public MassIVE datasets are complete datasets that published at massive.ucsd.edu optionally alongside paper publications. The MassIVE repository provides a location for researches to access datasets that have been made available by others. However, these MassIVE repository will not merely function as a file server and a data graveyard.
These datasets remain alive long after publication. At GNPS, users will be able to browse datasets, download datasets, and comment on datasets. These comments can be accompanied with new data or new analysis that enriches the MassIVE dataset. Additionally, users can subscribe to specific datasets, for while the underlying data might not be changing, the understanding of the data will. To continuously learn more about each dataset, each is searched against the ever growing public reference annotated spectral libraries and new identifications are reported to subscribers. Beyond new identifications within a dataset, subscribers will also be made aware of other datasets that exhibit similarities to the subscribed dataset. This allows for users to be connected via their interest in similar datasets.
Public MassIVE datasets specific to GNPS can be found here. By default, GNPS datasets are filtered under the Title column. If users wish to view all MassIVE datasets and not just GNPS datasets, the GNPS filter under Title can be removed and hit the Filter button.
These MassIVE datasets are available to download in their raw form by clicking on the MassIVE ID:
and the FTP link:
Upon clicking on the dataset link, users are brought to a dataset page with information such as:
Users can see complete information about the particular dataset.
While downloading the data is nice, users are also able to leverage the tools available at GNPS and compute on the converted mzXML versions of the MassIVE data.
To import a MassIVE dataset to compute on, simply click the Import Dataset to Analyze button, or to go directly to molecular networking and import users can click Import and Analyze Dataset with Networking Now.
NOTE: MassIVE Datasets will not be available to computed upon up to 48 hours after submissions as background conversion and processing must occur.
In this view, all comments are displayed in this table, and users can click View to view the particular comment's attachments:
To contribute comments, users can click on the Comment on Dataset link:
To add additional metadata per dataset, click on Update/Add Metadata. This will redirect users to a massive.ucsd.edu page that will allow them to update the appropriate metadata. Additionally, to add publications to this particular dataset, users can click the “Add Publication” link to add publications associated with the particular dataset.
Beyond individual users being able to compute on MassIVE data, GNPS periodically computes on all MassIVE data. Thus the information associated with each MassIVE dataset is constantly changing. Users can subscribe to be aware of the changes of information known about each dataset. These subscriptions sign the user up to receive continuous identification digest emails regarding changes in identifications on that dataset. To subscribe/unsubscribe users can simply toggle the Subscribe/Unsubscribe button.
With GNPS continually computing and identifying new spectra in MassIVE datasets, there must be a way to present these results in an easily view-able format to users. For each dataset, all the previously run continuous identification jobs are listed, and users can view the results (including at a glance how many identifications were made on a specific day):
Users can click the “View” link for a continuous ID job on a dataset, and will be taken to a status page. From here to browse all identifications, users can click on the “All Identifications (Beta)” link to view all identifications. Some new features here are still experimental but soon will be moved to the “All Identifications” link. The organization of the results of identifications can be found here: Dereplication Documentation , as it is very similar to the results of the dereplication workflow. There is however one key feature that is present in continuous identification: identification ratings. Users are able to browse the results and rate the accuracy of the identification. The scale is as follows:
|4 stars||correct match as context is right (i.e., molecule is known/expected to be in the sample)|
|3 stars||compound class match – at least part of the structure makes sense to match|
|2 stars||cannot tell – might be correct from the spectrum match and context but there is not enough information to tell|
|1 star||incorrect: molecule does not make sense in this context|
|No stars||No Rating|
Users will be able to both add their own rating, add a comment for their given rating, as well as view the average rating of the identification.
This example continuous identification email informs subscribers that the dataset of interest has more IDs in the most recent round of continuous identifications. It will first list the title of the dataset, then the changes in identification counts, and finally direct links to explore the data. Users can go directly to the search results and view the new, different, and deleted identifications as well as go straight to the dataset page itself.
Users can find related MassIVE datasets to the current one. Currently relatedness of datasets is determine by the number of shared identified compounds between the two. Users can see a view like this:
Please submit new massive datasets here.