Frequently Asked Questions
The EMERGE database is a Neo4j-powered database designed to store nearly all the data generated by the EMERGE project. This includes text-based data, such as temperature, pH or other measurements. It covers temporal and spatial data. If data cannot be stored natively in the graph database itself, links referring to the data files can be used in lieu.
First things first: The "database" encompasses several components: (1) the Neo4j graph database, which stores text-based information and serves as the basic storage structure, and (2) the web server, which is built using a number of different tools and provides front-end access to the data contained within and/or linked from the Neo4j database. These two aspects come together to form the EMERGE-DB.
Data are imported into the Neo4j graph database using a set of open-source Python scripts, which are available on our Bitbucket page.
If you check out this "recent" article, it turns out it's effectively unlimited. Actual usage is therefore limited by hardware, meaning whatever resources we can purchase. We currently have 3 virtual machines (VMs), each with 100s GB RAM and TBs of disk space. Each VM serves a different EMERGE-DB component, but will eventually be combined for improved access speeds and capacity. Data is duplicated at various data processing stages, from raw data that can be downloaded at the Data Downloads page, to the data uploaded to the Neo4j database.
Nearly any type of text-like data can be stored natively in the database. Images and non-text can have links referring to the actual data locations on the web server, allowing users to still access/download the data and see that data when placed in context of other data. This includes coring information, omics data (meta -genomics, -transcriptomics, -proteomics), terrestrial geochemistry, vegetation "ground cover," satellite and drone imagery (both stored as links). This data can be temporal, either 1 second/sample or 10,000 samples/second. The effective limitation is filtering/aggregating the data so that it is human-comprehensible. For example, let's assume that autochambers take 1 measurement every minute for 5 years straight, that's ~2.628 million measurements. When you query the database, you don't necessarily want to grab all 2+ million nodes, so instead [depending on the observed sample frequencies] the minute measurements are combined into hours or days, reducing the number of nodes to 1,825 (days) or 43,800 (hours). Users can then effectively sort through and filter much larger spans of data more quickly, while still being able to retrieve all the data associated with whatever aggregated level was done.
For those who aren't aware, site synonyms are "unique" terms that each lab uses that all mean the same thing. One great example is the use of "Sphagnum" and "Bog." One describes the dominant vegetation, the other the habitat classification. Over the years some labs have switched, while others have been consistent, but one thing remains the same - there's no universal naming scheme for each "plot of land."
To address this issue, a unified naming scheme was used that involved sampling members from multiple labs. Whenever data is imported into the database the site names or their acronyms/abbreviations is passed through a naming dictionary that matches the names against all known variations. The dictionary also contains habitat types, vegetation, and other abbreviations which are then associated with the data import. This allows the original source providers to search for their own data using their own nomenclature while simultaneously allowing for other members to use "their" naming scheme to find the same samples.
The EMERGE-DB is low-maintenance (see other question) with full functionality intended to survive through the end of the project, and dataset access in perpetuity. As such, references to the website will be maintained, including links from publications to datasets hosted on the legacy IsoGenieDB website. In addition, we are exploring the possibility of auto-creation of DOIs (alongside a DOI-generating service) for published data, with each DOI linked to a collection of datapoints in the database.
The differences between the private side of the site and the public are significant, though superficially they are similar. On the legacy IsoGenieDB website (and to some degree on this EMERGE-DB website), the website pages are separate (see graphic to the right), so there are no pages that link between the public and private sides (though technically speaking the two websites are running on the same hardware). This results in 2X the coding required, but it effectively isolates the two websites. For the EMERGE-DB, we are working on integrating the public and private sides of the website into a single framework that is easier to update, but still keeps private data isolated from the public side of the website.
The major difference between the sites is that the data content is different, with only the data directed for public release shown on publications, data downloads, etc. on the public site. In the Neo4j graph database, only nodes with the "Public" label are accessible from the public querying page.
While we would like to eventually expand the database to include non-EMERGE project data, we're instead focused on creating a fully comprehensive database for all our members and aren't currently accepting non-EMERGE data.
Querying generally means using a querying language to directly query the underlying Neo4j graph database. As one can imagine, highly sophisticated querying can use not only a query, but also parse and/or retrieve additional information based on the initial data returned. Instead of forcing users to learn a new querying language - that even the most advanced querying syntax can only return "limited" amounts of data - we've taken a hybrid approach. Some data is pre-returned to the website for quick filtering. At other times, python is used in the background to fetch results, translating more laymen search terms into the querying language used by the Neo4j database. This is especially useful during iterative querying, where prior results affect future queries. For simple queries, i.e. those that have 1 or 2 data types, nearly all data can quickly be filtered via the querying interface.
The major limitation to this method is that there's no easy way of foreseeing what data users will frequently request, or data returned in such a way that it is more useful for the end-user. If you find yourself wanting to run the same query repeatedly, simply varying a few search parameters, that can be easily automated. If you want to query/filter data based on numerous parameters, or iteratively retrieve data, get subsets, run GPS localization and other geospatial analyses, you'll want to check out the "network analytic queries" for more info.
The EMERGE-DB is spatially-aware in the sense that it contains standardized GPS coordinate information that can be acted upon by other tools/software. This can be seen through the map interface, where GPS information is pulled from the database alongside other site/core-specific information and rendered on a coordinate system.
In a way this means that querying based on GPS information is limited (through the "querying" page) to text-based matches. The map interface is another matter. Since GPS information can be retrieved, the limitation to the sophistication of the map queries is limited only by current coding skills and/or available "plugins" that are designed to work with the Google-based mapping software that powers the map interface. For example, overlaying images is simple a plugin that can be installed into the map, as are distance calculations between points on the map, drawing a shape and selecting all the points within it, even adding "walking directions" is a plugin + a little coding to get it working with our data.
The ultimate goal is a fully embellished, feature-rich map interface that combines filtering of site characteristics and summarizing data based on any geospatial selections.
The EMERGE database is a Neo4j-powered database designed to store nearly all the data generated by the EMERGE project. This includes text-based data, such as temperature, pH or other measurements. It covers temporal and spatial data. If data cannot be stored natively in the graph database itself, links referring to the data files can be used in lieu.