Don't Treat In-Situ and Tabular Data Like Gridded Data
(Or, why THREDDS doesn't handle in-situ or tabular data well and ERDDAP does.)

THIS WEB PAGE HAS BOB SIMONS' PERSONAL OPINIONS,
which do not necessarily reflect any position of the U.S. Government,
the National Oceanic and Atmospheric Administration, or the
Environmental Research Division.

[Created: 2013-06-26, Last edited: 2013-07-02]

THREDDS

THREDDS is a data server for multidimensional gridded data. Both THREDDS and ERDDAP support the part of the OPeNDAP DAP protocol that deals with serving gridded datasets (including the "projection" constraints, e.g.,
var[startIndex:stride:stopIndex][startIndex:stride:stopIndex]
for specifying subsets of a gridded dataset). This system works very well for the multidimensional gridded datasets (for example, satellite datasets and model data) for which it was designed.

THREDDS and In-Situ Datasets

Since THREDDS has been so successful with gridded datasets and is already installed at many sites, some people would like to try to use it to serve in-situ data. This has been aided by the relatively recent changes to CF metadata standard to specify approved ways of storing in-situ datasets as multidimensional arrays. If in-situ datasets are stored in files conforming to this standard, THREDDS supports two ways to request a subset of these datasets:
  1. Make a standard OPeNDAP gridded data ("projection") request.
    Technically, this works. A user can make a request for a subset of a dataset. THREDDS will return the requested subset.

    In practice, this is very cumbersome and often impractical. The reason is: to get a typical subset of the data with this approach, you have to make a series of requests to get the data you want:

    1. Since the CF standard defines ~5 different, valid, data structures for each feature type, you have to request and download the dataset's Dataset Descriptor Structure (.dds) and figure out which data structure is being used (i.e., how the data is stored).
    2. Then, you have to request and download all the outer/feature data (information about each e.g., trajectory) to see what is available.
    3. Then, you have to figure out which features you want, for example, by looking for features where certain conditions are true (e.g., where owner=NDBC). This is cumbersome. If the dataset is large (e.g., a million features), this becomes impractical. And for many common requests (e.g., to find data within a latitude longitude bounding box), the information to identify the desired features is not in the outer/feature data, so you must download the entire dataset and subset it yourself (if the client software you are using supports a way to do that). For all datasets, downloading the entire dataset is an unfortunate waste, since you may want just a small subset of the data. For huge datasets, this is impractical because downloading the entire dataset would take a very long time. And isn't this type of subsetting what we want server to support?
    4. Then you have to send a separate request for each of feature (e.g., trajectory) that you want. They are separate requests because the desired features will almost always be scattered throughout the dataset: feature #1, #17, #22, #38, #122, ... The OPeNDAP projection constraints which work so well for getting evenly spaced subsets of gridded data (e.g., [0:10:30] to get #0, #10, #20, and #30) offer no way to request data for a scattered set of index values. And the problem is made harder by the several different possible data structures that CF allows for in-situ data and by the need to use index numbers in the requests, not the units and terminology of the data. If the number of desired features is large, this becomes difficult or impractical.
       
  2. Make a request to THREDDS' new, experimental, non-standard CdmrFeature Protocol.

    This approach is designed for and works well for two common types of subset requests:

    1. Requests for specific features (which assumes the desired featureID's are already known),
    2. Requests for data within a specified latitude, longitude, and/or time bounding box.

    Unfortunately, this approach offers no help for any other type of subset request. Notably, you can't query based on other attributes, e.g., the owner of the device (e.g., owner=NDBC). To do that, you would have to download and look at the outer table's data and then figure out which features you want, and then generate and send the request for the desired features. There is no way to form one simple request (e.g., with WHERE owner=NDBC) and then get the response with all matching data. If there are a large number of features in the dataset, just finding the featureID's of the desired features becomes very difficult or impractical.

    Another problem with this approach is: it isn't based on a standard. It is specific to THREDDS. There is no client software (other than from Unidata) that supports generating these requests and dealing with the response data.

THREDDS and Tabular Data

There is an incredibly large amount of non-in-situ, non-geospatial, tabular data (database-like tables) in the scientific world and in the larger world. Most of it resides in relational databases. THREDDS has no special provisions for handling tabular data. Yes, you can store a table of data as gridded data by creating a set of variables which each have the same, single dimension (e.g., "time" or "row"). But the only way of requesting a subset of the dataset is via a projection constraint (e.g., var[startIndex:stride:stopIndex]). This approach doesn't express the subset in the domains terms (e.g., owner=NDBC), which is how the user is thinking. And there is no system for requesting a set of rows scattered throughout the table. It is totally inappropriate for tabular data. Users attempting this will find it extremely cumbersome and impractical.

 


ERDDAP

ERDDAP is a data server that treats gridded data and tabular data differently.

ERDDAP's system for in-situ data is simple and flexible.

ERDDAP offers a simple and flexible way to work with in-situ data (and for tabular data in general) by using a less-used part of the OPeNDAP standard: sequences. In addition to OPeNDAP's support for gridded data, OPeNDAP supports sequence (tabular) data and a different system (very much like SQL) to make an OPeNDAP sequence data (selection) request. For example, a SQL query like:
SELECT featureID, longitude, latitude, time FROM tableName WHERE owner=NDBC AND longitude>20 AND longitude<40 ...
becomes an ERDDAP RESTful OPeNDAP sequence query URL:
serverURL/erddap/tabledap/tableName?featureID,longitude,latitude,time&owner=NDBC&longitude>20&longitude<40 ...
(You can see why OPeNDAP calls these selection constraints: they closely match SQL SELECT statements.)
This approach works well because:
  • It lets the user specify what s/he wants in one simple query.
  • A query is easy to generate (especially with a web form).
  • The query is phrased in the domain's terms (e.g., owner=NDBC, not indices).
  • This system supports more flexible queries: any variable can be queried, not just featureID, time, latitude, longitude.
  • It is a more general solution because it supports all tabular (database-like) data, even non-geographic data, not just CF CDM in-situ data.
  • It uses the widely liked and used OPeNDAP standard. It is unfortunate that THREDDS just supports OPeNDAP gridded data and grid ("projection") queries, and doesn't support OPeNDAP sequence data and sequence/selection queries.

ERDDAP uses the same simple and flexible system for tabular data.

ERDDAP handles tabular data the same way that relational database programs handle tabular data.
  • There is an incredibly large amount of non-in-situ, non-geospatial, tabular data (database-like tables) in the scientific world and in the larger world. Most of it resides in relational databases.
  • We would be crazy to ignore the incredible success of the world of relational databases and incredible amount data that fits well in a database-like table.
  • It is important that there be an easy way (with reusable software) to make tabular data safely and easily accessible to users on the web.
  • It is essential that users be able to specify, in a way that is appropriate for tabular data, the subset of the data they would like to download. The query must be simple to create and be in the domain's terminology. The OPeNDAP sequence data structure and OPeNDAP selection constraints are great for this because they directly parallel SQL queries, the universally accepted method for querying data in relational databases. These queries are already familiar to many, many people. And for people who aren't familiar, they are easy to learn.
  • As shown above, handling tabular data like gridded data doesn't work well. What is true for in-situ data is also true for tabular data in general. Users must be able to constraint any variable in the dataset, not just a few predefined concepts (e.g., featureID, time, latitude, longitude).
  • There are other systems for making tabular data available via the web. Although, no system seems to be widely used, perhaps the two best systems are Google Query Language and Yahoo! Query Language, which are very similar and are both basically direct conversions of SQL to a RESTful format. Unfortunately, as with relational databases, they do not support global or variable (field) metadata.

    Metadata is so important in science (remember the Mars Climate Orbiter) that we should not accept a solution that doesn't support metadata. ERDDAP supports the CF metadata model, with key-value pairs for global and variable (field) metadata. The lack of support for metadata in Google Query Language and Yahoo! Query Language makes them unsuitable for scientific work.

 


Contact

Remember: THIS WEB PAGE HAS BOB SIMONS' PERSONAL OPINIONS,
which do not necessarily reflect any position of the U.S. Government,
the National Oceanic and Atmospheric Administration, or the
Environmental Research Division.

Questions, comments, suggestions? Please send an email to bob dot simons at noaa dot gov .
If you think this opinion is incorrect, please send me an email and tell my why.