NOAA ERDDAP
Easier access to scientific data
   
Brought to you by NOAA NMFS SWFSC ERD    
 

Working with the datasets.xml File

[This web page will only be of interest to ERDDAP administrators.]

After you have followed the ERDDAP installation instructions, you must edit the datasets.xml file in tomcat/content/erddap/ to describe the datasets that your ERDDAP installation will serve.

Table of Contents


 

Introduction

Some Assembly Required
Setting up a dataset in ERDDAP isn't just a matter of pointing to the dataset's directory or URL. You have to write a chunk of XML for datasets.xml which describes the dataset.

If you buy into these ideas and expend the effort to create the XML for datasets.xml, you get all the advantages of ERDDAP, including: Making the datasets.xml takes considerable effort for the first few datasets, but it gets easier. After the first dataset, you can often re-use a lot of your work for the next dataset. Fortunately, there are two Tools to help you create the XML for each dataset in datasets.xml.
If you get stuck, please send an email with the details to bob dot simons at noaa dot gov.
Or, you can join the ERDDAP Google Group / Mailing List and post your question there.

Data Provider Form
When a data provider comes to you hoping to add some data to your ERDDAP, it can be difficult and time consuming to collect all of the metadata (information about the dataset) needed to add the dataset into ERDDAP. Many data sources (for example, .csv files, Excel files, databases) have no internal metadata. So ERDDAP has a Data Provider Form which gathers metadata from the data provider and gives the data provider some other guidance, including extensive guidance for Data In Databases. The information submitted is converted into the datasets.xml format and then emailed to the ERDDAP administrator (you) and written (appended) to bigParentDirectory/logs/dataProviderForm.log . Thus, the form semi-automates the process of getting a dataset into ERDDAP, but the ERDDAP administrator still has to complete the datasets.xml chunk and deal with getting the data file(s) from the provider or connecting to the database.

The submission of actual data files from external sources is a huge security risk, so ERDDAP does not deal with that. You have to figure out a solution that works for you and the data provider, for example, email (for small files), pull from the cloud (for example, DropBox or Google Drive), an sftp site (with passwords), or sneakerNet (a USB thumb drive or external hard drive). You should probably only accept files from people you know. You will need to scan the files for viruses and take other security precautions.

There isn't a link in ERDDAP to the Data Provider Form (for example, on the ERDDAP home page). Instead, when someone tells you they want to have their data served by your ERDDAP, you can send them an email saying something like:
Yes, we can get your data into ERDDAP. To get started, please fill out the form at http://yourUrl/erddap/dataProviderForm.html .
After you finish, I'll contact you to work out the final details.

If you just want to look at the form (without filling it out), you can see the form on ERD's ERDDAP: Introduction, Part 1, Part 2, Part 3, and Part 4. These links on the ERD ERDDAP send information to me, not you, so don't submit information with them unless you actually want to add data to the ERD ERDDAP.

If you want to remove the Data Provider Form from your ERDDAP, put
<dataProviderFormActive>false</dataProviderFormActive>
in your setup.xml file.

The impetus for this was NOAA's 2014 Public Access to Research Results (PARR) directive (external link), which requires that all NOAA environmental data funded through taxpayer dollars be made available via a data service (not just files) within 12 months of creation. So there is increased interest in using ERDDAP to make datasets available via a service ASAP. We needed a more efficient way to deal with a large number of data providers.

Feedback/Suggestions? This form is new, so please email bob dot simons at noaa dot gov if you have any feedback or suggestions for improving this.

Tools
There are two command line programs which are tools to help you create the XML for each dataset that you want your ERDDAP to serve. Once you have set up ERDDAP and run it (at least one time), you can find and use these programs in the tomcat/webapps/erddap/WEB-INF directory. There are Linux/Unix shell scripts (with the extension .sh) and Windows scripts (with the extension .bat) for each program. [On Linux, run these tools as the same user (tomcat?) that will run Tomcat.] When you run each program, it will ask you questions. For each question, type a response, then press Enter. Or press ^C to exit a program at any time.

Program won't run?

The tools print various diagnostic messages:

The two tools are a big help, but you still must read all of these instructions on this page carefully and make important decisions yourself.

The basic structure of the datasets.xml file is:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<erddapDatasets>
  <convertToPublicSourceUrl /> <!-- 0 or more -->
  <requestBlacklist>...</requestBlacklist> <!-- 0 or 1 -->
  <subscriptionEmailBlacklist>...</subscriptionEmailBlacklist> 
    <!-- 0 or 1 -->
  <user username="..." password="..." roles="..." /> <!-- 0 or more -->
  <dataset>...</dataset> <!-- 1 or more -->
</erddapDatasets>
It is possible that other encodings will be allowed in the future, but for now, only ISO-8859-1 is recommended.
 

Notes

Working with the datasets.xml file is a non-trivial project. Please read all of these notes carefully. After you pick a dataset type, please read the detailed description of it carefully.
 

List of Types Datasets

If you need help chosing the right dataset type, see Choosing the Dataset Type.

The types of datasets fall into two categories. (Why?)


 

Detailed Descriptions of Dataset Types

EDDGridFromDap handles grid variables from DAP (external link) servers.

EDDGridFromEDDTable lets you convert an EDDTable tabular dataset into an EDDGrid gridded dataset. Remember that ERDDAP treats datasets as either gridded datasets (subclasses of EDDGrid) or tabular datasets (subclasses of EDDTable).

EDDGridFromErddap handles gridded data from a remote ERDDAP server.
EDDTableFromErddap handles tabular data from a remote ERDDAP server.

EDDGridFromEtopo just serves the ETOPO1 Global 1-Minute Gridded Elevation Data Set (external link) (Ice Surface, grid registered, binary, 2byte int: etopo1_ice_g_i2.zip) which is distributed with ERDDAP.

EDDGridFromFiles is the superclass of all EDDGridFrom...Files classes. You can't use EDDGridFromFiles directly. Instead, use a subclass of EDDGridFromFiles to handle the specific file type:

Currently, no other file types are supported. But it is usually relatively easy to add support for other file types. Contact us if you have a request. Or, if your data is in an old file format that you would like to move away from, we recommend converting the files to be NetCDF v3 .nc files. NetCDF is a widely supported, binary format, allows fast random access to the data, and is already supported by ERDDAP.

Details - The following information applies to all of the subclasses of EDDGridFromFiles.

EDDGridFromAudioFiles and EDDTableFromAudioFiles aggregate data from a collection of local audio files. (These first appeared in ERDDAP v1.82.) The difference is that EDDGridFromAudioFiles treats the data as a mulidimensional dataset (usually with 2 dimensions: [file startTime] and [elapsedTime within a file]), whereas EDDTableFromAudioFiles treats the data as tabular data (usually with columns for the file startTime, the elapsedTime with the file, and the data from the audio channels). EDDGridFromAudioFiles requires that all files have the same number of samples, so if that is not true, you must use EDDTableFromAudioFiles. Otherwise, the choice of which EDD type to use is entirely your choice. One advantage of EDDTableFromAudioFiles: you can add other variables with other information, e.g., stationID, stationType. It both cases, the lack of a unified time variable makes it more difficult to work with the data from these EDD types, but there was no good way set up a unified time variable.

See these class' superclasses, EDDGridFromFiles and EDDTableFromFiles, for general information on how this class works and how to use this class.

We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. Since audio files have no metadata other than information related to the encoding of the sound data, you will have to edit the output from GenerateDatasetsXml to provide essential information (e.g., title, summary, creator_name, institution, history).

Details:

EDDGridFromMergeIRFiles aggregates data from local, MergeIR (external link) files, which are from the Tropical Rainfall Measuring Mission (TRMM) (external link), which is a joint mission between NASA and the Japan Aerospace Exploration Agency (JAXA). MergeIR files can be downloaded from NASA (external link).

EDDGridFromMergeIRFiles.java was written and contributed to the ERDDAP project by Jonathan Lafite and Philippe Makowski of R.Tech Engineering (external link) (license: copyrighted open source).

EDDGridFromMergeIRFiles is a little unusual:

See this class' superclass, EDDGridFromFiles, for general information on how this class works and how to use this class.

We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.
 

EDDGridFromNcFiles aggregates data from local, gridded, GRIB .grb and .grb2 (external link) files, HDF (v4 or v5) .hdf (external link) files, .ncml files, and NetCDF (v3 or v4) .nc (external link) files. This may work with other file types (for example, BUFR), we just haven't tested it -- please send us some sample files.

EDDGridFromNcFilesUnpacked is a variant of EDDGridFromNcFiles which aggregates data from local, gridded NetCDF (v3 or v4) .nc and related files. The difference is that this class unpacks each data file before EDDGridFromFiles looks at the files:

The big advantage of this class is that it provides a way to deal with different values of scale_factor, add_offset, _FillValue, missing_value, or time units in different files in a collection. Otherwise, you would have to use a tool like NcML or NCO to modify each file to remove the differences so that the files could be handled by EDDGridFromNcFiles. For this class to work properly, the files must follow the CF standards for the related attributes.

EDDGridLonPM180 modifies the longitude values of a child (enclosed) EDDGrid dataset that has some longitude values greater than 180 (for example, 0 to 360) so that they are in the range -180 to 180 (Longitude Plus or Minus 180, hence the name).

EDDGridSideBySide aggregates two or more EDDGrid datasets (the children) side by side.

EDDGridAggregateExistingDimension aggregates two or more EDDGrid datasets each of which has a different range of values for the first dimension, but identical values for the other dimensions.

EDDGridCopy makes and maintains a local copy of another EDDGrid's data and serves data from the local copy.

EDDTableFromCassandra handles data from one Cassandra table.

EDDTableFromDapSequence handles variables within 1- and 2-level sequences from DAP (external link) servers such as DAPPER (external link).

EDDTableFromDatabase handles data from one relational database table or view (external link).

EDDTableFromEDDGrid lets you create an EDDTable dataset from any EDDGrid dataset.

EDDTableFromFileNames creates a dataset from information about a group of files in the server's file system, including a URL for each file so that users can download the files via ERDDAP's "files" system. Unlike all of the EDDTableFromFiles subclasses, this dataset type does not serve data from within the files.

EDDTableFromFiles is the superclass of all EDDTableFrom...Files classes. You can't use EDDTableFromFiles directly. Instead, use a subclass of EDDTableFromFiles to handle the specific file type:

Currently, no other file types are supported. But it is usually relatively easy to add support for other file types. Contact us if you have a request. Or, if your data is in an old file format that you would like to move away from, we recommend converting the files to be NetCDF v3 .nc files (and especially .nc files with the CF Discrete Sampling Geometries (DSG) (external link) Contiguous Ragged Array data structure -- ERDDAP can extract data from them very quickly). NetCDF is a widely supported, binary format, allows fast random access to the data, and is already supported by ERDDAP.

Details - The following information applies to all of the subclasses of EDDTableFromFiles.

EDDTableFromAsciiService is essentially a screen scraper. It is intended to deal with data sources which have a simple web service for requesting data (often an HTML form on a web page) and which can return the data in some structured ASCII format (for example, a comma-separated-value or columnar ASCII text format, often with other information before and/or after the data).

EDDTableFromAsciiService is the superclass of all EDDTableFromAsciiService... classes. You can't use EDDTableFromAsciiService directly. Instead, use a subclass of EDDTableFromAsciiService to handle specific types of services:

Currently, no other service types are supported. But it is usually relatively easy to support other services if they work in a similar way. Contact us if you have a request.

Details - The following information applies to all of the subclasses of EDDTableFromAsciiService.

EDDTableFromAsciiServiceNOS makes EDDTable datasets from the ASCII text data services offered by NOAA's National Ocean Service (NOS) (external link). For information on how this class works and how to use this class, see this class's superclass EDDTableFromAsciiService. It is unlikely that anyone other than Bob Simons will need to use this subclass.

Since the data within the response from a NOS service uses a columnar ASCII text format, data variables other than latitude and longitude must have a special attribute which specifies which characters of each data line contain that variable's data, for example,
<att name="responseSubstring">17, 25</att>
 

EDDTableFromAllDatasets is a higher-level dataset which has information about all of the other datasets which are currently loaded in your ERDDAP. Unlike other types of datasets, there is no specification for the allDatasets dataset in datasets.xml. ERDDAP automatically creates one EDDTableFromAllDatasets dataset (with datasetID=allDatasets). Thus, an allDatasets dataset will be created in each ERDDAP installation and will work the same way in each ERDDAP installation.

The allDatasets dataset is a tabular dataset. It has a row of information for each dataset. It has columns with information about each dataset, e.g., datasetID, accessible, institution, title, minLongitude, maxLongitude, minLatitude, maxLatitude, minTime, maxTime, etc. Because allDatasets is a tabular dataset, you can query it the same way you can query any other tabular dataset in ERDDAP, and you can specify the file type for the response. This lets users search for datasets of interest in very powerful ways.
 

EDDTableFromAsciiFiles aggregates data from comma-, tab-, semicolon-, or space-separated tabular ASCII data files.

EDDTableFromAwsXmlFiles aggregates data from a set of Automatic Weather Station (AWS) XML data files. Some background information is at WeatherBug_Rest_XML_API (external link).

EDDTableFromColumnarAsciiFiles aggregates data from tabular ASCII data files with fixed-width columns.

EDDTableFromHyraxFiles aggregates data files with several variables, each with one or more shared dimensions (for example, time, altitude (or depth), latitude, longitude), and served by a Hyrax OPeNDAP server (external link).

EDDTableFromMultidimNcFiles aggregates data from NetCDF (v3 or v4) .nc (or .ncml) files with several variables, each with one or more shared dimensions. The files may have character variables with or without an additional dimension (for example, STRING14). See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.

EDDTableFromNcFiles aggregates data from NetCDF (v3 or v4) .nc (or .ncml) files with several variables, each with one shared dimension (for example, time) or more than one shared dimensions (for example, time, altitude (or depth), latitude, longitude). The files must have the same dimension names. A given file may have multiple values for each of the dimensions and the values may be different in different files. The files may have character variables with an additional dimension (for example, STRING14). See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.

EDDTableFromNcCFFiles aggregates data aggregates data from NetCDF (v3 or v4) .nc (or .ncml) files which use one of the file formats specified by the CF Discrete Sampling Geometries (DSG) (external link) conventions. See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.

For files using one of the multidimensional CF DSG variants, use EDDTableFromMultidimNcFiles instead.

The CF DSG conventions defines dozens of file formats and includes numerous minor variations. These class deals with all of the variations we are aware of, but we may have missed one (or more). So if this class can't read data from your CF DSG files, please email bob.simons at noaa.gov and include a sample file.
Or, you can join the ERDDAP Google Group / Mailing List and post your question there.

We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.
 

EDDTableFromNccsvFiles aggregates data from NCCSV ASCII .csv files. See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.

EDDTableFromNOS handles data from a NOAA NOS (external link) source, which uses SOAP+XML for requests and responses. It is very specific to NOAA NOS's XML. See the sample EDDTableFromNOS dataset in datasets2.xml.
 

EDDTableFromOBIS handles data from an Ocean Biogeographic Information System (OBIS) (external link) server.

EDDTableFromSOS handles data from a Sensor Observation Service (SWE/SOS (external link)) server.

EDDTableFromThreddsFiles aggregates data files with several variables, each with one or more shared dimensions (for example, time, altitude (or depth), latitude, longitude), and served by a THREDDS OPeNDAP server (external link).

EDDTableFromWFSFiles makes a local copy of all of the data from an ArcGIS MapServer WFS server so the data can then be re-served quickly to ERDDAP users.

EDDTableAggregateRows can make an EDDTable dataset from a group of "child" EDDTable datasets.

EDDTableCopy can make a local copy of many types of EDDTable datasets and then re-serve the data quickly from the local copy.


Details

Here are detailed descriptions of common tags and attributes.
 

Contact

Questions, comments, suggestions? Please send an email to bob dot simons at noaa dot gov and include the ERDDAP URL directly related to your question or comment.

Or, you can join the ERDDAP Google Group / Mailing List by visiting https://groups.google.com/forum/#!forum/erddap (external link) and clicking on "Apply for membership". Once you are a member, you can post your question there or search to see if the question has already been asked and answered.
 


ERDDAP, Version 1.82
Disclaimers | Privacy Policy