NOAA   ERDDAP
Easier access to scientific data

Brought to you by NOAA NMFS SWFSC ERD    
 

Working with the datasets.xml File

[This web page will only be of interest to ERDDAP administrators.]

After you have followed the ERDDAP installation instructions, you must edit the datasets.xml file in tomcat/content/erddap/ to describe the datasets that your ERDDAP installation will serve.

Table of Contents


 

Introduction

Some Assembly Required
Setting up a dataset in ERDDAP isn't just a matter of pointing to the dataset's directory or URL. You have to write a chunk of XML for datasets.xml which describes the dataset.

  • For gridded datasets, in order to make the dataset conform to ERDDAP's data structure for gridded data, you have to identify a subset of the dataset's variables which share the same dimensions. (Why? How?)
  • The dataset's current metadata is imported automatically. But if you want to modify that metadata or add other metadata, you have to specify it in datasets.xml. And ERDDAP needs other metadata, including global attributes (such as infoUrl, institution, sourceUrl, summary, and title) and variable attributes (such as long_name and units). Just as the metadata that is currently in the dataset adds descriptive information to the dataset, the metadata requested by ERDDAP adds descriptive information to the dataset. The additional metadata is a good addition to your dataset and helps ERDDAP do a better job of presenting your data to users who aren't familiar with it.
  • ERDDAP needs you to do special things with the longitude, latitude, altitude (or depth), and time variables.
If you buy into these ideas and expend the effort to create the XML for datasets.xml, you get all the advantages of ERDDAP, including:
  • Full text search for datasets
  • Search for datasets by category
  • Data Access Forms (datasetID.html) so you can request subset of data in lots of different file formats
  • Forms to request graphs and maps (datasetID.graph)
  • Web Map Service (WMS) for gridded datasets
  • RESTful access to your data
Making the datasets.xml takes considerable effort for the first few datasets, but it gets easier. After the first dataset, you can often re-use a lot of your work for the next dataset. Fortunately, there are two Tools to help you create the XML for each dataset in datasets.xml.
If you get stuck, please send an email with the details to bob dot simons at noaa dot gov.
Or, you can join the ERDDAP Google Group / Mailing List and post your question there.

Data Provider Form
When a data provider comes to you hoping to add some data to your ERDDAP, it can be difficult and time consuming to collect all of the metadata (information about the dataset) needed to add the dataset into ERDDAP. Many data sources (for example, .csv files, Excel files, databases) have no internal metadata. So ERDDAP has a Data Provider Form which gathers metadata from the data provider and gives the data provider some other guidance, including extensive guidance for Data In Databases. The information submitted is converted into the datasets.xml format and then emailed to the ERDDAP administrator (you) and written (appended) to bigParentDirectory/logs/dataProviderForm.log . Thus, the form semi-automates the process of getting a dataset into ERDDAP, but the ERDDAP administrator still has to complete the datasets.xml chunk and deal with getting the data file(s) from the provider or connecting to the database.

The submission of actual data files from external sources is a huge security risk, so ERDDAP does not deal with that. You have to figure out a solution that works for you and the data provider, for example, email (for small files), pull from the cloud (for example, DropBox or Google Drive), an sftp site (with passwords), or sneakerNet (a USB thumb drive or external hard drive). You should probably only accept files from people you know. You will need to scan the files for viruses and take other security precautions.

There isn't a link in ERDDAP to the Data Provider Form (for example, on the ERDDAP home page). Instead, when someone tells you they want to have their data served by your ERDDAP, you can send them an email saying something like:

Yes, we can get your data into ERDDAP. To get started, please fill out the form at
http://yourUrl/erddap/dataProviderForm.html 
After you finish, I'll contact you to work out the final details.
If you just want to look at the form (without filling it out), you can see the form on ERD's ERDDAP: Introduction, Part 1, Part 2, Part 3, and Part 4. These links on the ERD ERDDAP send information to me, not you, so don't submit information with them unless you actually want to add data to the ERD ERDDAP.

If you want to remove the Data Provider Form from your ERDDAP, put
<dataProviderFormActive>false</dataProviderFormActive>
in your setup.xml file.

The impetus for this was NOAA's 2014 Public Access to Research Results (PARR) directive (external link), which requires that all NOAA environmental data funded through taxpayer dollars be made available via a data service (not just files) within 12 months of creation. So there is increased interest in using ERDDAP to make datasets available via a service ASAP. We needed a more efficient way to deal with a large number of data providers.

Feedback/Suggestions? This form is new, so please email bob dot simons at noaa dot gov if you have any feedback or suggestions for improving this.

Tools
There are two command line programs which are tools to help you create the XML for each dataset that you want your ERDDAP to serve. Once you have set up ERDDAP and run it (at least one time), you can find and use these programs in the tomcat/webapps/erddap/WEB-INF directory. There are Linux/Unix shell scripts (with the extension .sh) and Windows scripts (with the extension .bat) for each program. [On Linux, run these tools as the same user (tomcat?) that will run Tomcat.] When you run each program, it will ask you questions. For each question, type a response, then press Enter. Or press ^C to exit a program at any time.

Program won't run?

  • If you get a unknown program (or similar) error message, the problem is probably that the operating system couldn't find Java. You need to figure out where Java is on your computer, then edit the java reference in the .bat or .sh file that you are trying to use.
  • If you get a jar file not found or class not found error message, then Java couldn't find one of the classes listed in the .bat or .sh file you are trying to use. The solution is to figure out where that .jar file is, and edit the java reference to it in the .bat or .sh file.
  • If you are using a version of Java that is too old for a program, the program won't run and you will see an error message like
    Exception in thread "main" java.lang.UnsupportedClassVersionError:
    some/class/name: Unsupported major.minor version someNumber

    The solution is to update to the most recent version of Java and make sure the .sh or .bat file for the program is using it.

The tools print various diagnostic messages:

  • The word "ERROR" is used when something went so wrong that the procedure failed to complete. Although it is annoying to get an error, the error forces you to deal with the problem.
  • The word "WARNING" is used when something went wrong, but the procedure was able to complete. These are pretty rare.
  • Anything else is just an informative message. You can add -verbose to the GenerateDatasetsXml or DasDds command line to get additional informative messages, which sometimes helps solve problems.

The two tools are a big help, but you still must read all of these instructions on this page carefully and make important decisions yourself.

  • GenerateDatasetsXml is a command line program that can generate a rough draft of the dataset XML for almost any type of dataset.

    We STRONGLY RECOMMEND that you use GenerateDatasetsXml instead of creating chunks of datasets.xml by hand because:

    • GenerateDatasetsXml works in seconds. Doing this by hand is at least an hour's work, even when you know what you're doing.
    • GenerateDatasetsXml does a better job. Doing this by hand requires extensive knowledge of how ERDDAP works. It is unlikely that you will do a better job by hand. (Bob Simons always uses GenerateDatasetsXml for the first draft, and he wrote ERDDAP.)
    • GenerateDatasetsXml always generates a valid chunk of datasets.xml. Any chunk of datasets.xml that you write will probably have at least a few errors that prevent ERDDAP from loading the dataset. It often takes people hours to diagnose these problems. Don't waste your time. Let GenerateDatasetsXml do the hard work. Then you can refine the .xml by hand if you want.

    When you use the GenerateDatasetsXml program:

    • GenerateDatasetsXml first asks you to specify the EDDType (Erd Dap Dataset Type) of the dataset. See the List of Dataset Types (in this document) to figure out which is type appropriate for the dataset you are working on. In addition to the regular EDDTypes, there are also a few Special/Psuedo Dataset Types (e.g., one which crawls a THREDDS catalog to generate a chunk of datasets.xml for each of the datasets in the catalog).
    • GenerateDatasetsXml then asks you a series of questions specific to that EDDType. The questions gather the information needed for ERDDAP to access the dataset's source. To understand what ERDDAP is asking for, see the documentation for the EDDType that you specified by clicking on the same dataset type in the List of Dataset Types.
    • Often, one of your answers won't be what GenerateDatasetsXml needs. You can then try again, with revised answers to the questions, until GenerateDatasetsXml can successfully find and understand the source data.
    • If you answer the questions correctly (or reasonably correctly), GenerateDatasetsXml will connect to the dataset's source and gather basic information (for example, variable names and metadata).
      For datasets that are from local NetCDF .nc and related files, GenerateDatasetsXml will often print the ncdump-like structure of the file after it first reads the file. This may give you information to answer the questions better on a subsequent loop through GenerateDatasetsXml.
    • GenerateDatasetsXml will generate a rough draft of the dataset XML for that dataset.
    • Diagnostic information and the rough draft of the dataset XML will be written to bigParentDirectory/logs/GenerateDatasetsXml.log .
    • The rough draft of the dataset XML will be written to bigParentDirectory/logs/GenerateDatasetsXml.out .
    • "0 files" Error Message
      If you run GenerateDatasetsXml or DasDds, or if you try to load an EDDGridFrom...Files or EDDTableFrom...Files dataset in ERDDAP, and you get a "0 files" error message indicating that ERDDAP found 0 matching files in the directory (when you think that there are matching files in that directory):
      • Check that you have specified the full name of the directory. And if you specified the sample file name, make sure you specified the file's full name, including the full directory name.
      • Check that the files really are in that directory.
      • Check the spelling of the directory name.
      • Check the fileNameRegex. It's really, really easy to make mistakes with regexes. For test purposes, try the regex .* which should match all file names.
      • Check that the user who is running the program (e.g., user=tomcat (?) for Tomcat/ERDDAP) has 'read' permission for those files.
      • In some operating systems (for example, SE Linux) and depending on system settings, the user who ran the program must have 'read' permission for the whole chain of directories leading to the directory that has the files.
    • If you have problems that you can't solve, send an email to Bob with as much information as possible. Similarly, if it seems like the appropriate EDDType for a given dataset doesn't work with that dataset, or if there is no appropriate EDDType, please send an email to Bob with the details (and a sample file if relevant).
    • You can then use DasDds (see below) to repeatedly test the XML for that dataset to ensure that the resulting dataset appears as you want it to in ERDDAP.
    • Feel free to make small changes by hand, for example, supply a better infoUrl, summary, or title.
    • Scripting: As an alternative to answering the questions interactively at the keyboard and looping to generate additional datasets, you can provide command line arguments to answer all of the questions to generate one dataset. GenerateDatasetsXml will process those parameters, write the output to the output file, and exit the program. To set this up, first use the program in interactive mode and write down your answers. Then generate the command line (usually in a script) with all of the arguments. This should be useful for datasets that change frequently in a way that necessitates re-running GenerateDatasetsXml (notably EDDGridFromThreddsCatalog).
    • GenerateDatasetsXml supports a -idatasetsXmlName#tagName command line parameter which inserts the output into the specified datasets.xml file (the default is tomcat/content/erddap/datasets.xml). GenerateDatasetsXml looks for two lines in datasetsXmlName:
      <!-- Begin GenerateDatasetsXml #tagName someDatetime -->
      and
      <!-- End GenerateDatasetsXml #tagName someDatetime -->
      and replaces everything in between those lines with the new content, and changes the someDatetime.
      • The -i switch is only processed (and changes to datasets.xml are only made) if you run GenerateDatasetsXml with command line arguments which specify all the answers to all of the questions for one loop of the program. (See 'Scripting' above.) (The thinking is: This parameter is for use with scripts. If you use the program in interactive mode (typing info on the keyboard), you are likely to generate some incorrect chunks of XML before you generate the one you want.)
      • If the Begin and End lines are not found, then those lines and the new content are inserted right before </erddapDatasets>.
      • There is also a -I (capital i) switch for testing purposes which works the same as -i, but creates a file called datasets.xmlDateTime and doesn't make changes to datasets.xml.
      • Don't run GenerateDatasetsXml with -i in two processes at once. There is a chance only one set of changes will be kept. There may be serious trouble (for example, corrupted files).
    If you use "GenerateDatasetsXml -verbose", it will print more diagnostic messages than usual.

    DISCLAIMER: The chunk of datasets.xml made by GenerateDatasetsXml isn't perfect. YOU MUST READ AND EDIT THE XML BEFORE USING IT IN A PUBLIC ERDDAP. GenerateDatasetsXml relies on a lot of rules-of-thumb which aren't always correct. YOU ARE RESPONSIBLE FOR ENSURING THE CORRECTNESS OF THE XML THAT YOU ADD TO ERDDAP'S datasets.xml FILE.

    Special/Psuedo Dataset Types
    In general, the EDDType options in GenerateDatasetsXml match of the EDD types described in this document (see the List of Dataset Types) and generate one datasets.xml chunk to create one dataset from one specific data source. There are a few exceptions:

    • EDDGridFromErddap
      This EDDType generates all of the datasets.xml chunks needed to make EDDGridFromErddap datasets from all of the EDDGrid datasets in a remote ERDDAP. You will have the option of keeping the original datasetID's (which may duplicate some datasetID's already in your ERDDAP) or generating new names which will be unique (but usually aren't as human-readable).
       
    • EDDTableFromErddap
      This EDDType generates all of the datasets.xml chunks needed to make EDDTableFromErddap datasets from all of the EDDTable datasets in a remote ERDDAP. You will have the option of keeping the original datasetID's (which may duplicate some datasetID's already in your ERDDAP) or generating new names which will be unique (but usually aren't as human-readable).
       
    • EDDGridFromThreddsCatalog
      This EDDType generates all of the datasets.xml chunks needed for all of the EDDGridFromDap datasets that it can find by crawling recursively through a THREDDS (sub) catalog. There are many forms of THREDDS catalog URLs. This option REQUIRES a THREDDS .xml URL with /catalog/ in it, for example,
      http://oceanwatch.pfeg.noaa.gov/thredds/catalog/catalog.xml or
      http://oceanwatch.pfeg.noaa.gov/thredds/catalog/Satellite/aggregsatMH/chla/catalog.xml
      (note that the comparable .html catalog is at
      http://oceanwatch.pfeg.noaa.gov/thredds/Satellite/aggregsatMH/chla/catalog.html ).
      If you have problems with EDDGridFromThreddsCatalog:
      • Make sure the URL you are using is valid, includes /catalog/, and ends with /catalog.xml .
      • If possible, use a public IP address (for example, http://oceanwatch.pfeg.noaa.gov) in the URL, not a local numeric IP address (for example, http://12.34.56.78). If the THREDDS is only accessible via the local numeric IP address, you can use <convertToPublicSourceUrl> so ERDDAP users see the public address, even though ERDDAP gets data from the local numeric address.
      • If you have problems that you can't solve, send an email to Bob with as much information as possible.
         
    • EDDGridLonPM180FromErddapCatalog
      This EDDType generates the datasets.xml to make EDDGridLonPM180 datasets from all of the EDDGrid datasets in an ERDDAP that have any longitude values greater than 180.
      • If possible, use a public IP address (for example, http://oceanwatch.pfeg.noaa.gov) in the URL, not a local numeric IP address (for example, http://12.34.56.78). If the ERDDAP is only accessible via the local numeric IP address, you can use <convertToPublicSourceUrl> so ERDDAP users see the public address, even though ERDDAP gets data from the local numeric address.
         
    • EDDsFromFiles
      Given a start directory, this traverses the directory and all subdirectories and tries to create a dataset for each group of data files that it finds.
      • This assumes that when a dataset is found, the dataset includes all subdirectories.
      • If dataset is found, similar sibling directories will be treated as separate datasets (for example, directories for the 1990's, the 2000's, the 2010's, will generate separate datasets). They should be easy to combine by hand -- just change the first dataset's <fileDir> to the parent directory and delete all the subsequent sibling datasets.
      • This will only try to generate a chunk of datasets.xml for the most common type of file extension in a directory (not counting .md5, which is ignored). So, given a directory with 10 .nc files and 5 .txt files, a dataset will be generated for the .nc files only.
      • This assumes that all files in a directory with the same extension belong in the same dataset. If a directory has some .nc files with SST data and some .nc files with chlorophyll data, just one sample .nc file will be read (SST? chlorophyll?) and just one dataset will be created for that type of file. That dataset will probably fail to load because of complications from trying to load two types of files into the same dataset.
      • If there are fewer than 4 files with the most common extension in a directory, this assumes that they aren't data files and just skips the directory.
      • If there are 4 or more files in a directory, but this can't successfully generate a chunk of datasets.xml for the files (for example, an unsupported file type), this will generate an EDDTableFromFileNames dataset for the files.
      • At the end of the diagnostics that this writes to the log file, just before the datasets.xml chunks, this will print a table with a summary of information gathered by traversing all the subdirectories. The table will list every subdirectory and indicate the most common type of file extension, the total number of files, and which type of dataset was create for these files (if any). If you are faced with a complex, deeply nested file structure, consider running GenerateDatasetsXml with EDDType=EDDsFromFiles just to generate this information,
      • This option may not do a great job of guessing the best EDDType for a given group of data files, but it is a quick, easy, and worth a try. If the sourse files are suitable, it works well and is a good first step in generating the datasets.xml for a file system with lots of subdirectories, each with data files from different datasets.
         
  • DasDds is a command line program that you can use after you have created a first attempt at the XML for a new dataset in datasets.xml. With DasDds, you can repeatedly test and refine the XML. When you use the DasDds program:
    1. DasDds asks you for the datasetID for the dataset you are working on.
    2. DasDds tries to create the dataset with that datasetID.
      • DasDds always prints lots of diagnostic messages.
        If you use "DasDds -verbose", DasDds will print more diagnostic messages than usual.
      • For safety, DasDds always deletes all of the cached dataset information (files) for the dataset before trying to create the dataset. So for aggregated datasets, you might want to adjust the fileNameRegex temporarily to limit the number of files the data constructor finds.
      • If the dataset fails to load (for whatever reason), DasDds will stop and show you the error message for the first error it finds.
        Don't try to guess what the problem might be. Read the ERROR message carefully.
        If necessary, read the preceding diagnostic messages to find more clues and information, too.
      • Make a change to the dataset's XML to try to solve THAT problem
        and let DasDds try to create the dataset again.
      • If you repeatedly solve each problem, you will eventually solve all the problems
        and the dataset will load.
    3. All DasDds output (diagnostics and results) are written to the screen and to bigParentDirectory/logs/DasDds.log .
    4. If DasDds can create the dataset, DasDds will then show you the .das and .dds for the dataset on your screen and write them to bigParentDirectory/logs/DasDds.out .
    5. Often, you will want to make some small change to the dataset's XML to clean up the dataset's metadata and rerun DasDds.

The basic structure of the datasets.xml file is:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<erddapDatasets>
  <convertToPublicSourceUrl /> <!-- 0 or more -->
  <requestBlacklist>...</requestBlacklist> <!-- 0 or 1 -->
  <subscriptionEmailBlacklist>...</subscriptionEmailBlacklist> <!-- 0 or 1 -->
  <user username="..." password="..." roles="..." /> <!-- 0 or more -->
  <dataset>...</dataset> <!-- 1 or more -->
</erddapDatasets>
It is possible that other encodings will be allowed in the future, but for now, only ISO-8859-1 is recommended.
 

Notes

Working with the datasets.xml file is a non-trivial project. Please read all of these notes carefully. After you pick a dataset type, please read the detailed description of it carefully.
  • Use Ctrl-F To Find Things On This Web Page
    All of the information about working with datasets.xml is on this one, very long, .html web page, not several .html pages as some people prefer. The advantage of one .html web page is that you can use Ctrl-F (Command-F on a Mac) in your web browser to search for text (for example, time_precision) within this web page.

    Alternatively, at the top of this document, there is a Table of Contents.

  • Choosing the Dataset Type
    In most cases, there is just one ERDDAP dataset type that is appropriate for a given data source. In a few cases (e.g., .nc files), there are a few possibilities, but usually one of them is definitely best. The first and biggest decision you must make is: is it appropriate to treat the dataset as a group of multidimension arrays (if so see the EDDGrid dataset types) or as a database-like table of data (if so see the EDDTable dataset types).
     
  • Serving the Data As Is
    Usually, there is no need to modify the data source (e.g., convert the files to some other file type) so that ERDDAP can serve it. One of the assumptions of ERDDAP is that the data source will be used as is. Usually this works fine. Some exceptions are:
    • Relational Databases and Cassandra - ERDDAP can serve data directly from relational databases and Cassandra. But for security, load balancing, and performance issues, you may chose to set up another database with the same data or save the data to NetCDF v3 .nc files and have ERDDAP serve the data from the new data source. See EDDTableFromDatabase and EDDTableFromCassandra.
    • Not Supported Data Sources - ERDDAP can support a large number of types of data source, but the world is filled with 1000's (millions?) of different data sources (notably, data file structures). If ERDDAP doesn't support your data source:
      • If the data source is NetCDF .nc files, you can use NcML to modify the data files on-the-fly, or use NCO to permanently modify the data files.
      • You can write the data to a data source type that ERDDAP supports. NetCDF-3 .nc files are a good, general recommendation because they are binary files that ERDDAP can read very quickly. For tabular data, consider storing the data in a collection of .nc files that use the CF Discrete Sampling Geometries (DSG) (external link) Contiguous Ragged Array data structures and so can be handled with ERDDAP's EDDTableFromNcCFFiles). If they are logically organized (each with data for a chunk of space and time), ERDDAP can extract data from them very quickly.
      • You can request that support for that data source be added to ERDDAP by emailing bob.simons at noaa.gov.
      • You can add support for that data source by writing the code to handle it yourself. See the ERDDAP Programmer's Guide
    • Speed - ERDDAP can read data from some data sources much faster than others. For example, reading NetCDF v3 .nc files is fast and reading ASCII files is slower. And if there is a large (>1000) or huge (>10,000) number of source data files, ERDDAP will respond to some data requests slowly. Usually, the difference isn't noticeable to humans. However, if you think ERDDAP is slow for a given dataset, you may choose to solve the problem by writing the data to a more efficient setup (usually: a few, well-structured, NetCDF v3 .nc files). For tabular data, see this advice.
       
  • Hint
    It is often easier to generate the XML for a dataset by making a copy of a working dataset description in dataset.xml and then modifying it.
     
  • Encoding Special Characters
    Since datasets.xml is an XML file, you MUST encode (external link) "&", "<", and ">" in any content as "&amp;", "&lt;", and "&gt;".
    Wrong: <title>Time & Tides</title>
    Right:   <title>Time &amp; Tides</title>
     
  • XML doesn't tolerate syntax errors.
    After you edit the dataset.xml file, it is a good idea to verify that the result is well-formed XML (external link) by pasting the XML text into an XML checker like RUWF (external link).
     
  • Other Ways To Diagnose Problems With Datasets
    In addition to the two main Tools,
    • log.txt is a log file with all of ERDDAP's diagnostic messages.
    • The Daily Report has more information than the status page, including a list of datasets that didn't load and the exceptions (errors) they generated.
    • The Status Page is a quick way to check ERDDAP's status from any web browser. It includes a list of datasets that didn't load (although not the related exceptions) and taskThread statistics (showing the progress of EDDGridCopy and EDDTableCopy datasets).
    • If you get stuck, please send an email with the details to bob dot simons at noaa dot gov.
      Or, you can join the ERDDAP Google Group / Mailing List and post your question there.
       
  • The longitude, latitude, altitude (or depth), and time (LLAT) variable destinationNames are special.
    • In general:
      • LLAT variables are made known to ERDDAP if the axis variable's (for EDDGrid datasets) or data variable's (for EDDTable datasets) destinationName is "longitude", "latitude", "altitude", "depth", or "time".
      • We strongly encourage you to use these standard names for these variables whenever possible. None of them is required. If you don't use these special variable names, ERDDAP won't recognize their significance. For example, LLAT variables are treated specially by Make A Graph (datasetID.graph): if the X Axis variable is "longitude" and the Y Axis variable is "latitude", you will get a map (using a standard projection, and with a land mask, political boundaries, etc.) instead of a graph.
      • ERDDAP will automatically add lots of metadata to LLAT variables (for example, "ioos_category", "units", and several standards-related attributes like "_CoordinateAxisType").
      • ERDDAP will automatically, on-the-fly, add lots of global metadata related to the LLAT values of the selected data subset (for example, "geospatial_lon_min").
      • Clients that support these metadata standards will be able to take advantage of the added metadata to position the data in time and space.
      • Clients will find it easier to generate queries that include LLAT variables because the variable's names are the same in all relevant datasets.
    • For the "longitude" variable and the "latitude" variable:
      • Use the destinationNames "longitude" and "latitude" only if the units are degrees_east and degrees_north, respectively. If your data doesn't fit these requirements, use different variable names (for example, x, y, lonRadians, latRadians).
      • If you have longitude and latitude data expressed in different units and thus with different destinationNames, for example, lonRadians and latRadians, Make A Graph (datasetID.graph) will make graphs (for example, time series) instead of maps.
    • For the "altitude" variable and the "depth" variable:
      • Use the destinationName "altitude" to identify the data's distance above sea level (positive="up" values). Optionally, you may use "altitude" for distances below sea level if the values are negative below the sea (or if you use, for example,
        <att name="scale_factor" type="int">-1</att> to convert depth values into altitude values.
      • Use the destinationName "depth" to identify the data's distance below sea level (positive="down" values).
      • A dataset may not have both "altitude" and "depth" variables.
      • For these variable names, the units must be "m", "meter", or "meters". If the units different (for example, fathoms), you can use
        <att name="scale_factor">someValue</att> and <att name="units">meters</att> to convert the units to meters.
      • If your data doesn't fit these requirements, use a different destinationName (for example, aboveGround, distanceToBottom).
      • If you know the vertical datum (external link), please specify it in the metadata.
    • For the "time" variable:
      • Use the destinationName "time" only for variables that include the entire date+time (or date, if that is all there is). If, for example, there are separate columns for date and timeOfDay, don't use the variable name "time".
      • See units for more information about the units attribute for time and timeStamp variables.
      • The time variable and related timeStamp variables are unique in that they always convert data values from the source's time format (what ever it is) into a numeric value (seconds since 1970-01-01T00:00:00Z) or a String value (ISO 8601:2004(E) format), depending on the situation.
      • When a user requests time data, they can request it by specifying the time as a numeric value (seconds since 1970-01-01T00:00:00Z) or a String value (ISO 8601:2004(E) format).
      • ERDDAP has a utility to Convert a Numeric Time to/from a String Time.
      • See How ERDDAP Deals with Time.
         
  • Why just two basic data structures?
    • Since it is difficult for human clients and computer clients to deal with a complex set of possible dataset structures, ERDDAP uses just two basic data structures:
    • Certainly, not all data can be expressed in these structures, but much of it can. Tables, in particular, are very flexible data structures (look at the success of relational database programs).
    • This makes data queries easier to construct.
    • This makes data responses have a simple structure, which makes it easier to serve the data in a wider variety of standard file types (which often just support simple data structures). This is the main reason that we set up ERDDAP this way.
    • This, in turn, makes it very easy for us (or anyone) to write client software which works with all ERDDAP datasets.
    • This makes it easier to compare data from different sources.
    • We are very aware that if you are used to working with data in other data structures you may initially think that this approach is simplistic or insufficient. But all data structures have tradeoffs. None is perfect. Even the do-it-all structures have their downsides: working with them is complex and the files can only be written or read with special software libraries. If you accept ERDDAP's approach enough to try to work with it, you may find that it has its advantages (notably the support for multiple file types that can hold the data responses). The ERDDAP slide show (particularly the data structures slide) talk a lot about these issues.
    • And even if this approach sounds odd to you, most ERDDAP clients will never notice -- they will simply see that all of the datasets have a nice simple structure and they will be thankful that they can get data from a wide variety of sources returned in a wide variety of file formats.
       
  • What if the grid variables in the source dataset DON'T share the same axis variables?
    In EDDGrid datasets, all data variables MUST use (share) all of the axis variables. So if a source dataset has some variables with one set of dimensions, and other variables with a different set of dimensions, you will have to make two datasets in ERDDAP. For example, you might make one ERDDAP dataset entitled "Some Title (at surface)" to hold variables that just use [time][latitude][longitude] dimensions and make another ERDDAP dataset entitled "Some Title (at depths)" to hold the variables that use [time][altitude][latitude][longitude]. Or perhaps you can change the data source to add a dimension with a single value (for example, altitude=0) to make the variables consistent.

    ERDDAP doesn't handle more complicated datasets (for example, swath data and models that use a mesh of triangles) well. You can serve these datasets in ERDDAP by creating two or more datasets in ERDDAP (so that all data variables in each new dataset share the same set of axis variables). For some datasets, you might consider making a regular gridded version of the dataset and offering that in addition to the original data. Some client software can only deal with a regular grid, so by doing this, you reach additional clients.
     

  • Projected Gridded Data
    Modelers (and others) often work with gridded data on various non-cylindrical projections (for example, conic, polar stereographic). Some end users want the projected data so there is no loss of information. For those clients, ERDDAP can serve the data, as is, if the ERDDAP administrator breaks the original dataset into a few datasets, with each part including variables which share the same axis variables. Yes, that seems odd to people involved and it is different from most OPeNDAP servers. But ERDDAP emphasizes making the data available in many formats. That is possible because ERDDAP uses/requires a more uniform data structure. Although it is a little awkward (i.e., different that expected), ERDDAP can distribute the projected data.

    [Yes, ERDDAP could have looser requirements for the data structure, but keep the requirements for the output formats. But that would lead to confusion among many users, particularly newbies, since many seemingly valid requests for data with different structures would be invalid because the data wouldn't fit into the file type. We keep coming back to the current system's design.]

    Some end users want lat lon geographic data (plate carree) for ease-of-use in different situations. For that, we encourage the ERDDAP administrator to re-project the data onto a geographic (plate carree) projection and serve that form of the data as a different dataset. Then both types of users are happy.

  • NcML .ncml Files
    NcML files let you specify on-the-fly changes to one or more original source NetCDF (v3 or v4) .nc, .grib, .bufr, or .hdf (v4 or v5) files, and then have ERDDAP treat the .ncml files as the source files. ERDDAP datasets will accept .ncml files whenever .nc files are expected. The NcML files MUST have the extension .ncml. See the Unidata NcML documentation (external link). NcML is useful because you can do some things with it (for example, making different changes to different files in a collection, including adding a dimension with a specific value to a file), that you can't do with ERDDAP's datasets.xml.
    • Note that changes to an .ncml file's lastModified time will cause the file to be reloaded whenever the dataset is reloaded, but changes to the underlying .nc data files won't be directly noticed.
    • Hint: NcML is *very* sensitive to the order of some items in the NcML file. Think of NcML as specifying a series of instructions in the specified order, with the intention of changing the source files (the state at the start/top of the NcML file) into the destination files (the state at the end/bottom of the NcML file).
       

    An alternative to NcML is the NetCDF Operators (NCO). The big difference is that NcML is a system for making changes on-the-fly (so the source files aren't altered), whereas NCO can be used to make changes to (or new versions of) the files. Both NCO and NcML are very, very flexible and allow you to make almost any change you can think of to the files. For both, it can be challenging to figure out exactly how to do what you want to do -- check the web for similar examples. Both are useful tools for preparing netCDF and HDF files for use with ERDDAP, notably, to make changes beyond what ERDDAP's manipulation system can do.

    Example #1: Adding a Dimension
    Here's an .ncml file that adds a new outer dimension (time, with 1 value: 1041379200) to the pic variable in the file named A2003001.L3m_DAY_PIC_pic_4km.nc:

    <netcdf xmlns='http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2'>
      <variable name='time' type='int' shape='time' />
      <aggregation dimName='time' type='joinNew'>
        <variableAgg name='pic'/>
        <netcdf location='A2003001.L3m_DAY_PIC_pic_4km.nc' coordValue='1041379200'/>
      </aggregation>
    </netcdf>
    
  • NetCDF Operators (NCO)
    "The netCDF Operators (NCO) comprise a dozen standalone, command-line programs that take netCDF [v3 or v4], HDF [v4 or v5], [.grib, .bufr,] and/or DAP files as input, then operate (e.g., derive new data, compute statistics, print, hyperslab, manipulate metadata) and output the results to screen or files in text, binary, or netCDF formats. NCO aids analysis of gridded scientific data. The shell-command style of NCO allows users to manipulate and analyze files interactively, or with expressive scripts that avoid some overhead of higher-level programming environments." (from the NCO (external link) homepage).

    An alternative to NCO is NcML. The big difference is that NcML is a system for making changes on-the-fly (so the source files aren't altered), whereas NCO can be used to make changes to (or new versions of) the files. Both NCO and NcML are very, very flexible and allow you to make almost any change you can think of to the files. For both, it can be challenging to figure out exactly how to do what you want to do -- check the web for similar examples. Both are useful tools for preparing netCDF and HDF files for use with ERDDAP, notably, to make changes beyond what ERDDAP's manipulation system can do.

    For example, you can use NCO to make the units of the time variable consistent in a group of files where they weren't consistent originally. Or, you can use NCO to apply scale_factor and add_offset in a group of files where scale_factor and add_offset have different values in different files.
    (Or, you can now deal with those problems in ERDDAP via EDDGridFromNcFilesUnpacked, which is a variant of EDDGridFromNcFiles which unpacks packed data and standardizes time values at a low level in order to deal with a collection files that have different scale_factors and add_offset, or different time units.)

    NCO is Free and Open Source Software which uses the GPL 3.0 (external link) license.

    Example #1: Making Units Consistent
    EDDGridFromFiles and EDDTableFrom Files insist that the units for be identical in all of the files. If some of the files are trivially (not functionally) different from others (e.g., time units of
    "seconds since 1970-01-01 00:00:00 UTC" vs.
    "seconds since 1970-01-01T00:00:00Z", you could use nco's ncatted (external link). to change the units in all of the files to be identical with
    nco/ncatted -a units,time,o,c,'seconds since 1970-01-01T00:00:00Z' *.nc

  • No <include> Option
    All of the setup information for all of the datasets must be in one file: datasets.xml . The biggest advantage is: if you want to make the same or similar changes to multiple datasets, you can do it quickly and easily, without opening and editing numerous files.

    Some people have asked for datasets.xml to support references to external files which have chunks of XML which define one or more datasets, for example,
      <include dataset1.xml/>
    ERDDAP doesn't support that. It isn't a standard feature of XML, so adding support for it would cause some problems.

    Fortunately, there is a work-around you can use, right now.

    1. Make several sub-files, for example, start.xml, datasets1.xml, datasets2.xml, ... end.xml
      Use whatever names you want for the files.
    2. Write a Linux script or DOS batch (.bat) file to concatenate the files into one file.
      Linux example:  cat start.xml datasets1.xml datasets2.xml end.xml > datasets.xml
      DOS example: type start.xml datasets1.xml datasets2.xml end.xml > datasets.xml
    3. Then, whenever you make a change to one of the sub-files, rerun the script to regenerate the complete datasets.xml file.
       
  • Limits to the Size of a Dataset
    You'll see many references to "2 billion" below. More accurately, that is a reference to 2,147,483,647 (2^31-1), which is the maximum value of a 32-bit signed integer. In some computer languages, for example Java (which ERDDAP is written in), that is the largest data type that can be used for many data structures (for example, the size of an array).

    For String values (for example, for variable names, attribute names, String attribute values, and String data values), the maximum number of characters per String in ERDDAP is ~2 billion. But in almost all cases, there will be small or large problems if a String exceeds a reasonable size (e.g., 80 characters for variable names and attribute names, and 255 characters for most String attribute values and data values). For example, web pages which display long variable names will be awkwardly wide and long variable names will be truncated if they exceed the limit of the response file type.

    For gridded datasets:

    • The maximum number of axisVariables is ~2 billion.
      The maximum number of dataVariables is ~2 billion.
      But if a dataset has >100 variables, it will be cumbersome for users to use.
      And if a dataset has >1 million variables, your server will need a lot of physical memory and there will be other problems.
    • The maximum size of each dimension (axisVariable) is ~2 billion values.
    • I think the maxumum total number of cells (the product of all dimension sizes) is unlimited, but it may be ~9e18.

    For tabular datasets:

    • The maximum number of dataVariables is ~2 billion.
      But if a dataset has >100 variables, it will be cumbersome for users to use.
      And if a dataset has >1 million variables, your server will need a lot of physical memory and there will be other problems.
    • The maximum number of sources (for example, files) that can be aggregated is ~2 billion.
    • In a some cases, the maximum number of rows from an individual source (for example, a file, but not a database) is ~2 billion rows.
    • I don't think there are other limits.

    For both gridded and tabular datasets, there are some internal limits on the size of the subset that can be requested by a user in a single request (often related to >2 billion of something or ~9e18 of something), but it is far more likely that a user will hit the file-type-specific limits.

    • NetCDF version 3 .nc files are limited to 2GB bytes. (If this is really a problem for someone, let me know: I could add support for the NetCDF version 3 .nc 64-bit extension or NetCDF Version 4, which would increase the limit significantly, but not infinitely.)
    • Browsers crash after only ~500MB of data, so ERDDAP limits the response to .htmlTable requests to ~400MB of data.
    • Many data analysis programs have similar limits (for example, the maximum size of a dimension is often ~2 billion values), so there is no reason to work hard to get around the file-type-specific limits.
    • The file-type-specific limits are useful in that they prevent naive requests for truly huge amounts of data (for example, "give me all of this dataset" when the dataset has 20TB of data), which would take weeks or months to download. The longer the download, the more likely it will fail for a variety of reasons.
    • The file-type-specific limits are useful in that they force the user to deal with reasonably-sized subsets (for example, dealing with a large gridded dataset via files with data from one time point each).
       
  • Switch to ACDD-1.3
    We (notably GenerateDatasetsXml) currently recommend ACDD version 1.3, which was ratified in early 2015 and which is referred to as "ACDD-1.3" in the global Conventions attribute. Prior to ERDDAP version 1.62 (released in June 2015), ERDDAP used/recommended the original, version 1.0, of the NetCDF Attribute Convention for Dataset Discovery which was referred to as "Unidata Dataset Discovery v1.0" in the global Conventions and Metadata_Conventions attributes.

    If your datasets use earlier versions of ACDD, we RECOMMEND that you switch to ACDD-1.3. It isn't hard. ACDD-1.3 is highly backward compatibly with version 1.0. To switch, for all datasets (except EDDGridFromErddap and EDDTableFromErddap datasets):

    1. Remove the newly deprecated global Metadata_Conventions attribute by adding (or by changing the existing Metadata_Conventions attribute)
      <att name="Metadata_Conventions">null</att>
      to the dataset's global <addAttributes>.
       
    2. If the dataset has a Conventions attribute in the global <addAttributes>, change all "Unidata Dataset Discovery v1.0" references to "ACDD-1.3".
      If the dataset doesn't have a Conventions attribute in the global <addAttributes>, then add one that refers to ACDD-1.3. For example,
      <att name="Conventions">COARDS, CF-1.6, ACDD-1.3</att>
       
    3. If the dataset has a global standard_name_vocabulary attribute, please change the format of the value to, for example,
      <att name="standard_name_vocabulary">CF Standard Name Table v29</att>
      If the reference is to an older version of the CF Standard Name Table, it is probably a good idea to switch to v29, the current version (as we write this), since new standard names are added to that table with subsequent versions, but old standard names are rarely deprecated and never removed.
       
    4. Although ACDD-1.0 included global attributes for creator_name, creator_email, creator_url, GenerateDatasetsXml didn't automatically add them until sometime around ERDDAP v1.50. This is important information:
      • creator_name lets users know/cite the creator of the dataset.
      • creator_email tells users the preferred email address for contacting the creator of the dataset, for example if they have questions about the dataset.
      • creator_url gives users a way to find out more about the creator.
      • ERDDAP uses all of this information when generating FGDC and ISO 19115-2/19139 metadata documents for each dataset. Those documents are often used by external search services.
      Please add these attributes to the dataset's global <addAttributes<.
      <att name="creator_name">NOAA NMFS SWFSC ERD</att>
      <att name="creator_email">erd.data@noaa.gov</att>
      <att name="creator_url">http://www.pfeg.noaa.gov</att>

       
    That's it. I hope that wasn't too hard.

 

List of Types Datasets

If you need help chosing the right dataset type, see Choosing the Dataset Type.

The types of datasets fall into two categories. (Why?)

  • EDDGrid datasets handle gridded data.
    • In EDDGrid datasets, data variables are multi-dimensional arrays of data.
    • There MUST be an axis variable for each dimension. Axis variables MUST be specified in the order that the data variables use them.
    • In EDDGrid datasets, all data variables MUST use (share) all of the axis variables.
      (Why? What if they don't?)
    • Sorted Dimension Values - In all EDDGrid datasets, each dimension MUST be in sorted order (ascending or descending). Each can be irregularly spaced. There can be no ties. This is a requirement of the CF metadata standard (external link). If any dimension's values aren't in sorted order, the dataset won't be loaded and ERDDAP will identify the first unsorted value in the log file, bigParentDirectory/logs/log.txt .

      A few subclasses have additional restrictions (notably, EDDGridAggregateExistingDimension requires that the outer (leftmost) dimension be ascending.

      Unsorted dimension values almost always indicate a problem with the source dataset. This most commonly occurs when a misnamed or inappropriate file is included in the aggregation, which leads to an unsorted time dimension. To solve this problem, see the error message in the ERDDAP log.txt file to find the offending time value. Then look in the source files to find the corresponding file (or one before or one after) that doesn't belong in the aggregation.

    • See the more complete description of the EDDGrid data model.
    • The EDDGrid dataset types are:
      • EDDGridFromDap handles gridded data from DAP servers.
      • EDDGridFromEDDTable lets you convert a tabular dataset into a gridded dataset.
      • EDDGridFromErddap handles gridded data from a remote ERDDAP.
      • EDDGridFromEtopo just handles the built-in ETOPO topography data.
      • EDDGridFromFiles is the superclass of all EDDGridFrom...Files classes.
      • EDDGridFromMergeIRFiles aggregates data from a group of local MergeIR .gz files.
      • EDDGridFromNcFiles aggregates data from a group of local NetCDF (v3 or v4) .nc files.
      • EDDGridFromNcFilesUnpacked is a variant if EDDGridFromNcFiles which also aggregates data from a group of local NetCDF (v3 or v4) .nc files, which ERDDAP unpacks at a low level.
      • EDDGridLonPM180 modifies the longitude values of a child EDDGrid so that they are in the range -180 to 180.
      • EDDGridSideBySide aggregates two or more EDDGrid datasets side by side.
      • EDDGridAggregateExistingDimension aggregates two or more EDDGrid datasets, each of which has a different range of values for the first dimension, but identical values for the other dimensions.
      • EDDGridCopy makes a local copy of any EDDGrid's data and serves data from the local copy.
         
  • EDDTable datasets handle tabular data.
    • Tabular data can be represented as a database-like table with rows and columns. Each column (a data variable) has a name, a set of attributes, and stores just one type of data. Each row has an observation (or group of related values). The data source may have the data in a different data structure, a more complicated data structure, and/or multiple data files, but ERDDAP needs to be able to flatten the source data into a database-like table in order to present the data as a tabular dataset to users of ERDDAP.
    • See the more complete description of the EDDTable data model.
    • The EDDTable dataset types are:

 

Detailed Descriptions of Dataset Types

EDDGridFromDap handles grid variables from DAP (external link) servers.

  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can gather the information you need to tweak that or create your own XML for an EDDGridFromDap dataset by looking at the source dataset's DDS and DAS files in your browser (by adding .das and .dds to the sourceUrl, for example, http://thredds1.pfeg.noaa.gov/thredds/dodsC/satellite/BA/ssta/5day.dds (external link)).
  • EDDGridFromDap can get data from any multi-dimensional variable from a DAP data server. (Previously, EDDGridFromDap was limited to variables designated as "grid"'s, but that is no longer a requirement.)
  • Sorted Dimension Values - The values for each dimension MUST be in sorted order (ascending or descending). The values can be irregularly spaced. There can be no ties. This is a requirement of the CF metadata standard (external link). If any dimension's values aren't in sorted order, the dataset won't be loaded and ERDDAP will identify the first unsorted value in the log file, bigParentDirectory/logs/log.txt .

    Unsorted dimension values almost always indicate a problem with the source dataset. This most commonly occurs when a misnamed or inappropriate file is included in the aggregation, which leads to an unsorted time dimension. To solve this problem, see the error message in the ERDDAP log.txt file to find the offending time value. Then look in the source files to find the corresponding file (or one before or one after) that doesn't belong in the aggregation.

  • The skeleton XML for an EDDGridFromDap dataset is:
    <dataset type="EDDGridFromDap" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaWMS>...</accessibleViaWMS> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 --> 
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1. For EDDGridFromDap, 
        this gets the remote .dds and then gets the new leftmost dimension values. -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <axisVariable>...</axisVariable> <!-- 1 or more -->
      <dataVariable>...</dataVariable> <!-- 1 or more -->
    </dataset>
    
     

EDDGridFromEDDTable lets you convert an EDDTable tabular dataset into an EDDGrid gridded dataset. Remember that ERDDAP treats datasets as either gridded datasets (subclasses of EDDGrid) or tabular datasets (subclasses of EDDTable).

  • Normally, if you have gridded data, you just set up an EDDGrid dataset directly. Sometimes this isn't possible, for example, when you have the data stored in a relational database that ERDDAP can only access via EDDTableFromDatabase. EDDGridFromEDDTable class lets you remedy that situation.
     
  • Obviously, the data in the underlying EDDTable dataset must be (basically) gridded data, but in a tabular form. For example, the EDDTable dataset may have CTD data: measurements of eastward and northward current, at several depths, at several times. Since the depths are the same at each time point, EDDGridFromEDDTable can create a gridded dataset with a time and a depth dimension which accesses the data via the underlying EDDTable dataset.
     
  • GenerateDatasetsXml - We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can gather the information you need to improve the rough draft.
     
  • Source Attributes - As with all other types of datasets, EDDGridFromTable has the idea that there are global sourceAttributes and global addAttributes (specified in datasets.xml), which are combined to make the global combinedAttributes, which are what users see. For global sourceAttributes, EDDGridFromEDDTable uses the global combinedAttributes of the underlying EDDTable dataset. (If you think about it for a minute, it makes sense.)

    Similarly, for each axisVariable's and dataVariable's addAttributes, EDDGridFromEDDTable uses the variable's combinedAttributes from the underlying EDDTable dataset as the EDDGridFromEDDTable variable's sourceAttributes. (If you think about it for a minute, it makes sense.)

    As a consequence, if the EDDTable has good metadata, the EDDGridFromEDDTable often needs very little addAttributes metadata -- just a few tweaks here and there.

  • dataVariables vs. axisVariables - The underlying EDDTable has only dataVariables. An EDDGridFromEDDTable dataset will have some axisVariables (created from some of the EDDTable dataVariables) and some dataVariables (created from the remaining EDDTable dataVariables). GenerateDatasetsXml will make a guess as to which EDDTable dataVariables should become EDDGridFromEDDTable axisVariables, but it is just a guess. You need to modify the output of GenerateDatasetsXml to specify which dataVariables will become axisVariables, and in which order.
     
  • axisValues - There is nothing about the underlying EDDTable to tell EDDGridFromEDDTable the possible values of the axisVariables in the gridded version of the dataset, so you MUST provide that information for each axisVariable via one of these attributes:
    • axisValues - lets you specify a list of values. For example,
      <att name="axisValues" type="doubleList">2, 2.5, 3, 3.5, 4</att>
      Note the use of a list data type. Also, the type of list (for example, double), MUST match the dataType of the variable in the EDDTable and EDDGridFromEDDTable datasets.
    • axisValuesStartStrideStop - lets you specify a sequence of regularly spaced values by specifying the start, stride, and stop values. Here is an example that is equivalent to the axisValues example above:
      <att name="axisValuesStartStrideStop" type="doubleList">2, 0.5, 4</att>
      Again, note the use of a list data type. Also, the type of list (for example, double), MUST match the dataType of the variable in the EDDTable and EDDGridFromEDDTable datasets.
       
    Updates - Just as there is no way for EDDGridFromEDDTable to determine the axisValues from the EDDTable initially, there is also no reliable way for EDDGridFromEDDTable to determine from the EDDTable when the axisValues have changed (notably, when there are new values for the time variable). Currently, the only solution is to change the axisValues attribute in datasets.xml and reload the dataset. For example, you could write a script to
    1. Search datasets.xml for
      datasetID="theDatasetID"
      so you are working with the correct dataset.
    2. Search datasets.xml for the next occurrence of
      <sourceName>theVariablesSourceName</sourceName>
      so you are working with the correct variable.
    3. Search datasets.xml for the next occurrence of
      <att name="axisValuesStartStrideStop" type="doubleList">
      so you know the start position of the tag.
    4. Search datasets.xml for the next occurrence of
      </att>
      so you know the end postition of the axis values.
    5. Replace the old start, stride, stop values with the new values.
    6. Contact the flag URL for the dataset to tell ERDDAP to reload the dataset.
    This isn't ideal, but it works.
     
  • precision - When EDDGridFromEDDTable responds to a user's request for data, it moves a row of data from the EDDTable response table into the EDDGrid response grid. To do this, it has to figure out if the "axis" values on a given row in the table match some combination of axis values in the grid. For integer data types, it is easy to determine if two values are equal. But for floats and doubles, this brings up the horrible problem of floating point numbers not matching exactly (external link). (for example, 0.2 vs 0.199999999999996). To (try to) deal with this, EDDGridFromTable lets you specify a precision attribute for any of the axisVariables, which specifies the total number of decimal digits which must be identical.
    • For example, <att name="precision" type="int">5</att>
    • For different types of data variables, there are different default precision values. The defaults are usually appropriate. If they aren't, you need to specify different values.
    • For axisVariables that are time or timeStamp variables, the default is full precision (an exact match).
    • For axisVariables that are floats, the default precision is 5.
    • For axisVariables that are doubles, the default precision is 9.
    • For axisVariables that have integer data types, EDDGridFromEDDTable ignores the precision attribute and always uses full precision (an exact match).
       
    • WARNING! When doing the conversion of a chunk of tabular data into a chunk of gridded data, if EDDGridFromEDDTable can't match an EDDTable "axis" value to one of the expected EDDGridFromEDDTable axis values, EDDGridFromEDDTable silently (no error) throws away the data from that row of the table. For example, there may be other data (not on the grid) in the EDDTable dataset. (And if stride > 1, it isn't obvious to EDDGridFromTable which axis values are desired values and which ones are the one's to be skipped because of the stride.) So, if the precision values are too high, the user will see missing values in the data response when valid data values actually exist.

      Conversely, if the precision values are set too low, EDDTable "axis" values that shouldn't match EDDGridFromEDDTable axis values will (erroneously) match.

      These potential problems are horrible, because the user gets the wrong data (or missing values) when they should get the right data (or at least an error message).
      This is not a flaw in EDDGridFromTable. EDDGridFromTable can't solve this problem. The problem is inherent in the conversion of tabular data into gridded data (unless other assumptions can be made, but they can't be made here).
      It is up to you, the ERDDAP administrator, to test your EDDGridFromEDDTable thoroughly to ensure that the precision values are set to avoid these potential problems.

  • gapThreshold - This is a very unusual type of dataset. Since the types of queries that can be made to (handled by) an EDDGrid dataset (related to the ranges and strides of the axisVariables) are very different from the types of queries that can be made to (handled by) an EDDTable dataset (just related to the ranges of some variables), the performance of EDDGridFromEDDTable datasets will vary greatly depending on the exact request which is made and the speed of the underlying EDDTable dataset. For requests that have a stride value > 1, EDDGridFromEDDTable may ask the underlying EDDTable for a relatively big chunk of data (as if stride=1) and then sift through the results, keeping the data from some rows and throwing away the data from others. If it has to sift through a lot of data to get the data it needs, the request will take longer to fill.

    If EDDGridFromEDDTable can tell that there will be large gaps (with rows of unwanted data) between the rows with desired data, EDDGridFromEDDTable may choose to make several subrequests to the underlying EDDTable instead of one big request, thereby skipping the unwanted rows of data in the large gaps. The sensitivity for this decision is controlled by the gapThreshold value as specified in the <gapThreshold> tag (default=1000 rows of source data). Setting gapThreshold to a smaller number will lead to the dataset making (generally) more subrequests. Setting gapThreshold to a larger number will lead to the dataset making (generally) fewer subrequests.

    If gapThreshold is set too small, EDDGridFromEDDTable will operate more slowly because the overhead of multiple requests will be greater than the time saved by getting some excess data. If gapThreshold is set too big, EDDGridFromEDDTable will operate more slowly because so much excess data will be retrieved from the EDDTable, only to be discarded. (As Goldilocks discovered, the middle is "just right".) The overhead for different types of EDDTable datasets varies greatly, so the only way to know the actual best setting for your dataset is via experimentation. But you won't go to far wrong sticking to the default.

    A simple example is: Imagine an EDDGridFromTable with just one axisVariable (time, with a size of 100000), one dataVariable (temperature), and the default gapThreshold of 1000.

    • If a user requests temperature[0:100:5000], the stride is 100 so the gap size is 99, which is less than the gapThreshold. So EDDGridFromTable will make just one request to EDDTable for all of the data needed for the request (equivalent to temperature[0:5000]) and throw away all the rows of data it doesn't need.
    • If a user requests temperature[0:2500:5000], that stride is 2500 so the gap size is 2499, which is greater than the gapThreshold. So EDDGridFromTable will make separate requests to EDDTable which are equivalent to temperature[0], temperature[2500], temperature[5000].
    Calculation of the gap size is more complicated when there are multiple axes.

    For each user request, EDDGridFromEDDTable prints diagnostic messages related to this in the log.txt file.

    • If <logLevel> in setup.xml is set to info, this prints a message like
      * nOuterAxes=1 of 4 nOuterRequests=22
      If nOuterAxes=0, gapThreshold wasn't exceeded and only one request will be made to EDDTable.
      If nOuterAxes>0, gapThreshold was exceeded and nOuterRequests will be made to EDDTable, corresponding to each requested combination of the leftmost nOuterAxes. For example, if the dataset has 4 axisVariables and dataVariables like eastward[time][latitude][longitude][depth], the leftmost 1 axis is the time axis.
    • If <logLevel> in setup.xml is set to all, additional information is written to the log.txt file.
       
  • The skeleton XML for an EDDGridFromEDDTable dataset is:
    <dataset type="EDDGridFromEDDTable" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaWMS>...</accessibleViaWMS> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 --> 
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1. For EDDGridFromEDDTable, 
        this only works if the underlying EDDTable supports updateEveryNMillis. -->
      <gapThreshold>...</gapThreshold> <!-- 0 or 1. The default is 1000. >
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <axisVariable>...</axisVariable> <!-- 1 or more -->
      <dataVariable>...</dataVariable> <!-- 1 or more -->
      <dataset>...</dataset> <!-- The underlying source EDDTable dataset. -->
    </dataset>
    
     

EDDGridFromErddap handles gridded data from a remote ERDDAP server.
EDDTableFromErddap handles tabular data from a remote ERDDAP server.

  • EDDGridFromErddap and EDDTableFromErddap behave differently from all other types of datasets in ERDDAP.
    • Like other types of datasets, these datasets get information about the dataset from the source and keep it in memory.
    • Like other types of datasets, when ERDDAP searches for datasets, displays the Data Access Form (datasetID.html), or displays the Make A Graph form (datasetID.graph), ERDDAP uses the information about the dataset which is in memory.
    • Unlike other types of datasets, when ERDDAP receives a request for data or images from these datasets, ERDDAP redirects (external link) the request to the remote ERDDAP server. The result is:
      • This is very efficient (CPU, memory, and bandwidth), because otherwise
        1. The composite ERDDAP has to send the request to the other ERDDAP (which takes time).
        2. The other ERDDAP has to get the data, reformat it, and transmit the data to the composite ERDDAP.
        3. The composite ERDDAP has to receive the data (using bandwidth), reformat it (using CPU and memory), and transmit the data to the user (using bandwidth).
        By redirecting the request and allowing the other ERDDAP to send the response directly to the user, the composite ERDDAP spends essentially no CPU time, memory, or bandwidth on the request.
      • The redirect is transparent to the user regardless of the client software (a browser or any other software or command line tool).
  • Normally, when an EDDGridFromErddap and EDDTableFromErddap are (re)loaded on your ERDDAP, they try to add a subscription to the remote dataset via the remote ERDDAP's email/URL subscription system. That way, whenever the remote dataset changes, the remote ERDDAP contacts the setDatasetFlag URL on your ERDDAP so that the local dataset is reloaded ASAP and so that the local dataset always mimics the remote dataset. So, the first time this happens, you should get an email requesting that you validate the subscription. However, if the local ERDDAP can't send an email or if the remote ERDDAP's email/URL subscription system isn't active, you should email the remote ERDDAP administrator and request that s/he manually add <onChange>...</onChange> tags to all of the relevant datasets to call your dataset's setDatasetFlag URLs. See your ERDDAP daily report for a list of setDatasetFlag URLs, but just send the ones for EDDGridFromErddap and EDDTableFromErddap datasets to the remote ERDDAP administrator.
  • EDDGridFromErddap and EDDTableFromErddap are the basis for grids/clusters/federations of ERDDAPs, which efficiently distribute the CPU usage (mostly for making maps), memory usage, dataset storage, and bandwidth usage of a large data center.
  • EDDGridFromErddap and EDDTableFromErddap can't be used with remote datasets that require logging in (because they use <accessibleTo>).
  • For security reasons, EDDGridFromErddap and EDDTableFromErddap don't support the <accessibleTo> tag. See ERDDAP's security system for restricting access to some datasets to some users.
  • You can use the GenerateDatasetsXml program to make the datasets.xml chunk for this type of dataset. But you can do these types of datasets easily by hand.
  • The skeleton XML for an EDDGridFromErddap dataset is very simple, because the intent is just to mimic the remote dataset which is already suitable for use in ERDDAP:
    <dataset type="EDDGridFromErddap" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1.  For EDDGridFromErddap, 
        this gets the remote .dds and then gets the new leftmost dimension values. -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
    </dataset>
    
  • The skeleton XML for an EDDTableFromErddap dataset is very simple, because the intent is just to mimic the remote dataset, which is already suitable for use in ERDDAP:
    <dataset type="EDDTableFromErddap" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
    </dataset>
    
     

EDDGridFromEtopo just serves the ETOPO1 Global 1-Minute Gridded Elevation Data Set (external link) (Ice Surface, grid registered, binary, 2byte int: etopo1_ice_g_i2.zip) which is distributed with ERDDAP.

  • Only two datasetID's are supported for EDDGridFromEtopo, so that you can access the data with longitude values -180 to 180, or longitude values 0 to 360.
  • There are never any sub tags, since the data is already described within ERDDAP.
  • So the two options for EDDGridFromEtopo datasets are (literally):
      <!-- etopo180 serves the data from longitude -180 to 180 -->
      <dataset type="EDDGridFromEtopo" datasetID="etopo180" /> 
      <!-- etopo360 serves the data from longitude 0 to 360 -->
      <dataset type="EDDGridFromEtopo" datasetID="etopo360" /> 
    

EDDGridFromFiles is the superclass of all EDDGridFrom...Files classes. You can't use EDDGridFromFiles directly. Instead, use a subclass of EDDGridFromFiles to handle the specific file type:

Currently, no other file types are supported. But it is usually relatively easy to add support for other file types. Contact us if you have a request. Or, if your data is in an old file format that you would like to move away from, we recommend converting the files to be NetCDF v3 .nc files. NetCDF is a widely supported, binary format, allows fast random access to the data, and is already supported by ERDDAP.

Details - The following information applies to all of the subclasses of EDDGridFromFiles.

  • Aggregation of an Existing Dimension -
    All variations of EDDGridFromFiles can aggregate data from local files, where each file has 1 (or more) different values for the leftmost dimension, usually [time], which will be aggregated. For example, the dimensions might be [time][altitude][latitude][longitude], and the files might have the data for one (or a few) time value(s) per file. The resulting dataset appears as if all of the file's data had been combined. The big advantages of aggregation are:
    • The size of the aggregated data set can be much larger than a single file can be conveniently (~2GB).
    • For near-real-time data, it is easy to add a new file with the latest chunk of data. You don't have to rewrite the entire dataset.
    The requirements for aggregation are:
    • The local files needn't have the same dataVariables (as defined in the dataset's datasets.xml). The dataset will have the dataVariables defined in datasets.xml. If a given file doesn't have a given dataVariable, ERDDAP will add missing values as needed.
    • All of the dataVariables MUST use the same axisVariables/dimensions (as defined in the dataset's datasets.xml). The files will be aggregated based on the first (left-most) dimension, sorted in ascending order.
    • Each file MAY have data for one or more values of the first dimension, but there can't be any overlap between files. If a file has more than one value for the first dimension, the values MUST be sorted in ascending order, with no ties.
    • All files MUST have exactly the same values for all of the other dimensions.
    • All files MUST have exactly the same units metadata for all axisVariables and dataVariables. If this is a problem, you may be able to use NcML or NCO to fix the problem.
  • Aggregation via File Names or Global Metadata -
    All variations of EDDGridFromFiles can also aggregate a group of files by adding a new leftmost dimension, usually time, based on a value derived from each file name or from the value of a global attribute that is in each file. For example, the file name might include the time value for the data in the file. ERDDAP would then create a new time dimension.

    Unlike the similar feature in THREDDS, ERDDAP always creates an axisVariable with numeric values (as required by CF), never String values (which are not allowed by CF). Also, ERDDAP will sort the files in the aggregation based on the numeric axisVariable value which is assigned to each file, so that the axis variable will always have sorted values as required by CF. The THREDDS approach of doing a lexicographic sort based on the file names leads to aggregations where the axis values aren't sorted (which is not allowed by CF) when the file names sort differently than the derived axisVariable values.

    To set up one of these aggregations in ERDDAP, you will define a new leftmost (first) axisVariable with a special, psuedo <sourceName>, which tells ERDDAP where and how to find the value for the new dimension from each file.
    The format for the psuedo sourceName which gets the value from a file name is
    ***fileName,dataType,extractRegex,captureGroupNumber
    The format for the psuedo sourceName which gets the value from a global attribute is
    ***global:attributeName,dataType,extractRegex,captureGroupNumber
    The descriptions of the parts you need to provide are:

    • attributeName - the name of the global attribute which is in each file and which contains the dimension value.
    • dataType - This specifies the dataType that will be used to store the values. The allowed dataTypes are double (64-bit floating point), float (32-bit floating point), long (64-bit signed integer, discouraged), int (32-bit signed integer), short (16-bit signed integer), byte (8-bit signed integer), and char (essentially: 16-bit unsigned integer, strongly discouraged), which is slightly different than the standard list of dataTypes that ERDDAP supports.

      There is an additional psuedo dataType, timeFormat=stringTimeFormat, which tells ERDDAP that the value is a String timeStamp in the format specified by the stringTimeFormat, which is a Joda DateTimeFormat (external link). In most cases, the stringTimeFormat you need will be a variation of one of these two formats:

      • yyyyMMddHHmmss - which is the compact version of the ISO 8601 date time format. You will often need a shortened version of this, e.g., yyyyMMdd.
      • yyyyDDD - which is the year plus the zero-padded day of the year (e.g, 001 = Jan 1, 365 = Dec 31 in a non-leap year; this is sometimes called the Julian date).
    • extractRegex - This is the regular expression (external link) (tutorial (external link)) which includes a capture group (in parentheses) which describes how to extract the value from the file name or global attribute value. For example, given a file name like S19980011998031.L3b_MO_CHL.nc, capture group #1, "\d{7}", in the regular expression S(\d{7})\d{7}\.L3b.* will capture the first 7 digits after 'S': 1998001.
    • captureGroupNumber - This is the number of the capture group (within a pair of parentheses) in the regular expression which contains the information of interest. It is usually 1, the first capture group. Sometimes you need to use capture groups for other purposes in the regex, so then the important capture group number will be 2 (the second capture group) or 3 (the third), etc.

    A full example of an axisVariable which makes an aggregated dataset with a new time axis which gets the time values from the file name of each file is

        <axisVariable>
            <sourceName>***fileName,timeFormat=yyyyDDD,S(\d{7})\.L3m.*,1</sourceName>
            <destinationName>time</destinationName>
        </axisVariable>
    When you use the "timeFormat=" psuedo dataType, ERDDAP will add 2 attributes to the axisVariable so that they appear to be coming from the source:
    <att name="standard_name">time</att>
    <att name="units">seconds since 1970-01-01T00:00:00Z</att>

    So in this case, ERDDAP will create a new axis named "time" with double values (seconds since 1970-01-01T00:00:00Z) by extracting the 7 digits after 'S' and before ".L3m" in the file name and interpreting those as time values formatted as yyyyDDD.

    You can override the default base time (1970-01-01T00:00:00Z) by adding an addAttribute which specifies a different units attribute with a different base time. A common situation is: there are group of data files, each with a 1 day composite of a satellite dataset, where you want the time value to be noon of the day mentioned in the file name (the centered time of each day) and want the variable's long_name to be "Centered Time". An example which does this is:

        <axisVariable>
            <sourceName>***fileName,timeFormat=yyyyDDD,S(\d{7})\.L3m.*,1</sourceName>
            <destinationName>time</destinationName>
            <addAttributes>
                <att name="long_name">Centered Time</att>
                <att name="units">seconds since 1970-01-01T12:00:00Z</att>
            </addAttributes>
        </axisVariable>
    Note hours=12 in the base time, which adds 12 hours relative to the original base time of 1970-01-01T00:00:00Z.

    A full example of an axisVariable which makes an aggregated dataset with a new "run" axis (with int values) which gets the run values from the "runID" global attribute in each file (with values like "r17_global", where 17 is the run number) is

        <axisVariable> 
            <sourceName>***global:runID,int,(r|s)(\d+)_global,2</sourceName>
            <destinationName>run</destinationName>
            <addAttributes>
                <att name="ioos_category">Other</att>
                <att name="units">count</att>
            </addAttributes>
        </axisVariable>
    Note the use of the capture group number 2 to capture the digits which occur after 'r' or 's', and before "_global". This example also shows how to add additional attributes (e.g., ioos_category and units) to the axis variable.
     
  • Sorted Dimension Values - The values for each dimension MUST be in sorted order (ascending or descending, except for the first (left-most) dimension which must be ascending). The values can be irregularly spaced. There can't be any ties. This is a requirement of the CF metadata standard (external link). If any dimension's values aren't in sorted order, the dataset won't be loaded and ERDDAP will identify the first unsorted value in the log file, bigParentDirectory/logs/log.txt .

    Unsorted dimension values almost always indicate a problem with the source dataset. This most commonly occurs when a misnamed or inappropriate file is included in the aggregation, which leads to an unsorted time dimension. To solve this problem, see the error message in the ERDDAP log.txt file to find the offending time value. Then look in the source files to find the corresponding file (or one before or one after) that doesn't belong in the aggregation.

  • Directories - The files MAY be in one directory, or in a directory and its subdirectories (recursively). If there are a large number of files (for example, >1,000), the operating system (and thus EDDGridFromFiles) will operate much more efficiently if you store the files in a series of subdirectories (one per year, or one per month for datasets with very frequent files), so that there are never a huge number of files in a given directory.
     
  • Remote Directories and HTTP Range Requests -
    (AKA Byte Serving, Byte Range Requests, Accept-Ranges http header)
    EDDGridFromNcFiles, EDDTableFromMultidimNcFiles, EDDTableFromNcFiles, and EDDTableFromNcCFFiles, can sometimes serve data from .nc files on remote servers and accessed via HTTP if the server supports Byte Serving (external link) via HTTP range requests (the HTTP mechanism for byte serving). This is possible because netcdf-java (which ERDDAP uses to read .nc files) supports reading data from remote .nc files via HTTP range requests.

    Don't do this! It is horribly slow. It is not supported.
    Because each range request goes over the internet, serving data in remote files via range requests is very slow and inefficient. Note that requests for data from a file on a local, standard, SATA, hard drive have a latency of about 3 ms for the first request and <1 ms thereafter, and the bandwidth ranges from about 0.3 GBytes/s for a single drive to 3 GBytes/s for a RAID. For requests over the internet to a remote server (not in the same room), the latency varies greatly but is often 500 - 2000 ms for the first and subsequent requests, and the available bandwidth is often very limiting (since it is shared with other users at an institution). When reading data from a file, netcdf-java often makes hundreds or thousands of requests for a range of bytes. So it is no surprise that accessing data via range requests is much, much slower than accessing data via local files. So range requests should be a last-resort system for serving data. Because of this, this is an upsupported feature in ERDDAP. It works for your setup or it doesn't, but IMHO it isn't worth spending your or my time on.

    Alternatives
    Please use an alternative:

    • Install ERDDAP (or THREDDS) on a server that has direct access to the files. You can then use EDDGridFromErddap or EDDTableFromErddap, (or EDDGridFromDap) to re-serve the data on your main ERDDAP.
    • Or, make a local copy of the files. Big hard drives are cheap! Even if you can't store the entire dataset locally indefinitely, perhaps you can do it temporarily.
    • Or, request that ERDDAP make a local caching system. It would grab whole files, as needed, store and read them locally, and delete the Least Recently Used files when space is limited. This was planned as a solution for working with AWS S3 (see below) (which, as expected, performs badly when reading files via range requests -- not its fault) but never implemented.

    How it works.
    Normally, when netcdf-java reads data from a local file, it sends a series of requests to the operating system, each for a specific range of bytes from the file in order to get a specific piece of information. With range requests, netcdf-java can read the remote file by sending a series of requests to the server (over the internet), each for a specific range of bytes.

    .grib and .bufr
    Netcdf-java can read data from local .grib and .bufr files, but I don't know if netcdf-java (and thus ERDDAP) support reading data from remote .grib and .bufr files via range requests. The problem is: when netcdf-java reads local .grib and .bufr files, it makes a local index file in the same directory. Given that this wouldn't be possible with remote files, this probably doesn't work. I've never tried it.

    Amazon S3 (external link)
    Amazon S3 is a storage system that is part of Amazon Web Services (AWS) (external link) (AKA Amazon's cloud services). Instead of a hierarchical file system, S3 has "buckets", each of which has a name and a "file" (a blob of bytes in the bucket). If the bucket names are file-like (e.g., "dirName1/dirName2/bucketName"), they can be accessed via a URL (http://ownersName.s3.amazonaws.com/dirName1/dirName2/bucketName). ERDDAP uses the AWS SDK for Java (external link) (which is included in the erddap.war file) to read the bucket names as if they were file names and access data in the buckets via range requests. For this to work, the user running ERDDAP must have up-to-date, valid AWS credentials (external link) and which are authorized to access those buckets. This works, but, as discussed above, very slowly. If you are running ERDDAP on an AWS EC2 instance (a virtual computer that you can rent by the hour from Amazon), you can improve performance by using instance types (external link) that support "Enhanced Networking". For access to S3 to work better, ERDDAP should cache the files locally as needed instead of reading the remote S3 data via byte ranges. Currently, ERDDAP doesn't support that.

    For This To Work ...
    The remote server must support HTTP range requests. It is usually Apache that handles this. By default, Apache supports range requests, but this feature may have been disabled. You can use curl (external link) to test if a file can be accessed by range requests, via
    curl --head fileUrl
    If a file is accessible via byte range requests (Accept-Ranges), the response will include the line
    Accept-Ranges: bytes
    If you are the administer for the server and the server doesn't support range requests, you can enable them -- search the web for advice on how to do this in Apache.

    To Set This Up in ERDDAP ...
    Enter the URL+baseDirectory in the <fileDir> tag for the dataset. ERDDAP will notice that the fileDir starts with http://, https://, or ftp://, and will try to read the remote directory information. It may fail because ERDDAP can't yet read directory information from all types of remote directories. (If so, send us the URL.) You may need to use the rarely used <pathRegex> tag to specify a regular expression which limits which paths (which subdirectories) will be included in the dataset.

    Accessing ERDDAP's "files" via byte range requests
    Flipping this around, given that ERDDAP has a /files/ system to make source files accessible and also that you can (in theory) think of a dataset in ERDDAP as a giant .nc file by appending ".nc" to the base OPenDAP URL for a given dataset (e.g., http://myserver.org/erddap/griddap/datasetID.nc and also by adding a ?query after that to specify a subset), it is reasonable to ask whether you can use netcdf-java, Ferret, or some other NetCDF client software to read data via HTTP Range Requests from ERDDAP. The answer is "no". ERDDAP does not support that. For the files in the /files/ system, that is mostly because the /files/ system is a virtual file system created by ERDDAP which doesn't support range requests. And trying to access datasetID.nc (for the whole dataset or a ?query subset) as a file via range requests doesn't work partly because it is a virtual file.
    Instead, if you want to do this, download the source file(s) from the /files/ system or the .nc?query subset file to your computer and use netcdf-java, Ferret, or some other NetCDF client software to read the (now) local file(s). That will be vastly more efficient because there will be just one big request over the internet to the server, instead of hundreds or thousands of tiny requests.

  • Cached File Information - When an EDDGridFromFiles dataset is first loaded, EDDGridFromFiles reads information from all of the relevant files and creates tables (one row for each file) with information about each valid file and each "bad" (different or invalid) file.
    • The tables are also stored on disk, as NetCDF v3 .nc files in bigParentDirectory/dataset/last2CharsOfDatasetID/datasetID/ in files named:
        dirTable.nc (which holds a list of unique directory names),
        fileTable.nc (which holds the table with each valid file's information),
        badFiles.json (which holds the table with each bad file's information).
    • To speed up access to an EDDGridFromFiles dataset (but at the expense of using more memory), you can use
      <fileTableInMemory>true</fileTableInMemory>
      to tell ERDDAP to keep a copy of the file information tables in memory.
    • The copy of the file information tables on disk is also useful when ERDDAP is shut down and restarted: it saves EDDGridFromFiles from having to re-read all of the data files.
    • When a dataset is reloaded, ERDDAP only needs to read the data in new files and files that have changed.
    • If a file has a different structure from the other files (for example, different data type for one of the variables, different value for the "units" attribute), ERDDAP adds the file to the list of "bad" files. Information about the problem with the file will be written to the bigParentDirectory/logs/log.txt file.
    • You shouldn't ever need to delete or work with these files. One exception is: if you are still making changes to a dataset's datasets.xml setup, you may want to delete these files to force ERDDAP to reread all of the files since the files will be read/interpreted differently. If you ever do need to delete these files, you can do it when ERDDAP is running. (Then set a flag to reload the dataset ASAP.) However, ERDDAP usually notices that the datasets.xml information doesn't match the fileTable information and deletes the file tables automatically.
    • If you want to encourage ERDDAP to update the stored dataset information (for example, if you just added, removed, or changed some files to the dataset's data directory), use the flag system to force ERDDAP to update the cached file information.
  • Handling Requests - When a client's request for data is processed, EDDGridFromFiles can quickly look in the table with the valid file information to see which files have the requested data.
  • Updating the Cached File Information - Whenever the dataset is reloaded, the cached file information is updated.
    • The dataset is reloaded periodically as determined by the <reloadEveryNMinutes> in the dataset's information in datasets.xml.
    • The dataset is reloaded as soon as possible whenever ERDDAP detects that you have added, removed, touch'd (external link) (to change the file's lastModified time), or changed a datafile.
    • The dataset is reloaded as soon as possible if you use the flag system.
    When the dataset is reloaded, ERDDAP compares the currently available files to the cached file information tables. New files are read and added to the valid files table. Files that no longer exist are dropped from the valid files table. Files where the file timestamp has changed are read and their information is updated. The new tables replace the old tables in memory and on disk.
  • Bad Files - The table of bad files and the reasons the files were declared bad (corrupted file, missing variables, etc.) is emailed to the emailEverythingTo email address (probably you) every time the dataset is reloaded. You should replace or repair these files as soon as possible.
  • FTP Trouble/Advice - If you FTP new data files to the ERDDAP server while ERDDAP is running, there is the chance that ERDDAP will be reloading the dataset during the FTP process. It happens more often than you might think! If it happens, the file will appear to be valid (it has a valid name), but the file isn't yet valid. If ERDDAP tries to read data from that invalid file, the resulting error will cause the file to be added to the table of invalid files. This is not good. To avoid this problem, use a temporary file name when FTP'ing the file, for example, ABC2005.nc_TEMP . Then, the fileNameRegex test (see below) will indicate that this is not a relevant file. After the FTP process is complete, rename the file to the correct name. The renaming process will cause the file to become relevant in an instant.
  • "0 files" Error Message - If you run GenerateDatasetsXml or DasDds, or if you try to load an EDDGridFrom...Files dataset in ERDDAP, and you get a "0 files" error message indicating that ERDDAP found 0 matching files in the directory (when you think that there are matching files in that directory):
    • Check that the files really are in that directory.
    • Check the spelling of the directory name.
    • Check the fileNameRegex. It's really, really easy to make mistakes with regexes. For test purposes, try the regex .* which should match all file names.
    • Check that the user who is running the program (e.g., user=tomcat (?) for Tomcat/ERDDAP) has 'read' permission for those files.
    • In some operating systems (for example, SE Linux) and depending on system settings, the user who ran the program must have 'read' permission for the whole chain of directories leading to the directory that has the files.
       
  • The skeleton XML for all EDDGridFromFiles subclasses is:
    <dataset type="EDDGridFrom...Files" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaWMS>...</accessibleViaWMS> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1. For EDDGridFromFiles subclasses, 
        this uses Java's WatchDirectory system to notice new/deleted/changed files, 
        so it should be fast and efficient. -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <fileDir>...</fileDir> <-- The directory (absolute) with the data files. -->
      <recursive>true|false</recursive> <!-- 0 or 1. Indicates if subdirectories
        of fileDir have data files, too. -->
      <pathRegex>...</pathRegex>  <!-- 0 or 1. Only directory names which 
        match the pathRegex (default=".*") will be accepted. -->
      <fileNameRegex>...</fileNameRegex> <-- 0 or 1. A regular expression (external link) 
        (tutorial (external link)) describing valid data files names, 
        for example, ".*\.nc" for all .nc files. -->
      <accessibleViaFiles>true|false(default)</accessibleViaFiles> <!-- 0 or 1 -->
      <metadataFrom>...</metadataFrom> <-- The file to get 
        metadata from ("first" or "last" (the default) based on file's 
        lastModifiedTime). -->
      <fileTableInMemory>...</fileTableInMemory> <!-- 0 or 1 (true or false (the default)) -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <axisVariable>...</axisVariable> <!-- 1 or more -->
      <dataVariable>...</dataVariable> <!-- 1 or more -->
    </dataset>
    
     

EDDGridFromMergeIRFiles aggregates data from local, MergeIR (external link) files, which are from the Tropical Rainfall Measuring Mission (TRMM) (external link), which is a joint mission between NASA and the Japan Aerospace Exploration Agency (JAXA). MergeIR files can be downloaded from NASA (external link).

EDDGridFromMergeIRFiles.java was written and contributed to the ERDDAP project by Jonathan Lafite and Philippe Makowski of R.Tech Engineering (external link) (license: copyrighted open source).

EDDGridFromMergeIRFiles is a little unusual:

  • EDDGridFromMergeIRFiles supports compressed or uncompressed source data files, in any combination, in the same dataset. This allow you, for example, to compress older files that are rarely accessed, but uncompress new files that are often accessed. Or, you can change the type of compression from the original .Z to for example, .gz.
  • If you have compressed and uncompressed versions of the same data files in the same directory, please make sure the <fileNameRegex> for your dataset matches the file names that you want it to match and doesn't match file names that you don't want it to match.
  • Uncompressed source data files must have no file extension (i.e., no "." in the file name).
  • Compressed source data files must have a file extension, but ERDDAP determines the type of compression by inspecting the contents of the file, not by looking at the file's file extension (for example, ".Z"). The supported compression types include "gz", "bzip2", "xz", "lzma", "snappy-raw", "snappy-framed", "pack200", and "z". When ERDDAP reads compressed files, it decompresses on-the-fly, without writing to a temporary file.
  • All source data files must use the original file naming system: i.e., merg_YYYYMMDDHH_4km-pixel (where YYYYMMDDHH indicates the time associated with the data in the file), plus a file extension if the file is compressed.

See this class' superclass, EDDGridFromFiles, for general information on how this class works and how to use this class.

We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.
 

EDDGridFromNcFiles aggregates data from local, gridded, GRIB .grb and .grb2 (external link) files, HDF (v4 or v5) .hdf (external link) files, .ncml files, and NetCDF (v3 or v4) .nc (external link) files. This may work with other file types (for example, BUFR), we just haven't tested it -- please send us some sample files.

  • Note that for GRIB files, ERDDAP will make a .gbx index file the first time it reads each GRIB file. So the GRIB files must be in a directory where the "user" that ran Tomcat has read+write permission.
  • See this class' superclass, EDDGridFromFiles, for information on how this class works and how to use this class.
  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.

    The first thing GenerateDatasetsXml does for this type of dataset after you answer the questions is print the ncdump-like structure of the sample file. So if you enter a few goofy answers for the first loop through GenerateDatasetsXml, at least you'll be able to see if ERDDAP can read the file and see what dimensions and variables are in the file. Then you can give better answers for the second loop through GenerateDatasetsXml.

EDDGridFromNcFilesUnpacked is a variant of EDDGridFromNcFiles which aggregates data from local, gridded NetCDF (v3 or v4) .nc and related files. The difference is that this class unpacks each data file before EDDGridFromFiles looks at the files:

  • It unpacks variables that are packed with scale_factor and/or add_offset.
  • It promotes integer variables that have "_Unsigned=true" attributes to the next larger integer data type so that the values appear as the unsigned values. For example, an _Unsigned=true byte (8 bit) variable becomes a signed short (16 bit) variable.
  • It converts _FillValue and missing_value values to be NaN's (or MAX_VALUE for integer data types).
  • It converts time and timestamp values to "seconds since 1970-01-01T00:00:00Z".
The big advantage of this class is that it provides a way to deal with different values of scale_factor, add_offset, _FillValue, missing_value, or time units in different files in a collection. Otherwise, you would have to use a tool like NcML or NCO to modify each file to remove the differences so that the files could be handled by EDDGridFromNcFiles. For this class to work properly, the files must follow the CF standards for the related attributes.
  • If try to make an EDDGridFromNcFilesUnpacked from a group of files with which you previously tried and failed to use EDDGridFromNcFiles, cd to
    bigParentDirectory/dataset/last2Letters/datasetID/
    where last2Letters is the last 2 letters of the datasetID,
    and delete all of the files in that directory.
  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.

    The first thing GenerateDatasetsXml does for this type of dataset after you answer the questions is print the ncdump-like structure of the sample file before it is unpacked. So if you enter a few goofy answers for the first loop through GenerateDatasetsXml, at least you'll be able to see if ERDDAP can read the file and see what dimensions and variables are in the file. Then you can give better answers for the second loop through GenerateDatasetsXml.

EDDGridLonPM180 modifies the longitude values of a child (enclosed) EDDGrid dataset that has some longitude values greater than 180 (for example, 0 to 360) so that they are in the range -180 to 180 (Longitude Plus or Minus 180, hence the name).

  • This provides a way to make datasets that have longitude values greater than 180 compliant in/with OGC services (for example the WMS server in ERDDAP), since all OGC services require longitude values within -180 to 180.
  • Working near a discontinuity causes problems, regardless of whether the discontinuity is at longitude 0 or at longitude 180. This dataset type lets you avoid those problems for everyone, by offering two versions of the same dataset:
    one with longitude values in the range 0 to 360 ("Pacificentric"?),
    one with longitude values in the range -180 to 180 ("Atlanticentric"?).
  • For child datasets with all longitude values greater than 180, all of the new longitude values are simply 360 degrees lower. For example, a dataset with longitude values of 180 to 240 would become a dataset with longitude values of -180 to -120.
  • For child datasets that have longitude values for the entire globe (roughly 0 to 360), the new longitude value will be rearranged to be (roughly) -180 to 180:
    The original 0 to almost 180 values are unchanged.
    The original 180 to 360 values are converted to -180 to 0 and shifted to the beginning of the longitude array.
  • For child datasets that span 180 but don't cover the globe, ERDDAP inserts missing values as needed to make a dataset which covers the globe. For example, a child dataset with longitude values of 140 to 200 would be come a dataset with longitude values of -180 to 180.
    The child values of 180 to 200 would become -180 to -160.
    New longitude values would be inserted from -160 to 140. The corresponding data values will be _FillValues.
    The child values of 140 to almost 180 would be unchanged.
    The insertion of missing values may seem odd, but it avoids several problems that result from having longitude values that jump suddenly (e.g, from -160 to 140).
  • In GenerateDatasetsXml, there is a special "dataset type", EDDGridLonPM180FromErddapCatalog, that lets you generate the datasets.xml for EDDGridLonPM180 datasets from each of the EDDGrid datasets in an ERDDAP that have any longitude values greater than 180. This facilitates offering two versions of these datasets:
    the original, with longitude values in the range 0 to 360,
    and the new dataset, with longitude values in the range -180 to 180.

    The child dataset within each EDDGridLonPM180 dataset will be an EDDGridFromErddap dataset which points to the original dataset.
    The new dataset's datasetID will be the name of the original dataset plus "_LonPM180".
    For example,

    <dataset type="EDDGridLonPM180" datasetID="erdMBsstdmday_LonPM180" active="true">
        <dataset type="EDDGridFromErddap" datasetID="erdMBsstdmday_LonPM180Child">
            <!-- SST, Aqua MODIS, NPP, 0.025 degrees, Pacific Ocean, Daytime (Monthly Composite)
                 minLon=120.0 maxLon=320.0 -->
            <sourceUrl>https://coastwatch.pfeg.noaa.gov/erddap/griddap/erdMBsstdmday</sourceUrl>
        </dataset>
    </dataset> 
    Put the EDDGridLonPM180 dataset below the original dataset in datasets.xml. That avoids some possible problems.

    Alternatively, you can replace the EDDGridFromErddap child dataset with the original dataset's datasets.xml. Then, there will be only one version of the dataset: the one with longitude values within -180 to 180. We discourage this because there are times when each version of the dataset is more convenient.

  • If you offer two versions of a dataset, for example, one with longitude 0 to 360 and one with longitude -180 to 180:
    • You can use the optional <accessibleViaWMS>false</accessibleViaWMS> with the 0-360 dataset to forcibly disable the WMS service for that dataset. Then, only the LonPM180 version of the dataset will be accessible via WMS.
    • There are a couple of ways to keep the LonPM180 dataset up-to-date with changes to the underlying dataset:
      • If the child dataset is a EDDGridFromErddap dataset that references a dataset in the same ERDDAP, the LonPM180 dataset will try to directly subscribe to the underlying dataset so that it is always up-to-date. Direct subscriptions don't generate emails asking you to validate the subscription - validation should be done automatically.
      • If the child dataset is not an EDDGridFromErddap dataset that is on the same ERDDAP, the LonPM180 dataset will try to use the regular subscription system to subscribe to the underlying dataset. If you have the subscription system in your ERDDAP turned on, you should get emails asking you to validate the subscription. Please do so.
      • If you have the subscription system in your ERDDAP turned off, the LonPM180 dataset may sometimes have outdated metadata until the LonPM180 dataset is reloaded. So if the subscription system is turned off, you should set the <reloadEveryNMinutes> setting of the LonPM180 dataset to a smaller number, so that it is more likely to catch changes to the child dataset sooner.
  • The skeleton XML for an EDDGridLonPM180 dataset is:
    <dataset type="EDDGridLonPM180" datasetID="..." active="..." >
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 --> 
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1. For EDDGridFromDap, 
        this gets the remote .dds and then gets the new leftmost dimension values. -->
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaWMS>...</accessibleViaWMS> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <dataset>...</dataset> <!-- The child dataset. -->
    </dataset>
    
     

EDDGridSideBySide aggregates two or more EDDGrid datasets (the children) side by side.

  • The resulting dataset has all of the variables from all of the child datasets.
  • The parent dataset and all of the child datasets MUST have different datasetIDs. If any names in a family are exactly the same, the dataset will fail to load (with the error message that the values of the aggregated axis are not in sorted order).
  • All children MUST have the same source values for axisVariables[1+] (for example, latitude, longitude). The precision of the testing is determined by the matchAxisNDigits.
  • The children may have different source values for axisVariables[0] (for example, time), but they are usually largely the same.
  • The parent dataset will appear to have all of the axisVariables[0] source values from all of the children.
  • For example, this lets you combine a source dataset with a vector's u-component and another source dataset with a vector's v-component, so the combined data can be served.
  • Children created by this method are held privately. They are not separately accessible datasets (for example, by client data requests or by flag files).
  • The global metadata and settings for the parent comes from the global metadata and settings for the first child.
  • If there is an exception while creating the first child, the parent will not be created.
  • If there is an exception while creating other children, this sends an email to emailEverythingTo (as specified in setup.xml) and continues with the other children.
  • The skeleton XML for an EDDGridSideBySide dataset is:
    <dataset type="EDDGridSideBySide" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaWMS>...</accessibleViaWMS> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <dataset>...</dataset> <!-- 2 or more -->
    </dataset>
    
     

EDDGridAggregateExistingDimension aggregates two or more EDDGrid datasets each of which has a different range of values for the first dimension, but identical values for the other dimensions.

  • For example, one child dataset might have 366 values (for 2004) for the time dimension and another child might have 365 values (for 2005) for the time dimension.
  • All the values for all of the other dimensions (for example, latitude, longitude) MUST be identical for all of the children. The precision of the test is determined by matchAxisNDigits.
  • Sorted Dimension Values - The values for each dimension MUST be in sorted order (ascending or descending). The values can be irregularly spaced. There can be no ties. This is a requirement of the CF metadata standard (external link). If any dimension's values aren't in sorted order, the dataset won't be loaded and ERDDAP will identify the first unsorted value in the log file, bigParentDirectory/logs/log.txt .

    Unsorted dimension values almost always indicate a problem with the source dataset. This most commonly occurs when a misnamed or inappropriate file is included in the aggregation, which leads to an unsorted time dimension. To solve this problem, see the error message in the ERDDAP log.txt file to find the offending time value. Then look in the source files to find the corresponding file (or one before or one after) that doesn't belong in the aggregation.

  • The parent dataset and the child dataset MUST have different datasetIDs. If any names in a family are exactly the same, the dataset will fail to load (with the error message that the values of the aggregated axis are not in sorted order).
  • Currently, the child dataset MUST be an EDDGridFromDap dataset and MUST have the lowest values of the aggregated dimension (usually the oldest time values). All of the other children MUST be almost identical datasets (differing just in the values for the first dimension) and are specified by just their sourceUrl.
  • The aggregate dataset gets its metadata from the first child.
  • The GenerateDatasetsXml program can make a rough draft of the datasets.xml for an EDDGridAggregateExistingDimension based on a set of files served by a Hyrax or THREDDS server. For example, use this input for the program (the "/1988" in the URL makes the example run faster):
      EDDType? EDDGridAggregateExistingDimension
      Server type (hyrax, thredds, or dodsindex)? hyrax
      Parent URL (for example, for hyrax, ending in "contents.html";
        for thredds, ending in "catalog.xml")
      ? http://dods.jpl.nasa.gov/opendap/ocean_wind/ccmp/L3.5a/data/
        flk/1988/contents.html
      File name regex (for example, ".*\.nc")? month.*flk\.nc\.gz
      ReloadEveryNMinutes (for example, 10080)? 10080

    You can use the resulting <sourceUrl> tags or delete them and uncomment the <sourceUrl> tag (so that new files are noticed each time the dataset is reloaded.
  • The skeleton XML for an EDDGridAggregateExistingDimension dataset is:
    <dataset type="EDDGridAggregateExistingDimension" datasetID="..." 
        active="..." >
      <dataset>...</dataset> <!-- This is a regular EDDGridFromDap 
        dataset description child with the lowest values for the aggregated dimensions. -->
      <sourceUrl>...</sourceUrl> <!-- 0 or many; the sourceUrls for 
        other children.  These children must be listed in order of ascending values 
        for the aggregated dimension. -->
      <sourceUrls serverType="..." regex="..." recursive="true" pathRegex=".*"
        >http://someServer/someDirectory/someSubdirectory/catalog.xml</sourceUrls> 
        <!-- 0 or 1. This specifies how to find the other children, instead 
        of using separate sourceUrl tags for each child.  The advantage of this
        is: new children will be detected each time the dataset is reloaded. 
        The serverType must be "thredds", "hyrax", or "dodsindex".  
        An example of a regular expression (external link) (regex) 
        (tutorial (external link)) is .*\.nc 
        recursive can be "true" or "false".  
        Only directory names which match the pathRegex (default=".*") will be accepted. 
        An example of a thredds catalogUrl is
        http://thredds1.pfeg.noaa.gov/thredds/catalog/Satellite/aggregsatMH/chla/catalog.xml (external link)
        An example of a hyrax catalogUrl is
        http://podaac-opendap.jpl.nasa.gov/opendap/allData/ccmp/L3.5a/monthly/flk/1988/contents.html (external link)
        An example of a dodsindex URL is . An example is
        http://www.marine.csiro.au/dods/nph-dods/dods-data/bl/BRAN2.1/bodas/ (external link) 
        (Note the "DODS Index of /..." at the top of the page.)
        When these children are sorted by file name, they must be in order of
        ascending values for the aggregated dimension. -->
      <matchAxisNDigits>...</ensureAxisValuesAreEqual> <!-- 0 or 1 -->
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaWMS>...</accessibleViaWMS> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
    </dataset>
    
     

EDDGridCopy makes and maintains a local copy of another EDDGrid's data and serves data from the local copy.

  • EDDGridCopy (and for tabular data, EDDTableCopy) is a very easy to use and a very effective
    solution to some of the biggest problems with serving data from a remote data source:
    • Accessing data from a remote data source can be slow.
      • It may be slow because it is inherently slow (for example, an inefficient type of server),
      • because it is overwhelmed by too many requests,
      • or because your server or the remote server is bandwidth limited.
    • The remote dataset is sometimes unavailable (again, for a variety of reasons).
    • Relying on one source for the data doesn't scale well (for example, when many users and many ERDDAPs utilize it).
       
  • How It Works - EDDGridCopy solves these problems by automatically making and maintaining a local copy of the data and serving data from the local copy. ERDDAP can serve data from the local copy very, very quickly. And making a local copy relieves the burden on the remote server. And the local copy is a backup of the original, which is useful in case something happens to the original.

    There is nothing new about making a local copy of a dataset. What is new here is that this class makes it *easy* to create and *maintain* a local copy of data from a *variety* of types of remote data sources and *add metadata* while copying the data.

  • Chunks of Data - EDDGridCopy makes the local copy of the data by requesting chunks of data from the remote <dataset> . There will be a chunk for each value of the leftmost axis variable. Note that EDDGridCopy doesn't rely on the remote dataset's index numbers for the axis -- those may change.

    WARNING: If the size of a chunk of data is so big (> 2GB) that it causes problems, EDDGridCopy can't be used. (Sorry, we hope to have a solution for this problem in the future.)

  • Local Files - Each chunk of data is stored in a separate NetCDF file in a subdirectory of bigParentDirectory/copy/datasetID/ (as specified in setup.xml). File names created from axis values are modified to make them file-name-safe (for example, hyphens are replaced by "x2D") -- this doesn't affect the actual data.
     
  • New Data - Each time EDDGridCopy is reloaded, it checks the remote <dataset> to see what chunks are available. If the file for a chunk of data doesn't already exist, a request to get the chunk is added to a queue. ERDDAP's taskThread processes all the queued requests for chunks of data, one-by-one. You can see statistics for the taskThread's activity on the Status Page and in the Daily Report. (Yes, ERDDAP could assign multiple tasks to this process, but that would use up lots of the remote data source's bandwidth, memory, and CPU time, and lots of the local ERDDAP's bandwidth, memory, and CPU time, neither of which is a good idea.)

    NOTE: The very first time an EDDGridCopy is loaded, (if all goes well) lots of requests for chunks of data will be added to the taskThread's queue, but no local data files will have been created. So the constructor will fail but taskThread will continue to work and create local files. If all goes well, the taskThread will make some local data files and the next attempt to reload the dataset (in ~15 minutes) will succeed, but initially with a very limited amount of data.

    WARNING: If the remote dataset is large and/or the remote server is slow (that's the problem, isn't it?!), it will take a long time to make a complete local copy. In some cases, the time needed will be unacceptable. For example, transmitting 1 TB of data over a T1 line (0.15 GB/s) takes at least 60 days, under optimal conditions. Plus, it uses lots of bandwidth, memory, and CPU time on the remote and local computers. The solution is to mail a hard drive to the administrator of the remote data set so that s/he can make a copy of the dataset and mail the hard drive back to you. Use that data as a starting point and EDDGridCopy will add data to it. (That is one way that Amazon's EC2 Cloud Service (external link) handles the problem, even though their system has lots of bandwidth.)

    WARNING: If a given value for the leftmost axis variable disappears from the remote dataset, EDDGridCopy does NOT delete the local copied file. If you want to, you can delete it yourself.

  • Recommended use -
    1. Create the <dataset> entry (the native type, not EDDGridCopy) for the remote data source.
      Get it working correctly, including all of the desired metadata.
    2. If it is too slow, add XML code to wrap it in an EDDGridCopy dataset.
      • Use a different datasetID (perhaps by changing the datasetID of the old datasetID slightly).
      • Copy the <accessibleTo>, <reloadEveryNMinutes> and <onChange> from the remote EDDGrid's XML to the EDDGridCopy's XML. (Their values for EDDGridCopy matter; their values for the inner dataset become irrelevant.)
    3. ERDDAP will make and maintain a local copy of the data.
       
  • WARNING: EDDGridCopy assumes that the data values for each chunk don't ever change. If/when they do, you need to manually delete the chunk files in bigParentDirectory/copy/datasetID/ which changed and flag the dataset to be reloaded so that the deleted chunks will be replaced. If you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
     
  • All axis values must be equal.
    For each of the axes except the leftmost, all of the values must be equal for all children. The precision of the test is determined by matchAxisNValues.
  • Settings, Metadata, Variables - EDDGridCopy uses settings, metadata, and variables from the enclosed source dataset.
  • Change Metadata - If you need to change any addAttributes or change the order of the variables associated with the source dataset:
    1. Change the addAttributes for the source dataset in datasets.xml, as needed.
    2. Delete one of the copied files.
    3. Set a flag to reload the dataset immediately. If you do use a flag and you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
    4. The deleted file will be regenerated with the new metadata. If the source dataset is ever unavailable, the EDDGridCopy dataset will get metadata from the regenerated file, since it is the youngest file.
       
  • Skeleton XML - The skeleton XML for an EDDGridCopy dataset is:
    <dataset type="EDDGridCopy" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaFiles>true|false(default)</accessibleViaFiles> <!-- 0 or 1 -->
      <accessibleViaWMS>...</accessibleViaWMS> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <fileTableInMemory>...</fileTableInMemory> <!-- 0 or 1 (true or false (the default)) -->
      <dataset>...</dataset> <!-- 1 -->
    </dataset>
    
     

EDDTableFromCassandra handles data from one Cassandra table.

  • One Table - Cassandra doesn't support "joins" in the way that relational databases do. One ERDDAP EDDTableFromCassandra dataset maps to one (perhaps a subset of one) Cassandra table.
     
  • datasets.xml
    • ERDDAP comes with the Cassandra Java driver, so you don't need to install it separately.
    • Carefully read all of this document's information about EDDTableFromCassandra. Some of the details are very important.
    • The Cassandra Java driver is intended to work with Apache Cassandra (1.2+) and DataStax Enterprise (3.1+). If you are using Apache Cassandra 1.2.x, you must edit the cassandra.yaml file for each node to set start_native_transport: true, then restart each node.
    • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it (especially <partitionKeySourceNames>). You can gather most of the information you need to create the XML for an EDDTableFromCassandra dataset by contacting the Cassandra administrator and by searching the web.

      GenerateDatasetsXml has two special options for EDDTableFromCassandra:

      1. If you enter "!!!LIST!!!" (without the quotes) for the keyspace, the program will display a list of keyspaces
      2. If you enter a specific keyspace and then enter "!!!LIST!!!" (without the quotes) for the tablename, the program will display a list of tables in that keyspace and their columns.
    • Case-insensitive Keyspace and Table Names -
      Cassandra treats keyspace and table names in a case-insensitive way. Because of this, you MUST NEVER use a reserved word (but with a different case) as a Cassandra keyspace or table name.
    • Case-insensitive Column Names -
      By default, Cassandra treats column names in a case-insensitive way. If you use one of Cassandra's reserved words as a column name (please don't!), you MUST use
      <columnNameQuotes>"<columnNameQuotes>
      in datasets.xml for this dataset so that Cassandra and ERDDAP will treat the column names in a case-sensitive way. This will likely be a massive headache for you, because it is hard to determine the case-sensitive versions of the column names -- Cassandra almost always displays the column names as all lower-case, regardless of the true case.
    • Work closely with the Cassandra administrator, who may have relevant experience. If the dataset fails to load, read the error message carefully to find out why.
       
  • <connectionProperty>
    Cassandra has connection properties which can be specified in datasets.xml. Many of these will affect the performance of the Cassandra-ERDDAP connection. Unfortunatly, Cassandra properties must be set programmatically in Java, so ERDDAP must have code for each property ERDDAP supports. Currently, ERDDAP supports these properties:
    (The defaults shown are what we see. Your system's defaults may be different.)
    • General Options
      <connectionProperty name="compression">none|LZ4|snappy</connectionProperty> (case-insensitive, default=none)
      (General compression advice: use 'none' if the connection between Cassandra and ERDDAP is local/fast and use 'LZ4' if the connection is remote/slow.)
      <connectionProperty name="credentials">username/password</connectionProperty> (that's a literal '/')
      <connectionProperty name="metrics">true|false</connectionProperty> (default=true)
      <connectionProperty name="port">anInteger</connectionProperty> (default for native binary protocol=9042)
      <connectionProperty name="ssl">true|false</connectionProperty> (default=false)
      (My quick attempt to use ssl failed. If you succeed, please tell me how you did it.)
    • Query Options
      <connectionProperty name="consistencyLevel">all|any|each_quorum|local_one|local_quorum| local_serial|one|quorum|serial|three|two</connectionProperty> (case-insensitive, default=ONE)
      <connectionProperty name="fetchSize">anInteger</connectionProperty> (default=5000)
      (Do not set fetchSize to a smaller value.)
      <connectionProperty name="serialConsistencyLevel">all|any|each_quorum|local_one|local_quorum| local_serial|one|quorum|serial|three|two</connectionProperty> (case-insensitive, default=SERIAL)
    • Socket Options
      <connectionProperty name="connectTimeoutMillis">anInteger</connectionProperty> (default=5000)
      (Do not set connectTimeoutMillis to a smaller value.)
      <connectionProperty name="keepAlive">true|false</connectionProperty>
      <connectionProperty name="readTimeoutMillis">anInteger</connectionProperty>
      (Cassandra's default readTimeoutMillis is 12000, but ERDDAP changes the default to 120000. If Cassandra is throwing readTimeouts, increasing this may not help, because Cassandra sometimes throws them before this time. The problem is more likely that you are storing too much data per partitionKey combination.)
      <connectionProperty name="receiveBufferSize">anInteger</connectionProperty>
      (It is unclear what the default receiveBufferSize is. Don't set this to a small value.)
      <connectionProperty name="soLinger">anInteger</connectionProperty>
      <connectionProperty name="tcpNoDelay">true|false</connectionProperty> (default=null)

    If you need to be able to set other connection properties, please send an email with the details to
    bob dot simons at noaa dot gov.
    Or, you can join the ERDDAP Google Group / Mailing List and post your question there.

    For a given startup of Tomcat, connectionProperties are only used the first time a dataset is created for a given Cassandra URL. All reloads of that dataset and all subsequent datasets that share the same URL will use those original connectionProperties.

  • CQL - The Cassandra Query Language (CQL) is superficially like SQL, the query language used by traditional databases. Because OPeNDAP's tabular data requests were designed to mimic SQL tabular data requests, it is possible for ERDDAP to convert tabular data requests into CQL Bound/PreparedStatements. ERDDAP logs the statement in log.txt as
      statement as text: theStatementAsText

    Note that the version of the statement you see will be a text representation of the statement and will only have "?" where constraint values will be placed.
     
    Not so simple - Unfortunately, CQL has many restrictions on which columns can be queried with which types of constraints, for example, partition key columns can be constrained with = and IN, so ERDDAP sends some constraints to Cassandra and applies all constraints after the data is received from Cassandra. To help ERDDAP deal efficiently with Cassandra, you need to specify <partitionKeySourceNames>, <clusterColumnSourceNames>, and <indexColumnSourceNames> in datasets.xml for this dataset. These are the most important ways to help ERDDAP work efficiently with Cassandra. If you don't tell ERDDAP this information, the dataset will be painfully slow in ERDDAP and use tons of Cassandra resources.
     
  • <partitionKeySourceNames> - Because partition keys play a central role in Cassandra tables, ERDDAP needs to know their sourceNames and, if relevant, other information about how to work with them.
    • You MUST specify a comma-separated list of partition key source column names in datasets.xml via <partitionKeySourceNames>.
      Simple example,
      <partitionKeySourceNames>station, deviceid<partitionKeySourceNames>
      More complex example,
      <partitionKeySourceNames>deviceid=1007, date/sampletime/1970-01-01<partitionKeySourceNames>
    • TimeStamp Partition Keys - If one of the partition key columns is a timestamp column that has a coarser version of another timestamp column, specify this via
      partitionKeySourcName/otherColumnSourceName/time_precision
      where time_precision is one of the time_precision strings used elsewhere in ERDDAP.
      The trailing Z in the time_precision string is the default, so it doesn't matter if the time_precision string ends in Z or not.
      For example, ERDDAP will interpret date/sampletime/1970-01-01 as "Constraints for date can be constructed from constraints on sampletime by using this time_precision." The actual conversion of constraints is more complex, but that is the overview.
      Use this whenever it is relevant. It enables ERDDAP to work efficiently with Cassandra. If this relationship between columns exists in a Cassandra table and you don't tell ERDDAP, the dataset will be painfully slow in ERDDAP and use tons of Cassandra resources.
    • Single Value Partition Keys - If you want an ERDDAP dataset to work with only one value of one partition key, specify partitionKeySourceName=value.
      Don't use quotes for a numeric column, for example, deviceid=1007
      Do use quotes for a string column, for example, stationid="Point Pinos"
    • Dataset Default Sort Order - The order of the partition key <dataVariable>'s in datasets.xml determines the default sort order of the results from Cassandra. Of course, users can request a different sort order for a given set of results by appending &orderBy("comma-separated list of variables") to the end of their query.
    • By default, Cassandra and ERDDAP treat column names in a case-insensitive way. But if you set columnNameQuotes to ", ERDDAP will treat Cassandra column names in case-sensitive way.
       
  • <clusterColumnSourceNames> - Cassandra accepts SQL-like constraints on cluster columns, which are the columns that form the second part of the primary key (after the partition key(s)). So, it is essential that you identify these columns via <clusterColumnSourceNames>. This enables ERDDAP to work efficiently with Cassandra. If there are cluster columns and you don't tell ERDDAP, the dataset will be painfully slow in ERDDAP and use tons of Cassandra resources.
    • For example, <clusterColumnSourceNames>myClusterColumn1, myClusterColumn2</clusterColumnSourceNames>
    • If a Cassandra table has no cluster columns, either don't specify <clusterColumnSourceNames>, or specify it with no value.
    • By default, Cassandra and ERDDAP treat column names in a case-insensitive way. But if you set columnNameQuotes to ", ERDDAP will treat Cassandra column names in case-sensitive way.
       
  • <indexColumnSourceNames> - Cassandra accepts '=' constraints on secondary index columns, which are the columns that you have explicitly created indexes for via
    CREATE INDEX indexName ON keyspace.tableName (columnName);
    (Yes, the parentheses are required.)
    So, it is very useful if you identify these columns via <indexColumnSourceNames>. This enables ERDDAP to work efficiently with Cassandra. If there are index columns and you don't tell ERDDAP, some queries will be needlessly, painfully slow in ERDDAP and use tons of Cassandra resources.
    • For example, <indexColumnSourceNames>myIndexColumn1, myIndexColumn2</indexColumnSourceNames>
    • If a Cassandra table has no index columns, either don't specify <indexColumnSourceNames>, or specify it with no value.
    • WARNING: Cassandra indexes aren't like database indexes. Cassandra indexes only help with '=' constraints. And they are only recommended (external link) for columns that have far fewer distinct values than total values.
    • By default, Cassandra and ERDDAP treat column names in a case-insensitive way. But if you set columnNameQuotes to ", ERDDAP will treat Cassandra column names in case-sensitive way.
       
  • <maxRequestFraction> - When ERDDAP (re)loads a dataset, ERDDAP gets from Cassandra the list of distinct combinations of the partition keys. For a huge dataset, the number of combinations will be huge. If you want to prevent users requests from requesting most or all of the dataset (or even a request that asks ERDDAP to download most or all of the data in order to further filter it), you can tell ERDDAP only to allow requests that reduce the number of combinations by some amount via <maxRequestFraction>, which is a floating point number between 1e-10 (which means that the request can't need more than 1 combination in a billion) and 1 (the default, which means that the request can be for the entire dataset).
    For example, if a dataset has 10000 distinct combinations of the partition keys and maxRequestFraction is set to 0.1,
    then requests which need data from 1001 or more combinations will generate an error message,
    but requests which need data from 1000 or fewer combinations will be allowed.

    Generally, the larger the dataset, the lower you should set <maxRequestFraction>. So you might set it to 1 for a small dataset, 0.1 for a medium-sized dataset, 0.01 for a large dataset, and 0.0001 for a huge dataset.

    This approach is far from perfect. It will lead to some reasonable requests being rejected and some too-big requests being allowed. But it is difficult problem and this solution is much better than nothing.

  • subsetVariables - As with other EDDTable datasets, you can specify a comma-separated list of <dataVariable> destinationNames in a global attribute called "subsetVariables" to identify variables which have a limited number of values. The dataset will then have a .subset web page and show lists of distinct values for those variables in drop-down lists on many web pages.

    Including just partition key variables and static columns in the list is STRONGLY ENCOURAGED. Cassandra will be able to generate the list of distinct combinations very quickly and easily each time the dataset is reloaded. One exception is timestamp partition keys that are coarse versions of some other timestamp column -- it is probably best to leave these out of the list of subsetVariables since there are a large number of values and they aren't very useful to users.

    If you include non-partition key, non-static variables in the list, it will probably be very computationally expensive for Cassandra each time the dataset is reloaded, because ERDDAP has to look though every row of the dataset to generate the information. In fact, the query is likely to fail. So, except for very small datasets, this is STRONGLY DISCOURAGED.

  • Cassandra DataTypes - Because there is some ambiguity about which Cassandra data types map to which ERDDAP data types, you need to specify a <dataType> tag for each <dataVariable> to tell ERDDAP which dataType to use. The standard ERDDAP dataTypes (and the most common corresponding Cassandra data types) are:
    • boolean (boolean), which ERDDAP then stores as bytes
    • byte (int, if the range is +/-127)
    • short (int, if the range is +/-32767)
    • int (int, counter?, varint?)
    • long (bigint, counter?, varint?)
    • float (float)
    • double (double, decimal (with possible loss of precision), timestamp)
    • char (ascii or text, if they never have more than 1 character)
    • String (ascii, text, varchar, inet, uuid, timeuuid, blob, map, set, list?)

    Cassandra's timestamp is a special case: use ERDDAP's double dataType.

    If you specify a String dataType in ERDDAP for a Cassandra map, set or list, the map, set or list on each Cassandra row will be converted to a single string on a single row in the ERDDAP table. ERDDAP has an alternative system for lists; see below.

    typeLists - ERDDAP's <dataType> tag for Cassandra dataVariables can include the regular ERDDAP dataTypes (see above) plus several special dataTypes that can be used for Cassandra list columns: booleanList, byteList, shortList, intList, longList, floatList, doubleList, charList, StringList. When one of these list columns is in the results being passed to ERDDAP, each row of source data will be expanded to list.size() rows of data in ERDDAP; simple dataTypes (for example, int) in that source data row will be duplicated list.size() times. If the results contain more than one list variable, all lists on a given row of data MUST have the same size and MUST be "parallel" lists, or ERDDAP will generate an error message. For example, for currents measurements from an ADCP,
      depth[0], uCurrent[0], vCurrent[0], and zCurrent[0] are all related, and
      depth[1], uCurrent[1], vCurrent[1], and zCurrent[1] are all related, ...
    Alternatively, if you don't want ERDDAP to expand a list into multiple rows in the ERDDAP table, specify String as the dataVariable's dataType so the entire list will be represented as one String on one row in ERDDAP.

  • Cassandra TimeStamp Data - Cassandra's timestamp data is always aware of time zones. If you enter timestamp data without specifying a timezone, Cassandra assumes the timestamp uses the local time zone.

    ERDDAP supports timestamp data and always presents the data in the Zulu/GMT time zone. So if you enter timestamp data in Cassandra using a time zone other than Zulu/GMT, remember that you need to do all queries for timestamp data in ERDDAP using the Zulu/GMT time zone. So don't be surprised when the timestamp values that come out of ERDDAP are shifted by several hours because of the time zone switch from local to Zulu/GMT time.

    • In ERDDAP's datasets.xml, in the <dataVariable> tag for a timestamp variable, set
        <dataType>double</dataType>
      and in <addAttributes> set
        <att name="units">seconds since 1970-01-01T00:00:00Z</att> .
    • Suggestion: If the data is a time range, it is useful to have the timestamp values refer to the center of the implied time range (for example, noon). For example, if a user has data for 2010-03-26T13:00Z from another dataset and they want the closest data from this Cassandra dataset that has data for each day, then the data for 2010-03-26T12:00Z (representing Cassandra data for that date) is obviously the best (as opposed to the midnight before or after, where it is less obvious which is best).
    • ERDDAP has a utility to Convert a Numeric Time to/from a String Time.
    • See How ERDDAP Deals with Time.
       
  • Integer nulls - Cassandra supports nulls in Cassandra int (ERDDAP int) and bigint (ERDDAP long) columns, but ERDDAP doesn't support true nulls for any integer data type.
    By default, Cassandra integer nulls will be converted in ERDDAP to 2147483647 for int columns, or 9223372036854775807 for long columns. These will appear as "NaN" in some types of text output files (for example, .csv), "" in other types of text output files (for example, .htmlTable), and the specific number (2147483647 for missing int values) in other types of files (for example, binary files like .nc and mat). A user can search for rows of data with this type of missing value by referring to "NaN", e.g, "&windSpeed=NaN".

    If you use some other integer value to indicate missing values in your Cassandra table, please identify that value in datasets.xml:
    <att name="missing_value" type="int">-999</att>

    For Cassandra floating point columns, nulls get converted to NaNs in ERDDAP. For Cassandra data types that are converted to Strings in ERDDAP, nulls get converted to empty Strings. That shouldn't be a problem.

  • "WARNING: Re-preparing already prepared query" in tomcat/logs/catalina.out (or some other Tomcat log file)
    Cassandra documentation says there is trouble if the same query is made into a PreparedStatement twice (or more). (See this bug report (external link).) To avoid making Cassandra mad, ERDDAP caches all PreparedStatements so it can reuse them. That cache is lost if/when Tomcat/ERDDAP is restarted, but I think that is okay because the PreparedStatements are associated with a given session (between Java and Cassandra), which is also lost. So, you may see these messages. I know of no other solution. Fortunately, it is a warning, not an error (although Cassandra threatens that it may lead to performance problems).

    Cassandra claims that PreparedStatements are good forever, so ERDDAP's cached PreparedStatements should never become out-of-date/invalid. If that isn't true, and you get errors about certain PreparedStatements being out-of-date/invalid, then you need to restart ERDDAP to clear ERDDAP's cache of PreparedStatements.

  • Security
    See Securing Cassandra

    When working with Cassandra, you need to do things as safely and securely as possible to avoid allowing a malicious user to damage your Cassandra or gain access to data they shouldn't have access to. ERDDAP tries to do things in a secure way, too.

    • We encourage you to set up ERDDAP to connect to Cassandra as a Cassandra user that only has access to the relevant table(s) and only has READ privileges.
    • We encourage you to set up the connection from ERDDAP to Cassandra so that it
      • always uses SSL,
      • only allows connections from one IP address (or one block of addresses) and from the one ERDDAP user, and
      • only transfers passwords in their MD5 hashed form.
    • [KNOWN PROBLEM] The connectionProperties (including the password!) are stored as plain text in datasets.xml. Only the administrator should have READ and WRITE access to this file! No other users of the computer should have READ or WRITE access to this file! We haven't found a way to allow the administrator to enter the Cassandra password during ERDDAP's startup in Tomcat (which occurs without user input), so the password must be accessible in a file.
    • When in ERDDAP, the password and other connection properties are stored in "private" Java variables.
    • Requests from clients are parsed and checked for validity before generating the CQL requests for Cassandra.
    • Requests to Cassandra are made with CQL Bound/PreparedStatements, to prevent CQL injection. In any case, Cassandra is inherently less susceptible to CQL injection than traditional databases are to SQL injection.
       
  • Speed - Cassandra can be fast or slow. There are some things you can do to make it fast:
    • In General -
      The nature of CQL is that queries are declarative (external link). They just specify what the user wants. They don't include a specification or hints for how the query is to be handled or optimized. So there is no way for ERDDAP to generate the query in such a way that it helps Cassandra optimize the query (or in any way specifies how the query is to be handled). In general, it is up to the Cassandra administrator to set things up (for example, indexes) to optimize for certain types of queries.
       
    • Specify the timestamp columns that are related to coarser-precision timestamp partition keys is via <partitionKeySourceNames> is the most important way to help ERDDAP work efficiently with Cassandra. If this relationship exists in a Cassandra table and you don't tell ERDDAP, the dataset will be painfully slow in ERDDAP and use tons of Cassandra resources.
       
    • Specify the cluster columns via <clusterColumnSourceNames> is the second most important way to help ERDDAP work efficiently with Cassandra. If there are cluster columns and you don't tell ERDDAP, a large subset of the possible queries for data will be needlessly, painfully slow in ERDDAP and use tons of Cassandra resources.
       
    • Make Indexes (external link) for Commonly Constrained Variables -
      You can speed a few queries by creating indexes for Cassandra columns that are often constrained with "=" constraints.

      Cassandra can't make indexes for list, set, or map columns.

    • Specify the index columns via <indexColumnSourceNames> is an important way to help ERDDAP work efficiently with Cassandra. If there are index columns and you don't tell ERDDAP, some queries for data will be needlessly, painfully slow in ERDDAP and use tons of Cassandra resources.
       
    • "Cassandra stats" Diagnostic Messages - For every ERDDAP user query to a Cassandra dataset, ERDDAP will print a line in the log file, bigParentDirectory/logs/log.txt, with some statistics related to the query, for example,
      * Cassandra stats: partitionKeyTable: 2/10000=2e-4 < 0.1 nCassRows=1200 nErddapRows=12000 nRowsToUser=7405
      Using the numbers in the example above, this means:
      • When ERDDAP last (re)loaded this dataset, Cassandra told ERDDAP that there were 10000 distinct combinations of the partition keys. ERDDAP cached all of the distinct combinations in a file.
      • Due to the user's constraints, ERDDAP identified 2 combinations out of the 10000 that might have the desired data. So, ERDDAP will make 2 calls to Cassandra, one for each combination of the partition keys. (That's what Cassandra requires.) Clearly, it is trouble if a large dataset has a large number of combinations of the partition keys and a given request doesn't drastically reduce that. You can require that each request reduce the key space by setting <maxRequestFraction>. Here, 2/10000=2e-4, which is less than the maxRequestFraction (0.1), so the request was allowed.
      • After applying the constraints on the partition keys, cluster columns, and index columns which were sent by ERDDAP, Cassandra returned 1200 rows of data to ERDDAP in the ResultSet.
      • The ResultSet must have had dataType=sometypeList columns (with an average of 10 items per list), because ERDDAP expanded the 1200 rows from Cassandra into 12000 rows in ERDDAP.
      • ERDDAP always applies all of the user's constraints to the data from Cassandra. In this case, constraints which Cassandra had not handled reduced the number of rows to 7405. That is the number of rows sent to the user.
      The most important use of these diagnostic messages is to make sure that ERDDAP is doing what you think it is doing. If it isn't (for example, is it not reducing the number of distinct combinations as expected?), then you can use the information to try to figure out what's going wrong.
       
    • Research and experiment to find and set better <connectionProperty>'s.
       
    • Check the speed of the network connection between Cassandra and ERDDAP. If the connection is slow, see if you can improve it. The best situation is when ERDDAP is running on a server attached to the same (fast) switch as the server running the Cassandra node to which you are connecting.
       
    • Please be patient. Read the information here and in the Cassandra documentation carefully. Experiment. Check your work. If the Cassandra-ERDDAP connection is still slower than you expect, please email your Cassandra table's schema and your ERDDAP chunk of datasets.xml to bob dot simons at noaa dot gov.
      Or, you can join the ERDDAP Google Group / Mailing List and post your question there.
       
    • If all else fails,
      consider storing the data in a collection of NetCDF v3 .nc files (especially .nc files that use the CF Discrete Sampling Geometries (DSG) (external link) Contiguous Ragged Array data structures and so can be handled with ERDDAP's EDDTableFromNcCFFiles). If they are logically organized (each with data for a chunk of space and time), ERDDAP can extract data from them very quickly.
       
  • The skeleton XML for an EDDTableFromCassandra dataset is:
    <dataset type="EDDTableFromCassandra" datasetID="..." active="..." >
      <ipAddress>...</ipAddress>
        <!-- The Cassandra URL without the port number, for example, 127.0.0.1   REQUIRED. -->
      <connectionProperty name="name">value</connectionProperty>
        <!-- The names (for example, "readTimeoutMillis") and values of the 
          Cassandra properties that ERDDAP needs to change.  0 or more. --> 
      <keyspace>...</keyspace> <!-- The name of the keyspace that has the table.  REQUIRED. -->
      <tableName>...</tableName> <!-- The name of the table, default = "".  REQUIRED. -->
      <partitionKeySourceNames>...<partitionKeySourceNames> <!-- REQUIRED. -->
      <clusterColumnSourceNames>...<clusterColumnSourceNames> <!-- OPTIONAL. -->
      <indexColumnSourceNames>...<indexColumnSourceNames> <!-- OPTIONAL. -->
      <maxRequestFraction>...<maxRequestFraction> 
        <!-- OPTIONAL double between 1e-10 and 1 (the default). -->
      <columnNameQuotes>...<columnNameQuotes> <!-- OPTIONAL. Options: [nothing] (the default) or ".
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <dataVariable>...</dataVariable> <!-- 1 or more.
         Each dataVariable MUST include a <dataType> tag. See Cassandra DataTypes.
         For Cassandra timestamp columns, set dataType=double and 
         units=seconds since 1970-01-01T00:00:00Z -->
    </dataset>
    
     

EDDTableFromDapSequence handles variables within 1- and 2-level sequences from DAP (external link) servers such as DAPPER (external link).

  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it. You can gather the information you need by looking at the source dataset's DDS and DAS files in your browser (by adding .das and .dds to the sourceUrl, for example, http://dapper.pmel.noaa.gov/dapper/epic/tao_time_series.cdp.dds (external link)).

  • A variable is in a DAP sequence if the .dds response indicates that the data structure holding the variable is a "sequence" (case insensitive).
  • In some cases, you will see a sequence within a sequence, a 2-level sequence -- EDDTableFromDapSequence handles these, too.
  • The skeleton XML for an EDDTableFromDapSequence dataset is:
    <dataset type="EDDTableFromDapSequence" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <outerSequenceName>...</outerSequenceName>
        <!-- The name of the outer sequence for DAP sequence data. 
        This tag is REQUIRED. -->
      <innerSequenceName>...</innerSequenceName>
        <!-- The name of the inner sequence for DAP sequence data. 
        This tag is OPTIONAL; use it if the DAP data is a two level 
        sequence. -->
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <sourceCanConstrainStringEQNE>true|false</sourceCanConstrainStringEQNE>
      <sourceCanConstrainStringGTLT>true|false</sourceCanConstrainStringGTLT>
      <sourceCanConstrainStringRegex>...</sourceCanConstrainStringRegex>
      <skipDapperSpacerRows>...</skipDapperSpacerRows>
        <!-- skipDapperSpacerRows specifies whether the dataset 
        will skip the last row of each innerSequence other than the 
        last innerSequence (because Dapper servers put NaNs in the 
        row to act as a spacer).  This tag is OPTIONAL. The default 
        is false.  It is recommended that you set this to true for 
        all Dapper sources and false for all other data sources. -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <dataVariable>...</dataVariable> <!-- 1 or more -->
    </dataset>
    
     

EDDTableFromDatabase handles data from one database table or view (external link).

  • datasets.xml - It is difficult to create the correct datasets.xml information needed for ERDDAP to establish a connection to the database. Be patient. Be methodical.
    • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.

      GenerateDatasetsXml has three special options for EDDTableFromDatabase:

      1. If you enter "!!!LIST!!!" (without the quotes) for the catalog name, the program will display a list of the catalog names.
      2. If you enter "!!!LIST!!!" (without the quotes) for the schema name, the program will display a list of the schema names.
      3. If you enter "!!!LIST!!!" (without the quotes) for the tablename, the program will display a list of tables and their columns.
      The first "!!!LIST!!!" entry that you make is the one that will be used.
    • Carefully read all of this document's information about EDDTableFromDatabase.
    • You can gather most of the information you need to create the XML for an EDDTableFromDatabase dataset by contacting the database administrator and by searching the web.
    • Although databases often treat column names and table names in a case-insensitive way, they are case-sensitive in ERDDAP. So if an error message from the database says that a column name is unknown (for example, "Unknown identifier='column_name'") even though you know it exists, try using all capitals, for example, COLUMN_NAME, which is often the true, case-sensitive version of the column name.
    • Work closely with the database administrator, who may have relevant experience. If the dataset fails to load, read the error message carefully to find out why.
       
  • JDBC Driver and <driverName> - You must get the appropriate JDBC 3 or JDBC 4 driver .jar file for your database and
    put it in tomcat/webapps/erddap/WEB-INF/lib after you install ERDDAP. Then, in your datasets.xml for this dataset, you must specify the <driverName> for this driver, which is (unfortunately) different from the file name. Search on the web for the JDBC driver for your database and the driverName that Java needs to use it. Unfortunately, JDBC is sometimes the source of trouble. In its role as intermediary between ERDDAP and the database, it sometimes makes subtle changes to the standard/generic database SQL request that ERDDAP creates, thereby causing problems (for example, related to upper/lower case identifiers and related to date/time timezones). Please be patient, read the information here carefully, check your work, and email bob dot simons at noaa dot gov if you have problems that you can't resolve.
    Or, you can join the ERDDAP Google Group / Mailing List and post your question there.
     
  • <connectionProperty> - In the datasets.xml for your dataset, you must define several connectionProperty tags to tell ERDDAP how to connect to your database (for example, to specify the user name, password, ssl connection, and fetch size). These are different for every situation and are a little hard to figure out. Search the web for examples of using a JDBC driver to connect to your database. The <connectionProperty> names (for example, "user", "password", and "ssl"), and some of the connectionProperty values can be found by searching the web for "JDBC connection properties databaseType" (for example, Oracle, MySQL, Amazon RDS, MariaDB, PostgreSQL).
     
  • One Table - If the data you want to serve is in two or more tables (and needs a JOIN to extract data), you need to make one new table (or view) with the denormalized/JOINed/flattened information. Making the denormalized table for ERDDAP is a good opportunity to make a few changes that ERDDAP needs, in a way that doesn't affect your original tables:
    • Change the date and timestamp fields/columns to use <dataType>=timestamp with time zone.
      Timestamps without time zone information don't work correctly in ERDDAP.
    • Make indexes for the columns that users often search.
    • Be very aware of the case of the field/column names (for example, use all lower case) when you type them.
    • Don't use reserved words for the table and for the field/column names.
    Contact your database administrator for help doing this.
     
  • Views - EDDTableFromDatabase is limited to getting data from one table, but that shouldn't be a problem. If a table of interest has foreign keys which link to other tables, you can ask the database administrator to create a VIEW (external link). Views "can join and simplify multiple tables into a single virtual table" (Wikipedia). Views are good because:
    • They simplify queries (since the queries don't have to specify the JOINs, etc.).
    • They are efficient (since the database just has to set it up once).
    • They increase abstraction (since the underlying table can be changed without having to change how the VIEW appears to the client).
    Unfortunately, accessing data from a view is usually much slower than accessing data from a denormalized table. So, we generally recommend making a denormalized table, unless the data changes very frequently.
     
  • Quotes for Field/Column Names; Case Sensitivity - By default, EDDTableFromDatabase puts ANSI-SQL-standard double quotes around field/column names in SELECT statements in case you have used a reserved word as a field/column name, or a special character in a field/column name. The double quotes also thwart certain types of SQL injection attacks. You can tell ERDDAP to use ", ', or no quotes via <columnNameQuotes> in datasets.xml for this dataset.

    For many databases, using any type of quotes causes the database to work with field/column names in a case sensitive way (instead of the default database case insensitive way). Databases often display file/column names as all upper-case, when in reality the case sensitive form is different. In ERDDAP, please always treat database column names as case sensitive.

    • For MariaDB, you need to run the database with --sql-mode=ANSI_QUOTES (external link) .
    • For MySQL and Amazon RDS, you need to run the database with --sql-mode=ANSI_QUOTES (external link) .
    • Oracle supports ANSI-SQL-standard double quotes by default.
    • PostgreSQL supports ANSI-SQL-standard double quotes by default.

    Don't use a reserved word for a database, catalog, schema or table's name. ERDDAP doesn't put quotes around them.

    If possible, use all lower-case for database, catalog, schema, table names and field names when creating the database table (or view) and when referring to the field/column names in datasets.xml in ERDDAP. Otherwise, you may get an error message saying the database, catalog, schema, table, and/or field wasn't found. If you do get that error message, try using the case-sensitive version, the all upper-case, and the all lower-case versions of the names in ERDDAP. One of them may work. If not, you need to change the name of database, catalog, schema, and/or table to all lower-case.

  • Database <dataType> Tags - Because there is some ambiguity about which database data types map to which ERDDAP data types, you need to specify a <dataType> tag for each <dataVariable> to tell ERDDAP which dataType to use. Part of the problem is that different datasets use different terms for the various data types -- so always try to match the definitions, not just the names. The standard ERDDAP dataTypes (and the most common corresponding SQL data types) are:
    • boolean (boolean), which ERDDAP then stores as bytes
    • byte (tinyint, if the range is +/-127)
    • short (smallint, if the range is +/-32767)
    • int (integer, numeric?)
    • long (bigint, numeric?)
    • float (float?, real, decimal?, numeric?)
    • double (float?, double precision, decimal (with possible loss of precision), numeric?, date, timestamp)
    • char (character, if it never has more than 1 character)
    • string (character, varchar, character varying, binary, varbinary, interval, array, multiset, xml, and any other database data type that doesn't fit cleanly with any other ERDDAP data type)
    Date and timestamp are special cases: use ERDDAP's double dataType.
     
  • Database Date Time Data - Some database date time columns have no explicit time zone. Such columns are trouble for ERDDAP. Databases support the concept of a date (with or without a time) without a time zone, as an approximate range of time. But Java (and thus ERDDAP) only deal with instantaneous date+times with a timezone. So you may know that the date time data is based on a local time zone (with or without daylight savings) or the GMT/Zulu time zone, but Java (and ERDDAP) don't. We originally thought we could work around this problem (e.g, by specifying a time zone for the column), but the database+JDBC+Java interactions made this an unreliable solution.
    • So, ERDDAP requires that you store all date and date time data in the database table with a database data type that corresponds to the JDBC type "timestamp with time zone" (ideally, that uses the GMT/Zulu time zone).
    • In ERDDAP's datasets.xml, in the <dataVariable> tag for a timestamp variable, set
        <dataType>double</dataType>
      and in <addAttributes> set
        <att name="units">seconds since 1970-01-01T00:00:00Z</att> .
    • Suggestion: If the data is a time range, it is useful to have the timestamp values refer to the center of the implied time range (for example, noon). For example, if a user has data for 2010-03-26T13:00Z from another dataset and they want the closest data from a database dataset that has data for each day, then the database data for 2010-03-26T12:00Z (representing data for that date) is obviously the best (as opposed to the midnight before or after, where it is less obvious which is best).
    • ERDDAP has a utility to Convert a Numeric Time to/from a String Time.
    • See How ERDDAP Deals with Time.
       
  • Integer nulls - Databases support nulls in integer (int, smallint, tinyint) columns, but ERDDAP doesn't.
    By default (at least on PostgreSQL), database nulls will be converted in ERDDAP to 2147483647 for int columns or 32767 for short columns. If you use those defaults, please identify those missing_values for the dataset's users in ERDDAP with
    <att name="missing_value" type="int">2147483647</att>
    or
    <att name="missing_value" type="short">32767</att>

    If the integer nulls in your database cause error messages in ERDDAP (they don't in PostgreSQL), you need to convert them to some actual number (for example, -32000) and identify that value in datasets.xml:
    <att name="missing_value" type="int">-32000</att>

    For database floating point columns, nulls get converted to NaNs in ERDDAP.
    For database data types that are converted to Strings in ERDDAP, nulls get converted to empty Strings.

  • Security - When working with databases, you need to do things as safely and securely as possible to avoid allowing a malicious user to damage your database or gain access to data they shouldn't have access to. ERDDAP tries to do things in a secure way, too.
    • Consider replicating, on a different computer, the database and database tables with the data that you want ERDDAP to serve. (Yes, for commercial databases like Oracle, this involves additional licensing fees. But for open source databases, like PostgreSQL, MySQL, Amazon RDS, and MariaDB, this costs nothing.) This gives you a high level of security and also prevents ERDDAP requests from slowing down the original database.
    • We encourage you to set up ERDDAP to connect to the database as a database user that only has access to the relevant database(s) and only has READ privileges.
    • We encourage you to set up the connection from ERDDAP to the database so that it
      • always uses SSL,
      • only allows connections from one IP address (or one block of addresses) and from the one ERDDAP user, and
      • only transfers passwords in their MD5 hashed form.
    • [KNOWN PROBLEM]The connectionProperties (including the password!) are stored as plain text in datasets.xml. Only the administrator should have READ and WRITE access to this file! No other users of the computer should have READ or WRITE access to this file! We haven't found a way to allow the administrator to enter the database password during ERDDAP's startup in Tomcat (which occurs without user input), so the password must be accessible in a file.
    • When in ERDDAP, the password and other connection properties are stored in "private" Java variables.
    • Requests from clients are parsed and checked for validity before generating the SQL requests for the database.
    • Requests to the database are made with SQL PreparedStatements, to prevent SQL injection.
    • Requests to the database are submitted with executeQuery (not executeStatement) to limit requests to be read-only (so attempted SQL injection to alter the database will fail for this reason, too).
       
  • SQL - Because OPeNDAP's tabular data requests were designed to mimic SQL tabular data requests, it is easy for ERDDAP to convert tabular data requests into simple SQL PreparedStatements. For example, the ERDDAP request
      time,temperature&time>=2008-01-01T00:00:00Z&time<=2008-02-01T00:00:00Z

    will be converted into the SQL PreparedStatement
      SELECT "time", "temperature" FROM tableName
      WHERE "time" >= 2008-01-01T00:00:00Z AND "time" <= 2008-02-01T00:00:00Z

    ERDDAP requests with &distinct() and/or &orderBy(variables) will add DISTINCT and/or ORDER BY variables to the SQL prepared statement. In general, this will greatly slow down the response from the database.
    ERDDAP logs the PreparedStatement in log.txt as
      statement=thePreparedStatement

    Note that this will be a text representation of the PreparedStatement, which may be slightly different from the actual PreparedStatement. For example, in the PreparedStatement, times are encoded in a special way. But in the text representation, they appear as ISO 8601 date times.
     
  • Speed - Databases can be slow. There are some things you can do:
    • In General -
      The nature of SQL is that queries are declarative (external link). They just specify what the user wants. They don't include a specification or hints for how the query is to be handled or optimized. So there is no way for ERDDAP to generate the query in such a way that it helps the database optimize the query (or in any way specifies how the query is to be handled). In general, it is up to the database administrator to set things up (for example, indexes) to optimize for certain types of queries.
    • Set the Fetch Size -
      Databases return the data to ERDDAP in chunks. By default, different databases return a different number of rows in the chunks. Often this number is very small and so very inefficient. For example, the default for Oracle is 10! Read the JDBC documentation for your database's JDBC driver to find the connection property to set in order to increase this, and add this to the dataset's description in datasets.xml. For example,
      For MySQL and Amazon RDS, use
      <connectionProperty name="defaultFetchSize">10000</connectionProperty>
      For MariaDB, there is currently no way to change the fetch size. But it is a requested feature, so search the web to see if this has been implemented.
      For Oracle, use
      <connectionProperty name="defaultRowPrefetch">10000</connectionProperty>
      For PostgreSQL, use
      <connectionProperty name="defaultFetchSize">10000</connectionProperty>
      but feel free to change the number. Note that setting the number too big will
      cause ERDDAP to use lots of memory and be more likely to run out of memory.
    • ConnectionProperties -
      Each database has other connection properties which can be specified in datasets.xml. Many of these will affect the performance of the database to ERDDAP connection. Please read the documentation for your database's JDBC driver to see the options. If you find connection properties that are useful, please send an email with the details to bob dot simons at noaa dot gov.
    • Make a Table -
      You will probably get faster responses if you periodically (every day? whenever there is new data?) generate an actual table (similarly to how you generated the VIEW) and tell ERDDAP to get data from the table instead of the VIEW. Since any request to the table can then be fulfilled without JOINing another table, the response will be much faster.
    • Vacuum the Table -
      MySQL and Amazon RDS will respond much faster if you use OPTIMIZE TABLE (external link).
      MariaDB will respond much faster if you use OPTIMIZE TABLE (external link).
      PostgreSQL will respond much faster if you VACUUM (external link) the table.
      Oracle doesn't have or need an analogous command.
    • Make Indexes (external link) for Commonly Constrained Variables -
      You can speed up many/most queries by creating indexes in the database for the variables (which databases call "columns") that are often constrained in the user's query. In general, these are the same variables specified by <subsetVariables> and/or the latitude, longitude, and time variables.
    • Use Connection Pooling -
      Normally, ERDDAP makes a separate connection to the database for each request. This is the most reliable approach. The faster alternative is to use a DataSource which supports connection pooling. To set it up, specify (for example)
      <dataSourceName>java:comp/env/jdbc/postgres/erddap</dataSourceName>
      right next to <sourceUrl>, <driverName>, and <connectionProperty>.
      And in tomcat/conf/context.xml, define a resource with the same information, for example,
      <Resource
      name="jdbc/postgres/erddap" auth="Container" type="javax.sql.DataSource"
      driverClassName="org.postgresql.Driver"
      url="jdbc:postgresql://somehost:5432/myDatabaseName"
      username="myUsername" password="myPassword"
      initialSize="0" maxActive="8" minIdle="0" maxIdle="0" maxWait="-1"/>

      General information about using a DataSource is at https://docs.oracle.com/javase/tutorial/jdbc/basics/sqldatasources.html (external link).
      See Tomcat DataSource information (external link) and Tomcat DataSource examples (external link) or search the web for examples of using DataSources with other application servers.
    • If all else fails,
      consider storing the data in a collection of NetCDF v3 .nc files (especially .nc files that use the CF Discrete Sampling Geometries (DSG) (external link) Contiguous Ragged Array data structures and so can be handled with ERDDAP's EDDTableFromNcCFFiles). If they are logically organized (each with data for a chunk of space and time), ERDDAP can extract data from them very quickly.
       
  • The skeleton XML for an EDDTableFromDatabase dataset is:
    <dataset type="EDDTableFromDatabase" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
        <!-- The format varies for each type of database, but will be something like: 
          For MariaDB:    jdbc:mariadb://xxx.xxx.xxx.xxx:3306/databaseName
          For MySql       jdbc:mysql://xxx.xxx.xxx.xxx:3306/databaseName
          For Amazon RDS: jdbc:mysql://xxx.xxx.xxx.xxx:3306/databaseName
          For Oracle:     jdbc:oracle:thin:@xxx.xxx.xxx.xxx:1521:databaseName
          For Postgresql: jdbc:postgresql://xxx.xxx.xxx.xxx:5432/databaseName
          where xxx.xxx.xxx.xxx is the host computer's numeric IP address 
          followed by :PortNumber (4 digits), which may be different for your database.
          REQUIRED. -->
      <driverName>...</driverName>
        <!-- The high-level name of the database driver, for example, 
          "org.postgresql.Driver".  You need to put the actual database 
          driver .jar file (for example, postgresql.jdbc.jar) in 
          tomcat/webapps/erddap/WEB-INF/lib.  REQUIRED. -->
      <connectionProperty name="name">value</connectionProperty>
        <!-- The names (for example, "user", "password", and "ssl") and values 
          of the properties needed for ERDDAP to establish the connection
          to the database.  0 or more. -->
      <dataSourceName>...</dataSourceName>  <!-- 0 or 1 -->
      <catalogName>...</catalogName>
        <!-- The name of the catalog which has the schema which has the 
          table, default = "".  OPTIONAL.  Some databases don't use this. -->
      <schemaName>...</schemaName> <!-- The name of the 
        schema which has the table, default = "".  OPTIONAL. -->
      <tableName>...</tableName>  <!-- The name of the 
        table, default = "".  REQUIRED. -->
      <columnNameQuotes><columnNameQuotes> <!-- OPTIONAL. Options: " (the default), ', [nothing].
      <orderBy>...</orderBy>  <!-- A comma-separated list of
        sourceNames to be used in an ORDER BY clause at the end of the 
        every query sent to the database (unless the user's request
        includes an &orderBy() filter, in which case the user's 
        orderBy is used).  The order of the sourceNames is important. 
        The leftmost sourceName is most important; subsequent 
        sourceNames are only used to break ties.  Only relevant 
        sourceNames are included in the ORDER BY clause for a given user 
        request.  If this is not specified, the order of the returned 
        values in not specified. Default = "".  OPTIONAL. -->
      <sourceCanSort>no(default)|partial|yes</sourceCanSort> <!-- 0 or 1 -->
      <sourceCanDoDistinct>no(default)|partial|yes</sourceCanDoDistinct> <!-- 0 or 1 -->
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <dataVariable>...</dataVariable> <!-- 1 or more.
         Each dataVariable MUST include a <dataType> tag. See Database DataTypes.
         For database date and timestamp columns, set dataType=double and 
         units=seconds since 1970-01-01T00:00:00Z -->
    </dataset>
    
     

EDDTableFromEDDGrid lets you create an EDDTable dataset from any EDDGrid dataset.

  • Some common reasons for doing this are:
    • This allows the dataset to be queried with OPeNDAP selection constraints (which a user may have requested).
    • The dataset is inherently a tabular dataset.
  • WARNING: For now, this should be considered an EXPERIMENTAL type of dataset. ERDDAP does not put any limits on user requests to this dataset (such as requiring time, latitude, and/or longitude constraints). As a result, this class probably shouldn't be used with very large datasets (for example, satellite datasets) or with remote datasets because queries without time, latitude, and/or longitude constraints may have to sift through the entire dataset to find matching data. For large and/or remote datasets, those requests may fail because they take too long to complete.
  • Comments? If you have any comments about this type of dataset, please email bob dot simons at noaa dot gov.
    Or, you can join the ERDDAP Google Group / Mailing List and post your question there.
  • This class's <reloadEveryNMinutes> is what counts. The enclosed EDDGrid's <reloadEveryNMinutes> is ignored.
  • This class doesn't support <updateEveryNMillis>. The enclosed EDDGrid's <updateEveryNMillis> is what matters.
  • The skeleton XML for an EDDTableFromEDDGrid dataset is:
    <dataset type="EDDTableFromEDDGrid" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1. For EDDTableFromEDDGrid, 
        this calls lowUpdate() of the underlying EDDGrid. -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <addAttributes>...</addAttributes>  <!-- 0 or 1 -->
      <dataset>...</dataset> <!-- 1 
         Any type of EDDGrid dataset.  You can even use an EDDGridFromErddap to access 
         an independent EDDGrid dataset on this server. -->
    </dataset>
    
     

EDDTableFromFileNames creates a dataset from information about a group of files in the server's file system, including a URL for each file so that users can download the files via ERDDAP's "files" system. Unlike all of the EDDTableFromFiles subclasses, this dataset type does not serve data from within the files.

  • EDDTableFromFileNames is useful when:
    • You have a group of files that you want to distribute as whole files because they don't contain "data" in the same way that regular data files have data. For example, image files, video files, Word documents, Excel spreadsheet files, PowerPoint presentation files, or text files with unstructured text.
    • You have a group of files which have data in a format that ERDDAP can't yet read. For example, a project-specific, custom, binary format.
       
  • The data in an EDDTableFromFileNames dataset is a table that ERDDAP creates on-the-fly with information about a group of local files. In the table, there is a row for each file. Four special attributes in the datasets.xml for this dataset determine which files will be included in this dataset:
    • <fileDir> - This specifies the source directory in the server's file system with the files for this dataset. The files that are actually located in the server's file system in <fileDir> will appear in the url column of this dataset within a virtual directory named http://serverUrl/erddap/files/datasetID/ .
      For example, if the datasetID is jplMURSST,
      and the <fileDir> is /home/data/mur/ ,
      and that directory has a file named jplMURSST20150103000000.png,
      then the url that will be shown to users for that file will be
      http://serverUrl/erddap/jplMURSST/jplMURSST20150103000000.png .
    • <recursive> - Files in subdirectories of <fileDir> with names which match <fileRegex> will appear in the same subdirectories in the "files" URL if <recursive> is set to true. The default is false.
    • <pathRegex> - If recursive=true, Only directory names which match the pathRegex (default=".*") will be accepted. If recursive=false, this is ignored. This is rarely used, but can be very useful in unusual circumstances.
    • <fileRegex> - Only the file names where the whole file name (not including the directory name) match the <fileRegex> will be included in this dataset. For example, jplMURSST.{14}\.png .
    In the table, there will be columns with:
    • url - The URL that users can use to download the file via ERDDAP's "files" system.
    • name - The file's name (without a directory name).
    • lastModified - The time the file was last modified (stored as doubles with "seconds since 1970-01-01T00:00:00Z"). This variable is useful because users can see if/when the contents of a given file last changed. This variable is a timeStamp variable, so the data may appear as a numeric values (seconds since 1970-01-01T00:00:00Z) or a String value (ISO 8601:2004(E) format), depending on the situation.
    • size - The size of the file in bytes, stored as doubles. They are stored as doubles because some files may be larger than ints allow and longs are not supported in some response file types. Doubles will give the exact size, even for very large files.
    • addition columns defined by the ERDDAP administrator with information extracted from the file name (for example, the time associated with the data in the file) based on two attributes that you specify in the metadata for each additional column/dataVariable:
      • extractRegex - This is a regular expression (external link) (tutorial (external link)). The entire regex must match the entire file name (not including the directory name). The regex must include at least one capture group (a section of a regular expression that is enclosed by parentheses) which ERDDAP uses to determine which section of the file name to extract to become data.
      • extractGroup - This is the number of the capture group (#1 is the first capture group) in the regular expression. The default is 1. A capture group is a section of a regular expression that is enclosed by parentheses.
      Here are two examples:
          <dataVariable>
              <sourceName>time</sourceName>
              <destinationName>time</destinationName>
              <dataType>String</dataType>
              <addAttributes>
                  <att name="extractRegex">jplMURSST(.{14})\.png</att>
                  <att name="extractGroup" type="int">1</att>
                  <att name="units">yyyyMMddHHmmss</att>
              </addAttributes>
          </dataVariable>
          <dataVariable>
              <sourceName>day</sourceName>
              <destinationName>day</destinationName>
              <dataType>int</dataType>
              <addAttributes>
                  <att name="extractRegex">jplMURSST.{6}(..).{6}\.png</att>
                  <att name="extractGroup" type="int">1</att>
                  <att name="ioos_category">Time</att>
              </addAttributes>
          </dataVariable> 
      In the case of the time variable, if a file has the name jplMURSST20150103000000.png, the extractRegex will match the file name, extract the characters which match the first capture group ("20150103000000") as dataType=String, then use the units (see Joda DateTimeFormat (external link)) to interpret that as a time data value (2015-01-03T00:00:00Z).

      In the case of the day variable, if a file has the name jplMURSST20150103000000.png, the extractRegex will match the file name, extract the characters which match the first capture group ("03") as <dataType>=int, yielding a data value of 3.

  • No <updateEveryNMillis> - This type of dataset doesn't need and can't use the <updateEveryNMillis> tag because the information served by EDDTableFromFileNames is always perfectly up-to-date because ERDDAP queries the file system in order to respond to each request for data. Even if there are a huge number of files, this approach should work reasonably well. A response may be a slow if there are a huge number of files and the dataset hasn't been queried for a while. But for several minutes after that, the operating system keeps the information in a cache, so responses should be very fast.
     
  • You can use the GenerateDatasetsXml program to make the datasets.xml chunk for this type of dataset. You can add/define additional columns with information extracted from the file name, as shown above.
     
  • The skeleton XML for an EDDTableFromFileNames dataset is:
    <dataset type="EDDTableFromFileNames" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <fileDir>...</fileDir> 
      <recursive>...</recursive>  <!-- true or false (the default) -->
      <pathRegex>...</pathRegex>  <!-- 0 or 1. Only directory names which 
        match the pathRegex (default=".*") will be accepted. -->
      <fileNameRegex>...</fileNameRegex> 
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <dataVariable>...</dataVariable> <!-- 1 or more.
         Each dataVariable MUST include <dataType> tag. -->
    </dataset>
    
     

EDDTableFromFiles is the superclass of all EDDTableFrom...Files classes. You can't use EDDTableFromFiles directly. Instead, use a subclass of EDDTableFromFiles to handle the specific file type:

Currently, no other file types are supported. But it is usually relatively easy to add support for other file types. Contact us if you have a request. Or, if your data is in an old file format that you would like to move away from, we recommend converting the files to be NetCDF v3 .nc files (and especially .nc files with the CF Discrete Sampling Geometries (DSG) (external link) Contiguous Ragged Array data structure -- ERDDAP can extract data from them very quickly). NetCDF is a widely supported, binary format, allows fast random access to the data, and is already supported by ERDDAP.

Details - The following information applies to all of the subclasses of EDDTableFromFiles.

  • Aggregation - This class aggregates data from local files. Each file holds a (relatively) small table of data.
    • The resulting dataset appears as if all of the file's tables had been combined (all of the rows of data from file #1, plus all of the rows from file #2, ...).
    • The files don't all have to have all of the specified variables. If a given file doesn't have a specified variable, ERDDAP will add missing values as needed.
    • The variables in all of the files MUST have the same values for the add_offset, missing_value, _FillValue, scale_factor, and units attributes (if any). ERDDAP checks, but it is an imperfect test -- if there are different values, ERDDAP doesn't know which is correct and therefore which files are invalid. If this is a problem, you may be able to use NcML or NCO to fix the problem.
       
  • Cached File Information - When an EDDTableFromFiles dataset is first loaded, EDDTableFromFiles reads information from all of the relevant files and creates tables (one row for each file) with information about each valid file and each "bad" (different or invalid) file.
    • The tables are also stored on disk, as NetCDF v3 .nc files in bigParentDirectory/dataset/last2CharsOfDatasetID/datasetID/ in files named:
        dirTable.nc (which holds a list of unique directory names),
        fileTable.nc (which holds the table with each valid file's information),
        badFiles.nc (which holds the table with each bad file's information).
    • To speed up access to an EDDTableFromFiles dataset (but at the expense of using more memory), you can use
      <fileTableInMemory>true</fileTableInMemory>
      to tell ERDDAP to keep a copy of the file information tables in memory.
    • The copy of the file information tables on disk is also useful when ERDDAP is shut down and restarted: it saves EDDTableFromFiles from having to re-read all of the data files.
    • When a dataset is reloaded, ERDDAP only needs to read the data in new files and files that have changed.
    • If a file has a different structure from the other files (for example, different data type for one of the variables, different value for the "units" attribute), ERDDAP adds the file to the list of "bad" files. Information about the problem with the file will be written to the bigParentDirectory/logs/log.txt file.
    • You shouldn't ever need to delete or work with these files. One exception is: if you are still making changes to a dataset's datasets.xml setup, you may want to delete these files to force ERDDAP to reread all of the files since the files will be read/interpreted differently. If you ever do need to delete these files, you can do it when ERDDAP is running. (Then set a flag to reload the dataset ASAP.) However, ERDDAP usually notices that the datasets.xml information doesn't match the fileTable information and deletes the file tables automatically.
    • If you want to encourage ERDDAP to update the stored dataset information (for example, if you just added, removed, or changed some files to the dataset's data directory), use the flag system to force ERDDAP to update the cached file information.
       
  • Handling Requests - ERDDAP tabular data requests can put constraints on any variable.
    • When a client's request for data is processed, EDDTableFromFiles can quickly look in the table with the valid file information to see which files might have relevant data. For example, if each source file has the data for one fixed-location buoy, EDDTableFromFiles can very efficiently determine which files might have data within a given longitude range and latitude range.
    • Because the valid file information table includes the minimum and maximum value of every variable for every valid file, EDDTableFromFiles can often handle other queries quite efficiently. For example, if some of the buoys don't have an air pressure sensor, and a client requests data for airPressure!=NaN, EDDTableFromFiles can efficiently determine which buoys have air pressure data.
       
  • Updating the Cached File Information - Whenever the dataset is reloaded, the cached file information is updated.
    • The dataset is reloaded periodically as determined by the <reloadEveryNMinutes> in the dataset's information in datasets.xml.
    • The dataset is reloaded as soon as possible whenever ERDDAP detects that you have added, removed, touch'd (external link) (to change the file's lastModified time), or changed a datafile.
    • The dataset is reloaded as soon as possible if you use the flag system.
    When the dataset is reloaded, ERDDAP compares the currently available files to the cached file information table. New files are read and added to the valid files table. Files that no longer exist are dropped from the valid files table. Files where the file timestamp has changed are read and their information is updated. The new tables replaces the old tables in memory and on disk.
     
  • Bad Files - The table of bad files and the reasons the files were declared bad (corrupted file, missing variables, incorrect axis values, etc.) is emailed to the emailEverythingTo email address (probably you) every time the dataset is reloaded. You should replace or repair these files as soon as possible.
     
  • Near Real Time Data - EDDTableFromFiles treats requests for very recent data as a special case. The problem: If the files making up the dataset are updated frequently, it is likely that the dataset won't be updated every time a file is changed. So EDDTableFromFiles won't be aware of the changed files. (You could use the flag system, but this might lead to ERDDAP reloading the dataset almost continually. So in most cases, we don't recommend it.) Instead, EDDTableFromFiles deals with this by the following system: When ERDDAP gets a request for data within the last 20 hours (for example, 8 hours ago until Now), ERDDAP will search all files which have any data in the last 20 hours. Thus, ERDDAP doesn't need to have perfectly up-to-date data for all of the files in order to find the latest data. You should still set <reloadEveryNMinutes> to a reasonably small value (for example, 60), but it doesn't have to be tiny (for example, 3).
     
    • Not recommended organization of near-real-time data in the files: If, for example, you have a dataset that stores data for numerous stations (or buoy, or trajectory, ...) for many years, you could arrange the files so that, for example, there is one file per station. But then, every time new data for a station arrives, you have to read a large old file and write a large new file. And when ERDDAP reloads the dataset, it notices that some files have been modified, so it reads those files completely. That is inefficient.
       
    • Recommended organization of near-real-time data in the files: Store the data in chunks, for example, all data for one station/buoy/trajectory for one year (or one month). Then, when a new datum arrives, only the file with this year's (or month's) data is affected.
      • Best: Use NetCDF v3 .nc files with an unlimited dimension (time). Then, to add new data, you can just append the new data without having to read and re-write the entire file. The change is made very efficiently and essentially atomically, so the file isn't ever in an inconsistent state.
      • Otherwise: If you don't/can't use .nc files with an unlimited dimension (time), then, when you need to add new data, you have to read and rewrite the entire affected file (hopefully small because it just has a year's (or month's) worth of data). Fortunately, all of the files for previous years (or months) for that station remain unchanged.
      In both cases, when ERDDAP reloads the dataset, most files are unchanged; only a few, small files have changed and need to be read.
       
  • Directories - The files can be in one directory, or in a directory and its subdirectories (recursively). If there are a large number of files (for example, >1,000), the operating system (and thus EDDTableFromFiles) will operate much more efficiently if you store the files in a series of subdirectories (one per year, or one per month for datasets with very frequent files), so that there are never a huge number of files in a given directory.
     
  • Remote Directories and HTTP Range Requests (AKA Byte Serving, Byte Range Requests) -
    EDDGridFromNcFiles, EDDTableFromMultidimNcFiles, EDDTableFromNcFiles, and EDDTableFromNcCFFiles, can sometimes serve data from .nc files on remote servers and accessed via HTTP if the server supports Byte Serving (external link) via HTTP range requests (the HTTP mechanism for byte serving). This is possible because netcdf-java (which ERDDAP uses to read .nc files) supports reading data from remote .nc files via HTTP range requests. For more information, see Remote Directories.
     
  • Millions of Files - Some datasets have millions of source files. ERDDAP can handle this, but with mixed results.
    • For requests that just involve variables listed in <subsetVariables>, ERDDAP has all of the needed information already extracted from the datafiles and stored in one file, so it can respond very, very quickly.
    • For other requests, ERDDAP can scan the dataset's cached file information and figure out that only a few of the files might have data which is relevant to the request and thus respond quickly.
    • But for other requests (for example, waterTemperature=18 degrees_C) where any file might have relevant data, ERDDAP has to open a large number of files to see if each of the files has any data which is relevant to the request. The files are opened sequentially. On any operating system and any file system (other than solid state drives), this takes a long time (so ERDDAP responds slowly) and really ties up the file system (so ERDDAP responds slowly to other requests).

    Fortunately, there is a solution.

    1. Set up the dataset on a non-public ERDDAP (your personal computer?).
    2. Create and run a script which requests a series of .ncCF files, each with a large chunk of the dataset, usually a time period (for example, all of the data for a given month). Choose the time period so that all of the resulting files are less than 2GB (but hopefully greater than 1GB). If the dataset has near-real-time data, run the script to regenerate the file for the current time period (e.g., this month) frequently (every 10 minutes? every hour?). Requests to ERDDAP for .ncCF files create a NetCDF v3 .nc file that uses the CF Discrete Sampling Geometries (DSG) (external link) Contiguous Ragged Array data structures).
    3. Set up an EDDTableFromNcCFFiles dataset on your public ERDDAP which gets data from the .nc(CF) files. ERDDAP can extract data from these files very quickly. And since there are now dozens or hundreds (instead of millions) of files, even if ERDDAP has to open all of the files, it can do so quickly.
    Yes, this system takes some time and effort to set up, but it works very, very well. Most data requests can be handled 100 times faster than before.
    [Bob knew this was a possibility, but it was Kevin O'Brien who first did this and showed that it works well. Now, Bob uses this for the GTSPP dataset which has about 18 million source files and which ERDDAP now serves via about 500 .nc(CF) files.]
     
  • FTP Trouble/Advice - If you FTP new data files to the ERDDAP server while ERDDAP is running, there is the chance that ERDDAP will be reloading the dataset during the FTP process. It happens more often than you might think! If it happens, the file will appear to be valid (it has a valid name), but the file isn't valid. If ERDDAP tries to read data from that invalid file, the resulting error will cause the file to be added to the table of invalid files. This is not good. To avoid this problem, use a temporary file name when FTP'ing the file, for example, ABC2005.nc_TEMP . Then, the fileNameRegex test (see below) will indicate that this is not a relevant file. After the FTP process is complete, rename the file to the correct name. The renaming process will cause the file to become relevant in an instant.
     
  • File Name Extracts - EDDTableFromFiles has a system for extracting a String from each file name and using that to make a psuedo data variable. Currently, there is no system to interpret these Strings as dates/times. There are several XML tags to set up this system. If you don't need part or all of this system, just don't specify these tags or use "" values.
    • preExtractRegex is a regular expression (external link) (tutorial (external link)) used to identify text to be removed from the start of the file name. The removal only occurs if the regex is matched. This usually begins with "^" to match the beginning of the file name.
    • postExtractRegex is a regular expression used to identify text to be removed from the end of the file name. The removal only occurs if the regex is matched. This usually ends with "$" to match the end of the file name.
    • extractRegex If present, this regular expression is used after preExtractRegex and postExtractRegex to identify a string to be extracted from the file name (for example, the stationID). If the regex isn't matched, the entire file name is used (minus preExtract and postExtract). Use ".*" to match the entire file name that is left after preExtractRegex and postExtractRegex.
    • columnNameForExtract is the data column source name for the extracted Strings. A dataVariable with this sourceName must be in the dataVariables list (with any data type, but usually String).
    For example, if a dataset has files with names like XYZAble.nc, XYZBaker.nc, XYZCharlie.nc, ..., and you want to create a new variable (stationID) when each file is read which will have station ID values (Able, Baker, Charlie, ...) extracted from the file names, you could use these tags:
    • <preExtractRegex>^XYZ</preExtractRegex>
      The initial ^ is a regular expression special character which forces ERDDAP to look for XYZ at the beginning of the file name. This causes XYZ, if found at the beginning of the file name, to be removed (for example, the file name XYZAble.nc becomes Able.nc).
    • <postExtractRegex>\.nc$</postExtractRegex>
      The $ at the end is a regular expression special character which forces ERDDAP to look for .nc at the end of the file name. Since . is a regular expression special character (which matches any character), it is encoded as \. here (because 2E is the hexadecimal character number for a period). This causes .nc, if found at the end of the file name, to be removed (for example, the partial file name Able.nc becomes Able).
    • <extractRegex>.*</extractRegex>
      The .* regular expression matches all remaining characters (for example, the partial file name Able becomes the extract for the first file).
    • <columnNameForExtract>stationID</columnNameForExtract>
      This tells ERDDAP to create a new source column called stationID when reading each file. Every row of data for a given file will have the text extracted from its file name (for example, Able) as the value in the stationID column.
    In most cases, there are numerous values for these extract tags that will yield the same results -- regular expressions are very flexible. But in a few cases, there is just one way to get the desired results.
     
  • global: sourceNames - Global metadata attributes in each source data file can be promoted to be data. If a variable's <sourceName> has the format global:attributeName, then when ERDDAP is reading the data from a file, ERDDAP will look for a global attribute of that name (for example, PI) and create a column filled with the attribute's value. This is useful when the attribute has different values in different files, because otherwise, users would only see one of those values for the whole dataset. For example,
    <sourceName>global:PI</sourceName>

    When you promote an attribute to be data, ERDDAP removes the corresponding attribute. If you want, you can add a new value for the attribute for the whole dataset by adding <att name="attributeName">newValue</att> to the dataset's global <addAttributes>. For global attributes that ERDDAP requires, for example, institution, you MUST add a new value for the attribute.

  • variable: sourceNames - A variable's metadata in each file can be promoted to be data. If a variable's <sourceName> has the format variable:variableName:attributeName (for example, variable:instument:ID), then when ERDDAP is reading the data from a file, ERDDAP will look for the specified attribute (for example, ID) of the specified variable (for example, instrument) and create a column filled with the attribute's value. This is useful when the attribute has different values in different files, because otherwise, users would only see one of those values for the whole dataset.

    When you promote an attribute to be data, ERDDAP removes the corresponding attribute. If you want, you can add a new value for the attribute for the whole dataset by adding <att name="attributeName">newValue</att> to the variable's <addAttributes>. For attributes that ERDDAP requires, for example, ioos_category (depending on your setup), you MUST add a new value for the attribute.

  • "0 files" Error Message - If you run GenerateDatasetsXml or DasDds, or if you try to load an EDDTableFrom...Files dataset in ERDDAP, and you get a "0 files" error message indicating that ERDDAP found 0 matching files in the directory (when you think that there are matching files in that directory):
    • Check that the files really are in that directory.
    • Check the spelling of the directory name.
    • Check the fileNameRegex. It's really, really easy to make mistakes with regexes. For test purposes, try the regex .* which should match all file names.
    • Check that the user who is running the program (e.g., user=tomcat (?) for Tomcat/ERDDAP) has 'read' permission for those files.
    • In some operating systems (for example, SE Linux) and depending on system settings, the user who ran the program must have 'read' permission for the whole chain of directories leading to the directory that has the files.
       
  • The skeleton XML for all EDDTableFromFiles subclasses is:
    <dataset type="EDDTableFrom...Files" datasetID="..." active="..." >
      <nDimensions>...</nDimensions>  <!-- This was used prior to ERDDAP version 1.30, 
        but is now ignored. -->
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1. For EDDTableFromFiles subclasses, 
        this uses Java's WatchDirectory system to notice new/deleted/changed files, 
        so it should be fast and efficient. -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <specialMode>mode</specialMode>  <-- This rarely-used, OPTIONAL tag can be used 
        with EDDTableFromThreddsFiles to specify that special, hard-coded rules 
        should be used to determine which files should be downloaded from the server.
        Currently, the only valid mode is SAMOS which is used with datasets from
        http://coaps.fsu.edu/thredds/catalog/samos to download only the files with 
        the last version number. -->
      <sourceUrl>...</sourceUrl>  <-- For subclasses like EDDTableFromHyraxFiles and 
        EDDTableFromThreddsFiles, this is where you specify the base URL for the files 
        on the remote server.  For subclasses that get data from local files, ERDDAP 
        doesn't use this information to get the data, but does display the information 
        to users. So I usually use "(local files)". -->
      <fileDir>...</fileDir> <-- The directory (absolute) with the data files. -->
      <recursive>true|false</recursive> <!-- 0 or 1. Indicates if subdirectories
        of fileDir have data files, too. -->
      <pathRegex>...</pathRegex>  <!-- 0 or 1. Only directory names which 
        match the pathRegex (default=".*") will be accepted. -->
      <fileNameRegex>...</fileNameRegex> <-- 0 or 1. A regular expression (external link) 
        (tutorial (external link)) describing valid data files names, for example, ".*\.nc" for 
        all .nc files. -->
      <accessibleViaFiles>true|false(default)</accessibleViaFiles> <!-- 0 or 1 -->
      <metadataFrom>...</metadataFrom> <-- The file to get metadata
        from ("first" or "last" (the default) based on file's 
        lastModifiedTime). -->
      <charset>...</charset> 
        <!-- (For EDDTableFromAsciiFiles and EDDTableFromColumnarAsciiFiles only) 
        This OPTIONAL tag specifies the character set (case sensitive!) of the 
        source files, for example, ISO-8859-1 (the default) and UTF-8.  --> 
      <columnNamesRow>...</columnNamesRow> <-- (For EDDTableFromAsciiFiles only) 
        This specifies the number of the row with the column names in the files. 
        (The first row of the file is "1". Default = 1.)  If you specify 0, ERDDAP
        will not look for column names and will assign names: 
        Column#1, Column#2, ... -->
      <firstDataRow>...</firstDataRow> 
        <-- (For EDDTableFromAsciiFiles and EDDTableFromColumnarAsciiFiles only) 
        This specifies the number of the first row with data in the files. 
        (The first row of the file is "1". Default = 2.) -->
      <-- For the next four tags, see File Name Extracts. -->
      <preExtractRegex>...</preExtractRegex>
      <postExtractRegex>...</postExtractRegex>
      <extractRegex>...</extractRegex>
      <columnNameForExtract>...</columnNameForExtract> 
      <sortedColumnSourceName>...</sortedColumnSourceName> 
        <-- The sourceName of the numeric column that the data files are 
        usually already sorted by within each file, for example, "time".
        Don't specify this or use an empty string if no variable is suitable.
        It is ok if not all files are sorted by this column.
        If present, this can greatly speed up some data requests. 
        For EDDTableFromHyraxFiles, EDDTableFromNcFiles and EDDTableFromThreddsFiles, 
        this must be the leftmost axis variable. 
        EDDTableFromMultidimNcFiles ignores this because it has a better system.
        -->
      <sortFilesBySourceNames>...</sortFilesBySourceNames>
        <-- This is a space-separated list of sourceNames 
        which specifies how the internal list of files should be sorted
        (in ascending order), for example "id time". 
        It is the minimum value of the specified columns in each file
        that is used for sorting.
        When a data request is filled, data is obtained from the files
        in this order. Thus it determines the overall order of the data
        in the response.  If you specify more than one column name, the
        second name is used if there is a tie for the first column; the
        third is used if there is a tie for the first and second columns; ...
        This is OPTIONAL (the default is fileDir+fileName order). -->
      
      
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <fileTableInMemory>...</fileTableInMemory> <!-- 0 or 1 (true or false (the default)) -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <dataVariable>...</dataVariable> <!-- 1 or more -->
        <-- For EDDTableFromHyraxFiles, EDDTableFromMultidimNcFiles, EDDTableFromNcFiles, and 
        EDDTableFromThreddsFiles, the source's axis variables (for example, time) needn't
        be first or in any specific order. -->
    </dataset>
    
     

EDDTableFromAsciiService is essentially a screen scraper. It is intended to deal with data sources which have a simple web service for requesting data (often an HTML form on a web page) and which can return the data in some structured ASCII format (for example, a comma-separated-value or columnar ASCII text format, often with other information before and/or after the data).

EDDTableFromAsciiService is the superclass of all EDDTableFromAsciiService... classes. You can't use EDDTableFromAsciiService directly. Instead, use a subclass of EDDTableFromAsciiService to handle specific types of services:

Currently, no other service types are supported. But it is usually relatively easy to support other services if they work in a similar way. Contact us if you have a request.

Details - The following information applies to all of the subclasses of EDDTableFromAsciiService.

  • Constraints - ERDDAP tabular data requests can put constraints on any variable. The underlying service may or may not allow constraints on all variables. For example, many services only support constraints on station names, latitude, longitude, and time. So when a subclass of EDDTableFromAsciiService gets a request for a subset of a dataset, it passes as many constraints as possible to the source data service and then applies the remaining constraints to the data returned by the service, before handing the data to the user.
  • Valid Range - Unlike many other dataset types, EDDTableFromAsciiService usually doesn't know the range of data for each variable, so it can't quickly reject requests for data outside of the valid range.
  • Parsing the ASCII Text Response - When EDDTableFromAsciiService gets a response from an ASCII Text Service, it must validate that the response has the expected format and information, and then extract the data. You can specify the format by using various special tags in the chunk of XML for this dataset:
    • <beforeData1> through <beforeData10> tags - You can specify a series of pieces of text (as many as you want, up to 10) that EDDTableFromAsciiService must look for in the header of the ASCII text returned by the service with <beforeData1> through <beforeData10>. For example, this is useful for verifying that the response includes the expected variables using the expected units. The last beforeData tag that you specify identifies the text that occurs right before the data starts.
    • <afterData> - This specifies the text that EDDTableFromAsciiService will look for in the ASCII text returned by the service which signifies the end of the data.
    • <noData> - If EDDTableFromAsciiService finds this text in the ASCII text returned by the service, it concludes that there is no data which matches the request.
  • The skeleton XML for all EDDTableFromAsciiService subclasses is:
    <dataset type="EDDTableFromAsciiService..." datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <sourceUrl>...</sourceUrl>  
      <beforeData1>...<beforeData1> <!-- 0 or 1 -->
      ...
      <beforeData10>...<beforeData10> <!-- 0 or 1 -->
      <afterData>...<afterData> <!-- 0 or 1 --> 
      <noData>...<noData> <!-- 0 or 1 -->
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <dataVariable>...</dataVariable> <!-- 1 or more -->
    </dataset>
    
     

EDDTableFromAsciiServiceNOS makes EDDTable datasets from the ASCII text data services offered by NOAA's National Ocean Service (NOS) (external link). For information on how this class works and how to use this class, see this class's superclass EDDTableFromAsciiService. It is unlikely that anyone other than Bob Simons will need to use this subclass.

Since the data within the response from a NOS service uses a columnar ASCII text format, data variables other than latitude and longitude must have a special attribute which specifies which characters of each data line contain that variable's data, for example,
<att name="responseSubstring">17, 25</att>
 

EDDTableFromAsciiFiles aggregates data from comma-, tab-, or space-separated tabular ASCII data files.

  • Most often, the files will have column names on the first row and data starting on the second row. (Here, the first row of the file is called row number 1.) But you can use <columnNamesRow> and <firstDataRow> in your datasets.xml file to specify a different row number.
  • ERDDAP allows the rows of data to have different numbers of data values. ERDDAP assumes that the missing data values are the final columns in the row. ERDDAP assigns the standard missing value values for the missing data values. (added v1.56)
  • ASCII files are easy to work with, but they are not an efficient way to store/retreive data. For greater efficiency, save the files as NetCDF v3 .nc files (which one dimension, "row", shared by all variables) instead. You can use ERDDAP to generate the new files.
  • See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.
  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. Because of the total lack of metadata in ASCII files, you will always need to edit the results of GenerateDatasetsXml.
     

EDDTableFromAwsXmlFiles aggregates data from a set of Automatic Weather Station (AWS) XML data files. Some background information is at WeatherBug_Rest_XML_API (external link).

EDDTableFromColumnarAsciiFiles aggregates data from tabular ASCII data files with fixed-width columns.

  • Most often, the files will have column names on the first row and data starting on the second row. The first line/row in the file is called row #1. But you can use <columnNamesRow> and <firstDataRow> in your datasets.xml file to a specify different row number.
  • The <addAttributes> for each <dataVariable> for these datasets MUST include these two special attributes:
    • <att name="startColumn">integer<att> - specifies the character column in each line that is the start of this data variable.
    • <att name="stopColumn">integer<att> - specifies the character column in each line that is the 1 after the end of this data variable.
    The first character column is called column #0.
    For example, for this file that has time values abutting temperature values :
      0         1         2           <-- character column number 10's digit
      0123456789012345678901234567    <-- character column number 1's digit
      time                temp
      2014-12-01T12:00:00Z12.3
      2014-12-02T12:00:00Z13.6
      2014-12-03T12:00:00Z11.0
      
    the time data variable would have
      <att name="startColumn">0<att>
      <att name="stopColumn">20<att>

    and the time data variable would have
      <att name="startColumn">20<att>
      <att name="stopColumn">24<att>

    These attributes MUST be specified for all variables except fixed-value and file-name-extract variables.
  • ASCII files are easy to work with, but they are not an efficient way to store/retreive data. For greater efficiency, save the files as NetCDF v3 .nc files (which one dimension, "row", shared by all variables) instead. You can use ERDDAP to generate the new files.
  • See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.
  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. Because of the difficulty of determining the start and end positions for each data column and the total lack of metadata in ASCII files, you will always need to edit the results from GenerateDatasetsXml.
     

EDDTableFromHyraxFiles aggregates data files with several variables, each with one or more shared dimensions (for example, time, altitude (or depth), latitude, longitude), and served by a Hyrax OPeNDAP server (external link).

  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.
  • In most cases, each file has multiple values for the leftmost dimension, for example time.
  • The files often (but don't have to) have a single value for the other dimensions (for example, altitude (or depth), latitude, longitude).
  • The files may have character variables with an additional dimension (for example, nCharacters).
  • Hyrax servers can be identified by the "/dods-bin/nph-dods/" or "/opendap/" in the URL.
  • This class screen-scrapes the Hyrax web pages with the lists of files in each directory. Because of this, it is very specific to the current format of Hyrax web pages. We will try to adjust ERDDAP quickly if/when future versions of Hyrax change how the files are listed.
  • The <fileDir> setting is ignored. Since this class downloads and makes a local copy of each remote data file, ERDDAP forces the fileDir to be bigParentDirectory/copy/datasetID/.
  • For <sourceUrl>, use the URL of the base directory of the dataset in the Hyrax server, for example,
    <sourceUrl>http://edac-dap.northerngulfinstitute.org/dods-bin/nph-dods/WCOS/nmsp/wcos/</sourceUrl>
    (although that server is no longer available).
    The sourceUrl web page usually has "OPeNDAP Server Index of [directoryName]" at the top.
  • Since this class always downloads and makes a local copy of each remote data file, you should never wrap this dataset in EDDTableCopy.
  • See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.
  • See the 1D, 2D, 3D, and 4D examples for EDDTableFromNcFiles.
     

EDDTableFromMultidimNcFiles aggregates data from NetCDF (v3 or v4) .nc (or .ncml) files with several variables, each with one or more shared dimensions. The files may have character variables with or without an additional dimension (for example, STRING14). See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.

  • If the .nc files use one of the CF Discrete Sampling Geometries (DSG) (external link) file formats, try using EDDTableFromNcCFFiles before trying this.
  • For new tabular datasets from .nc files, use this option before trying the older EDDTableFromNcFiles. Some advantages of this class are:
    • This class can read more variables from a wider variety of file structures. So when you specify a DimensionsCSV in GenerateDatasetXml, this class will find more matching variables.
    • This class can often reject files very quickly if they don't match a request's constraints. So reading data from large collections will often go much faster.
    • This class handles true char variables (non-String variables) correctly.
    • This class can trim String variables when the creator didn't use Netcdf-java's writeStrings (which appends char #0 to mark the end of the string).
    • This class is better at dealing with individual files that lack certain variables or dimensions.
    • This class can remove blocks of rows with missing values as specified for CF Discrete Sampling Geometries (DSG) Incomplete Multidimensional Array files (external link)
  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.

    The first thing GenerateDatasetsXml does for this type of dataset after you answer the questions is print the ncdump-like structure of the sample file. So if you enter a few goofy answers for the first loop through GenerateDatasetsXml, at least you'll be able to see if ERDDAP can read the file and see what dimensions and variables are in the file. Then you can give better answers for the second loop through GenerateDatasetsXml.

    DimensionsCSV - GenerateDatasetsXml will ask for a "DimensionsCSV" string. This is a comma-separated-value list of source names of a set of dimensions. GenerateDatasetsXml will find the data variables in sample .nc file which uses any or all of those dimensions (and no other dimensions) and all of the scalar variables in the file and make the dataset from those data variables.
    If you specify nothing (an empty string), GenerateDatasetsXml will look for the variables with the most dimensions, on the theory that they will be the most interesting, but there may be times when you will want to make a dataset from some other group of data variables that uses some other group of dimensions.
    If you just specify a dimension name that doesn't exist (e.g., NO_MATCH), ERDDAP will just find all of the scalar variables.

EDDTableFromNcFiles aggregates data from NetCDF (v3 or v4) .nc (or .ncml) files with several variables, each with one shared dimension (for example, time) or more than one shared dimensions (for example, time, altitude (or depth), latitude, longitude). The files must have the same dimension names. A given file may have multiple values for each of the dimensions and the values may be different in different files. The files may have character variables with an additional dimension (for example, STRING14). See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.

  • If the .nc files use one of the CF Discrete Sampling Geometries (DSG) (external link) file formats, try using EDDTableFromNcCFFiles before trying this.
  • For new tabular datasets from .nc files, try the newer EDDTableFromMultidimNcFiles first.
  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.

    The first thing GenerateDatasetsXml does for this type of dataset after you answer the questions is print the ncdump-like structure of the sample file. So if you enter a few goofy answers for the first loop through GenerateDatasetsXml, at least you'll be able to see if ERDDAP can read the file and see what dimensions and variables are in the file. Then you can give better answers for the second loop through GenerateDatasetsXml.

    DimensionsCSV - GenerateDatasetsXml will ask for a "DimensionsCSV" string. This is a comma-separated-value list of source names of a set of dimensions. GenerateDatasetsXml will find the data variables in the .nc files which use that list of dimensions and make the dataset from those data variables. If you specify nothing (an empty string), GenerateDatasetsXml will look for the variables with the most dimensions, on the theory that they will be the most interesting, but there may be times when you will want to make a dataset from some other group of data variables that uses some other group of dimensions.

  • 1D Example: 1D files are somewhat different from 2D, 3D, 4D, ... files.
    • You might have a set of .nc data files where each file has one month's worth of data from one drifting buoy.
    • Each file will have 1 dimension, for example, time (size = [many]).
    • Each file will have one or more 1D variables which use that dimension, for example, time, longitude, latitude, air temperature, ....
    • Each file may have 2D character variables, for example, with dimensions (time,nCharacters).
  • 2D Example:
    • You might have a set of .nc data files where each file has one month's worth of data from one drifting buoy.
    • Each file will have 2 dimensions, for example, time (size = [many]) and id (size = 1).
    • Each file will have 2 1D variables with the same names as the dimensions and using the same-name dimension, for example, time(time), id(id). These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
    • Each file will have one or more 2D variables, for example, longitude, latitude, air temperature, water temperature, ...
    • Each file may have 3D character variables, for example, with dimensions (time,id,nCharacters).
  • 3D Example:
    • You might have a set of .nc data files where each file has one month's worth of data from one stationary buoy.
    • Each file will have 3 dimensions, for example, time (size = [many]), lat (size = 1), and lon (size = 1).
    • Each file will have 3 1D variables with the same names as the dimensions and using the same-name dimension, for example, time(time), lat(lat), lon(lon). These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
    • Each file will have one or more 3D variables, for example, air temperature, water temperature, ...
    • Each file may have 4D character variables, for example, with dimensions (time,lat,lon,nCharacters).
    • The file's name might have the buoy's name within the file's name.
  • 4D Example:
    • You might have a set of .nc data files where each file has one month's worth of data from one station. At each time point, the station takes readings at a series of depths.
    • Each file will have 4 dimensions, for example, time (size = [many]), depth (size = [many]), lat (size = 1), and lon (size = 1).
    • Each file will have 4 1D variables with the same names as the dimensions and using the same-name dimension, for example, time(time), depth(depth), lat(lat), lon(lon). These 1D variables should be included in the list of <dataVariable>'s in the dataset's XML.
    • Each file will have one or more 4D variables, for example, air temperature, water temperature, ...
    • Each file may have 5D character variables, for example, with dimensions (time,depth,lat,lon,nCharacters).
    • The file's name might have the buoy's name within the file's name.
       

EDDTableFromNcCFFiles aggregates data aggregates data from NetCDF (v3 or v4) .nc (or .ncml) files which use one of the file formats specified by the CF Discrete Sampling Geometries (DSG) (external link) conventions. See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.

The CF DSG conventions defines dozens of file formats and includes numerous minor variations. These class deals with all of the variations we are aware of, but we may have missed one (or more). So if this class can't read data from your CF DSG files, please email bob.simons at noaa.gov and include a sample file.
Or, you can join the ERDDAP Google Group / Mailing List and post your question there.

We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.
 

EDDTableFromNOS handles data from a NOAA NOS (external link) source, which uses SOAP+XML for requests and responses. It is very specific to NOAA NOS's XML. See the sample EDDTableFromNOS dataset in datasets2.xml.
 

EDDTableFromOBIS handles data from an Ocean Biogeographic Information System (OBIS) (external link) server.

  • OBIS servers expect an XML request and return an XML response.
  • Because all OBIS servers serve the same variables the same way (see the OBIS schema (external link)), you don't have to specify much to set up an OBIS dataset in ERDDAP.
  • You MUST include a "creator_email" attribute in the global addAttributes, since that information is used within the license. A suitable email address can be found by reading the XML response from the sourceURL.
  • You may or may not be able to get the global attribute <subsetVariables> to work with a given OBIS server. If you try, just try one variable (for example, ScientificName or Genus).
  • The skeleton XML for an EDDTableFromOBIS dataset is:
    <dataset type="EDDTableFromOBIS" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <sourceCode>...</sourceCode>
        <!-- If you read the XML response from the sourceUrl, the 
        source code (for example, GHMP) is the value from one of the 
        <resource><code> tags. -->
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <-- All ...SourceMinimum and Maximum tags are OPTIONAL -->
      <longitudeSourceMinimum>...</longitudeSourceMinimum> 
      <longitudeSourceMaximum>...</longitudeSourceMaximum> 
      <latitudeSourceMinimum>...</latitudeSourceMinimum> 
      <latitudeSourceMaximum>...</latitudeSourceMaximum> 
      <altitudeSourceMinimum>...</altitudeSourceMinimum> 
      <altitudeSourceMaximum>...</altitudeSourceMaximum> 
      <-- For timeSource... tags, use yyyy-MM-dd'T'HH:mm:ssZ format. -->
      <timeSourceMinimum>...</timeSourceMinimum> 
      <timeSourceMaximum>...</timeSourceMaximum> 
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ> <!-- 0 or 1 -->
      <addAttributes>...</addAttributes> <!-- 0 or 1.  This MUST include "creator_email" -->
    </dataset>
    
     

EDDTableFromSOS handles data from a Sensor Observation Service (SWE/SOS (external link)) server.

  • This dataset type aggregates data from a group of stations which are all served by one SOS server.
  • The stations all serve the same set of variables (although the source for each station doesn't have to serve all variables).
  • SOS servers expect an XML request and return an XML response.
  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it. It is not easy to generate the dataset XML for SOS datasets by hand. To find the needed information, you must visit sourceUrl+"?service=SOS&request=GetCapabilities" in a browser; look at the XML; make a GetObservation request by hand; and look at the XML response to the request.
  • With the occasional addition of new types of SOS servers and changes to the old servers, it is getting harder for ERDDAP to automatically detect the server type from the server's responses. The use of <sosServerType> (with a value of IOOS_NDBC, IOOS_NOS, OOSTethys, or WHOI) is now STRONGLY RECOMMENDED. If you have problems with any datasets of this type, try re-running GenerateDatasetsXml for the SOS server. GenerateDatasetsXml will let you try out the different <sosServerType> options until you find the right one for a given server.
  • SOS overview:
    • SWE (Sensor Web Enablement) and SOS (Sensor Observation Service) are OpenGIS® standards (external link). That web site has the standards documents.
    • The OGC Web Services Common Specification ver 1.1.0 (OGC 06-121r3) covers construction of GET and POST queries (see section 7.2.3 and section 9).
    • If you send a getCapabilities xml request to a SOS server (sourceUrl + "?service=SOS&request=GetCapabilities"), you get an xml result with a list of stations and the observedProperties that they have data for.
    • An observedProperty is a formal URI reference to a property. For example, urn:ogc:phenomenon:longitude:wgs84 or http://marinemetadata.org/cf#sea_water_temperature
    • An observedProperty isn't a variable.
    • More than one variable may have the same observedProperty (for example, insideTemp and outsideTemp might both have observedProperty http://marinemetadata.org/cf#air_temperature).
    • If you send a getObservation xml request to a SOS server, you get an xml result with descriptions of field names in the response, field units, and the data. The field names will include longitude, latitude, depth(perhaps), and time.
    • Each dataVariable for an EDDTableFromSOS must include an "observedProperty" attribute, which identifies the observedProperty that must be requested from the server to get that variable. Often, several dataVariables will list the same composite observedProperty.
    • The dataType for each dataVariable may not be specified by the server. If so, you must look at the XML data responses from the server and assign appropriate <dataType>s in the ERDDAP dataset dataVariable definitions.
    • (At the time of writing this) some SOS servers respond to getObservation requests for more than one observedProperty by just returning results for the first of the observedProperties. (No error message!) See the constructor parameter requestObservedPropertiesSeparately.
  • EDDTableFromSOS automatically adds
    <att name="subsetVariables">station_id, longitude, latitude</att>
    to the dataset's global attributes when the dataset is created.
  • SOS servers usually express units with the UCUM (external link) system. Most ERDDAP servers express units with the UDUNITS (external link) system. If you need to convert between the two systems, you can use ERDDAP's web service to convert UCUM units to/from UDUNITS.
  • The skeleton XML for an EDDTableFromSOS dataset is:
    <dataset type="EDDTableFromSOS" datasetID="..." active="..." >
      <sourceUrl>...</sourceUrl>
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <sosServerType>...</sosServerType> <!-- 0 or 1, but STRONGLY RECOMMENDED.
        This lets you specify the type of SOS server (so ERDDAP doesn't have to figure it out).
        Valid values are: IOOS_NDBC, IOOS_NOS, OOSTethys, and WHOI. -->
      <responseFormat>...</responseFormat> <!-- 0 or 1. Use this only if you need to override the default
        responseFormat for the specified sosServerType.  -->
      <stationIdSourceName>...</stationIdSourceName> <!-- 0 or 1. 
        Default="station_id". -->
      <longitudeSourceName>...</longitudeSourceName>
      <latitudeSourceName>...</latitudeSourceName>
      <altitudeSourceName>...</altitudeSourceName>
      <altitudeSourceMinimum>...</altitudeSourceMinimum> <!-- 0 or 1 -->
      <altitudeSourceMaximum>...</altitudeSourceMaximum> <!-- 0 or 1 -->
      <altitudeMetersPerSourceUnit>...</altitudeMetersPerSourceUnit> 
      <timeSourceName>...</timeSourceName>
      <timeSourceFormat>...</timeSourceFormat>
        <!-- timeSourceFormat MUST be either
        * For numeric data: a UDUnits (external link)-compatible string (with the format 
          "units since baseTime") describing how to interpret
          source time values (for example, "seconds since 1970-01-01T00:00:00Z"),
          where the base time is an ISO 8601:2004(E) formatted date time string 
          (yyyy-MM-dd'T'HH:mm:ssZ).
        * For String String data: an org.joda.time.format.DateTimeFormat 
          string (which is mostly compatible with java.text.SimpleDateFormat)
          describing how to interpret string times  (for example, the 
          ISO8601TZ_FORMAT "yyyy-MM-dd'T'HH:mm:ssZ").  See Joda DateTimeFormat (external link) -->
      <observationOfferingIdRegex>...</observationOfferingIdRegex>
        <!-- Only observationOfferings with IDs (usually the station names) 
        which match this regular expression (external link) (tutorial (external link)) will be included 
        in the dataset (".+" will catch all station names). -->
      <requestObservedPropertiesSeparately>true|false(default)
        </requestObservedPropertiesSeparately>
      <sourceNeedsExpandedFP_EQ>true(default)|false</sourceNeedsExpandedFP_EQ>
      <addAttributes>...</addAttributes> <!-- 0 or 1 -->
      <dataVariable>...</dataVariable> <!-- 1 or more. 
        * Each dataVariable MUST include the dataType tag.
        * Each dataVariable MUST include the observedProperty attribute. 
        * For IOOS SOS servers, *every* variable returned in the text/csv
          response MUST be included in this ERDDAP dataset definition. -->
    </dataset>
    
     

EDDTableFromThreddsFiles aggregates data files with several variables, each with one or more shared dimensions (for example, time, altitude (or depth), latitude, longitude), and served by a THREDDS OPeNDAP server (external link).

  • We strongly recommend using the GenerateDatasetsXml program to make a rough draft of the datasets.xml chunk for this dataset. You can then edit that to fine tune it.
  • In most cases, each file has multiple values for the leftmost dimension, for example time.
  • The files often (but don't have to) have a single value for the other dimensions (for example, altitude (or depth), latitude, longitude).
  • The files may have character variables with an additional dimension (for example, nCharacters).
  • THREDDS servers can be identified by the "/thredds/" in the URLs. For example,
    http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.html
  • This class reads the catalog.xml files served by THREDDS with the lists of <catalogRefs> (references to additional catalog.xml sub-files) and <dataset>s (data files).
  • The <fileDir> setting is ignored. Since this class downloads and makes a local copy of each remote data file, ERDDAP forces the fileDir to be bigParentDirectory/copy/datasetID/.
  • For <sourceUrl>, use the URL of the catalog.xml file for the dataset in the THREDDS server, for example: for this URL which may be used in a web browser,
    http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.html ,
    use <sourceUrl>http://data.nodc.noaa.gov/thredds/catalog/nmsp/wcos/catalog.xml</sourceUrl> .
  • Since this class always downloads and makes a local copy of each remote data file, you should never wrap this dataset in EDDTableCopy.
  • This dataset type supports an OPTIONAL, rarely-used, special tag, <specialMode>mode</specialMode> which can be used to specify that special, hard-coded rules should be used to determine which files should be downloaded from the server. Currently, the only valid mode is SAMOS which is used with datasets from http://coaps.fsu.edu/thredds/catalog/samos to download only the files with the last version number.
  • See this class' superclass, EDDTableFromFiles, for information on how this class works and how to use this class.
  • See the 1D, 2D, 3D, and 4D examples for EDDTableFromNcFiles.
     

EDDTableFromWFSFiles makes a local copy of all of the data from an ArcGIS MapServer WFS server so the data can then be re-served quickly to ERDDAP users.

  • You need to specify a specially formatted sourceUrl global attribute to tell ERDDAP how to request feature information from the server. Please use this example as a template:
    <att name="sourceUrl">
    http://kgs.uky.edu/arcgis/services/aasggeothermal/WVBoreholeTemperatures/MapServer/WFSServer?
    request=GetFeature&amp;service=WFS&amp;typename=aasg:BoreholeTemperature
    &amp;format=&quot;text/xml;%20subType=gml/3.1.1/profiles/gmlsf/1.0.0/0"</att>
  • You need to add a special global attribute to tell ERDDAP how to identify the names of the chunks of data that should be downloaded. This will probably work for all EDDTableFromWFSFiles datasets:
    <att name="rowElementXPath">/wfs:FeatureCollection/gml:featureMember</att>
  • Since this class always downloads and makes a local copy of each remote data file, you should never wrap this dataset in EDDTableCopy.
  • See this class' superclass, EDDTableFromFiles, for additional information on how this class works and how to use this class.
     

EDDTableAggregateRows can make an EDDTable dataset from a group of "child" EDDTable datasets.

  • Here are some uses for EDDTableAggregateRows:
    • You could make an EDDTableAggregateRows dataset from two different kinds of files or data sources, for example, a dataset with data up to the end of last month stored in .ncCF files and a dataset with data for the current month stored in a relational database.
    • You could make an EDDTableAggregateRows dataset to deal with a change in source files (for example, the time format changed, or a variable name changed, or dataType/scale_factor/add_offset changed). In this case, one child would get data from files made before the change and the other child would get data from files made after the change. This use of EDDTableAggregateRows is an alternative to using NcML or NCO. Unless there is a distinguishing feature in the file names (so you can use <fileNameRegex> to determine which file belongs to which child dataset), you probably need to store the files for the two child datasets in different directories.
    • You could make an EDDTableAggregateRows dataset which has a shared subset of variables of one or more similar but different datasets, for example, a dataset which makes a Profile dataset from the combination of a Profile dataset, a TimeSeriesProfile dataset, and a TrajectoryProfile dataset (which have some different variables and some variables in common -- in which case you'll have to make special variants for the child datasets, with just the in-common variables).
  • The "source" globalAttributes for the EDDTableAggregateRows is the combined globalAttributes from the first child dataset. The EDDTableAggregateRows can have a global <addAttributes> to provide additional global attributes or override the source global attributes.
  • All child datasets must have the same dataVariables, in the same order, with the same destinationNames, dataTypes, missing_values, _FillValues, and units. The metadata for each variable for the EDDTableAggregateRows dataset comes variables in the first child dataset, but EDDTableAggregateRows will update the actual_range metadata to be the actual range for all of the children.
  • Dataset Default Sort Order - The order of the child datasets determines the overall default sort order of the results. Of course, users can request a different sort order for a given set of results by appending &orderBy("comma-separated list of variables") to the end of their query.
  • Skeleton XML - The skeleton XML for an EDDTableAggregateRows dataset is:
    <dataset type="EDDTableAggregateRows" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaFiles>true|false(default)</accessibleViaFiles> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <updateEveryNMillis>...</updateEveryNMillis> <!-- 0 or 1. -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <dataset>...</dataset> <!-- 1 or more -->
    </dataset>
    
     

EDDTableCopy can make a local copy of many types of EDDTable datasets and then re-serve the data quickly from the local copy.

  • EDDTableCopy (and for grid data, EDDGridCopy) is a very easy to use and a very effective solution to some of the biggest problems with serving data from remote data sources:
    • Accessing data from a remote data source can be slow.
      • They may be slow because they are inherently slow (for example, an inefficient type of server),
      • because they are overwhelmed by too many requests,
      • or because your server or the remote server is bandwidth limited.
    • The remote dataset is sometimes unavailable (again, for a variety of reasons).
    • Relying on one source for the data doesn't scale well (for example, when many users and many ERDDAPs utilize it).
       
  • How It Works - EDDTableCopy solves these problems by automatically making and maintaining a local copy of the data and serving data from the local copy. ERDDAP can serve data from the local copy very, very quickly. And making and using a local copy relieves the burden on the remote server. And the local copy is a backup of the original, which is useful in case something happens to the original.

    There is nothing new about making a local copy of a dataset. What is new here is that this class makes it *easy* to create and *maintain* a local copy of data from a *variety* of types of remote data sources and *add metadata* while copying the data.

  • <extractDestinationNames> - EDDTableCopy makes the local copy of the data by requesting chunks of data from the remote dataset. EDDTableCopy determines which chunks to request by requesting the &distinct() values for the <extractDestinationNames> (specified in the datasets.xml, see below), which are the space-separated destination names of variables in the remote dataset. For example,
    <extractDestinationNames>drifter profile</extractDestinationNames>
    might yield distinct values combinations of drifter=tig17,profile=1017, drifter=tig17,profile=1095, ... drifter=une12,profile=1223, drifter=une12,profile=1251, ....

    In situations where one column (for example, profile) may be all that is required to uniquely identify a group of rows of data, if there are a very large number of, for example, profiles, it may be useful to also specify an additional extractDestinationName (for example, drifter) which serves to subdivide the profiles. That leads to fewer data files in a given directory, which may lead to faster access.

  • Local Files - Each chunk of data is stored in a separate NetCDF file in a subdirectory of bigParentDirectory/copy/datasetID/ (as specified in setup.xml). There is one subdirectory level for all but the last extractDestinationName. For example, data for tig17+1017, would be stored in
    bigParentDirectory/copy/sampleDataset/tig17/1017.nc .
    For example, data for une12+1251, would be stored in
    bigParentDirectory/copy/sampleDataset/une12/1251.nc .
    Directory and file names created from data values are modified to make them file-name-safe (for example, spaces are replaced by "x20") -- this doesn't affect the actual data.
     
  • New Data - Each time EDDTableCopy is reloaded, it checks the remote dataset to see what distinct chunks are available. If the file for a chunk of data doesn't already exist, a request to get the chunk is added to a queue. ERDDAP's taskThread processes all the queued requests for chunks of data, one-by-one. You can see statistics for the taskThread's activity on the Status Page and in the Daily Report. (Yes, ERDDAP could assign multiple tasks to this process, but that would use up lots of the remote data source's bandwidth, memory, and CPU time, and lots of the local ERDDAP's bandwidth, memory, and CPU time, neither of which is a good idea.)

    NOTE: The very first time an EDDTableCopy is loaded, (if all goes well) lots of requests for chunks of data will be added to the taskThread's queue, but no local data files will have been created. So the constructor will fail but taskThread will continue to work and create local files. If all goes well, the taskThread will make some local data files and the next attempt to reload the dataset (in ~15 minutes) will succeed, but initially with a very limited amount of data.

    WARNING: If the remote dataset is large and/or the remote server is slow (that's the problem, isn't it?!), it will take a long time to make a complete local copy. In some cases, the time needed will be unacceptable. For example, transmitting 1 TB of data over a T1 line (0.15 GB/s) takes at least 60 days, under optimal conditions. Plus, it uses lots of bandwidth, memory, and CPU time on the remote and local computers. The solution is to mail a hard drive to the administrator of the remote data set so that s/he can make a copy of the dataset and mail the hard drive back to you. Use that data as a starting point and EDDTableCopy will add data to it. (That is how Amazon's EC2 Cloud Service (external link) handles the problem, even though their system has lots of bandwidth.)

    WARNING: If a given combination of values disappears from remote dataset, EDDTableCopy does NOT delete the local copied file. If you want to, you can delete it yourself.

  • Recommended Use -
    1. Create the <dataset> entry (the native type, not EDDTableCopy) for the remote data source. Get it working correctly, including all of the desired metadata.
    2. If it is too slow, add XML code to wrap it in an EDDTableCopy dataset.
      • Use a different datasetID (perhaps by changing the datasetID of the old datasetID slightly).
      • Copy the <accessibleTo>, <reloadEveryNMinutes> and <onChange> from the remote EDDTable's XML to the EDDTableCopy's XML. (Their values for EDDTableCopy matter; their values for the inner dataset become irrelevant.)
      • Create the <extractDestinationNames> tag (see above).
      • <orderExtractBy> is an OPTIONAL space separated list of destination variable names in the remote dataset. When each chunk of data is downloaded from the remote server, the chunk will be sorted by these variables (by the first variable, then by the second variable if the first variable is tied, ...). In some cases, ERDDAP will be able to extract data faster from the local data files if the first variable in the list is a numeric variable ("time" counts as a numeric variable). But choose the these variables in a way that is appropriate for the dataset.
    3. ERDDAP will make and maintain a local copy of the data.
       
  • WARNING: EDDTableCopy assumes that the data values for each chunk don't ever change. If/when they do, you need to manually delete the chunk files in bigParentDirectory/copy/datasetID/ which changed and flag the dataset to be reloaded so that the deleted chunks will be replaced. If you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
     
  • Change Metadata - If you need to change any addAttributes or change the order of the variables associated with the source dataset:
    1. Change the addAttributes for the source dataset in datasets.xml, as needed.
    2. Delete one of the copied files.
    3. Set a flag to reload the dataset immediately. If you do use a flag and you have an email subscription to the dataset, you will get two emails: one when the dataset first reloads and starts to copy the data, and another when the dataset loads again (automatically) and detects the new local data files.
    4. The deleted file will be regenerated with the new metadata. If the source dataset is ever unavailable, the EDDTableCopy dataset will get metadata from the regenerated file, since it is the youngest file.
       
  • Note that EDDGridCopy is very similar to EDDTableCopy, but works with gridded datasets.
     
  • Skeleton XML - The skeleton XML for an EDDTableCopy dataset is:
    <dataset type="EDDTableCopy" datasetID="..." active="..." >
      <accessibleTo>...</accessibleTo> <!-- 0 or 1 -->
      <graphsAccessibleTo>auto|public</graphsAccessibleTo> <!-- 0 or 1 -->
      <accessibleViaFiles>true|false(default)</accessibleViaFiles> <!-- 0 or 1 -->
      <reloadEveryNMinutes>...</reloadEveryNMinutes> <!-- 0 or 1 -->
      <defaultDataQuery>...</defaultDataQuery> <!-- 0 or 1 -->
      <defaultGraphQuery>...</defaultGraphQuery> <!-- 0 or 1 -->
      <fgdcFile>...</fgdcFile> <!-- 0 or 1 -->
      <iso19115File>...</iso19115File> <!-- 0 or 1 -->
      <onChange>...</onChange> <!-- 0 or more -->
      <extractDestinationNames>...</extractDestinationNames>  <!-- 1 -->
      <orderExtractBy>...</orderExtractBy> <!-- 0 or 1 -->
      <fileTableInMemory>...</fileTableInMemory> <!-- 0 or 1 (true or false (the default)) -->
      <dataset>...</dataset> <!-- 1 -->
    </dataset>
    
     

Details

Here are detailed descriptions of common tags and attributes.
  • <convertToPublicSourceUrl> is an OPTIONAL tag within an <erddapDatasets> tag which contains a "from" and a "to" attribute which specify how to convert a matching local sourceUrl (usually an IP number) into a public sourceUrl (a domain name). "from" must have the form "[something]//[something]/". There can be 0 or more of these tags. For more information see <sourceUrl>. For example,
    <convertToPublicSourceUrl from="http://192.168.31.18/" to="http://oceanwatch.pfeg.noaa.gov/" />
    will cause a matching local sourceUrl (such as http://192.168.31.18/thredds/dodsC/satellite/BA/ssta/5day)
    into a public sourceUrl (http://oceanwatch.pfeg.noaa.gov/thredds/dodsC/satellite/BA/ssta/5day).

    But, for security reasons and reasons related to the subscription system, DON'T USE THIS TAG!
    Instead, always use the public domain name in the <sourceUrl> tag and use the /etc/hosts table on your server to convert local domain names to IP numbers without using a DNS server. You can test if a domain name is properly converted into an IP number by using:
    ping some.domain.name

  • <requestBlacklist> is an OPTIONAL tag within an <erddapDatasets> tag which contains a comma-separated list of numeric IP addresses which will be blacklisted.
    • This can be used to fend off a Denial of Service attack (external link), an overly zealous web robot (external link), or an overeager user running multiple scripts at one time.
    • Frequent Crashes - If ERDDAP freezes/stops twice or more in one day, you probably have a troublesome user running several scripts at once and/or someone making a large number of invalid requests. If this happens, you should probably blacklist that user.
    • Any request from a blacklisted addresses will receive an HTTP Error 403: Forbidden. The accompanying text error message encourages the user to email you to work out the problems. Then, you can encourage them to run just one script at a time and to fix the problems in their script (for example, requesting data from a remote dataset that can't respond before timing out).
    • To block a user, add their numeric IP address to the comma-separated list of IP addresses in <requestBlacklist> in your datasets.xml file. To find the troublesome user's IP address, look in the ERDDAP bigParentDirectory/logs/log.txt file (bigParentDirectory is specified in setup.xml) to see if this is the case and to find that user's IP address. The IP address for every request is listed on the lines starting with "{{{{#" and is 4 numbers separated by periods, for example, 123.45.67.8 . Seaching for "ERROR" will help you find problems such as invalid requests.
    • You can also replace the last number in an IP address with * (for example, 123.45.67.*) to block a range of IP address, 0-255.
    • For example,
      <requestBlacklist>98.76.54.321, 123.45.68.*</requestBlacklist>
    • You don't need to restart ERDDAP for the changes to <requestBlacklist> to take effect. The changes will be detected the next time ERDDAP checks if any datasets need to be reloaded. Or, you can speed up the process by visiting a setDatasetFlag URL for any dataset.
    • Your ERDDAP daily report includes a list/tally of the most active allowed and blocked requesters.
    • If you want to figure out what domain/institution is related to a numeric IP address, you can use a free, reverse DNS web service like http://network-tools.com/ (external link).
       
  • <subscriptionEmailBlacklist> is an OPTIONAL tag within an <erddapDatasets> tag which contains a comma-separated list of email address which are immediately blacklisted from the subscription system, for example
    <subscriptionEmailBlacklist>bob@badguy.com, john@badguy.com</subscriptionEmailBlacklist>
    If an email address on the list has subscriptions, the subscriptions will be cancelled. If an email address on the list tries to subscribe, the request will be refused.
     
  • <user> is an OPTIONAL tag within an <erddapDatasets> tag that identifies a user's username, password (if authentication=custom), and roles (a comma-separated list). The use of username and password varies slightly based on the value of <authentication> in your ERDDAP's setup.xml file.
    • This is part of ERDDAP's security system for restricting access to some datasets to some users.
    • Make a separate <user> tag for each user.
    • If there is no <user> tag for a client, s/he will only be able to access public datasets, i.e., datasets which don't have an <accessibleTo> tag.
    • username
      For authentication=custom, the username is usually a combination of letters, digits, underscores, and periods.
      For authentication=email, the username is the user's email address. It may be any email address.
      For authentication=google, the username is the user's Google email address. This includes Google-managed accounts like @noaa.gov accounts.
    • password
      For authentication=email and google, don't specify a password attribute.
      For authentication=custom, you must specify a password attribute for each user.
      • The passwords that users enter are case sensitive.
      • setup.xml's <passwordEncoding> determines how passwords are stored in the <user> tags in datasets.xml. In order of increasing security, the options are:
        • MD5 (external link) (not recommended) - for the password attribute, specify the MD5 hash digest of the user's password.
        • UEPMD5 (not recommended) - for the password attribute, specify the MD5 hash digest of username:ERDDAP:password . The username and "ERDDAP" are used to salt (external link) the hash value, making it more difficult to decode.
        • SHA256 (external link) (not recommended) - for the password attribute, specify the SHA-256 hash digest of the user's password.
        • UEPSHA256 (default, recommended) - for the password attribute, specify the SHA-256 hash digest of username:ERDDAP:password . The username and "ERDDAP" are used to salt the hash value, making it more difficult to decode.
      • On Windows, you can generate MD5 password digest values by downloading an MD5 program (such as MD5 (external link)) and using (for example): md5 -djsmith:ERDDAP:actualPassword
      • On Linux/Unix, you can generate MD5 digest values by using the built-in md5sum program (for example):
        echo -n "jsmith:ERDDAP:actualPassword" | md5sum
      • Stored plaintext passwords are case sensitive. The stored forms of MD5 and UEPMD5 passwords are not case sensitive.
      • For example (using UEPMD5), if username="jsmith" and password="myPassword", the <user> tag is:
        <user username="jsmith"
        password="57AB7ACCEB545E0BEB46C4C75CEC3C30"
        roles="JASmith, JASmithGroup" />

        where the stored password was generated with
        md5 -djsmith:ERDDAP:myPassword

       
  • <dataset> is an OPTIONAL tag within an <erddapDatasets> tag that (if you include all of the information between <dataset> and </dataset>) completely describes one dataset. For example,
    <dataset type="EDDGridFromDap" datasetID="erdPHssta8day" active="true"> ... </dataset>
    There MAY be any number of dataset tags in your datasets.xml file.
    Three attributes MAY appear within a <dataset> tag.
     
    • type="aType" is a REQUIRED attribute within a <dataset> tag which identifies the dataset type (for example, whether it is an EDDGrid/gridded or EDDTable/tabular dataset) and the source of the data (for example, a database, files, or a remote OPeNDAP server). See the List of Dataset Types.
       
    • datasetID="aDatasetID" is a REQUIRED attribute within a <dataset> tag which assigns a short (usually <15 characters), unique, identifying name to a dataset.
      • Valid characters are A-Z, a-z, 0-9, _, and -, but we strongly recommend starting with a letter and then just using A-Z, a-z, 0-9, and _.
      • DatasetID's are case sensitive, but DON'T create two datasetID's that only differ in upper/lower case letters. It will cause problems on Windows computers (yours and/or a user's computer).
      • Best practices: We recommend using camelCase (external link).
      • Best practices: We recommend that the first part be an acronym or abbreviation of the source institution's name and the second part be an acronym or abbreviation of the dataset's name. When possible, we create a name which reflect's the source's name for the dataset. For example, we used datasetID="erdPHssta8day" for a dataset from the NOAA NMFS SWFSC Environmental Research Division (ERD) which is designated by the source to be satellite/PH/ssta/8day.
      • If you want to change a dataset's name, you need to do something to kill off (err, retire) the dataset with the existing name. The two solutions are:
        • Shutdown Tomcat/ERDDAP. Change the name. Restart ERDDAP.
        • Change the name of the dataset and make a dummy, active="false" dataset to kill off (err, retire) the old dataset:
          <dataset type="EDDTableFromNcFiles" datasetID="theOldName" active="false" />
          You can remove that tag after the old dataset is inactive.
           
    • active="boolean" is an OPTIONAL attribute within the <dataset> tag which indicates if a dataset is active (eligible for use in ERDDAP) or not.
      • Valid values are true (the default) and false.
      • Since the default is true, you don't need to use this attribute except to use active="false" to force a dataset's removal as soon as possible (if it is alive in ERDDAP) and to tell ERDDAP not to try to load it in the future.
         

    Several tags can appear between the <dataset> and </dataset> tags.
    There is some variation in which tags are allowed by which types of datasets. See the documentation for a specific type of dataset for details.

    • <accessibleTo> is an OPTIONAL tag within a <dataset> tag that specifies a space-separated list of roles which are allowed to have access to this dataset. For example,
      <accessibleTo>RASmith NEJones</accessibleTo>
      • This is part of ERDDAP's security system for restricting access to some datasets to some users.
      • If this tag is not present, all users (even if they haven't logged in) will have access to this dataset.
      • If this tag is present, this dataset will only be visible and accessible to logged-in users who have one of the specified roles. This dataset won't be visible to users who aren't logged in.
         
    • <graphsAccessibleTo> is an OPTIONAL tag within a <dataset> tag which determines whether graphics and metadata for the dataset are available to the public. It offers a way to partially override the dataset's <accessibleTo> setting. The allowed values are:
      • auto - This value (or the absence of a <graphsAccessibleTo> tag for the dataset) makes access to graphs and metadata from the dataset mimic the dataset's <accessibleTo> setting.
        So if the dataset is private, its graphs and metadata will be private.
        And if the dataset is public, its graphs and metadata will be public.
      • public - This setting makes the dataset's graphs and metadata accessible to anyone, even users who aren't logged in, even if the dataset is otherwise private because it has an <accessibleTo> tag.
         
    • <accessibleViaFiles> is an OPTIONAL tag within a <dataset> tag for EDDGridFromFiles, EDDTableFromFiles, EDDGridCopy, and EDDTableCopy datasets. It can have a value of true or false (the default). For example,
      <accessibleViaFiles>true</accessibleViaFiles>
      If the value is true, ERDDAP will make it so that users can browse and download the dataset's source data files via ERDDAP's "files" system. See the "files" system's documentation for more information.
       
    • <accessibleViaWMS> is an OPTIONAL tag within a <dataset> tag for all EDDGrid subclasses. It can have a value of true (the default) or false. For example,
      <accessibleViaWMS>true</accessibleViaWMS>
      If the value is false, ERDDAP's WMS server won't be available for this dataset. This is commonly used for datasets that have some longitude values greater than 180 (which technically is invalid for WMS services), and for which you are also offering a variant of the dataset with longitude values entirely in the range -180 to 180 via EDDGridLonPM180.
      If the value is true, ERDDAP will try to make the dataset available via ERDDAP's WMS server. But if the dataset is completely unsuitable for WMS (e.g., there is no longitude or latitude data), then the dataset won't be available via ERDDAP's WMS server, regardless of this setting.
       
    • <altitudeMetersPerSourceUnit> is an OPTIONAL tag within the <dataset> tag for EDDTableFromSOS datasets (only!) that specifies a number which is multiplied by the source altitude or depth values to convert them into altitude values (in meters above sea level). For example,
      <altitudeMetersPerSourceUnit>-1</altitudeMetersPerSourceUnit>
      This tag MUST be used if the dataset's vertical axis values aren't meters, positive=up. Otherwise, it is OPTIONAL, since the default value is 1. For example,
      • If the source is already measured in meters above sea level, use 1 (or don't use this tag, since 1 is the default value).
      • If the source is measured in meters below sea level, use -1.
        <altitudeMetersPerSourceUnit>-1</altitudeMetersPerSourceUnit>
      • If the source is measured in km above sea level, use 0.001.
         
    • <defaultDataQuery> is an OPTIONAL tag within a <dataset> tag that tells ERDDAP to use the specified query (the part of the URL after the "?") if the .html fileType (the Data Access Form) is requested with no query.
      • You will probably rarely need to use this.
      • You need to XML-encode or percent-encode (either one, but not both) the default queries since they are in an XML document. For example, & becomes &amp; , < becomes &lt; , > becomes &gt; .
      • Please check your work. It is easy to make a mistake and not get what you want. ERDDAP will try to clean up your errors -- but don't rely on that, since *how* it is cleaned up may change.
      • For griddap datasets, a common use of this is to specify a different default depth or altitude dimension value (for example, [0] instead of [last]).
        In any case, you should always list all of the variables, always use the same dimension values for all variables, and almost always use [0], [last], or [0:last] for the dimension values.
        For example:
        <defaultDataQuery>u[last][0][0:last][0:last],v[last][0][0:last][0:last]</defaultDataQuery>
      • For tabledap datasets, the most common use of this is to specify a different default time range (relative to now, for example, &time>=now-1day).
        Remember that requesting no data variables is the same as specifying all data variables, so usually you can just specify the new time constraint.
        For example:
        <defaultDataQuery>&amp;time&gt;=now-1day</defaultDataQuery>
         
    • <defaultGraphQuery> is an OPTIONAL tag within a <dataset> tag that tells ERDDAP to use the specified query (the part of the URL after the "?") if the .graph fileType (the Make A Graph Form) is requested with no query.
      • You will probably rarely need to use this.
      • You need to XML-encode or percent-encode (either one, but not both) the default queries since they are in an XML document. For example, & becomes &amp; , < becomes &lt; , > becomes &gt; .
      • Please check your work. It is easy to make a mistake and not get what you want. ERDDAP will try to clean up your errors -- but don't rely on that, since *how* it is cleaned up may change.
      • For gridddap datasets, the most common use of this is to specify a different default depth or altitude dimension value (for example, [0] instead of [last]) and/or to specify that a specific variable be graphed.
        In any case, you will almost always use [0], [last], or [0:last] for the dimension values.
        For example:
        <defaultGraphQuery>temp[last][0][0:last][0:last]&amp;.draw=surface&amp;.vars=longitude|latitude|temp</defaultGraphQuery>
      • For tabledap datasets, the most common uses of this are to specify different variables to be graphed, a different default time range (relative to now, for example, &time>=now-1day) and/or different default graphics settings (for example, marker type).
        For example:
        <defaultGraphQuery>longitude,latitude,seaTemperature&amp;time&gt;=now-1day&amp;.marker=1|5</defaultGraphQuery>
         
    • <fileTableInMemory> (true or false (the default)) is an OPTIONAL tag that tells ERDDAP where to keep the fileTable (which has information about each source data file):
      • true = in memory (which is faster but uses more memory)
      • false = on disk (which is slower but uses no memory)
      For example,
      <fileTableInMemory>true</fileTableInMemory>
      If you set this to true for any dataset, keep an eye on the Memory: currently using line at [yourDomain]/erddap/status.html to ensure that ERDDAP still has plenty of free memory.
       
    • <fgdcFile> is an OPTIONAL tag within a <dataset> tag that tells ERDDAP to use a pre-made FGDC file instead of having ERDDAP try to generate the file. Usage:
      <fgdcFile>fullFileName</fgdcFile>
      fullFileName can refer to a local file (somewhere on the server's file system) or the URL of a remote file.
      If fullFileName="" or the file isn't found, the dataset will have no FGDC metadata. So this is also useful if you want to suppress the FGDC metadata for a specific dataset.
      Or, you can put <fgdcActive>false</fgdcActive> in setup.xml to tell ERDDAP not to offer FGDC metadata for any dataset.
       
    • <iso19115File> is an OPTIONAL tag within a <dataset> tag that tells ERDDAP to use a pre-made ISO 19115 file instead of having ERDDAP try to generate the file. Usage:
      <iso19115File>fullFileName</iso19115File>
      fullFileName can refer to a local file (somewhere on the server's file system) or the URL of a remote file.
      If fullFileName="" or the file isn't found, the dataset will have no ISO 19115 metadata. So this is also useful if you want to suppress the ISO 19115 metadata for a specific dataset.
      Or, you can put <iso19115Active>false</iso19115Active> in setup.xml to tell ERDDAP not to offer ISO 19115 metadata for any dataset.
       
    • <onChange> is an OPTIONAL tag within a <dataset> tag that specifies an action which will be done when this dataset is created (when ERDDAP is restarted) and whenever this dataset changes in any way.
      • Currently, for EDDGrid subclasses, any change to metadata or to an axis variable (for example, a new time point for near-real-time data) is considered a change, but a reloading of the dataset is not considered a change (by itself).
      • Currently, for EDDTable subclasses, any reloading of the dataset is considered a change.
      • Currently, only two types of actions are allowed:
        • http:// - If the action starts with "http://", ERDDAP will send an HTTP GET request to the specified URL. The response will be ignored. For example, the URL might tell some other web service to do something.
          • If the URL has a query part (after the "?"), it MUST be already percent encoded (external link). You just need to encode special characters in the right-hand-side values of any constraints into the form %HH, where HH is the 2 digit hexadecimal value of the character. Usually, you just need to convert a few of the punctuation characters: % into %25, & into %26, ", into %22, = into %3D, + into %2B, | into %7C, space into %20, and convert all characters above #127 into their UTF-8 form and then percent encode each byte of the UTF-8 form into the %HH format (ask a programmer for help). But in some situations, you need to percent encode all characters other than A-Za-z0-9_-!.~'()* .
          • Since datasets.xml is an XML file, you then need to encode '&', '<', and '>' in the URL as '&amp;', '&lt;', and '&gt;'.
          • Example: For a URL that you might type into a browser as: http://www.company.com/webService?department=R%26D&param2=value2 You should specify an <onChange> tag via (on one line)
            <onChange>http://www.company.com/webService?department=R%26D&amp;param2=value2</onChange>
        • mailto: - If the action starts with "mailto:", ERDDAP will send an email to the subsequent email address indicating that the dataset has been updated/changed.
          For example: <onChange>mailto:john.smith@company.com</onChange>
        If you have a good reason for ERDDAP to support some other type of action, send us an email describing what you want.
      • This tag is OPTIONAL. There can be as many of these tags as you want. Use one of these tags for each action to be performed.
      • This is analogous to ERDDAP's email/URL subscription system, but these actions aren't stored persistently (i.e., they are only stored in an EDD object).
      • To remove a subscription, just remove the <onChange> tag. The change will be noted the next time the dataset is reloaded.
         
    • <reloadEveryNMinutes> is an OPTIONAL tag within a <dataset> tag of almost all dataset types that specifies how often the dataset should be reloaded. For example,
      <reloadEveryNMinutes>60</reloadEveryNMinutes>
      • Generally, datasets that change frequently (for example, get new data files) should be reloaded frequently, for example, every 60 minutes.
      • Datasets that change infrequently should be reloaded infrequently, for example, every 1440 minutes (daily) or 10080 minutes (weekly).
      • This tag is OPTIONAL, but recommended. The default is 10080.
      • An example is: <reloadEveryNMinutes>1440</reloadEveryNMinutes>
      • Note that when a dataset is reloaded, all files in the bigParentDirectory/cache/datasetID directory are deleted.
      • No matter what this is set to, a dataset won't be loaded more frequently than <loadDatasetsMinMinutes> (default = 15), as specified in setup.xml. So if you want datasets to be reloaded very frequently, you need to set both reloadEveryNMinutes and loadDatasetsMinMinutes to small values.
      • Don't set reloadEveryNMinutes to the same value as loadDatasetsMinMinutes, because the elapsed time is likely to be (for example) 14:58 or 15:02, so the dataset will only be reloaded in about half of the major reloads. Instead, use a smaller (for example, 10) or larger (for example, 20) reloadEveryNMinutes value.
      • Regardless of reloadEveryNMinutes, you can manually tell ERDDAP to reload a specific dataset as soon as possible via a flag file.
      • Proactive vs Reactive - Note that ERDDAP's reload system is proactive -- datasets are reloaded soon after their reloadEveryNMinutes time is up (i.e., they become "stale", but never very stale), whether the dataset is getting requests from users or not. So ERDDAP datasets are always up-to-date and ready for use. This is in contrast to THREDDS' reactive approach: a user's request is what tells THREDDS to check if a dataset is stale (it may be very stale). If it is stale, THREDDS makes the user wait (often for a few minutes) while the dataset is reloaded.
      • For Curious Programmers - In ERDDAP, the reloading of all datasets is handled by two single purpose threads. One thread initiates a minor reload if it finds a flag file or a major reload (which checks all datasets to see if they need to be reloaded). The other thread does the actual reload of the datasets one at a time. These threads work in the background ensuring that all datasets are kept up-to-date. The thread which actually does the reloads prepares a new version of a dataset then swaps it into place (essentially replacing the old version atomically). So it is very possible that the follow sequence of events occurs (it's a good thing):
        1. ERDDAP starts reloading a dataset (making a new version) in the background.
        2. User 'A' makes a request to the dataset. ERDDAP uses the current version of the dataset to create the response. (That is good. There was no delay for the user, and the current version of the dataset should never be very stale.)
        3. ERDDAP finishes creating the new reloaded version of the dataset and swaps that new version into production. All subsequent new requests are handled by the new version of the dataset. For consistency, user A's request is still being filled by the original version.
        4. User 'B' makes a request to the dataset and ERDDAP uses the new version of the dataset to create the response.
        5. Eventually user A's and user B's requests are completed (perhaps A's finishes first, perhaps B's finishes first).

        I can hear someone saying, "Just two thredds! Ha! That's lame! He should set that up so that reloading of datasets uses as many threads as are needed, so it all gets done faster and with little or no lag." Yes and no. The problem is that loading more than one dataset at a time creates several hard new problems. They all need to be solved or dealt with. The current system works well and has manageable problems (for example, potential for lag before a flag is noticed). (If you need help managing them, email bob dot simons at noaa dot gov .) Note that the related updateEveryNMillis. system works within response threads, so it can and does lead to multiple datasets being updated (not the full reload) simultaneously.

    • <updateEveryNMillis> is an OPTIONAL tag within a <dataset> tag of some dataset types that helps ERDDAP work with datasets that change very frequently (as often as roughly every second). Unlike ERDDAP's regular, proactive, <reloadEveryNMinutes> system for completely reloading each dataset, this OPTIONAL additional system is reactive (triggered by a user request) and quicker because it is incremental (just updating the information that needs to be updated). For example, if a request to an EDDGridFromDap dataset occurs more than the specified number of milliseconds since the last update, ERDDAP will see if there are any new values for the leftmost (usually "time") dimension and, if so, just download those new values before handling the user's request. This system is very good at keeping a rapidly changing dataset up-to-date with minimal demands on the data source, but at the cost of slightly slowing down the processing of some user requests.
      • To use this system, add (for example):
        <updateEveryNMillis>1000</updateEveryNMillis>
        right after the <reloadEveryNMinutes> tag for the dataset in datasets.xml. The number of milliseconds that you specify can be as small as 1 (to ensure that the dataset is always up-to-date). A value of 0 (the default) turns off the system.
      • Due to their incremental nature, updates should finish very quickly, so users should never have to wait a long time.
      • If a second data request arrives before the previous update has finished, the second request won't trigger another update.
      • Throughout the documentation, we will try to use the word "reload" for regular, full dataset reloads, and "update" for these new incremental, partial updates.
      • For testing purposes, some diagnostics are printed to log.txt if <logLevel> is set to "all" in setup.xml.
      • If you use incremental updates and especially if the leftmost, for example, time, axis is large, you may want to set <reloadEveryNMinutes> to a larger number (1440?), so that updates do most of the work to keep the dataset up-to-date, and full reloads are done infrequently.
      • Note: this new update system updates metadata (for example, time actual_range, time_coverage_end, ...) but doesn't trigger onChange (email or touch URL) or change the RSS feed (perhaps it should...).
      • For all datasets that use subclasses of EDDGridFromFiles and EDDTableFromFiles:
        • WARNING: when you add a new data file to a dataset by copying it into the directory that ERDDAP looks at, there is a danger that ERDDAP will notice the partially written file; try to read it, but fail because the file is incomplete; declare the file to be a "bad" file and remove it (temporarilly) from the dataset.
          To avoid this, we STRONGLY RECOMMEND that you copy a new file into the directory with a temporary name (for example, 20150226.ncTmp) that doesn't match the datasets fileNameRegex (*\.nc), then rename the file to the correct name (for example, 20150226.nc). If you use this approach, ERDDAP will ignore the temporary file and only notice the correctly named file when it is complete and ready to be used.
        • If you modify existing datafiles in place (for example, to add a new data point), <updateEveryNMillis> will work well if the changes appear atomically (in an instant) and the file is always a valid file. For example, the netcdf-java library allow for additions to the unlimited dimension of a "classic" .nc v3 file to be made atomically.
          <updateEveryNMillis> will work badly if the file is invalid while the changes are being made.
        • <updateEveryNMillis> will work well for datasets where one or a few files change in a short amount of time.
        • <updateEveryNMillis> will work poorly for datasets where a large number of files change in a short amount of time (unless the changes appear atomically). For these datasets, it is better to not use <updateEveryNMillis> and to set a flag to tell ERDDAP to reload the dataset.
        • <updateEveryNMillis> does not update the information associated with the <subsetVariables>. Normally, this is not a problem, because the subsetVariables have information about things that don't change very often (for example, the list of station names, latitudes, and longitudes). If the subsetVariables data changes (for example, when a new station is added to the dataset), then contact the flag URL for the dataset to tell ERDDAP to reload the dataset. Otherwise, ERDDAP won't notice the new subsetVariable information until the next time the dataset is reloaded (<reloadEveryNMinutes>).
        • Our generic recommendation is to use:
          <reloadEveryNMinutes>1440</reloadEveryNMinutes>
          <updateEveryNMillis>10000</updateEveryNMillis>
        • TROUBLE? On Linux computers, if you are using <updateEveryNMillis> with EDDGridFromFiles or EDDTableFromFiles classes, you may see a problem where a dataset fails to load (occasionally or consistently) with the error message: "IOException: User limit of inotify instances reached or too many open files". The cause may be a bug in Java which causes inotify instances to be not garbage collected. This problem is avoided in ERDDAP v1.66 and higher. So the best solution is to switch the latest version of ERDDAP.
          If that doesn't solve the problem (that is, if you have a really large number of datasets using <updateEveryNMillis>), you can fix this problem by calling (as root):
          echo 20000 > /proc/sys/fs/inotify/max_user_watches
          echo 500 > /proc/sys/fs/inotify/max_user_instances

          Or, use higher numbers if the problem persists. The default for watches is 8192. The default for instances is 128.
      • For Curious Programmers - these incremental updates, unlike ERDDAP's full reloadEveryNMinutes system, occur within user request threads. So, any number of datasets can be updating simultaneously. There is code (and a lock) to ensure that only one thread is working on an update for any given dataset at any given moment. Allowing multiple simultaneous updates was easy; allowing multiple simultaneous full reloads would be harder.
         
    • <sourceCanConstrainStringEQNE> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies if the source can constrain String variables with the = and != operators.
      • For EDDTableFromDapSequence, this applies to the outer sequence String variables only. It is assumed that the source can't handle any constraints on inner sequence variables.
      • This tag is OPTIONAL. Valid values are true (the default) and false.
      • For EDDTableFromDapSequence OPeNDAP DRDS servers, this should be set to true (the default).
      • For EDDTableFromDapSequence Dapper servers, this should be set to false.
      • An example is:
        <sourceCanConstrainStringEQNE>true</sourceCanConstrainStringEQNE>
         
    • <sourceCanConstrainStringGTLT> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies if the source can constrain String variables with the <, <=, >, and >= operators.
      • For EDDTableFromDapSequence, this applies to the outer sequence String variables only. It is assumed that the source can't handle any constraints on inner sequence variables.
      • Valid values are true (the default) and false.
      • This tag is OPTIONAL. The default is true.
      • For EDDTableFromDapSequence OPeNDAP DRDS servers, this should be set to true (the default).
      • For EDDTableFromDapSequence Dapper servers, this should be set to false.
      • An example is:
        <sourceCanConstrainStringGTLT>true</sourceCanConstrainStringGTLT>
         
    • <sourceCanConstrainStringRegex> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies if the source can constrain String variables by regular expressions, and if so, what the operator is.
      • Valid values are "=~" (the DAP standard), "~=" (mistakenly supported by many DAP servers), or "" (indicating that the source doesn't support regular expressions).
      • This tag is OPTIONAL. The default is "".
      • For EDDTableFromDapSequence OPeNDAP DRDS servers, this should be set to "" (the default).
      • For EDDTableFromDapSequence Dapper servers, this should be set to "" (the default).
      • An example is:
        <sourceCanConstrainStringRegex>=~</sourceCanConstrainStringRegex>
         
    • <sourceCanDoDistinct> is an OPTIONAL tag within an EDDTableFromDatabase <dataset> tag that specifies if the source database should handle &distinct() constraints in user queries.
      • This tag is OPTIONAL. Valid values are no (ERDDAP handles distinct; the default), partial (the source handles distinct and ERDDAP handles it again), and yes (the source handles distinct).
      • If you are using no and ERDDAP is running out of memory when handling distinct, use yes.
      • If you are using yes and the source database handles distinct too slowly, use no.
      • partial gives you the worst of both: it is slow because the database handling of distinct is slow and it may run out of memory in ERDDAP.
      • Note that databases interpret DISTINCT as a request for just unique rows of results, whereas ERDDAP interprets it as a request for a sorted list of unique rows of results. If you set this to partial or yes, ERDDAP automatically also tells the database to sort the results.
      • One small difference in the results:
        With no|partial, ERDDAP will sort "" at start of results (before non-"" strings).
        With yes, the database may (Postgres will) sort "" at end of results (after non-"" strings).
        I will guess that this will also affect the sorting of short words vs. longer words that start with the short word. For example, ERDDAP will sort "Simon" before "Simons".
      • An example is:
        <sourceCanDoDistinct>yes</sourceCanDoDistinct>
         
    • <sourceCanOrderBy> is an OPTIONAL tag within an EDDTableFromDatabase <dataset> tag that specifies if the source database should handle &orderBy (and variants) constraints in user queries.
      • This tag is OPTIONAL. Valid values are no (ERDDAP handles orderBy; the default), partial (the source handles orderBy and ERDDAP handles it again), and yes (the source handles orderBy).
      • If you are using no and ERDDAP is running out of memory when handling orderBy, use yes.
      • If you are using yes and the source database handles orderBy too slowly, use no.
      • partial gives you the worst of both: it is slow because the database handling of orderBy is slow and it may run out of memory in ERDDAP.
      • One small difference in the results:
        With no|partial, ERDDAP will sort "" at start of results (before non-"" strings).
        With yes, the database may (Postgres will) sort "" at end of results (after non-"" strings).
        I will guess that this will also affect the sorting of short words vs. longer words that start with the short word. For example, ERDDAP will sort "Simon" before "Simons".
      • An example is:
        <sourceCanOrderBy>yes</sourceCanOrderBy>
         
    • <sourceNeedsExpandedFP_EQ> is an OPTIONAL tag within an EDDTable <dataset> tag that specifies (true (the default) or false) if the source needs help with queries with <numericVariable>=<floatingPointValue> (and !=, >=, <=). For example,
      <sourceNeedsExpandedFP_EQ>false</sourceNeedsExpandedFP_EQ>
      • For some data sources, numeric queries involving =, !=, <=, or >= may not work as desired with floating point numbers. For example, a search for longitude=220.2 may fail if the value is stored as 220.20000000000001.
      • This problem arises because floating point numbers are not represented exactly within computers (external link).
      • If sourceNeedsExpandedFP_EQ is set to true (the default), ERDDAP modifies the queries sent to the data source to avoid this problem. It is always safe and fine to leave this set to true.
         
    • <sourceUrl> is a common tag within a <dataset> tag that specifies the url source of the data.
      • An example is:
        <sourceUrl>http://oceanwatch.pfeg.noaa.gov/thredds/dodsC/satellite/VH/chla/1day</sourceUrl>
      • In ERDDAP, all datasets will have a "sourceUrl" in the combined global attributes which are shown to the users.
      • For most dataset types, this tag is REQUIRED. See the dataset type's description to find out if this is REQUIRED or not.
      • For some datasets, the separate <sourceUrl> tag is not allowed. Instead, you must provide a "sourceUrl" global attribute, usually in the global >addAttributes<. If there is no actual source URL (for example, if the data is stored in local files), this attribute often just has a placeholder value, for example, <att name="name">(local files)</att> .
      • For most datasets, this is the base of the url that is used to request data. For example, for DAP servers, this is the url to which .dods, .das, .dds, or .html could be added.
      • If the URL has a query part (after the "?"), it MUST be already percent encoded (external link). You just need to encode special characters in the right-hand-side values of any constraints into the form %HH, where HH is the 2 digit hexadecimal value of the character. Usually, you just need to convert a few of the punctuation characters: % into %25, & into %26, ", into %22, = into %3D, + into %2B, | into %7C, space into %20, and convert all characters above #127 into their UTF-8 form and then percent encode each byte of the UTF-8 form into the %HH format (ask a programmer for help). But in some situations, you need to percent encode all characters other than A-Za-z0-9_-!.~'()* .
      • Since datasets.xml is an XML file, you MUST also encode '&', '<', and '>' in the URL as '&amp;', '&lt;', and '&gt;'.
      • For most dataset types, ERDDAP adds the original sourceUrl (the "localSourceUrl" in the source code) to the global attributes (where it becomes the "publicSourceUrl" in the source code). When the data source is local files, ERDDAP adds sourceUrl="(local files)" to the global attributes as a security precaution. When the data source is a database, ERDDAP adds sourceUrl="(source database)" to the global attributes as a security precaution. If some of your datasets use non-public sourceUrl's (usually because their computer is in your DMZ or on a local LAN) you can use <convertToPublicSourceUrl> tags to specify how to convert the local sourceUrls to public sourceUrls.
         
    • <addAttributes> is an OPTIONAL tag for each dataset and for each variable which lets ERDDAP administrators control the metadata attributes associated with a dataset and its variables.
      • ERDDAP combines the attributes from the dataset's source ("sourceAttributes") and the "addAttributes" which you define in datasets.xml (which have priority) to make the "combinedAttributes", which are what ERDDAP users see. Thus, you can use addAttributes to redefine the values of sourceAttributes, add new attributes, or remove attributes.
      • The <addAttributes> tag encloses 0 or more <att> subtags, which are used to specify individual attributes.
      • Each attribute consists of a name and a value (which has a specific data type, for example, double).
      • There can be only one attribute with a given name. If there are more, the last one has priority.
      • The value can be a single value or a space-separated list of values.
      • Syntax
        • The order of the <att> subtags within addAttributes is not important.
        • The <att> subtag format is
          <att name="name" [type="type"] >value</att>
        • The destination name of all attributes MUST start with a letter (A-Z, a-z) and MUST contain only the characters A-Z, a-z, 0-9, or '_'.
        • If an <att> subtag has no value or a value of null, that attribute will be removed from the combined attributes.
          For example, <att name="rows" /> will remove rows from the combined attributes.
          For example, <att name="coordinates">null</att> will remove coordinates from the combined attributes.
        • The OPTIONAL type value for <att> subtags indicates the data type for the values. The default type is string. An example of a string attribute is:
          <att name="creator_name">NASA/GSFC OBPG</att>
          • Valid types for single values are byte (8-bit integer), unsignedShort (16-bit integer), short (16-bit signed integer), int (32-bit signed integer), long (64-bit signed integer), float (32-bit floating point), double (64-bit floating point), and string. For example,
            <att name="scale_factor" type="float">0.1</att>
          • Valid types for space-separated lists of values (or single values) are byteList, unsignedShortList, shortList, intList, longList, floatList, doubleList. For example,
            <att name="actual_range" type="doubleList">10.34 23.91</att>
            There is no stringList. Store the String values as a multi-line String. For example,
            <att name="history">2011-08-05T08:55:02Z ATAM - made CF-1.6 compliant.
            2012-04-08T08:34:58Z ATAM - Changed 'height' from double to float.</att>

             
    • Global Attributes / Global <addAttributes> -
      <addAttributes> is an OPTIONAL tag within the <dataset> tag which is used to change attributes that apply to the entire dataset.
      • Use the global <addAttributes> to change the dataset's global attributes. ERDDAP combines the global attributes from the dataset's source (sourceAttributes) and the global addAttributes which you define in datasets.xml (which have priority) to make the global combinedAttributes, which are what ERDDAP users see. Thus, you can use addAttributes to redefine the values of sourceAttributes, add new attributes, or remove attributes.
      • See the <addAttributes> information which applies to global and variable <addAttributes>.
      • FGDC (external link) and ISO 19115-2/19139 (external link) Metadata - Normally, ERDDAP will automatically generate ISO 19115-2/19139 and FGDC (FGDC-STD-001-1998) XML metadata files for each dataset using information from the dataset's metadata. So, good dataset metadata leads to good ERDDAP-generated ISO 19115 and FGDC metadata. Please consider putting lots of time and effort into improving your datasets' metadata (which is a good thing to do anyway). Most of the dataset metadata attributes which are used generate the ISO 19115 and FGDC metadata are from the ACDD metadata standard (external link) and are so noted below.
      • Many global attributes are special in that ERDDAP looks for them and uses them in various ways. For example, a link to the infoUrl is included on web pages with lists of datasets, and other places, so that users can find out more about the dataset.
      • When a user selects a subset of data, globalAttributes related to the variable's longitude, latitude, altitude (or depth), and time ranges (for example, Southernmost_Northing, Northernmost_Northing, time_coverage_start, time_coverage_end) are automatically generated or updated.
      • A simple sample global <addAttributes> is:
        <addAttributes> 
          <att name="Conventions">COARDS, CF-1.6, ACDD-1.3</att>
          <att name="infoUrl">http://coastwatch.pfeg.noaa.gov/infog/PH_ssta_las.html</att>
          <att name="institution">NOAA CoastWatch, West Coast Node</att>
          <att name="title">SST, Pathfinder Ver 5.0, Day and Night, Global</att>
          <att name="cwhdf_version" />
        </addAttributes>  
        The empty cwhdf_version attribute causes the source cwhdf_version attribute (if any) to be removed from the final, combined list of attributes.
      • Supplying this information helps ERDDAP do a better job and helps users understand the datasets.
        Good metadata makes a dataset usable.
        Insufficient metadata makes a dataset useless.
        Please take the time to do a good job with metadata attributes.

      Comments about global attributes that are special in ERDDAP:
       

      • acknowledgment (from the ACDD (external link) metadata standard) is a RECOMMENDED way to acknowledge the group or groups that provided support (notably, financial) for the project that created this data. For example,
        <att name="acknowledgment">AVISO</att>
      • cdm_altitude_proxy is just for EDDTable datasets that don't have an altitude or depth variable but do have a variable that is a proxy for altitude or depth (for example, pressure, sigma, bottleNumber), you may use this attribute to identify that variable. For example,
        <att name="cdm_altitude_proxy">pressure</att>
        If the cdm_data_type is Profile or TrajectoryProfile and there is no altitude or depth variable, cdm_altitude_proxy MUST be defined. If cdm_altitude_proxy is defined, ERDDAP will add the following metadata to the variable: _CoordinateAxisType=Height and axis=Z.
      • cdm_data_type (from the ACDD (external link) metadata standard) is a global attribute that indicates the Unidata Common Data Model (external link) data type for the dataset. For example,
        <att name="cdm_data_type">Point</att>
        The CDM is still evolving and may change again. ERDDAP complies with the Discrete Sampling Geometries (external link) chapter of the CF 1.6 (external link) metadata conventions (previously called the CF Point Observation Conventions).
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include the cdm_data_type attribute. A few dataset types (like EDDTableFromObis) will set this automatically.
        • For EDDGrid datasets, the cdm_data_type options are Grid (the default and by far the most common type for EDDGrid datasets), MovingGrid, Other, Point, Profile, RadialSweep, TimeSeries, TimeSeriesProfile, Swath, Trajectory, and TrajectoryProfile. Currently, EDDGrid does not require that any related metadata be specified, nor does it check that the data matches the cdm_data_type. That will probably change in the near future.
        • EDDTable uses cdm_data_type in a rigorous way. If a dataset doesn't comply with the cdm_data_type's requirements, the dataset will fail to load and will generate an error message. (That's a good thing, in the sense that the error message will tell you what is wrong so that you can fix it.)

          For all of these datasets, in the Conventions and Metadata_Conventions global attributes, please refer to CF-1.6 (not CF-1.0, 1.1, 1.2, 1.3, 1.4, or 1.5), since CF-1.6 is the first version to include the changes related to Discrete Sampling Geometry (DSG) conventions.

          For EDDTable datasets, the cdm_data_type options (and related requirements) are

          • Point - for a dataset with unrelated points.
            • As with all cdm_data_types other than Other, Point datasets MUST have longitude, latitude, and time variables.
          • Profile - for data from multiple depths at one or more longitude,latitude locations.
            • The dataset MUST include the globalAttribute cdm_profile_variables, where the value is a comma-separated list of the variables which have the information about each profile. Thus, for a given profile, the values of these variables will be constant.
            • One of the variables MUST have the variable attribute cf_role=profile_id to identify the variable that uniquely identifies the profiles. If no other variable is suitable, consider using the time variable.
          • TimeSeries - for data from a set of stations with fixed longitude,latitude(,altitude).
            • The dataset MUST include the globalAttribute cdm_timeseries_variables, where the value is a comma-separated list of the variables which have the information about each station. Thus, for a given station, the values of these variables will be constant.
            • One of the variables MUST have the variable attribute cf_role=timeseries_id to identify the variable that uniquely identifies the stations.
            • It is okay if the longitude and latitude vary slightly over time. If the longitude and latitude don't vary, include them in the cdm_timeseries_variables. If they do vary, don't include them in the cdm_timeseries_variables.
          • TimeSeriesProfile - for profiles from a set of stations.
            • The dataset MUST include the globalAttribute cdm_timeseries_variables, where the value is a comma-separated list of the variables which have the information about each station. Thus, for a given station, the values of these variables will be constant.
            • The dataset MUST include the globalAttribute cdm_profile_variables, where the value is a comma-separated list of the variables which have the information about each profile. Thus, for a given profile, the values of these variables will be constant.
            • One of the variables MUST have the variable attribute cf_role=timeseries_id to identify the variable that uniquely identifies the stations.
            • One of the variables MUST have the variable attribute cf_role=profile_id to identify the variable that uniquely identifies the profiles. (A given profile_id only has to be unique for a given timeseries_id.) If no other variable is suitable, consider using the time variable.
          • Trajectory - for data from a set of longitude,latitude(,altitude) paths called trajectories.
            • The dataset MUST include the globalAttribute cdm_trajectory_variables, where the value is a comma-separated list of the variables which have the information about each trajectory. Thus, for a given trajectory, the values of these variables will be constant.
            • One of the variables MUST have the attribute cf_role=trajectory_id to identify the variable that uniquely identifies the trajectories.
          • TrajectoryProfile - for profiles taken along trajectories.
            • The dataset MUST include the globalAttribute cdm_trajectory_variables, where the value is a comma-separated list of the variables which have the information about each trajectory. Thus, for a given trajectory, the values of these variables will be constant.
            • The dataset MUST include the globalAttribute cdm_profile_variables, where the value is a comma-separated list of the variables which have the information about each profile. Thus, for a given profile, the values of these variables will be constant.
            • One of the variables MUST have the variable attribute cf_role=trajectory_id to identify the variable that uniquely identifies the trajectories.
            • One of the variables MUST have the variable attribute cf_role=profile_id to identify the variable that uniquely identifies the profiles. (A given profile_id only has to be unique for a given trajectory_id.) If no other variable is suitable, consider using the time variable.
          • Other - has no requirements. Use it if the dataset doesn't fit one of the other options.
          Related notes:
          • All EDDTable datasets with a cdm_data_type other than "Other" MUST have longitude, latitude, and time variables.
          • Datasets with profiles MUST have an altitude variable, a depth variable, or an cdm_altitude_proxy variable.
          • If you can't make a dataset comply with all of the requirements for the ideal cdm_data_type, use "Point" (which has few requirements) or "Other" (which has no requirements) instead.
          • This information is used by ERDDAP in various ways, for example, when making .ncCF files (.nc files which comply with the Contiguous Ragged Array Representations associated with the dataset's cdm_data_type, as defined in the newly ratified Discrete Sampling Geometries (external link) chapter of the CF 1.6 (external link) metadata conventions, which were previously named "CF Point Observation Conventions").
          • Hint: Usually, a good starting point for subsetVariables is the combined values of the cdm_..._variables. For example, for TimeSeriesProfile, start with the cdm_timeseries_variables plus the cdm_profile_variables.
      • contributor_name (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify a person, organization, or project which contributed to this dataset (for example, the original creator of the data, before it was reprocessd by the creator of this dataset). For example,
        <att name="contributor_name">NOAA OceanWatch - Central Pacific</att>
        If "contributor" doesn't really apply to a dataset, omit this attribute. Compared to creator_name, this is sometimes more focused on the funding source.
      • contributor_role (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify the role of contributor_name. For example,
        <att name="contributor_role">Source of Level 2b data</att>
        If "contributor" doesn't really apply to a dataset, omit this attribute.
      • Conventions (from the CF (external link) metadata standard) is STRONGLY RECOMMENDED. (It may be REQUIRED in the future.) The value is a comma-separated list of metadata standards that this dataset follows. For example:
        <att name="Conventions">COARDS, CF-1.6, ACDD-1.3</att>
        The common metadata conventions used in ERDDAP are:
        • COARDS Conventions (external link) is the precursor to CF.
        • Climate and Forecast (CF) Conventions (external link) is the source of many of the recommended and required attributes in ERDDAP. The current version of CF is identified as "CF-1.6".
        • The NetCDF Attribute Convention for Dataset Discovery (ACDD) is the source of many of the recommended and required attributes in ERDDAP. The original 1.0 version of ACDD (a brilliant piece of work by Ethan Davis), was identified as Unidata Dataset Discovery v1.0 (external link) The current (starting in 2015) 1.3 version of ACDD is identified as ACDD-1.3 (external link). If your datasets have been using Unidata Dataset Discovery v1.0, we encourage you to switch your datasets to use ACDD-1.3.
        If your dataset follows some additional metadata standard, please add the name to the CSV list in the Conventions attribute.
      • coverage_content_type (from the ISO 19115 (external link) metadata standard) is the RECOMMENDED way to identify the type of gridded data (in EDDGrid datasets). For example,
        <att name="coverage_content_type">modelResult</att>
        The only allowed values are auxiliaryInformation, image, modelResult, physicalMeasurement (the default when ISO 19115 metadata is generated), qualityInformation, referenceInformation, and thematicClassification. (Don't use this tag for EDDTable datasets.)
      • creator_name (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify the person, organization, or project (if not a specific person or organization), most responsible for the creation (or most recent reprocessing) of this data. For example,
        <att name="creator_name">NOAA NMFS SWFSC ERD</att>
        If the data was extensively reprocessed (for example, satellite data from level 2 to level 3 or 4), then usually the reprocessor is listed as the creator and the original creator is listed via contributor_name. Compared to project, this is more flexible, since it may identify a person, an organization, or a project.
      • creator_email (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify an email address (correctly formatted) that provides a way to contact the creator. For example,
        <att name="creator_email">erd.data@noaa.gov</att>
      • creator_url (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify a URL for organization that created the dataset, or a URL with the creator's information about this dataset (but that is more the purpose of infoUrl). For example,
        <att name="creator_url">http://www.pfeg.noaa.gov</att>
      • date_created (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify the date on which the data was first created (for example, processed into this form), in ISO 8601 format. For example,
        <att name="date_created">2010-01-30</att>
        If data is periodically added to the dataset, this is the first date that the original data was made available.
      • date_modified (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify the date on which the data was last modified (for example, when an error was fixed or when the latest data was added), in ISO 8601 format. For example,
        <att name="date_modified">2012-03-15</att>
      • date_issued (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify the date on which the data was first made available to others, in ISO 8601 format, for example, 2012-03-15. For example,
        <att name="date_issued">2010-07-30</att>
        For example, the dataset may have a date_created of 2010-01-30, but was only made publicly available 2010-07-30. date_issued is less commonly used than date_created and date_modified. If date_issued is omitted, it is assumed to be the same as the date_created.
      • drawLandMask - This is a RECOMMENDED global attribute used by ERDDAP (and no metadata standards) which specifies the default value for the "Draw Land Mask" option on the dataset's Make A Graph form (datasetID.graph) and for the &.land parameter in a URL requesting a graph/map of the data. For example,
        <att name="drawLandMask">over</att>
        (However, if drawLandMask is specified in a variable's attributes, that value has precedence.)
        • For EDDGrid datasets, this specifies whether the land mask on a map is drawn over or under the grid data. over is recommended for oceanographic data (so that grid data over land is obscured by the landmask). under is recommended for all other data.
        • For EDDTable datasets: over makes the land mask on a map visible (land appears as a uniform gray area). over is commonly used for purely oceanographic datasets. under makes the land mask invisible (topography information is displayed for ocean and land areas). under is commonly used for all other data.
        • If any other value (or no value) is specified, the drawLandMask value from setup.xml is used. If none is specified there, over is the default.
      • featureType (from the CF (external link) metadata standard) is IGNORED and/or REPLACED. If the dataset's cdm_data_type is appropriate, ERDDAP will automatically use it to create a featureType attribute. So there is no need for you to add it.

        However, if you are using EDDTableFromNcCFFiles to create a dataset from files that follow the CF Discrete Sampling Geometries (DSG) standard (external link), the files themselves must have featureType correctly defined, so that ERDDAP can read the files correctly. That is part of the CF DSG requirements for that type of file.

      • history (from the CF (external link) and ACDD (external link) metadata standards) is a RECOMMENDED multi-line string global attribute with a line for every processing step that the data has undergone. For example,
        <att name="history">2011-08-05T08:55:02Z CMOR: Rewrote data to comply with CF standards.
        2012-04-08T08:34:58Z CMOR: Converted 'height' type from 'd' to 'f'.</att>
        • Ideally, each line has an ISO 8601:2004(E) formatted date+timeZ (for example, 2011-08-05T08:55:02Z) followed by a description of the processing step.
        • ERDDAP creates this if it doesn't already exist.
        • If it already exists, ERDDAP will append new information to the existing information.
        • history is important because it allows clients to backtrack to the original source of the data.
      • infoUrl is a REQUIRED global attribute with the URL of a web page with more information about this dataset (usually at the source institution's web site). For example,
        <att name="infoUrl">http://www.globec.org/</att>
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • infoUrl is important because it allows clients to find out more about the data from the original source.
        • ERDDAP displays a link to the infoUrl on the dataset's Data Access Form (datasetID.html), Make A Graph web page (datasetID.graph), and other web pages.
        • If the URL has a query part (after the "?"), it MUST be already percent encoded (external link). You just need to encode special characters in the right-hand-side values of any constraints into the form %HH, where HH is the 2 digit hexadecimal value of the character. Usually, you just need to convert a few of the punctuation characters: % into %25, & into %26, ", into %22, = into %3D, + into %2B, | into %7C, space into %20, and convert all characters above #127 into their UTF-8 form and then percent encode each byte of the UTF-8 form into the %HH format (ask a programmer for help). But in some situations, you need to percent encode all characters other than A-Za-z0-9_-!.~'()* .
        • Since datasets.xml is an XML file, you MUST also encode '&', '<', and '>' in the URL as '&amp;', '&lt;', and '&gt;'.
        • infoUrl is unique to ERDDAP. It is not from any metadata standard.
      • institution (from the CF (external link) and ACDD (external link) metadata standards) is a REQUIRED global attribute with the short version of the name of the institution which is the source of this data (usually an acronym, usually <20 characters). For example,
        <att name="institution">NASA GSFC</att>
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • ERDDAP displays the institution whenever it displays a list of datasets. If an institution is longer than 20 characters, only the first 20 characters will be visible in the list of datasets (but the whole institution can be seen by putting the mouse cursor over the adjacent "?" icon).
        • If you add institution to the list of <categoryAttributes> in ERDDAP's setup.xml (external link) file, users can easily find datasets from the same institution via ERDDAP's "Search for Datasets by Category" on the home page.
      • keywords (from the ACDD (external link) metadata standard) is a RECOMMENDED comma-separated list of words and short phrases (for example, GCMD Science Keywords (external link)) that describe the dataset in a general way, and not assuming any other knowledge of the dataset (for example, for oceanographic data, include ocean). For example,
        <att name="keywords">Oceans > Ocean Circulation > Ocean Currents,
        ano, circulation, coastwatch, currents, derived, eastward, eastward_sea_water_velocity, experimental, hf radio, meridional, noaa, northward, northward_sea_water_velocity, nuevo, ocean, oceans, radio, radio-derived, scan, sea, seawater, velocity, water, zonal</att>
      • keywords_vocabulary (from the ACDD (external link) metadata standard) is a RECOMMENDED attribute: if you are following a guideline for the words/phrases in your keywords attribute (for example, GCMD Science Keywords), put the name of that guideline here. For example,
        <att name="keywords_vocabulary">GCMD Science Keywords</att>
      • license (from the ACDD (external link) metadata standard) is a STRONGLY RECOMMENDED global attribute with the license and/or usage restrictions. For example,
        <att name="license">[standard]</att>
        • If "[standard]" occurs in the attribute value, it will be replaced by the standard ERDDAP license from the <standardLicense> tag in messages.xml.
      • Metadata_Conventions is from the outdated ACDD 1.0 (external link) (which was identified in Metadata_Conventions as "Unidata Dataset Discovery v1.0") metadata standard. The attribute value was a comma-separated list of metadata conventions used by this dataset.
        If a dataset uses ACDD 1.0, this attribute is a STRONGLY RECOMMENDED, for example,
        <att name="Metadata_Conventions">COARDS, CF-1.6, Unidata Dataset Discovery v1.0</att>
        But ERDDAP now recommends ACDD-1.3. If you have switched your datasets to use ACDD-1.3, use of Metadata_Conventions is STRONGLY DISCOURAGED: just use <Conventions> instead.
      • processing_level (from the ACDD (external link) metadata standard) is a RECOMMENDED textual description of the processing (for example, NASA satellite data processing levels (external link), for example, Level 3) or quality control level (for example, Science Quality) of the data. For example,
        <att name="processing_level">3</att>
      • project (from the ACDD (external link) metadata standard) is an OPTIONAL attribute to identify the project that the dataset is part of. For example,
        <att name="project">GTSPP</att>
        If the dataset isn't part of a project, don't use this attribute. Compared to creator_name, this is focused on the project (not a person or an organization, which may be involved in multiple projects).
      • publisher_name (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify the person, organization, or project which is publishing this dataset. For example,
        <att name="publisher_name">JPL</att>
        For example, you are the publisher if another person or group created the dataset and you are just re-serving it via ERDDAP. If "publisher" doesn't really apply to a dataset, omit this attribute. Compared to creator_name, the publisher probably didn't significantly modify or reprocess the data; the publisher is just making the data available in a new venue.
      • publisher_email (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify an email address (correctly formatted, for example, john_smith@great.org) that provides a way to contact the publisher. For example,
        <att name="publisher_email">john_smith@great.org</att>
        If "publisher" doesn't really apply to a dataset, omit this attribute.
      • publisher_url (from the ACDD (external link) metadata standard) is the RECOMMENDED way to identify a URL for the organization that published the dataset, or a URL with the publisher's information about this dataset (but that is more the purpose of infoUrl). For example,
        <att name="publisher_url">http://podaac.jpl.nasa.gov</att>
        If "publisher" doesn't really apply to a dataset, omit this attribute.
      • sourceUrl is a global attribute with the URL of the source of the data. For example,
        <att name="sourceUrl">http://opendap.co-ops.nos.noaa.gov/ioos-dif-sos/SOS</att>
        • ERDDAP usually creates this global attribute automatically. Two exceptions are EDDTableFromHyraxFiles and EDDTableFromThreddsFiles.
        • If the source is local files and the files were created by your organization, use
          <att name="sourceUrl">(local files)</att>
        • If the source is local database and the data was created by your organization, use
          <att name="sourceUrl">(local database)</att>
        • sourceUrl is important because it allows clients to backtrack to the original source of the data.
        • sourceUrl is unique to ERDDAP. It is not from any metadata standard.
      • standard_name_vocabulary (from the ACDD (external link) metadata standard) is a RECOMMENDED attribute to identify the name of the controlled vocabulary from which variable standard_names are taken. For example,
        <att name="standard_name_vocabulary">CF Standard Name Table v29</att>
        for version 29 of the CF standard name table (external link).
      • subsetVariables (for EDDTable datasets only) is a RECOMMENDED global attribute that lets you specify a comma-separated list of <dataVariable> destinationNames to identify variables which have a limited number of values (stated another way: variables for which each of the values has many duplicates). For example,
        <att name="subsetVariables">station_id, longitude, latitude</att>
        If this attribute is present, the dataset will have a datasetID.subset web page (and a link to it on every dataset list) which lets users quickly and easily select various subsets of the data.
        • Each time a dataset is loaded, ERDDAP loads and caches all of the distinct() subsetVariable data. Then, all user requests for distinct() subsetVariable data will be very fast.
        • The order of the destinationNames you specify determines the sort order on the datasetID.subset web page, so you will usually specify the most important variables first, then the least important. For example, for datasets with time series data for several stations, you might use, for example,
          <att name="subsetVariables">station_id, longitude, latitude</att>
          so that the values are sorted by station_id.
        • The suggested usage is: include the feature variables (variables with information about the stations, profiles, and/or trajectories) in the subsetVariables list, and don't include the data variables (e.g., time, temperature, salinity, current speed) in the list. But it is your choice which variables to include in the subsetVariables list.
        • If the number of distinct combinations of these variables is greater than about 1,000,000, you should consider restricting the subsetVariables that you specify to reduce the number of distinct combinations to below 1,000,000; otherwise, the datasetID.subset web pages may be generated slowly.
        • If the number of distinct values of any one subset variable is greater than about 20,000, you should consider not including that variable in the list of subsetVariables; otherwise, it takes a long time to transmit the datasetID.subset, datasetID.graph, and datasetID.html web pages. A compromise is: remove variables from the list that users are not likely to select from a drop down list.
        • You should test each dataset to see if the subsetVariables setting is okay. If the source data server is slow and it takes too long (or fails) to download the data, either reduce the number of variables specified or remove the subsetVariables global attribute.
        • SubsetVariables is very useful. So if your dataset is suitable, please create a subsetVariables attribute.
        • EDDTableFromSOS automatically adds
          <att name="subsetVariables">station_id, longitude, latitude</att>
          when the dataset is created.
        • Possible warning: if a user using the datasetID.subset web page selects a value which has a carriageReturn or newline character, datasetID.subset will fail. ERDDAP can't work around this issue because of some HTML details. In any case, it is almost always a good idea to remove the carriageReturn and newline characters from the data. To help you fix the problem, if the EDDTable.subsetVariablesDataTable method in ERDDAP detects data values that will cause trouble, it will email a warning with a list of offending values to the emailEverythingTo email addresses specified in setup.xml. That way, you know what needs to be fixed.
        • Pre-generated subset tables. Normally, when ERDDAP loads a dataset, it requests the distinct() subset variables data table from the data source, just via a normal data request. In some cases, this data is not available from the data source or retrieving from the data source may be hard on the data source server. If so, you can supply a table with the information in a .json or .csv file with the name tomcat/content/erddap/subset/datasetID.json (or .csv). If present, ERDDAP will read it once when the dataset is loaded and use it as the source of the subset data.
          • If there is an error while reading it, the dataset will fail to load.
          • It MUST have exact same column names (for example, same case) as <subsetVariables>, but the columns MAY be in any order.
          • It MAY have extra columns (they'll be removed and newly redundant rows will be removed).
          • Time and timestamp columns should have ISO 8601:2004(E) formatted date+timeZ strings (for example, 1985-01-31T15:31:00Z).
          • Missing values should be missing values (not fake numbers like -99).
          • .json files may be a little harder to create but deal with Unicode characters well. .json files are easy to create if you create them with ERDDAP.
          • .csv files are easy to work with, but suitable for ISO 8859-1 characters only. .csv files MUST have column names on the first row and data on subsequent rows.
      • summary (from the CF (external link) and ACDD (external link) metadata standards) is a REQUIRED global attribute with a long description of the dataset (usually <500 characters). For example,
        <att name="summary">VIIRSN Level-3 Standard Mapped Image, Global, 4km, Chlorophyll a, Daily. The Visible and Infrared Imager/Radiometer Suite (VIIRS) is a multi-disciplinary instrument that flies on the National Polar-orbiting Operational Environmental Satellite System (NPOESS) series of spacecraft, including the NPOESS Preparatory Project (NPP).</att>
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • summary is very important because it allows clients to read a description of the dataset that has more information than the title and thus quickly understand what the dataset is.
        • Advice: please write the summary so it would work to describe the dataset to some random person you meet on the street or to a colleague. Remember to include the Five W's and one H (external link): Who created the dataset? What information was collected? When was the data collected? Where was it collected? Why was it collected? How was it collected?
        • ERDDAP displays the summary on the dataset's Data Access Form (datasetID.html), Make A Graph web page (datasetID.graph), and other web pages. ERDDAP uses the summary when creating FGDC and ISO 19115 documents.
      • title (from the CF (external link) and ACDD (external link) metadata standards) is a REQUIRED global attribute with the short description of the dataset (usually <80 characters). For example,
        <att name="title">VIIRSN Level-3 Mapped, Global, 4km, Chlorophyll a, Daily</att>
        • Either the dataset's global sourceAttributes or its global <addAttributes> MUST include this attribute.
        • title is important because every list of datasets presented by ERDDAP (other than search results) lists the datasets in alphabetical order, by title. So if you want to specify the order of datasets, or have some datasets grouped together, you have to create titles with that in mind. Many lists of datasets (for example, in response to a category search), show a subset of the full list and in a different order. So the title for each dataset should stand on its own.
        • If a title is longer than 80 characters, only the first and last 40 characters will be visible in the list of datasets (but the whole title can be seen by putting the mouse cursor over the adjacent "?" icon).
        • If the title contains the word "DEPRECATED" (all capital letters), then the dataset will get a lower ranking in searches.
           
    • <axisVariable> is used to describe a dimension (also called "axis").
      For EDDGrid datasets, one or more axisVariable tags is REQUIRED, and all dataVariables always share/use all axis variables. (Why? What if they don't?)
      There MUST be an axis variable for each dimension of the data variables.
      Axis variables MUST be specified in the order that the data variables use them.
      (EDDTable datasets can NOT use <axisVariable> tags.)
      A fleshed out example is:
      <axisVariable>
        <sourceName>MT</sourceName> 
        <destinationName>time</destinationName>
        <addAttributes>
          <att name="units">days since 1902-01-01T12:00:00Z</att>
        </addAttributes>
      </axisVariable> 
      <axisVariable> supports the following subtags:
      • <sourceName> - the data source's name for the variable. This is the name that ERDDAP will use when requesting data from the data source. This is the name that ERDDAP will look for when data is returned from the data source. This is case sensitive. This is REQUIRED.
      • <destinationName> is the name for the variable that will be shown to and used by ERDDAP users.
        • This is OPTIONAL. If absent, the sourceName is used.
        • This is useful because it allows you to change a cryptic or odd sourceName.
        • destinationName is case sensitive.
        • destinationNames MUST start with a letter (A-Z, a-z) and MUST be followed by 0 or more characters (A-Z, a-z, 0-9, and _). ('-' was allowed before ERDDAP version 1.10.) This restriction allows axis variable names to be used as variable names in a programming language (such as Matlab).
        • In EDDGrid datasets, the longitude, latitude, altitude, depth, and time axis variables are special.
      • <addAttributes> defines an OPTIONAL set of attributes (name = value) which are added to the source's attributes for a variable, to make the combined attributes for a variable.
        If the variable's sourceAttributes or <addAttributes> include scale_factor and/or add_offset attributes, their values will be used to unpack the data from the source before distribution to the client
        (resultValue = sourceValue * scale_factor + add_offset) . The unpacked variable will be of the same data type (for example, float) as the scale_factor and add_offset values.
         
    • <dataVariable> is a REQUIRED (for almost all datasets) tag within the <dataset> tag which is used to describe a data variable. There MUST be 1 or more instances of this tag. A fleshed out example is:
      <dataVariable>
        <sourceName>waterTemperature</sourceName>
        <destinationName>sea_water_temperature</destinationName>
        <dataType>float</dataType>
        <addAttributes>
          <att name="ioos_category">Temperature</att>
          <att name="long_name">Sea Water Temperature</att>
          <att name="standard_name">sea_water_temperature</att>
          <att name="units">degree_C</att>
        </addAttributes>
      </dataVariable>  
      <dataVariable> supports the following subtags:
      • <sourceName> - the data source's name for the variable. This is the name that ERDDAP will use when requesting data from the data source. This is the name that ERDDAP will look for when data is returned from the data source. This is case sensitive. This is REQUIRED.

        In an EDDTable dataset, if you want to create a variable (with a fixed value) that isn't in the source dataset, use:
        <sourceName>=fixedValue</sourceName>
        The initial equals sign tells ERDDAP that a fixedValue will follow.
        The other tags for the <dataVariable> work as if this were a regular variable.
        For example, to create a variable called altitude with a fixed value of 0.0 (float), use:
        <sourceName>=0</sourceName>
        <destinationName>altitude</destinationName>
        <dataType>float</dataType>

      • <destinationName> - the name for the variable that will be shown to and used by ERDDAP users.
        • This is OPTIONAL. If absent, the sourceName is used.
        • This is useful because it allows you to change a cryptic or odd sourceName.
        • destinationName is case sensitive.
        • destinationNames MUST start with a letter (A-Z, a-z) and MUST be followed by 0 or more characters (A-Z, a-z, 0-9, and _). ('-' was allowed before ERDDAP version 1.10.) This restriction allows data variable names to be used as variable names in a programming language (like Matlab).
        • In EDDTable datasets, longitude, latitude, altitude (or depth), and time data variables are special.
      • <dataType> - specifies the data type coming from the source. (In some cases, for example, when reading data from ASCII files, it specifies how the data coming from the source should be stored.)
        • This is REQUIRED by some dataset types and IGNORED by others. Dataset types that require this for their dataVariables are: EDDGridFromXxxFiles, EDDTableFromXxxFiles, EDDTableFromMWFS, EDDTableFromNOS, EDDTableFromSOS. Other dataset types ignore this tag because they get the information from the source.
        • Valid values are: double (64-bit floating point), float (32-bit floating point), long (64-bit signed integer), int (32-bit signed integer), short (16-bit signed integer), byte (8-bit signed integer), char (essentially: 16-bit unsigned integer), boolean, and String (any length).
        • "boolean" is a special case.
          • Internally, ERDDAP doesn't support a boolean type because booleans can't store missing values.
          • Also, DAP doesn't support booleans, so there is no standard way to query boolean variables.
          • Specifying "boolean" for the dataType in datasets.xml will cause boolean values to be stored and represented as bytes: 0=false, 1=true.
          • Clients can specify constraints by using the numeric values (for example, "isAlive=1"). But ERDDAP administrators need to use the "boolean" dataType in datasets.xml to tell ERDDAP how to interact with the data source.
        • If you want to change a data variable from the dataType in the source files (for example, short) into some other dataType in the dataset (for example, int), don't use <dataType> to specify what you want. (It works for some types of datasets, but not others.) Instead:
          • Use <dataType> to specify what is in the files (for example, short).
          • In the <addAttributes> for the variable, add a scale_factor attribute with the new dataType (for example, int) and a value of 1, for example,
            <att name="scale_factor" type="int">1</att>
      • <addAttributes> - defines a set of attributes (name = value) which are added to the source's attributes for a variable, to make the combined attributes for a variable. This is OPTIONAL.
        If the variable's sourceAttributes or <addAttributes> include scale_factor and/or add_offset attributes, their values will be used to unpack the data from the source before distribution to the client. The unpacked variable will be of the same data type (for example, float) as the scale_factor and add_offset values.
         
    • Variable Attributes / Variable <addAttributes> - <addAttributes> is an OPTIONAL tag within an <axisVariable> or <dataVariable> tag which is used to change the variable's attributes.
      • Use a variable's <addAttributes> to change the variable's attributes. ERDDAP combines a variable's attributes from the dataset's source (sourceAttributes) and the variable's addAttributes which you define in datasets.xml (which have priority) to make the variable's "combinedAttributes", which are what ERDDAP users see. Thus, you can use addAttributes to redefine the values of sourceAttributes, add new attributes, or remove attributes.
      • See the <addAttributes> information which applies to global and variable <addAttributes>.
      • ERDDAP looks for and uses many of these attributes in various ways. For example, the colorBar values are required to make a variable available via WMS, so that maps can be made with consistent colorBars.
      • The longitude, latitude, altitude (or depth), and time variables get lots of appropriate metadata automatically (for example, units).
      • A sample <addAttributes> for a data variable is:
        <addAttributes> 
          <att name="actual_range" type="doubleList">10.34 23.91</att>
          <att name="colorBarMinimum" type="double">0</att>
          <att name="colorBarMaximum" type="double">32</att>
          <att name="ioos_category">Temperature</att>
          <att name="long_name">Sea Surface Temperature</att>
          <att name="numberOfObservations" /> 
          <att name="units">degree_C</att>
        </addAttributes>
        The empty numberOfObservations attribute causes the source numberOfObservations attribute (if any) to be removed from the final, combined list of attributes.
      • Supplying this information helps ERDDAP do a better job and helps users understand the datasets.
        Good metadata makes a dataset usable.
        Insufficient metadata makes a dataset useless.
        Please take the time to do a good job with metadata attributes.

      Comments about variable attributes that are special in ERDDAP:
       

      • actual_range (CDC COARDS (external link)) is a RECOMMENDED variable attribute. For example,
        <att name="actual_range" type="floatList">0.17 23.58</att>
        • If present, it MUST be an array of two values of the same data type as the variable, specifying the actual (not the theoretical or the allowed) minimum and maximum values of the data for that variable.
        • If the data is packed with scale_factor and/or add_offset, actual_range should have packed values.
        • For some data sources (for example, all EDDTableFrom...Files datasets), ERDDAP determines the actual_range of each variable and sets the actual_range attribute. With other data sources (for example, relational databases, Cassandra, DAPPER, Hyrax), it might be troublesome or burdensome for the source to calculate the range, so ERDDAP doesn't request it. In this case, it is best if you can set actual_range (especially for the longitude, latitude, altitude, depth, and time variables) by adding an actual_range attribute to each variable's <addAttributes> for this dataset in datasets.xml, for example,
          <att name="actual_range" type="doubleList">-180 180</att>
        • For numeric time and timestamp variables, the values specified should be the relevant source (not destination) numeric values. For example, if the source time values are stored as "days since 1985-01-01", then the actual_range should be specified in "days since 1985-01-01". And if you want to refer to NOW as the second value for near-real-time data that is periodically updated, you should use NaN . For example, to specify a data range of 1985-01-17 until NOW, use
          <att name="actual_range" type="doubleList">16 NaN</att>
        • If actual_range is known (either by ERDDAP calculating it or by you adding it via <addAttributes>), ERDDAP will display it to the user on the Data Access Form (datasetID.html) and Make A Graph web pages (datasetID.graph) for that dataset and use it when generating the FGDC and ISO 19115 metadata. Also, the last 7 days of time's actual_range are used as the default time subset.
        • If actual_range is known, users can use the min() and max() functions in requests, which is often very useful.
        • For all EDDTable... datasets, if actual_range is known (either by you specifying it or by ERDDAP calculating it), ERDDAP will be able to quickly reject any requests for data outside that range. For example, if the dataset's lowest time value corresponds to 1985-01-17, then a request for all data from 1985-01-01 through 1985-01-16 will be immediately rejected with the error message "Your query produced no matching results." This makes actual_range a very important piece of metadata, as it can save ERDDAP a lot of effort and save the user a lot of time.
        • When a user selects a subset of data and requests a file type that includes metadata (for example, .nc), ERDDAP modifies actual_range in the response file to reflect the subset's range.
        • See also data_min and data_max, which are an alternative way to specify the actual_range.
      • Color Bar Attributes - There are several OPTIONAL variable attributes which specify the suggested default attributes for a color bar (used to convert data values into colors on images) for this variable.
        • If present, this information is used as default information by griddap and tabledap whenever you request an image that uses a color bar.
        • For example, when latitude-longitude gridded data is plotted as a coverage on a map, the color bar specifies how the data values are converted to colors.
        • Having these values allows ERDDAP to create images which use a consistent color bar across different requests, even when the time or other dimension values vary.
        • These attribute names were created for use in ERDDAP. They are not from a metadata standard.
        • WMS - The main requirements for a variable to be accessible via ERDDAP's WMS server are:
          • The dataset must be an EDDGrid... dataset.
          • The data variable MUST be a gridded variable.
          • The data variable MUST have longitude and latitude axis variables. (Other axis variables are OPTIONAL.)
          • There MUST be some longitude values between -180 and 180.
          • The colorBarMinimum and colorBarMaximum attributes MUST be specified. (Other color bar attributes are OPTIONAL.)
        • The attributes related to the color bar are:
          • colorBarMinimum specifies the minimum value on the colorBar. For example,
            <att name="colorBarMinimum" type="double">-5</att>
            • If the data is packed with scale_factor and/or add_offset, specify the colorBarMinimum as an unpacked value.
            • Data values lower than colorBarMinimum are represented by the same color as colorBarMinimum values.
            • The attribute should be of type="double", regardless of the data variable's type.
            • The value is usually a nice round number.
            • Best practices: We recommend a value slightly higher than the minimum data value.
            • There is no default value.
          • colorBarMaximum specifies the maximum value on the colorBar. For example,
            <att name="colorBarMaximum" type="double">5</att>
            • If the data is packed with scale_factor and/or add_offset, specify the colorBarMinimum as an unpacked value.
            • Data values higher than colorBarMaximum are represented by the same color as colorBarMaximum values.
            • The attribute should be of type="double", regardless of the data variable's type.
            • The value is usually a nice round number.
            • Best practices: We recommend a value slightly lower than the maximum data value.
            • There is no default value.
          • colorBarPalette specifies the palette for the colorBar. For example,
            <att name="colorBarPalette">WhiteRedBlack</att>
            • All ERDDAP installations support these standard palettes: BlackBlueWhite, BlackRedWhite, BlackWhite, BlueWhiteRed, LightRainbow, Ocean, Rainbow, RedWhiteBlue, ReverseRainbow, Topography, WhiteBlack, WhiteBlueBlack, and WhiteRedBlack.
            • If you have installed additional palettes, you can refer to one of them.
            • If this attribute isn't present, the default is BlueWhiteRed if -1*colorBarMinimum = colorBarMaximum; otherwise the default is Rainbow.
          • colorBarScale specifies the scale for the colorBar. For example,
            <att name="colorBarScale">Log</att>
            • Valid values are Linear and Log.
            • If the value is Log, colorBarMinimum must be greater than 0.
            • If this attribute isn't present, the default is Linear.
          • colorBarContinuous specifies whether the colorBar has a continuous palette of colors, or whether the colorBar has a few discrete colors. For example,
            <att name="colorBarContinuous">false</att>
            • Valid values are the strings true and false.
            • If this attribute isn't present, the default is true.
      • data_min and data_max - These are RECOMMENDED variable attributes defined in the World Ocean Circulation (external link) metadata description. For example,
        <att name="data_min" type="float">0.17</att>
        <att name="data_max" type="float">23.58</att>
        • If present, they are of the same data type as the variable, and specify the actual (not the theoretical or the allowed) minimum and maximum values of the data for that variable.
        • If the data is packed with scale_factor and/or add_offset, data_min and data_max should be packed values.
        • If present, ERDDAP will extract the information and display it to the user on the Data Access Form (datasetID.html) and Make A Graph web pages (datasetID.graph) for that dataset.
        • This is an alternative to actual_range. All of the documentation for actual_range applies to data_min and data_max.
      • drawLandMask - This is an OPTIONAL variable attribute used by ERDDAP (and no metadata standards) which specifies the default value for the "Draw Land Mask" option on the dataset's Make A Graph form (datasetID.graph) and for the &.land parameter in a URL requesting a graph/map of the data. For example,
        <att name="drawLandMask">under</att>
        • For variables in EDDGrid datasets, this specifies whether the land mask on a map is drawn over or under the grid data. over is recommended for oceanographic data (so that grid data over land is obscured by the landmask). under is recommended for all other data.
        • For variables in EDDTable datasets: over makes the land mask on a map visible (land appears as a uniform gray area). over is commonly used for purely oceanographic datasets. under makes the land mask invisible (topography information is displayed for ocean and land areas). under is commonly used for all other data.
        • If any other value (or no value) is specified, the drawLandMask value from the dataset's global attributes is used.
      • ioos_category - This is a REQUIRED variable attribute if <variablesMustHaveIoosCategory> is set to true (the default) in setup.xml; otherwise, it is OPTIONAL.
        For example, <att name="ioos_category">Salinity</att>
        The categories are from NOAA's Integrated Ocean Observing System (IOOS) (external link).
        • (As of writing this) we aren't aware of formal definitions of these names.
        • The core names are from Zdenka Willis' .ppt "Integrated Ocean Observing System (IOOS) NOAA's Approach to Building an Initial Operating Capability" and from the US IOOS Blueprint (external link) (page 1-5).
        • It is likely that this list will be revised in the future. If you have requests, please email bob.simons at noaa.gov.
        • ERDDAP supports a larger list of categories than IOOS does because Bob Simons added additional names (mostly based on the the names of scientific fields, for example, Biology, Ecology, Meteorology, Statistics, Taxonomy) for other types of data.
        • The current valid values in ERDDAP are Bathymetry, Biology, Bottom Character, Colored Dissolved Organic Matter, Contaminants, Currents, Dissolved Nutrients, Dissolved O2, Ecology, Fish Abundance, Fish Species, Heat Flux, Hydrology, Ice Distribution, Identifier, Location, Meteorology, Ocean Color, Optical Properties, Other, Pathogens, pCO2, Phytoplankton Species, Pressure, Productivity, Quality, Salinity, Sea Level, Statistics, Stream Flow, Surface Waves, Taxonomy, Temperature, Time, Total Suspended Matter, Unknown, Wind, Zooplankton Species, and Zooplankton Abundance.
        • There is some overlap and ambiguity between different terms -- do your best.
        • If you add ioos_category to the list of <categoryAttributes> in ERDDAP's setup.xml file, users can easily find datasets with similar data via ERDDAP's "Search for Datasets by Category" on the home page.
          Try using ioos_category to search for datasets of interest.

        You may be tempted to set <variablesMustHaveIoosCategory> to false so that this attribute isn't required. ("Pfft! What's it to me?") Some reasons to leave it set to true (the default) and use ioos_category are:

        • If setup.xml's <variablesMustHaveIoosCategory> is set to true, GenerateDatasetsXml always creates/suggests an ioos_category attribute for each variable in each new dataset. So why not just leave it in?
        • ERDDAP lets users search for datasets of interest by category. ioos_category is a very useful search category because the ioos_categories (for example, Temperature) are quite broad. This makes ioos_category much better for this purpose than, for example, the much finer-grained CF standard_names (which aren't so good for this purpose because of all the synonyms and slight variations, for example, sea_surface_temperature vs. sea_water_temperature).
          (Using ioos_category for this purpose is controlled by <categoryAttributes> in your setup.xml file.)
          Try using ioos_category to search for datasets of interest.
        • These categories are from NOAA's Integrated Ocean Observing System (IOOS) (external link). These categories are fundamental to IOOS's description of IOOS's mission. If you are in NOAA, supporting ioos_category is a good One-NOAA thing to do. (Watch this One NOAA video (external link) and be inspired!) If you are in some other U.S. or international agency, or work with governmental agencies, or work with some other Ocean Observing System, isn't it a good idea to cooperate with the U.S. IOOS office?
        • Sooner or later, you may want some other ERDDAP to link to your datasets via EDDGridFromErddap and EDDTableFromErddap. If the other ERDDAP requires ioos_category, your datasets must have ioos_category in order for EDDGridFromErddap and EDDTableFromErddap to work.
        • It is psychologically much easier to include ioos_category when you create the dataset (it's just another thing that ERDDAP requires to add the dataset to ERDDAP), than to add it after the fact (if you decided to use it in the future).
      • long_name (COARDS (external link), CF (external link) and ACDD (external link) metadata standards) is a RECOMMENDED variable attribute in ERDDAP. For example,
        <att name="long_name">Eastward Sea Water Velocity</att>
        • ERDDAP uses the long_name for labeling axes on graphs.
        • Best practices: Capitalize the words in the long_name as if it were a title (capitalize the first word and all non-article words). Don't include the units in the long_name. The long name shouldn't be very long (usually <20 characters), but should be more descriptive than the destinationName, which is often very concise.
        • If "long_name" isn't defined in the variable's sourceAttributes or <addAttributes>, ERDDAP will generate it by cleaning up the standard_name (if present) or the destinationName.
      • missing_value (default = NaN) and _FillValue (default = NaN) (COARDS (external link) and CF (external link)) are variable attributes which describe a number (for example, -9999) which is used to represent a missing value. For example,
        <att name="missing_value" type="double">-9999</att>
        • ERDDAP supports missing_value and _FillValue, since some data sources assign slightly different meanings to them.
        • If present, they should be of the same data type as the variable.
        • If the data is packed with scale_factor and/or add_offset, the missing_value and _FillValue values should be likewise packed.
        • If a variable uses these special numbers, the missing_value and/or _FillValue attributes are REQUIRED.
        • For some output data formats, ERDDAP will leave this special numbers intact.
        • For other output data formats, ERDDAP will replace these special numbers with NaN or "".
      • scale_factor (default = 1) and add_offset (default = 0) (COARDS (external link) and CF (external link)) are OPTIONAL variable attributes which describe data which is packed in a simpler data type via a simple transformation.
        • If present, their data type is different from the source data type and describes the data type of the destination values.
          For example, a data source might have stored float data values with one decimal digit packed as short ints (int16), using scale_factor = 0.1 and add_offset = 0. For example,
          <att name="scale_factor" type="float">0.1</att>
          <att name="add_offset" type="float">0</att>
          In this example, ERDDAP would unpack the data and present it to the user as float data values.
        • If present, ERDDAP will extract the values from these attributes, remove the attributes, and automatically unpack the data for the user:
            destinationValue = sourceValue * scale_factor + add_offset
          Or, stated another way:
            unpackedValue = packedValue * scale_factor + add_offset
        • If you have a collection of gridded .nc files where, for a given variable, some files use one combination of scale_factor + add_offset, and one or more other subsets of the files use other combinations of scale_factor + add_offset, you can use EDDGridFromNcFilesUnpacked. It unpacks the values at a lower level, thereby hiding the differences, so that you can make one dataset from the collection of heterogeneous files.
      • standard_name (from the ACDD (external link) metadata standard) is a RECOMMENDED variable attribute in ERDDAP. (CF (external link) maintains a list of CF standard names (external link)) For example,
        <att name="standard_name">eastward_sea_water_velocity</att>
        • If you add standard_name to variables' attributes and add standard_name to the list of <categoryAttributes> in ERDDAP's setup.xml file, users can easily find datasets with similar data via ERDDAP's "Search for Datasets by Category" on the home page.
        • If you specify a CF standard_name for a variable, the units attribute for the variable doesn't have to be identical to the Canonical Units specified for the standard name in the CF Standard Name table, but the units MUST be convertible to the Canonical Units. For example, all temperature-related CF standard_names have "K" (Kelvin) as the Canonical Units. So your variable MUST have units of K, degrees_C, degrees_F, or some UDUnits variant of those names, since they are all inter-convertible.
        • Best practices: Part of the power of controlled vocabularies (external link) comes from using only the terms in the list. So we recommend sticking to the terms defined in the controlled vocabulary, and we recommend against making up a term if there isn't an appropriate one in the list. If you need additional terms, see if the standards committee will add them to the controlled vocabulary.
      • time_precision
        • time_precision is an OPTIONAL attribute used by ERDDAP (and no metadata standards) for time and timestamp variables, which may be in gridded datasets or tabular datasets, and in axisVariables or dataVariables. For example,
          <att name="time_precision">1970-01-01</att>
          time_precision specifies the precision to be used whenever ERDDAP formats the time values from that variable as strings on web pages, including .htmlTable responses. In file formats where ERDDAP formats times as strings (for example, .csv and .json), ERDDAP only uses the time_precision-specified format if it includes fractional seconds; otherwise, ERDDAP uses the 1970-01-01T00:00:00Z format.
        • Valid values are 1970-01, 1970-01-01, 1970-01-01T00Z, 1970-01-01T00:00Z, 1970-01-01T00:00:00Z (the default), 1970-01-01T00:00:00.0Z, 1970-01-01T00:00:00.00Z, 1970-01-01T00:00:00.000Z. [1970 is not an option because it is a single number, so ERDDAP can't know if it is a formatted time string (a year) or if it is some number of seconds since 1970-01-01T00:00:00Z.]
        • If time_precision isn't specified or the value isn't matched, the default value will be used.
        • Here, as in other parts of ERDDAP, any fields of the formatted time that are not displayed are assumed to have the minimum value. For example, 1985-07, 1985-07-01, 1985-07-01T00Z, 1985-07-01T00:00Z, and 1985-07-01T00:00:00Z are all considered equivalent, although with different levels of precision implied. This matches the ISO 8601:2004 "extended" Time Format Specification (external link).
        • WARNING: You should only use a limited time_precision if all of the data values for the variable have only the minimum value for all of the fields that are hidden.
          • For example, you can use a time_precision of 1970-01-01 if all of the data vaules have hour=0, minute=0, and second=0 (for example 2005-03-04T00:00:00Z and 2005-03-05T00:00:00Z).
          • For example, don't use a time_precision of 1970-01-01 if there are non-0 hour, minute, or seconds values, (for example 2005-03-05T12:00:00Z) because the non-default hour value wouldn't be displayed. Otherwise, if a user asks for all data with time=2005-03-05, the request will fail unexpectedly.
      • units (COARDS (external link), CF (external link) and ACDD (external link) metadata standard) defines the units of the data values. For example,
        <att name="units">degree_C</att>
        • "units" is REQUIRED as either a sourceAttribute or an addAttribute for "time" variables and is STRONGLY RECOMMENDED for other variables whenever appropriate (which is almost always).
        • In general, we recommend UDUnits (external link)-compatible units which is required by the COARDS (external link) and CF (external link) standards.
        • Another common standard is UCUM (external link) - the Unified Code for Units of Measure. OGC (external link) services such as SOS (external link), WCS (external link), and WMS (external link) require UCUM and often refer to UCUM as UOM (Units Of Measure).
        • We recommend that you use one units standard for all datasets in your ERDDAP. You should tell ERDDAP which standard you are using with <units_standard>, in your setup.xml file.
        • For time and timestamp variables, either the variable's sourceAttributes or <addAttributes> (which takes precedence) MUST have units which is either
          • For time axis variables or time data variables with numeric data: UDUnits (external link)-compatible string (with the format units since baseTime) describing how to interpret source time values (for example, seconds since 1970-01-01T00:00:00Z).

            units can be any one of:
            ms, msec, msecs, millis, millisec, millisecs, millisecond, milliseconds,
            s, sec, secs, second, seconds, m, min, mins, minute, minutes, h, hr, hrs, hour, hours,
            d, day, days, week, weeks, mon, mons, month, months, yr, yrs, year, or years.
            Technically, ERDDAP does NOT follow the UDUNITS standard when converting "years since" and "months since" time values to "seconds since". The UDUNITS standard defines a year as a fixed, single value: 3.15569259747e7 seconds. And UDUNITS defines a month as year/12. Unfortunately, most/all datasets that we have seen that use "years since" or "months since" clearly intend the values to be calendar years or calendar months. For example, 3 "months since 1970-01-01" is usually intended to mean 1970-04-01. So, ERDDAP interprets "years since" and "months since" as calendar years and months, and does not strictly follow the UDUNITS standard.

            Ideally, the baseTime is an ISO 8601:2004(E) formatted date time string
            (yyyy-MM-dd'T'HH:mm:ssZ, for example, 1970-01-01T00:00:00Z). ERDDAP tries to work with a wide range of variations of that ideal format, for example, "1970-1-1 0:0:0" is supported. If the time zone information is missing, it is assumed to be Zulu time zone (AKA GMT). Even if another time zone is specified, ERDDAP never uses Daylight Savings Time.

            You can test ERDDAP's ability to deal with a specific units since baseTime with ERDDAP's
            Time Converter. Hopefully, you can plug in a number (the first time value from the data source?) and a units string, click on Convert, and ERDDAP will be able to convert it into an ISO 8601:2004(E) formatted date time string. It will return an error message if the units string isn't recognizable.

          • For time data variables with String data: an org.joda.time.format.DateTimeFormat string (which is mostly compatible with java.text.SimpleDateFormat) describing how to interpret string times (for example, the ISO8601TZ_FORMAT yyyy-MM-dd'T'HH:mm:ssZ).
            A Z (not the literal 'Z') at the end of the format string tells Java/Joda/ERDDAP to look for the character 'Z' (indicating the Zulu time zone with offset=0) or look for a time zone offset in the form +hh:mm, +hh, -hh:mm, or -hh. Examples of String dates in this format are
            2012-11-20T10:12:59-07:00
            2012-11-20T17:12:59Z
            2012-11-20T17:12:59
            all of which are equivalent times in ERDDAP because ERDDAP's default time zone (relevant for the last example) is Zulu.
            Other examples are
            2012-11-20T10:12 (missing seconds are assumed to be 0)
            2012-11-20T17 (missing minutes are assumed to be 0)
            2012-11-20 (missing hours are assumed to be 0)
            2012-11 (missing date is assumed to be 1)
            See Joda DateTimeFormat (external link) .
          The main time data variable (for tabular datasets) and the main time axis variable (for gridded datasets) are recognized by the destinationName time and their units metadata (which must be suitable).

          Different Time Units in Different Gridded .nc Files - If you have a collection of gridded .nc files where, for the time variable, one subset of the files uses different time units than one or more other subsets of the files, you can use EDDGridFromNcFilesUnpacked. It converts time values to "seconds since 1970-01-01T00:00:00Z" at a lower level, thereby hiding the differences, so that you can make one dataset from the collection of heterogeneous files.

          TimeStamp Variables - Any other variable (axisVariable or dataVariable, in an EDDGrid or EDDTable dataset) can be a timeStamp variable. Timestamp variables are variables that have time-related units and time data, but have a <destinationName> other than time. TimeStamp variables behave like the main time variable in that they convert the source's time format into "seconds since 1970-01-01T00:00:00Z" and/or ISO 8601:2004(E) format). ERDDAP recognizes timeStamp variables by their time-related "units" metadata, which must match this regular expression "[a-zA-Z]+ +since +[0-9].+" (for numeric dateTimes, for example, "seconds since 1970-01-01T00:00:00Z") or be a dateTime format string containing "yy" or "YY" (for example, "yyyy-MM-dd'T'HH:mm:ssZ"). But please still use the destinationName "time" for the main dateTime variable.

          Always check your work to be sure that the time data that shows up in ERDDAP is the correct time data. Working with time data is always tricky and error prone.

          See more information about time variables.
          ERDDAP has a utility to Convert a Numeric Time to/from a String Time.
          See How ERDDAP Deals with Time.

      • valid_range, or valid_min and valid_max - These are OPTIONAL variable attributes defined in the CF (external link) metadata conventions. For example,
        <att name="valid_range" type="floatList">0.0 40.0</att>
        or
        <att name="valid_min" type="float">0.0</att>
        <att name="valid_max" type="float">40.0</att>
        • If present, they should be of the same data type as the variable, and specify the valid minimum and maximum values of the data for that variable. Users should consider values outside this range to be invalid.
        • ERDDAP does not apply the valid_range. Said another way: ERDDAP does not convert data values outside the valid_range to the _FillValue or missing_value. ERDDAP just passes on this metadata and leaves the application up to you.
          Why? That's what this metadata is for. If the data provider had wanted to, the data provider could have converted the data values outside of the valid_range to be _FillValues. ERDDAP doesn't second guess the data provider. This approach is safer: if it is later shown that the valid_range was too narrow or otherwise incorrect, ERDDAP won't have obliterated the data.
        • If the data is packed with scale_factor and/or add_offset, valid_range, valid_min and valid_max should be the packed data type and values. Since ERDDAP applies scale_factor and add_offset when it loads the dataset, ERDDAP will unpack the valid_range, valid_min and valid_max values so that the destination metadata (shown to users) will indicate the unpacked data type and range.
          Or, if an unpacked_valid_range attribute is present, it will be renamed valid_range when ERDDAP loads the dataset.

 

Contact

Questions, comments, suggestions? Please send an email to bob dot simons at noaa dot gov and include the ERDDAP URL directly related to your question or comment.

Or, you can join the ERDDAP Google Group / Mailing List by visiting https://groups.google.com/forum/#!forum/erddap (external link) and clicking on "Apply for membership". Once you are a member, you can post your question there or search to see if the question has already been asked and answered.
 


ERDDAP, Version 1.72
Disclaimers | Privacy Policy