NOAA   ERDDAP
Easier access to scientific data

Brought to you by NOAA NMFS SWFSC ERD    
 

The EDDTableFromEML and EDDTableFromEMLBatch Options in GenerateDatasetsXml

[This web page will only be of interest to ERDDAP administrators who work with EML files.]

ERDDAP is a data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP works with a given dataset as either a group of multidimensional gridded variables (e.g., satellite or model data) or as a database-like table (with a column for each type of information and a row for each observation). ERDDAP is Free and Open Source Software, so anyone can download and install ERDDAP to serve their data.

To add a dataset to an ERDDAP installation, the ERDDAP administrator must add a chunk of XML describing the dataset to a file called datasets.xml. (There is thorough documentation for datasets.xml.) Although it is possible to generate the chunk of XML for datasets.xml entirely by hand, ERDDAP comes with a tool called GenerateDatasetsXml which can generate the rough draft of the chunk of XML needed for a given dataset based on some source of information about the dataset.

The first thing GenerateDatasetsXml asks is what type of dataset you want to create. GenerateDatasetsXml has a special option, EDDTableFromEML, which uses the information in an Ecological Metadata Language (EML) (external link) XML file to generate the chunk of XML for datasets.xml to create an EDDTableFromAsciiFiles dataset from each data table in an EML file. This works very well for most EML files, mostly because EML files do an excellent job of storing all of the needed metadata for a dataset in an easy-to-work-with format. The information that GenerateDatasetsXml needs to create the datasets is in the EML file, including the URL for the data file, which GenerateDatasetsXml downloads, parses, and compares to the description in the EML file. (Many groups would do well to switch to EML, which is a great system for documenting any tabular scientific dataset, not just ecological data. And many groups that create XML schemas would do well to use EML as a case study for XML schema that are clear, to the point, not excessively deep (i.e., too many levels), and easy for humans and computers to work with.)

Questions

Here are all the questions GenerateDatasetsXml will ask, with comments about how you should answer if you want to process just one EML file or a batch of EML files:
  • Which EDDType?
    If you want to process just one file, answer: EDDTableFromEML
    If you want to process a group of files, answer: EDDTableFromEMLBatch
  • Directory to store files?
    Enter the name of the directory that will be used to store downloaded EML and/or data files.
    If the directory doesn't exist, it will be created.
  • (For EDDTableFromEML only) EML URL or local fileName?
    Enter the URL or local file name of an EML file.
  • (For EDDTableFromEMLBatch only) EML dir (URL or local)?
    Enter the name of the directory with the EML files (a URL or a local dir).
    For example: http://sbc.lternet.edu/data/eml/files/
  • (For EDDTableFromEMLBatch only) Filename regex?
    Enter the regular expression which will be used to identify the desired EML files in the EML directory.
    For example: knb-lter-sbc\.\d+
  • Use local files if present (true|false)?
    Enter true to use the existing local EML files and data files, if they exist.
    Enter false to always re-download the EML files and/or data files.
  • accessibleTo?
    If you want the new datasets to be private datasets in ERDDAP,
    specify the name of the group(s) that will be allowed access.
    Recommended for LTER groups: combine "lter" plus the group, e.g., lterSbc .
    See accessibleTo.
  • localTimeZone (e.g., US/Pacific)?
    If a time variable indicates that it has local time values, this time zone will be assigned.
    This must a value from the TZ column list of time zone names (external link).
    Note all of the easy-to-use "US/..." names at the end of the list.
    If you later find that to be incorrect, you can change the time_zone in the chunk of datasets.xml.

EML plus ERDDAP is a great combination, since ERDDAP can give users more direct access to the wealth of Knowledge Network for Biocomplexity (KNB) (external link) and Long Term Ecological Research (LTER) (external link) data and help those projects meet the US government's Public Access to Research Results (PARR) requirements by making the data available via a web service. Also, EML plus ERDDAP seems like a great bridge between scientists in the academic / NSF-funded realm and scientists in the federal agency (NOAA, NASA, USGS) realm.

If you have questions, comments, suggestions, or need help, please send an email to bob dot simons at noaa dot gov .
 

Design Details

Here are the design details of the EDDTableFromEML option in GenerateDatasetsXml.
Some are related to differences in how EML and ERDDAP do things and how the converter deals with these problems.
If one of these was a problem, it is now mostly dealt with/solved.
  • One dataTable = One ERDDAP Dataset
    One EML file may have multiple <dataTable>s. ERDDAP makes one ERDDAP dataset per EML dataTable. The datasetID for the dataset is EMLName_t[tableNumber] (e.g., for SBC where EMLname isn't just a number) or [system]_EMLName_t[tableNumber] (e.g., for NTL). For example, table #1 in the file knb-lter-sbc.28, becomes ERDDAP datasetID=knb_lter_sbc_28_t1,
     
  • EML vs CF+ACDD
    Almost all of the metadata in the EML files gets into ERDDAP, but in a different format. ERDDAP uses the CF (external link)) and ACDD (external link) metadata standards. They are complementary metadata systems that use key=value pairs for global metadata and for each variable's metadata.
    Yes, the EML representation of the metadata is nicer than the CF+ACDD representation. I'm not suggesting using the CF+ACDD representation as a replacement for the EML. Please think of CF+ACDD as part of the bridge from the EML world to the OPeNDAP/CF/ACDD world.
     
  • Small Changes
    ERDDAP makes a lot small changes. For example, ERDDAP uses the EML non-DOI alternateIdentifier plus a dataTable number as the ERDDAP datasetID, but slightly changes alternateIdentifier to make it a valid variable name in most computer languages, e.g., knb-lter-sbc.33 dataTable #1 becomes knb_lter_sbc_33_t1.
     
  • DocBook
    EML uses DocBook's markup system to provide structure to blocks of text. CF and ACDD assume plain text. So GenerateDatasetsXml converts the marked up text into plain text that looks like the formatted version of the text. The inline tags are sanitized with square brackets, e.g., [emphasized], and left in the plain text.
     
  • Data Files
    Since the EML dataTable includes the URL of the actual data file, GenerateDatasetsXml will:
    1. Download the data file.
    2. Store it in the same directory as the EML file.
    3. Read the data.
    4. Compare the description of the data in the EML with the actual data in the file.
    5. If GenerateDatasetsXml finds differences, it deals with them, or asks the operator if the differences are okay, or returns an error message. The details are in various items below.
       
  • .zip'd Data Files
    If the referenced data file is a .zip file, it must contain just one file. That file will be used for the ERDDAP dataset. If there is more than 1 file. ERDDAP will reject that dataset. If needed, this could be modified. (In practice, all SBC LTER zip files have just one data file.)
     
  • StorageType
    If a column's storageType isn't specified, ERDDAP uses its best guess based on the data in the data file. This works pretty well.
     
  • Units
    ERDDAP uses UDUNITS formatting for units (external link). GenerateDatasetsXml is able to convert EML units to UDUNITS cleanly about 95% of the time. The remaining 5% results in a readable description of the units, e.g., "biomassDensityUnitPerAbundanceUnit" in EML becomes "biomass density unit per abundance unit" in ERDDAP. Technically this isn't allowed. I don't think it's so bad under the circumstances. [If necessary, units that can't be made UDUNITS compatible could be moved the variable's comment attribute.]
     

Issues with the EML Files

Here are some issues/problems with the EML files that cause problems when a software client (such as the EDDTableFromEML option in GenerateDatasetsXML) tries to interpret/process the EML files.
  • Although there are several issues listed here, they are mostly small, solveable problems. In general, EML is a great system and it has been my pleasure to work with it.
  • These are roughly sorted from worst/most common to least bad / less common.
  • Most are related to small problems in specific EML files (which are not EML's fault).
  • Most can be fixed by simple changes to the EML file or data file.
  • Given that LTER people are building an EML checker to test the validity of EML files, I have added some suggestions below regarding features that could be added to the checker.
Here are the issues:
  • Separate Date and Time Columns
    Some data files have separate columns for date and for time, but no date+time column. Since it is an important part of ERDDAP that the time column display the unified date+time, ERDDAP currently rejects these datasets.
    A solution is to make a new column in the datafile (and describe it in the EML) where the date and time columns are merged into one column.
     
  • Inconsistent Column Names
    The EML files list the data file's columns and their names. Unfortunately, they are often different from the column names in the actual data file. Normally, the column order in the EML file is the same as the column order in the data file, even if the names vary slightly, but not always. GenerateDatasetsXml tries to match the column names. When it can't (which is common), it will stop, show you the EML/data file name pairs, and ask if they are correctly aligned. If you enter 's' to skip a table, GeneratedDatasetsXml will print an error message and go on to the next table.
    The solution is to change the column names in the EML file to match the column names in the data file.
     
  • Different Column Order
    There are several cases where the EML specified the columns in a different order than they exist in the data file. GenerateDatasetsXml will stop and ask the operator if the matchups are okay or if the dataset should be skipped. If it is skipped, there will be an error message in the results file, e.g.,:
      <-- SKIPPED (USUALLY BECAUSE THE COLUMN NAMES IN THE DATAFILE ARE IN
      A DIFFERENT ORDER OR HAVE DIFFERENT UNITS THAN IN THE EML file):
      datasetID=knb_lter_sbc_17_t1
      dataFile=all_fish_all_years_20140903.csv
      The data file and EML file have different column names.
      ERDDAP would like to equate these pairs of names:
        SURVEY_TIMING        = notes
        NOTES                = survey_timing
      -->
    The solution is to fix the column order in these EML files so that they match the order in the data files.

    It would be nice if the EML checker checked that the columns and column order in the source file match the columns and column order in the EML file.

  • Incorrect numHeaderLines
    Several dataTables incorrectly state numHeaderLines=1, e.g., ...sbc.4011. This causes ERDDAP to read the first line of data as the column names. I tried to manually SKIP all of these dataTables. They are obvious because the unmatched source col names are all data values. And if there are files that incorrectly have numHeaderLines=0, my system doesn't make it obvious. EXAMPLE in SBC LTER failures file:
      <-- SKIPPED (USUALLY BECAUSE THE COLUMN NAMES IN THE DATAFILE ARE IN
      A DIFFERENT ORDER OR HAVE DIFFERENT UNITS THAN IN THE EML file):
       datasetID=knb_lter_sbc_3017_t1
      dataFile=MC06_allyears_2012-03-03.txt
      The data file and EML file have different column names.
      ERDDAP would like to equate these pairs of names:
        2008-10-01T00:00     = timestamp_local
        2008-10-01T07:00     = timestamp_UTC
        2.27                 = discharge_lps
        -999.0               = water_temperature_celsius
      -->

    It would be nice if the EML checker checked the numHeaderLines value.

  • numHeaderLines = 0
    Some source files don't have column names. ERDDAP accepts that if the EML describes the same number of columns.

    In my opinion: this seems very dangerous. There could be columns in a different order or with different units (see below) and there is no way to catch those problems.

  • DateTime Format Strings
    EML has a standard way to describe date time formats. but there is considerable variation in its use in EML files. (I was previously wrong about this. I see the EML documentation for formatString which appears to match the Java DateTimeFormatter specification (external link), but which lacks the important guidelines about its use, with the result that formatString is often/usually improperly used.) There are several instances with incorrect case, and/or incorrect duplication of a letter, and/or non-standard formatting. That puts the burden on clients, especially software clients. GenerateDatasetsXml tries to convert the incorrectly defined formats in the EML files into the Java/Joda time format that ERDDAP requires.

    It would be nice if the EML checker required strict adherence to this specification and verified that date time values in the data table could be parsed correctly with the specified format.

  • DateTime But No Time Zone
    GenerateDatasetsXml looks for a column with dateTime and a specified time zone (either Zulu: from time units ending in 'Z' or a column name or attribute definition that includes "gmt" or "utc", or local: from "local" in the column name or attribute definition). Also acceptable is a file with a date column but no time column. Also acceptable is a file with no date or time information.

    GenerateDatasetsXml treats all "local" times as being from the time zone which you can specify for a given batch of files, e.g., for SBC LTER, use US/Pacific. The information is sometimes in the comments, but not in a form that is easy for a computer program to figure out.

    Files that don't meet this criteria are rejected with the message "NO GOOD DATE(TIME) VARIABLE". Common problems are:

    • There is a column with dates and a column with times, but not dateTime column.
    • There are time units, but the time zone isn't specified.

    Other comments:
    If there is good date+time with time zone column, that column will be named "time" in ERDDAP. ERDDAP requires that time column data be understandable/convertible to Zulu/UTC/GMT time zone dateTimes. [My belief is: using local times and different date/time formats (2-digit years! mm/dd/yy vs dd/mm/yy vs... ) in data files forces the end user to do complicated conversions to Zulu time in order to compare data from one dataset with data from another. So ERDDAP standardizes all time data: For string times, ERDDAP always uses the ISO 8601:2004(E) standard format, for example, 1985-01-02T00:00:00Z. For numeric times, ERDDAP always uses "seconds since 1970-01-01T00:00:00Z". ERDDAP always uses the Zulu (UTC, GMT) time zone to remove the difficulties of working with different time zones and standard time vs. daylight saving time. So GenerateDatasetsXml seeks an EML dataTable column with date+time Zulu. This is hard because EML doesn't use a formal vocabulary/system (like Java/Joda time format (external link)) for specifying the dataTime format:
    If there is a col with numeric time values (e.g., Matlab times) and Zulu timezone (or just dates, with no time columns), it is used as "time".
    If there is a col with date and time data, using the Zulu time zone, it is used as "time" and any other date or time column is removed.
    Else if a col with just date information is found, it is used as the "time" variable (with no time zone).
    If there is a data column and a time column and no combined dateTime column, the dataset is REJECTED -- but the dataset could be made usable by adding a combined dateTime column (preferrably, Zulu time zone) to the datafile and adding its description in the EML file.
    EXAMPLE from SBC LTER: http://sbc.lternet.edu/data/eml/files/knb-lter-sbc.10 (external link) dataTable #2.

    It would be nice if EML/LTER required the inclusion of a column with Zulu (UTC, GMT) time zone times in all relevant source data files. Next best is to add a system to EML to specify a time_zone attribute using standard names (from the TZ column (external link)).

  • Missing missing_value
    Some columns use a missing_value but don't list it in the EML metadata, e.g., precipitation_mm in knb-lter-sbc.5011 uses -999. If no missing value is specified in the EML, GenerateDatasetsXml automatically searches for common missing values (e.g., 99, -99, 999, -999, 9999, -9999, etc) and creates that metadata. But other missing missing_values are not caught.

    It would be nice if the EML checker looked for missing missing_values.

  • Small Problems
    There are a lot of small problems (spelling, punctuation) which will probably only be found by a human inspecting each dataset.

    It would be nice if the EML checker looked for spelling and grammatical errors. This is a difficult problem because words in science are often flagged by spell checkers. Human editing is probably needed.

  • Invalid Unicode Characters
    Some of the EML content contains invalid Unicode characters. These are probably characters from the Windows charset that were incorrectly copied and pasted into the UTF-8 EML files. GenerateDatasetsXml sanitizes these characters to e.g., [#128], so they are easy to search for in the ERDDAP datasets.xml file.

    It would be nice if the EML checker checked for this. It is easy to find and easy to fix.

  • Different Column Units
    Some EML dataTables define columns that are inconsistent with the columns in the data file, notably because they have different units. GenerateDatasetsXml flags these. It is up to the operator to decide if the differences are okay or not. These appear in the failures file as "SKIPPED" dataTables. EXAMPLE in SBC LTER failures file:
      < SKIPPED (USUALLY BECAUSE THE COLUMN NAMES IN THE DATAFILE ARE IN
      A DIFFERENT ORDER OR HAVE DIFFERENT UNITS THAN IN THE EML file):
       datasetID=knb_lter_sbc_3_t1
      dataFile=SBCFC_Precip_Daily_active_logger.csv
      The data file and EML file have different column names.
      ERDDAP would like to equate these pairs of names:
        Daily_Precipitation_Total_mm = Daily_Precipitation_Total_inch
        Flag_Daily_Precipitation_Total_mm = Flag_Daily_Precipitation_Total_inch
      -->

    It would be nice if the EML checker checked that the units match. Unfortunately, this is probably impossible to catch and then impossible to resolve without contacting the dataset creator, given that the source file doesn't include units. The discrepancy for the example above was only noticeable because the units were included in the source column name and the EML column name. How many other dataTables have this problem but are undetectable?

  • Different Versions of EML
    GenerateDatasetsXml is designed to work with EML 2.1.1. Other versions of EML will work the extent that they match 2.1.1 or that GenerateDatasetsXml has special code to deal with it. This is a rare problem. When it occurs, the solution is to convert your files to EML 2.1.1, or send the EML file to bob.simons at noaa.gov, so I can make changes to GenerateDatasetsXml to deal with the differences.
     
  • Trouble Parsing the Data File
    Rarely, a dataTable may be rejected with the error "unexpected number of items on line #120 (observed=52, expected=50)" An error message like this means that a line in the datafile had a different number of values than the other lines. It may be a problem in ERDDAP (e.g., not parsing the file correctly) or in the file. EXAMPLE from SBC LTER: http://sbc.lternet.edu/data/eml/files/knb-lter-sbc.10  (external link) dataTable #3, see datafile=LTER_monthly_bottledata_registered_stations_20140429.txt
     

 

Contact

Questions, comments, suggestions? Please send an email to bob dot simons at noaa dot gov .
 

ERDDAP, Version 1.78
Disclaimers | Privacy Policy