Bob's Guidelines for
[The opinions expressed in this document are Bob Simons' personal opinions (email: bob dot simons at noaa dot gov) and do not necessarily reflect any position of my coworkers, bosses, ERD, the National Oceanic and Atmospheric
Administration, or the U.S. Government.]
Data Distribution Systems
The NOAA NMFS SWFSC
Research Division (ERD)
and the NOAA NESDIS
CoastWatch West Coast Regional Node
are working on ways to improve how scientific data is distributed.
We have evolved a set of "best practice" guidelines related to data distribution systems that
we have found to be useful.
These guidelines also help us evaluate systems. For example, if a system isn't
fast, that's a problem.
These are guidelines, not hard and fast rules. If you want to do something else, that's your choice.
We didn't invent any of these guidelines. We looked at many guidelines and kept the ones that made sense and worked for us.
There are probably lots of other useful guidelines, but this is our current list:
Comments? Questions? Suggestions?
See also the Disclaimer.
Last modified: 2011-09-15
We are trying to find best practices for distributing scientific data.
At ERD, we work with oceanographic data from buoys, cruises, tags, satellites, computer models, etc.
In general, we want to distribute the data via web pages
and web services.
Features to consider:
While the best practices here may substantially overlap with the best practices for other problems,
it is important to separate them. If you are trying to solve a different problem,
these best practices may or may not apply.
Since we are thinking about how to distribute data, it is easy to forget (or minimize)
the user's point of view: "How do I get the data into my favorite application?".
That would be a serious mistake.
In the end, if a system doesn't meet the user's needs,
doesn't fit into the user's work flow, or isn't easy to use,
the user won't use it and all our work will be for naught.
The user experience must be a primary focus.
- Some of the data is relatively static. Some of it is changing frequently (near-real-time).
- Some of the data is publicly available. Some of it has restricted access.
- Some of the data is stored locally in various formats.
Some of it is stored remotely and accessible via a web page or web service.
Unfortunately, some of it is privately held and not accessible.
- Some of the datasets are very small (e.g., a table with a few columns and rows).
Some are very large (e.g., satellite and model datasets spanning years, with >1 TB of data).
- A user may make one or many small data requests (e.g., one value from one sensor) or
one or many large data requests (e.g., >1 GB chunks of satellite or model data).
So when you are designing, developing, and using your system, think from the user's point of view:
For more info, see
Software for Use
by Larry L. Constantine and Lucy A.D. Lockwood (especially the Five Rules of Usability and Six Principles of Usability in Appendix B) and
Ten Usability Heuristics
by Jakob Nielsen.
- Is the system easy for the user to install?
Better yet, can you design a system where the users don't have to install anything?
No-installation, server-based solutions have the additional advantage that users
always use the latest version, starting the instant you release it.
- Is the system reasonably easy for a new user to get started with?
- Does the system have helpful information to help the user learn more?
- Does the system allow advanced users to work efficiently (and not get in their way)?
- Does the system make common tasks easy to do and less common tasks at least possible (ideally easy, too)?
- Does the system minimize the effort and thought needed to use it?
Some user interface designers advocate counting the clicks needed to complete each task.
If you keep the user's point of view in mind when writing data distribution software,
if will be easier to make your software fast.
Speed and efficiency will be a requirement, not an afterthought.
Clients are diverse: scientists working in their discipline,
scientists in related disciplines, decision makers, teachers, students, fisherman,
surfers, other members of the public, etc.
And clients within a given group are diverse: one scientist may want
to bring data into Matlab for analysis, another into R (a statistics program),
another into ArcGIS, and another into a custom FORTRAN modelling program, etc.
Some will be interested in forecasts, some in the latest data,
and others in long historic time series.
Some will be interested in just their local area; others in the whole world.
We need to design systems with these diverse needs in mind.
And we need to accept that new clients and new needs will appear
as time goes by. The system needs to be able to evolve.
In fact, we probably won't be able to predict all of our clients needs.
So it is probably best to make our systems (at least some of them) as
flexible as possible.
Plug-ins and libraries (client-side solutions) can be great.
They work at the point-of-need.
Sometimes they are the best or only solution.
However, they have disadvantages:
- You may have to make different versions for different operating systems
or for different versions of the software they plug into.
- Some users are reluctant to install them.
- Some users don't have permission to install any software.
- Some users will have difficulty installing them.
- Plug-ins may break when the application is updated.
- If there is a new version of the plug-in (e.g., with a bug fix),
it is impossible to get all users to update.
You probably don't even know who they are.
We prefer server-side solutions: web applications/services running on servers.
They don't need a new plug-in or library.
They don't have these problems.
When there is a new version, one administrator updates one program on one server
and the changes apply to all users, immediately.
For developers and users, that is a wonderful feature.
All of your potential clients matter, not just the ones using a certain browser.
When making a web page/ web application, it can be tempting to take advantage
of the great new features offered by a certain browser or by the latest version of HTML.
Don't do it. Web pages should work on all reasonably modern browsers.
It isn't that hard to do: just use the features supported by all reasonably modern browsers.
If a feature isn't universally supported, don't use it.
Yes, sometimes you have to have code to do things one way for some browsers
and another way for other browsers. But try to minimize this.
You will be glad you did:
And more important, your users will thank you (or at least they'll have one
less thing to complain about).
When thinking about clients' needs,
there is an important distinction between humans working with a web browser
and automated computer programs.
The qualities of a system interface that a human wants
when searching for and requesting data via a web browser
are usually different from the qualities of a system interface that
a computer program needs when it is gathering and processing data.
- You won't have to write those annoying warnings (e.g.,
"This page only works with Google Chrome 13.034 or above").
- Your tech support efforts will be minimized.
- The page won't require changes everytime a browser is updated.
- Part of the beauty of the web is its universality.
Humans using web browsers like nice web pages. Humans can read and understand information
and make decisions. Humans can read directions and follow them (although they might not like to).
Humans like a rich browsing experience: for example, interesting images and sophisticated user interface widgets.
Compared to humans using web browsers, computer programs don't care about a rich browsing experience
and they often can't interact with sophisticated user interface widgets.
They don't understand directions or other information on a web page,
but they are really good at following pre-defined instructions to do simple, well-defined tasks.
Since it is easier to write computer programs when the
task to be done is simple,
it is easier to write computer programs to access data from very well-defined,
very well structured, and very simple web services.
Computer programs need an easy way to
find out what data is available (in some standardized format),
generate requests for subsets of data, get the response,
easily parse the response, and process the response.
Note that a system can offer web pages (for humans) and
web services (for computers). For example, most
DAP servers offer a
Data Access Form, which is a web page
for every dataset.
When the user fills out the form, the web page reformats
the request into a URL, which is sent to the server (the web service) and returns the
requested data (usually in a human-readable format).
But a computer program or script could generate those URLs
and just use the web service parts of the system.
One nice feature of many data servers that offer web services is that they have a way
to get data from similar remote web services.
For example, a
server can serve datasets from another THREDDS server.
can get data from several types of remote web services (for example, DAP, SOS, OBIS, and other ERDDAP's).
our plan for federations of ERDDAPs and other data servers.)
Such federations of data servers make it easy for one site to serve data from multiple remote servers,
as if all of the data were on the aggregating server.
Users can get data from lots of datasets just by going to that aggregating server.
Federations can also disseminate data via push and pull technologies.
If a dataset is only available via web pages, not web services,
such federations are not possible.
User's have to go to one web page to get data from one dataset
and another web page (perhaps at another web site) to get data from another dataset, etc.
It is extremely useful if every dataset is at least available via a web service,
because other web pages and web services
can be built on top of the original web service.
For example, this makes
interoperable data servers like
ERDDAP can get data from several types of remote web services (for example, DAP, SOS, OBIS, and other ERDDAP's)
and make it available to users via other protocols (for example,
the user can make a DAP request to get data from a SOS server).
Or, anyone can create a new web page (using data from a remote web service)
to simplify a given task for a group of users.
The web service may be one-size-fits all, but the web services and web pages
that are built on top can be highly customized.
An example of this is Cara Wilson's
web page. The page has HTML image tags that refer to ERDDAP URLs that
request an image to be generated from the latest data available
for a specific dataset, for a specific geographic region. Thus, the images aren't static images.
Whenever a user visits the BloomWatch page, the user sees the latest images,
generated automatically by ERDDAP.
Best of all, the authors of the web pages and web services that are built on top
of the original web service don't need to coordinate with the administrators of the original web service.
Some data servers distribute data that is stored locally.
Other data servers (e.g., THREDDS, LAS, and ERDDAP) can distribute data that is stored locally or remotely.
A potential problem with remote data is that access to it may be slow.
The solution is to make a local copy of the data.
- The original web service makes the data available.
- The new web pages or web services use the data.
- They communicate through the original web service's interface, thus they are
Loosely Coupled Systems.
- This allows new products, highly customized to user's needs,
to be made in ways that the original web service's designers might never have imagined.
- This decentralized approach to collaboration is a good way to deal with the fact that
Clients Have Diverse Needs.
- It allows anyone to participate in the data distribution process,
leads to solutions that are tailored to the user's needs, and minimizes the
Some data servers (e.g., ERDDAP) have the ability to actively get all of the
available data from a remote data source and
a local copy of the data.
There are also systems designed to actively push data to remote servers
as soon as new data is available
(e.g., Unidata's Internet
Data Distribution System).
Similarly, but not as full-featured, ERDDAP's EDDGridFromErddap uses ERDDAP's subscription system and flag system so that it will be notified immediately when new data
In the most common and simple sense, the web works when a computer
program (such as a web browser) sends a request to a specific URL
(the address of some resource on the web) and gets a response (for example, a web page
or a file).
This is the essence of a
Because this approach is so fundamental to the way the web works,
almost all types of computer programs
and computer languages have the ability to contact a URL and get the response.
- By combining ERDDAP's EDDGridCopy and EDDGridFromErddap, data can be efficiently pushed between ERDDAPs.
- Push technologies disseminate data very quickly (within seconds).
- Web services
make data dissemination via push and pull technologies possible.
- This architecture puts each ERDDAP administrator in charge of determining where the data
for his/her ERDDAP comes from.
- Other ERDDAP administrators can do the same. There is no need for coordination
- If many ERDDAP administrators link to each other's ERDDAPs, a data distribution network
- Data will be quickly, efficiently, and automatically disseminated from data sources
(ERDDAPs and other servers)
to data re-distribution sites (ERDDAPs) anywhere in the network.
- A given ERDDAP can be both a source of data for some datasets and a re-distribution site
for other datasets.
- The resulting network is roughly similar to data distribution networks set up with
Unidata's IDD/IDM, but less rigidly structured.
As a result, it is very convenient (though not essential) if
a data distribution system uses URLs to specify the data being requested.
URLs are great.
- The URL can completely specify what is being requested: which subset of which dataset,
and the file type for the response.
- Humans are comfortable working with the URLs.
- Computer programs (web browsers, Matlab, R, Excel, ArcGIS, shell scripts, etc.)
and computer languages (C, Java, Python, etc.) are good at getting data via URLs.
- You can email URLs to co-workers, bookmark URLs, write URLs in your notes, etc.
That's a lot easier than dealing with directions like: click on this, then click on that,
then down at the bottom click on ...
- Web pages can embed image URLs in <img> tags.
- Web services that use URLs for data requests make it easy to
build other web pages or web services on top of them.
What about XML?!
Some people advocate XML-based systems which require that the requests and responses to be formatted as XML.
The idea is that XML could be the lingua franca for Internet communications, particularly
between computer programs.
While we believe that there are some good uses for XML (for example, configuration files),
we are not convinced that XML has been an unqualified success for requesting and receiving data.
The big potential problems are:
See L. Richardson and Sam Ruby's
RESTful Web Services
(the web page and the book) for a discussion of RESTful vs. XML-based services.
In keeping with the guideline Focus on the User's Point of View,
it is important to make it easy for users to get started with and use your system.
There is a great book on designing web sites called
Don't Make Me Think!
by Steve Krug that makes that point that both human clients and computer clients appreciate
easy-to-use web pages and web services.
For users, if it isn't simple, they won't use it (or they'll grumble if they have to use it).
- XML-based systems are often very verbose, so transmission is slow when a lot of data is sent.
- Because of the difficulty in forming requests and parsing responses,
XML-based systems rely on custom computer programs to send and receive information,
so they usually require that custom software to be installed on the client's computer.
- Because of the difficulty in forming requests and parsing responses,
XML is usually only a solution for Web Services,
not Web Pages.
So XML-based systems generally aren't useful for handling the end-user's-needs part of
a data distribution system.
- XML-based systems seem to quickly become very complex.
It would be okay if they just used a few schemas, but systems often grow to use dozens
or even hundreds of schemas.
The resulting systems may be precisely defined (or not!), but writing a client
to convert the XML to some other format can become very difficult.
See Keep It Simple for reasons to avoid complex systems.
It is also best if the system architecture is simple.
Simple systems are easier to design, build, use, and maintain.
They are less prone to bugs. They are more efficient.
That is why
(Keep it Simple Stupid) is a venerable design principle.
Over time, systems seem to become complex all by themselves,
so you need to make a constant effort to keep them simple.
for more thoughts about simple vs. complex systems.
Anders Hejlsberg, the author of Turbo Pascal and Delphi, and now the lead architect of
Microsoft's C#, said,
"I think simplicity is always a winner. If you can find a simpler solution to
something -- that has certainly for me been a guiding principle. Always try to make it simpler."
Speed matters, from a user stand point and from a cost standpoint.
- If the system is slow (for example, takes more than about 8 seconds to respond),
people will think it isn't working and give up.
- If the system is slow, then a large number of requests
(notably from a computer program or another web service)
will overwhelm your service and you'll have to buy additional, faster computers
to meet the demand. That's expensive.
- If the system is even a little slow, people won't enjoy working with it.
- If they don't enjoy working with it, they are less likely to use it.
Don't measure the system's speed just as the CPU time to process the request.
Your system might be a success. People might actually use it.
Do capacity planning ahead of time
so you have a plan for expanding the system to meet the needs of more and more users.
our plan for scaling up ERDDAP
to meet the needs of a large number of users.)
For general information on designing scalable, high capacity, fault-tolerant systems,
see Michael T. Nygard's book
Murphy's Law ("Anything that can go wrong, will go wrong") applies to Data Distribution Systems.
Remote data sources will fail in all sorts of ways.
Local servers and disk drives (even RAIDs) will fail.
Networks will slow to a crawl, behave erratically, and fail.
If a system is designed to properly handle 99.999% of
all situations, you can be sure that the 0.001% situations will also occur.
If a certain type of error can only happen if two or more unrelated, unlikely events occur
simultaneously (leading you to believe the error will never occur in reality), be assured
that the error is more likely to occur than you think.
Since things are going to fail, check your work.
See The Black Swan
for a more thorough discussion.
Start timing when the user sends the request;
consider the time until the user first starts to see the response;
stop timing when the user has received and parsed the entire response.
For XML-based systems in particular, the transmission and parsing times
may be long if lots of data is being transmitted.
We may wish to design a system that tries to overcome these failures (retry!).
But that just leads to a system that where one failure causes
a cascade of other failures, as uncompleted tasks pile up in the system. Instead, it is
better to accept that some datasets are currently unavailable and that some requests can't be
fulfilled right now. Systems (e.g., ERDDAP) may have extensive error checking and error handling
routines, but the system will fail in ways that the designers hadn't even thought of.
History is full of examples.
for a more thorough discussion.
If you are developing software, use automated unit and system tests internally during development.
Automated tests can catch most (but not all) bugs before the software is released.
If you are the administrator of a data distribution system, test your system.
Test every dataset at least once.
Is all of the metadata appearing correctly?
Can you send a request and get the correct response reasonably quickly?
If you have a server-based data distribution system,
it is great that when clients access your system they always use the latest version of the software.
But the downside is that if the server is down, nobody can use your system.
Systems will fail in all sorts of ways, expected and unexpected.
So it makes sense to have a system that checks if your system is functioning correctly.
At ERD, we wrote and use
The Network Resource Checker, which is freely available, but alternatives are available.
Every few minutes NetCheck gets the responses from URLs for all of our services
and looks for specific text strings in the responses (for example, the names of all of the datasets).
If a response is too slow or the required text strings aren't present, NetCheck
sends an email to people who have subscribed to that test.
Our systems may go down sometimes (don't they all?), but at least we find out about it quickly
so that we can react quickly.
The NetCheck tests are also very useful when we make changes to the system,
to make sure we didn't break anything.
Since we know that user's needs are diverse and changing,
it makes sense to use a software design and development process
that is well suited to this, such as the
Agile Software Development
Some of the principles of Agile Software Development are:
- Sufficient (not excessive) planning. Most importantly, plan on the system changing over time.
- Sufficient (not excessive) documentation.
- Frequent (every month or two?) releases of working software.
- The initial releases will be simple and not have all of the features you want,
but they should meet some of the more important needs of the clients.
As additional versions of the software are released, additional features can
be added to meet additional needs of the clients.
- You always have working software, starting very early on.
- You can adjust course after every release
via a dialog between software developers, users, and decision-makers.
This lets the system evolve as needs and priorities change.
- Frequent releases facilitate
which is part of
highly successful system to improve the quality and efficiency of products.
It is important to contrast this approach with attempting to
completely design the system before building anything.
Although planning is important, it can become paralyzing if one tries to build the perfection plan.
There is a danger of planning and planning and planning and never
actually building a system that does anything.
As Voltaire said, "The perfect is the enemy of the good."
Planning large systems is very difficult.
As one works with a large problem like data distribution, one learns more
about the problem. One thinks of other solutions to parts of the problem.
And the problem often evolves over time as new requirements are added
(for example, when a new client program like Google Earth appears).
Agile development takes advantage of this, allowing the system (and the system's design) to evolve
as our understanding of the problem evolves.
The software used for data distribution systems can be reusable, or custom made for a specific site.
Both have advantages and disadvantages.
The Agile Software Development Process seems to have been designed
with a deep appreciation of the
(also known as the 80/20 rule),
which basically says that some things are more important than others,
so it makes sense to work on the most important things first.
At the start of work on each version of the software (every month or two?),
you can decide what features are most important to work on next.
By taking this approach, you can usually get the most important features implemented first.
Clients get a useful system soon, rather than having to wait a long, long time until the software
- Because custom systems are tailor made for their situation, they can be highly optimized
for the data that will be served and to meet the needs of specific groups of users.
But custom systems take time, money, and expertise (programmers) to develop.
- Because reusable systems are reused, it makes sense to put (relatively)
more time and money into developing them because the fruits of the effort are
enjoyed by many sites, which magnifies the return-on-investment by the number of installations (10x? 100x?).
When a site chooses to use an existing reusable system,
they can be up and running in a day and at no cost if the system is open source.
In many cases, reusable systems can be somewhat customized for a given site,
giving you the best of both worlds.
- As a result, reusable systems are preferred unless the need for special
features and the availability of time, money, and expertise make a custom
system worth the extra cost.
One interpretation of the 80/20 rule for software development is
that you can implement the most important 80% of the desired
features for 20% of the total effort. The remaining 20% of the features may be harder
to implement and require the remaining 80% of the total effort.
You can quibble with the percentages, but the idea is sound.
It makes sense to work initially toward meeting the most important needs of most of the clients.
Work on the important things first.
Given that Clients Have Diverse Needs
(really diverse when you consider human clients and computer program clients),
it seems unlikely that one data distribution system will meet everyone's needs.
It seems more likely that a one-size-fits-all solution won't be the perfect solution for everyone.
If you want to work toward building the one great system, great.
Good luck to you. (Really!) But beware of these dangers:
- The Needs Problem: it is unlikely that you have thought of everyone's
- The Geared-Towards Problem: given that diverse clients have diverse needs,
it is seems unlikely that your system can be the optimal system for all users.
It seems more likely that it will be geared toward a few types of users
or a few tasks, and that it will be mediocre (or bad) at other tasks.
- The Planning Paralysis Problem:
Although planning is important, it can become paralyzing if one tries to plan the perfect system.
There is a danger of planning and planning and planning and never
actually building a system that does anything.
- The Double-Edged-Sword Problem: if you build a system that
is really good in one respect, that strength is often also the system's
greatest weakness. For example, if you design a file structure that is so
sophisticated that it can handle the most complex data structures that you
could possibly imagine (HDF5 comes to mind), that's great. Really. But then you
have to write computer programs to process each of those amazingly complex data
structures to get the data into your client program (for example, Matlab, R, Excel, ArcGIS)
if that can even be done. Good luck with that.
- The Switch-Over Problem: Even if you could design and build a perfect system, it is unlikely
that you could convince all of the administrators of all of the existing data servers and all of the clients
to dump their existing solutions and switch to the new system.
Lots of people have lots of effort invested in the current systems
(which is why
Building Web Pages and Web Services on Top of Other Web Services
is so interesting -- it doesn't require any changes to existing web services).
- The Big-World Problem: Even if a large institution (NOAA?) built a really great system designed
to satisfy everyone's needs and even if they mandated that everyone in the institution
switch to the new system, there would still be a whole world of other data servers
and clients outside of that institution.
It would be great if you could build a perfect system that would meet everyone's needs
and then have everyone switch to it.
But that seems exceedingly unlikely.
Perhaps it is better to accept that a given data server may serve the needs of a
given community (or part of that community) well,
but not be well suited to some other community.
Perhaps it is better to accept that the world of data servers will (and should) remain heterogeneous.
Okay. The world has lots of types of data servers and probably always will.
The problem is that most are
unable to make their data available to other types of client programs.
DAP client software works with DAP servers, but not SOS or OBIS servers.
SOS client software works with SOS servers, but not DAP or OBIS servers.
OBIS client software works with OBIS servers, but not DAP or SOS servers.
Some programs (notably Matlab) allow plug-ins to be added to enable
requesting data from additional types of data servers, but that is the exception to the rule.
The general situation is: different types of data servers aren't interoperable.
That's a problem.
The solution is interoperability.
The reality is, the world has lots of types of data servers and, in general,
each client program can only get data from one type of data server, not others.
A partial solution is a system that seeks to make different data servers interoperable.
ERDDAP is one such system. ERDDAP can read data from
several types of data servers (DAP, OBIS, SOS, SOAP, databases, and local files).
ERDDAP can act as a DAP or a WMS server, to serve data in different ways to different clients.
And ERDDAP can return data in lots of different data (file) formats
(for example, .asc, .csv, .dods, ESRI .asc (for ArcGIS), HTML Table, .json, .mat, .nc, .tsv, .xhtml)
and image file formats (for example, .geotif, Google Earth .kml, .pdf, .png, and transparent .png).
ERDDAP is a good example of Building Web Pages and Web Services on Top of Other Web Services. ERDDAP is built to get data from existing web services.
And as a web service itself, other web pages and services can be built on top of ERDDAP
(for example, The CoastWatch Browser,
which gets data from an ERDDAP installation).
has also made a few steps in this direction.
It can read data from several types of data files.
It can act as a DAP, WCS, and WMS server to serve the data in different ways to different clients.
And there are solutions that are more limited in scope (but are useful), such as
which makes it easy to access data from a THREDDS server from within ArcGIS.
An important barrier to full interoperability is inconsistency.
It is hard to work with two datasets that use different metadata standards,
use different variable names for the same types of information,
or use different units for the same type of information.
- Searches for datasets with similar information may fail if different names are used.
For example, if you search for "sea_surface_temperature", the program may not find
datasets that just have "temp", Temperature", or "SST".
- Comparing the time values in different datasets is a particularly challenging problem
because times are expressed in different ways
in different datasets: days since 1954-01-01, seconds since 1970-01-01, 17Jan1985,
Day 17 in 1985, 1985.06456345, with different (sometimes implicit) time zones, etc.
- Comparable information in different datasets may use different units.
For example, if you want to compare data using degrees_F and data using degrees_C,
you have to convert one of them first.
- Some datasets use longitude values from -180 to 180, others use 0 to 360.
This makes it hard to compare values from the two datasets.
Some software programs (for example, ArcGIS) insist on -180 to 180.
A partial solution is to use standards whenever possible.
Perhaps someday there will be unified standards that meet everyone's needs.
But for now, different communities have different standards and they are very useful.
Another partial solution: Building systems that can convert metadata from one standard
into metadata from another standard. Unfortunately, there are inherent limitations with this.
It would be surprising if the metadata used for one metadata standard
had all of the information needed, and at the correct level of detail,
to convert to another metadata standard.
For example, one standard might have a Contact attribute, while
another might have ContactName, ContactEmail, ContactPhone, and ContactAddress.
Another partial solution: Interoperability programs like
give the administrator the opportunity to add or modify each dataset's metadata.
So, for example, all units metadata could be converted to UDUnits.
ERDDAP also deals with the time units problem by converting all times to
UDUnits-compatible "seconds since 1970-01-01T00:00:00Z" (when formatting times as numbers),
or ISO 8601:2004 "extended" format,
for example, 1985-01-17T12:00:00Z (when formatting times as Strings).
It is tempting to wish that datasets will be forever unchanged after they are
assembled, quality checked, and released. In practice, datasets are often revised in major
and minor ways. For this reason, it is useful to keep the expert(s) who created the dataset
relatively close to the dataset's initial distribution server so that they can
remain in charge of it. That doesn't mean that the dataset can't be re-served by other
servers, just that the connection between the expert and their data shouldn't be broken.
That brings up the issue of provenance.
Even with very detailed metadata, questions about datasets will arise that
can't be answered by reading the metadata. For this reason, it is important that
the metadata include provenance information -- the steps the dataset took to
get to where the user found it. In the CF standard, for example,
provenance information can be stored in the "history" attribute,
with one line for each processing step. The "summary" attribute may also
include contact information. With this information, the user can contact
the expert in charge of the dataset when they have questions.
Please email bob dot simons at noaa dot gov.
The opinions expressed in this document are Bob Simons' personal opinions (email: bob dot simons at noaa dot gov) and do not necessarily reflect any position of my coworkers, bosses, ERD, the National Oceanic and Atmospheric
Administration, or the U.S. Government.
DISCLAIMER OF ENDORSEMENT
Any reference obtained from this server to a specific commercial product,
process, or service does not constitute or imply an endorsement by CoastWatch,
NOAA, or the United States Government of the product, process, or service, or
its producer or provider. The views and opinions expressed in any referenced
document do not necessarily state or reflect those of CoastWatch, ERD,
NOAA, or the United States Government.
DISCLAIMER FOR EXTERNAL LINKS
The appearance of external links on this World Wide Web site does not
constitute endorsement by the
Department of Commerce/National
Oceanic and Atmospheric Administration
of external Web sites or the information, products or services contained
therein. For other than authorized activities, the Department of Commerce/NOAA does not
exercise any editorial control over the information you may find at these locations. These
links are provided consistent with the stated purpose of this Department of Commerce/NOAA
DISCLAIMER OF LIABILITY
Neither the data providers, ERD, CoastWatch, NOAA, nor the United States Government,
nor any of their employees or contractors, makes any warranty, express or implied,
including warranties of merchantability and fitness for a particular purpose,
or assumes any legal liability for the accuracy, completeness, or usefulness,
of any information at this site.