In our first round of communications, we had a number of good conversations with campus scientists. Many people were on the way out for the summer, so a second request was sent out in mid-August. Some followup meetings are scheduled in the next couple weeks, notably with Bernie Hurley of The Library. We do need to get time with CDL; Patrick is trying to get in touch with Tricia Cruze. Chris, Patrick, and Noah had a good conversation with Clifford Lynch (CNI) on August 10.
The following notes are unorganized fragments for discussion. Some of this content will go into the white paper. This borrows heavily from the discussion with Clifford Lynch.
The scientists we talked to admitted this was very important and very difficult. Not surprisingly, what "data management" means is not absolutely clear. Where does it start and end? We might use a framework (OAIS, CDL) to help define this more specifically.
Drivers: Scientific authenticity and integrity, and data reuse. This is related to NSF cyberinfrastructure initiatives. Clifford points out that this is not just an NSF issue. Some agencies have done work here, and others are likely to follow. Similarly, this is is not just a US issue -- it is global. Therefore, we should expect that requirements for scientific data management are only likely to become more important over time.
Nobody knows what will be included in the NSF requirement. However, it is likely that many of the specifics will left to individual NSF directorates to allow for flexibility at the domain level. Clifford thinks it is unlikely that NSF will be asking for preservation in perpetuity. Aim for ten years initially.
Compliance: It is not likely that compliance will be carefully monitored, nor that it will be a big driver for the scientific community. It is very difficult to measure compliance in a detailed way. Clifford prognosticated that at some point in the next few years, there would be an incident where scientific data were called into question (like ClimateGate) and that this would give additional motivation for scientists and institutions to take data management activities more seriously. Similarly, a natural disaster could expose the fragility of scientific data. Interestingly, some journals are now requiring deposit of data sets -- by tying publication to this requirement, they are helping create the market for accessible repositories.
Domain-based repositories are emerging right now. Will there be a mega-repository, or will the domain-based and institutional repositories become nodes in a network? Probably the latter. Some of these are: ...
DataNet: DataONE and Data Conservancy... Not likely to have clear services in the very near future.
UK and Europe ...
There are oh so many issues to contend with. Scientists who responded brought up several familiar ones. We have good quotes we can include in the white paper.
We should both get some campus discussion moving on this topic and start reaching out to our partners off campus, some of whom have already been working on scientific data management. The driver should be competitiveness first, compliance second.
Though it is too early to call these recommendations, some of the next steps that I am likely to mention include: