Skip to end of metadata
Go to start of metadata

In our first round of communications, we had a number of good conversations with campus scientists.  Many people were on the way out for the summer, so a second request was sent out in mid-August.  Some followup meetings are scheduled in the next couple weeks, notably with Bernie Hurley of The Library.  We do need to get time with CDL; Patrick is trying to get in touch with Tricia Cruze.  Chris, Patrick, and Noah had a good conversation with Clifford Lynch (CNI) on August 10.

The following notes are unorganized fragments for discussion.  Some of this content will go into the white paper.  This borrows heavily from the discussion with Clifford Lynch.

The scientists we talked to admitted this was very important and very difficult.  Not surprisingly, what "data management" means is not absolutely clear.  Where does it start and end?  We might use a framework (OAIS, CDL) to help define this more specifically.

Drivers: Scientific authenticity and integrity, and data reuse.  This is related to NSF cyberinfrastructure initiatives.  Clifford points out that this is not just an NSF issue.  Some agencies have done work here, and others are likely to follow.  Similarly, this is is not just a US issue -- it is global.  Therefore, we should expect that requirements for scientific data management are only likely to become more important over time.

Nobody knows what will be included in the NSF requirement.  However, it is likely that many of the specifics will left to individual NSF directorates to allow for flexibility at the domain level.  Clifford thinks it is unlikely that NSF will be asking for preservation in perpetuity.  Aim for ten years initially.

Compliance: It is not likely that compliance will be carefully monitored, nor that it will be a big driver for the scientific community.  It is very difficult to measure compliance in a detailed way.  Clifford prognosticated that at some point in the next few years, there would be an incident where scientific data were called into question (like ClimateGate) and that this would give additional motivation for scientists and institutions to take data management activities more seriously.  Similarly, a natural disaster could expose the fragility of scientific data.  Interestingly, some journals are now requiring deposit of data sets -- by tying publication to this requirement, they are helping create the market for accessible repositories.

Domain-based repositories are emerging right now.  Will there be a mega-repository, or will the domain-based and institutional repositories become nodes in a network?  Probably the latter.  Some of these are: ... 

  • UC DATA, ICPSR, and census-based data.  An example of a scientific community that has figured out more of this.
  • GenBank and Dryad.

DataNet: DataONE and Data Conservancy...  Not likely to have clear services in the very near future.

UK and Europe ...

There are oh so many issues to contend with.  Scientists who responded brought up several familiar ones.  We have good quotes we can include in the white paper.

  • Some data are confidential (especially where human subjects are involved)
  • Most scientists will need to or want to place restrictions on public data sharing, especially to allow time for publication. 
  • Some fields have cultures that strongly inhibit data sharing.  For others (where there are no data standards or that are very new), data is almost unusable outside the laboratory where it was created.  Review or reuse of scientific data can sometimes only be done by a very small number of people who are familiar with a specific piece of equipment for example.
  • In other fields, data sharing is already required (NASA)
  • Organizationally, this is a complex problem.  Who should be responsible for ensuring that the campus is moving in the right direction?  At most other universities that are working on this, the primary organization is an academic one or the library.  UC Berkeley seems to be different in this regard.

Recommendations

We should both get some campus discussion moving on this topic and start reaching out to our partners off campus, some of whom have already been working on scientific data management.  The driver should be competitiveness first, compliance second.

Though it is too early to call these recommendations, some of the next steps that I am likely to mention include:

  • Some kind of statement from the campus (from the CIO and/or Academic Senate) endorsing principles for scientific data management (e.g., standards-based preservation and access within to-be-established constraints)
  • Initial packaging of some existing IST services (storage and backup) that can be offered to the campus scientific community; pointing people to existing domain-based repositories where they exist; helping with grant-writing language and so on
  • Convening discussions about how we can move UCB scientific data management forward, including identifying someone responsible for leading the conversation
  • No labels