Please join us for the next Research IT Reading Group, on Thursday 14 July, as we discuss management of research data in the active phase of its lifecycle, and what services the campus can and should offer to support research data management during those active phases.
When: Thursday, July 14 from 12 - 1pm
As data science and the challenges of Big Data expand across new academic frontiers, how will researchers craft solutions to their data storage needs?
Over the course of a research project's lifecycle, how will researchers meet the storage needs for the active phase of their research, as opposed to other project storage needs (backup, recovery, public data sharing, archiving).
This Research IT Reading group session will characterize the needs around active research data storage. It will describe work being done to frame the problem and describe approaches to solutions for the UCB research community. Please join us in discussion of this emerging topic.
Please read the following articles prior to the meeting:
Also, please feel free to review the following (short) web pages, which describe storage solution providers currently under consideration by the Research Data Management team:
Presenting: Jason Christopher, Research IT
Aaron Culich, Research IT
Anthony Suen, BIDS
Aron Roberts, Research IT
Bernard Li, LBNL/BRC
Camille Crittenden, CITRIS
Giulia Hill, Library Sytems Office
Jason Huff, CGRL
John Kratz, CDL
John White, LBNL/BRC
Maurice Manning, Research IT
Patrick Schmitz, Research IT
Perry Willet, CDL
Quinn Dombrowski, Research IT
Rick Jaffe, Research IT
Ronald Sprouse, Linguistics
Scott Peterson, Doe Library
Stephen Abrams, CDL
Steve Masover, Research IT
Steven Carrier, School of Education
[Jason's presentation: see slide deck (PDF)]
Huge range of needs across research domains, from DOE lab experiments generating 150-200TB of data per experiment or per day; to domains where data is stored using DIY solutions from optical disks to flash drives to cloud services to network-attached storage arrays.
Researchers often unaware that Box and Drive are available to them as storage resource.
Key question to researchers: how do storage needs change over the course of a research lifecycle.
Deliverable: "a guidance grid" to help researchers on campus understand and choose available options. This is a W.I.P. Also a decision matrix, addressing the questions and sequence of questions to ask when consulting with a researcher about her needs.
Consulting expertise exists on campus, but is not "evenly distributed" (available to all who need to make use of it)
Camille: any patterns to which domains are amenable to moving to some of the new models proposed
Jason: Not a comprehensive enough survey to say ... but the people we're talking to have been open to considering services that are available, e.g., Box and Google Drive. Most of the researchers interested in these solutions have ~10TB or so of data to handle. Interesting use cases from Stanford, 1.5PB migrated up to Google Drive (in 256 parallel streams) -- but they haven't pulled it back out yet, so we'll see how that goes.
John White: Notes that no API for Google Drive, which left Linux environment / command line users unsupported.
Patrick: Globus as another new solution.
Jason Huff: 2-300 biologists supported. Seeing among this population the same sort of mishmash, from thumb drives to external drives, etc. We're excited to see the Condo Storage offering develop for those who compute on Savio. Finding that new faculty -- with a clean slate -- are the most amenable to adoption of services and consultation offered by RDM program. Hope is that their discussion with colleagues about how they've benefitted from campus-supported solutions will help to spread the interest in these.
Discussion of attraction (or indifference) of researchers to the relief from sys ad responsibility (for storage or compute), on them, their staff, their grad students. An interesting point raised at a recent CASC meeting (reported by Patrick) is a rising concern that a source of HPC sys admins is drying up due to the centralization of this type of resource -- that is, the PhD students who cut their teeth on running clusters for their research groups, then bail on their programs or on academic career and turn instead to getting paid for the sys admin skills developed as a grad student.
Camille: Communications plan to reach new faculty?
Jason: Working with CSS-IT to be sure that our services (including consultation) are on new-faculty orientation agendas.
Aaron & Jason: Best for us (BRC & RDM consultants) if researchers engage with us early. Sometimes that's earlier than a researcher thinks is appropriate to engage a consultant, or think their data or computation needs are too small-scale to bother with Research IT ... but we'd like any help we can get encouraging researchers to speak with us as early as possible.