Skip to end of metadata
Go to start of metadata

What we'll be discussing (from e-mail of 13 Nov 2014):

Our next Research IT Reading Group topic will be the first of a two-part, back-to-back series on Research Data Storage. Coming up first, we'll focus on the Space Sciences Lab & Google Drive. Google recently announced unlimited (no quota) storage for all Google Apps for Education customers, including UC Berkeley. [The Google Drive aspect of this meeting was deferred until 4 December]

The presentation and discussion will be facilitated by Chris Hoffman. Bill Allison (IST - Architecture & Platform Integration) and Jon Loran (Space Sciences Laboratory) will present.

[Part II of this series will be on 12/04, with a focus on research storage on Google Drive, which now offers unlimited storage to Google Apps for Education customers, including UC Berkeley.atop Amazon Web Services cloud infrastructure.]

Please read/review the following in advance of the 11/20 meeting
==> Google's announcement of unlimited storage for Apps for Education customers (30 Sep 2014)

==> The Berkeley Ground Systems home page, describing a major center of SSL operations and data infrastructure [minimally, please read the Facility and Ground Station description pages]

==> The THEMIS/ARTEMIS project home page, as an example of the research conducted at SSL [minimally, please read the Mission : Overview page]

==> Scaling Astronomy to the Next Decade (PDF), a slide deck from a 2010 Scientific Data Management class at U. Washington, describing some of the data challenges and infrastructure for the Large Synoptic Survey Telescope (LSST)

Additional/optional readings:
More information about the LSST and its plans for data:

More about data for the SSL HESSI project:

An astronomer talks about big data (video and/or summary):

Comparison tool being developed to help provide guidance about collaboration tools at Berkeley:



James McCarthy (SSL)
John Loran (SSL)
Mark Baroni (Library)
Paul Payne (Library)
Ron Sprouse (Linguistics)
Perry Willett (CDL)
Ian Crew (IST-API)
Jack Shnell (IST storage team)
Parrish McCorkle (IST storage team)
Lynne Cunningham (Art History)
Scott Peterson (Doe Library)

Chris Hoffman (CIO-Research IT)
David Greenbaum (CIO-Research IT)
John Lowe (CIO-Research IT)
Quinn Dombrowski (CIO-Research IT)
Rick Jaffe (CIO-Research IT)
Steve Masover (CIO-Research IT)


Chris: Bill wants to be involved in this discussion, will have a deeper dive into the Google Apps for Education in the future
Met with SSL several week ago, have infrastructure
Remote, need to have data storage close to researchers, unique work
Google announcement about unlimited storage — unlimited is great, but can you use it? When is it useful?
Unlimited storage sounds great, but where and when is it helpful to researchers?
What challenges limit broader use of cloud based storage services such as Google Drive?
What challenges limit broader use of centralized storage services such as those offered by UC Berkeley?
What is unique and familiar about data management workflows and needs described at SSL?


2nd largest ORU on campus

Operate independently, remote to campus, separate network that doesn't benefit from high speed campus network

Berkeley MOC - jewel of SSL. Autonomous data center, an oddity / unheard of among space science centers. Able to build and operate less expensively than it would have cost ($1M/year) to have data center hosted elsewhere. Budget experience based on 12 years operation.

Data center is highly firewalled, only way in is, essentially, through a NASA T1 line, itself highly firewalled.

Storage and project SOC's ingest data from third-parties and handle connections to project science workstations and remote researchers (see slide for diagram). Satellite data comes in through NASA network.

Sprinkled through SSL data are ITAR sensitive material (must be kept in United States), severe penalties for anyone responsible for leaking data to a foreign national; there have been prosecutions, people are understandably nervous. The university has made clear that it will not stand by anyone who acts irresponsibly in such matters. Some relaxation of these restrictions on data going forward, but data previously restricted under ITAR remains so.

Satellite operations requires data availability (critical) -- serious issue if network goes down.

Data processing requires low latency and high bandwidth. Vast majority of SSL data is accessed as block reads/writes. If this data were to move into the cloud, questions about how it would be accessed: how would existing software use this data without extensive rewrite?

Chris: Given that some data is open, other ITAR-restricted, you have a problem in asking scientists to treat multiple sets with different restrictions appropriately when both are inputs to a research inquiry. How to handle?
Jon Loran: This is really hard, you can't really wrangle the many (hundreds) of scientists for whom these issues are not a central concern.
John Lowe: Foreign nationals?
Jon Loran: Many work with us, yes.

David: Mandated data sharing?
Jon Loran: Written into the contracts we set up. We have procedures for transfering data. Generally speaking, "there's a lot of surface area" -- we do what we can, and try to control the "leaky" areas across that surface. Reminders not to transmit engineering information in e-mail.
Jack: Access methods?
Jon Loran: Transmission over internet. Software packages (commercial) for access: IDL, command line interpretive, array arithmetic. Scientists share software they have written to do analysis.

Jack: Notes similarities to project underway with BRC-HPC project. Interested in exploiting synergies.

Perry Willett: Tools for access? How do scientists choose?
Jon Loran: Well ... they attend talks, hear from other scientists what has proven useful in relevant contexts.
Chris: Reading the web sites, noticing how project-specific the data sets are, how opaque to people not directly involved in the projects. ...
Rick Jaffe: So is there pressure to make materials more accessible, returning to David's question?
Jon Loran: There is definitely interest, but we're not doing a great job of that yet. Partial, slow progress. But this is real work and, again, not central to project foci. SETI @ Home was a brilliant idea in this direction. So there are some interactions. Have heard about distribution of supercomputers to homes for heating (Ian: in Germany).

[Technical discussion of storage technologies: Jack Shnell, Parrish, Jon Loran]. GPFS, Nexenta (, software-defined storage (and its recent buzz). Procurement, licensing, aggregation of demand, sales representatives .... Chris will work with Jon on assessing whether Nexenta demand might be aggregated across the campus. Procurement data mining [note]

Paul: storage at this scale to be available from IST to campus as a service?
Parrish: goal is to roll out as an archival storage service that rivals Amazon Glacier in pricing

Chris: Big Data -- how does that term look to SSL
Jon Loran: not sure our data really is "big" in modern terms. Scientists are not looking toward storage and analysis in the modes (formats, databases) currently applied to Big Data. Not putting our data in databases.

Chris: Google Drive announcement re: free storage. Curious about potential use.
Parrish: Don't know enough about it. Free sounds good. Interface, security, access?
Ian: Google API exists, but enabling such that API access does not give superuser access to whole Berkeley instance is still having the kinks worked out. Google Drive sync, web client. Support single file storage up to 5TB (someone in Higher Ed has actually tested this). Permission scheme for Google Drive is not very granular, people need to be quite careful. Not very granular: e.g., edit privilege implies delete. SharePoint and Box may have more appropriate permission schemes in some use case contexts. All data on Google is encrypted both in transit and at rest.
Jack S: Forces developers into realm of being filesystem administrators as well. Can be problematic.
Ian: Box could be HIPPA compliant, but this is not yet implemented for UCB's Box contract/engagement.

Chris: Knowledge base article (44390) developed by Ian and others on guidance re: selection of appropriate content management solutions. Link in readings suggested for this meeting. We plan to develop more detailed guidelines along these lines, including how much these services can be trusted to be available.

Mark B: What happens if someone scripts change to permissions of hundreds of files, in error, changes that need to be rolled back but not noticed until 6 weeks later.
Paul: Business continuity. What happens when someone does not cooperate in transferring control of data to institution?
Ian: Security policy. CalNet Special Purpose Accounts (SPAs -- dept CalNet accts) -- can be used to log into Box, create a folder, share with individuals. From then on anything that goes into that folder or subfolder is owned by SPA. Article on KnowledgeBase on how that works in detail; simple to execute, but worth thinking through.


Chris: Next time, will continue this discussion. Likely to focus on Amazon S3, possibly on data security in Amazon context.
Steve: for 12/4, will send announcement next week.




  • No labels