What we'll be discussing (from e-mail of 13 Nov 2014):
James McCarthy (SSL)
John Loran (SSL)
Mark Baroni (Library)
Paul Payne (Library)
Ron Sprouse (Linguistics)
Perry Willett (CDL)
Ian Crew (IST-API)
Jack Shnell (IST storage team)
Parrish McCorkle (IST storage team)
Lynne Cunningham (Art History)
Scott Peterson (Doe Library)
Chris Hoffman (CIO-Research IT)
David Greenbaum (CIO-Research IT)
John Lowe (CIO-Research IT)
Quinn Dombrowski (CIO-Research IT)
Rick Jaffe (CIO-Research IT)
Steve Masover (CIO-Research IT)
INTRO PRESENTATION: Jon Loran, IT Officer, SSL
2nd largest ORU on campus
Operate independently, remote to campus, separate network that doesn't benefit from high speed campus network
Berkeley MOC - jewel of SSL. Autonomous data center, an oddity / unheard of among space science centers. Able to build and operate less expensively than it would have cost ($1M/year) to have data center hosted elsewhere. Budget experience based on 12 years operation.
Data center is highly firewalled, only way in is, essentially, through a NASA T1 line, itself highly firewalled.
Storage and project SOC's ingest data from third-parties and handle connections to project science workstations and remote researchers (see slide for diagram). Satellite data comes in through NASA network.
Sprinkled through SSL data are ITAR sensitive material (must be kept in United States), severe penalties for anyone responsible for leaking data to a foreign national; there have been prosecutions, people are understandably nervous. The university has made clear that it will not stand by anyone who acts irresponsibly in such matters. Some relaxation of these restrictions on data going forward, but data previously restricted under ITAR remains so.
Satellite operations requires data availability (critical) -- serious issue if network goes down.
Data processing requires low latency and high bandwidth. Vast majority of SSL data is accessed as block reads/writes. If this data were to move into the cloud, questions about how it would be accessed: how would existing software use this data without extensive rewrite?
Chris: Given that some data is open, other ITAR-restricted, you have a problem in asking scientists to treat multiple sets with different restrictions appropriately when both are inputs to a research inquiry. How to handle?
Jon Loran: This is really hard, you can't really wrangle the many (hundreds) of scientists for whom these issues are not a central concern.
John Lowe: Foreign nationals?
Jon Loran: Many work with us, yes.
David: Mandated data sharing?
Jon Loran: Written into the contracts we set up. We have procedures for transfering data. Generally speaking, "there's a lot of surface area" -- we do what we can, and try to control the "leaky" areas across that surface. Reminders not to transmit engineering information in e-mail.
Jack: Access methods?
Jon Loran: Transmission over internet. Software packages (commercial) for access: IDL, command line interpretive, array arithmetic. Scientists share software they have written to do analysis.
Jack: Notes similarities to project underway with BRC-HPC project. Interested in exploiting synergies.
Perry Willett: Tools for access? How do scientists choose?
Jon Loran: Well ... they attend talks, hear from other scientists what has proven useful in relevant contexts.
Chris: Reading the web sites, noticing how project-specific the data sets are, how opaque to people not directly involved in the projects. ...
Rick Jaffe: So is there pressure to make materials more accessible, returning to David's question?
Jon Loran: There is definitely interest, but we're not doing a great job of that yet. Partial, slow progress. But this is real work and, again, not central to project foci. SETI @ Home was a brilliant idea in this direction. So there are some interactions. Have heard about distribution of supercomputers to homes for heating (Ian: in Germany).
[Technical discussion of storage technologies: Jack Shnell, Parrish, Jon Loran]. GPFS, Nexenta (http://www.nexenta.com/), software-defined storage (and its recent buzz). Procurement, licensing, aggregation of demand, sales representatives .... Chris will work with Jon on assessing whether Nexenta demand might be aggregated across the campus. Procurement data mining [note]
Paul: storage at this scale to be available from IST to campus as a service?
Parrish: goal is to roll out as an archival storage service that rivals Amazon Glacier in pricing
Chris: Big Data -- how does that term look to SSL
Jon Loran: not sure our data really is "big" in modern terms. Scientists are not looking toward storage and analysis in the modes (formats, databases) currently applied to Big Data. Not putting our data in databases.
Chris: Google Drive announcement re: free storage. Curious about potential use.
Parrish: Don't know enough about it. Free sounds good. Interface, security, access?
Ian: Google API exists, but enabling such that API access does not give superuser access to whole Berkeley instance is still having the kinks worked out. Google Drive sync, web client. Support single file storage up to 5TB (someone in Higher Ed has actually tested this). Permission scheme for Google Drive is not very granular, people need to be quite careful. Not very granular: e.g., edit privilege implies delete. SharePoint and Box may have more appropriate permission schemes in some use case contexts. All data on Google is encrypted both in transit and at rest.
Jack S: Forces developers into realm of being filesystem administrators as well. Can be problematic.
Ian: Box could be HIPPA compliant, but this is not yet implemented for UCB's Box contract/engagement.
Chris: Knowledge base article (44390) developed by Ian and others on guidance re: selection of appropriate content management solutions. Link in readings suggested for this meeting. We plan to develop more detailed guidelines along these lines, including how much these services can be trusted to be available.
Mark B: What happens if someone scripts change to permissions of hundreds of files, in error, changes that need to be rolled back but not noticed until 6 weeks later.
Paul: Business continuity. What happens when someone does not cooperate in transferring control of data to institution?
Ian: Security policy. CalNet Special Purpose Accounts (SPAs -- dept CalNet accts) -- can be used to log into Box, create a folder, share with individuals. From then on anything that goes into that folder or subfolder is owned by SPA. Article on KnowledgeBase on how that works in detail; simple to execute, but worth thinking through.
Chris: Next time, will continue this discussion. Likely to focus on Amazon S3, possibly on data security in Amazon context.
Steve: for 12/4, will send announcement next week.