When: Thursday, June 2 from noon - 1pm
Where: 200C Warren Hall, 2195 Hearst St (see building access instructions on parent page).
Event format: The reading group is a brown bag lunch (bring your own) with a short <20 min talk followed by ~40 min group discussion.
Presenter: Joe Near, AMPLab
Joe Near, a postdoctoral researcher in Dawn Song's group in UC Berkeley's Computer Science Division, will lead a discussion on Helio: a scalable system for distributed, secure, collaborative data analytics that addresses security and privacy challenges presented by the explosion in large-scale data analytics using systems that are vulnerable to attack, both from the outside (by untrusted programs or queries) and from within (by compromised infrastructure providers).
Helio provides strong, provable, end-to-end security guarantees, approaches the performance of today’s data analytics frameworks, and provides a friendly programming model in which programmers need not write security checks. Helio uses information flow analysis to ensure that programs do not leak data, and leverages Intel’s SGX extensions to protect running tasks against compromised infrastructure operators.
Prior to the meeting, please review:
Optional readings for those interested in a deeper look at Helio:
Bootstrapping Privacy Compliance in Big Data Systems (Shayak Sen, Saikat Guha, Anupam Datta, Sriram K. Rajamani, Janice Tsai and Jeannette M. Wing). LEGALEASE is the policy language underpinning our data capsule policies. It's designed to be easy for data owners to write and automatically enforceable.
Differential Privacy (Cynthia Dwork). Differential privacy is a formalization of one notion of privacy, designed to allow analytics that reveal trends in large datasets without harming the privacy of individual participants. Helio allows data owners to specify that analysts must use differentially private analytics on their data capsules.
Presenting: Joe Near, Computer Science
Aaron Culich, Research IT
Aron Roberts, Research IT
Barbara Gilson, SAIT
Betsy Cooper, Center for Long Term Cybersecurity
Camille Crittenden, CITRIS
Chris Hoffman, Research IT
Jamie Wittenberg, Research IT
Jason Christopher, Research IT
Leon Wong, IST-Security
Matthew Campbell, Research IT
Nicholas M--, Computer Science
Oliver Heyer, ETS
Patrick Schmitz, Research IT
Richard Katz, SAIT
Scott Peterson, Doe Library
Steve Masover, Research IT
Steven Carrier, School of Education
Presentation (see slides)
Data owner and data analyst are often assumed to be same person; not necessarily so, and DA not necessarily trusted by DO. The program the analyst writes might not be trustworthy. The computation infrastructure (e.g., cloud) may not be trusted either. Last, the result set may expose more about initial data set than data owner intended -- which may constitute a security/privacy issues.
1. data enclosed in a "capsule" that includes a Use Control Policy; data capsule is encrypted
2. analytical model also enclosed in a "capsule" protected by a "Residual Policy"
3. query results generated by an analyst's query also encrypted in a "capsule" + residual program
At this point it may make sense to declassify the data, assuming all conforms to data owner's policies.
Analyst need not include anything different in her program (e.g., Apache Spark) due to Helio or the security provided by it: assuming her program is analyzable and (as necessary) convertible using a set of rules included in Helio that satisfy the data owner's specified security policies.
SGX provides protection against insecure infrastructure; Intel provides a service that a provider who claims to be using SGX is in fact doing so.
Current performance evaluations have been done only with a four-laptop "cluster" (the earliest machines w/ Intel's SDX included in a processor).
Enclave programs must be written in C++ and compiled using Intel-provided tools. Helio group is developing a higher-level language that can be compiled into the necessary executables.
Case Study: Health eHeart Study. Longitudinal study on heart disease at UCSF Med Center.
Who controls/owns the keys? -- Keys owned by data owner; DO delegates to Helio Key Server
How is data chopped up when set is large? -- Spark does this, Spark is untrusted. It is chopping up into encrypted enclaves, and deploying those to nodes.
How are policies expressed? -- "Legalese" is the language in which policies are expressed. Since domain-specific languages for expressing policy are what's real-world useful, have not done much work yet to develop interfaces from DSLs for policy to Legalese.
Given fast evolution and complexities of real world policy, how gnarly do real-world expressions of policy get in Legalese? -- this is actually a hard question, because data owners don't necessarily (or even often) know what policies they want. In the Heart Healthy study, the policy the data owners said they wanted worked out to 50 lines of policy language, but JN isn't certain at all that the owners actually expressed the policy they really (or should have) wanted.
How many groups / data sets / analyses has this development of policy been tried with? -- not as many as JN wants. One other is principal example, and it wasn't easy. What we ended up writing was ~35 lines.
If you own your own hardware, what need is there for Helio? -- Uber, for example, has lots of highly sensitive data; and their own data center. So they would rather not encrypt, for the performance overhead. So they would like to 'lose' the SGX piece of Helio.
Access to resources under copyright (as in Hathi Trust Research Center)? -- JN would like to learn more about that. Jamie W will send more information on this.
Working with cloud providers? -- Not yet. Can't specify an SGX machine from Amazon, but we hope / expect that will go away in a couple of years. (Microsoft is doing a lot of SGX research, maybe something will come from that quarter.)
What of using Helio as a framework for working out what policies are required -- even if the owners don't analyze with Spark? -- hope to get there, expect to run the Spark-integrated aspects (enforcement) using alt-technologies. Can imagine running the auditing piece but not the enforcement part using an alt-technology.
What are goals for platform? -- For platform generally, would like to see Spark project incorporate some of these ideas. Not working to build something to be folded into Spark directly, but looking to influence where Spark goes. Our projects interested in (1) declassification framework piece -- how do you enforce differential privacy -- integrate that piece better with analyst workflow, what we have works but it's cumbersome; (2) build higher level language for policies; (3) go further with user defined function part of this, without requiring UDFs to be written in C++.
What will it take to explain what's in the Helio system to a lawyer who has to determine whether an analytical system conforms to HIPPA or FISMA. -- A hard question. Would like to encode a set of requirements that a lawyer could verify as aligning to HIPPA or FISMA. But then there's proving that "it works" to the lawyer, who does not have the technical or mathematical chops to analyze. Over time one might hope that expertise and reliance on experts who understand different pieces of the privacy/security puzzle (e.g., deidentification) will develop further, so that lawyers would, essentially, be willing to trust classes of certified experts.