Our next Research IT Reading Group topic will be: Cloud-based Simulation to Optimize Study Design and Analysis for Health Impacts, Th 30 July / noon / 200C Warren Hall
When: Thursday, July 30th from noon - 1pm
Where: 200C Warren Hall, 2195 Hearst St (see building access instructions on parent page).
Event format: The reading group is a brown bag lunch (bring your own) with a short ~20 min talk followed by ~40 min group discussion.
Jennifer Ahern, PhD, MPH is Associate Professor of Epidemiology and Chancellor's Professor of Public Health
Scott Zimmerman, MPH is a Research Data Analyst 3 in the Ahern Research group, Division of Epidemiology, School of Public Health
Ellie Colson, MPH is a Research Data Analyst 3 in the Ahern Research group, Division of Epidemiology, School of Public Health
We will describe our work building a cloud-based simulation tool for Public Health analyses that users can use to identify the optimal study design and analysis combination for a specific program or policy. Epidemiologists, public health professionals, government agencies and non-profits will be able to use the system to rigorously plan studies and data analyses for the evaluation of policies and programs.
Please review the following prior to the 7/30 meeting:
⇒ Project Overview Poster presented at the 2015 BIDS Data Science Faire and the 2015 Society for Epidemiologic Research Conference provides a 1-page overview of the project background, and example results.
⇒ Recommended Background Papers: The choice of study design and analysis approach for health effect assessments has been informed by general frameworks, including those outlined in the papers below, that discuss pros and cons of different designs and analyses for the examination of health impacts of non-randomized policies and programs. While these frameworks provide useful broad guidelines, our project aims to improve study quality by incorporating rigorous quantitative assessment of which specific design and analysis combination is best to answer the scientific question of interest, given characteristics of the program and potential biases.
⇒ To secure computational resources for the early stage of the project we applied for, and were awarded, grants with Microsoft Azure and Amazon Web Services; additional resources used include the campus institutional cluster called Savio, available through the BRC Faculty Computing Allowance. Our applications to the cloud providers are included here:
Jennifer Ahern, School of Public Health
Ellie Colson, School of Public Health
Scott Zimmerman, School of Public Health
Bill Allison, IST-API
Steve Carrier, School of Education
Norm Cheng, TPO
Jason Christopher, Research IT
Aaron Culich, Research IT
Quinn Dombrowski, Research IT
Chris Hoffman, Research IT
Rick Jaffe, Research IT
Greg Merritt, Connected Corridors
Krishna Muriki, Research IT
Chris Paciorik, Statistics & Research IT
Aron Roberts, Research IT
Patrick Schmitz, Research IT
Brian Peterson, Connected Corridors
Camille Villa, Research IT
Leon Wong, Security
Raymond Yee, D-Lab
Presentation slides (PDF)
- Project addresses need for a rigorous system to apply to research methodologies as they are defined to improve the reliability and accuracy of public health studies that assess health impacts of policies and programs -- specifically to do with criminal justice in California.
- Current stage of work: (studysimulator.com): concluding setup, will soon test and refine simulation algorithms and improve UI/UX.
- Angular.js on front end, node.js and Express on the back end; R on the analytic layer (Azure, AWS, Savio); MySQL database. Interested in being able to switch between analytic infrastructures (Azure/AWS/Savio layer) without a lot of fuss, rewriting, or configuration.
Scott suggests some Open Questions to discuss
** Cluster Computing **
- Discussion point: how much database chatter is there, how scalable is that? Consider migrating from MySQL to PostGres for db clustering. Greg and Scott get killed by latency (30 ms) when they have to generate a large number of rows, analyze, and return a result within 60s.
- Discussion point: isolation of Savio from the web. Krishna explains: web activity is not permitted to launch a Savio job. However a long-running process can pull from a database that is written-to from the web in order to poll for events that trigger run of a new Savio job. Looking into RESTful API to interact with Slurm scheduler used for Savio (we have this working on another cluster currently).
- Synchrony is an issue for Common Corridors, which has a real-time service. Not so much with the StudySimulator app: assumption is that user would receive results in an asynchronous notification (e.g., email).
- ... scale / sustainability / Savio Condo as a sustainability option / use the researcher's Amazon grant to do the computational steps as a means of containing costs to the project
- Question from Aron about generality of the project: how will these methodologies apply to other study designs than public health / criminal justice? Jen responds: simulator as standalone that addresses a wide range of disciplinary methods is possible, depending on what we're funded to develop -- we're starting with Public Health because that's where we are. The public health database that we have is applicable to studying health impacts of much beyond criminal justice. (Aron's spin-byte: "a wizard for researchers" to assess their methodologies.)
- Q: How long does it take to get grants from cloud providers? For the StudySimulator project, 2 weeks for Azure; ~7 months for AWS --- but structure of what's asked and how applications are processed is changing as the vendors tune their processes.
- Discussion of database contention and options to avoid it: from spinning up multiple databases (not a good option for several reasons, including orchestration overhead and desire to be able to evaluate whether analysis has been run already in order not to duplicate computation unnecessarily); to writing interim results to local nodes and only write small metadata all the way to the db.
- Discussion of how cron or always-running jobs can run on one of the Master Nodes (not the compute nodes) as an intermediary between a web server and compute jobs on Savio.