When: Thursday, March 10 from noon - 1pm
Where: 200C Warren Hall, 2195 Hearst St (see building access instructions on parent page).
Event format: The reading group is a brown bag lunch (bring your own). This session will be an open discussion of ideas.
Partial list of featured participants: Elaine Angelino (AMPLab); Matthias Bussonier (BIDS); Shreyas Cholia (LBNL/NERSC); Ryan Lovett (Statistics Compute Facility); Fernando Perez (BIDS); et al.
Facilitators: Kevin Koy, Patrick Schmitz
Research IT's Berkeley Research Computing (BRC) and the Berkeley Institute for Data Science (BIDS) will partner to deploy a new JupyterHub server that has significant computing capacity on the server itself, and can also submit compute-intensive tasks to the Comet super-computing resource at SDSC. The server is being provided by partners at the San Diego Supercomputer Center (SDSC) under the NSF grant supporting the Pacific Research Platform. BRC and BIDS will jointly support the server to facilitate experiments aimed at discovering how to constructively support JupyterHub technologies for use by campus researchers, and in particular how to scale the associated computational resources for JupyterHub users. These experiments will also relate to the BRC Program's current work to deploy a JupyterHub server that can "spawn" jobs into the campus's Savio HPC cluster. This reading group meeting will be a discussion among interested parties about what sort of experiments we might conduct, who is interested in participating, and more generally, how to best explore and leverage these ongoing and upcoming threads of work to support UC Berkeley's research community.
Prior to the meeting, participants are invited to:
Facilitating: Patrick Schmitz (Research IT); Kevin Koy (BIDS)
Aaron Culich, Research IT
Aron Roberts, Research IT
Barbara Gilson, SAIT
Bernard Li, LBNL/BRC
Camille Crittendon, CITRIS
Chris Hoffman, Research IT
Chris Paciorek, SCF/BRC
David Greenbaum, Research IT
Elaine Angelini, AMPLab
Fernando Perez, BIDS
Jack Shnell, IST-Storage
Jason Christopher, Research IT
John Lowe, Research IT
Jon Stiles, D-Lab
Ken Geis, COIS
Krishna Muriki, LBNL/BRC
Matthias Bussonnier, BIDS
Michael Jennings, LBNL/BRC
Patrick Schmitz, Research IT
Quinn Dombrowski, Research IT
Rick Jaffe, Research IT
Ron Sprouse, Linguistics
Ryan Lovett, SCF
Shreyas Cholia, LBNL/NERSC
Steve Masover, Research IT
Yong Qin, LBNL/BRC
==> PRP work includes building and distributing boxes, including DTNs and JupyterHub boxes (confusingly, both called "FIONA boxes"; note the JH boxes are not DTNs). Berkeley is getting one. Significant compute and storage; configured to spawn jobs into COMET HPC cluster at SDSC.
==> Question to discuss at this group is: what shall we do with it? What sorts of experiments should we design to explore how best to serve campus researchers with a JupyterHub resource.
==> How do research needs to do with JupyterHub differ (or how are they the same) as the use cases in campus instruction (e.g., DS8)
==> What should researcher user experience be vis-a-vis spawning a job into an HPC cluster
==> How to manage AuthN and access to the JupyterHub box that Berkeley will get from the PRP project.
==> How will we manage the experimental nature of this box -- e.g., that nothing on it will be backed up, it's not a candidate for storing essential data.
==> No local queue management / priority management for this box (which includes a couple of GPU processors). How to manage the local computation power on this box?
Fernando: brief intro to JupyterHub. "Notebook" concept will be renamed to "Jupyter Lab" (includes text editor, notebook environment, and shell access to underlying resource). Text support now supports both Markdown and LaTeX. Support for programming languages is extensible via kernal implementation; project implements a reference kernal for Python. Cf. try.jupyter.org.
Kevin: BIDS could be a convener of researchers in multiple disciplines, including many who have not traditionally used HPC clusters.
Yong: Current effort is at proposal stage. Provide JupyterHub "landing node" (interactive node) to allow users to log into Savio environment; do basic activities on the local box's resources (terminal interface); but also allow user to spawn heavy computation jobs into the Savio cluster's compute nodes.
Shreyas: Similar at NERSC. Cory system. Scratch and global file system access, and batch job access (SLURM); can process results either in real time or after job is completed. Not spawning, but dispatching to cluster resources; but expect to eventually spawn. Have been talking to Andreas Zonca re: SLURM spawner.
Fernando Perez notes on Google Drive
Patrick: Various phases of research within the JH notebook environment: from editing to large computational jobs. How to make that decision/transition point fit into researcher workflow smoothly.
Shreyas: Spawning to parallel job is different from running big compute job on a single big node.
Fernando: IPython Parallel -- some experience in this vein, spawning multiple Python processes controlled by a head node. This is specific to Python, not generalized across languages, but doing this is something we're considering. This IPython Parallel functionality permitted a research group to interactively monitor, steer, and visualize an MPI job running on a supercomputer -- among other use cases.
Chris: how does data transfer to spawned processes?
Fernando: There is a way to broadcast data to and from nodes (scatter/gather), but not MPI and therefore not efficient. More typical use case, assume the computation is being done on a resource that has access to large data storage (and data transfer resources), and that it will find the data there.
Jack: Network transfer? (PLS: PRP JH box ought to be installed network-close to Science DMZ, but not on it due to policy (what's permitted on Science DMZ). Machine is configured with 2 40-gig NICs.)
PLS: What are some experiments we might run with the FIONA box from PRP project? Could include spawning into Comet.
==> Ryan: experiment that examines how well Swarm and Kubernetes (http://kubernetes.io/) spawn jobs into the box's local resources, including GPUs. Michael Jennings believes that resources will be allocated first-come-first-served, so the question may involve additional resource-managing technology.
==> Ryan: AuthN -- CILogin ... (Fernando: someone has written a Shib authenticator as well). Fernando: raises question of whether non-Berkeley people can/should use the box, including industry people developing aspects of Jupyter. [Patrick suggests that guest ID is likely (if imperfect) workaround for non-Berkeley people.]
==> Fernando: In instructional context? [Not for DS8; but what if you have a class of 25 people, what's a good resource for that kind of group]
==> Ryan: Profile spawner (Michael Milligan is developer). Different environments spawned on different resources depending on context/access/role. PLS notes that this may touch on the user experience question.
==> Chris: Can this environment be used to coordinate data movement in and out? (Fernando asks Camille: are there use cases in PRP/CITRIS context. PLS: this box is not a DTN. Michael: sounded like expectation was that it would sit atop a parallel file system, where data would live -- and which would be supported by its own data transfer infrastructure.) Camille thinks there are a number of domains where we might find use cases. ---- Network - Data - Domain experiment.
==> Shreas: lots of support needed to customize environment, bring in libraries specifically necessary to a given research project. (PLS: Docker spawner to isolate? Shreyas: still need to manage getting libraries needed into the virtualized environments.)
==> Michael: Experiment involving security. What happens when multiple users are running on a single resource. Zeromq not encrypted (signed but not encrypted, says Fernando; signature-key stored by user, protected by SSH-equivalent technology). Impersonation, command-injection, other issues? Not sure how this experiment might be constructed: but something having to do with JH concurrency.
==> Michael: What if users were able to spin up nodes that then went to sleep until and unless computation needed to be done? As a way to more smoothly manage transition between light load on JH box resources and heavier load appropriate for spawning into beefier nodes.
PLS: Good set here, it need not be definitive.
Fernando: We need too know WHO are the researchers whose work will provide the context for these experiments.
Camille: 5/12 meeting at I House -- lunchtime roundtable on PRP on 5/11 -- if there's a preliminary set of findings to discuss, that might be of interest.
PLS: We have not talked with Jon (PRP) about how/when we get back to him with results/findings.
Kevin: Every other Thursday we have a meeting at BIDS with a set of researchers who might be candidates to participate in these experiments.
Aaron: D-Lab? Workshop? This box as venue to experiment. PLS: problem is that we don't have any support resources. Kevin: In June (7-9?), "Image Processing Across Domains" event (3-day), where 3rd day is a hands-on hacking day. Other similar events.
PLS: 6 months of experiments initially; then consider a phase II starting in Fall based on results.
Rick: A future question -- how can a service be developed around this type of offering.
PLS: Not ready to consider this now vis-a-vis the PRP's box. Yong's work is where we're headed in Savio context in the near term vis-a-vis service offering.
Fernando: Currently researchers just wait, lock up their local machine, when jobs become big. Because transition to another environment "is painful" ... solving this, making a smoother path, would be a genuine contribution.
Elaine: AMPLab experimentation in this direction (migration) on an Amazon resource.
Fernando: AMPLab might be an interesting group to talk to around recruiting researchers / use cases.
Michael: Concern that user experience doesn't lead researchers to spawn jobs and leave them running that they're then "charged for" though they're not expecting those charges (e.g., against FCA allocation).
[various]: strong interest in a "visible meter" being right in the user's field of view, so that this kind of unexpected 'error' and cost doesn't happen
Fernando: freezing and moving containers ... terminal.com (Fernando is on the board of this) ... very fast ... proprietary ... if something like this were available to UCB researchers that would solve a lot of problems vis-a-vis smoothing the transition between differently-scaled resources.