Our next Research IT Reading Group will feature a presentation from Scott French and Richard Gerber of the National Energy Research Scientific Computing Center (NERSC) User Services Group. Scott and Richard will provide an overview of what's available at NERSC, who can get access, and how, as well as discuss opportunities for collaboration between campus and NERSC.
When: Thursday, February 12 from noon - 1pm
Where: 200C Warren Hall, 2195 Hearst St (see building access instructions on parent page).
Event format: The reading group is a brown bag lunch (bring your own) with a short ~20 min talk followed by ~40 min group discussion.
Please review the following in advance of the 2/12 meeting:
* General background on NERSC: http://www.nersc.gov/about/
* Computational systems offered: http://www.nersc.gov/users/computational-systems/
* Storage systems offered: http://www.nersc.gov/users/data-and-file-systems/file-systems/
* How the allocations process works: http://www.nersc.gov/users/accounts/allocations/
... especially, how to get your first allocation: http://www.nersc.gov/users/accounts/allocations/first-allocation/
... and eligibility: http://www.nersc.gov/users/accounts/allocations/overview/
* A bit of background on the consulting services offered by the User Services Group: http://www.nersc.gov/users/getting-help/consulting-services/
* Representative examples of user trainings and associated resources: http://www.nersc.gov/users/training/
Facilitator: Patrick Schmitz
Presenters: Richard Gerber & Scott French, NERSC
Aaron Culich (RIT)
Aron Roberts (RIT)
Chris Hoffman (RIT)
Chris Picoriek (SCF)
David Greenbaum (RIT)
James McCarthy (SSL)
Michael Jennings (RIT)
Patrick Schmitz (RIT)
Perry Willets (CDL)
Richard Gerber (NERSC)
Rick Jaffe (RIT)
Scott French (NERSC)
Scott Peterson (Doe Library)
Steve Masover (RIT)
Patrick: Framing questions for this discussion include what resources are offered by NERSC; and how researchers and research support staff can collaborate with NERSC in taking advantage of services as well as sharing references, information, approaches to support, and referrals.
NERSC mission is "To accelerate scientific discovery" for DOE-relevant scientific research, via high performance computing and data analysis. "Enabling users to tackle some of the most computationally challenging scientific problems today."
NERSC's mission is DOE-linked, whereas Argonne and Oak Ridge have broader latitude to choose projects to support. Medical research, personal health, anything with PII -- is not in DOE's / NERSC's portfolio. OTOH, genomics in bio-energy context is in-bounds. Argonne, Oak Ridge and similar labs solicit users who run large-scale jobs (many cores) simultaneously.
Different from most security model: access is controlled, but not as shut down as some other labs; lots of proactive monitoring of what activity is occurring on NERSC hardware, in order to identify and curtail inappropriate use in short order.
DAG asks whether it's possible to produce an analog to the "Science View of Workflow" slide for Berkeley faculty who use NERSC services. Richard notes that QCD and Fusion are probably not Berkeley fields; material science and climate would likely be the heaviest areas of use by UCB researchers. [See comment at bottom of this page for Richard's diagram in response to this question.]
Discussion of software designed to accommodate failure. NERSC has 1-2 nodes (of 6000) fail per day. Jobs designed to use large fractions of NERSC's available computational power tend not to be designed to recover gracefully from failure. This, in some measure, is a way of understanding/explaining the relatively short duration of jobs running on tens of thousands of nodes.
Discussion of effort to strip Linux kernel down to the minimum number of libraries necessary to run a binary, which did in fact cut down on the "jitter" that system interrupts introduced into job runs; but found in practice that inability to make some basic system calls turned out to make the nodes unusable ... so began adding capabilities back.
To the question of scaling up from laptop to Savio to NERSC, what are obstacles.
Richard: Don't develop on Windows ;-) ...... Socket communications between nodes might be a point of change in architecture that gives trouble. Anything that has to swap in and out of loading modules from disk (Python, R) tends to bottleneck on attempts to load of .so files (shared libraries).
Michael J: Not an issue to load libraries. Boot libraries are on RAM, tends to be cached; local to each node. If app is made up of shared libraries, that may be a design error; but at least you'll want to assure they are loaded locally before jobs begin. Global scratch is where writes ought to be made. Loading happens at beginning of job, but reads and writes over NTFS during job causes significant taxation on the cluster.
ESNet to aid large data transfers between facilities
Allocations: Program managers in charge of deciding whether projects not funded by DOE are compatible with DOE office missions are generous in trying to see linkage of such projects to DOE's mission. Allocation of 3bn hrs this year; ~99%. Adjustment is made mid-year for two reasons: to assure that allocations go to projects that will make use of them; and to avoid end-of-year rushes to use hours that can't be accommodated by the actual resources. Both scientific progress (value) and readiness to use the resources are considered in allocating resources. ALCC (ASCR Leadership Computing Challenge) -- large allocations, competitive, about 30% of applicants are selected for an allocation. Most allocations are granted in Aug/Sep, but program managers hold back a small quantity for allocation to projects that come in mid-year. Startup allocations (perhaps most interesting to Berkeley faculty not yet engaged with NERSC) are available to facilitate on-boarding to later run "much larger allocations."
PLS: Collaboration between NERSC and UCB's Research IT might be most interesting in this on-boarding of researchers at the edges of domains served by NERSC
Scott French: Machine learning methods are an avenue for social sciences and other text analysis projects to become eligible for NERSC resources.
Perry Willets: Interested in tape archive
Scott: always an issue where to keep large quantities of data for potential research reproduction capability
Chris Hoffman: next reading group will be on "Research Data Storage: Parallel file systems for HPC, archival storage, & beyond"