SDSU 2007 Intro to HPC, Amit Majumdar (PDF, PPT)
Comparison of HPC policies at peer institutions
"UCB Research Computing High Performance Infrastructure" proposal
1. What is the "condo" HPC model? What are its strengths and weaknesses?
2. In HPC policies article, one of biggest questions for condo model is what are the "queue" or access policies? What seems to be the general trend among universities?
3. "Writing effective parallel applications is difficult". What sort of services do PIs need from central HPC support staff? What is the division of labor?
4. What sort of HPC infrastructure is needed on campus to support data science?
4. What is your assessment of the "UCB Research Computing High Performance Infrastructure" proposal
- Tiered infrastructure? HPC = "high", some people need parallel processing, but not necessarily HPC
HPC maybe not appropriate for data science problems
- Condo model: some central group sponsors, manages, subsidizes some of cost of HPC resources; PIs add to those resources (options to choose from); people who buy hardware have priority/insurance over jobs run
- Ops costs are high (draft underestimates them); overhead funds, or try to recoup?
- Some availability for people w/o own contributions
- Proposal thin on analysis of needs, costs
- Depreciation issues in accepting donation of equipment from UCOP
- Centralized storage service?
- Various small clusters exist (~25-ish a few years ago)
- Power, cooling, space -- 1/2 cost of hardware each year, in addition to people cost
- CIO eliminating recharge costs for data center? - still need to understand what costs are
- Why offer new service, vs being Berkeley broker to existing service (LBL, supercomputing center)
- What are alternatives to condo model? - People don't know, don't participate; friction = not worth using even free resources, vs Amazon
- Ease of bringing in applications
- Software consulting needs = one of hardest needs to meet, necessary for success
- Faculty who wrote letter interviewed colleagues at 5 other institutions
- Is PI right person for governing council, or department? Ops people?
- XSEDE offers 10k node systems; are people who go out to the cloud doing "HPC"
- Class of computing that requires specific architecture, need at least sandbox for trying that architecture, that can scale out to some kind of public infrastructure; other things work with cloud infrastructure
- What labels do we need to get people to participate who wouldn't otherwise? For traditional customers, what do they have now, what do they want? - consider future of Amazon
- Long queues in XSEDE, need local systems to debug (but long queues are free)
- Penny/hour: works for small jobs, but when jobs run for months, that gets expensive (at that point, apply for time in XSEDE = free)
- Job scheduling: significant accounting work
- henyey cluster for astrophysics: different queues, can set up an account -- mini-condo?
- EECS - embedded software consultants in research groups; AMPLab: writing code for cancer genomics
- Not consulting center where people come to you
- Departments as basic unit of organization, esp. if supporting access broadly (provide non-traditional departments an allocation, see what they do with it)
- Parallel to access or governance model; distribute overhead
- LBL may not be willing to run access model for whole campus
- XSEDE victim of own success-- not good for learning, students, etc.; long queues
- Is a model that works well at that scale the same model we need here? Not new XSEDE node, but support people
- How much HPC do we need, vs support for people doing computationally intensive research, regardless of architecture?
Patrick's additional notes and research spurred by the readings
- More important questions than "what is condo model" is "what is HPC, how do we think about different areas, how does that fit into story of supporting research/teaching"
- Need to bring together people with interest in various aspects of this
- What happens when deluge arrives, have to tell people to wait (for resource provisioning); Amazon is instant solution, can scale incrementally; EECS has production OpenStack cluster; developing for Amazon, then do something that's compatible with Amazon for cost containment; Azure not in direct plans (HP, Dell, Rackspace -- all using OpenStack)
- Short survey of signatories-- how is system set up? What do you need? What filesystem? etc
- Value in an interest group of people working on this (starting w/ Stanford), Aaron will bring in Harvard, potential for Google hangout w/ E Coast people
- Original condo model at Harvard is no more, model didn't work well