Skip to end of metadata
Go to start of metadata

Readings:

- “Sustainability” – perspective, background & context (introductory slides from David Lifka, Cornell), discusses the 10 plagues that HPC and cyberinfrastructure facilities in general are facing

- the Executive Summary from the recently published XSEDE Cloud Survey Report.  The full report is very valuable and is posted at
  https://www.ideals.illinois.edu/handle/2142/45766

Optional readings:

The full agenda for the CASC workshop contains links to most of the presentations:
  https://www.cac.cornell.edu/srccii/agenda.aspx

Here are a few presentations that were especially interesting:

-  Models, challenges and opportunities at advanced computing facilities, Dr. Jan Odegard, Director, Ken Kennedy Institute, Rice University (nicely represents the theme discussed by many presenters, "what keeps me up at night")
  https://www.cac.cornell.edu/srccii/Decks/Odegard.pdf

- UCLA Perspective Dr. Jim Davis, Vice Provost, Office of Information Technology, UCLA (see what our colleagues at UCLA are doing in this space)
  https://www.cac.cornell.edu/srccii/Decks/Davis.pdf

- The MGHPCC Data Center and Consortium John Goodhue, Executive Director, Massachusetts Green HPC Center (multi-university collaboration in Massachusetts)
  https://www.cac.cornell.edu/srccii/Decks/Goodhue.pdf

- Models, challenges and opportunities. Administrative costs? Resources in greatest need? Dr. John Towns, Principal Investigator, XSEDE (what does the PI for the NSF Extreme Science and Engineering Discovery Environment worry about?)
  https://www.cac.cornell.edu/srccii/Decks/Towns.pdf

Notes

People willing to share information - including funding, staffing, challenges
NSF funded pre-workshop on sustainable funding (but NSF representatives couldn't be there due to the shutdown)

1. Which of David Lifka's plagues are most relevant to research computing at Berkeley? Are any plagues missing? That is, what should we be most concerned about here?

2. Regarding the XSEDE report on cloud computing, they identify some challenges that limit the use of these resources, emphasizing two issues in particular:
... "Executing a tightly coupled HPC application in a virtual machine environment may not be the best use of production resources. It is important to pick the environment best suited to your application. Time to access and overall cost- performance are other factors worth
considering..."
... "Several survey respondents reported that they were surprised by the cost to move data when they received their monthly bill. Most cloud service providers charge by the GB to move data out of the cloud..."
Are these accurate, or is this report already out of date in this fast-moving area? What might this look like in two years?

3. What else struck you in the CASC readings?

Gary Jung's group - 23k cores ("medium-large" size), separate from NERSC
CASC - good for benchmarking activities in size, effort, investment, etc.
What technologists are going mainstream
Not a lot of help from national orgs to help individual institutions
Suggestion of reallocating IT resources from things IT no longer does, towards scientific computing

Discussion is around what presenters are doing in support of research computing across whole campus
Wide range of institutions, organizations
10-20% faculty in dual roles; also CIO types
Not a technical conversation, sociological/funding/relationship
"Plagues" are daunting, but emphasis wasn't on doom-and-gloom
Regardless of size, group of people going through same issues, want to work together
Disappointing to not have reps from NSF/NIH about shaping funding; some radical suggestions for changing how money is doled out

UCLA - longstanding campuswide investments (Hoffman cluster)
Discussion of what keeps people up at night
Don't do strategic planning effort, do a bottom-up, campus coordination effort

Lots about relationships: faculty, staff, lack of empathy, what it takes
"rule of 7 touches" - going out and understanding people
importance of having long-term trusting relationships
Never enough collaboration, communication, funding, "free"

Infrastructure for supporting big data (outside of just storage)
People aren't sure what category genomics data falls into (not HIPAA, but what is it?)
Not a lot of clear guidance

2 genomics clusters - one up here
Folks at Santa Cruz think about privacy issues
Not dealing with gigantic data sets yet on that side, mostly manageable for now

Who's succeeding in sustainability?
No horror stories about losing funding
But as services grow, high-touch model, have to keep arguing for central investment
John Hopkins - "postmodern condo", reliable recurring costs, even as contributions increased
"sustainable" as "cost model"
LBNL has tried a number of different models, depends on what you want to accomplish
First made institutional system free (to get people there), but always said that would have to charge at some point

Aggregation as key component of sustainability
Value for you to contribute your nodes to the cluster

Apprehension due to failure of New Mexico facility
NM - wantedto do HPC, funded a sizable system, but then didn't want to fund it anymore, and charge people to use it (this was the death of it)
Recently disassembled, remembered as a failure

Bandwidth costs a lot from commercial providers - is it so expensive to them?
Vendor lock-in component

Can do HPC in the cloud, hard to have same experience as hardware (though you can pay for it)
If you're building an application from scratch, build for scale/cloud, but build locally to constrain costs
Can get better cost utilization locally

If you're not thinking of scaling out, and you build your application that way, you never get there w/o fully re-engineering
"learning trough", "learning squiggle" - world is moving around you in a learning trough, stuck with compilers from 10 years ago

What is a reasonable scale in the near term? Lots of business questions - how many people will use it, make enough investment w/o over-investment, etc.
As cloud matures, can get out within a few years
Tell researchers to build for the cloud - you can make the leap as the cost goes down (scaling back locally is just an economic decision / price point)
Some things went to HPC just because it was there (embarrassingly parallel)

Architecture consulting as they're building things - out in the domains
Hard problems in scaling computation - conversations (human problem too)

 

  • No labels