Please join the Research IT Reading Group for a discussion with David Schlegel about future astronomical surveys.
David Schlegel, Senior Scientist, Physics Division, LBNL; and BIDS Senior Fellow
Please review the following prior to the July 16th meeting:
When: Thursday, July 16 from noon - 1pm
Where: 200C Warren Hall, 2195 Hearst St (see building access instructions on parent page).
Event format: The reading group is a brown bag lunch (bring your own) with a short ~20 min talk followed by ~40 min group discussion.
Presenting: David Schlegel (LBNL)
Steve Carrier (School of Education)
Jason Christopher (Research IT)
Aaron Culich (Research IT)
Quinn Dombrowski (Research IT)
Barbara Gilson (IST)
David Greenbaum (Research IT)
Chris Hoffman (Research IT)
Rick Jafee (Research IT)
Michael Jennings (Research IT/LBNL)
Gary Jung (Research IT)
James McCarthey (SSL)
Aron Roberts (Research IT)
Patrick Schmitz (Research IT)
Jamie Wittgenstein (Research IT)
Camille Villa (Research IT)
- What constitutes "big data" as researchers in different disciplines conceive it (e.g., is "big" simply "bigger than I have resources to handle"?)?
- What does / can / should the campus offer to support data storage, movement, and analysis on the scale David is describing (sky survey data), and what should major research centers or discipline-based organizations host?
Gary: Dr. David Schlegel is a Sr Scientist at LBNL and a Senior Fellow at BIDS. He received his PhD at Berkeley, was at Princeton for a time after earning his PhD, and returned to Berkeley in 2004.
David Schlegel presentation (link to slides as PDF):
Will talk about his research; about whether astronomical data sets "big"; and there may be data problems related to data that is big by any measure.
Dark Energy accounts for ~70% of the energy density of the universe, but was unknown until 1998. Discovered because distant supernovae appeared a little too faint from Earth-based observations. We don't know what Dark Energy is, but what we can measure is that there's too much "space" between us and distant objects, which implies that something is causing the universe to accelerate more quickly than it would if there were not "dark energy."
BOSS @Sloan Telescope has gathered the best data for measuring dark energy to-date; it's also the most broadly used astronomical data set (in part because we have made it more easily accessible than others).
YouTube video: "A Flight Through the Universe" -- https://www.youtube.com/watch?v=08LBltePDZw
Explanations of Dark Energy: gravitational force wrong (though in fact Einstein probably wasn't, actually); new field constant in time and space (vacuum energy); new field that's dynamic in time and space.
Successor to Sloan telescope is the LSST at Kitt Peak, Arizona. A dark energy spectroscopic instrument. Not only big, but "way cool." Will start operations in 2019. Will expand map beyond 1/2% of visible universe to about 5% of the visible universe -- which is actually pretty big.
Before we make 3D images we have to make 2D images of the sky from which we can pick out which objects we want to map 3-dimensionally.
Largest image of sky to-date is SDSS (1998-2008) ~1Tpix (Terrapix), contains 250 million galaxies and 250 million stars. Legacy Imaging Survey (2014-2018) will represent objects that are 10x fainter than SDSS: legacysurvey.org/viewer. LSST will be bigger yet.
And the question: will LSST data be "big data"? Moore's law: in that context, doubling storage capacity on magnetic media (hard drives) every 1.6 years -- is this data going to be big by the time it is collected and needs to be stored?
Looking back, evidence tells us that "peak storage" occurs 2.3 years into any survey.
Doing the arithmetic, LSST at its most-aggressively estimated crunch point in 2023 is 22 hard drives (assuming Moore's Law holds wrt storage capacity). That's big, but not big relative to what we've dealt with in the past. Pushback has been offered in response to this assertion: e.g., network speeds don't go up as fast as storage; but DS doesn't think (based on his look at the numbers) that the speed-growth curve for network is all that different from storage.
Optical image capturing size/capacity isn't scaling as fast as storage/compute. DS thinks that astronomers (including himself) haven't been doing as good a job as possible in pushing the technology. [David Greenbaum asks whether other industries -- military, surveillance -- *are* scaling more quickly. DS says that telemetry seems to be the rate-limiting factor for space missions, radio transmission of data.] If you could collect all data on all photons with the LSST instrument, 6.9e16 = 550 PB/night. But that level of completeness in data collection is not going to happen soon, we're not even close.
Standard data reduction of astronomical data: "catalog reductions." Catalogs are then matched to others derived from different images using same/different instruments ... then we ask questions based on what information is available from those matched catalogs. This mode of inquiry is limited by mismatches between catalogs, as well as spurious data/objects in catalogs (e.g., reflections in the optics).
DS vision for future astronomical survey data: forward-model all raw data. Ask scientific questions of the raw data, not the (corrupted) derivatives such as matched catalogs. Eliminate the data pipelines that create loss of 'true' information in the data being queried.
A "hard" problem: Near Earth Object search (there was a congressional mandate in 1998 to find 90% of "killer asteroids" -- the rocks that stand a decent chance of colliding with Earth). Current approach to this problem is fundamentally the same as used in 1890: seek differences by eye between two plates (only comparison is done digitally now).
4e16 positions in the asteroid belt to resolve, even without accounting for 3 dimensions of velocity (by which this number needs to be multiplied). Chelyabinsk asteroid was at about the threshold of detectability.
Quinn: tools vs. methods. It seems methods haven't scaled as tools have. Why is that?
DS: People are used to dealing with catalogs, and they are useful in meaningful applications. Utility depends on what you're doing. But you are right, methodology has not scaled. Part of it is that fewer and fewer people are trained to deal with data as it comes off the telescope.
Aaron: How much of data collection is constrained by storage capacity?
DS: Conventionally what we have easy access to is derivatives. It's worth mentioning that the "pipelining" processes may not be as reproducible as one would like. Not reproducible because of a 'sociological problem' -- not because there is a technical barrier. It's not how people are used to working.
David Greenbaum: What one thing would you like to have that you don't have now, in terms of infrastructure.
DS: That's a question I think we're going to hit in the next several months. Scale of data sets we're talking about -- every pixel ever collected -- the model testing for the "killer asteroid" problem has so many permutations and factors in phase space is probably going to become big and hard. Perhaps I ought to come back for another discussion then...