Globus and Box provide opportunities to move data between CI components and automate manual processes for data transfer. We will look at examples of how each are being used by Berkeley researchers. We'll also discuss possibilities for improvement and extension of these examples. We will also talk about some limitations of the current versions of the Globus and Box SDKs (Software Development Kits).
When: Thursday, Jan 26, 2017 from 12 - 1pm
Where: 200C Warren Hall, 2195 Hearst St (see building access instructions on parent page).
What: Integrating Globus and Box in research workflows
Presenting: Maurice Manning, Research IT
Please (lightly) review prior to the meeting:
For those who would like to look at scripts or Jupyter Notebooks, these resources might be of interest (optional):
Presenting: Maurice Manning, Research IT
Aaron Culich, Research IT
Andy Lyons, UC Division of Ag Resources
Anna Sackmann, Library
Aron Roberts, Research IT
Barbara Gilson, SAIT
Bill Allison, IST-API & Campus CTO
Chris Hoffman, Research IT
Deb Mccaffey, Research IT
Erik Latrope, Network
Jason Christopher, Research IT
Jon Hayes, bConnected
John Lowe, Research IT
Kelly Rowland, Nuclear Engineering & Research IT
Krishna Muriki, LBNL & Research IT
Leon Wong, Security
Michael Chang, EDW
Patrick Schmitz, Research IT
Perry Willett, CDL
Quinn Dombrowski, Research IT
Rick Jaffe, Research IT
Steve Masover, Research IT
Yong Qin, LBNL & Research IT
Next Week: How should NSF support cyberinfrastructure for the next decade?
Maurice: won't go over SDK in detail, but demonstrated use of Globus (globus.org/app/transfer)
Use cases: instruments that collect large data files and need to push them up to Savio to compute over. Many small text files, similar story, moving them to compute or storage.
Globus Python script:
o watch a (local/instrument) folder, on changes to files that match specified file name patterns, use Globus from (local/instrument) folder to an appropriate directory on Savio.
o for AuthN, use a RefreshToken, which permits a transfer job to run for a long period of time (weeks or months, even)
o must pre-authenticate to Savio in Globus web client before running the Python script (token gotten in this process is valid for four days) [Patrick suggests a test in the script that will tell users if they're not logged in]
o could monitor deletions (Globus SDK permits) and reflect those on the transfer-to storage, but script currently does not do this
o need a Globus client ID (obtained from Globus app developer interface)
Multiple steps required to get this started (see above). This, depending on the friction involved, can be offputting:
==> start script in morning, shut down at night -- in person, not remotely -- this in one case is for several instruments in several different buildings, creating some logistical friction to use
==> how to organize data once it's moved up; how are permissions set for access to that data, vis-a-vis which user is logged in to Savio in script context
Rick raises use case for sensor data collection that generates ~1000 files/day collected on a campus-local computer, and researchers want to make a 'protection' copy in the cloud.
Patrick asks about command line vs. GUI -- are there users who don't want to see a command line, can Globus work in a GUI interface? Maurice suggests that he could run the Globus SDK from an IPython notebook, but has not done that yet.
Yong asks how script detects partially-written files where the write from instrument takes a while. [Maurice doesn't recall the way the folder-watching module, watchdog handles that, but it does. There's a use case re: appending to a file that hasn't been resolved yet.]
Box Python SDK in IPython Notebook
o need to run a script on Savio to generate and store a RefreshToken for Box in an appropriate Savio directory; this is then referenced/read by the IPython notebook script
o need a Box client ID (obtained from Box app developer interface)
o for software that is complex to set up (e.g., Tesseract, OCR software -- Adam Anderson use case), setup in a Singularity container makes the setup 'ceremony' easily portable/reusable
Box to Savio, run computation on Savio, obtain results -- all from within IPython notebook. But here there's lots of setup stages that make the top of the notebook look complex and, to some, intimidating.
IPython Notebook contains some utilities, e.g., for
==> getting IDs from Box that correspond to folders and files (must specify that ID, rather than file/folder name, to identify an object to Box for transfer)
==> converting PDF to a series of PNG files (Tesseract wants to operate on -- OCR scan -- images)
==> assembling OCR'd images (one text file per page) into properly ordered single files that correspond to the input PDFs
Discussion on the "Pythonness" and stability -- are there parts of these disparate implementations that are common and reusable across research use case? And are the people who are using these capable of following (or even modifying?) the Python code -- as opposed to employing a programmer or Python-wise reseracher?
Yong points out the Jupyter widgets might be used as wrappers for some of these code modules.
Erik: Use DTN to do the transfer if better transfer performance desired for Box transfers.