Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
View / edit / reply to this conversation on ReviewNB hrodmn commented on 2026-03-19T01:38:46Z Should we target the Hub for this instead of the ADE? HarshiniGirish commented on 2026-03-23T15:34:00Z I am running the notebook on the hub it was failing due to a Python package compatibility issue in the Hub environment. A dependency is trying to import
earthaccess <- import("earthaccess") maap_module <- import("maap.maap", convert = FALSE) MAAP <- maap_module$MAAP maap <- MAAP() |
|
View / edit / reply to this conversation on ReviewNB hrodmn commented on 2026-03-19T01:38:47Z Great use of the
I don't love that we have to download these entire granule files to work on them in R. You could add something like "if your workflow is taking too long due to the download process consider using the Python workflow" and then link to the NISAR Python notebook. |
|
View / edit / reply to this conversation on ReviewNB hrodmn commented on 2026-03-19T01:38:47Z This is a really nice snippet but we should update this to clip out a specific area of interest (in projected coordinates) rather than grid cell indexes. |
hrodmn
left a comment
There was a problem hiding this comment.
@HarshiniGirish really nice job on this one. It is succinct and to the point, and the formatting is so clean 🫶 .
I have a few change requests:
- Replace the local download option method with some kind of cloud-native data access path. I know support for reading from S3 in R is not great but I think there is a solution out there.
- Change the subset operation at the end to use projected coordinates instead of grid cell indexes. This might be a bit of work, but it will be what users want to be able to do.
For the cloud-optimized read solution there are a few possibilities:
Use rhdf5 instead of hdf5r: see https://huber-group-embl.github.io/rhdf5/articles/rhdf5_cloud_reading.html
Maybe we could use GDAL drivers via the terra package to load the file lazily (without downloading the entire file), but I am not really sure how well terra handles the complex HDF5 data structure.
I tried this:
vsis3_path = "/vsis3/sds-n-cumulus-prod-nisar-products/NISAR_L2_GCOV_BETA_V1/NISAR_L2_PR_GCOV_002_109_D_063_4005_DHDH_A_20251012T182508_20251012T182531_X05010_N_P_J_001/NISAR_L2_PR_GCOV_002_109_D_063_4005_DHDH_A_20251012T182508_20251012T182531_X05010_N_P_J_001.h5"
# got creds from a python session
setGDALconfig("AWS_SECRET_ACCESS_KEY", "...")
setGDALconfig("AWS_ACCESS_KEY_ID", "...")
setGDALconfig("AWS_SESSION_TOKEN", "...")
setGDALconfig("AWS_REGION", "us-west-2")
# Enable the virtual file system cache
setGDALconfig("VSI_CACHE", "TRUE")
# Set the size of that cache (e.g., 500 MB)
# This prevents re-downloading the same blocks during analysis
setGDALconfig("VSI_CACHE_SIZE", "500000000")
# Increase the global block cache (default is usually too small)
# This can be a % of your RAM or a specific byte value
setGDALconfig("GDAL_CACHEMAX", "20%")
cube <- sds(vsis3_path)
cube|
I am running the notebook on the hub it was failing due to a Python package compatibility issue in the Hub environment. A dependency is trying to import
earthaccess <- import("earthaccess") maap_module <- import("maap.maap", convert = FALSE) MAAP <- maap_module$MAAP maap <- MAAP() View entire conversation on ReviewNB |
|
@hrodmn thanks for the feedback I also wanted to mention the main issues I ran into while making these updates. The initial rhdf5 cloud-read approach was not opening the authenticated S3-backed HDF5 file reliably, so I moved to the GDAL /vsis3/ + terra route instead. After that, terra::sds() was able to reach the file, but it produced extent-mismatch warnings because the GCOV HDF5 contains many datasets that do not all behave like one aligned raster stack. I also ran into a layer-selection mismatch at one point, where the HHHH layer existed but the matching logic was too strict. On top of that, the AOI initially extended outside the raster bounds, so I constrained it to the valid extent. |
|
cc : @wildintellect |
|
Since reticulate is having trouble on the hub let's use a different approach for EDL authentication and granule search. As @wildintellect mentioned yesterday there is an R package called earthdatalogin (https://boettiger-lab.github.io/earthdatalogin/index.html) that can be used for EDL purposes. I could not get it to work to get S3 credentials for NISAR using the built-in functions but here is what I came up with for a reticulate-free solution: library(earthdatalogin)
library(httr2)
library(rstac)
# get S3 credentials using httr2::request with EDL token set in auth header
edl_token = edl_set_token()
resp <- request("https://nisar.asf.earthdatacloud.nasa.gov/s3credentials") |>
req_auth_bearer_token(edl_token) |>
req_perform()
creds = resp_body_json(resp)
# search the ASF STAC endpoint for a NISAR granule
items <- stac("https://cmr.earthdata.nasa.gov/stac/ASF") |>
stac_search(
collections = "NISAR_L2_GCOV_BETA_V1_1",
limit=1
) |>
get_request() |>
items_next()
item = items$features[[1]]
# get the S3-prefixed asset href
s3_asset_key <- names(item$assets)[startsWith(names(item$assets), "s3")]
s3_link = item$assets[[s3_asset_key]]$href
vsis3_path <- sub("^s3://", "/vsis3/", s3_link)Now I am getting stuck when trying to read the /vsis3 link with terra on the Hub (this was working last week...). I don't know if it is related but when I check terra's gdal version I get 3.8.4 which is not the same that I get when I run |
|
https://gist.github.com/HarshiniGirish/7f401d4feaa9ff4b0e4df436d06b27d3 (will set a proper PR once the methodology is finalised) Thankyou @hrodmn I was able to authenticate with Earthdata, request temporary ASF S3 credentials, query the ASF STAC collection, and correctly resolve the actual .h5 science asset instead of the browse/thumbnail asset. I also confirmed the /vsis3/ path and HDF5 subdatasets are valid through GDAL. The main issue is that direct streamed HDF5 reads are not reliable in the current notebook R/terra runtime, even though the remote file itself is accessible. Because of that, I cleaned the notebook so it now handles authentication, STAC lookup, asset selection, and subdataset path construction in R, and then prints the exact GDAL commands needed to stream only the required variables and create small output rasters for use in the notebook. I chose this approach because it keeps the workflow cloud-based without downloading the full .h5, while avoiding the runtime limitations we kept hitting with direct terra access. What I found is that this workflow is not very straightforward in the current notebook environment. I would prefer an easier and more stable approach if possible, looking forward for the feedback. |
No description provided.