Skip to content

NISAR Access with R#574

Open
HarshiniGirish wants to merge 3 commits intoMAAP-Project:developfrom
HarshiniGirish:nisar_r
Open

NISAR Access with R#574
HarshiniGirish wants to merge 3 commits intoMAAP-Project:developfrom
HarshiniGirish:nisar_r

Conversation

@HarshiniGirish
Copy link
Collaborator

No description provided.

@HarshiniGirish HarshiniGirish requested a review from hrodmn March 18, 2026 18:39
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Mar 19, 2026

View / edit / reply to this conversation on ReviewNB

hrodmn commented on 2026-03-19T01:38:46Z
----------------------------------------------------------------

Should we target the Hub for this instead of the ADE?


HarshiniGirish commented on 2026-03-23T15:34:00Z
----------------------------------------------------------------

I am running the notebook on the hub it was failing due to a Python package compatibility issue in the Hub environment. A dependency is trying to import ssl from urllib3.util.ssl_, but the installed urllib3 version in the notebook environment no longer exposes that symbol when I try to run the below cell

earthaccess <- import("earthaccess")

maap_module <- import("maap.maap", convert = FALSE)

MAAP     <- maap_module$MAAP

maap     <- MAAP()

@review-notebook-app
Copy link

review-notebook-app bot commented Mar 19, 2026

View / edit / reply to this conversation on ReviewNB

hrodmn commented on 2026-03-19T01:38:47Z
----------------------------------------------------------------

Great use of the tempdir() here. I wonder what the impact of downloading many gigabytes to the /tmp directory on the Hub is though. @wildintellect might know.

I don't love that we have to download these entire granule files to work on them in R. You could add something like "if your workflow is taking too long due to the download process consider using the Python workflow" and then link to the NISAR Python notebook.


@review-notebook-app
Copy link

review-notebook-app bot commented Mar 19, 2026

View / edit / reply to this conversation on ReviewNB

hrodmn commented on 2026-03-19T01:38:47Z
----------------------------------------------------------------

This is a really nice snippet but we should update this to clip out a specific area of interest (in projected coordinates) rather than grid cell indexes.


Copy link
Contributor

@hrodmn hrodmn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HarshiniGirish really nice job on this one. It is succinct and to the point, and the formatting is so clean 🫶 .

I have a few change requests:

  1. Replace the local download option method with some kind of cloud-native data access path. I know support for reading from S3 in R is not great but I think there is a solution out there.
  2. Change the subset operation at the end to use projected coordinates instead of grid cell indexes. This might be a bit of work, but it will be what users want to be able to do.

For the cloud-optimized read solution there are a few possibilities:

Use rhdf5 instead of hdf5r: see https://huber-group-embl.github.io/rhdf5/articles/rhdf5_cloud_reading.html

Maybe we could use GDAL drivers via the terra package to load the file lazily (without downloading the entire file), but I am not really sure how well terra handles the complex HDF5 data structure.

I tried this:

vsis3_path = "/vsis3/sds-n-cumulus-prod-nisar-products/NISAR_L2_GCOV_BETA_V1/NISAR_L2_PR_GCOV_002_109_D_063_4005_DHDH_A_20251012T182508_20251012T182531_X05010_N_P_J_001/NISAR_L2_PR_GCOV_002_109_D_063_4005_DHDH_A_20251012T182508_20251012T182531_X05010_N_P_J_001.h5"

# got creds from a python session
setGDALconfig("AWS_SECRET_ACCESS_KEY", "...")
setGDALconfig("AWS_ACCESS_KEY_ID", "...")
setGDALconfig("AWS_SESSION_TOKEN", "...")
setGDALconfig("AWS_REGION", "us-west-2")

# Enable the virtual file system cache
setGDALconfig("VSI_CACHE", "TRUE")

# Set the size of that cache (e.g., 500 MB)
# This prevents re-downloading the same blocks during analysis
setGDALconfig("VSI_CACHE_SIZE", "500000000") 

# Increase the global block cache (default is usually too small)
# This can be a % of your RAM or a specific byte value
setGDALconfig("GDAL_CACHEMAX", "20%")

cube <- sds(vsis3_path)
cube

Copy link
Collaborator Author

I am running the notebook on the hub it was failing due to a Python package compatibility issue in the Hub environment. A dependency is trying to import ssl from urllib3.util.ssl_, but the installed urllib3 version in the notebook environment no longer exposes that symbol when I try to run the below cell

earthaccess <- import("earthaccess")

maap_module <- import("maap.maap", convert = FALSE)

MAAP     <- maap_module$MAAP

maap     <- MAAP()


View entire conversation on ReviewNB

@HarshiniGirish
Copy link
Collaborator Author

@hrodmn thanks for the feedback
I’ve now implemented the requested notebook changes. I replaced the local full-download workflow with a cloud-native access path using direct S3 access through GDAL /vsis3/ and terra, and I updated the subsetting step to use an AOI in projected coordinates instead of grid cell indices. I also added a note in the notebook that if the R workflow is slow or unstable, users should consider using the Python NISAR workflow instead.

I also wanted to mention the main issues I ran into while making these updates. The initial rhdf5 cloud-read approach was not opening the authenticated S3-backed HDF5 file reliably, so I moved to the GDAL /vsis3/ + terra route instead. After that, terra::sds() was able to reach the file, but it produced extent-mismatch warnings because the GCOV HDF5 contains many datasets that do not all behave like one aligned raster stack. I also ran into a layer-selection mismatch at one point, where the HHHH layer existed but the matching logic was too strict. On top of that, the AOI initially extended outside the raster bounds, so I constrained it to the valid extent.

@HarshiniGirish HarshiniGirish requested a review from hrodmn March 23, 2026 15:56
@HarshiniGirish
Copy link
Collaborator Author

cc : @wildintellect

@hrodmn
Copy link
Contributor

hrodmn commented Mar 24, 2026

Since reticulate is having trouble on the hub let's use a different approach for EDL authentication and granule search.

As @wildintellect mentioned yesterday there is an R package called earthdatalogin (https://boettiger-lab.github.io/earthdatalogin/index.html) that can be used for EDL purposes. I could not get it to work to get S3 credentials for NISAR using the built-in functions but here is what I came up with for a reticulate-free solution:

library(earthdatalogin)
library(httr2)
library(rstac)


# get S3 credentials using httr2::request with EDL token set in auth header
edl_token = edl_set_token()

resp <- request("https://nisar.asf.earthdatacloud.nasa.gov/s3credentials") |>
  req_auth_bearer_token(edl_token) |>
  req_perform()

creds = resp_body_json(resp)

# search the ASF STAC endpoint for a NISAR granule
items <- stac("https://cmr.earthdata.nasa.gov/stac/ASF") |> 
  stac_search(
      collections = "NISAR_L2_GCOV_BETA_V1_1",
      limit=1
  ) |>
  get_request() |>
  items_next()

item = items$features[[1]]

# get the S3-prefixed asset href
s3_asset_key <- names(item$assets)[startsWith(names(item$assets), "s3")]
s3_link = item$assets[[s3_asset_key]]$href

vsis3_path <- sub("^s3://", "/vsis3/", s3_link)

Now I am getting stuck when trying to read the /vsis3 link with terra on the Hub (this was working last week...).

> cube <- terra::sds(vsis3_path)
Error: [sds] file does not exist: /vsis3/sds-n-cumulus-prod-nisar-products/NISAR_L2_GCOV_BETA_V1/NISAR_L2_PR_GCOV_002_109_D_064_4005_DHDH_A_20251012T182530_20251012T182605_X05010_N_F_J_001/NISAR_L2_PR_GCOV_002_109_D_064_4005_DHDH_A_20251012T182530_20251012T182605_X05010_N_F_J_001.h5
In addition: Warning message:
A header you provided implies functionality that is not implemented (GDAL error 17) 

I don't know if it is related but when I check terra's gdal version I get 3.8.4 which is not the same that I get when I run gdalinfo --version in the Hub's terminal. I am able to open the same /vsis3 path (with proper S3 credentials set) directly with gdalinfo.

@HarshiniGirish
Copy link
Collaborator Author

https://gist.github.com/HarshiniGirish/7f401d4feaa9ff4b0e4df436d06b27d3 (will set a proper PR once the methodology is finalised)

Thankyou @hrodmn

I was able to authenticate with Earthdata, request temporary ASF S3 credentials, query the ASF STAC collection, and correctly resolve the actual .h5 science asset instead of the browse/thumbnail asset. I also confirmed the /vsis3/ path and HDF5 subdatasets are valid through GDAL.

The main issue is that direct streamed HDF5 reads are not reliable in the current notebook R/terra runtime, even though the remote file itself is accessible. Because of that, I cleaned the notebook so it now handles authentication, STAC lookup, asset selection, and subdataset path construction in R, and then prints the exact GDAL commands needed to stream only the required variables and create small output rasters for use in the notebook. I chose this approach because it keeps the workflow cloud-based without downloading the full .h5, while avoiding the runtime limitations we kept hitting with direct terra access.

What I found is that this workflow is not very straightforward in the current notebook environment. I would prefer an easier and more stable approach if possible, looking forward for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants