Remote files and fsspec caching¶

Author: Luis Lopez @betolink

When working with large binary files stored in the cloud, tools like fsspec play a critical role in abstracting file access. These tools are input/output drivers to file systems. Libraries that understand how to read these binary formats do not know anything about remote files and fsspec is in charge of translating byte requests to remote byte requests (HTTP).

Most of the file formats in the scientific world were not designed for efficient cloud access as they require non-sequential reads, meaning client libraries can jump to specific byte ranges to read only what's needed, like a row group in a Parquet file or a variable slice in NetCDF. However, the way fsspec fetches data from remote sources is controlled by its caching strategy. This strategy determines how data is fetched, stored locally, and re-used. If this strategy isn't properly aligned with the access patterns of the file format, performance and reliability can (and will) quickly degrade.

Imagine reading a NetCDF file hosted on S3 with a default block cache of 5 MB. If a query requires multiple small reads scattered throughout the file, and each of these reads falls in a different 5 MB block, fsspec will repeatedly download new blocks, resulting in excessive data transfer and latency. Worse, if the file is very large and caching is disabled or limited (e.g. using simplecache with no chunk awareness), access may become painfully slow — or even fail if many tiny reads cause timeouts or bandwidth limits to be exceeded.

By tuning the cache size, choosing the right caching backend (like blockcache or filecache), and understanding how your format is read by your analysis tools (e.g., xarray, pandas, pyarrow), you can greatly reduce redundant downloads, improve throughput, and make interactive analysis over the cloud viable.

Aligning cache strategies on a per-file basis is not a simple task, therefore earthaccess won't try to fine tune your requests except for changing the default behavior in fsspec from readahead to blockcache that offers better access patterns and a more efficient cache invalidation strategy for binary files. Based on file size earthaccess will cache starting at 4MB up to 16MB for very large files (more than 1GB).

The max in-memory caching size will be 16MBx32 blocks = 512MB per file. All of this is configurable at runtime with the new open_kwargs parameter. Some examples of the usage of this new config dictionary are detailed in this notebook.

Efficiency: Avoids repeated downloads of the same data during random-access reads.
Performance: Reduces latency and bandwidth usage, especially for cloud-hosted files.
Scalability: Supports parallel or repeated access patterns without overloading remote servers.

NetCDF and HDF5 files¶

A considerable amount of remote sensing data is stored in HDF5 and NetCDF formats. They are hierarchical, self-describing formats widely used in Earth science due to their ability to store multidimensional arrays, metadata, and complex data structures efficiently. However, their design assumes relatively fast random access to local storage, which becomes a bottleneck when accessed over high-latency networks like HTTP or S3.

HDF5 and NetCDF files are typically organized into chunks—contiguous blocks of data that are compressed and stored separately. When reading a subset of a dataset (e.g., a time slice or spatial region), only the relevant chunks should be fetched. But without proper caching, each byte-range request for a chunk can result in a separate HTTP GET request, leading to thousands of small requests (Ro to Rn in the figure) for large analyses. The diagram below illustrates how traditional HDF5 access results in scattered, inefficient requests, while cloud-optimized layouts enable sequential or sparse access with fewer round trips:

No description has been provided for this image

This is where fsspec's blockcache becomes essential. By aligning the cache block size with the HDF5/NetCDF chunk size (or a multiple thereof), we minimize redundant fetches and ensure that once a chunk is downloaded, it remains available for reuse during subsequent accesses. For example, if a NetCDF variable uses 1 MB chunks, a 4–16 MB block cache ensures that multiple chunks are retained per block, reducing round trips.

Additionally, because HDF5 files often contain shared metadata (e.g., dimensions, attributes, group structures), caching the beginning of the file (where some of this metadata resides) can dramatically speed up repeated opens or inspections. earthaccess leverages blockcache with sensible defaults to preserve both data and metadata across requests, making workflows using xarray or h5py over S3 much more performant.

Cloud optimized files (Zarr / COG / etc / Co-HDF5)¶

As cloud-native workflows mature, new formats and adaptations have emerged to address the limitations of traditional binary formats in object storage environments. These cloud-optimized formats are designed for efficient, parallel, and distributed access over HTTP/S3 with minimal latency.

Zarr: A chunked, compressed, and directory-structured format that natively supports cloud storage. Each chunk is stored as a separate object, enabling true parallel I/O and eliminating the need for byte-range requests. When used with fsspec, Zarr datasets can be read efficiently even with minimal caching, since each chunk is an independent resource.
Cloud Optimized GeoTIFF (COG): A GeoTIFF variant with internal tiling and overviews organized so that key image metadata is at the front of the file, and tiles are laid out sequentially. This allows fast previews and spatial subsetting via byte-range requests without downloading the entire file.
Co-HDF5 (Cloud-Optimized HDF5): An emerging pattern where HDF5 files are rechunked and reorganized to improve cloud access performance. This includes placing metadata at the beginning, aligning chunks to network block boundaries, and minimizing cross-chunk references.

These formats work best when paired with intelligent caching strategies. While Zarr benefits less from large block caches (since each chunk is already a discrete object), COG and Co-HDF5 still rely heavily on byte-range requests and thus benefit significantly from blockcache with appropriately sized blocks.

Credit: Earthmover "What is Zarr"

Ultimately, the key to high-performance cloud data access lies in matching the caching strategy to the data layout, access pattern, and underlying format. With earthaccess, we aim to provide defaults that "just work" while giving advanced users full control via open_kwargs to tune behavior as needed.

💡 Pro Tip: Always inspect your data’s chunking structure (e.g., using zarr.tree() or h5py.File(...).visititems()) before setting cache parameters. Aligning cache and chunk sizes can yield 10x+ performance improvements.

xarray and `earthaccess.open()`¶

xarray has become the de facto library to open and analyze n-dimensional scientific data in Python, thus the use of fsspec to open our remote files has the advantage of being interoperable with any xarray backend(or file engine) that can accept "file-like" objects.

In [1]:

Copied!

import earthaccess as ea
import xarray as xr

ea.login()
import earthaccess as ea
import xarray as xr

ea.login()

Out[1]:

<earthaccess.auth.Auth at 0x7b4706c50ad0>

In [2]:

Copied!





results = ea.search_data(
    short_name="ATL03",
    version="006",
    cloud_hosted=True,
    temporal=("2025-01", "2025-08"),
    bounding_box=(33.98, 31.19, 34.65, 31.51),
    count=1,
)
results[0]
results = ea.search_data(
    short_name="ATL03",
    version="006",
    cloud_hosted=True,
    temporal=("2025-01", "2025-08"),
    bounding_box=(33.98, 31.19, 34.65, 31.51),
    count=1,
)
results[0]

Out[2]:

Data: ATL03_20250104003608_02892602_006_01.h5

Size: 517.13 MB

Cloud Hosted: True

Now that we have results from CMR we can open the files, in this case a single HDF5 level 2 (non gridded) swath. One thing that this particular collection will highlight is that when dealing with non cloud-optimized files each dimension and variable will add overhead on the range requests. ATL03 contains more than 900 of them, and if we use libraries like xarray that try eagerly to fetch metadata to decode variables and dimensions our open_dataset() operation in xarray will be exponentially slow.

Let's open the file and inspect the new caching defaults from earthaccess!

In [3]:

Copied!

file_handler = ea.open(results)[0]  # only one item in the list
file_handler.f.cache
file_handler = ea.open(results)[0]  # only one item in the list
file_handler.f.cache

Out[3]:

        <BlockCache:
            block size  :   8388608
            block count :   65
            file size   :   542254177
            cache hits  :   0
            cache misses:   0
            total requested bytes: 0>

Here we are seeing the new caching defaults when we open files using earthaccess, instead of readahead we are using the blockcache and earthaccess adjusted the block size to 8MB. This will helps us not re-fetching data when we in turn use xr.open_dataset()

In [4]:

Copied!

%%time

# no engine defined, xarray will guess based on file type in this case h5netcdf
ds = xr.open_dataset(file_handler, group="/gt1l/heights")
ds
%%time

# no engine defined, xarray will guess based on file type in this case h5netcdf
ds = xr.open_dataset(file_handler, group="/gt1l/heights")
ds

CPU times: user 545 ms, sys: 178 ms, total: 723 ms
Wall time: 33.3 s

Out[4]:

<xarray.Dataset> Size: 280MB
Dimensions:         (delta_time: 5595411, ds_surf_type: 5)
Coordinates:
  * delta_time      (delta_time) datetime64[ns] 45MB 2025-01-04T00:36:08.1495...
    lat_ph          (delta_time) float64 45MB ...
    lon_ph          (delta_time) float64 45MB ...
Dimensions without coordinates: ds_surf_type
Data variables:
    dist_ph_across  (delta_time) float32 22MB ...
    dist_ph_along   (delta_time) float32 22MB ...
    h_ph            (delta_time) float32 22MB ...
    pce_mframe_cnt  (delta_time) uint32 22MB ...
    ph_id_channel   (delta_time) uint8 6MB ...
    ph_id_count     (delta_time) uint8 6MB ...
    ph_id_pulse     (delta_time) uint8 6MB ...
    quality_ph      (delta_time) int8 6MB ...
    signal_conf_ph  (delta_time, ds_surf_type) int8 28MB ...
    weight_ph       (delta_time) uint8 6MB ...
Attributes:
    Description:  Contains arrays of the parameters for each received photon.
    data_rate:    Data are stored at the photon detection rate.

xarray.Dataset

Dimensions:
- delta_time: 5595411
- ds_surf_type: 5
Coordinates: (3)
- delta_time
  (delta_time)
  datetime64[ns]
  2025-01-04T00:36:08.149526176 .....
  contentType :
  referenceInformation
  description :
  The transmit time of a given photon, measured in seconds from the ATLAS Standard Data Product Epoch. Note that multiple received photons associated with a single transmit pulse will have the same delta_time. The ATLAS Standard Data Products (SDP) epoch offset is defined within /ancillary_data/atlas_sdp_gps_epoch as the number of GPS seconds between the GPS epoch (1980-01-06T00:00:00.000000Z UTC) and the ATLAS SDP epoch. By adding the offset contained within atlas_sdp_gps_epoch to delta time parameters, the time in gps_seconds relative to the GPS epoch can be computed.
  long_name :
  Elapsed GPS seconds
  source :
  Operations
  standard_name :
  time
```
array(['2025-01-04T00:36:08.149526176', '2025-01-04T00:36:08.149626208',
       '2025-01-04T00:36:08.149726176', ..., '2025-01-04T00:44:35.167339200',
       '2025-01-04T00:44:35.183339168', '2025-01-04T00:44:35.196339200'],
      shape=(5595411,), dtype='datetime64[ns]')
```
- lat_ph
  (delta_time)
  float64
  ...
  contentType :
  modelResult
  description :
  Latitude of each received photon. Computed from the ECF Cartesian coordinates of the bounce point.
  long_name :
  Latitude
  source :
  ATL03g ATBD, Section 3.4
  standard_name :
  latitude
  units :
  degrees_north
  valid_max :
  90.0
  valid_min :
  -90.0
```
[5595411 values with dtype=float64]
```
- lon_ph
  (delta_time)
  float64
  ...
  contentType :
  modelResult
  description :
  Longitude of each received photon. Computed from the ECF Cartesian coordinates of the bounce point.
  long_name :
  Longitude
  source :
  ATL03g ATBD, Section 3.4
  standard_name :
  longitude
  units :
  degrees_east
  valid_max :
  180.0
  valid_min :
  -180.0
```
[5595411 values with dtype=float64]
```
Data variables: (10)
- dist_ph_across
  (delta_time)
  float32
  ...
  contentType :
  modelResult
  description :
  Across-track distance projected to the ellipsoid of the received photon from the reference ground track. This is based on the Along-Track Segment algorithm described in Section 3.1.
  long_name :
  Distance off RGT.
  source :
  ATL03 ATBD, Section 3.1
  units :
  meters
```
[5595411 values with dtype=float32]
```
- dist_ph_along
  (delta_time)
  float32
  ...
  contentType :
  modelResult
  description :
  Along-track distance in a segment projected to the ellipsoid of the received photon, based on the Along-Track Segment algorithm. Total along track distance can be found by adding this value to the sum of segment lengths measured from the start of the most recent reference groundtrack.
  long_name :
  Distance from equator crossing.
  source :
  ATL03 ATBD, Section 3.1
  units :
  meters
```
[5595411 values with dtype=float32]
```
- h_ph
  (delta_time)
  float32
  ...
  contentType :
  physicalMeasurement
  description :
  Height of each received photon, relative to the WGS-84 ellipsoid including the geophysical corrections noted in Section 6. Please note that neither the geoid, ocean tide nor the dynamic atmosphere (DAC) corrections are applied to the ellipsoidal heights.
  long_name :
  Photon WGS84 Height
  source :
  ATL03g ATBD, Section 3.4
  standard_name :
  height
  units :
  meters
```
[5595411 values with dtype=float32]
```
- pce_mframe_cnt
  (delta_time)
  uint32
  ...
  contentType :
  referenceInformation
  description :
  The major frame counter is read from the digital flow controller in a given PCE card. The counter identifies individual major frames across diag and science packets. Used as part of the photon ID.
  long_name :
  PCE Major frame counter
  source :
  Retained from prior a_alt_science_ph packet
  units :
  counts
```
[5595411 values with dtype=uint32]
```
- ph_id_channel
  (delta_time)
  uint8
  ...
  contentType :
  referenceInformation
  description :
  Channel number assigned for each received photon event. This is part of the photon ID. Values range from 1 to 120 to span all channels and rise/fall edges. Values 1 to 60 are for falling edge; PCE1 (1 to 20), PCE 2 (21 to 40) and PCE3 (41 to 60). Values 61 to 120 are for rising edge; PCE1 (61 to 80), PCE 2 (81 to 100) and PC3 (101 to 120).
  long_name :
  Receive channel id
  source :
  Derived as part of Photon ID
  units :
  1
  valid_max :
  120
  valid_min :
  1
```
[5595411 values with dtype=uint8]
```
- ph_id_count
  (delta_time)
  uint8
  ...
  contentType :
  referenceInformation
  description :
  The photon event counter is part of photon ID and counts from 1 for each channel until reset by laser pulse counter.
  long_name :
  photon event counter
  source :
  Derived as part of Photon ID
  units :
  counts
```
[5595411 values with dtype=uint8]
```
- ph_id_pulse
  (delta_time)
  uint8
  ...
  contentType :
  referenceInformation
  description :
  The laser pulse counter is part of photon ID and counts from 1 to 200 and is reset for each new major frame.
  long_name :
  laser pulse counter
  source :
  Derived as part of Photon ID
  units :
  counts
```
[5595411 values with dtype=uint8]
```
- quality_ph
  (delta_time)
  int8
  ...
  contentType :
  qualityInformation
  description :
  Indicates the quality of the associated photon. 0=nominal, 1=possible_afterpulse, 2=possible_impulse_response_effect, 3=possible_tep. Use this flag in conjunction with signal_conf_ph to identify those photons that are likely noise or likely signal.
  flag_meanings :
  nominal possible_afterpulse possible_impulse_response_effect possible_tep
  flag_values :
  [0 1 2 3]
  long_name :
  Photon Quality
  source :
  ATL03 ATBD
  units :
  1
  valid_max :
  3
  valid_min :
  0
```
[5595411 values with dtype=int8]
```
- signal_conf_ph
  (delta_time, ds_surf_type)
  int8
  ...
  contentType :
  qualityInformation
  description :
  Confidence level associated with each photon event selected as signal. 0=noise. 1=added to allow for buffer but algorithm classifies as background; 2=low; 3=med; 4=high). This parameter is a 5xN array where N is the number of photons in the granule, and the 5 rows indicate signal finding for each surface type (in order: land, ocean, sea ice, land ice and inland water). Events not associated with a specific surface type have a confidence level of -1. Events evaluated as TEP returns have a confidence level of -2.
  flag_meanings :
  possible_tep not_considered noise buffer low medium high
  flag_values :
  [-2 -1 0 1 2 3 4]
  long_name :
  Photon Signal Confidence
  source :
  ATL03 ATBD, Section 5, Conf
  units :
  1
  valid_max :
  4
  valid_min :
  -2
```
[27977055 values with dtype=int8]
```
- weight_ph
  (delta_time)
  uint8
  ...
  contentType :
  modelResult
  description :
  Computed weight of each photon. The weight is calculated by a windowed KNN algorithm using the distances between each photon and its K nearest neighbors. Values range from 0 to 255 where 255 is the most heavily weighted photon and would be considered likely signal.
  long_name :
  Photon weight
  source :
  ATBD Section 5
  units :
  1
  valid_max :
  255
  valid_min :
  0
```
[5595411 values with dtype=uint8]
```

Indexes: (1)

delta_time

PandasIndex

PandasIndex(DatetimeIndex(['2025-01-04 00:36:08.149526176',
               '2025-01-04 00:36:08.149626208',
               '2025-01-04 00:36:08.149726176',
               '2025-01-04 00:36:08.149726176',
               '2025-01-04 00:36:08.149726176',
               '2025-01-04 00:36:08.149826176',
               '2025-01-04 00:36:08.149926176',
               '2025-01-04 00:36:08.149926176',
               '2025-01-04 00:36:08.150026176',
               '2025-01-04 00:36:08.150026176',
               ...
               '2025-01-04 00:44:35.099739168',
               '2025-01-04 00:44:35.103839200',
               '2025-01-04 00:44:35.117839200',
               '2025-01-04 00:44:35.118639200',
               '2025-01-04 00:44:35.157039200',
               '2025-01-04 00:44:35.164639200',
               '2025-01-04 00:44:35.165339200',
               '2025-01-04 00:44:35.167339200',
               '2025-01-04 00:44:35.183339168',
               '2025-01-04 00:44:35.196339200'],
              dtype='datetime64[ns]', name='delta_time', length=5595411, freq=None))

Attributes: (2)
Description :
Contains arrays of the parameters for each received photon.
data_rate :
Data are stored at the photon detection rate.

Better caching¶

xarray decoded a few variables here and touched many parts of the file, what does this mean regardless file access? if we inspect the cache again will see the number of requests made and how many of them hit a cached block.

In [5]:

Copied!

file_handler.f.cache
file_handler.f.cache

Out[5]:

        <BlockCache:
            block size  :   8388608
            block count :   65
            file size   :   542254177
            cache hits  :   1762
            cache misses:   22
            total requested bytes: 184549376>

Defaults are bad for random access: testing readahead on HDF5, ATL03 data¶

Now let's explore what happens with fsspec defaults by overriding the config with open_kwargs.

In [6]:

Copied!





default_fsspec_open = {
    "cache_type": "readahead",
    "block_size": 5 * 1024 * 1024,  # 5MB
}

file_handler = ea.open(results, open_kwargs=default_fsspec_open)[0]
file_handler.f.cache
default_fsspec_open = {
    "cache_type": "readahead",
    "block_size": 5 * 1024 * 1024,  # 5MB
}

file_handler = ea.open(results, open_kwargs=default_fsspec_open)[0]
file_handler.f.cache

Out[6]:

        <ReadAheadCache:
            block size  :   5242880
            block count :   0
            file size   :   542254177
            cache hits  :   0
            cache misses:   0
            total requested bytes: 0>

In [7]:

Copied!

%%time

ds = xr.open_dataset(file_handler, group="/gt1l/heights")
# we print the cache info again
file_handler.f.cache
%%time

ds = xr.open_dataset(file_handler, group="/gt1l/heights")
# we print the cache info again
file_handler.f.cache

CPU times: user 6.06 s, sys: 1.54 s, total: 7.6 s
Wall time: 11min 2s

Out[7]:

        <ReadAheadCache:
            block size  :   5242880
            block count :   0
            file size   :   542254177
            cache hits  :   1158
            cache misses:   604
            total requested bytes: 3120435184>

This we can see we requested an order of magnitude more information with considerable more cache misses despite the block_size being similar in size. Most of it is due how readahead does cache invalidation for random access.

Old defaults:

3120435184 requested bytes
604 cache misses
Wall time: 11min

New defaults:

184549376 requested bytes
22 cache misses
Wall time: 29.7 s

In [8]:

Copied!

print(f"{int(3120435184 / 184549376)} times more data requested.")
print(f"{int(3120435184 / 184549376)} times more data requested.")

16 times more data requested.

Here we are seeing over an order of magnitude improvement without changing anything but the way we fetch data. Of course things can improve even further if we deal with netCDF4/HDF5 files with fewer variables and/or especially with cloud-optimized HDF5.

We can test that with version 7 of the same file.

In [9]:

Copied!





results = ea.search_data(
    short_name="ATL03",
    version="007",
    cloud_hosted=True,
    temporal=("2025-01", "2025-08"),
    bounding_box=(33.98, 31.19, 34.65, 31.51),
    count=1,
)
results[0]
results = ea.search_data(
    short_name="ATL03",
    version="007",
    cloud_hosted=True,
    temporal=("2025-01", "2025-08"),
    bounding_box=(33.98, 31.19, 34.65, 31.51),
    count=1,
)
results[0]

Out[9]:

Data: ATL03_20250104003608_02892602_007_01.h5

Size: 544.0 MB

Cloud Hosted: True

In [10]:

Copied!

file_handler = ea.open(results, open_kwargs=default_fsspec_open)[0]
file_handler.f.cache
file_handler = ea.open(results, open_kwargs=default_fsspec_open)[0]
file_handler.f.cache

Out[10]:

        <ReadAheadCache:
            block size  :   5242880
            block count :   0
            file size   :   570425344
            cache hits  :   0
            cache misses:   0
            total requested bytes: 0>

In [11]:

Copied!

%%time

ds = xr.open_dataset(file_handler, group="/gt1l/heights")
# we print the cache info again
file_handler.f.cache
%%time

ds = xr.open_dataset(file_handler, group="/gt1l/heights")
# we print the cache info again
file_handler.f.cache

CPU times: user 316 ms, sys: 55.7 ms, total: 371 ms
Wall time: 16.2 s

Out[11]:

        <ReadAheadCache:
            block size  :   5242880
            block count :   0
            file size   :   570425344
            cache hits  :   1274
            cache misses:   13
            total requested bytes: 68816555>

Wait, what happened now?¶

We didn't optimize the cache and yet this file in version 7 took only ~17 seconds to open, why? well this is a cloud optimized HDF5 file, hence all the metadata fits in the first request fsspec does. But this is not always the case, depending on how big a file is and how many variables and metadata has. Now we's use new earthaccess defaults and see if we can improve this even more.

Note that cloud optimized HDF5 and NetCDF4 is a very recent and for now not a lot of datasets will be released in this format

In [12]:

Copied!

%%time

# best case scenario, block caching on a cloud optimized HDF5 file.
file_handler = ea.open(results)[0]
ds = xr.open_dataset(file_handler, group="/gt1l/heights")

file_handler.f.cache
%%time

# best case scenario, block caching on a cloud optimized HDF5 file.
file_handler = ea.open(results)[0]
ds = xr.open_dataset(file_handler, group="/gt1l/heights")

file_handler.f.cache

CPU times: user 217 ms, sys: 52.3 ms, total: 269 ms
Wall time: 6.71 s

Out[12]:

        <BlockCache:
            block size  :   8388608
            block count :   68
            file size   :   570425344
            cache hits  :   1287
            cache misses:   3
            total requested bytes: 25165824>

Benchmarks¶

blockcache is better than readahead for binary files (HDF5, netCDF4, HDF) that require random access. earthaccess now uses blockcache by default and the block size is adjusted by file size.
- small files < 100MB: 4MB
- medium files < 1GB: 8MB
- big file: > 1GB: 16MB
optimal caching depends very much on the access pattern and the internal chunking of a given dataset, as of now that information is not easily available beforehand, a database that contains this kind of information will be really helpful to the community.

Here is a table that shows the performance gains by this new caching schema for a few selected popular binary datasets in the NASA Earth archives.

CAUTION: Running this can take up to 1 hour if we are not in-region.

In [13]:

Copied!





%%time

import time

import earthaccess as ea
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tqdm
import xarray as xr

ea.login()

# Some popular NASA datasets
popular_collections = [
    "C2723758340-GES_DISC",
    "C2724036633-LARC_CLOUD",
    "C3385049989-OB_CLOUD",
    "C2532426483-ORNL_CLOUD",
    "C2670138092-NSIDC_CPRD",
]

collection_labels = {
    "C2723758340-GES_DISC": "GPM Precipitation (GES DISC)",
    "C2724036633-LARC_CLOUD": "TEMPO L2 (LARC)",
    "C3385049989-OB_CLOUD": "PACE OCI (OB)",
    "C2532426483-ORNL_CLOUD": "Daymet daily (ORNL)",
    "C2670138092-NSIDC_CPRD": "ATL06 Lidar (NSIDC)",
}


def open_file(concept_id: str, optimized: bool, run_id: int = 0):
    default_fsspec_open = {
        "cache_type": "readahead",
        "block_size": 5 * 1024 * 1024,  # 5MB
    }

    results = ea.search_data(concept_id=concept_id, cloud_hosted=True, count=1)
    if not results:
        return {
            "concept_id": concept_id,
            "optimized": optimized,
            "run_id": run_id,
            "time": None,
            "block_size": None,
            "file_size": None,
            "cache_hits": None,
            "cache_misses": None,
            "total_requested_bytes": None,
            "error": "No files found",
        }

    try:
        if not optimized:
            f_h = ea.open(results, open_kwargs=default_fsspec_open)[0]
        else:
            f_h = ea.open(results)[
                0
            ]  # Uses earthaccess's blockcache + adaptive block size

        t_start = time.time()
        ds = xr.open_datatree(f_h, phony_dims="sort", decode_timedelta=True)
        execution_time = time.time() - t_start
        ds.close()

        cache = f_h.f.cache
        return {
            "concept_id": concept_id,
            "optimized": optimized,
            "run_id": run_id,
            "time": execution_time,
            "block_size": getattr(cache, "blocksize", None),
            "file_size": getattr(cache, "size", None),
            "cache_hits": getattr(cache, "hit_count", 0),
            "cache_misses": getattr(cache, "miss_count", 0),
            "total_requested_bytes": getattr(cache, "total_requested_bytes", 0),
            "error": None,
        }

    except Exception as e:
        return {
            "concept_id": concept_id,
            "optimized": optimized,
            "run_id": run_id,
            "time": None,
            "block_size": None,
            "file_size": None,
            "cache_hits": None,
            "cache_misses": None,
            "total_requested_bytes": None,
            "error": str(e),
        }


n_runs = 3
print(f"Starting benchmark: {n_runs} runs per dataset and mode\n")
benchmarks = []

for run in range(n_runs):
    print(f"Run {run + 1} of {n_runs}")

    # Optimized (earthaccess defaults)
    print("Optimized (blockcache)")
    for cid in tqdm.tqdm(popular_collections):
        result = open_file(cid, optimized=True, run_id=run)
        benchmarks.append(result)

    # Default (readahead)
    print("Non-optimized (readahead)")
    for cid in tqdm.tqdm(popular_collections):
        result = open_file(cid, optimized=False, run_id=run)
        benchmarks.append(result)


df = pd.DataFrame(benchmarks)
%%time

import time

import earthaccess as ea
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tqdm
import xarray as xr

ea.login()

# Some popular NASA datasets
popular_collections = [
    "C2723758340-GES_DISC",
    "C2724036633-LARC_CLOUD",
    "C3385049989-OB_CLOUD",
    "C2532426483-ORNL_CLOUD",
    "C2670138092-NSIDC_CPRD",
]

collection_labels = {
    "C2723758340-GES_DISC": "GPM Precipitation (GES DISC)",
    "C2724036633-LARC_CLOUD": "TEMPO L2 (LARC)",
    "C3385049989-OB_CLOUD": "PACE OCI (OB)",
    "C2532426483-ORNL_CLOUD": "Daymet daily (ORNL)",
    "C2670138092-NSIDC_CPRD": "ATL06 Lidar (NSIDC)",
}


def open_file(concept_id: str, optimized: bool, run_id: int = 0):
    default_fsspec_open = {
        "cache_type": "readahead",
        "block_size": 5 * 1024 * 1024,  # 5MB
    }

    results = ea.search_data(concept_id=concept_id, cloud_hosted=True, count=1)
    if not results:
        return {
            "concept_id": concept_id,
            "optimized": optimized,
            "run_id": run_id,
            "time": None,
            "block_size": None,
            "file_size": None,
            "cache_hits": None,
            "cache_misses": None,
            "total_requested_bytes": None,
            "error": "No files found",
        }

    try:
        if not optimized:
            f_h = ea.open(results, open_kwargs=default_fsspec_open)[0]
        else:
            f_h = ea.open(results)[
                0
            ]  # Uses earthaccess's blockcache + adaptive block size

        t_start = time.time()
        ds = xr.open_datatree(f_h, phony_dims="sort", decode_timedelta=True)
        execution_time = time.time() - t_start
        ds.close()

        cache = f_h.f.cache
        return {
            "concept_id": concept_id,
            "optimized": optimized,
            "run_id": run_id,
            "time": execution_time,
            "block_size": getattr(cache, "blocksize", None),
            "file_size": getattr(cache, "size", None),
            "cache_hits": getattr(cache, "hit_count", 0),
            "cache_misses": getattr(cache, "miss_count", 0),
            "total_requested_bytes": getattr(cache, "total_requested_bytes", 0),
            "error": None,
        }

    except Exception as e:
        return {
            "concept_id": concept_id,
            "optimized": optimized,
            "run_id": run_id,
            "time": None,
            "block_size": None,
            "file_size": None,
            "cache_hits": None,
            "cache_misses": None,
            "total_requested_bytes": None,
            "error": str(e),
        }


n_runs = 3
print(f"Starting benchmark: {n_runs} runs per dataset and mode\n")
benchmarks = []

for run in range(n_runs):
    print(f"Run {run + 1} of {n_runs}")

    # Optimized (earthaccess defaults)
    print("Optimized (blockcache)")
    for cid in tqdm.tqdm(popular_collections):
        result = open_file(cid, optimized=True, run_id=run)
        benchmarks.append(result)

    # Default (readahead)
    print("Non-optimized (readahead)")
    for cid in tqdm.tqdm(popular_collections):
        result = open_file(cid, optimized=False, run_id=run)
        benchmarks.append(result)


df = pd.DataFrame(benchmarks)

Starting benchmark: 3 runs per dataset and mode

Run 1 of 3
Optimized (blockcache)

100%|███████████████████████████████████████████████████████████████████████| 5/5 [01:42<00:00, 20.48s/it]

Non-optimized (readahead)

100%|██████████████████████████████████████████████████████████████████████| 5/5 [11:46<00:00, 141.24s/it]

Run 2 of 3
Optimized (blockcache)

100%|███████████████████████████████████████████████████████████████████████| 5/5 [01:49<00:00, 21.98s/it]

Non-optimized (readahead)

100%|██████████████████████████████████████████████████████████████████████| 5/5 [12:01<00:00, 144.28s/it]

Run 3 of 3
Optimized (blockcache)

100%|███████████████████████████████████████████████████████████████████████| 5/5 [01:36<00:00, 19.24s/it]

Non-optimized (readahead)

100%|██████████████████████████████████████████████████████████████████████| 5/5 [12:21<00:00, 148.27s/it]

CPU times: user 54.4 s, sys: 3.94 s, total: 58.3 s
Wall time: 41min 18s

In [17]:

Copied!





df["dataset"] = df["concept_id"].map(collection_labels)
df["mode"] = df["optimized"].map({True: "Optimized", False: "Default (readahead)"})
df["file_size_mb"] = df["file_size"] / (1024 * 1024)
df["total_requests"] = df["cache_hits"] + df["cache_misses"]
df["hit_rate"] = df["cache_hits"] / df["total_requests"].where(
    df["total_requests"] > 0, 1
)
df["requested_mb"] = df["total_requested_bytes"] / (1024 * 1024)


success_df = df[df["error"].isna()].copy()
failed = df[df["error"].notna()]

if len(failed) > 0:
    print(failed[["dataset", "mode", "run_id", "error"]].drop_duplicates())

if len(success_df) == 0:
    raise RuntimeError("All tests failed. Check authentication and connectivity.")

# Average metrics over n runs
grouped = (
    success_df.groupby(["dataset", "mode"])
    .agg(
        time_mean=("time", "mean"),
        time_std=("time", "std"),
        hit_rate_mean=("hit_rate", "mean"),
        requested_mb_mean=("requested_mb", "mean"),
        block_size=("block_size", "first"),  # should be stable
        file_size_mb=("file_size_mb", "first"),
    )
    .reset_index()
)


pivot = grouped.pivot(index="dataset", columns="mode")
pivot.columns = [f"{metric}_{mode}" for metric, mode in pivot.columns]
pivot = pivot.sort_index()


pivot["speedup_x"] = (
    pivot["time_mean_Default (readahead)"] / pivot["time_mean_Optimized"]
).round(2)
pivot["data_savings_%"] = (
    (
        pivot["requested_mb_mean_Default (readahead)"]
        - pivot["requested_mb_mean_Optimized"]
    )
    / pivot["requested_mb_mean_Default (readahead)"]
    * 100
).round(1)


final_cols = [
    "time_mean_Optimized",
    "time_mean_Default (readahead)",
    "speedup_x",
    "hit_rate_mean_Optimized",
    "hit_rate_mean_Default (readahead)",
    "requested_mb_mean_Optimized",
    "requested_mb_mean_Default (readahead)",
    "data_savings_%",
    "block_size_Optimized",
    "block_size_Default (readahead)",
    "file_size_mb_Optimized",
]

summary = pivot[final_cols].copy()

summary.rename(
    columns={
        "time_mean_Optimized": "Time (s) - Opt",
        "time_mean_Default (readahead)": "Time (s) - Default",
        "speedup_x": "Speedup (x)",
        "hit_rate_mean_Optimized": "Hit Rate - Opt",
        "hit_rate_mean_Default (readahead)": "Hit Rate - Default",
        "requested_mb_mean_Optimized": "Req MB - Opt",
        "requested_mb_mean_Default (readahead)": "Req MB - Default",
        "data_savings_%": "Data Saved (%)",
        "block_size_Optimized": "Block MB - Opt",
        "block_size_Default (readahead)": "Block MB - Default",
        "file_size_mb_Optimized": "File Size (MB)",
    },
    inplace=True,
)


for col in ["Block MB - Opt", "Block MB - Default"]:
    summary[col] = (summary[col] / (1024 * 1024)).round(1)

# Summary
print("\n" + "=" * 100)
print(f"SUMMARY: Averaged Over {n_runs} Runs")
print("=" * 100)
print(summary.round(3))

# Cache Hit Rate Comparison

plot_data = grouped.set_index(["dataset", "mode"])["hit_rate_mean"].unstack("mode")
datasets = plot_data.index
x = np.arange(len(datasets))
width = 0.35

plt.figure(figsize=(10, 6))
bars1 = plt.bar(
    x - width / 2,
    plot_data["Default (readahead)"],
    width,
    label="Default (readahead)",
    color="gray",
    alpha=0.8,
)
bars2 = plt.bar(
    x + width / 2,
    plot_data["Optimized"],
    width,
    label="Optimized (blockcache)",
    color="seagreen",
    alpha=0.9,
)

plt.ylabel("Cache Hit Rate", fontsize=12)
plt.title(
    "Cache Hit Rate: Optimized vs Default (Averaged Over 3 Runs)", fontsize=14, pad=20
)
plt.xticks(x, [lbl.replace(" ", "\n") for lbl in datasets], rotation=0)
plt.ylim(0, 1.05)
plt.legend(fontsize=11)
plt.grid(axis="y", linestyle="--", alpha=0.4)


def add_labels(bars):
    for bar in bars:
        height = bar.get_height()
        plt.text(
            bar.get_x() + bar.get_width() / 2,
            height + 0.02,
            f"{height:.3f}",
            ha="center",
            va="bottom",
            fontsize=9,
        )


add_labels(bars1)
add_labels(bars2)

plt.tight_layout()
plt.show()
df["dataset"] = df["concept_id"].map(collection_labels)
df["mode"] = df["optimized"].map({True: "Optimized", False: "Default (readahead)"})
df["file_size_mb"] = df["file_size"] / (1024 * 1024)
df["total_requests"] = df["cache_hits"] + df["cache_misses"]
df["hit_rate"] = df["cache_hits"] / df["total_requests"].where(
    df["total_requests"] > 0, 1
)
df["requested_mb"] = df["total_requested_bytes"] / (1024 * 1024)


success_df = df[df["error"].isna()].copy()
failed = df[df["error"].notna()]

if len(failed) > 0:
    print(failed[["dataset", "mode", "run_id", "error"]].drop_duplicates())

if len(success_df) == 0:
    raise RuntimeError("All tests failed. Check authentication and connectivity.")

# Average metrics over n runs
grouped = (
    success_df.groupby(["dataset", "mode"])
    .agg(
        time_mean=("time", "mean"),
        time_std=("time", "std"),
        hit_rate_mean=("hit_rate", "mean"),
        requested_mb_mean=("requested_mb", "mean"),
        block_size=("block_size", "first"),  # should be stable
        file_size_mb=("file_size_mb", "first"),
    )
    .reset_index()
)


pivot = grouped.pivot(index="dataset", columns="mode")
pivot.columns = [f"{metric}_{mode}" for metric, mode in pivot.columns]
pivot = pivot.sort_index()


pivot["speedup_x"] = (
    pivot["time_mean_Default (readahead)"] / pivot["time_mean_Optimized"]
).round(2)
pivot["data_savings_%"] = (
    (
        pivot["requested_mb_mean_Default (readahead)"]
        - pivot["requested_mb_mean_Optimized"]
    )
    / pivot["requested_mb_mean_Default (readahead)"]
    * 100
).round(1)


final_cols = [
    "time_mean_Optimized",
    "time_mean_Default (readahead)",
    "speedup_x",
    "hit_rate_mean_Optimized",
    "hit_rate_mean_Default (readahead)",
    "requested_mb_mean_Optimized",
    "requested_mb_mean_Default (readahead)",
    "data_savings_%",
    "block_size_Optimized",
    "block_size_Default (readahead)",
    "file_size_mb_Optimized",
]

summary = pivot[final_cols].copy()

summary.rename(
    columns={
        "time_mean_Optimized": "Time (s) - Opt",
        "time_mean_Default (readahead)": "Time (s) - Default",
        "speedup_x": "Speedup (x)",
        "hit_rate_mean_Optimized": "Hit Rate - Opt",
        "hit_rate_mean_Default (readahead)": "Hit Rate - Default",
        "requested_mb_mean_Optimized": "Req MB - Opt",
        "requested_mb_mean_Default (readahead)": "Req MB - Default",
        "data_savings_%": "Data Saved (%)",
        "block_size_Optimized": "Block MB - Opt",
        "block_size_Default (readahead)": "Block MB - Default",
        "file_size_mb_Optimized": "File Size (MB)",
    },
    inplace=True,
)


for col in ["Block MB - Opt", "Block MB - Default"]:
    summary[col] = (summary[col] / (1024 * 1024)).round(1)

# Summary
print("\n" + "=" * 100)
print(f"SUMMARY: Averaged Over {n_runs} Runs")
print("=" * 100)
print(summary.round(3))

# Cache Hit Rate Comparison

plot_data = grouped.set_index(["dataset", "mode"])["hit_rate_mean"].unstack("mode")
datasets = plot_data.index
x = np.arange(len(datasets))
width = 0.35

plt.figure(figsize=(10, 6))
bars1 = plt.bar(
    x - width / 2,
    plot_data["Default (readahead)"],
    width,
    label="Default (readahead)",
    color="gray",
    alpha=0.8,
)
bars2 = plt.bar(
    x + width / 2,
    plot_data["Optimized"],
    width,
    label="Optimized (blockcache)",
    color="seagreen",
    alpha=0.9,
)

plt.ylabel("Cache Hit Rate", fontsize=12)
plt.title(
    "Cache Hit Rate: Optimized vs Default (Averaged Over 3 Runs)", fontsize=14, pad=20
)
plt.xticks(x, [lbl.replace(" ", "\n") for lbl in datasets], rotation=0)
plt.ylim(0, 1.05)
plt.legend(fontsize=11)
plt.grid(axis="y", linestyle="--", alpha=0.4)


def add_labels(bars):
    for bar in bars:
        height = bar.get_height()
        plt.text(
            bar.get_x() + bar.get_width() / 2,
            height + 0.02,
            f"{height:.3f}",
            ha="center",
            va="bottom",
            fontsize=9,
        )


add_labels(bars1)
add_labels(bars2)

plt.tight_layout()
plt.show()

====================================================================================================
SUMMARY: Averaged Over 3 Runs
====================================================================================================
                              Time (s) - Opt  Time (s) - Default  Speedup (x)  \
dataset                                                                         
ATL06 Lidar (NSIDC)                   10.357             507.886        49.04   
Daymet daily (ORNL)                    2.406               2.025         0.84   
GPM Precipitation (GES DISC)           4.206              57.330        13.63   
PACE OCI (OB)                          3.581              24.653         6.88   
TEMPO L2 (LARC)                       19.316              68.545         3.55   

                              Hit Rate - Opt  Hit Rate - Default  \
dataset                                                            
ATL06 Lidar (NSIDC)                    0.999               0.810   
Daymet daily (ORNL)                    0.998               0.998   
GPM Precipitation (GES DISC)           0.979               0.452   
PACE OCI (OB)                          0.990               0.887   
TEMPO L2 (LARC)                        0.916               0.592   

                              Req MB - Opt  Req MB - Default  Data Saved (%)  \
dataset                                                                        
ATL06 Lidar (NSIDC)                   12.0          1996.835            99.4   
Daymet daily (ORNL)                    4.0             1.922          -108.1   
GPM Precipitation (GES DISC)           8.0           134.247            94.0   
PACE OCI (OB)                          8.0            67.849            88.2   
TEMPO L2 (LARC)                      112.0           288.875            61.2   

                              Block MB - Opt  Block MB - Default  \
dataset                                                            
ATL06 Lidar (NSIDC)                      4.0                 5.0   
Daymet daily (ORNL)                      4.0                 5.0   
GPM Precipitation (GES DISC)             4.0                 5.0   
PACE OCI (OB)                            4.0                 5.0   
TEMPO L2 (LARC)                          8.0                 5.0   

                              File Size (MB)  
dataset                                       
ATL06 Lidar (NSIDC)                   11.307  
Daymet daily (ORNL)                    1.922  
GPM Precipitation (GES DISC)           7.276  
PACE OCI (OB)                          8.435  
TEMPO L2 (LARC)                      292.735

TL;DR; `earthaccess.open()` then vs now¶

All of this to say that with the upcoming release of earthaccess we'll enjoy speedups in our workflows from marginal to order of magnitude depending on the dataset.

In [18]:

Copied!





plt.figure(figsize=(10, 6))

# Extract data
speedup_values = summary["Speedup (x)"].values
datasets = summary.index

# Color: red if < 1.0 (slower), green if >= 1.0 (faster)
colors = ["firebrick" if s < 1.0 else "seagreen" for s in speedup_values]

bars = plt.bar(
    datasets, speedup_values, color=colors, alpha=0.85, edgecolor="black", linewidth=0.8
)

# Labels and title
plt.ylabel("Speedup Factor (x)", fontsize=12)
plt.title("Speedup: Optimized vs Default (Higher = Better)", fontsize=14, pad=20)
plt.xticks(rotation=0)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.axhline(1.0, color="red", linestyle="--", linewidth=1.2, label="No difference")

# Add value labels on bars
for bar, speedup in zip(bars, speedup_values):
    color = "white" if speedup < 1.0 else "black"
    plt.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + max(speedup_values) * 0.05,
        f"{speedup:.2f}x",
        ha="center",
        va="bottom",
        fontsize=10,
        color=color,
        fontweight="bold",
    )

plt.legend(fontsize=11)
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))

# Extract data
speedup_values = summary["Speedup (x)"].values
datasets = summary.index

# Color: red if < 1.0 (slower), green if >= 1.0 (faster)
colors = ["firebrick" if s < 1.0 else "seagreen" for s in speedup_values]

bars = plt.bar(
    datasets, speedup_values, color=colors, alpha=0.85, edgecolor="black", linewidth=0.8
)

# Labels and title
plt.ylabel("Speedup Factor (x)", fontsize=12)
plt.title("Speedup: Optimized vs Default (Higher = Better)", fontsize=14, pad=20)
plt.xticks(rotation=0)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.axhline(1.0, color="red", linestyle="--", linewidth=1.2, label="No difference")

# Add value labels on bars
for bar, speedup in zip(bars, speedup_values):
    color = "white" if speedup < 1.0 else "black"
    plt.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + max(speedup_values) * 0.05,
        f"{speedup:.2f}x",
        ha="center",
        va="bottom",
        fontsize=10,
        color=color,
        fontweight="bold",
    )

plt.legend(fontsize=11)
plt.tight_layout()
plt.show()